# <font color='Black'>Exploratory analysis of google playstore dataset</font>

In this notebook we will perform Exploratory Data Analysis (EDA) on life expectancy dataset. you can download the dataset from [here](https://www.kaggle.com/dhruvildave/top-play-store-games).
    

<div align ="center">
<img src="https://www.gamedesigning.org/wp-content/uploads/2015/08/A121.jpg" width="200" height="150" align="center"/>
    </div>


Google play store has a variety of applications for users to download and use.Some applications are free while the others are paid versions. There are millions of game applications too. thousands of users download millions of games everyday. To analysize which game is popular we can have a look at the ratings. The ratings will give us a better understanding on which games are popular. This will help a lot of individuals and companies including users, developers and advertizement companies.

lets import the necessary libraries and load the data


In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff

In [2]:
data = pd.read_csv('../Google play store/data/android-games.csv')
data.head()

Unnamed: 0,rank,title,total ratings,installs,average rating,growth (30 days),growth (60 days),price,category,5 star ratings,4 star ratings,3 star ratings,2 star ratings,1 star ratings,paid
0,1,Garena Free Fire- World Series,86273129,500.0 M,4,2.1,6.9,0.0,GAME ACTION,63546766,4949507,3158756,2122183,12495915,False
1,2,PUBG MOBILE - Traverse,37276732,500.0 M,4,1.8,3.6,0.0,GAME ACTION,28339753,2164478,1253185,809821,4709492,False
2,3,Mobile Legends: Bang Bang,26663595,100.0 M,4,1.5,3.2,0.0,GAME ACTION,18777988,1812094,1050600,713912,4308998,False
3,4,Brawl Stars,17971552,100.0 M,4,1.4,4.4,0.0,GAME ACTION,13018610,1552950,774012,406184,2219794,False
4,5,Sniper 3D: Fun Free Online FPS Shooting Game,14464235,500.0 M,4,0.8,1.5,0.0,GAME ACTION,9827328,2124154,1047741,380670,1084340,False


In [3]:
print(data.shape)
print(data.isnull().sum())
print(data.info())
print(data.describe())

(1730, 15)
rank                0
title               0
total ratings       0
installs            0
average rating      0
growth (30 days)    0
growth (60 days)    0
price               0
category            0
5 star ratings      0
4 star ratings      0
3 star ratings      0
2 star ratings      0
1 star ratings      0
paid                0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1730 entries, 0 to 1729
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   rank              1730 non-null   int64  
 1   title             1730 non-null   object 
 2   total ratings     1730 non-null   int64  
 3   installs          1730 non-null   object 
 4   average rating    1730 non-null   int64  
 5   growth (30 days)  1730 non-null   float64
 6   growth (60 days)  1730 non-null   float64
 7   price             1730 non-null   float64
 8   category          1730 non-null   object 
 9   5 star ratings   

## Observations
 - There are 1730 rows with 15 different variables to work on and there are no missing values. We have a clean dataset which
 is very rare in the real world. 
 - The dataset has games from different categories, different ratings and different number of installs.
 - 'installs' variable has a good numerical info to use. It would be a good idea to make adjustments on it to use it as a numerical variable
 - Numerical variables deserves special attention for further analysis.

lets make the necessary modifications before moving to analysis

In [4]:
data['installs'].value_counts()

10.0 M      805
50.0 M      252
5.0 M       245
100.0 M     204
1.0 M       192
500.0 k      15
500.0 M      12
100.0 k       3
1000.0 M      2
Name: installs, dtype: int64

Everything is in millions other than 500K and 100K. Lets first convert that to millions and then make it in numerical format.

In [5]:
def convert(val):
  if val == '500.0 k':
    return '0.5 M'
  elif val == '100.0 k':
    return '0.1 M'
  else:
    return val

data['installs'] = data['installs'].apply(convert) # apply the function directly to the dataframe 
data['installs'] = data['installs'].str.replace('M', '').str.strip().astype('float')
data = data.rename(columns={'installs':'install(M)'})
data['install(M)'].value_counts()
    

10.0      805
50.0      252
5.0       245
100.0     204
1.0       192
0.5        15
500.0      12
0.1         3
1000.0      2
Name: install(M), dtype: int64

In [6]:
print(data['price'].value_counts())
print(data['paid'].value_counts())

0.00    1723
1.99       3
1.49       1
2.99       1
7.49       1
0.99       1
Name: price, dtype: int64
False    1723
True        7
Name: paid, dtype: int64


As we can see most of the applications are free to use while very few are paid.The price distribution is very small to make any difference hence we can drop the columns.

Once thats done we can move on to analysis.

In [7]:
data.drop('price', axis=1, inplace=True)

In [8]:
# categorical data visualization
fig = px.histogram(data, x ='category')

fig.update_layout(xaxis={'categoryorder':'total descending'},
                  title_text ='Total count of games per category',
                  xaxis_title_text ='Category',
                  yaxis_title_text ='Count',
                  bargap =0.2)
fig.show()


In [9]:
# check the downloads for only free games
free = data[data['paid'] == False][['install(M)','category']]

In [10]:
fig = px.bar(free,x = free['category'],y =free['install(M)'])
fig.update_layout(xaxis ={'categoryorder':'total descending'})
fig.show()

As we can see, Game Action section has the highest number of downloads as well as games.

In [11]:
data['category'].value_counts()

GAME CARD            126
GAME WORD            104
GAME SIMULATION      100
GAME SPORTS          100
GAME STRATEGY        100
GAME BOARD           100
GAME ROLE PLAYING    100
GAME ACTION          100
GAME ARCADE          100
GAME EDUCATIONAL     100
GAME ADVENTURE       100
GAME RACING          100
GAME TRIVIA          100
GAME PUZZLE          100
GAME MUSIC           100
GAME CASINO          100
GAME CASUAL          100
Name: category, dtype: int64

In [12]:
fig = px.histogram(data,x='total ratings',title='Total rating of the games')
 
fig.update_layout(
  xaxis_title_text ='Total ratings',
  yaxis_title_text = 'Count',
  bargap = 0.2
)
out_fig = px.box(data,x='total ratings', hover_data=data[['title','category']])
out_fig.update_traces(quartilemethod='inclusive')

In [13]:
fig.show()
out_fig.show()

We can conclude from the histogram that most of the ratings lie in the 0-500k range and this is verified by the box plot showing large number of outliers, which increases the mean and puts it away from the median.

We have a highly skewed distribution, A right skewed distribution.Hence the best apprach would be to use median value to get some insights from the distribution.


In [14]:
install_fig = px.histogram(data,x='install(M)',title='Number of games installed in Millions')
 
fig.update_layout(
  xaxis_title_text ='Installed in millions',
  yaxis_title_text = 'Count',
  bargap = 0.2
)
install_out = px.box(data,x='install(M)', hover_data=data[['title','category']])
install_out.update_traces(quartilemethod='inclusive')

In [15]:
install_fig.show()
install_out.show()

Candy Crush Saga has 1 Billion install and Clash of Clans with 500 Million installs is shown in the box plot. In the dataset we have 2 counts of 1 Billion install and 12 count of 500 Million installs. And boxplot shows us one example from this number of installs.so make sure you check it wih the dataset


In [16]:
rating_cat = data.groupby('category')['total ratings'].mean()
rating_cat

category
GAME ACTION          4.011344e+06
GAME ADVENTURE       8.935617e+05
GAME ARCADE          1.793780e+06
GAME BOARD           4.457431e+05
GAME CARD            3.326041e+05
GAME CASINO          3.619031e+05
GAME CASUAL          2.470866e+06
GAME EDUCATIONAL     1.529804e+05
GAME MUSIC           2.163020e+05
GAME PUZZLE          9.466929e+05
GAME RACING          1.139027e+06
GAME ROLE PLAYING    7.087648e+05
GAME SIMULATION      9.341417e+05
GAME SPORTS          1.353829e+06
GAME STRATEGY        1.856570e+06
GAME TRIVIA          2.982217e+05
GAME WORD            3.943603e+05
Name: total ratings, dtype: float64

In [17]:
install_cat = data.groupby('category')['install(M)'].mean()
install_cat

category
GAME ACTION          74.100000
GAME ADVENTURE       18.030000
GAME ARCADE          71.610000
GAME BOARD           21.230000
GAME CARD            12.484127
GAME CASINO           7.715000
GAME CASUAL          63.970000
GAME EDUCATIONAL     17.895000
GAME MUSIC           12.487000
GAME PUZZLE          36.210000
GAME RACING          46.750000
GAME ROLE PLAYING    14.080000
GAME SIMULATION      27.710000
GAME SPORTS          33.610000
GAME STRATEGY        23.910000
GAME TRIVIA           6.901000
GAME WORD            12.317308
Name: install(M), dtype: float64

In [18]:
fdata = [
    go.Bar(x= list(rating_cat.index), y=list(rating_cat.values), name='Total Ratings by Category',offsetgroup=0),
    go.Bar(x= list(install_cat.index), y=list(install_cat.values), name='Install by Category', yaxis='y2',offsetgroup=1)
]
y1 = go.YAxis(title='Total Ratings', titlefont=go.Font(color='SteelBlue'))
y2 = go.YAxis(title= 'Installation', titlefont=go.Font(color='DarkOrange'))
y2.update(overlaying='y', side='right') # update second y axis to be position appropriately
layout = go.Layout( yaxis1 = y1, yaxis2 = y2) # Add the pre-defined formatting for both y axes
compare = go.Figure(data=fdata, layout=layout)
compare.update_layout(title='Total Ratings & Installs by Categories',xaxis={'categoryorder':'total descending'}, bargap=0.2,bargroupgap=0.1) # gap between bars of the same location coordinates)
compare.update_xaxes(title_text="Categories")
compare.show()


plotly.graph_objs.Font is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.Font
  - plotly.graph_objs.layout.hoverlabel.Font
  - etc.



plotly.graph_objs.YAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.YAxis
  - plotly.graph_objs.layout.scene.YAxis




Game action category has the highest rating and highest downloads. meanwhile casual and arcade games are highly dowloaded even though the ratings are low

In [19]:
pf = data['paid'].value_counts()
pf

False    1723
True        7
Name: paid, dtype: int64

In [20]:
label =['Free','Paid']
pf_fig = px.pie(pf,values=list(pf.values),names = label, title='paid and free games')
pf_fig.update_traces(textposition='inside',textinfo='percent+label')
pf_fig.show()

In [21]:
df_30 = data.groupby('category')['growth (30 days)'].mean()
df_60 = data.groupby('category')['growth (60 days)'].mean()

In [22]:
growth_data = [
    go.Bar(x= list(rating_cat.index), y=list(df_30.values), name='Growth by category in 30 days',offsetgroup=0),
    go.Bar(x= list(rating_cat.index), y=list(df_60.values), name='Growth by category in 60 days', yaxis='y2',offsetgroup=1)
]
y1 = go.YAxis(title='30 days growth', titlefont=go.Font(color='SteelBlue'))
y2 = go.YAxis(title= '60 days growth', titlefont=go.Font(color='DarkOrange'))
y2.update(overlaying='y', side='right') # update second y axis to be position appropriately
layout = go.Layout( yaxis1 = y1, yaxis2 = y2) # Add the pre-defined formatting for both y axes
compare = go.Figure(data=growth_data, layout=layout)
compare.update_layout(title='30 and 60 day growth by categories',xaxis={'categoryorder':'total descending'}, bargap=0.2,bargroupgap=0.1) # gap between bars of the same location coordinates)
compare.update_xaxes(title_text="Categories")
compare.show()

- Games in the casino category have more growth in 30 days, Even though games in the action categories get more ratings and were installed more than games in the other categories.
- Growth in 60 days for the games in the casino, adventure, role playing categories are significantly lower than their growth in 30 days.
- We cannot come to a conclusion as to why the growth cahrt is varying in such manner with the current data available

Let us explore the top 20 ranked Games

In [23]:
top_20 = data.sort_values(by ='install(M)',ascending=False).head(20)
top_20

Unnamed: 0,rank,title,total ratings,install(M),average rating,growth (30 days),growth (60 days),category,5 star ratings,4 star ratings,3 star ratings,2 star ratings,1 star ratings,paid
200,1,Subway Surfers,35665901,1000.0,4,0.5,1.0,GAME ARCADE,27138572,3366600,1622695,814890,2723142,False
626,1,Candy Crush Saga,31367945,1000.0,4,0.9,1.6,GAME CASUAL,23837448,4176798,1534041,486005,1333650,False
0,1,Garena Free Fire- World Series,86273129,500.0,4,2.1,6.9,GAME ACTION,63546766,4949507,3158756,2122183,12495915,False
207,8,Temple Run,4816448,500.0,4,0.7,1.5,GAME ARCADE,3184391,438320,318164,204384,671187,False
1426,1,Clash of Clans,55766763,500.0,4,0.3,1.0,GAME STRATEGY,43346128,5404966,2276203,971321,3768141,False
1026,1,Hill Climb Racing,10188038,500.0,4,0.4,0.8,GAME RACING,7148370,982941,607603,338715,1110407,False
1326,1,8 Ball Pool,21632735,500.0,4,1.2,630.8,GAME SPORTS,16281475,2268294,1017204,425693,1640067,False
630,5,Pou,11506051,500.0,4,0.2,0.5,GAME CASUAL,8175679,1051014,688712,346244,1244400,False
628,3,My Talking Angela,13050503,500.0,4,0.6,1.4,GAME CASUAL,9165205,1073761,636763,399662,1775110,False
1,2,PUBG MOBILE - Traverse,37276732,500.0,4,1.8,3.6,GAME ACTION,28339753,2164478,1253185,809821,4709492,False


In [29]:
data =[ go.Scatter(x=list(top_20['title']),y=list(top_20['5 star ratings'].values), name ='5 star rating'),
       go.Bar(x=list(top_20['title']),y=list(top_20['install(M)'].values), name ='Number of installs',yaxis ='y2',opacity=0.5)]

y1 = go.YAxis(title ='5 star rating',titlefont = go.Font(color='SteelBlue'))
y2 = go.YAxis(title ='install(M)',titlefont=go.Font(color='DarkOrange'))

y2.update(overlaying='y',side ='right')

layout = go.Layout(yaxis1=y1,yaxis2=y2)

figure = go.Figure(data = data,layout =layout)
figure.update_layout(bargap =0.2,bargroupgap =0.1)
figure.update_xaxes(title_text = 'Game title')
figure.show()



plotly.graph_objs.Font is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.Font
  - plotly.graph_objs.layout.hoverlabel.Font
  - etc.



plotly.graph_objs.YAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.YAxis
  - plotly.graph_objs.layout.scene.YAxis




The above chart is self-explainatory. Further analysis needs requires further data.
Even though Candy Crush Saga and Subway Surfers have 1 Billion installs, it does not automatically mean that, they will have a good rating.
Garena Free Fire-World Series with 500 Million installs, it has the highest 5 star rating among all.