- Recently I got a lot of feedback from my dear friends who just change or about the change their career towards to Data Analysis/ Data Science and Machine Learning areas about the lack of material between beginning the analysis journey and the advanced techniques.

- They are looking for detailed but at the same time beginner friendly, not so much complicated (with different regression, normalization techniques, etc.) explained Explanatory Data Analysis examples, which show them how to start and most importantly how to read the descriptive statistics and graphs.

- After getting these feedbacks, I have decided to make some kind of series of EDA’s from different datasets, without making so complicated for the people at their first steps of DS/ML journey.

### This notebook is part of the 9 Beginner Friendly EDAs. If these EDAs would be helpful to anyone, I would be more than happy.




#### **INTRO**



- In this study, we are going to make Exploratory Data Analysis (EDA) with the Top Games on Google Playstore dataset. 
- Study aims to be beginner friendly and give as much as possible explanation for each step on the way.
- Study's dataset has top 100 games of each category of games on Google Play Store along with their ratings and other data like price and number of installs.

- First, let's import the required libraries.
- We will use Plotly's interactive environment for visualization.

In [None]:
import pandas as pd
import numpy as np


import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

### Overview Stage

- Read the csv
- Look for basic information about the dataset

In [None]:
df= pd.read_csv('../input/top-play-store-games/android-games.csv')
df.head()

In [None]:
df.shape

- We have 1730 games and 15 different variables to work on.

In [None]:
df.isnull().sum()

- We have a very clean dataset, which is very rare in the real world. 
- Dataset, without missing values, like a having unicorn in your backyard. Or Ronaldo is playing in your favorite local team :)

In [None]:
df.info()

- Installs includes number,and should be integer or float data type. But as data type, it is an object data type. It is good to take of note of it 


In [None]:
df.describe()

Before going further, let's summarize what we have got from the dataset.

- Our dataset has games from different categories, different ratings and different number of installs.  
- 'installs' variable has a good numerical info to use. It would be a good idea to make adjustments on it to use it as a numerical variable
- There is no missing value, which is very good during the data preparation stage.
- 'Category' column is categorical variable, it would be good to see whether any significant differences among the categories of the games.
-  Numerical variables deserves special attention for further analysis.
- 'Paid' and 'Price' seems to have a lot on common. Needs to look in detail and if necessary drop one of them for simplicity.

- Let's make the necessary adjustments before moving to the analysis part.

In [None]:
df['installs'].value_counts()

- Let's make 'installs' a numerical variable by doing a small adjustment.

In [None]:
def in_thousand (inst):
    if inst == '500.0 k':
        return '0.5 M' 
    elif inst == '100.0 k':
        return '0.1 M'
    else:
        return inst
df['installs']= df['installs'].apply(in_thousand)

df['installs']= df['installs'].str.replace( 'M', '').str.strip().astype('float')

df= df.rename(columns={'installs': 'installs_in_million'})
df['installs_in_million'].value_counts()

- Let's see price and paid columns and decide whether necessary to continue with both of them or drop one of them.

In [None]:
df['price'].value_counts()

In [None]:
df['paid'].value_counts()

- OK, almost 99% of the games are free, and not much sample size to compare betwen the different price range.
- Sample size less than 30, most of the time, not fulfill minimum requirements for the sample - population representativeness.
- For this dataset, 'price' column does not have much to offer for further analysis.
- So let's drop the 'price' column. 
- Dropping column, deleting rows are decisions to be taken very cautiously and should based on analysis and domain knowledge.

In [None]:
df.drop('price', axis=1, inplace=True)

In [None]:
df.info()

- Seems OK.  Let's move on to the next step: **analysis part**.

### Analysis Part

- Let's first look at the categories

### Game Categories

In [None]:
df['category'].value_counts(normalize=True)

- We have almost same size categories.

In [None]:
fig = px.histogram(df, x="category", title='Game Categories')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

### Total Ratings

In [None]:
df['total ratings'].describe()

In [None]:

fig = px.histogram(df, x= 'total ratings', title='Total Ratings of the Games')


fig.show()

In [None]:
fig = px.box(df, x= 'total ratings', hover_data = df[['title','category']])
fig.update_traces(quartilemethod="inclusive")
fig.show()

- As we have seen in the histogram, quite a lot of the ratings are in the 0 - 500.000 ratings range.
- On the other hand ve have quite a number of outliers, which increases mean and put it further away from the median.
- We have highly skewed distribution, more specifially right skewed distribution with the possible outliers on the maximum side of the distribution. So for further analysis it would be good to remember that.
- In these kinds of situations, it would be a good idea to look for the median based approach.
- Median value, instead of mean value, should be used for to get some insights from the distributions.

### Number of Game Install

In [None]:
df['installs_in_million'].describe()

In [None]:
fig = px.histogram(df, x= 'installs_in_million', title='Number of Game Install in Millions')

fig.show()

In [None]:
fig = px.box(df, x= 'installs_in_million', hover_data = df[['title','category']])
fig.update_traces(quartilemethod="inclusive")
fig.show()

- We have rightly skewed distribution with possible outliers.
- Candy Crush Saga with  1 Billion install and Clash of Clans with 500 Million installs shown in the box plot.
- it would be good idea to always check with dataset, in the dataset we have 2 count of 1 Billion install and 12 count of 500 Million installs. And boxplot shows us one example from this number of installs.
- Size of the outliers definitely affect  mean value and distributions.
- Difference between mean value and median value is really  huge (mean = 29.1M,median= 10M)
- As mentioned above, it would be a good idea to use median based approach.

### Paid-Free Games

In [None]:
df['paid'].value_counts(normalize=True)

In [None]:
paid_free= df['paid'].value_counts()
label =['Free','Paid']
fig = px.pie(paid_free, values=df['paid'].value_counts().values, names=label,
             title='Paid & Free Games')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

- Almost all of the games (except 7 out of 1730) in this dataset are free games

- OK after this point we can look deeper into the dataset.

### Total Ratings by Category

In [None]:
total_ratings_by_category = df.groupby('category')['total ratings'].mean()
total_ratings_by_category

In [None]:
fig = px.bar(total_ratings_by_category, x= total_ratings_by_category.index, y=total_ratings_by_category.values, labels={'y':'Total Ratings'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Games in the action, casual, strategy,arcade, sports categories are getting considerably more ratings than, games in the educational, music categories.

### Number of Game Installations by Game Category

In [None]:
install_by_category = df.groupby('category')['installs_in_million'].mean()
install_by_category

In [None]:
fig = px.bar(install_by_category, x= install_by_category.index, y=install_by_category.values, labels={'y':'Install in Millions'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Games in the action, arcade and casual categories are installed significantly more than games in the trivia, casino and word categories.

In [None]:
growth_by_category_30 = df.groupby('category')['growth (30 days)'].mean()
growth_by_category_30

In [None]:
fig = px.bar(growth_by_category_30, x= growth_by_category_30.index, y=growth_by_category_30, labels={'y':'Growth in 30 days'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Even though games in the action categories get more ratings and were installed more than games in the other categories, games in the casino category have more growth in 30 days. 


- Let's see whether same also true for the 60 days growth

In [None]:
growth_by_category_60 = df.groupby('category')['growth (60 days)'].mean()
growth_by_category_60

In [None]:
fig = px.bar(growth_by_category_60, x= growth_by_category_60.index, y=growth_by_category_60, labels={'y':'Growth in 60 days'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Nope, growth in 60 days for the games in the casino, adventure, role playing categories are significantly lower than their growth in 30 days. 
- With given dataset, we can only speculate something, but we can not make an analytical assumptions based on the  given data. We need more variables to explain the signifcant differences for some of the categories in 30-60 days growth.

- Lets' see top 3 ranked games in each category in details.

### Top 3 Ranked Games by Category

In [None]:
top_ranked_games = df[df['rank']<4][['rank','title','category', 'total ratings', 'installs_in_million', '5 star ratings']]
top_ranked_games

### Top 3 Games by Category and Their Total Ratings

In [None]:
fig = px.scatter(top_ranked_games, y= 'title', x='total ratings', 
                 hover_data = top_ranked_games[['category','rank']], color='category', 
                 title = "Top 3 Games by Their Total Ratings")
fig.show()

- As mentioned above, games in the action, casual, strategy,arcade, sports categories are getting considerably more ratings than, games in the educational, music categories.
- Same is true even for the top ranked games in these categories.

### Top 3 Games by Category and Their Installs in Millions

In [None]:
fig = px.scatter(top_ranked_games, y= 'title', x='installs_in_million', 
                 hover_data = top_ranked_games[['category','rank']], color='category', 
                 title = "Top 3 Games by Their Installations in Millions")
fig.show()

- As mentioned above, games in the action, arcade and casual categories are installed significantly more than games in the trivia, casino and word categories.

- Same is true even for the top ranked games in these categories.

### Top 3 Games by Category and Their 5 star ratings

In [None]:
fig = px.scatter(top_ranked_games, y= 'title', x='5 star ratings', 
                 hover_data = top_ranked_games[['category','rank']], color='category', 
                 title = "Top 3 Games by 5 Star Rankings")
fig.show()

- Games in the action, casual, strategy,arcade categories also get more 5 star ratings than the games in the educational, music categories.

- And Finally see the top 20 games

### Top 20 Games

In [None]:
top_20 = df.sort_values(by='installs_in_million', ascending=False).head(20)
top_20

In [None]:
fig = px.bar(top_20, x= 'title', y='installs_in_million', hover_data = top_20[['5 star ratings']], color='category')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- 2 top games have 1 Billion installs.
- 12 following games have 500 million installs.

In [None]:
fig = px.bar(top_20, x= 'title', y='total ratings', hover_data = top_20[['5 star ratings']], color='category')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- It is important to see that, even though Candy Crush Saga and Subway Surfers have 1 Billion installs, it does not automatically mean that, they will get the most total number of ratings.
- Garena Free Fire-World Series with 500 Million installs, it has also more than 86 million total ratings.

## This notebook is a part of the 9 Beginner Friendly EDAs
## If you like this one, you can also check out other notebooks in the Beginner Friendly EDAs series!

* [Data Analyst Jobs - EDA](https://www.kaggle.com/kaanboke/plotly-data-analyst-jobs)
* [Hollywood Top Movies- EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-top-movies)
* [UDEMY Courses EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-udemy)
* [World Happiness Report - EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-eda)
* [Countries Life Expectancy](https://www.kaggle.com/kaanboke/plotly-beginner-friendly)
* [Netflix Movies- EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-netflix)
* [Amazon Top 50 Bestselling Books EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-amazon)
* [London bike Sharing EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-london-bike)


- Thanks for the dataset contibutor for this data. I really enjoyed working on it.

- It was a quite pleasure to share with you this detailed, beginner friendly EDA. Thanks for your time.

- All the best 