# Team 2 - Google Play Store

![](https://www.brandnol.com/wp-content/uploads/2019/04/Google-Play-Store-Search.jpg)

_For more information about the dataset, read [here](https://www.kaggle.com/lava18/google-play-store-apps)._

## Your tasks
- Name your team!
- Read the source and do some quick research to understand more about the dataset and its topic
- Clean the data
- Perform Exploratory Data Analysis on the dataset
- Analyze the data more deeply and extract insights
- Visualize your analysis on Google Data Studio
- Present your works in front of the class and guests next Monday

## Submission Guide
- Create a Github repository for your project
- Upload the dataset (.csv file) and the Jupyter Notebook to your Github repository. In the Jupyter Notebook, **include the link to your Google Data Studio report**.
- Submit your works through this [Google Form](https://forms.gle/oxtXpGfS8JapVj3V8).

## Tips for Data Cleaning, Manipulation & Visualization
- Here are some of our tips for Data Cleaning, Manipulation & Visualization. [Click here](https://hackmd.io/cBNV7E6TT2WMliQC-GTw1A)

_____________________________

## Some Hints for This Dataset:
- There are lots of null values. How should we handle them?
- Column `Installs` and `Size` have some strange values. Can you identify them?
- Values in `Size` column are currently in different format: `M`, `k`. And how about the value `Varies with device`?
- `Price` column is not in the right data type
- And more...


# TODO
- Remove `NaN` from Rating.
- Convert `M` and `k` to milions and thounsands in Size.
- - Remove `+` in Installs.
- Change `Varies with device` in Size with median of the same Genres.
- Change `Varies with device` in Android Ver with 'unknown' text.
- Remove `$` in Price.

In [145]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [146]:
df = pd.read_csv('google-play-store.csv')

In [147]:
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [148]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
App               10841 non-null object
Category          10841 non-null object
Rating            9367 non-null float64
Reviews           10841 non-null object
Size              10841 non-null object
Installs          10841 non-null object
Type              10840 non-null object
Price             10841 non-null object
Content Rating    10840 non-null object
Genres            10841 non-null object
Last Updated      10841 non-null object
Current Ver       10833 non-null object
Android Ver       10838 non-null object
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [149]:
# preprocess Rating
print('NaN percent: ', df['Rating'].isna().sum() /  df['Rating'].size)
print(df['Category'].unique())
print(df[df['Category'] == '1.9'])

NaN percent:  0.13596531685268887
['ART_AND_DESIGN' 'AUTO_AND_VEHICLES' 'BEAUTY' 'BOOKS_AND_REFERENCE'
 'BUSINESS' 'COMICS' 'COMMUNICATION' 'DATING' 'EDUCATION' 'ENTERTAINMENT'
 'EVENTS' 'FINANCE' 'FOOD_AND_DRINK' 'HEALTH_AND_FITNESS' 'HOUSE_AND_HOME'
 'LIBRARIES_AND_DEMO' 'LIFESTYLE' 'GAME' 'FAMILY' 'MEDICAL' 'SOCIAL'
 'SHOPPING' 'PHOTOGRAPHY' 'SPORTS' 'TRAVEL_AND_LOCAL' 'TOOLS'
 'PERSONALIZATION' 'PRODUCTIVITY' 'PARENTING' 'WEATHER' 'VIDEO_PLAYERS'
 'NEWS_AND_MAGAZINES' 'MAPS_AND_NAVIGATION' '1.9']
                                           App Category  Rating Reviews  \
10472  Life Made WI-Fi Touchscreen Photo Frame      1.9    19.0    3.0M   

         Size Installs Type     Price Content Rating             Genres  \
10472  1,000+     Free    0  Everyone            NaN  February 11, 2018   

      Last Updated Current Ver Android Ver  
10472       1.0.19  4.0 and up         NaN  


In [150]:
# We have a strange Category '1.9' for App name 'Life Made WI-Fi Touchscreen Photo Frame'
# This app is PHOTOGRAPHY
df['Category'][10472] = 'PHOTOGRAPHY'
df.loc[10472]

App               Life Made WI-Fi Touchscreen Photo Frame
Category                                      PHOTOGRAPHY
Rating                                                 19
Reviews                                              3.0M
Size                                               1,000+
Installs                                             Free
Type                                                    0
Price                                            Everyone
Content Rating                                        NaN
Genres                                  February 11, 2018
Last Updated                                       1.0.19
Current Ver                                    4.0 and up
Android Ver                                           NaN
Name: 10472, dtype: object

In [151]:
print(df.groupby('Category').mean()['Rating'])

Category
ART_AND_DESIGN         4.358065
AUTO_AND_VEHICLES      4.190411
BEAUTY                 4.278571
BOOKS_AND_REFERENCE    4.346067
BUSINESS               4.121452
COMICS                 4.155172
COMMUNICATION          4.158537
DATING                 3.970769
EDUCATION              4.389032
ENTERTAINMENT          4.126174
EVENTS                 4.435556
FAMILY                 4.192272
FINANCE                4.131889
FOOD_AND_DRINK         4.166972
GAME                   4.286326
HEALTH_AND_FITNESS     4.277104
HOUSE_AND_HOME         4.197368
LIBRARIES_AND_DEMO     4.178462
LIFESTYLE              4.094904
MAPS_AND_NAVIGATION    4.051613
MEDICAL                4.189143
NEWS_AND_MAGAZINES     4.132189
PARENTING              4.300000
PERSONALIZATION        4.335987
PHOTOGRAPHY            4.238679
PRODUCTIVITY           4.211396
SHOPPING               4.259664
SOCIAL                 4.255598
SPORTS                 4.223511
TOOLS                  4.047411
TRAVEL_AND_LOCAL       4.109292

In [152]:
# Convert M and k to milions and thounsands in Size.
def convert_size(sizeStr):
    if('M' in sizeStr):
        return float(sizeStr.split('M')[0]) * 1000000
    if('k' in sizeStr):
        return float(sizeStr.split('k')[0]) * 1000
    else:
        return -1 # we mask -1 for `Varial of Device` so that df['Size'] will be float64 type
    
df['Size'] = df['Size'].apply(convert_size)
df['Size'] 

0        19000000.0
1        14000000.0
2         8700000.0
3        25000000.0
4         2800000.0
5         5600000.0
6        19000000.0
7        29000000.0
8        33000000.0
9         3100000.0
10       28000000.0
11       12000000.0
12       20000000.0
13       21000000.0
14       37000000.0
15        2700000.0
16        5500000.0
17       17000000.0
18       39000000.0
19       31000000.0
20       14000000.0
21       12000000.0
22        4200000.0
23        7000000.0
24       23000000.0
25        6000000.0
26       25000000.0
27        6100000.0
28        4600000.0
29        4200000.0
            ...    
10811     3900000.0
10812    13000000.0
10813     2700000.0
10814    31000000.0
10815     4900000.0
10816     6800000.0
10817     8000000.0
10818     1500000.0
10819     3600000.0
10820     8600000.0
10821     2500000.0
10822     3100000.0
10823     2900000.0
10824    82000000.0
10825     7700000.0
10826          -1.0
10827    13000000.0
10828    13000000.0
10829     7400000.0


In [168]:
# `Varies with device` in Size
print('Varies with device percent: ', df[df['Size']== - 1]['Size'].size /  df['Size'].size * 100 , '%')
medianSizeEachCategory = df[df['Size'] != 'Varies with device' ].groupby('Category').mean()
medianSizeEachCategory

Varies with device percent:  15.644313255234756 %


Unnamed: 0_level_0,Rating,Size
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
ART_AND_DESIGN,4.358065,11800000.0
AUTO_AND_VEHICLES,4.190411,17679840.0
BEAUTY,4.278571,12233960.0
BOOKS_AND_REFERENCE,4.346067,11351650.0
BUSINESS,4.121452,12584490.0
COMICS,4.155172,11462550.0
COMMUNICATION,4.158537,8057305.0
DATING,3.970769,15062470.0
EDUCATION,4.389032,14793470.0
ENTERTAINMENT,4.126174,13200000.0


In [211]:
# set median value
def get_median_size(category):
    return medianSizeEachCategory.loc[ category ,'Size']

for index, row in df.iterrows():
    if(row['Size']== -1):
        df.set_value(index, 'Size' ,  get_median_size( df.loc[index]['Category']))

print(df[df['Size']== -1]['Size'].size)

  


0
