**Exploratory analysis of Google Play Store app data**

**Synopsis**: In this notebook, we perform a preliminary exploratory analysis of data related to ~10k apps in the Google Play Store.

**Data source**: The data was obtained from the Kaggle database https://www.kaggle.com/lava18/google-play-store-apps.

In [1]:
# Packages and loading data
import numpy as np
import pandas as pd
import csv

app_data = pd.read_csv("googleplaystore.csv")

app_data.head(3)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up


In [2]:
print(app_data.describe(include = "all"))

           App Category       Rating Reviews                Size    Installs  \
count    10841    10841  9367.000000   10841               10841       10841   
unique    9660       34          NaN    6002                 462          22   
top     ROBLOX   FAMILY          NaN       0  Varies with device  1,000,000+   
freq         9     1972          NaN     596                1695        1579   
mean       NaN      NaN     4.193338     NaN                 NaN         NaN   
std        NaN      NaN     0.537431     NaN                 NaN         NaN   
min        NaN      NaN     1.000000     NaN                 NaN         NaN   
25%        NaN      NaN     4.000000     NaN                 NaN         NaN   
50%        NaN      NaN     4.300000     NaN                 NaN         NaN   
75%        NaN      NaN     4.500000     NaN                 NaN         NaN   
max        NaN      NaN    19.000000     NaN                 NaN         NaN   

         Type  Price Content Rating Gen

There are a few issues:
- There seem to be many apps with the same name.
- The max rating (19) is out of scale.

However, let us first check which are the unique categories.

In [3]:
app_data["Category"].value_counts()

FAMILY                 1972
GAME                   1144
TOOLS                   843
MEDICAL                 463
BUSINESS                460
PRODUCTIVITY            424
PERSONALIZATION         392
COMMUNICATION           387
SPORTS                  384
LIFESTYLE               382
FINANCE                 366
HEALTH_AND_FITNESS      341
PHOTOGRAPHY             335
SOCIAL                  295
NEWS_AND_MAGAZINES      283
SHOPPING                260
TRAVEL_AND_LOCAL        258
DATING                  234
BOOKS_AND_REFERENCE     231
VIDEO_PLAYERS           175
EDUCATION               156
ENTERTAINMENT           149
MAPS_AND_NAVIGATION     137
FOOD_AND_DRINK          127
HOUSE_AND_HOME           88
AUTO_AND_VEHICLES        85
LIBRARIES_AND_DEMO       85
WEATHER                  82
ART_AND_DESIGN           65
EVENTS                   64
COMICS                   60
PARENTING                60
BEAUTY                   53
1.9                       1
Name: Category, dtype: int64

It is not clear what the category '1.9' represents, let us investigate further.

In [4]:
bool_category_1_9 = app_data["Category"] == "1.9"
app_data.loc[bool_category_1_9, :]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


In [5]:
app_data.loc[bool_category_1_9, "Android Ver"] = app_data.loc[bool_category_1_9, "Current Ver"]
app_data.loc[bool_category_1_9, "Current Ver"] = app_data.loc[bool_category_1_9, "Last Updated"]
app_data.loc[bool_category_1_9, "Last Updated"] = app_data.loc[bool_category_1_9, "Genres"]
app_data.loc[bool_category_1_9, "Genres"] = app_data.loc[bool_category_1_9, "Content Rating"]
app_data.loc[bool_category_1_9, "Content Rating"] = app_data.loc[bool_category_1_9, "Price"]
app_data.loc[bool_category_1_9, "Price"] = app_data.loc[bool_category_1_9, "Type"]
app_data.loc[bool_category_1_9, "Type"] = app_data.loc[bool_category_1_9, "Installs"]
app_data.loc[bool_category_1_9, "Installs"] = app_data.loc[bool_category_1_9, "Size"]
app_data.loc[bool_category_1_9, "Size"] = app_data.loc[bool_category_1_9, "Reviews"]
app_data.loc[bool_category_1_9, "Reviews"] = app_data.loc[bool_category_1_9, "Rating"]
app_data.loc[bool_category_1_9, "Rating"] = app_data.loc[bool_category_1_9, "Category"]
app_data.loc[bool_category_1_9, "Category"] = np.nan

app_data.loc[bool_category_1_9, :]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,,1.9,19,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up


This also solves the problem of the maximum rating being 19. Next, we investigate the apps with the same name.

In [6]:
app_data["App"].value_counts().head(10)

ROBLOX                                               9
CBS Sports App - Scores, News, Stats & Watch Live    8
8 Ball Pool                                          7
Duolingo: Learn Languages Free                       7
Candy Crush Saga                                     7
ESPN                                                 7
Zombie Catchers                                      6
Temple Run 2                                         6
Helix Jump                                           6
Bowmasters                                           6
Name: App, dtype: int64

In [7]:
bool_ROBLOX = app_data["App"] == "ROBLOX" 
app_data.loc[bool_ROBLOX,:]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1653,ROBLOX,GAME,4.5,4447388,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
1701,ROBLOX,GAME,4.5,4447346,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
1748,ROBLOX,GAME,4.5,4448791,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
1841,ROBLOX,GAME,4.5,4449882,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
1870,ROBLOX,GAME,4.5,4449910,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
2016,ROBLOX,FAMILY,4.5,4449910,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
2088,ROBLOX,FAMILY,4.5,4450855,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
2206,ROBLOX,FAMILY,4.5,4450890,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
4527,ROBLOX,FAMILY,4.5,4443407,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up


In [8]:
bool_Candy_Crush = app_data["App"] == "Candy Crush Saga" 
app_data.loc[bool_Candy_Crush,:]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1655,Candy Crush Saga,GAME,4.4,22426677,74M,"500,000,000+",Free,0,Everyone,Casual,"July 5, 2018",1.129.0.2,4.1 and up
1705,Candy Crush Saga,GAME,4.4,22428456,74M,"500,000,000+",Free,0,Everyone,Casual,"July 5, 2018",1.129.0.2,4.1 and up
1751,Candy Crush Saga,GAME,4.4,22428456,74M,"500,000,000+",Free,0,Everyone,Casual,"July 5, 2018",1.129.0.2,4.1 and up
1842,Candy Crush Saga,GAME,4.4,22429716,74M,"500,000,000+",Free,0,Everyone,Casual,"July 5, 2018",1.129.0.2,4.1 and up
1869,Candy Crush Saga,GAME,4.4,22430188,74M,"500,000,000+",Free,0,Everyone,Casual,"July 5, 2018",1.129.0.2,4.1 and up
1966,Candy Crush Saga,GAME,4.4,22430188,74M,"500,000,000+",Free,0,Everyone,Casual,"July 5, 2018",1.129.0.2,4.1 and up
3994,Candy Crush Saga,FAMILY,4.4,22419455,74M,"500,000,000+",Free,0,Everyone,Casual,"July 5, 2018",1.129.0.2,4.1 and up


These differ only by number of reviews so they are repeated entries and we drop them accordingly.

In [9]:
app_data_clean = app_data.drop_duplicates("App")

print("New dataframe:", app_data_clean.shape)
print("Old dataframe:", app_data.shape)

app_data_clean["App"].value_counts().head()

New dataframe: (9660, 13)
Old dataframe: (10841, 13)


Public Digital Library                  1
McClatchy DC Bureau                     1
EY-Parthenon                            1
Real Basketball                         1
Profile Pictures and DP for Whatsapp    1
The DL Hughley Show                     1
Acorns - Invest Spare Change            1
DRAGON BALL LEGENDS                     1
Centimeter Ruler                        1
Fortune City - A Finance App            1
Name: App, dtype: int64

In [17]:
app_data_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9660 entries, 0 to 10840
Data columns (total 13 columns):
App               9660 non-null object
Category          9659 non-null object
Rating            8197 non-null object
Reviews           9660 non-null object
Size              9660 non-null object
Installs          9660 non-null object
Type              9659 non-null object
Price             9660 non-null object
Content Rating    9660 non-null object
Genres            9659 non-null object
Last Updated      9660 non-null object
Current Ver       9652 non-null object
Android Ver       9658 non-null object
dtypes: object(13)
memory usage: 1.0+ MB


The dtypes are incorrect, we fix this next.

In [36]:
app_data_clean["Rating"] = float(app_data_clean["Rating"])

TypeError: cannot do label indexing on <class 'pandas.core.indexes.numeric.Int64Index'> with these indexers [4.1] of <class 'float'>

We have cleaned the database. We now perform some exploratory analysis. We start by looking at the top apps in terms of number of installs (1,000,000,000+).

In [13]:
bool_top_apps = app_data_clean["Installs"] == "1,000,000,000+"

# app_data_clean.loc[bool_top_apps,:]
app_data_clean.loc[bool_top_apps,:].describe()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
count,20,20,20.0,20,20,20,20,20,20,20,20,20,20
unique,20,11,8.0,20,3,1,1,1,3,11,9,3,3
top,Google Play Movies & TV,COMMUNICATION,4.3,10484169,Varies with device,"1,000,000,000+",Free,0,Everyone,Communication,"August 3, 2018",Varies with device,Varies with device
freq,1,6,5.0,1,18,20,20,20,11,6,5,18,18


In [11]:
app_data_clean.describe()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
count,9660,9659,8197.0,9660,9660,9660,9659,9660,9660,9659,9660,9652,9658
unique,9660,33,40.0,5331,461,21,2,92,6,118,1377,2817,33
top,Public Digital Library,FAMILY,4.3,0,Varies with device,"1,000,000+",Free,0,Everyone,Tools,"August 3, 2018",Varies with device,4.1 and up
freq,1,1832,897.0,593,1227,1417,8903,8904,7904,826,252,1055,2202
