# App Store Analytics
Explore thousands of apps data on Google Play store

In [16]:
import pandas as pd
import numpy as np

## Data Exploration

In [17]:
df_apps = pd.read_csv('data/apps.csv')
df_apps.sample(n=10, replace=True, random_state=1)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
235,Trovami se ci riesci,GAME,5.0,11,6.1,10,Free,0,Everyone,Arcade,"March 11, 2017",2.3 and up
5192,Castlight Mobile,MEDICAL,3.7,529,26.0,100000,Free,0,Everyone,Medical,"July 30, 2018",5.0 and up
905,California Cop Assist CA Cop,PRODUCTIVITY,3.0,5,7.0,100,Paid,$4.99,Everyone,Productivity,"February 11, 2014",2.2 and up
7813,QuickShortcutMaker,PERSONALIZATION,4.6,41000,2.0,1000000,Free,0,Everyone,Personalization,"February 23, 2014",1.6 and up
2895,HCP Anywhere,BUSINESS,4.7,114,8.6,5000,Free,0,Everyone,Business,"March 30, 2018",4.3 and up
5056,Type S LED,LIFESTYLE,3.5,628,8.35,100000,Free,0,Everyone,Lifestyle,"August 14, 2017",4.3 and up
144,Cy-Fair Houston Chamber,BUSINESS,,0,5.0,5,Free,0,Everyone,Business,"June 6, 2018",4.1 and up
4225,Cl-app!,SPORTS,,41,0.344727,10000,Free,0,Everyone,Sports,"May 2, 2013",2.0 and up
7751,PJ Masks: HQ,FAMILY,4.1,13731,55.0,1000000,Free,0,Everyone,Entertainment;Action & Adventure,"June 22, 2018",4.0 and up
3462,Old: CL-150,FAMILY,3.8,120,24.0,10000,Free,0,Everyone,Education,"May 21, 2018",4.0.3 and up


In [18]:
df_apps.shape

(10841, 12)

So our DataFrame has **10841** rows and **12** columns.

Let's check if there are any `NaN` values in it.

In [19]:
df_apps.isna().any()

App               False
Category          False
Rating             True
Reviews           False
Size_MBs          False
Installs          False
Type               True
Price             False
Content_Rating    False
Genres            False
Last_Updated      False
Android_Ver        True
dtype: bool

We can see that there are `NaN` values in **Rating**, **Type** and **Android_Ver** columns.

In [20]:
df_apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  int64  
 4   Size_MBs        10841 non-null  float64
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content_Rating  10841 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last_Updated    10841 non-null  object 
 11  Android_Ver     10839 non-null  object 
dtypes: float64(2), int64(1), object(9)
memory usage: 1016.5+ KB


Now we are going to remove **Last_Updated** and **Android_Ver** because they are not necessary.

In [21]:
df_apps = df_apps.drop(['Last_Updated', 'Android_Ver'], axis=1)

Let's see how many rows in the **Rating** and the **Type** columns have `NaN` values.

In [22]:
print(df_apps['Rating'].isna().value_counts())
print(df_apps['Type'].isna().value_counts())

False    9367
True     1474
Name: Rating, dtype: int64
False    10840
True         1
Name: Type, dtype: int64


There are 1474 rows in **Rating** and 1 row in **Type** have `NaN` values. Let's drop them.

In [23]:
df_apps = df_apps.dropna(axis=0)
df_apps.isna().any()

App               False
Category          False
Rating            False
Reviews           False
Size_MBs          False
Installs          False
Type              False
Price             False
Content_Rating    False
Genres            False
dtype: bool

In [24]:
df_apps.shape

(9367, 10)

We are not done yet. We have to find any duplicated apps and delete them. The way we do this is by looking up in the **App**, **Type** and **Price** columns and look for any duplicated features.

For example: let's look for any  duplicated **Instagram** in the DataFrame.

In [25]:
df_instagram = df_apps[df_apps['App'] == 'Instagram']
df_instagram

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10808,Instagram,SOCIAL,4.5,66577446,5.3,1000000000,Free,0,Teen,Social
10809,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10810,Instagram,SOCIAL,4.5,66509917,5.3,1000000000,Free,0,Teen,Social


In [26]:
df_instagram.duplicated()

10806    False
10808    False
10809     True
10810    False
dtype: bool

We can see that there 4 entries of **Instagram** but only one at index **10809** is a duplicate. If we simply call [.drop_duplicates](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html?highlight=drop_duplicate) on our DataFrame, it will only remove the one at index **10809**. 

Therefore we have to tell our DataFrame how to identify a duplicate. We have to provide the column names that we want it to look for duplicated features in the `subset` parameter, in this case, look for duplicates in the **App**, **Type** and **Price** column.

In [27]:
df_apps[df_apps.duplicated(subset=['App', 'Type', 'Price'])]

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
946,420 BZ Budeze Delivery,MEDICAL,5.0,2,11.0,100,Free,0,Mature 17+,Medical
1133,MouseMingle,DATING,2.7,3,3.9,100,Free,0,Mature 17+,Dating
1196,"Cardiac diagnosis (heart rate, arrhythmia)",MEDICAL,4.4,8,6.5,100,Paid,$12.99,Everyone,Medical
1231,Sway Medical,MEDICAL,5.0,3,22.0,100,Free,0,Everyone,Medical
1247,Chat Kids - Chat Room For Kids,DATING,4.7,6,4.9,100,Free,0,Mature 17+,Dating
...,...,...,...,...,...,...,...,...,...,...
10836,Subway Surfers,GAME,4.5,27723193,76.0,1000000000,Free,0,Everyone 10+,Arcade
10837,Subway Surfers,GAME,4.5,27724094,76.0,1000000000,Free,0,Everyone 10+,Arcade
10838,Subway Surfers,GAME,4.5,27725352,76.0,1000000000,Free,0,Everyone 10+,Arcade
10839,Subway Surfers,GAME,4.5,27725352,76.0,1000000000,Free,0,Everyone 10+,Arcade


In [28]:
df_apps = df_apps.drop_duplicates(subset=['App', 'Type', 'Price'])
df_apps[df_apps.App == 'Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social


Now we have successfully remove duplicated rows of Instagram as well as which of other apps.

In [29]:
df_apps.shape

(8199, 10)