<a href="https://colab.research.google.com/github/MeidanGR/GooglePlayStore_Preprocessing_Python/blob/main/GooglePlayStore_DataPreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Google Play Store Data Preprocessing**
This data preprocessing progress has been established for an academic project of a BI course. The data will be analyzed within the project.

Major thanks to **Gautham Prakash** for publishing & maintenance this dataset, which you can find at https://www.kaggle.com/gauthamp10/google-playstore-apps


# **Google auth & libraries**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np

# **CSV LOAD**


In [43]:
#Loading csv
df = pd.read_csv('/content/drive/My Drive/Google-Playstore.csv')

print("Dataframe original shape is", np.shape(df), end='\n\n')
print(df.info())
df.sample(2)

Dataframe original shape is (603047, 23)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 603047 entries, 0 to 603046
Data columns (total 23 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   App Name           603046 non-null  object 
 1   App Id             603047 non-null  object 
 2   Category           603047 non-null  object 
 3   Rating             599287 non-null  float64
 4   Rating Count       599287 non-null  float64
 5   Installs           602875 non-null  object 
 6   Minimum Installs   602875 non-null  float64
 7   Maximum Installs   603047 non-null  int64  
 8   Free               603047 non-null  bool   
 9   Price              603047 non-null  float64
 10  Currency           602866 non-null  object 
 11  Size               603047 non-null  object 
 12  Minimum Android    602556 non-null  object 
 13  Developer Id       603045 non-null  object 
 14  Developer Website  423783 non-null  object 
 15  Developer

Unnamed: 0,App Name,App Id,Category,Rating,Rating Count,Installs,Minimum Installs,Maximum Installs,Free,Price,Currency,Size,Minimum Android,Developer Id,Developer Website,Developer Email,Released,Last Updated,Content Rating,Privacy Policy,Ad Supported,In App Purchases,Editors Choice
2811,"100 великих авантюристов,обман",com.biika.velikihAvantyristov,Books & Reference,4.6,178.0,"10,000+",10000.0,39972,True,0.0,USD,9.9M,2.3 and up,biika.studio,https://d631eeb21.app-ads-txt.com,biika.studio@gmail.com,"Mar 19, 2015",27/02/2020 15:21,Everyone,https://docs.google.com/document/d/1bQkG6-p7X8...,True,False,False
196738,Estonian MessagEase Wordlist,com.exideas.cp.words.estonian,Tools,0.0,0.0,100+,100.0,223,True,0.0,USD,2.7M,2.1 and up,Exideas,http://www.exideas.com,support@exideas.com,"Mar 10, 2013",26/03/2013 3:16,Everyone,http://www.exideas.com/ME/privacy.php,False,False,False


# **DATA FILTERING**
## Agenda: 
Data reduction down to ~100K rows, for a decent dataframe volume, shape, & a faster run in Google Data Studio.

## Guidelines:
- **Minimum Android** is the minimum android version which allowes operation for the app. The mose popular values were found: 4.1\4.0.3\6.0 and up & Varies with device.
- **Size** must exceed 1MB.
- **Currency** will be USD only.
- **Last updated** during 2020.


In [44]:
#Coloumn labels space fix
df.columns = [x.replace(" ", "_") for x in df.columns]

#Rows SELECT
df = df.query('(Minimum_Android == "4.1 and up" | Minimum_Android == "4.0.3 and up" | Minimum_Android == "Varies with device" | \
Minimum_Android == "6.0 and up") & not Size.str.endswith("k") & Currency == "USD" & Last_Updated.str.contains("2020")',engine='python')

#Coloumns FILTER
df = df.filter(items=['App_Name', 'Category', 'Rating', 'Rating_Count', 'Installs', 'Minimum_Installs','Maximum_Installs', 'Price', 'Size','Released','Last_Updated','Ad_Supported', 'In_App_Purchases'])

#Saving a copy of dataframe
dfcopy = df.copy(deep=True)


In [45]:

print("Dataframe reduced shape is", np.shape(df))
df.sample(10)

Dataframe reduced shape is (112092, 13)


Unnamed: 0,App_Name,Category,Rating,Rating_Count,Installs,Minimum_Installs,Maximum_Installs,Price,Size,Released,Last_Updated,Ad_Supported,In_App_Purchases
314008,LastPass Authenticator,Productivity,4.2,3162.0,"500,000+",500000.0,962310,0.0,38M,"Mar 16, 2016",20/07/2020 18:24,False,False
537700,The Healthy Mummy,Health & Fitness,4.6,2206.0,"50,000+",50000.0,87572,0.0,8.3M,"Oct 16, 2018",21/07/2020 11:39,False,True
403479,Oden Master,Simulation,3.9,2386.0,"100,000+",100000.0,295345,0.0,5.2M,"Nov 14, 2017",05/06/2020 3:30,True,True
247561,Happy Birthday Greetings HD,Lifestyle,0.0,0.0,500+,500.0,816,0.0,3.4M,"Jul 2, 2016",22/02/2020 22:01,True,False
519618,Super Wellness,Health & Fitness,0.0,0.0,10+,10.0,30,0.0,7.9M,"Aug 18, 2017",11/06/2020 23:25,False,False
308661,Kumpulan Lagu HITS 2020 Offline plus lirik,Music & Audio,3.4,34.0,"10,000+",10000.0,10164,0.0,36M,"Dec 30, 2019",29/03/2020 12:18,True,False
260024,Horoscope of Money and Career - Daily & Free,Finance,4.4,1142.0,"100,000+",100000.0,152300,0.0,3.6M,"Apr 5, 2013",29/06/2020 19:45,True,True
269084,한국항공대학교 모바일학생증(KAU ID),Education,1.2,131.0,"5,000+",5000.0,7619,0.0,3.7M,"Feb 24, 2017",08/01/2020 10:26,False,False
67229,Baby Story Tracker Milestone Sticker Photo Editor,Photography,4.7,1531.0,"100,000+",100000.0,412088,0.0,13M,"Nov 18, 2019",09/07/2020 22:11,True,True
283898,스마트 원패스 시스템,Tools,2.8,95.0,"10,000+",10000.0,25504,0.0,3.9M,"Oct 22, 2017",25/03/2020 8:38,False,False


# **CHECKPOINT DATA INFO**

In [46]:
print("Dataframe NULL count:")
print(df.isnull().sum(axis = 0), end="\n\n")

df.info()

Dataframe NULL count:
App_Name               1
Category               0
Rating              1880
Rating_Count        1880
Installs               0
Minimum_Installs       0
Maximum_Installs       0
Price                  0
Size                   0
Released            6338
Last_Updated           0
Ad_Supported           0
In_App_Purchases       0
dtype: int64

<class 'pandas.core.frame.DataFrame'>
Int64Index: 112092 entries, 20 to 603046
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   App_Name          112091 non-null  object 
 1   Category          112092 non-null  object 
 2   Rating            110212 non-null  float64
 3   Rating_Count      110212 non-null  float64
 4   Installs          112092 non-null  object 
 5   Minimum_Installs  112092 non-null  float64
 6   Maximum_Installs  112092 non-null  int64  
 7   Price             112092 non-null  float64
 8   Size              112092 non-null  object 

# **DATA OPTIMIZATION**
- A null value that was found in **App_Name** is not acceptable for further analysis, also 4 apps named: "#NAME?" should be cleaned.
- The null values in **Rating, Rating_Count** will be replaced with mean values.
- Column dtypes will be corrected if necessary.
- Column values type will be corrected if necessary.
- Adding "Ad_bool" column, which is identical to Ad_Suppored but with 0 and 1 values, for aggregation actions such as Sum. 

In [47]:
#NULL Extraction
df = df.query('not App_Name.isnull() & not App_Name=="#NAME?"',engine='python')

# Function replaces np.nan values in all of the data frame with relevant statistics \ dal7collab
def rep_missings(data_frame):
    d = data_frame.copy(deep = True)
    for c in data_frame:
        if (data_frame[c].dtype == 'float64'): #or data_frame[c].dtype == 'float32' or data_frame[c].dtype == 'int32' or data_frame[c].dtype == 'int64'):
            if data_frame[c].isnull().values.any() :
                m = data_frame[c].dropna().mean()
                d[c].fillna(m, inplace = True)
    return d;

# Activating the function
df = rep_missings(df)

In [48]:

df['Size'] = df['Size'].replace(to_replace="Varies with device", #'Size' coloumn excluding "Varies with device".
                                   value = np.nan)
df['Size'] = df['Size'].str.replace('M',"") #trimming 'M' from all values in order to set as int.

#Adding Ad_bool column, described above.

df['Ad_Supported'] = df['Ad_Supported'].astype({'Ad_Supported': 'string'})
df['Ad_bool']= df['Ad_Supported']
df['Ad_bool']= df['Ad_bool'].str.replace("True", "1")
df['Ad_bool']= df['Ad_bool'].str.replace("False", "0")

#Coloumn types correction
df['App_Name'] = df['App_Name'].astype({'App_Name': 'string'})
df['Category'] = df['Category'].astype({'Category': 'string'})
df['Rating'] = df['Rating'].astype({'Rating': 'float32'})
df['Rating_Count'] = df['Rating_Count'].astype({'Rating_Count': 'int64'})
df['Installs'] = df['Installs'].astype({'Installs': 'string'})
df['Minimum_Installs'] = df['Minimum_Installs'].astype({'Minimum_Installs': 'int64'})
df['Price'] = df['Price'].astype({'Price': 'float32'})
df['Released'] = pd.to_datetime(df['Released']).dt.strftime('%Y-%m') #Fetching only yyyy-mm from the full datetime.
df['Last_Updated'] = pd.to_datetime(df['Last_Updated']).dt.strftime('%Y-%m') #Fetching only mm-yyyy from the full datetime.
df['Size'] = df['Size'].astype({'Size': 'float32'})
df['Ad_bool'] = df['Ad_bool'].astype({'Ad_bool': 'int8'})


print("Dataframe final shape is", np.shape(df))
df.sample(10)

Dataframe final shape is (112087, 14)


Unnamed: 0,App_Name,Category,Rating,Rating_Count,Installs,Minimum_Installs,Maximum_Installs,Price,Size,Released,Last_Updated,Ad_Supported,In_App_Purchases,Ad_bool
458610,Real Bike Stunt Racer 3D - Motorcylce Games 2020,Travel & Local,4.6,170,"100,000+",100000,106306,0.0,6.2,2019-05,2020-07,True,True,1
96975,Bus Simulator 2015 New York,Simulation,3.7,28793,"1,000,000+",1000000,2969673,0.0,16.0,2015-07,2020-06,True,True,1
541444,Thunder Assault: Raiden Striker,Arcade,4.5,135492,"5,000,000+",5000000,6668803,0.0,8.4,2015-08,2020-09,True,True,1
374157,Name Art - Write Name With Candle,Art & Design,4.5,313,"100,000+",100000,108943,0.0,40.0,2018-09,2020-09,True,False,1
210204,fishing time,Puzzle,0.0,0,5+,5,5,0.0,27.0,2020-04,2020-04,True,False,1
30117,Acı Mesajlar,Entertainment,5.0,7,"1,000+",1000,1623,0.0,11.0,2017-07,2020-11,True,False,1
62600,Autobahn,Finance,4.5,14,"1,000+",1000,1500,0.0,28.0,2018-12,2020-07,False,False,0
316275,Learn Armenian - 50 languages,Education,4.5,179,"10,000+",10000,27651,0.0,39.0,2016-10,2020-11,True,True,1
295144,Kaizen English Gujarati spoken English app,Education,4.1,477,"50,000+",50000,59027,0.0,,2015-10,2020-07,True,True,1
83142,Biocom - ביוקום,Business,3.4,13,"1,000+",1000,2322,0.0,14.0,2016-02,2020-05,False,False,0


# **POSTPROCESSING DATA INFO**

In [52]:
#Genereal information after preprocessing
print(df.info(),end = '\n\n')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 112087 entries, 20 to 602972
Data columns (total 14 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   App_Name          112087 non-null  string 
 1   Category          112087 non-null  string 
 2   Rating            112087 non-null  float32
 3   Rating_Count      112087 non-null  int64  
 4   Installs          112087 non-null  string 
 5   Minimum_Installs  112087 non-null  int64  
 6   Maximum_Installs  112087 non-null  int64  
 7   Price             112087 non-null  float32
 8   Size              101269 non-null  float32
 9   Released          105749 non-null  object 
 10  Last_Updated      112087 non-null  object 
 11  Ad_Supported      112087 non-null  string 
 12  In_App_Purchases  112087 non-null  bool   
 13  Ad_bool           112087 non-null  int8   
dtypes: bool(1), float32(3), int64(3), int8(1), object(2), string(4)
memory usage: 10.0+ MB
None



# **EXPORT TO GOOGLE DRIVE**


In [53]:
#Exporting a csv into google drive
#df.to_csv('/content/drive/My Drive/GooglePlay-Store_Optimized_v13.csv')

# **STATISTICS**

In [54]:
#Retrieve coloumn general info 
name = 'Ad_bool'
print(df[name].describe(include=[object]))
print("Coloumn's real dtype is:", df[name].dtypes)

count    112087.000000
mean          0.619626
std           0.485481
min           0.000000
25%           0.000000
50%           1.000000
75%           1.000000
max           1.000000
Name: Ad_bool, dtype: float64
Coloumn's real dtype is: int8
