<a href="https://colab.research.google.com/github/MeidanGR/GooglePlayStore_Preprocessing_Python/blob/main/GooglePlayStore_DataPreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Google Play Store Data Preprocessing**
This data preprocessing progress has been established for an academic project of a BI course. The data will be analyzed within the project.

Major thanks to **Gautham Prakash** for publishing & maintenance this dataset, which you can find at https://www.kaggle.com/gauthamp10/google-playstore-apps


# **Google auth & libraries**

In [100]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [101]:
import pandas as pd
import numpy as np

# **CSV LOAD**


In [115]:
#Loading csv
df = pd.read_csv('/content/drive/My Drive/Google-Playstore.csv')

print("Dataframe original shape is", np.shape(df))
df.sample(2)

Dataframe original shape is (603047, 23)


Unnamed: 0,App Name,App Id,Category,Rating,Rating Count,Installs,Minimum Installs,Maximum Installs,Free,Price,Currency,Size,Minimum Android,Developer Id,Developer Website,Developer Email,Released,Last Updated,Content Rating,Privacy Policy,Ad Supported,In App Purchases,Editors Choice
223238,Fun Color: Coloring Games & Happy Color by Nu...,com.paint.color.by.number.coloring.pages.pixel...,Entertainment,4.8,716.0,"50,000+",50000.0,74100,True,0.0,USD,11M,5.0 and up,Fun Coloring World,https://www.launchertheme.com/,coloring.product3@gmail.com,"May 15, 2020",20/07/2020 18:13,Everyone,http://montiapp.com/coloring_fun_privacy.html,True,True,False
181734,Easy Checklist (Legacy),br.com.checklistfacil,Business,3.8,101.0,"10,000+",10000.0,21586,True,0.0,USD,22M,4.1 and up,Checklist Fácil,http://www.rz2.com.br,contato@checklistfacil.com.br,"Apr 15, 2015",07/12/2017 22:34,Everyone,http://www.rz2.com.br/politica-de-privacidade,False,False,False


# **DATA FILTERING**
## Agenda: 
Data reduction up to ~100K rows, for a decent memory space & faster analysis.

## Guidelines:
- **Minimum Android** is the minimum android version which allowes operation for the app. The mose popular values were found: 4.1\4.0.3\6.0 and up & Varies with device.
- **Size** should exceed 1MB.
- **Currency** will be USD only.
- **Last updated** during 2020.


In [116]:
#Coloumn labels space fix
df.columns = [x.replace(" ", "_") for x in df.columns]

#Rows SELECT
df = df.query('(Minimum_Android == "4.1 and up" | Minimum_Android == "4.0.3 and up" | Minimum_Android == "Varies with device" | \
Minimum_Android == "6.0 and up") & not Size.str.endswith("k") & Currency == "USD" & Last_Updated.str.contains("2020")',engine='python')

#Coloumns FILTER
df = df.filter(items=['App_Name', 'Category', 'Rating', 'Rating_Count', 'Installs', 'Price', 'Size','Released','Last_Updated','Ad_Supported', 'In_App_Purchases'])

#Saving a copy of dataframe
dfcopy = df.copy(deep=True)

print("Dataframe reduced shape is", np.shape(df))
df.sample(10)

Dataframe reduced shape is (112092, 11)


Unnamed: 0,App_Name,Category,Rating,Rating_Count,Installs,Price,Size,Released,Last_Updated,Ad_Supported,In_App_Purchases
104470,Capital Public Radio App,Music & Audio,4.1,827.0,"10,000+",0.0,16M,"Jan 7, 2012",13/06/2020 1:46,True,False
446997,Rádio CBN,News & Magazines,3.6,9736.0,"1,000,000+",0.0,58M,"Sep 18, 2012",03/04/2020 4:26,True,False
104307,CAP Mob,Education,0.0,0.0,5+,0.0,16M,"Nov 26, 2019",17/02/2020 5:29,False,False
326261,Live Wallpapers 4K - Full HD Live Wallpapers,Personalization,4.2,484.0,"100,000+",0.0,4.1M,"May 20, 2020",19/07/2020 23:04,True,False
408724,Oraciones Diarias y Protección,Books & Reference,4.6,343.0,"50,000+",0.0,5.3M,"Mar 27, 2017",05/01/2020 22:09,True,False
62057,Australia Stocks,Finance,4.6,386.0,"10,000+",0.0,13M,"Jan 4, 2016",06/07/2020 20:57,True,False
501960,Social IVI,Lifestyle,0.0,0.0,0+,0.0,7.3M,"Jun 13, 2020",14/06/2020 2:31,False,False
497084,Sleep TIMER,Lifestyle,3.6,1450.0,"100,000+",0.0,6.4M,"Oct 21, 2014",28/04/2020 5:05,True,False
583905,"WATU - Accept Payments, Send Money, Pay Bills",Finance,,,100+,0.0,Varies with device,,27/06/2020 15:55,False,False
215687,Food Merge - Idle Tap Clicker Merger Game,Lifestyle,4.0,27.0,"5,000+",0.0,28M,"Nov 13, 2018",01/07/2020 14:12,True,True


# **INITIAL DATA INFO**

In [117]:
print("Dataframe NULL count:")
print(df.isnull().sum(axis = 0), end="\n\n")

df.info()

Dataframe NULL count:
App_Name               1
Category               0
Rating              1880
Rating_Count        1880
Installs               0
Price                  0
Size                   0
Released            6338
Last_Updated           0
Ad_Supported           0
In_App_Purchases       0
dtype: int64

<class 'pandas.core.frame.DataFrame'>
Int64Index: 112092 entries, 20 to 603046
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   App_Name          112091 non-null  object 
 1   Category          112092 non-null  object 
 2   Rating            110212 non-null  float64
 3   Rating_Count      110212 non-null  float64
 4   Installs          112092 non-null  object 
 5   Price             112092 non-null  float64
 6   Size              112092 non-null  object 
 7   Released          105754 non-null  object 
 8   Last_Updated      112092 non-null  object 
 9   Ad_Supported      112092 non-null  bool   
 

In [105]:
#Retrieve coloumn general info 
name = 'Category'
df[name].describe()

count        112092
unique           48
top       Education
freq          12333
Name: Category, dtype: object

# **DATA CLEAN & CORRECTION**
- A null value that was found in **App_Name** are not acceptable for furthor analysis.
- The null values in **Rating, Rating_Count** will be replaced with mean values.

In [118]:
#NULL Extraction
df = df.query('not App_Name.isnull()',engine='python')

# Function replaces np.nan values in all of the data frame with relevant statistics \ dal7collab
def rep_missings(data_frame):
    d = data_frame.copy(deep = True)
    for c in data_frame:
        if (data_frame[c].dtype == 'float64'):
            if data_frame[c].isnull().values.any() :
                m = data_frame[c].dropna().mean()
                d[c].fillna(m, inplace = True)
    return d;
# Activating the function
df = rep_missings(df)

#Coloumn types correction
df['App_Name'] = df['App_Name'].astype({'App_Name': 'string'})
df['Category'] = df['Category'].astype({'Category': 'string'})
df['Rating_Count'] = df['Rating_Count'].astype({'Rating_Count': 'int32'})
df['Installs'] = df['Installs'].astype({'Installs': 'string'})
df['Size'] = df['Size'].astype({'Size': 'string'})
df['Released']= pd.to_datetime(df['Released']) 
df['Last_Updated']= pd.to_datetime(df['Last_Updated']) 

#'Size' coloumn excluding "Varies with device"
df['Size'] = df['Size'].replace(to_replace="Varies with device", 
                                   value = np.nan)


print("Dataframe final shape is", np.shape(df))
df.sample(10)


Dataframe final shape is (112091, 11)


Unnamed: 0,App_Name,Category,Rating,Rating_Count,Installs,Price,Size,Released,Last_Updated,Ad_Supported,In_App_Purchases
78551,Best Waterfall Wallpaper,Personalization,0.0,0,100+,0.0,14M,2018-11-08,2020-06-05 21:35:00,True,False
108739,Cats game: Stretch the cat,Casual,0.0,0,"1,000+",0.0,17M,2020-02-25,2020-10-05 20:38:00,True,False
396828,اللؤلؤ والمرجان فيما اتفق عليه الشيخان,Books & Reference,4.4,78,"10,000+",0.0,5.0M,2019-07-20,2020-06-24 22:31:00,True,False
467362,River Crossing IQ Logic Puzzles & Fun Brain Games,Board,4.2,47194,"5,000,000+",0.0,6.5M,2017-03-11,2020-06-15 17:13:00,True,False
526129,Tail Skins,Events,2.2,617,"100,000+",0.0,75M,2019-02-04,2020-10-06 20:48:00,True,True
157891,Didi Kempot - Tulung (Offline),Music & Audio,0.0,0,50+,0.0,2.8M,2020-03-28,2020-03-29 05:41:00,True,False
232953,Glitter Heart Love Coloring Book for Girls,Art & Design,3.9,574,"1,000,000+",0.0,3.1M,2019-11-08,2020-03-02 07:58:00,True,False
99352,群雄异世界,Strategy,3.1,61,"1,000+",0.0,16M,2020-05-26,2020-08-07 20:35:00,False,True
380373,New Year countdown,Personalization,4.6,44032,"1,000,000+",0.0,43M,2015-11-12,2020-02-13 15:55:00,True,False
88850,Body by KP,Health & Fitness,0.0,0,0+,0.0,15M,2020-05-19,2020-05-20 12:29:00,False,False


# **POSTPROCESSING DATA INFO**

In [119]:
#Genereal information after preprocessing
print(df.info(),end = '\n\n')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 112091 entries, 20 to 603045
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   App_Name          112091 non-null  string        
 1   Category          112091 non-null  string        
 2   Rating            112091 non-null  float64       
 3   Rating_Count      112091 non-null  int32         
 4   Installs          112091 non-null  string        
 5   Price             112091 non-null  float64       
 6   Size              101269 non-null  string        
 7   Released          105753 non-null  datetime64[ns]
 8   Last_Updated      112091 non-null  datetime64[ns]
 9   Ad_Supported      112091 non-null  bool          
 10  In_App_Purchases  112091 non-null  bool          
dtypes: bool(2), datetime64[ns](2), float64(2), int32(1), string(4)
memory usage: 8.3 MB
None



# **EXPORT TO GOOGLE DRIVE**


In [120]:
#Exporting a csv into google drive
df.to_csv('/content/drive/My Drive/GooglePlay-Store_Optimized.csv')