<a href="https://colab.research.google.com/github/MeidanGR/GooglePlayStore_Preprocessing_Python/blob/main/GooglePlayStore_DataPreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Google Play Store Data Preprocessing**
This data preprocessing progress has been established for an academic project of a BI course. The data will be analyzed within the project.

Major thanks to **Gautham Prakash** for publishing & maintenance this dataset, which you can find at https://www.kaggle.com/gauthamp10/google-playstore-apps


# **Google auth & libraries**

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import pandas as pd
import numpy as np

# **CSV LOAD**


In [4]:
#Loading csv
df = pd.read_csv('/content/drive/My Drive/Google-Playstore.csv')

print("Dataframe original shape is", np.shape(df), end='\n\n')
print(df.info())
df.sample(2)

Dataframe original shape is (603047, 23)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 603047 entries, 0 to 603046
Data columns (total 23 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   App Name           603046 non-null  object 
 1   App Id             603047 non-null  object 
 2   Category           603047 non-null  object 
 3   Rating             599287 non-null  float64
 4   Rating Count       599287 non-null  float64
 5   Installs           602875 non-null  object 
 6   Minimum Installs   602875 non-null  float64
 7   Maximum Installs   603047 non-null  int64  
 8   Free               603047 non-null  bool   
 9   Price              603047 non-null  float64
 10  Currency           602866 non-null  object 
 11  Size               603047 non-null  object 
 12  Minimum Android    602556 non-null  object 
 13  Developer Id       603045 non-null  object 
 14  Developer Website  423783 non-null  object 
 15  Developer

Unnamed: 0,App Name,App Id,Category,Rating,Rating Count,Installs,Minimum Installs,Maximum Installs,Free,Price,Currency,Size,Minimum Android,Developer Id,Developer Website,Developer Email,Released,Last Updated,Content Rating,Privacy Policy,Ad Supported,In App Purchases,Editors Choice
245096,Hadramihouse,com.appswiz.hadramihousebcdded,Business,0.0,0.0,1+,1.0,3,True,0.0,USD,6.1M,4.4 and up,Appswiz W.I,https://www.hadramihouse.com/,appswiz.w.i@gmail.com,"Jan 21, 2019",22/01/2019 4:21,Everyone,https://www.hadramihouse.com/,False,False,False
421177,PetWorld - Fantasy Animals,com.tivola.fantasy.free,Simulation,3.8,14332.0,"1,000,000+",1000000.0,1301109,True,0.0,USD,16M,2.3 and up,Tivola,http://www.tivola.com,apps@tivola.de,"Sep 16, 2015",16/09/2015 21:33,Everyone,https://www.tivola.de/datenschutz_vorlage/date...,True,True,False


# **DATA FILTERING**
## Agenda: 
Data reduction down to ~100K rows, for a decent dataframe volume, shape, & a faster run in Google Data Studio.

## Guidelines:
- **Minimum Android** is the minimum android version which allowes operation for the app. The mose popular values were found: 4.1\4.0.3\6.0 and up & Varies with device.
- **Size** must exceed 1MB.
- **Currency** will be USD only.
- **Last updated** during 2020.


In [5]:
#Coloumn labels space fix
df.columns = [x.replace(" ", "_") for x in df.columns]

#Rows SELECT
df = df.query('(Minimum_Android == "4.1 and up" | Minimum_Android == "4.0.3 and up" | Minimum_Android == "Varies with device" | \
Minimum_Android == "6.0 and up") & not Size.str.endswith("k") & Currency == "USD" & Last_Updated.str.contains("2020")',engine='python')

#Coloumns FILTER
df = df.filter(items=['App_Name', 'Category', 'Rating', 'Rating_Count', 'Installs', 'Price', 'Size','Released','Last_Updated','Ad_Supported', 'In_App_Purchases'])

#Saving a copy of dataframe
dfcopy = df.copy(deep=True)

print("Dataframe reduced shape is", np.shape(df))
df.sample(10)

Dataframe reduced shape is (112092, 11)


Unnamed: 0,App_Name,Category,Rating,Rating_Count,Installs,Price,Size,Released,Last_Updated,Ad_Supported,In_App_Purchases
243408,GUNDAM BATTLE: GUNPLA WARFARE,Action,3.7,33210.0,"500,000+",0.0,60M,"Jul 30, 2019",21/07/2020 7:44,False,True
107604,Cassell,Food & Drink,0.0,0.0,10+,0.0,17M,"May 14, 2020",09/07/2020 21:27,False,False
438894,"Protect Yourself - File, Message & Voice Encry...",Tools,,,500+,0.0,Varies with device,,01/05/2020 6:25,True,False
380442,New Year Photo Editor 2021,Photography,0.0,0.0,100+,0.0,43M,"Nov 12, 2019",15/07/2020 5:42,True,False
116993,Christmas Sweeper 3,Puzzle,4.5,39698.0,"1,000,000+",0.0,Varies with device,"Dec 11, 2015",25/06/2020 21:17,True,True
6750,3 000 кроссвордов,Word,4.6,16329.0,"100,000+",0.0,1.6M,"Sep 2, 2015",17/04/2020 5:47,True,True
355092,Misión Tiburón,Education,4.9,36.0,"1,000+",0.0,4.6M,"Jun 7, 2016",16/05/2020 19:23,False,False
526993,Talking Wolf,Entertainment,4.1,3484.0,"500,000+",0.0,77M,"Oct 19, 2018",01/06/2020 16:37,True,False
186980,Electric Screen for Prank Live Wallpaper &Laun...,Personalization,3.4,47739.0,"10,000,000+",0.0,23M,"Sep 19, 2017",10/06/2020 7:57,True,False
438783,ProRamadan,Books & Reference,5.0,275.0,"10,000+",0.0,52M,"Apr 29, 2018",14/04/2020 19:35,False,False


# **CHECKPOINT DATA INFO**

In [6]:
print("Dataframe NULL count:")
print(df.isnull().sum(axis = 0), end="\n\n")

df.info()

Dataframe NULL count:
App_Name               1
Category               0
Rating              1880
Rating_Count        1880
Installs               0
Price                  0
Size                   0
Released            6338
Last_Updated           0
Ad_Supported           0
In_App_Purchases       0
dtype: int64

<class 'pandas.core.frame.DataFrame'>
Int64Index: 112092 entries, 20 to 603046
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   App_Name          112091 non-null  object 
 1   Category          112092 non-null  object 
 2   Rating            110212 non-null  float64
 3   Rating_Count      110212 non-null  float64
 4   Installs          112092 non-null  object 
 5   Price             112092 non-null  float64
 6   Size              112092 non-null  object 
 7   Released          105754 non-null  object 
 8   Last_Updated      112092 non-null  object 
 9   Ad_Supported      112092 non-null  bool   
 

# **DATA CLEAN & CORRECTION**
- A null value that was found in **App_Name** are not acceptable for furthor analysis, also 4 apps named: "#NAME?" should be cleaned.
- The null values in **Rating, Rating_Count** will be replaced with mean values.

In [7]:
#NULL Extraction
df = df.query('not App_Name.isnull() & not App_Name=="#NAME?"',engine='python')

# Function replaces np.nan values in all of the data frame with relevant statistics \ dal7collab
def rep_missings(data_frame):
    d = data_frame.copy(deep = True)
    for c in data_frame:
        if (data_frame[c].dtype == 'float64'):
            if data_frame[c].isnull().values.any() :
                m = data_frame[c].dropna().mean()
                d[c].fillna(m, inplace = True)
    return d;

# Activating the function
df = rep_missings(df)

In [45]:
#'Size' coloumn excluding "Varies with device"
df['Size'] = df['Size'].replace(to_replace="Varies with device", 
                                   value = np.nan)

#Coloumn types correction
df['App_Name'] = df['App_Name'].astype({'App_Name': 'string'})
df['Category'] = df['Category'].astype({'Category': 'string'})
df['Rating'] = df['Rating'].astype({'Rating': 'float32'})
df['Rating_Count'] = df['Rating_Count'].astype({'Rating_Count': 'int32'})
df['Installs'] = df['Installs'].astype({'Installs': 'string'})
df['Price'] = df['Price'].astype({'Price': 'float32'})
df['Released']= pd.to_datetime(df['Released']) 
df['Last_Updated']= pd.to_datetime(df['Last_Updated']) 
df['Size'] = df['Size'].astype({'Size': 'string'})


print("Dataframe final shape is", np.shape(df))
df.sample(10)

Dataframe final shape is (112087, 11)


  mask = arr == x


Unnamed: 0,App_Name,Category,Rating,Rating_Count,Installs,Price,Size,Released,Last_Updated,Ad_Supported,In_App_Purchases
484423,Sehat Bugar,Health & Fitness,0.0,0,5+,0.0,61M,2020-04-20,2020-05-28 23:54:00,True,False
216310,Football Wallpapers - ⚽ Fantasy Football Soccer,Personalization,4.0,10,"1,000+",0.0,28M,2020-01-13,2020-01-13 15:17:00,True,False
42681,Alien Tech Cube 3D,Personalization,3.0,1160,"100,000+",0.0,11M,2017-03-28,2020-02-19 15:33:00,True,False
216830,Forex trading for beginners Guide,Books & Reference,0.0,0,1+,0.0,28M,2020-07-19,2020-07-18 21:45:00,True,False
256578,History GK In Hindi (Theory + Quiz + Oneliner),Education,4.0,957,"50,000+",0.0,3.5M,2018-03-08,2020-02-06 11:35:00,True,False
335238,Magic: Puzzle Quest,Role Playing,4.0,97847,"1,000,000+",0.0,4.2M,2015-12-09,2020-03-06 19:01:00,False,True
335829,Mahakali Aarti,Music & Audio,0.0,0,500+,0.0,4.3M,2018-12-20,2020-08-07 10:38:00,True,False
414154,Panduan Shalat AR: Belajar Wudhu dan Shalat,Education,0.0,0,100+,0.0,5.4M,2020-03-02,2020-06-24 10:08:00,False,False
15614,আল্লাহর ৯৯ নাম অর্থ ফজিলত - Audio,Education,4.0,1114,"100,000+",0.0,103M,2018-04-16,2020-01-07 16:01:00,True,False
22029,विलोम शब्द (Antonyms in Hindi),Education,2.0,153,"10,000+",0.0,10M,2018-10-27,2020-06-23 00:19:00,True,False


# **POSTPROCESSING DATA INFO**

In [46]:
#Genereal information after preprocessing
print(df.info(),end = '\n\n')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 112087 entries, 20 to 602972
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   App_Name          112087 non-null  string        
 1   Category          112087 non-null  string        
 2   Rating            112087 non-null  float32       
 3   Rating_Count      112087 non-null  int32         
 4   Installs          112087 non-null  string        
 5   Price             112087 non-null  float32       
 6   Size              101269 non-null  string        
 7   Released          105749 non-null  datetime64[ns]
 8   Last_Updated      112087 non-null  datetime64[ns]
 9   Ad_Supported      112087 non-null  bool          
 10  In_App_Purchases  112087 non-null  bool          
dtypes: bool(2), datetime64[ns](2), float32(2), int32(1), string(4)
memory usage: 7.5 MB
None



# **EXPORT TO GOOGLE DRIVE**


In [10]:
#Exporting a csv into google drive
df.to_csv('/content/drive/My Drive/GooglePlay-Store_Optimized.csv')

# **STATISTICS**

In [51]:
#Retrieve coloumn general info 
name = 'In_App_Purchases'
print(df[name].describe(include=[object]))
print("Coloumn's real dtype is:", df[name].dtypes)

count     112087
unique         2
top        False
freq       90677
Name: In_App_Purchases, dtype: object
Coloumn's real dtype is: bool
