### Preprocessing Data 

This notebook is for preparing our dataset for our modelling process. The goal is to transform our categorical columns into numerical columns. 

In [83]:
import pandas as pd

In [84]:
df_clean = pd.read_csv(filepath_or_buffer= "../data/Clean_Dataset.csv") 
df_clean.head()

Unnamed: 0,Category,Rating,Rating Count,Free,Price,Content Rating,Ad Supported,In App Purchases,Editors Choice
0,Entertainment,3.9,68.0,True,0.0,Everyone,False,False,False
1,Lifestyle,0.0,0.0,True,0.0,Everyone,False,False,False
2,Shopping,4.3,918.0,True,0.0,Everyone,True,False,False
3,Finance,5.0,6.0,True,0.0,Everyone,False,False,False
4,Food & Drink,4.3,830.0,True,0.0,Everyone,True,False,False


### Changing some values into binary

To be able to do our modelling, we need to change some of our values into integers. As some values are boolean types, they will be changed into binary.

In [85]:
df_clean["Free"] = df_clean["Free"].astype(int)
df_clean["In App Purchases"] = df_clean["In App Purchases"].astype(int)
df_clean["Editors Choice"] = df_clean["Editors Choice"].astype(int)
df_clean["Ad Supported"] = df_clean["Ad Supported"].astype(int)

we now can check if the datatype change was effective by using .info()

In [86]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9663 entries, 0 to 9662
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Category          9663 non-null   object 
 1   Rating            9663 non-null   float64
 2   Rating Count      9663 non-null   float64
 3   Free              9663 non-null   int32  
 4   Price             9663 non-null   float64
 5   Content Rating    9663 non-null   object 
 6   Ad Supported      9663 non-null   int32  
 7   In App Purchases  9663 non-null   int32  
 8   Editors Choice    9663 non-null   int32  
dtypes: float64(3), int32(4), object(2)
memory usage: 528.6+ KB


We are also looking the dataset contained binary number. 

In [87]:
df_clean.head()

Unnamed: 0,Category,Rating,Rating Count,Free,Price,Content Rating,Ad Supported,In App Purchases,Editors Choice
0,Entertainment,3.9,68.0,1,0.0,Everyone,0,0,0
1,Lifestyle,0.0,0.0,1,0.0,Everyone,0,0,0
2,Shopping,4.3,918.0,1,0.0,Everyone,1,0,0
3,Finance,5.0,6.0,1,0.0,Everyone,0,0,0
4,Food & Drink,4.3,830.0,1,0.0,Everyone,1,0,0


### Compact Category and Content Rating Columns

First step we are going to filter is our Category columns and save the first 20 features (in term of value counts) and replace the other category with 'other'.  

In [88]:
categorical_df = df_clean.select_dtypes('object').copy()
categorical_df.head()

Unnamed: 0,Category,Content Rating
0,Entertainment,Everyone
1,Lifestyle,Everyone
2,Shopping,Everyone
3,Finance,Everyone
4,Food & Drink,Everyone


In [89]:
keep_categories = list(df_clean['Category'].value_counts().sort_values(ascending=False)[0:20].index)

In [90]:
df_clean.loc[~df_clean['Category'].isin(keep_categories), 'Category'] = 'other'

In [91]:
df_clean['Category'].value_counts()

Category
other                1708
Education            1032
Music & Audio         724
Tools                 596
Business              594
Entertainment         586
Books & Reference     496
Lifestyle             494
Personalization       355
Health & Fitness      328
Shopping              316
Productivity          315
Food & Drink          303
Travel & Local        281
Finance               264
Arcade                232
Puzzle                228
Social                211
Casual                209
Communication         196
Sports                195
Name: count, dtype: int64

In [92]:
df_clean["Category"].nunique()

21

Seconds step we are going to filter is our Content Rating column and save keep only the first feature (which is Rveryone ) and replace the other content rating by 'Not for Everyone'. 

In [93]:
content_rating = list(df_clean['Content Rating'].value_counts().sort_values(ascending=False)[0:1].index)

In [94]:
df_clean.loc[~df_clean['Content Rating'].isin(content_rating), 'Content Rating'] = 'Not for Everyone'

In [95]:
df_clean['Content Rating'].value_counts()

Content Rating
Everyone            8402
Not for Everyone    1261
Name: count, dtype: int64

### Dummies

The goal of creating dummies is transform our categorical data into numerical form. 
The first dummies we will create is the category column. 

In [96]:
dummies = pd.get_dummies(df_clean['Category'], prefix = 'Category', dtype= int) 
df_clean = pd.concat([df_clean, dummies], axis=1)
df_clean = df_clean.drop('Category', axis=1)
df_clean

Unnamed: 0,Rating,Rating Count,Free,Price,Content Rating,Ad Supported,In App Purchases,Editors Choice,Category_Arcade,Category_Books & Reference,...,Category_Music & Audio,Category_Personalization,Category_Productivity,Category_Puzzle,Category_Shopping,Category_Social,Category_Sports,Category_Tools,Category_Travel & Local,Category_other
0,3.9,68.0,1,0.0,Everyone,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0.0,0.0,1,0.0,Everyone,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,4.3,918.0,1,0.0,Everyone,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,5.0,6.0,1,0.0,Everyone,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4.3,830.0,1,0.0,Everyone,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9658,0.0,0.0,1,0.0,Everyone,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9659,4.3,142.0,1,0.0,Everyone,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9660,5.0,9.0,1,0.0,Not for Everyone,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
9661,0.0,0.0,1,0.0,Everyone,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [97]:
dummies = pd.get_dummies(df_clean['Content Rating'], prefix = 'Content_Rating', dtype= int) 
df_clean = pd.concat([df_clean, dummies], axis=1)
df_clean = df_clean.drop('Content Rating', axis=1)
df_clean

Unnamed: 0,Rating,Rating Count,Free,Price,Ad Supported,In App Purchases,Editors Choice,Category_Arcade,Category_Books & Reference,Category_Business,...,Category_Productivity,Category_Puzzle,Category_Shopping,Category_Social,Category_Sports,Category_Tools,Category_Travel & Local,Category_other,Content_Rating_Everyone,Content_Rating_Not for Everyone
0,3.9,68.0,1,0.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,0.0,0.0,1,0.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,4.3,918.0,1,0.0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0
3,5.0,6.0,1,0.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,4.3,830.0,1,0.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9658,0.0,0.0,1,0.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
9659,4.3,142.0,1,0.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
9660,5.0,9.0,1,0.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9661,0.0,0.0,1,0.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [98]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9663 entries, 0 to 9662
Data columns (total 30 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Rating                           9663 non-null   float64
 1   Rating Count                     9663 non-null   float64
 2   Free                             9663 non-null   int32  
 3   Price                            9663 non-null   float64
 4   Ad Supported                     9663 non-null   int32  
 5   In App Purchases                 9663 non-null   int32  
 6   Editors Choice                   9663 non-null   int32  
 7   Category_Arcade                  9663 non-null   int32  
 8   Category_Books & Reference       9663 non-null   int32  
 9   Category_Business                9663 non-null   int32  
 10  Category_Casual                  9663 non-null   int32  
 11  Category_Communication           9663 non-null   int32  
 12  Category_Education  

In [99]:
df_clean.to_csv('../data/Dataset_for_modeling.csv', index= False)