## Data Loading

INTRODUCTION: 
This notebook is for loading in the dataset, create a data dictionnary, verifying the shape of our dataset (number of rows and columns). Those steps are followed by cleaning the data. We will remove duplicated and missing values and dropping columns if necessary. Then we will save our clean data set into a csv file to use for our EDA.

In [135]:
import pandas as pd

### Step 1: Loading data 

In [136]:
df = pd.read_csv(filepath_or_buffer= "../data/Extract_Dataset.csv") 
df.head()

Unnamed: 0.1,Unnamed: 0,App Name,App Id,Category,Rating,Rating Count,Installs,Minimum Installs,Maximum Installs,Free,...,Developer Website,Developer Email,Released,Last Updated,Content Rating,Privacy Policy,Ad Supported,In App Purchases,Editors Choice,Scraped Time
0,0,Untap,in.untap,Entertainment,3.9,68.0,"10,000+",10000.0,10291,True,...,https://untap.in,hello@untap.in,"Nov 2, 2020","Nov 02, 2020",Everyone,,False,False,False,2021-06-16 11:24:00
1,1,Green Meadows,com.ooweboowebengineers.greenmeadows,Lifestyle,0.0,0.0,50+,50.0,90,True,...,http://ooweboo.co.za,ray@ooweboo.co.za,"May 29, 2017","May 29, 2017",Everyone,http://appmc2.net/privacy?company=OOWEBOO%20We...,False,False,False,2021-06-16 10:54:34
2,2,YG SELECT,com.makeshop.powerapp.ygnext,Shopping,4.3,918.0,"100,000+",100000.0,135038,True,...,http://www.ygeshop.com,app.ygselect@gmail.com,"Jan 20, 2016","May 12, 2021",Everyone,http://www.ygeshop.com/m/privacy.html,True,False,False,2021-06-16 02:21:54
3,3,Vinca Wealth,com.bag4wealth,Finance,5.0,6.0,50+,50.0,53,True,...,https://bag4wealth.com,acmatics.app@gmail.com,"Jun 30, 2020","May 11, 2021",Everyone,https://bag4wealth.com/finnsys/app/privacy.php,False,False,False,2021-06-16 01:29:59
4,4,Drink recipes,com.drinks.recipes,Food & Drink,4.3,830.0,"100,000+",100000.0,142498,True,...,http://cookwithlove.biz/,andrei.nazarco@gmail.com,"May 20, 2014","Jul 13, 2020",Everyone,http://cookwithlove.biz/privacy_policy/drink_r...,True,False,False,2021-06-16 08:43:51


We can see that there is an index column called Unnamed, we need to make the first column index the primary index. So we dropping the Unnamed column.

In [137]:
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
df.head()

Unnamed: 0,App Name,App Id,Category,Rating,Rating Count,Installs,Minimum Installs,Maximum Installs,Free,Price,...,Developer Website,Developer Email,Released,Last Updated,Content Rating,Privacy Policy,Ad Supported,In App Purchases,Editors Choice,Scraped Time
0,Untap,in.untap,Entertainment,3.9,68.0,"10,000+",10000.0,10291,True,0.0,...,https://untap.in,hello@untap.in,"Nov 2, 2020","Nov 02, 2020",Everyone,,False,False,False,2021-06-16 11:24:00
1,Green Meadows,com.ooweboowebengineers.greenmeadows,Lifestyle,0.0,0.0,50+,50.0,90,True,0.0,...,http://ooweboo.co.za,ray@ooweboo.co.za,"May 29, 2017","May 29, 2017",Everyone,http://appmc2.net/privacy?company=OOWEBOO%20We...,False,False,False,2021-06-16 10:54:34
2,YG SELECT,com.makeshop.powerapp.ygnext,Shopping,4.3,918.0,"100,000+",100000.0,135038,True,0.0,...,http://www.ygeshop.com,app.ygselect@gmail.com,"Jan 20, 2016","May 12, 2021",Everyone,http://www.ygeshop.com/m/privacy.html,True,False,False,2021-06-16 02:21:54
3,Vinca Wealth,com.bag4wealth,Finance,5.0,6.0,50+,50.0,53,True,0.0,...,https://bag4wealth.com,acmatics.app@gmail.com,"Jun 30, 2020","May 11, 2021",Everyone,https://bag4wealth.com/finnsys/app/privacy.php,False,False,False,2021-06-16 01:29:59
4,Drink recipes,com.drinks.recipes,Food & Drink,4.3,830.0,"100,000+",100000.0,142498,True,0.0,...,http://cookwithlove.biz/,andrei.nazarco@gmail.com,"May 20, 2014","Jul 13, 2020",Everyone,http://cookwithlove.biz/privacy_policy/drink_r...,True,False,False,2021-06-16 08:43:51


We have now only one index appearing.

### Step 2: Data Dictionnary
This is a data dictionnary for the project. It contains all the information about the dataset used to train and test the model.

App Name:

    - object 
    - The name of the application on the Google play store 
    
App Id: 

    - object
    - The application have an unique identifiers
    
Category: 

    - object
    - The application belong to a specific category 
    - there is 48 categories of application
    
Rating: 

    - float64
    - This is the average rating for an application  
    - Average Rating from 1 to 5

Rating Count:

    - float64
    - The number of person that rate the application   
    
Installs:

    - object
    - The approximate number of install of the app
    - The number is rounded. For example if the number of maximminstall in 53 the number will be 50+. 

Minimum Installs:

    - float64
    - The minimum numbers of install of an app  rounded
    - The number is rounded and have a positve sign next to it. For example if maximum installs is 135038 the number will be 100,000+. 

Maximum Installs

    - int64  
    - The numbers of install of an app in total 
    
Free: 

    - bool
    - The application doesnt cost 
    - True/ False  

Price:

    - float64
    - The cost of the mobile application 
    - True / False

Currency:

    - object
    - The currency in which the price of the mobile application is listed
             
Size:

    - object
    - The size of the mobile application in terms of storage space 
    - Number with M 

Minimum Android:

    - object
    - The minimum version of the Android operating system required to run the mobile application.
    - Number of the version and up    
Developer Id:

    - object
    - The developer has a unique identifier
    
Developer Website:

    - object
    - The website associated with the developer or company behind the mobile application 
    -   
Developer Email: 

    - object
    - The email address of the developer or company responsible for the mobile application.
        
Released:    

    - object
    - The date when the mobile application was initially released on the Google Play Store.
    - Month day, year 

Last Updated: 

    - object
    - The date when the mobile application was last updated 
    - Month day, year 

Content Rating:

    - object
    - The content rating for which the mobile application is suitable
    - There is 6 features

Privacy Policy:

    - object
    - A link to the privacy policy associated with the mobile application.
     
Editors Choice:

    - bool
    - the mobile application has been selected as an "Editor's Choice" on the Google Play Store.
    - True/ False

Scraped Time:
 
    - object
    - The date where the data was collected in the google play store
    - year-month-day hour:minutes:secondes

### Step 3: Check datatypes and format

In [138]:
df.shape

(10000, 24)

This DataFrame contains 10000 rows and 24 columns

In [139]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 24 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   App Name           10000 non-null  object 
 1   App Id             10000 non-null  object 
 2   Category           10000 non-null  object 
 3   Rating             9906 non-null   float64
 4   Rating Count       9906 non-null   float64
 5   Installs           9999 non-null   object 
 6   Minimum Installs   9999 non-null   float64
 7   Maximum Installs   10000 non-null  int64  
 8   Free               10000 non-null  bool   
 9   Price              10000 non-null  float64
 10  Currency           9999 non-null   object 
 11  Size               9999 non-null   object 
 12  Minimum Android    9969 non-null   object 
 13  Developer Id       10000 non-null  object 
 14  Developer Website  6628 non-null   object 
 15  Developer Email    10000 non-null  object 
 16  Released           9694

We can that see there are a mixture of numerical and categorical columns. There is lots of "object" type which corresponds to categorical data. We can also see that there is boolean data that might be some binary type columns which could be True/False or 0/1 which are captured as "yes"/"no". Finally, we see that there are some columns missing data in Rating, Rating Count ,Installs, Minimum Installs, Currency, Size, Minimum Android, Developer Website (this columns has the more null value compare to the other columns), Released and Privacy Policy.

### Step 4: Checking for duplicate values

In [140]:
df.duplicated().sum()

0

There is no duplicated values in the dataset. 

In [141]:
df.T.duplicated().sum()

0

There is no duplicated columns in the dataset. 

### Step 5: Checking and Dealing with missing values

In [142]:
df.isna().sum()

App Name                0
App Id                  0
Category                0
Rating                 94
Rating Count           94
Installs                1
Minimum Installs        1
Maximum Installs        0
Free                    0
Price                   0
Currency                1
Size                    1
Minimum Android        31
Developer Id            0
Developer Website    3372
Developer Email         0
Released              306
Last Updated            0
Content Rating          0
Privacy Policy       1746
Ad Supported            0
In App Purchases        0
Editors Choice          0
Scraped Time            0
dtype: int64

In [143]:
df.isna().sum()/df.shape[0]*100

App Name              0.00
App Id                0.00
Category              0.00
Rating                0.94
Rating Count          0.94
Installs              0.01
Minimum Installs      0.01
Maximum Installs      0.00
Free                  0.00
Price                 0.00
Currency              0.01
Size                  0.01
Minimum Android       0.31
Developer Id          0.00
Developer Website    33.72
Developer Email       0.00
Released              3.06
Last Updated          0.00
Content Rating        0.00
Privacy Policy       17.46
Ad Supported          0.00
In App Purchases      0.00
Editors Choice        0.00
Scraped Time          0.00
dtype: float64

As the percentage of null is small for some columns and some columns have a larger percentage but not really able to replace those values as it is unique to each application. We will drop the Developper website and Privacy policy columns as they contains lots of null values and also those columns are not needed for our prediction. 

In [144]:
df.drop(columns=['Developer Website', 'Privacy Policy'], axis = 1, inplace = True)

In [145]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   App Name          10000 non-null  object 
 1   App Id            10000 non-null  object 
 2   Category          10000 non-null  object 
 3   Rating            9906 non-null   float64
 4   Rating Count      9906 non-null   float64
 5   Installs          9999 non-null   object 
 6   Minimum Installs  9999 non-null   float64
 7   Maximum Installs  10000 non-null  int64  
 8   Free              10000 non-null  bool   
 9   Price             10000 non-null  float64
 10  Currency          9999 non-null   object 
 11  Size              9999 non-null   object 
 12  Minimum Android   9969 non-null   object 
 13  Developer Id      10000 non-null  object 
 14  Developer Email   10000 non-null  object 
 15  Released          9694 non-null   object 
 16  Last Updated      10000 non-null  object 

We can see that our 2 columns have been dropped. 
We can now deal with our other missing values. 

In [146]:
df.isna().sum()

App Name              0
App Id                0
Category              0
Rating               94
Rating Count         94
Installs              1
Minimum Installs      1
Maximum Installs      0
Free                  0
Price                 0
Currency              1
Size                  1
Minimum Android      31
Developer Id          0
Developer Email       0
Released            306
Last Updated          0
Content Rating        0
Ad Supported          0
In App Purchases      0
Editors Choice        0
Scraped Time          0
dtype: int64

In [147]:
df.shape

(10000, 22)

In [148]:
df.isna().sum()/df.shape[0]*100

App Name            0.00
App Id              0.00
Category            0.00
Rating              0.94
Rating Count        0.94
Installs            0.01
Minimum Installs    0.01
Maximum Installs    0.00
Free                0.00
Price               0.00
Currency            0.01
Size                0.01
Minimum Android     0.31
Developer Id        0.00
Developer Email     0.00
Released            3.06
Last Updated        0.00
Content Rating      0.00
Ad Supported        0.00
In App Purchases    0.00
Editors Choice      0.00
Scraped Time        0.00
dtype: float64

We can see that the rest of the missing values is smaller. In this case, we allow us to drop the missing values. 

In [149]:
df.dropna(inplace = True)

In [150]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9663 entries, 0 to 9999
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   App Name          9663 non-null   object 
 1   App Id            9663 non-null   object 
 2   Category          9663 non-null   object 
 3   Rating            9663 non-null   float64
 4   Rating Count      9663 non-null   float64
 5   Installs          9663 non-null   object 
 6   Minimum Installs  9663 non-null   float64
 7   Maximum Installs  9663 non-null   int64  
 8   Free              9663 non-null   bool   
 9   Price             9663 non-null   float64
 10  Currency          9663 non-null   object 
 11  Size              9663 non-null   object 
 12  Minimum Android   9663 non-null   object 
 13  Developer Id      9663 non-null   object 
 14  Developer Email   9663 non-null   object 
 15  Released          9663 non-null   object 
 16  Last Updated      9663 non-null   object 
 17  

Reseting index

In [151]:
df.shape

(9663, 22)

In [152]:
df

Unnamed: 0,App Name,App Id,Category,Rating,Rating Count,Installs,Minimum Installs,Maximum Installs,Free,Price,...,Minimum Android,Developer Id,Developer Email,Released,Last Updated,Content Rating,Ad Supported,In App Purchases,Editors Choice,Scraped Time
0,Untap,in.untap,Entertainment,3.9,68.0,"10,000+",10000.0,10291,True,0.0,...,4.4 and up,Pat H,hello@untap.in,"Nov 2, 2020","Nov 02, 2020",Everyone,False,False,False,2021-06-16 11:24:00
1,Green Meadows,com.ooweboowebengineers.greenmeadows,Lifestyle,0.0,0.0,50+,50.0,90,True,0.0,...,4.0 and up,Business Apps - OOWEBOO,ray@ooweboo.co.za,"May 29, 2017","May 29, 2017",Everyone,False,False,False,2021-06-16 10:54:34
2,YG SELECT,com.makeshop.powerapp.ygnext,Shopping,4.3,918.0,"100,000+",100000.0,135038,True,0.0,...,5.0 and up,YG PLUS,app.ygselect@gmail.com,"Jan 20, 2016","May 12, 2021",Everyone,True,False,False,2021-06-16 02:21:54
3,Vinca Wealth,com.bag4wealth,Finance,5.0,6.0,50+,50.0,53,True,0.0,...,4.1 and up,Developed By: 'ARM Fintech',acmatics.app@gmail.com,"Jun 30, 2020","May 11, 2021",Everyone,False,False,False,2021-06-16 01:29:59
4,Drink recipes,com.drinks.recipes,Food & Drink,4.3,830.0,"100,000+",100000.0,142498,True,0.0,...,4.0.3 and up,Nazarco,andrei.nazarco@gmail.com,"May 20, 2014","Jul 13, 2020",Everyone,True,False,False,2021-06-16 08:43:51
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,ngtzit,com.ionicframework.mobile482715,Music,0.0,0.0,10+,10.0,14,True,0.0,...,4.1 and up,camelCaseD,amused.rutabega@gmail.com,"Oct 9, 2015","Oct 09, 2015",Everyone,False,False,False,2021-06-16 12:22:12
9995,Deep Memorial Public School,com.edunext.dmps,Education,4.3,142.0,"1,000+",1000.0,2490,True,0.0,...,4.4 and up,Edunext Technologies,edunexttech@gmail.com,"Mar 4, 2016","Aug 26, 2020",Everyone,False,False,False,2021-06-16 00:56:34
9997,설운도 트로트 노래모음,korea.singer.sulundo,Music & Audio,5.0,9.0,"1,000+",1000.0,1035,True,0.0,...,4.4 and up,sopiapark,parkhanye28@gmail.com,"Apr 23, 2020","Feb 03, 2021",Teen,True,False,False,2021-06-16 12:40:44
9998,Coq,fr.enfantdoudou.coq,Entertainment,0.0,0.0,500+,500.0,522,True,0.0,...,4.0 and up,thanki,pridgua@gmail.com,"Apr 16, 2020","Apr 16, 2020",Everyone,True,False,False,2021-06-16 12:06:11


We can see that we have some index missing due to drop missing values. We need to reset our index.

In [153]:
df.reset_index(inplace= True)

In [154]:
df

Unnamed: 0,index,App Name,App Id,Category,Rating,Rating Count,Installs,Minimum Installs,Maximum Installs,Free,...,Minimum Android,Developer Id,Developer Email,Released,Last Updated,Content Rating,Ad Supported,In App Purchases,Editors Choice,Scraped Time
0,0,Untap,in.untap,Entertainment,3.9,68.0,"10,000+",10000.0,10291,True,...,4.4 and up,Pat H,hello@untap.in,"Nov 2, 2020","Nov 02, 2020",Everyone,False,False,False,2021-06-16 11:24:00
1,1,Green Meadows,com.ooweboowebengineers.greenmeadows,Lifestyle,0.0,0.0,50+,50.0,90,True,...,4.0 and up,Business Apps - OOWEBOO,ray@ooweboo.co.za,"May 29, 2017","May 29, 2017",Everyone,False,False,False,2021-06-16 10:54:34
2,2,YG SELECT,com.makeshop.powerapp.ygnext,Shopping,4.3,918.0,"100,000+",100000.0,135038,True,...,5.0 and up,YG PLUS,app.ygselect@gmail.com,"Jan 20, 2016","May 12, 2021",Everyone,True,False,False,2021-06-16 02:21:54
3,3,Vinca Wealth,com.bag4wealth,Finance,5.0,6.0,50+,50.0,53,True,...,4.1 and up,Developed By: 'ARM Fintech',acmatics.app@gmail.com,"Jun 30, 2020","May 11, 2021",Everyone,False,False,False,2021-06-16 01:29:59
4,4,Drink recipes,com.drinks.recipes,Food & Drink,4.3,830.0,"100,000+",100000.0,142498,True,...,4.0.3 and up,Nazarco,andrei.nazarco@gmail.com,"May 20, 2014","Jul 13, 2020",Everyone,True,False,False,2021-06-16 08:43:51
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9658,9994,ngtzit,com.ionicframework.mobile482715,Music,0.0,0.0,10+,10.0,14,True,...,4.1 and up,camelCaseD,amused.rutabega@gmail.com,"Oct 9, 2015","Oct 09, 2015",Everyone,False,False,False,2021-06-16 12:22:12
9659,9995,Deep Memorial Public School,com.edunext.dmps,Education,4.3,142.0,"1,000+",1000.0,2490,True,...,4.4 and up,Edunext Technologies,edunexttech@gmail.com,"Mar 4, 2016","Aug 26, 2020",Everyone,False,False,False,2021-06-16 00:56:34
9660,9997,설운도 트로트 노래모음,korea.singer.sulundo,Music & Audio,5.0,9.0,"1,000+",1000.0,1035,True,...,4.4 and up,sopiapark,parkhanye28@gmail.com,"Apr 23, 2020","Feb 03, 2021",Teen,True,False,False,2021-06-16 12:40:44
9661,9998,Coq,fr.enfantdoudou.coq,Entertainment,0.0,0.0,500+,500.0,522,True,...,4.0 and up,thanki,pridgua@gmail.com,"Apr 16, 2020","Apr 16, 2020",Everyone,True,False,False,2021-06-16 12:06:11


We have now our new index  that is reset but we need to remove the the original index named index. 

In [155]:
df = df.loc[:, ~df.columns.str.contains('^index')]
df.head()

Unnamed: 0,App Name,App Id,Category,Rating,Rating Count,Installs,Minimum Installs,Maximum Installs,Free,Price,...,Minimum Android,Developer Id,Developer Email,Released,Last Updated,Content Rating,Ad Supported,In App Purchases,Editors Choice,Scraped Time
0,Untap,in.untap,Entertainment,3.9,68.0,"10,000+",10000.0,10291,True,0.0,...,4.4 and up,Pat H,hello@untap.in,"Nov 2, 2020","Nov 02, 2020",Everyone,False,False,False,2021-06-16 11:24:00
1,Green Meadows,com.ooweboowebengineers.greenmeadows,Lifestyle,0.0,0.0,50+,50.0,90,True,0.0,...,4.0 and up,Business Apps - OOWEBOO,ray@ooweboo.co.za,"May 29, 2017","May 29, 2017",Everyone,False,False,False,2021-06-16 10:54:34
2,YG SELECT,com.makeshop.powerapp.ygnext,Shopping,4.3,918.0,"100,000+",100000.0,135038,True,0.0,...,5.0 and up,YG PLUS,app.ygselect@gmail.com,"Jan 20, 2016","May 12, 2021",Everyone,True,False,False,2021-06-16 02:21:54
3,Vinca Wealth,com.bag4wealth,Finance,5.0,6.0,50+,50.0,53,True,0.0,...,4.1 and up,Developed By: 'ARM Fintech',acmatics.app@gmail.com,"Jun 30, 2020","May 11, 2021",Everyone,False,False,False,2021-06-16 01:29:59
4,Drink recipes,com.drinks.recipes,Food & Drink,4.3,830.0,"100,000+",100000.0,142498,True,0.0,...,4.0.3 and up,Nazarco,andrei.nazarco@gmail.com,"May 20, 2014","Jul 13, 2020",Everyone,True,False,False,2021-06-16 08:43:51


### Step 6: Dropping Columns 

In [156]:
df.drop(columns = ['Last Updated','App Name','Size','Minimum Android','Minimum Installs', 'App Id', 'Currency', 'Scraped Time', 'Developer Email', 'Developer Id', 'Installs', 'Released', 'Maximum Installs'], inplace= True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns = ['Last Updated','App Name','Size','Minimum Android','Minimum Installs', 'App Id', 'Currency', 'Scraped Time', 'Developer Email', 'Developer Id', 'Installs', 'Released', 'Maximum Installs'], inplace= True)


We decided to drop those columns because there are not usefull for our predictions. 
After we drop them we used the df.info to see if they are correctly dropped. 

In [157]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9663 entries, 0 to 9662
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Category          9663 non-null   object 
 1   Rating            9663 non-null   float64
 2   Rating Count      9663 non-null   float64
 3   Free              9663 non-null   bool   
 4   Price             9663 non-null   float64
 5   Content Rating    9663 non-null   object 
 6   Ad Supported      9663 non-null   bool   
 7   In App Purchases  9663 non-null   bool   
 8   Editors Choice    9663 non-null   bool   
dtypes: bool(4), float64(3), object(2)
memory usage: 415.3+ KB


We are saving the cleaned data into as a new csv file to do our EDA. 

In [158]:
df.to_csv('../data/Clean_Dataset.csv', index = False)

CONCLUSION: 
We cleaned our dataset by removing duplicated and missing data. We also dropped columns that would not be beneficial for our prediction. 
We saved our new dataset into a CSV file to use it for our EDA 