# Imports <a id="imports"></a>
This cell will contain the imports necessary for the project

In [67]:
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'
import numpy as np

"""
repair.py contains functions that helpes with repairing dataset and performing
data munging
"""
import repair 

In [140]:
df = pd.read_csv("googleplaystore.csv")

# The Dataset <a id="ds"></a> 
The dataset we chose is the ["Google Play Store Apps"](https://www.kaggle.com/lava18/google-play-store-apps) dataset with around 10,841 data rows.

## The Headings <a id="hd"></a> 
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Facebook | SOCIAL | 4.1 | 78158306 | Varies with device | 1,000,000,000+ | Free | 0.0 | Teen | Social | August 3, 2018 | Varies with device | Varies with device |

### App
- Application name

### Category
- The category the app belongs to

### Rating
- Number of user reviews for the app (as when scraped)

### Reviews 
- Number of user reviews for the app (as when scraped)

### Installs 
- Number of user downloads/install for the app (as when scraped)

### Type
-  Paid or Free

### Price
- Price of the app (as when scraped)

### Content Rating
- Age group the app is targeted at - Children / Mature 21+ / Adult

### Genres
- An app can belong to multiple genres (apart from its main category). For example. a musical family game will belong to Music, Game, Family genres.

### Last Updated 
- Date when the app was last updated on the Play Store (as when scraped)

### Current Ver 
- Current version of the app available on Play Store (as when scraped)

### Android Ver
- Min required Android version (as when scraped)

# Data Munging

## Data types
When you look at the current version of the dataset, all of the columns are of type object:

In [3]:
df.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

We believed that the *"Reviews"* and *"Price"* columns should be of a numerical type. We would do this with the help of the function ```pandas.to_numeric()``` and the ```remove_dollar_sign()``` function in ```repair.py``` which removes the dollar sign from the prices.

In [4]:
# create new dataset of cleaned data of extracted dollar sign
repair.remove_dollar_sign("googleplaystore.csv", "googleplaystore2.csv")

repair.py: Printing done


In [5]:
# update the data frame
df = pd.read_csv("googleplaystore2.csv")

# convert Price to numerical
df["Price"] = pd.to_numeric(df["Price"])

In [6]:
# convert Reviews column to numerical
df["Reviews"] = pd.to_numeric(df["Reviews"])

In [7]:
df.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size               object
Installs           object
Type               object
Price             float64
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

After the fixing the data, we have converted all columns that should be numeric into numerics

## Filling null values

Below displays the columns that contain null values

In [141]:
df.apply(lambda x: sum(x.isnull()),axis=0)

App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       0
Genres               1
Last Updated         0
Current Ver          8
Android Ver          2
dtype: int64

In [17]:
# Get data rows where the value is null
null_values = df.loc[df["Rating"].isnull()]

In [131]:
null_values

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
23,Mcqueen Coloring pages,ART_AND_DESIGN,,61,7.0M,"100,000+",Free,0.00,Everyone,Art & Design;Action & Adventure,"March 7, 2018",1.0.0,4.1 and up
113,Wrinkles and rejuvenation,BEAUTY,,182,5.7M,"100,000+",Free,0.00,Everyone 10+,Beauty,"September 20, 2017",8.0,3.0 and up
123,Manicure - nail design,BEAUTY,,119,3.7M,"50,000+",Free,0.00,Everyone,Beauty,"July 23, 2018",1.3,4.1 and up
126,Skin Care and Natural Beauty,BEAUTY,,654,7.4M,"100,000+",Free,0.00,Teen,Beauty,"July 17, 2018",1.15,4.1 and up
129,"Secrets of beauty, youth and health",BEAUTY,,77,2.9M,"10,000+",Free,0.00,Mature 17+,Beauty,"August 8, 2017",2.0,2.3 and up
130,Recipes and tips for losing weight,BEAUTY,,35,3.1M,"10,000+",Free,0.00,Everyone 10+,Beauty,"December 11, 2017",2.0,3.0 and up
134,"Lady adviser (beauty, health)",BEAUTY,,30,9.9M,"10,000+",Free,0.00,Mature 17+,Beauty,"January 24, 2018",3.0,3.0 and up
163,Anonymous caller detection,BOOKS_AND_REFERENCE,,161,2.7M,"10,000+",Free,0.00,Everyone,Books & Reference,"July 13, 2018",1.0,2.3 and up
180,SH-02J Owner's Manual (Android 8.0),BOOKS_AND_REFERENCE,,2,7.2M,"50,000+",Free,0.00,Everyone,Books & Reference,"June 15, 2018",3.0,6.0 and up
185,URBANO V 02 instruction manual,BOOKS_AND_REFERENCE,,114,7.3M,"100,000+",Free,0.00,Everyone,Books & Reference,"August 7, 2015",1.1,5.1 and up


In [139]:
df.apply(lambda x: sum(x.isnull()),axis=0) 

App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              1
Price             0
Content Rating    0
Genres            1
Last Updated      0
Current Ver       8
Android Ver       2
dtype: int64

According from the data above, there are 1474 rows with null values for **Rating**. This was quite problematic as this field is key for our predictive modelling later on. It does not make it any better that these null values make up about %13.5 of the total dataset. There are options to reduce the impact and the two biggest are to get the mean of the Rating column or to just delete them. There are obvious problems to these:
1. Deleting all 1474 ratings may have an effect of the performance of the maching learning algorithms later which is made way worse by the fact that the Ratings column will be a key attribute that we will be using for the predictives model later
2. Getting the mean of the entire  Rating column may not accurately represent the true rating for the null data rows

There are a couple of workarounds that we have thought of:
- Get the mean of non null values for each category and use them to to replace the null values of the categories
- If there are very little number of null value rows for a category, they will be deleted

In [144]:
def fill_null_ratings_for_category(df):
    # Get all unique values for the category column
    categories = df["Category"].unique().tolist()
    
    df.isnull()
    
    for category in categories:
        cat = df.loc[df["Category"] == category]
        
        # get rows of current category where ratings is not null 
        cat_not_null = cat.loc[cat["Rating"].notna()]
        avg = cat_not_null["Rating"].mean()
        
        df.loc[df["Category"] == category, "Rating"] = df.loc[df["Category"] == category, "Rating"].fillna(avg)
    
    
        
    

fill_null_ratings_for_category(df)

In [145]:
df.apply(lambda x: sum(x.isnull()),axis=0) 

App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              1
Price             0
Content Rating    0
Genres            1
Last Updated      0
Current Ver       8
Android Ver       2
dtype: int64

App                   0
Category              0
Rating            10841
Reviews               0
Size                  0
Installs              0
Type                  1
Price                 0
Content Rating        0
Genres                1
Last Updated          0
Current Ver           8
Android Ver           2
dtype: int64

In [41]:
beauty_cat_null

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
478,Truth or Dare Pro,DATING,,0,20M,50+,Paid,1.49,Teen,Dating,"September 1, 2017",1.0,4.0 and up
479,"Private Dating, Hide App- Blue for PrivacyHider",DATING,,0,18k,100+,Paid,2.99,Everyone,Dating,"July 25, 2017",1.0.1,4.0 and up
480,Ad Blocker for SayHi,DATING,,4,1.2M,100+,Paid,3.99,Teen,Dating,"August 2, 2018",1.2,4.0.3 and up
610,Random Video Chat,DATING,,3,16M,"1,000+",Free,0.0,Mature 17+,Dating,"July 15, 2018",4.20,4.0.3 and up
613,Random Video Chat App With Strangers,DATING,,3,4.8M,"1,000+",Free,0.0,Mature 17+,Dating,"July 17, 2018",1.,4.0 and up
617,Meet With Strangers: Video Chat & Dating,DATING,,2,3.7M,500+,Free,0.0,Mature 17+,Dating,"July 16, 2018",1.,4.0 and up
620,Ost. Zombies Cast - New Music and Lyrics,DATING,,1,4.6M,100+,Free,0.0,Teen,Dating,"July 20, 2018",1.0,4.0.3 and up
621,Dating White Girls,DATING,,0,3.6M,50+,Free,0.0,Mature 17+,Dating,"July 20, 2018",1.0,4.0 and up
623,Geeks Dating,DATING,,0,13M,50+,Free,0.0,Mature 17+,Dating,"July 10, 2018",1.0,4.1 and up
624,Live chat - free video chat,DATING,,1,8.7M,500+,Free,0.0,Mature 17+,Dating,"July 23, 2018",3.52,4.0.3 and up


In [42]:
beauty_cat_not_null

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
476,"Moco+ - Chat, Meet People",DATING,4.2,1545,Varies with device,"10,000+",Paid,3.99,Mature 17+,Dating,"June 19, 2018",2.6.139,4.1 and up
477,Calculator,DATING,2.6,57,6.2M,"1,000+",Paid,6.99,Everyone,Dating,"October 25, 2017",1.1.6,4.0 and up
481,AMBW Dating App: Asian Men Black Women Interra...,DATING,3.5,2,17M,100+,Paid,7.99,Mature 17+,Dating,"January 21, 2017",1.0.1,4.0 and up
482,Zoosk Dating App: Meet Singles,DATING,4.0,516801,Varies with device,"10,000,000+",Free,0.00,Mature 17+,Dating,"August 2, 2018",Varies with device,Varies with device
483,OkCupid Dating,DATING,4.1,285726,15M,"10,000,000+",Free,0.00,Mature 17+,Dating,"July 30, 2018",11.10.1,4.1 and up
484,Match™ Dating - Meet Singles,DATING,3.7,76646,Varies with device,"10,000,000+",Free,0.00,Mature 17+,Dating,"July 23, 2018",Varies with device,Varies with device
485,"Hily: Dating, Chat, Match, Meet & Hook up",DATING,4.1,2556,56M,"100,000+",Free,0.00,Mature 17+,Dating,"August 1, 2018",2.5.2,4.1 and up
486,Hinge: Dating & Relationships,DATING,4.2,7779,12M,"500,000+",Free,0.00,Mature 17+,Dating,"August 3, 2018",6.1.3,5.0 and up
487,Casual Dating & Adult Singles - Joyride,DATING,4.5,61637,11M,"5,000,000+",Free,0.00,Mature 17+,Dating,"July 31, 2018",4.17.2,4.1 and up
488,BBW Dating & Plus Size Chat,DATING,4.4,12632,29M,"1,000,000+",Free,0.00,Mature 17+,Dating,"July 27, 2018",3.5.0.1,4.1 and up


['ART_AND_DESIGN',
 'AUTO_AND_VEHICLES',
 'BEAUTY',
 'BOOKS_AND_REFERENCE',
 'BUSINESS',
 'COMICS',
 'COMMUNICATION',
 'DATING',
 'EDUCATION',
 'ENTERTAINMENT',
 'EVENTS',
 'FINANCE',
 'FOOD_AND_DRINK',
 'HEALTH_AND_FITNESS',
 'HOUSE_AND_HOME',
 'LIBRARIES_AND_DEMO',
 'LIFESTYLE',
 'GAME',
 'FAMILY',
 'MEDICAL',
 'SOCIAL',
 'SHOPPING',
 'PHOTOGRAPHY',
 'SPORTS',
 'TRAVEL_AND_LOCAL',
 'TOOLS',
 'PERSONALIZATION',
 'PRODUCTIVITY',
 'PARENTING',
 'WEATHER',
 'VIDEO_PLAYERS',
 'NEWS_AND_MAGAZINES',
 'MAPS_AND_NAVIGATION']

In [92]:
print()

98       4.7
99       4.9
100      4.7
101      3.9
102      3.9
103      4.2
104      4.6
105      4.3
106      4.7
107      4.7
108      4.8
109      4.2
110      4.3
111      4.5
112      4.1
113      4.2
114      4.2
115      4.5
116      4.4
117      4.0
118      4.1
119      4.1
120      4.4
121      4.6
122      4.5
123      4.2
124      3.9
125      4.4
126      4.2
127      4.6
128      3.8
129      4.2
130      4.2
131      4.0
132      4.3
133      4.5
134      4.2
135      4.1
136      3.7
137      4.7
138      4.2
5192     3.1
5998     4.2
6222     4.0
6318     4.2
6506     4.0
6828     4.2
7021     4.5
7460     4.2
8505     4.2
9205     4.0
10022    4.7
10289    3.9
Name: Rating, dtype: float64


In [93]:
df.apply(lambda x: sum(x.isnull()),axis=0)

App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       0
Genres               1
Last Updated         0
Current Ver          8
Android Ver          2
dtype: int64

In [62]:
df.loc[df["Category"] == "BEAUTY", "Rating"]

98       4.7
99       4.9
100      4.7
101      3.9
102      3.9
103      4.2
104      4.6
105      4.3
106      4.7
107      4.7
108      4.8
109      4.2
110      4.3
111      4.5
112      4.1
113      NaN
114      4.2
115      4.5
116      4.4
117      4.0
118      4.1
119      4.1
120      4.4
121      4.6
122      4.5
123      NaN
124      3.9
125      4.4
126      NaN
127      4.6
128      3.8
129      NaN
130      NaN
131      4.0
132      4.3
133      4.5
134      NaN
135      4.1
136      3.7
137      4.7
138      4.2
5192     3.1
5998     NaN
6222     4.0
6318     NaN
6506     4.0
6828     NaN
7021     4.5
7460     NaN
8505     NaN
9205     4.0
10022    4.7
10289    3.9
Name: Rating, dtype: float64