In [1]:
import pandas as pd
import numpy as np

In [2]:
# read in our data
store_data = pd.read_csv("data/google-play-store-apps/googleplaystore.csv")
reviews_data = pd.read_csv("data/google-play-store-apps/googleplaystore_user_reviews.csv")

In [None]:
# some stuff that could be used
# pd.isnull(store_data).sum() > 0
# store_data.isna().any(axis=0)
# store_data.loc[store_data['Price'].isnull()]
# print(store_data.iloc[5]['Price'])
# store_data[pd.isnull(store_data).any(axis=1)]

## Checking out the missing values

First we inspect and check which attributes in the data have missing values.

In [39]:
pd.isnull(store_data).sum() > 0

App               False
Category          False
Rating             True
Reviews           False
Size              False
Installs          False
Type               True
Price             False
Content Rating     True
Genres            False
Last Updated      False
Current Ver        True
Android Ver        True
dtype: bool

The `Rating`, `Type`, `Content Rating`, `Current Ver`, `Android Ver` attributes turned out to have NaN values.

---

#### Type:

Let's start with `Type`; which is the attribute that indicates whether the app is `Paid` or `Free`.

In [49]:
store_data.loc[store_data['Type'].isnull()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
9148,Command & Conquer: Rivals,FAMILY,,0,Varies with device,0,,0,Everyone 10+,Strategy,"June 28, 2018",Varies with device,Varies with device


So the app in row `9148` named `Command & Conquer: Rivals` appears to be the only one with NaN value in `Type`.

Since its `Price` attribute holds a price of $0, we could safely assume that it is a free app and fix the value of the `Type` to `Free`.

In [89]:
store_data.iloc[9148, store_data.columns.get_loc('Type')] = 'Free'
store_data.loc[store_data['Type'].isnull()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver


---

#### Rating:

The `Rating` attribute represent the rating given by the users for the app on a scale from 1 to 5.

In [54]:
store_data['Rating'].isnull().sum()

1474

There appears to be 1474 apps with missing rating values. There are a few approaches of adjusting these missing values. We decided to change each one of with a mean of ratings of other apps with same range of installs. For e.g: app 2 has a missing rating value and `100,000+` installs, if the mean rating of all the others apps with `100,000+` installs is `3.5`, then app 2's rating will be adjusted to `3.5`

In [92]:
store_data['Rating'].mean()

4.193338315362448

---