Cleaning apple dataset

Purpose This notebook is to clean and then test apple dataset.

In [1]:
from os.path import exists, isfile

import re

import pandas as pd
import numpy as np
import math
import great_expectations as ge

In [2]:
save_path = '../../datasets/2200_clean_apple.csv'

In [3]:
if not exists("../../datasets/1400_kaggle_dataset_apple.csv"):
    print ("Missing dataset file")
    
df_apple=ge.read_csv("../../datasets/1400_kaggle_dataset_apple.csv")
df_apple.head()

Unnamed: 0,id,track_name,size_bytes,price,rating_count_tot,user_rating,cont_rating,prime_genre
0,281656475,PAC-MAN Premium,100788224,3.99,21292,4.0,4+,Games
1,281796108,Evernote - stay organized,158578688,0.0,161065,4.0,4+,Business
2,281940292,"WeatherBug - Local Weather, Radar, Maps, Alerts",100524032,0.0,188583,3.5,4+,Others
3,282614216,"eBay: Best App to Buy, Sell, Save! Online Shop...",128512000,0.0,262241,4.0,12+,Lifestyle
4,282935706,Bible,92774400,0.0,985920,4.5,4+,Books & Reference


We have previously normalize the genres. We expect that new genres will only be those we want.

In [4]:
valid_genre = set(['Utilities', 'Auto & Vehicles', 'Books & Reference', 'Business',
       'Entertainment', 'Social Networking', 'Education', 'News',
       'Food & Drink', 'Health & Fitness', 'Others', 'Lifestyle', 'Games'])

df_apple.expect_column_values_to_be_in_set('prime_genre', valid_genre)

{'success': True,
 'result': {'element_count': 7197,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 0,
  'unexpected_percent': 0.0,
  'unexpected_percent_nonmissing': 0.0,
  'partial_unexpected_list': []}}

In [5]:
df_apple.info()

<class 'great_expectations.dataset.pandas_dataset.PandasDataset'>
RangeIndex: 7197 entries, 0 to 7196
Data columns (total 8 columns):
id                  7197 non-null int64
track_name          7197 non-null object
size_bytes          7197 non-null int64
price               7197 non-null float64
rating_count_tot    7197 non-null int64
user_rating         7197 non-null float64
cont_rating         7197 non-null object
prime_genre         7197 non-null object
dtypes: float64(2), int64(3), object(3)
memory usage: 449.9+ KB


We want size to be a numerical feature. We want "size" to be a float, which is, "Megabytes" of each app.

In [6]:
df_apple['size_bytes'] = df_apple['size_bytes'].map(lambda x: x*1.0/(1024*1024))
df_apple['size_bytes'].describe()

count    7197.000000
mean      189.909414
std       342.566408
min         0.562500
25%        44.749023
50%        92.652344
75%       173.497070
max      3839.463867
Name: size_bytes, dtype: float64

Since we will compare the content rating of apple apps and google apps later, we should have the same standard for both. Here we normalize those apps' content rating.

In [7]:
df_apple['cont_rating'].unique()

array(['4+', '12+', '17+', '9+'], dtype=object)

In [8]:
df_apple['cont_rating'].isnull().sum()

0

In [9]:
df_apple['cont_rating'] = df_apple['cont_rating'].replace({'4+': 'Everyone', '9+': 'Everyone 10+', '12+': 'Teen', '17+': 'Mature 17+'})
df_apple['cont_rating'].unique()

array(['Everyone', 'Teen', 'Mature 17+', 'Everyone 10+'], dtype=object)

In [10]:
df_apple['user_rating'].describe()

count    7197.000000
mean        3.526956
std         1.517948
min         0.000000
25%         3.500000
50%         4.000000
75%         4.500000
max         5.000000
Name: user_rating, dtype: float64

It is unreasonable that some apps have 0 as their ratings. If they get such a low rating because no users have rated them, then those 0s are outliers that should be dropped.

In [11]:
df_apple.expect_column_values_to_not_be_null('id')

{'success': True,
 'result': {'element_count': 7197,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 0,
  'unexpected_percent': 0.0,
  'partial_unexpected_list': []}}

In [12]:
df_apple.expect_column_values_to_not_be_null('track_name')

{'success': True,
 'result': {'element_count': 7197,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 0,
  'unexpected_percent': 0.0,
  'partial_unexpected_list': []}}

In [13]:
df_apple.expect_column_values_to_be_unique('id')

{'success': True,
 'result': {'element_count': 7197,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 0,
  'unexpected_percent': 0.0,
  'unexpected_percent_nonmissing': 0.0,
  'partial_unexpected_list': []}}

In [14]:
df_apple.expect_column_values_to_be_unique('track_name')

{'success': False,
 'result': {'element_count': 7197,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 4,
  'unexpected_percent': 0.0005557871335278589,
  'unexpected_percent_nonmissing': 0.0005557871335278589,
  'partial_unexpected_list': ['VR Roller Coaster',
   'VR Roller Coaster',
   'Mannequin Challenge',
   'Mannequin Challenge']}}

In [15]:
df_apple[df_apple['track_name']=='VR Roller Coaster']

Unnamed: 0,id,track_name,size_bytes,price,rating_count_tot,user_rating,cont_rating,prime_genre
3319,952877179,VR Roller Coaster,161.669922,0.0,107,3.5,Everyone,Games
5603,1089824278,VR Roller Coaster,229.801758,0.0,67,3.5,Everyone,Games


In [16]:
df_apple[df_apple['track_name']=='Mannequin Challenge'][['size_bytes', 'rating_count_tot', 'user_rating']]

Unnamed: 0,size_bytes,rating_count_tot,user_rating
7092,104.623047,668,3.0
7128,56.8125,105,4.0


In [17]:
df_apple = df_apple[~df_apple.duplicated(subset=['track_name'])]
df_apple.shape

(7195, 8)

In [18]:
df_apple[df_apple['user_rating']==0]['rating_count_tot']

199     0
301     0
330     0
441     0
452     0
515     0
531     0
553     0
575     0
612     0
658     0
667     0
669     0
721     0
778     0
779     0
809     0
844     0
859     0
870     0
905     0
957     0
1030    0
1048    0
1049    0
1060    0
1068    0
1072    0
1085    0
1102    0
       ..
7089    0
7094    0
7095    0
7096    0
7104    0
7107    0
7118    0
7120    0
7124    0
7126    0
7132    0
7133    0
7135    0
7143    0
7145    0
7149    0
7151    0
7152    0
7153    0
7157    0
7164    0
7165    0
7173    0
7176    0
7178    0
7181    0
7182    0
7184    0
7185    0
7189    0
Name: rating_count_tot, Length: 929, dtype: int64

Ratings given by less than 3 users are not reliable. We only keep those apps with more than 3 users giving ratings to them. 

In [19]:
min_reviews = 3
df_apple = df_apple[df_apple['rating_count_tot']>min_reviews]
df_apple.shape

(6078, 8)

In [20]:
df_apple.expect_column_min_to_be_between('rating_count_tot', min_reviews, 100000000)

{'success': True,
 'result': {'observed_value': 4,
  'element_count': 6078,
  'missing_count': 0,
  'missing_percent': 0.0}}

While we are dealing with ratings let's normalise google and apple ratings so that they are on the same scale.

In [21]:
df_apple['normed_rating'] = df_apple['user_rating']/df_apple['user_rating'].max()
df_apple.expect_column_values_to_be_between('normed_rating', 0, 1)

{'success': True,
 'result': {'element_count': 6078,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 0,
  'unexpected_percent': 0.0,
  'unexpected_percent_nonmissing': 0.0,
  'partial_unexpected_list': []}}

In [22]:
def z_score(column, df):
    return (df[column] - df[column].mean())/df[column].std()

df_apple['z_score'] = z_score('user_rating', df_apple)

In [23]:
df_apple.loc[df_apple['rating_count_tot']>0,'log_apple_reviews'] = df_apple[df_apple['rating_count_tot']>0]['rating_count_tot'].apply(lambda x: math.log(x, 10))

In [24]:
df_apple[df_apple['rating_count_tot'].isnull()]

Unnamed: 0,id,track_name,size_bytes,price,rating_count_tot,user_rating,cont_rating,prime_genre,normed_rating,z_score,log_apple_reviews


In [25]:
df_apple.columns=['apple_id', 'apple_title', 'apple_size', 'apple_price', 'apple_reviews', 'apple_rating', 'apple_pegi', 'apple_genre', 'normed_apple_rating', 'z_score_apple', 'log_apple_reviews']

In [26]:
df_apple.to_csv(save_path, index=False)
df_apple.shape

(6078, 11)