Downloading google dataset from kaggle

Purpose
This notebook is to download google dataset from kaggle. Genres will be "normalized".

In [1]:
google_kaggle_dataset = '1300_kaggle_dataset_google.csv'

In [2]:
datasets_dir = '../../datasets/'
force_download = False 

In [3]:
import subprocess
from os.path import isfile, exists
import pandas as pd
import numpy as np

To avoid logging in kaggle, the original dataset is put on github to be dowdnloaded.
Dataset will first be downloaded and then unzipped. We don't download the dataset everytime. Once it is kept in dataset folder, we only need to open it.

In [4]:
google_dataset = 'https://raw.githubusercontent.com/EloiseXu/Data-Science-in-Practice/master/googleplaystore.csv'
local_filename = google_dataset.split('/')[-1]
google_filename = datasets_dir + local_filename

if not(isfile(google_filename)) or force_download:
    curl_cmd = "curl -L {} --output {}".format(google_dataset, google_filename)
    gunzip_cmd = "gunzip {} -qq".format(google_filename)

In [5]:
google_apps = pd.read_csv(google_filename)
google_apps.shape

(10841, 13)

In [6]:
google_apps.sample(10)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
8772,Dr. Dominoes,GAME,4.0,2700,5.7M,"500,000+",Free,0,Everyone,Board,"January 3, 2018",1.06,4.1 and up
6328,BJ - Confidential,COMMUNICATION,,0,3.2M,10+,Free,0,Teen,Communication,"April 23, 2018",1.7,4.1 and up
7890,CT-Stream Player,FAMILY,3.9,84,11M,"10,000+",Free,0,Everyone,Entertainment,"November 21, 2016",3.4,3.0 and up
4305,Magic Tiles - TWICE Edition (K-Pop),GAME,4.4,2351,62M,"100,000+",Free,0,Everyone,Music,"July 25, 2018",1009001,4.0 and up
10027,OMG Gross Zit - Date Nightmare,FAMILY,3.7,70105,67M,"10,000,000+",Free,0,Everyone,Casual,"June 6, 2018",1.0.3,4.1 and up
4912,TOSHIBA Smart AC,TOOLS,2.6,60,18M,"10,000+",Free,0,Everyone,Tools,"August 4, 2018",2.1.20180804_01,4.0.3 and up
4631,U-Dictionary: Best English Learning Dictionary,FAMILY,4.5,166886,21M,"10,000,000+",Free,0,Everyone 10+,Education,"July 29, 2018",3.6.2,4.0.3 and up
3387,Live 3D Neon Blue Love Heart Keyboard Theme,PERSONALIZATION,4.3,6626,9.1M,"1,000,000+",Free,0,Everyone,Personalization,"July 25, 2018",6.7.25.2018,4.0.3 and up
7207,Beast of Lycan Isle CE,FAMILY,4.1,2683,20M,"50,000+",Free,0,Everyone 10+,Casual,"January 23, 2014",1.0,2.3 and up
7113,CBRadioTab,TOOLS,3.9,127,1.5M,"50,000+",Free,0,Everyone,Tools,"April 1, 2017",1.0,3.0 and up


We evaluate the quality of each app mainly based on its rating. Thus we won't keep those apps without a valid rating. As for "Type", "Content Rating", or "Current Ver", we can keep them even if they are missed because they are not necessary for our analysis.

In [7]:
google_apps.isnull().sum()

App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

In [8]:
google_apps = google_apps.dropna(subset=['Rating'])
google_apps.shape

(9367, 13)

Versions of those apps are not helpful in this project. We only keep apps' title, genre, rating, size, etc.

In [9]:
col_n = ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Price', 'Content Rating'] 
google_apps = pd.DataFrame(google_apps, columns = col_n)

In [10]:
google_apps.groupby('Category').count()

Unnamed: 0_level_0,App,Rating,Reviews,Size,Price,Content Rating
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1.9,1,1,1,1,1,0
ART_AND_DESIGN,62,62,62,62,62,62
AUTO_AND_VEHICLES,73,73,73,73,73,73
BEAUTY,42,42,42,42,42,42
BOOKS_AND_REFERENCE,178,178,178,178,178,178
BUSINESS,303,303,303,303,303,303
COMICS,58,58,58,58,58,58
COMMUNICATION,328,328,328,328,328,328
DATING,195,195,195,195,195,195
EDUCATION,155,155,155,155,155,155


"1.9" is not a valid genre value. Besides, we can merge some genres so that there will be enough apps of each genre to be evaluated.

In [11]:
google_apps = google_apps[google_apps['Category'] != '1.9']

In [12]:
def normalise_genre(genre):
    genre = str(genre)
    std = ['Utilities', 'Auto & Vehicles', 'Food & Drink', 'Health & Fitness', 'Lifestyle', 'Games', 'Books & Reference', 'Business', 'Entertainment', 'Social Networking', 'Education', 'News', 'Others']
    genre2cat = {'ART_AND_DESIGN':0, 'BEAUTY': 0,
                'AUTO_AND_VEHICLES': 1, 'TRAVEL_AND_LOCAL': 1, 'MAPS_AND_NAVIGATION':1,
                'FOOD_AND_DRINK': 2,
                'HEALTH_AND_FITNESS': 3, 'SPORTS':3, 'MEDICAL':3,
                'LIFESTYLE': 4, 'SHOPPING':4, 
                'GAME': 5, 
                'BOOKS_AND_REFERENCE': 6, 'TOOLS': 6, 'PRODUCTIVITY':6, 'PHOTOGRAPHY': 6, 
                'BUSINESS': 7, 'FINANCE': 7,
                'COMICS': 8, 'ENTERTAINMENT': 8, 'VIDEO_PLAYERS':8,
                'COMMUNICATION': 9, 'DATING': 9, 'SOCIAL':9,
                'EDUCATION': 10, 
                'EVENTS': 11, 'NEWS_AND_MAGAZINES':11,
                'HOUSE_AND_HOME': 12, 'LIBRARIES_AND_DEMO': 12, 'FAMILY': 12, 'PERSONALIZATION':12, 'PARENTING':12, 'WEATHER':12}
    return std[genre2cat[genre.strip()]]

In [13]:
google_apps['Category'] = google_apps['Category'].map(normalise_genre)

In [14]:
google_apps.to_csv(datasets_dir + google_kaggle_dataset, index=False)
google_apps.shape

(9366, 7)