Downloading apple dataset from kaggle

Purpose This notebook is to download apple dataset from kaggle. Genres will be "normalized".

In [1]:
import subprocess
from os.path import isfile, exists
import pandas as pd
import numpy as np

In [2]:
apple_kaggle_dataset = '1400_kaggle_dataset_apple.csv'

To avoid logging in kaggle, the original dataset is put on github to be dowdnloaded. Dataset will first be downloaded and then unzipped. We don't download the dataset everytime. Once it is kept in dataset folder, we only need to open it.

In [3]:
datasets_dir = '../../datasets/'
force_download = False 

In [4]:
apple_dataset = 'https://raw.githubusercontent.com/EloiseXu/Data-Science-in-Practice/master/AppleStore.csv'
local_filename = apple_dataset.split('/')[-1]
apple_filename = datasets_dir + local_filename

if not(isfile(apple_filename)) or force_download:
    curl_cmd = "curl -L {} --output {}".format(apple_dataset, apple_filename)
    gunzip_cmd = "gunzip {} -qq".format(apple_filename)

In [5]:
apple_apps = pd.read_csv(apple_filename)
apple_apps.shape

(11100, 17)

In [6]:
apple_apps.sample(10)

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic,game_enab
439,367003839,Hotels & Vacation Rentals by Booking.com,105609216,USD,0.0,31261,374,4.5,4.5,14.4,4+,Travel,37,5,40,1,0
2728,761980429,戦国乙女〜剣戟に舞う白き剣聖〜,1406185472,USD,14.99,0,0,0.0,0.0,1.0.4,17+,Games,43,0,2,0,0
3282,891194610,Earn to Die 2,102755328,USD,1.99,3072,82,4.5,4.5,1.3,12+,Games,39,5,14,1,0
2921,819197891,Don't Touch This - Secret Data Vault,40141824,USD,6.99,615,117,4.5,4.5,2.8,17+,Photo & Video,37,5,12,1,0
4951,1014277964,SelfieCity,71816192,USD,0.0,252,1,5.0,5.0,2.9.1,4+,Photo & Video,37,0,5,1,0
7959,1097724199,ぱちモンパズル〜簡単無料パズルRPGゲーム,66735104,USD,0.0,0,0,0.0,0.0,1.1.2,9+,Games,38,0,2,1,0
8336,1106487399,,0,,0.0,0,0,0.0,0.0,,,,1,1,1,0,0
2600,720208070,Block Fortress: War,451338240,USD,1.99,1889,804,4.5,4.5,1.2.4,12+,Games,43,5,1,1,0
6576,1068875169,Escape Game Escape from Lost Memory,185708544,USD,0.0,0,0,0.0,0.0,1.0.0,4+,Games,38,5,1,1,0
8698,1114711020,,0,,0.0,0,0,0.0,0.0,,,,1,1,1,0,0


In [7]:
apple_apps.isnull().sum()

id                     0
track_name          3903
size_bytes             0
currency            3903
price                  0
rating_count_tot       0
rating_count_ver       0
user_rating            0
user_rating_ver        0
ver                 3903
cont_rating         3903
prime_genre         3903
sup_devices.num        0
ipadSc_urls.num        0
lang.num               0
vpp_lic                0
game_enab              0
dtype: int64

We need to pick up apps that are released on both platforms based on their titles. Thus we won't keep those apps without a valid title. 

In [8]:
apple_apps = apple_apps.dropna(subset=['track_name'])
apple_apps.shape

(7197, 17)

Versions and number of languages of those apps are not helpful in this project. We only keep apps' title, genre, rating, size, etc.

In [9]:
col_n = ['id', 'track_name', 'size_bytes', 'price', 'rating_count_tot', 'user_rating', 'cont_rating', 'prime_genre'] 
apple_apps = pd.DataFrame(apple_apps, columns = col_n)

In [10]:
apple_apps.groupby('prime_genre').count()

Unnamed: 0_level_0,id,track_name,size_bytes,price,rating_count_tot,user_rating,cont_rating
prime_genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Book,112,112,112,112,112,112,112
Business,57,57,57,57,57,57,57
Catalogs,10,10,10,10,10,10,10
Education,453,453,453,453,453,453,453
Entertainment,535,535,535,535,535,535,535
Finance,104,104,104,104,104,104,104
Food & Drink,63,63,63,63,63,63,63
Games,3862,3862,3862,3862,3862,3862,3862
Health & Fitness,180,180,180,180,180,180,180
Lifestyle,144,144,144,144,144,144,144


We merge some genres so there will be enough apps in each genre. 

In [11]:
def normalise_genre(genre):
    genre = str(genre)
    std = ['Utilities', 'Auto & Vehicles', 'Food & Drink', 'Health & Fitness', 'Lifestyle', 'Games', 'Books & Reference', 'Business', 'Entertainment', 'Social Networking', 'Education', 'News', 'Others']
    genre2cat = {'Utilities':0, 'Navigation': 1, ''
                'Travel': 1, 
                'Food & Drink': 2,
                'Health & Fitness':3,
                'Sports': 3, 'Medical': 3, 'Lifestyle': 4, 'Shopping': 4,
                'Games': 5,
                'Book': 6, 'Reference': 6,
                'Finance': 7, 'Business': 7, 'Productivity': 7,
                'Entertainment': 8, 'Music': 8,
                'Social Networking': 9,
                'Photo & Video': 9,
                'Education':10, 
                'News': 11,
                'Catalogs': 12, 'Weather': 12}
    return std[genre2cat[genre.strip()]]

In [12]:
apple_apps['prime_genre'] = apple_apps['prime_genre'].map(normalise_genre)

In [13]:
apple_apps.to_csv(datasets_dir + apple_kaggle_dataset, index=False)
apple_apps.shape

(7197, 8)