# Capstone Project on Apple App Rating 

### Problem Summary:
With million of apps around nowadays, mobile app analytics is a great way to understand the existing strategy to drive growth and retention of future user. This data set contains more than 7000 Apple iOS mobile application details, e.g. size, price, genre, rating_count, description and etc. The data was extracted from the iTunes Search API at the Apple Inc website. The goal is to predict whether the overall rating for the app is more than 4 stars (1=yes, 0=no), which we think it a very good app.

# Data Wrangling Steps

1. Data Collection
    * Locating the data
    * Data loading
    * Data joining
2. Data Organization
    * File structure
    * Git & Github
3. Data Definition
    * Column names
    * Data types (numeric, categorical, timestamp, etc.)
    * Description of the columns
    * Count or percent per unique values or codes (including NA)
    * The range of values or codes
4. Data Cleaning
    * NA or missing data
    * Duplicates

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime

In [2]:
print(os.getcwd())

/Users/oluwafemibabatunde


In [3]:
path = '/Users/oluwafemibabatunde/Desktop/Springboard/capstone_one/apple-app'

In [4]:
directory = os.chdir(path)
os.listdir(directory)

['app_train.csv', 'app_test.csv', 'sample_submission.csv']

In [5]:
df_train = pd.read_csv('app_train.csv')

In [10]:
df_train.head()

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic,desc,rating
0,1169417102,ミリオン行進曲,196380672,USD,4.99,1,1,1.02,4+,Games,38,5,1,1,音楽事務所の社長に就任して歌手の卵を育てる、アイドル歌手育成シミュレーションゲーム ！\r\...,0
1,965748314,Pinata Hunter 3,38805504,USD,0.0,199,199,1.0.0,9+,Games,43,3,16,0,"Finally, it is back! The Pinata is here with t...",0
2,307764057,niconico,25808896,USD,0.0,182,0,6.52,17+,Entertainment,37,5,3,1,The Niconico app allows you to watch Niconico ...,0
3,1005783927,Frozen Frenzy Mania: Challenging Match 3 Games,296790016,USD,0.0,4104,143,2.1.1,4+,Games,37,5,1,1,Match ice cream treats to break through cookie...,1
4,350642635,Plants vs. Zombies,105379840,USD,0.99,426463,680,1.9.13,9+,Games,38,0,5,1,The game requires iOS 6 compatible device.\r\n...,1


In [6]:
df_test = pd.read_csv('app_test.csv')

In [11]:
df_test.head()

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic,desc
0,893525571,Weather & Radar Pro Ad-Free,87799808,USD,2.99,6,1,4.3.2,4+,Weather,37,3,21,1,Hourly weather forecasts. Rainfall radar and s...
1,1116564897,多纳餐厅2,104617984,USD,2.99,0,0,3.1,4+,Education,38,5,1,1,Every child has a little chef dream! Donut Res...
2,1140507373,Mini Games Maps for Minecraft PE - The Best Ma...,93831168,USD,0.0,943,943,1.0,4+,Entertainment,37,4,1,1,Explore the BEST Minecraft PE MINI GAMES Maps ...
3,346184215,TaxCaster – Free tax refund calculator,7111680,USD,0.0,17516,125,7.2,4+,Finance,37,5,1,1,Get a quick estimate of your 2016 tax refund.\...
4,1071712425,中高英文法を10時間で！マジグラ,49196032,USD,0.99,1,1,1.1.0,4+,Education,37,0,1,1,英文法に苦手意識はありませんか？\r\nマジグラを1日たった20分続ければTOEIC英文法を...


In [50]:
figpath = "/Users/oluwafemibabatunde/Desktop/Springboard/capstone_one/apple-app/figures"
if not os.path.isdir(figpath):
   os.makedirs(figpath)

In [51]:
modelpath = "/Users/oluwafemibabatunde/Desktop/Springboard/capstone_one/apple-app/model"
if not os.path.isdir(modelpath):
   os.makedirs(modelpath)

In [8]:
df_train.shape

(5197, 16)

In [9]:
df_test.shape

(2000, 15)

In [12]:
df_train.describe()

Unnamed: 0,id,size_bytes,price,rating_count_tot,rating_count_ver,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic,rating
count,5197.0,5197.0,5197.0,5197.0,5197.0,5197.0,5197.0,5197.0,5197.0,5197.0
mean,864588100.0,197058900.0,1.741312,12884.72,463.634982,37.422359,3.719454,5.405426,0.993265,0.434866
std,271391900.0,341742400.0,6.534191,70802.12,4039.02234,3.626466,1.981193,7.901467,0.081796,0.495787
min,281656500.0,618496.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0
25%,599852900.0,47585280.0,0.0,27.0,1.0,37.0,3.0,1.0,1.0,0.0
50%,981819100.0,98246660.0,0.0,307.0,24.0,37.0,5.0,1.0,1.0,0.0
75%,1082678000.0,185370600.0,1.99,2908.0,145.0,38.0,5.0,8.0,1.0,1.0
max,1187839000.0,4025970000.0,299.99,2161558.0,177050.0,47.0,5.0,75.0,1.0,1.0


In [13]:
df_test.describe()

Unnamed: 0,id,size_bytes,price,rating_count_tot,rating_count_ver,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,859344700.0,204527800.0,1.686995,12914.19,451.9,37.2045,3.675,5.5115,0.9925
std,270864400.0,401090800.0,3.390946,87292.42,3595.068056,4.009079,1.998593,7.967974,0.086299
min,282614200.0,589824.0,0.0,0.0,0.0,11.0,0.0,0.0,0.0
25%,602553500.0,45226750.0,0.0,28.75,1.0,37.0,3.0,1.0,1.0
50%,969482400.0,93445630.0,0.0,275.0,21.0,37.0,5.0,1.0,1.0
75%,1081903000.0,175047400.0,2.99,2535.0,131.0,38.0,5.0,8.0,1.0
max,1188376000.0,3968638000.0,59.99,2974676.0,117470.0,47.0,5.0,74.0,1.0


In [15]:
df_train.dtypes

id                    int64
track_name           object
size_bytes            int64
currency             object
price               float64
rating_count_tot      int64
rating_count_ver      int64
ver                  object
cont_rating          object
prime_genre          object
sup_devices.num       int64
ipadSc_urls.num       int64
lang.num              int64
vpp_lic               int64
desc                 object
rating                int64
dtype: object

In [43]:
variable_description = {'id': 'App ID', 'track_name': 'App Name',
'size_bytes': 'Size (in Bytes)', 'currency': 'Currency Type',
'Price': 'Price amount',
'rating_count_tot': 'User Rating counts (for all version)',
'rating_count_ver': 'User Rating counts (for current version)',
'ver': 'Latest version code',
'cont_rating': 'Content Rating',
'prime_genre': 'Primary Genre',
'sup_devices.num': 'Number of supporting devices',
'ipadSc_urls.num': 'Number of screenshots showed for display',
'lang.num': 'Number of supported languages',
'vpp_lic': 'Vpp Device Based Licensing Enabled',
'desc': 'Whether the overall user rating is above 4 stars or not (1=yes, 0=no)'}
variables = pd.DataFrame.from_dict(variable_description, orient = 'Index')
variables.index.name = 'Variables'
variables = variables.rename(columns={0:'Description'})
variables

Unnamed: 0_level_0,Description
Variables,Unnamed: 1_level_1
id,App ID
track_name,App Name
size_bytes,Size (in Bytes)
currency,Currency Type
Price,Price amount
rating_count_tot,User Rating counts (for all version)
rating_count_ver,User Rating counts (for current version)
ver,Latest version code
cont_rating,Content Rating
prime_genre,Primary Genre


In [44]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5197 entries, 0 to 5196
Data columns (total 16 columns):
id                  5197 non-null int64
track_name          5197 non-null object
size_bytes          5197 non-null int64
currency            5197 non-null object
price               5197 non-null float64
rating_count_tot    5197 non-null int64
rating_count_ver    5197 non-null int64
ver                 5197 non-null object
cont_rating         5197 non-null object
prime_genre         5197 non-null object
sup_devices.num     5197 non-null int64
ipadSc_urls.num     5197 non-null int64
lang.num            5197 non-null int64
vpp_lic             5197 non-null int64
desc                5197 non-null object
rating              5197 non-null int64
dtypes: float64(1), int64(9), object(6)
memory usage: 649.8+ KB


In [45]:
df_train.nunique()

id                  5197
track_name          5196
size_bytes          5144
currency               1
price                 34
rating_count_tot    2488
rating_count_ver     949
ver                 1305
cont_rating            4
prime_genre           23
sup_devices.num       20
ipadSc_urls.num        6
lang.num              54
vpp_lic                2
desc                5179
rating                 2
dtype: int64

In [47]:
dfSki = df_train.nunique()
dfSize = df_train.size
percentage_dfSki = (dfSki/dfSize)*100
print(percentage_dfSki)

id                  6.250000
track_name          6.248797
size_bytes          6.186261
currency            0.001203
price               0.040889
rating_count_tot    2.992111
rating_count_ver    1.141283
ver                 1.569415
cont_rating         0.004810
prime_genre         0.027660
sup_devices.num     0.024052
ipadSc_urls.num     0.007216
lang.num            0.064941
vpp_lic             0.002405
desc                6.228353
rating              0.002405
dtype: float64


In [49]:
df_train.agg([min, max]).T

Unnamed: 0,min,max
id,281656475,1187838770
track_name,! OH Fantastic Free Kick + Kick Wall Challenge,Ｘ:15秒の人気 アクション ゲーム
size_bytes,618496,4025969664
currency,USD,USD
price,0,299.99
rating_count_tot,0,2161558
rating_count_ver,0,177050
ver,0.0.15,v2.13.9
cont_rating,12+,9+
prime_genre,Book,Weather


In [55]:
duplicateRowsDF = df_train[df_train.duplicated()]
duplicateRowsDF

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic,desc,rating


Some of the data warngling steps were skipped because the data was cleaned from its source. 
The df_train is clean of null values and no feature is dropped from the data set. 
Hence, there is no need to write the data set out as the original data will be used.