# Capstone Project on Apple App Rating 

### Problem Summary:
With million of apps around nowadays, mobile app analytics is a great way to understand the existing strategy to drive growth and retention of future user. This data set contains more than 7000 Apple iOS mobile application details, e.g. size, price, genre, rating_count, description and etc. The data was extracted from the iTunes Search API at the Apple Inc website. The goal is to predict whether the overall rating for the app is more than 4 stars (1=yes, 0=no), which we think it a very good app.

# Data Wrangling Steps

1. Data Collection
    * Locating the data
    * Data loading
    * Data joining
2. Data Organization
    * File structure
    * Git & Github
3. Data Definition
    * Column names
    * Data types (numeric, categorical, timestamp, etc.)
    * Description of the columns
    * Count or percent per unique values or codes (including NA)
    * The range of values or codes
4. Data Cleaning
    * NA or missing data
    * Duplicates

Importing packages

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
from langdetect import detect, DetectorFactory
from textblob import TextBlob
from collections import Counter

Print out current working directory

In [2]:
print(os.getcwd())

/Users/oluwafemibabatunde


In [3]:
path = '/Users/oluwafemibabatunde/Desktop/Springboard/capstone_one/apple-app'

Change current working directory

In [4]:
directory = os.chdir(path)
os.listdir(directory)

['app_train.csv',
 'app_test.csv',
 'figures',
 'model',
 'data',
 'sample_submission.csv']

Import Train Data using pandas

In [5]:
df_train = pd.read_csv('app_train.csv')

In [6]:
df_train.head()

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic,desc,rating
0,1169417102,ミリオン行進曲,196380672,USD,4.99,1,1,1.02,4+,Games,38,5,1,1,音楽事務所の社長に就任して歌手の卵を育てる、アイドル歌手育成シミュレーションゲーム ！\r\...,0
1,965748314,Pinata Hunter 3,38805504,USD,0.0,199,199,1.0.0,9+,Games,43,3,16,0,"Finally, it is back! The Pinata is here with t...",0
2,307764057,niconico,25808896,USD,0.0,182,0,6.52,17+,Entertainment,37,5,3,1,The Niconico app allows you to watch Niconico ...,0
3,1005783927,Frozen Frenzy Mania: Challenging Match 3 Games,296790016,USD,0.0,4104,143,2.1.1,4+,Games,37,5,1,1,Match ice cream treats to break through cookie...,1
4,350642635,Plants vs. Zombies,105379840,USD,0.99,426463,680,1.9.13,9+,Games,38,0,5,1,The game requires iOS 6 compatible device.\r\n...,1


Creating file structure for data organization

In [7]:
datapath = "/Users/oluwafemibabatunde/Desktop/Springboard/capstone_one/apple-app/data"
if not os.path.isdir(datapath):
   os.makedirs(datapath)

In [8]:
figpath = "/Users/oluwafemibabatunde/Desktop/Springboard/capstone_one/apple-app/figures"
if not os.path.isdir(figpath):
   os.makedirs(figpath)

In [9]:
modelpath = "/Users/oluwafemibabatunde/Desktop/Springboard/capstone_one/apple-app/model"
if not os.path.isdir(modelpath):
   os.makedirs(modelpath)

View shape of dataframe

In [10]:
df_train.shape

(5197, 16)

View descriptive Statistics of dataframe

In [11]:
df_train.describe()

Unnamed: 0,id,size_bytes,price,rating_count_tot,rating_count_ver,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic,rating
count,5197.0,5197.0,5197.0,5197.0,5197.0,5197.0,5197.0,5197.0,5197.0,5197.0
mean,864588100.0,197058900.0,1.741312,12884.72,463.634982,37.422359,3.719454,5.405426,0.993265,0.434866
std,271391900.0,341742400.0,6.534191,70802.12,4039.02234,3.626466,1.981193,7.901467,0.081796,0.495787
min,281656500.0,618496.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0
25%,599852900.0,47585280.0,0.0,27.0,1.0,37.0,3.0,1.0,1.0,0.0
50%,981819100.0,98246660.0,0.0,307.0,24.0,37.0,5.0,1.0,1.0,0.0
75%,1082678000.0,185370600.0,1.99,2908.0,145.0,38.0,5.0,8.0,1.0,1.0
max,1187839000.0,4025970000.0,299.99,2161558.0,177050.0,47.0,5.0,75.0,1.0,1.0


Language Count in dataframe 

In [12]:
df_train.isnull().sum()

id                  0
track_name          0
size_bytes          0
currency            0
price               0
rating_count_tot    0
rating_count_ver    0
ver                 0
cont_rating         0
prime_genre         0
sup_devices.num     0
ipadSc_urls.num     0
lang.num            0
vpp_lic             0
desc                0
rating              0
dtype: int64

In [13]:
df_train.isna().sum()

id                  0
track_name          0
size_bytes          0
currency            0
price               0
rating_count_tot    0
rating_count_ver    0
ver                 0
cont_rating         0
prime_genre         0
sup_devices.num     0
ipadSc_urls.num     0
lang.num            0
vpp_lic             0
desc                0
rating              0
dtype: int64

Language counts for app. track_name column

In [14]:
texl70 = df_train['track_name']
langdet = []                                                    

for i in range(len(df_train)):                                         
    try:                                                          
       lang=detect(texl70[i])                                      
    except:                                                       
       lang='no'                                                  
       print("This row throws error:", texl70[i])                 
    langdet.append(lang) 

This row throws error: 1010!
This row throws error: 2048


In [15]:
new_vals = Counter(langdet).most_common()
new_vals = new_vals[::1] #this sorts the list in descending order

for a, b in new_vals:
    print (a, b)

en 2789
ja 412
de 362
zh-cn 200
tl 115
ko 114
ro 108
it 100
fr 87
nl 75
af 74
no 74
id 68
ca 61
pt 57
sw 48
es 47
da 44
sv 43
hr 41
pl 37
cy 35
so 32
et 29
tr 27
fi 25
vi 22
lt 19
sl 14
sk 13
cs 8
lv 5
hu 5
sq 5
ar 2


Language counts for app. description column 

In [23]:
texldesc = df_train['desc']
langdesc = []                                                    

for i in range(len(df_train)):                                         
    try:                                                          
       langdes=detect(texldesc[i])                                      
    except:                                                       
       langdes='no'                                                  
       print("This row throws error:", texldesc[i])                 
    langdesc.append(langdes) 

new_lan = Counter(langdet).most_common()
new_lan = new_vals[::1] #this sorts the list in descending order

for v, m in new_lan:
    print (v, m)

en 2789
ja 412
de 362
zh-cn 200
tl 115
ko 114
ro 108
it 100
fr 87
nl 75
af 74
no 74
id 68
ca 61
pt 57
sw 48
es 47
da 44
sv 43
hr 41
pl 37
cy 35
so 32
et 29
tr 27
fi 25
vi 22
lt 19
sl 14
sk 13
cs 8
lv 5
hu 5
sq 5
ar 2


View Datatypes of Variables in Dataframe

In [16]:
df_train.dtypes

id                    int64
track_name           object
size_bytes            int64
currency             object
price               float64
rating_count_tot      int64
rating_count_ver      int64
ver                  object
cont_rating          object
prime_genre          object
sup_devices.num       int64
ipadSc_urls.num       int64
lang.num              int64
vpp_lic               int64
desc                 object
rating                int64
dtype: object

Table giving description of variables

In [17]:
variable_description = {'id': 'App ID', 'track_name': 'App Name',
'size_bytes': 'Size (in Bytes)', 'currency': 'Currency Type',
'Price': 'Price amount',
'rating_count_tot': 'User Rating counts (for all version)',
'rating_count_ver': 'User Rating counts (for current version)',
'ver': 'Latest version code',
'cont_rating': 'Content Rating',
'prime_genre': 'Primary Genre',
'sup_devices.num': 'Number of supporting devices',
'ipadSc_urls.num': 'Number of screenshots showed for display',
'lang.num': 'Number of supported languages',
'vpp_lic': 'Vpp Device Based Licensing Enabled',
'desc': 'Whether the overall user rating is above 4 stars or not (1=yes, 0=no)'}
variables = pd.DataFrame.from_dict(variable_description, orient = 'Index')
variables.index.name = 'Variables'
variables = variables.rename(columns={0:'Description'})
variables

Unnamed: 0_level_0,Description
Variables,Unnamed: 1_level_1
id,App ID
track_name,App Name
size_bytes,Size (in Bytes)
currency,Currency Type
Price,Price amount
rating_count_tot,User Rating counts (for all version)
rating_count_ver,User Rating counts (for current version)
ver,Latest version code
cont_rating,Content Rating
prime_genre,Primary Genre


More information on columns in dataframe

In [18]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5197 entries, 0 to 5196
Data columns (total 16 columns):
id                  5197 non-null int64
track_name          5197 non-null object
size_bytes          5197 non-null int64
currency            5197 non-null object
price               5197 non-null float64
rating_count_tot    5197 non-null int64
rating_count_ver    5197 non-null int64
ver                 5197 non-null object
cont_rating         5197 non-null object
prime_genre         5197 non-null object
sup_devices.num     5197 non-null int64
ipadSc_urls.num     5197 non-null int64
lang.num            5197 non-null int64
vpp_lic             5197 non-null int64
desc                5197 non-null object
rating              5197 non-null int64
dtypes: float64(1), int64(9), object(6)
memory usage: 649.8+ KB


In [19]:
df_train.nunique()

id                  5197
track_name          5196
size_bytes          5144
currency               1
price                 34
rating_count_tot    2488
rating_count_ver     949
ver                 1305
cont_rating            4
prime_genre           23
sup_devices.num       20
ipadSc_urls.num        6
lang.num              54
vpp_lic                2
desc                5179
rating                 2
dtype: int64

Checking percentage of unique values per varible in dataframe

In [20]:
dfSki = df_train.nunique()
dfSize = df_train.size
percentage_dfSki = (dfSki/dfSize)*100
print(percentage_dfSki)

id                  6.250000
track_name          6.248797
size_bytes          6.186261
currency            0.001203
price               0.040889
rating_count_tot    2.992111
rating_count_ver    1.141283
ver                 1.569415
cont_rating         0.004810
prime_genre         0.027660
sup_devices.num     0.024052
ipadSc_urls.num     0.007216
lang.num            0.064941
vpp_lic             0.002405
desc                6.228353
rating              0.002405
dtype: float64


Verify counts of currency

In [24]:
df_train['currency'].value_counts()

USD    5197
Name: currency, dtype: int64

Dropping Redundant variables (track_name and description columns were dropped from dataframe because 2,408 of the records have languages that are not in english. If these rows were dropped, we will path with about 46% of the data in the dataframe. Hence, to avoid this situation, dropping the columns will be the most reasonable step to take. Another option is to translate the languages to english. What's the possibility of this?)

In [29]:
df = df_train.drop(['track_name','currency','desc'], axis=1) 

Checking for duplicated role in dataframe

In [31]:
duplicateRowsDF = df[df.duplicated()]
duplicateRowsDF

Unnamed: 0,id,size_bytes,price,rating_count_tot,rating_count_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic,rating


Some of the data warngling steps were skipped because the data was cleaned from its source. 
The df_train is clean of null values and no feature is dropped from the data set. 
Hence, there is no need to write the data set out as the original data will be used.

Write out cleaned dataframe to folder

In [33]:
df.to_csv(r'/Users/oluwafemibabatunde/Desktop/Springboard/capstone_one/apple-app/data/step1_output.csv')