# Mobile Apps analysis
In our new upcoming project, we are going to make our apps available on App Store. Therefore, data analysis will be used to help our developers understand what type of apps are likely to attract more users.

# Opening and Exploring the Data
As of September 2018, there were approximately 2 million iOS apps available on the App Store.

Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:

- A [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from this link.

Let's start by opening the data set and then continue with exploring the data.

In [35]:
import pandas as pd
import re
import numpy as np

In [36]:
ios_df = pd.read_csv('AppleStore.csv')

In [37]:
print('Explore ios df')
print(ios_df.info())
ios_df.head()

Explore ios df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7197 entries, 0 to 7196
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                7197 non-null   int64  
 1   track_name        7197 non-null   object 
 2   size_bytes        7197 non-null   int64  
 3   currency          7197 non-null   object 
 4   price             7197 non-null   float64
 5   rating_count_tot  7197 non-null   int64  
 6   rating_count_ver  7197 non-null   int64  
 7   user_rating       7197 non-null   float64
 8   user_rating_ver   7197 non-null   float64
 9   ver               7197 non-null   object 
 10  cont_rating       7197 non-null   object 
 11  prime_genre       7197 non-null   object 
 12  sup_devices.num   7197 non-null   int64  
 13  ipadSc_urls.num   7197 non-null   int64  
 14  lang.num          7197 non-null   int64  
 15  vpp_lic           7197 non-null   int64  
dtypes: float64(3), int64(8), ob

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
1,389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
2,529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
3,420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
4,284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1


# Data cleaning
Our company, we only build apps that are free to download and install, and we design them for an English-speaking audience. This means that we'll need to do the following:

- Remove non-English apps
- Remove apps that aren't free


## Apps that are non-Englis

In [38]:
ios_df['isEnglish'] = ios_df["track_name"].apply(lambda name: 1 if name.isascii() else 0)
notEnglish_Apps = ios_df.loc[ios_df['isEnglish'] == 0]
notEnglish_Apps

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic,isEnglish
24,284815942,Google – Search made just for mobile,179979264,USD,0.00,479440,203,3.5,4.0,27.0,17+,Utilities,37,4,33,1,0
26,466965151,The Sims™ FreePlay,695603200,USD,0.00,446880,1832,4.5,4.0,5.29.0,12+,Games,38,5,12,1,0
31,543186831,8 Ball Pool™,86776832,USD,0.00,416736,19076,4.5,4.5,3.9.1,4+,Games,38,5,10,1,0
42,297368629,Lose It! – Weight Loss Program and Calorie Cou...,182054912,USD,0.00,373835,402,4.0,4.5,8.0.2,4+,Health & Fitness,37,3,1,1,0
46,366247306,▻Sudoku,71002112,USD,0.00,359832,17119,4.5,5.0,5.4,4+,Games,40,5,7,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7189,1070725569,【悲報】鬼ヶ島終了のお知らせ　-ゾンビ桃太郎が3Dすぎて鬼やばいwww-,147131392,USD,0.00,0,0,0.0,0.0,1.0.1,12+,Games,40,4,1,1,0
7190,1069789014,中学英文法総復習 パターンで覚える 瞬間英文法,22881280,USD,1.99,0,0,0.0,0.0,1.0.6,4+,Education,37,0,31,1,0
7191,1069796800,Brain15 − 脳トレ 無料パズル −,8912896,USD,0.00,0,0,0.0,0.0,1.2,12+,Games,38,0,1,1,0
7193,1069830936,【謎解き】ヤミすぎ彼女からのメッセージ,16808960,USD,0.00,0,0,0.0,0.0,1.2,9+,Book,38,0,1,1,0


Above table shows 1490 rows of Apps are not identified as English, isascii could not recognise some English apps, e.g. Fruit Ninja® and The Sims™ FreePlay. After previewing above table, non-English apps include Chinese, Japanese and Arabic languages.

In [39]:
dict_ = {'™':'','®':'','–':'','!':'','℠':'','Ⓞ':'','▻':'','':'','é':'','⋆':'','·':'','‼':'','‰':'',' ！':'','&':'and','’':'','◎':'','—':'',':':'',' － ':'','С':'',"'":'','！':'','∞':'','• ':'','－':'','∘':'','▪':'','■ ':'','＆':'and','⁺':''}
for line in ios_df['track_name']:
    for key in dict_.keys():
        if key in line:
            ios_df['track_name'] = ios_df['track_name'].str.replace(key,dict_[key])

ios_df['isEnglish'] = ios_df["track_name"].apply(lambda name: 1 if name.isascii() else 0)
notEnglish_Apps = ios_df.loc[ios_df['isEnglish'] == 0]
notEnglish_Apps.shape

(1086, 17)

After removing the non-English symbols, only 1086 Apps are idenified as non-English Apps. In next step, a table only contains English Apps will be created.

In [40]:
English_Apps = ios_df.loc[ios_df['isEnglish'] == 1]
English_Apps.head()
English_Apps.shape

(6111, 17)

## Apps that are English & Free

In this section, we will find out the Apps which are free and English. As well as the the proportion of free English apps in all English apps.

In [41]:
free_English_Apps = English_Apps.loc[English_Apps['price']==0]
free_English_Apps.shape

(3178, 17)

In [42]:
proportion_free_english_apps = free_English_Apps.shape[0]/English_Apps.shape[0]
print(proportion_free_english_apps)

0.5200458190148912


More than 50% of English Apps are free. In the following, we will find out which genre is the most common.

In [43]:
prime_genre_dict = dict()

for prime_genre in free_English_Apps['prime_genre']:
    if prime_genre in prime_genre_dict:
        prime_genre_dict[prime_genre] += 1
    else:
        prime_genre_dict[prime_genre] = 1

sort_orders = sorted(prime_genre_dict.items(), key=lambda x: x[1], reverse=True)
for i in sort_orders:
	print(i[0], i[1])

Games 1857
Entertainment 252
Photo & Video 159
Education 118
Social Networking 104
Shopping 81
Utilities 77
Sports 68
Music 65
Health & Fitness 63
Productivity 54
Lifestyle 50
News 43
Travel 37
Finance 35
Weather 28
Food & Drink 26
Reference 17
Business 16
Book 12
Navigation 6
Medical 6
Catalogs 4


In [44]:
proportion_game = prime_genre_dict['Games']/free_English_Apps.shape[0]
print(proportion_game)

0.5843297671491504


Above list shows that most of the free English Apps are Games, which accounts for near ~60%. To look deeper, it also shows that most of the apps designed for entertainment (games, photo and video, social networking, sports, music) instead of practical purposes (education, shopping, utilities, productivity, lifestyle). In the following, we will drill deeper, because of Installs information is missing for the Apple App Store data set, we will use the rating_count_tot info instead.

In [45]:
rating_dict = dict()

for prime_genre in free_English_Apps['prime_genre']:
    for rating in free_English_Apps['rating_count_tot']:
        if prime_genre in rating_dict:
            rating_dict[prime_genre] += rating
        else:
            rating_dict[prime_genre] = 1

sort_rating = sorted(rating_dict.items(), key=lambda x: x[1], reverse=True)
for i in sort_rating:
	print(i[0], i[1])

Games 148471883732
Entertainment 20145471377
Photo & Video 12709735334
Education 9431615143
Social Networking 8312257029
Shopping 6473311556
Utilities 6153494952
Sports 5433907593
Music 5194045140
Health & Fitness 5034136838
Productivity 4314549479
Lifestyle 3994732875
News 3435053818
Travel 2955328912
Finance 2795420610
Weather 2235741553
Food & Drink 2075833251
Reference 1356245892
Business 1276291741
Book 956475137
Navigation 476750231
Medical 476750231
Catalogs 316841929


Above list shows that apps designed for entertainment have the highest rating_count_tot, however Games genre alone accounts for 58% in all the English Apps. Therefore, let's averaging the rating_count_tot by the number of the Apps.

In [46]:
value_dict = dict()

for genre_key in prime_genre_dict.keys():
    for rating_key in rating_dict.keys():
        if genre_key == rating_key:
            prime_genre_dict_value = prime_genre_dict[genre_key]
            #print(prime_genre_dict_value)
            rating_dict_value = rating_dict[rating_key]
            #print(rating_dict_value)
            proportion = rating_dict_value / prime_genre_dict_value
            value_dict[genre_key] = proportion
            

sort_proportion = sorted(value_dict.items(), key=lambda x: x[1], reverse=True)
for i in sort_proportion:
	print(i[0], i[1])

Games 79952549.12870221
Entertainment 79942346.73412699
Photo & Video 79935442.35220125
Education 79928941.88983051
Social Networking 79925548.35576923
Shopping 79917426.61728396
Utilities 79915518.85714285
Sports 79910405.77941176
Music 79908386.76923077
Health & Fitness 79906933.93650794
Productivity 79899064.42592593
Lifestyle 79894657.5
News 79884972.51162791
Travel 79873754.37837838
Finance 79869160.28571428
Weather 79847912.60714285
Food & Drink 79839740.42307693
Reference 79779170.11764705
Business 79768233.8125
Book 79706261.41666667
Navigation 79458371.83333333
Medical 79458371.83333333
Catalogs 79210482.25


## Conclusion 

Above list shows that Apps for entertainment still have highest_rating_count_tot after averaging. To conclue, it seems that Apps for entertainment could attract more users, however the market is full of entertainment Apps, so we need to develop some special Apps to differentiate ourselves from other competitors.