# **DataQuest Guided Project: Profitable App Profiles in the App Market**




## **Objectives**

Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets.



For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.

This is the source of <a href='https://www.kaggle.com/lava18/google-play-store-apps'>Google Play Stroe</a> data set.

This is the source of <a href='https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps'>Apple App Store</a> data set.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

## **Importing the data set**

In [1]:
import csv

with open('AppleStore.csv') as applecsv:
    appledata=list(csv.reader(applecsv))
    
    apple_header=appledata[0]
    appledata=appledata[1:]

with open('googleplaystore.csv') as googlecsv:
    googledata=list(csv.reader(googlecsv))
    
    google_header=googledata[0]
    googledata=googledata[1:]


This purpose of this function is to make it eaiser to read the datasets.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False, print_rows=False):
    dataset_slice = dataset[start:end]
    if print_rows:
        for row in dataset_slice:
            print(row)
            print('\n') # adds a new (empty) line after each row
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
print(google_header)
print('\n')
explore_data(appledata,0,-1,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Number of rows: 7197
Number of columns: 16


<font size="4">**Data Cleaning**</font>

According to the discussion section from the google play data, there is an error on line 10472. Let's test it.

In [3]:
print(googledata[10472])
print('\n')
print(google_header)
print('\n')
print(googledata[0])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', '11-Feb-18', '1.0.19', '4.0 and up', '']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', '7-Jan-18', '1.0.0', '4.0.3 and up']


The app on line 10472 is: **Life Made WI-Fi Touchscreen Photo Frame**, and we can see that the rating is 19. This is clearly off because the maximum rating for a Google Play app is 5 (as mentioned in the discussions section, this problem is caused by a missing value in the 'Category' column). As a consequence, we'll delete this row.

In [4]:
print(len(googledata))
del googledata[10472]
print(len(googledata))

10841
10840


<font size="4">**Removing Duplicate Entries**</font>



<font size="3">**Part One**</font>



As it turns out, some apps have duplicate entries. For example: Instagram

In [5]:
for app in googledata:
    name=app[0]
    if name=='Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', '31-Jul-18', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', '31-Jul-18', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', '31-Jul-18', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', '31-Jul-18', 'Varies with device', 'Varies with device']


There are a total of 1181 cases in which apps appeared more than once 

In [6]:
duplicate_apps=[]
unique_apps=[]

for app in googledata:
    name=app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('Number of duplicate apps:',len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:',duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


Since we don't want duplicate apps for our analysis, we need to find a criteria to delete unwanted apps.

From the Instagram example, we can see that the main difference happnes on the fourth column on each row, which cooresponds to the number of reviews. The higher the number of reviews an app has probably indicates the more recent the data was collected. We won't remove rows randomly, but rather we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

In order to do this, we will:
- Create a dictionary where each key is a unqiue app name, and the value is the highest numbers of reviews of that app
- Use the dictionary to create a new data set, which will have only one entry per app(and we only select the apps with teh highest number of reviews)


<font size="3">**Part Two**</font>

We will start by building the dictionary. 

In [7]:
reviews_max={}
for app in googledata:
    name=app[0]
    n_reviews=float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name]=n_reviews
        
    elif name not in reviews_max:
        reviews_max[name]=n_reviews
        


In the previous code cell, we found that there are 1,181 cases where an app occurs mroe than once, so the length of our dictionary shoud be equal to the difference between the length of our data set and 1,181.

In [8]:
print('Expected length:',len(googledata)-1181)
print('Actual length:',len(reviews_max))

Expected length: 9659
Actual length: 9659


We only want to keep the app with the highest number of  reviews. We will do that by:

* Create two empty list:
    1. google_clean
    2. already_added

Loop through the data set:
* If n_reviews is the same as the number of maximum reviews of the app name(number can be found in the reviews_max) **and** name is not already in the list already_added. *(This is important because some apps might have the same number of reviews, so we have to add in this condition to prevent adding duplicates of the same app.)*
    1. Append the entire row to the google_clean list(this store our cleaned data set)
    2. Append the name of the app name to the already_added list, this helps us to keep track of apps that we already added.
        

In [9]:
google_clean=[]
already_added=[]

for app in googledata:
    name=app[0]
    n_reviews=float(app[3])
    
    if n_reviews==reviews_max[name] and name not in already_added:
        google_clean.append(app)
        already_added.append(name)
        
for i in google_clean:
    if 'Instagram' in i:
        print(i)


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', '31-Jul-18', 'Varies with device', 'Varies with device']


The Instagram did return the highest number of reviews as we have seen from earlier code.
Now let's see if the data set have 9,659 rows.

In [10]:
print(len(google_clean))
print(len(already_added))

9659
9659


<font size="4">**Removing Non English Apps**</font>

Our next step in cleaning the data set is finding out apps not in English, since we are building an app for English speaking audience. Some examples as below


In [11]:
print(appledata[813][1])
print(appledata[6731][1])
print('\n')
print(google_clean[4412][0])
print(google_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


One way to remove non English apps is to remove each app with a name containing a symbol not commonly used in the English text.

In the ASCII system, commonly used characters in an English text are in the range of 0-127. We can built a function that detects whether a character belongs to the set of common English characters or not.

**If the number is equal to or less than 127, then the character belongs to the set of common English characters.**

Our app names are stored as strings. So we need to use indexing to select an individual character, and iterate on the string using a for loop.

First we will build a function that takes in a string and returns False if there's any character in the string that doesn't belong to the set of common English characters, otherwise it returns True.

In [12]:
def return_string(data):
    for i in data:
        if ord(i) > 127:
            return False
       
    return True

Let's test the function with the following English and non English name apps.

In [13]:
print(return_string('Instagram'))
print(return_string('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(return_string('Docs To Go™ Free Office Suite'))
print(return_string('Instachat 😜'))
print(ord('™'))
print(ord('😜'))

True
False
False
False
8482
128540


Now we realized that our function couldn't recognize emojis and characters like ™ since they fall outsie the ASCII range of 127. To make sure we can still include some apps with characters like emoji and special characters, we will only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range.

*Updated function for detecting non English apps*

In [14]:
def return_string(data):
    x=0

    for i in data:
        if ord(i) > 127:
            x+=1
            if x>3:
                return False
       
    return True
print(return_string('Instagram'))
print(return_string('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(return_string('Docs To Go™ Free Office Suite'))
print(return_string('Instachat 😜'))

True
False
True
True


Let's test out the new function with our data sets.

In [15]:
number_nonEnglish=[]
number_english=[]
for app in already_added:
    if return_string(app)==False:
        print(app)
        number_nonEnglish.append(app)
    else:
        number_english.append(app)


Flame - درب عقلك يوميا
သိင်္ Astrology - Min Thein Kha BayDin
РИА Новости
صور حرف H
L.POINT - 엘포인트 [ 포인트, 멤버십, 적립, 사용, 모바일 카드, 쿠폰, 롯데]
RMEduS - 음성인식을 활용한 R 프로그래밍 실습 시스템
AJ렌터카 법인 카셰어링
Al Quran Free - القرآن (Islam)
中国語 AQリスニング
日本AV历史
Ay Yıldız Duvar Kağıtları
বাংলা টিভি প্রো BD Bangla TV
Cъновник BG
CSCS BG (в български)
뽕티비 - 개인방송, 인터넷방송, BJ방송
BL 女性向け恋愛ゲーム◆俺プリクロス
SecondSecret ‐「恋を読む」BLノベルゲーム‐
BL 女性向け恋愛ゲーム◆ごくメン
あなカレ【BL】無料ゲーム
감성학원 BL 첫사랑
BQ-መጽሐፍ ቅዱሳዊ ጥያቄዎች
BS Calendar / Patro / पात्रो
Vip视频免费看-BT磁力搜索
Билеты ПДД CD 2019 PRO
Offline Jízdní řády CG Transit
Bonjour 2017 Abidjan CI ❤❤❤❤❤
CK 初一 十五
الفاتحون Conquerors
DG ग्राम / Digital Gram Panchayat
DM הפקות
DW فارسی By dw-arab.com
لعبة تقدر تربح DZ
বাংলাflix
RPG ブレイジング ソウルズ アクセレイト
英漢字典 EC Dictionary
ECナビ×シュフー
أحداث وحقائق | خبر عاجل في اخبار العالم
EG SIM CARD (EGSIMCARD, 이지심카드)
パーリーゲイツ公式通販｜EJ STYLE（イージェイスタイル）
FAHREDDİN er-RÂZİ TEFSİRİ
I'm Rich/Eu sou Rico/أنا غني/我很有錢
AÖF Ev İdaresi 1. Sınıf
Ey Sey Storytime រឿងនិទានតាឥសី
哈哈姆特不EY
FP Разбиты

Let's do the same thing for applestore apps.
First we need to create a new list containing the name of the apps since our function only works with strings, we will only append the names of the apps.

In [16]:
apple_appname=[]
apple_nonEnglish=[]
apple_english=[]
for app in appledata:
    apple_appname.append(app[1])
    
for app in apple_appname:
    if return_string(app)==False:
        print(app)
        apple_nonEnglish.append(app)
    else:
        apple_english.append(app)
        


爱奇艺PPS -《欢乐颂2》电视剧热播
聚力视频HD-人民的名义,跨界歌王全网热播
优酷视频
网易新闻 - 精选好内容，算出你的兴趣
淘宝 - 随时随地，想淘就淘
搜狐视频HD-欢乐颂2 全网首播
阴阳师-全区互通现世集结
百度贴吧-全球最大兴趣交友社区
百度网盘
爱奇艺HD -《欢乐颂2》电视剧热播
乐视视频HD-白鹿原,欢乐颂,奔跑吧全网热播
万年历-值得信赖的日历黄历查询工具
新浪新闻-阅读最新时事热门头条资讯视频
喜马拉雅FM（听书社区）电台有声小说相声英语
央视影音-海量央视内容高清直播
腾讯视频HD-楚乔传,明日之子6月全网首播
手机百度 - 百度一下你就得到
百度视频HD-高清电视剧、电影在线观看神器
MOMO陌陌-开启视频社交,用直播分享生活
QQ 浏览器-搜新闻、选小说漫画、看视频
同花顺-炒股、股票
聚力视频-蓝光电视剧电影在线热播
快看漫画
乐视视频-白鹿原,欢乐颂,奔跑吧全网热播
酷我音乐HD-无损在线播放
随手记（专业版）-好用的记账理财工具
Dictionary ( قاموس عربي / انجليزي + ودجيت الترجمة)
滴滴出行
高德地图（精准专业的手机地图）
百度HD-极速安全浏览器
美丽说-潮流穿搭快人一步
百度地图-智能的手机导航，公交地铁出行必备
Majiang Mahjong（单机+川麻+二人+武汉+国标）
土豆视频HD—高清影视综艺视频播放器
360手机卫士-超安全的来电防骚扰助手
QQ浏览器HD-极速搜索浏览器
搜狗输入法-Sogou Keyboard
百度网盘 HD
大众点评-发现品质生活
讯飞输入法-智能语音输入和表情斗图神器
美柚 - 女生助手
爱奇艺 - 电视剧电影综艺娱乐视频播放器
搜狐视频-欢乐颂2 全网首播
百度地图HD
QQ同步助手-新机一键换机必备工具
QQ音乐-来这里“发现・音乐”
腾讯新闻-头条新闻热点资讯掌上阅读软件
土豆（短视频分享平台）
风行视频+ HD - 电影电视剧体育视频播放器
仙劍奇俠傳5 - 劍傲丹楓
YY- 小全民手机直播交友软件
腾讯视频-欢乐颂2全网首播
中华万年历-2亿用户首选的日历软件
央视影音HD-海量央视内容高清直播
蘑菇街-网红直播搭配的购物特卖平台
Keep - 移动健身教练 自由运动场
美团 - 吃喝玩乐全都有
百度贴吧HD
腾讯手机管家-拦截骚

激ムズ！お前、脱出できんの？～反射神経アクションゲーム～
意味が分かると怖い話【意味怖】-この怖い話の意味が分かるか…
おかず甲子園
絶対に笑える話　腹筋崩壊の笑える話
時計仕掛けの彼女 /明日なんて来なきゃいいのにね
マンガモンスター 　無料マンガ/無料本/人気マンガ
妹型杀器
武神战纪OL-热血三国志国战手游
激ムズ！にゃんコプター
謎解きあの人からメール
东方财富网领先版-财经资讯&股票开户
暗黑屠魔者2（唉哟-还不错哦）
エジコイ！〜エジプト神と恋しよっ〜
剑雨奇缘-万人同屏PK,唯美风仙侠
PINGPONG（ピンポン）- 君の反射神経Lvはいくつ？
椅子ドンVR~一ノ宮英介 編~
マンガきゅんと‐漫画が全話読み放題の少女まんがアプリ
ダイエットやメイク、ネイルのコスメ情報 - シェリル
国金宝理财-15%年化收益的金融投资赚钱软件
女性向けまとめ読みアプリ - pool（プール）-
イケメン革命◆アリスと恋の魔法 恋愛ゲーム
ライブチャット、ビデオチャット通話が楽しめる！ ライブでゴーゴー
百盈足球-专业足篮比分赛事预测
ねこめし屋 -マンガも読めるネコゲーム料理店経営の無料育成シュミレーション-
Please,Dad. I wanna live.　おとうさん、おねがい。わたし生きたいよ。
ひとほろぼし
ST channel ［エスティーチャンネル］- 雑誌『セブンティーン』公式アプリ
ごちうさアラーム～シャロ編～
战歌联萌-新英雄登场 惊喜嘉年华
テニスクラブ物語
不良西游-神魔悟空大战篇
ScaryStory in Japan 怖い話し無料
-The 穴通し3D- 君の記憶力x反射神経を問う! ～Mr.CURVEからの挑戦状 ～
【殺人現場へようこそ】推理サスペンス劇場/謎解き大人の脳トレゲーム
鳥として生きた男　その壮絶な人生
キクタン 【Advanced】 6000 ～聞いて覚える英単語～(アルク)
全民夺宝(官方)
大盛グルメ食堂
MineSweeper　マインスイーパ無料
秒速で1億円 貢ぐ男　～美女キャラ集結！from ギャングロード JOKER～
学年ビリのギャルが今さら受験してみた
吕布帮帮忙-三国策略卡牌，指间微操轻松畅玩
キクタンTOEIC(R) Test Score 600 ～聞いて覚える英単語～(アルク)
通信量チェッカー
脱出ゲ

Let's print out the number of Non English and English apps from Apple and Google data sets

In [17]:
print('This is the number of non English app in Google Play data set:',len(number_nonEnglish))
print('This is the number of English app in Google Play data set:',len(number_english))
print('\n')
print('This is the number of non English app in Apple data set:',len(apple_nonEnglish))
print('This is the number of English app in Apple data set:',len(apple_english))


This is the number of non English app in Google Play data set: 45
This is the number of English app in Google Play data set: 9614


This is the number of non English app in Apple data set: 1014
This is the number of English app in Apple data set: 6183


Now we are going to create a new list for both apple and google data set that only includes the English apps as new sets of cleaned data

In [18]:
eng_google=[]
eng_apple=[]

for app in google_clean:
    name=app[0]
    if name in number_english:
        eng_google.append(app)
for app in appledata:
    name=app[1]
    if name in apple_english:
        eng_apple.append(app)
        
print('This is the number of English app in Apple data set:',len(eng_google))
print('This is the number of English app in Apple data set:',len(eng_apple))

This is the number of English app in Apple data set: 9614
This is the number of English app in Apple data set: 6183


Current progress in our data cleaning:
* Removed inaccurate data
* Removed duplicate app entries
* Removed non English apps

The example we are using in this guided project is that our company only builds free apps, we will need to isolate only the free apps for our analysis.

<font size="4">**Removing Paid Apps**</font>


Let's test to see if every row is consistent with it's input value

In [19]:
for i in google_clean:
    if i[6]!='Free' and i[6]!='Paid':
        print(i)
    

['Command & Conquer: Rivals', 'FAMILY', 'NaN', '0', 'Varies with device', '0', 'NaN', '0', 'Everyone 10+', 'Strategy', '28-Jun-18', 'Varies with device', 'Varies with device']


Since there is one row in google data that has a vlaue of 'NaN', but the total price is '0', we will use index of 7, with is the price tag of an app instead of the binary value of Free or Paid as our basis for determining whether an app is free or not.

In [20]:
google_final_clean=[]
apple_final_clean=[]
for app in eng_google:
    if app[7]=='0':
        google_final_clean.append(app)

for app in eng_apple:
    if app[4]=='0.0':
        apple_final_clean.append(app)

print('This is the cleaned data with only free apps from google play store:',len(google_final_clean))
print('This is the cleaned data with only free apps from app store:',len(apple_final_clean))

This is the cleaned data with only free apps from google play store: 8864
This is the cleaned data with only free apps from app store: 3222


So far, we spent a good amount of time on cleaning data, and:

- Removed inaccurate data
- Removed duplicate app entries
- Removed non-English apps
- Isolated the free apps

<font size='4'>**Main Steps in Validation Strategy**</font>

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.






<font size="4">**Most Common Apps by Genre**</font>


In order to accomplish our goals, let's begin by finding out the most common genres for each market. We do that by building a frequency tables for a few columns in our data sets.

For google play data set, we will use Genres and Category columns to build frequency tables.
For apple play data set, we will use prime_genre column to build frequency tables.

We will build 2 functions for analysis:
* Function to show percentages
* display the percentages in a descending order


We are going to create a function to find frequency table of a genres.
For that we will need to build a dictionary that can store the genre as key and the number(frequency) of that genre as value


In [21]:
def freq_table(dataset,index):
    table={}
    total=0
    frequency={}
    for app in dataset:
        total+=1
        value=app[index]
        if value in table:
            table[value]+=1
        else:
            table[value]=1
    table_percentage={}
    
    for key in table:
        percentage=(table[key]/total)*100
        table_percentage[key]=percentage
        
    return table_percentage



Since dictionary doesn't have order, we will build a second function which can help us display the entries in the frequency table in a descending order.


In [22]:

def display_table(dataset,index):
    sort_orders=sorted(freq_table(dataset,index).items(),key=lambda x: x[1],reverse=True)
    for i in sort_orders:
        print(i[0],"{:.2f}".format(i[1])),
    
   
  

print('Google play store most popular genres from column Category in decending order:\n',)
print(display_table(google_final_clean,1))


Google play store most popular genres from column Category in decending order:

FAMILY 18.91
GAME 9.72
TOOLS 8.46
BUSINESS 4.59
LIFESTYLE 3.90
PRODUCTIVITY 3.89
FINANCE 3.70
MEDICAL 3.53
SPORTS 3.40
PERSONALIZATION 3.32
COMMUNICATION 3.24
HEALTH_AND_FITNESS 3.08
PHOTOGRAPHY 2.94
NEWS_AND_MAGAZINES 2.80
SOCIAL 2.66
TRAVEL_AND_LOCAL 2.34
SHOPPING 2.25
BOOKS_AND_REFERENCE 2.14
DATING 1.86
VIDEO_PLAYERS 1.79
MAPS_AND_NAVIGATION 1.40
FOOD_AND_DRINK 1.24
EDUCATION 1.16
ENTERTAINMENT 0.96
LIBRARIES_AND_DEMO 0.94
AUTO_AND_VEHICLES 0.93
HOUSE_AND_HOME 0.82
WEATHER 0.80
EVENTS 0.71
PARENTING 0.65
ART_AND_DESIGN 0.64
COMICS 0.62
BEAUTY 0.60
None


In [23]:

print('Google play store most popular genres from column Genres in decending order:\n',)
print(display_table(google_final_clean,9))



Google play store most popular genres from column Genres in decending order:

Tools 8.45
Entertainment 6.07
Education 5.35
Business 4.59
Lifestyle 3.89
Productivity 3.89
Finance 3.70
Medical 3.53
Sports 3.46
Personalization 3.32
Communication 3.24
Action 3.10
Health & Fitness 3.08
Photography 2.94
News & Magazines 2.80
Social 2.66
Travel & Local 2.32
Shopping 2.25
Books & Reference 2.14
Simulation 2.04
Dating 1.86
Arcade 1.85
Video Players & Editors 1.77
Casual 1.76
Maps & Navigation 1.40
Food & Drink 1.24
Puzzle 1.13
Racing 0.99
Libraries & Demo 0.94
Role Playing 0.94
Auto & Vehicles 0.93
Strategy 0.91
House & Home 0.82
Weather 0.80
Events 0.71
Adventure 0.68
Comics 0.61
Art & Design 0.60
Beauty 0.60
Parenting 0.50
Card 0.45
Casino 0.43
Trivia 0.42
Educational;Education 0.39
Board 0.38
Educational 0.37
Education;Education 0.34
Word 0.26
Casual;Pretend Play 0.24
Music 0.20
Entertainment;Music & Video 0.17
Puzzle;Brain Games 0.17
Racing;Action & Adventure 0.17
Casual;Brain Games 0.14
Ca

In [24]:
print('Apple app store most popular genres from column prime_genre in decending order:\n',)
print(display_table(apple_final_clean,11))

Apple app store most popular genres from column prime_genre in decending order:

Games 58.16
Entertainment 7.88
Photo & Video 4.97
Education 3.66
Social Networking 3.29
Shopping 2.61
Utilities 2.51
Sports 2.14
Music 2.05
Health & Fitness 2.02
Productivity 1.74
Lifestyle 1.58
News 1.33
Travel 1.24
Finance 1.12
Weather 0.87
Food & Drink 0.81
Reference 0.56
Business 0.53
Book 0.43
Navigation 0.19
Medical 0.19
Catalogs 0.12
None


Google play store seems to have more apps dedicated to more practical purposes, as opposed to app store 58.16% of apps are Game genre. 

Although genres column have much more specific genres than category column, for the purpose of this project, we will only be looking at the category column since we are not woking with finer analysis at this point.

<font size='4'>**Most Popular Apps by Genre**</font>

To find out what genres are the most popular(have the most users, we can find the total number of Installs in an app, but this information is missing for app store, so we will look for total number of user rating as a proxy, the column is rating_count_tot.

<font size='3'>**App store apps: Using *rating count total* as a metrics to substitude Total Downloads of Apps**</font>

We need to begin with calculating the average number of user ratings per app genre on the app store:
* Isolate the apps of each genre.
* Sum up the user ratings for the apps of that genre.
* Divide the sum by the number of apps belonging to that genre (not by the total number of apps).
* In the end, I also sorted the list in descending order for ease of readability

In [25]:
prime_genre=freq_table(apple_final_clean,11)
tot_count=[]
avg_count_tot=[]
print('This is the average count total of user rating per genre.\n')

for genre in prime_genre:
    total=0
    len_genre=0
    for app in apple_final_clean:
        genre_app=app[11]
        if genre_app == genre:
            tot_count=float(app[5])
            total=total+tot_count
            len_genre+=1
    num_user_rating=total/len_genre
    kv_tuple=num_user_rating,genre
    avg_count_tot.append(kv_tuple)
    
rating_sorted=sorted(avg_count_tot,reverse=True)

for entry in rating_sorted:
    print(entry[1],"{:.2f}".format(entry[0]))


This is the average count total of user rating per genre.

Navigation 86090.33
Reference 74942.11
Social Networking 71548.35
Music 57326.53
Weather 52279.89
Book 39758.50
Food & Drink 33333.92
Finance 31467.94
Photo & Video 28441.54
Travel 28243.80
Shopping 26919.69
Health & Fitness 23298.02
Sports 23008.90
Games 22788.67
News 21248.02
Productivity 21028.41
Utilities 18684.46
Lifestyle 16485.76
Entertainment 14029.83
Business 7491.12
Education 7003.98
Catalogs 4004.00
Medical 612.00


Let's find out the average score of user rating per genre 

In [26]:
prime_genre=freq_table(apple_final_clean,11)
avg_rating=[]
avg_rating_tot=[]
print('This is the average score of user rating per genre.\n')

for genre in prime_genre:
    total=0
    len_genre=0
    for app in apple_final_clean:
        genre_app=app[11]
        if genre_app == genre:
            avg_rating=float(app[7])
            total=total+avg_rating
            len_genre+=1
    avg_tot=total/len_genre
    rating_tuple=avg_tot,genre
    avg_rating_tot.append(rating_tuple)
    
rating_sorted=sorted(avg_rating_tot,reverse=True)

for entry in rating_sorted:
    print(entry[1],"{:.2f}".format(entry[0]))

This is the average score of user rating per genre.

Catalogs 4.12
Games 4.04
Productivity 4.00
Business 3.97
Shopping 3.97
Music 3.95
Photo & Video 3.90
Navigation 3.83
Health & Fitness 3.77
Reference 3.67
Education 3.64
Food & Drink 3.63
Social Networking 3.59
Entertainment 3.54
Utilities 3.53
Travel 3.49
Weather 3.48
Lifestyle 3.41
Finance 3.38
News 3.24
Book 3.07
Sports 3.07
Medical 3.00


We are going to do the same thing for ratings on version updates, to compare the potential differences with original users rating.

In [27]:
prime_genre=freq_table(apple_final_clean,11)
ver_rating=[]
avg_ver_rating_tot=[]
print('This is the updated version average score of user rating per genre.\n')


for genre in prime_genre:
    total=0
    len_genre=0
    for app in apple_final_clean:
        genre_app=app[11]
        if genre_app == genre:
            ver_rating=float(app[8])
            total=total+ver_rating
            len_genre+=1
    avg_num_user=total/len_genre
    version_tuple=avg_num_user,genre
    avg_ver_rating_tot.append(version_tuple)
    
rating_ver_sorted=sorted(avg_ver_rating_tot,reverse=True)

for entry in rating_ver_sorted:
    print(entry[1],"{:.2f}".format(entry[0]))

This is the updated version average score of user rating per genre.

Catalogs 4.00
Productivity 3.94
Music 3.93
Games 3.91
Reference 3.86
Health & Fitness 3.62
Shopping 3.49
Photo & Video 3.38
Entertainment 3.32
Medical 3.25
Food & Drink 3.25
Book 3.14
Utilities 3.12
Education 3.11
Business 3.06
Weather 3.02
Social Networking 2.99
Lifestyle 2.92
Finance 2.85
Travel 2.74
Sports 2.68
News 2.66
Navigation 2.25


<font size="3">**Google Play Store Average Downloads**</font>

Since we have data about the number of installs for google play market, we should be able to get more direct idea about genre popularity.

One issue we run into is that install numbers are not very precise, most values are open ended:ex(100+, 1,000+, 5,000+, etc.) However, since we only want to find out which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

For the purpose of this project, we are going to consider the number at face values, for 100,000+, we will treat it like 100,000 total installs and so on.

The numbers in install column are in strings, so we will need to convert it to float, we will need to remove the commas and the plus characters before we can do so.

In [28]:
category_freq=freq_table(google_final_clean,1)
avg_category_install=[]
print('This is the average installation of apps per category.\n\n')
for category in category_freq:
    total=0
    len_genre=0
    for app in google_final_clean:
        install=app[5]
        genre=app[1]
        if category==genre:
            install=install.replace('+','')
            install=install.replace(',','')
            len_genre+=1
            total=total+float(install)
    avg=total/len_genre
    install_tuple=avg,category
    avg_category_install.append(install_tuple)
sorted_avg_install=sorted(avg_category_install,reverse=True)

for cat in sorted_avg_install:
    print(cat[1],"{:.2f}".format(cat[0]))

This is the average installation of apps per category.


COMMUNICATION 38456119.17
VIDEO_PLAYERS 24727872.45
SOCIAL 23253652.13
PHOTOGRAPHY 17840110.40
PRODUCTIVITY 16787331.34
GAME 15588015.60
TRAVEL_AND_LOCAL 13984077.71
ENTERTAINMENT 11640705.88
TOOLS 10801391.30
NEWS_AND_MAGAZINES 9549178.47
BOOKS_AND_REFERENCE 8767811.89
SHOPPING 7036877.31
PERSONALIZATION 5201482.61
WEATHER 5074486.20
HEALTH_AND_FITNESS 4188821.99
MAPS_AND_NAVIGATION 4056941.77
FAMILY 3695641.82
SPORTS 3638640.14
ART_AND_DESIGN 1986335.09
FOOD_AND_DRINK 1924897.74
EDUCATION 1833495.15
BUSINESS 1712290.15
LIFESTYLE 1437816.27
FINANCE 1387692.48
HOUSE_AND_HOME 1331540.56
DATING 854028.83
COMICS 817657.27
AUTO_AND_VEHICLES 647317.82
LIBRARIES_AND_DEMO 638503.73
PARENTING 542603.62
BEAUTY 513151.89
EVENTS 253542.22
MEDICAL 120550.62


Communication have the most installs, to find out how much of them are coming from a few apps that might be considered outliers (Facebook messenger, WhatsApp,Gmail,etc), we need to set some conditions.

In [29]:
for app in google_final_clean:
    if app[1]=='COMMUNICATION' and (app[5]=='1,000,000,000+'
                                   or app[5]=='500,000,000+'
                                   or app[5]=='100,000,000+'):
        print(app[0],':',app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

Let's see waht happens when we remove these apps

In [30]:
under_100m=[]

for app in google_final_clean:
    install=app[5]
    install=install.replace(',','')
    install=install.replace('+','')
    if (app[1]=='COMMUNICATION') and (float(install)<100000000):
        under_100m.append(float(install))

sum(under_100m)/len(under_100m)


3603485.3884615386

The average has roughly 1/10 of it's orignal size.


Similar patterns probably exist for video players category, photography apps, social apps, and producitivty apps.

These genres will be hard to compete with all those companies dominating the market, while givng off the impression that the genre is more popular than they really are.

Let's take a look at some of the genres around the top ten rankings that also have the least amount of apps within that genre. The logic behind this approach is to find a genre that is fairly popular yet have less competition than some of the genres higher up on the list and avoid genres that have outlier apps like we discovered earlier.

In [31]:
len_personal=0
len_book=0
len_tool=0
len_fitness=0
len_education=0
len_sports=0

for app in google_final_clean:
    if app[1]=='PERSONALIZATION':
        #print(app[0],':',app[5])
        len_personal+=1
    if app[1] == 'BOOKS_AND_REFERENCE':
        len_book+=1
    if app[1] == 'SPORTS':
        len_sports+=1
         
        
    if app[1] == 'TOOLS':
        len_tool+=1
    if app[1] == 'HEALTH_AND_FITNESS':
        len_fitness+=1
    if app[1] == 'EDUCATION':
        len_education+=1
        
print(len_personal)
print(len_book)
print(len_tool)
print(len_fitness)
print(len_education)
print(len_sports)

294
190
750
273
103
301


As we can see, books and education genres have the least amount of apps compare to other genres.

Let's first find out if there are extremely popular apps that may skew the average

In [32]:
for app in google_final_clean:
    if app[1]=='BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        books=app[0],app[5]
        print('Books apps:',books,)
        
    elif app[1]=='EDUCATION' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        education=app[0],app[5]
        print('Education apps:',education)
        

Books apps: ('Google Play Books', '1,000,000,000+')
Books apps: ('Bible', '100,000,000+')
Books apps: ('Amazon Kindle', '100,000,000+')
Books apps: ('Wattpad 📖 Free Books', '100,000,000+')
Books apps: ('Audiobooks from Audible', '100,000,000+')


The education genre has no apps that exceed 100,000,000+ downloads, with only a few with the book genre. 

Let's see if there are some apps in the middle in terms of popularity(between 1,000,000 to 100,000,000 downloads)

In [33]:
for app in google_final_clean:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        books=app[0], ':', app[5]
        print('Books apps:',books,)
        
    elif app[1] == 'EDUCATION' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        education=app[0], ':', app[5]
        print('Education apps:',education)

Books apps: ('Wikipedia', ':', '10,000,000+')
Books apps: ('Cool Reader', ':', '10,000,000+')
Books apps: ('Book store', ':', '1,000,000+')
Books apps: ('FBReader: Favorite Book Reader', ':', '10,000,000+')
Books apps: ('Free Books - Spirit Fanfiction and Stories', ':', '1,000,000+')
Books apps: ('AlReader -any text book reader', ':', '5,000,000+')
Books apps: ('FamilySearch Tree', ':', '1,000,000+')
Books apps: ('Cloud of Books', ':', '1,000,000+')
Books apps: ('ReadEra – free ebook reader', ':', '1,000,000+')
Books apps: ('Ebook Reader', ':', '5,000,000+')
Books apps: ('Read books online', ':', '5,000,000+')
Books apps: ('eBoox: book reader fb2 epub zip', ':', '1,000,000+')
Books apps: ('All Maths Formulas', ':', '1,000,000+')
Books apps: ('Ancestry', ':', '5,000,000+')
Books apps: ('HTC Help', ':', '10,000,000+')
Books apps: ('Moon+ Reader', ':', '10,000,000+')
Books apps: ('English-Myanmar Dictionary', ':', '1,000,000+')
Books apps: ('Golden Dictionary (EN-AR)', ':', '1,000,000+')


As we can see, there are quite a few book apps that appeared in the range between 1,000,000 to 100,000,000 downloads, and far fewer apps for educatons. Let's change the numbers from string to float, so we can rank them in descending orders to find out what's more popular in each individual genres, and make an educated suggesstion based on the data.

Frist we will start with the book genre. 


In [34]:
popular_book_install=[]
print('This is a list of book genre apps ranked in descending order from the most popular to least popular for downloads between 1,000,000 to 100,000,000. .\n\n')

for app in google_final_clean:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        books=app[5]
        books=books.replace('+','')
        books=books.replace(',','')
        book= float(books), app[0]
        popular_book_install.append(book)
sorted_book=sorted(popular_book_install,reverse=True)

for cat in sorted_book:
    print(cat[1],':',"{:.2f}".format(cat[0]))


This is a list of book genre apps ranked in descending order from the most popular to least popular for downloads between 1,000,000 to 100,000,000. .


Wikipedia : 10000000.00
Spanish English Translator : 10000000.00
Quran for Android : 10000000.00
Oxford Dictionary of English : Free : 10000000.00
NOOK: Read eBooks & Magazines : 10000000.00
Moon+ Reader : 10000000.00
JW Library : 10000000.00
HTC Help : 10000000.00
FBReader: Favorite Book Reader : 10000000.00
English Hindi Dictionary : 10000000.00
English Dictionary - Offline : 10000000.00
Dictionary.com: Find Definitions for English Words : 10000000.00
Dictionary - Merriam-Webster : 10000000.00
Dictionary : 10000000.00
Cool Reader : 10000000.00
Aldiko Book Reader : 10000000.00
Al-Quran (Free) : 10000000.00
Al'Quran Bahasa Indonesia : 10000000.00
Al Quran Indonesia : 10000000.00
Read books online : 5000000.00
English to Hindi Dictionary : 5000000.00
Ebook Reader : 5000000.00
Dictionary - WordWeb : 5000000.00
Bible KJV : 5000000.00
Ances

Then we do the same for education genre

In [35]:
popular_edu_install=[]
print('This is a list of education genre apps ranked in descending order from the most popular to least popular for downloads between 1,000,000 to 100,000,000. .\n\n')

for app in google_final_clean:
    if app[1] == 'EDUCATION' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        edu=app[5]
        edu=books.replace('+','')
        edu=books.replace(',','')
        edu_list= float(edu), app[0]
        popular_edu_install.append(edu_list)
sorted_edu=sorted(popular_edu_install,reverse=True)

for cat in sorted_edu:
    print(cat[1],':',"{:.2f}".format(cat[0]))


This is a list of education genre apps ranked in descending order from the most popular to least popular for downloads between 1,000,000 to 100,000,000. .


myHomework Student Planner : 5000000.00
edX - Online Courses by Harvard, MIT & more : 5000000.00
Udemy - Online Courses : 5000000.00
Udacity - Lifelong Learning : 5000000.00
Timetable : 5000000.00
Thai Handwriting : 5000000.00
THAI DICT 2018 : 5000000.00
SoloLearn: Learn to Code for Free : 5000000.00
Socratic - Math Answers & Homework Help : 5000000.00
Rosetta Stone: Learn to Speak & Read New Languages : 5000000.00
Remind: School Communication : 5000000.00
Quizlet: Learn Languages & Vocab with Flashcards : 5000000.00
Programming Hub, Learn to code : 5000000.00
PINKFONG Baby Shark : 5000000.00
NeuroNation - Focus and Brain Training : 5000000.00
My Study Life - School Planner : 5000000.00
My Class Schedule: Timetable : 5000000.00
Mermaids : 5000000.00
Memorado - Brain Games : 5000000.00
Math Tricks : 5000000.00
Lynda - Online Trainin

The book market seems to be dominated by software for reading ebooks, and collections of libraries and dictionaries. So if we want to build book apps, these might be the ones we want to stay clear from or find specific dictionaries that haven't been done yet, like a Chinese English dictionary, if there's a market for it.

For education genre, programming apps are quite popular, as with schedulers and planners. These might be the apps we want to avoid making.

One thing we need to be aware of is that the top app for the education genre has 5,000,000 downloads, which puts it into the middle of the list in the book genre. Another potential red flag for the education genre is that in the app store, though the education genre originally received higher average reviews from users, the reviews took a dip after versions update, and the book genre's average reviews are now higher than that of education genre.




<font size='4'>**Conclusions**</font>


From this perspective, I agree with the conclusion reached by DataQuest that the book app seems like a promising genre to delve into, especially if in the future one wants to tap into the app store market. In addition, since the market is occupied with libraries and dictionaries, our app might require special features to stand out from the crowd.

One trend we have noticed is that the top ranking genres are communication and social media-related apps. Even the top apps in the education genre are online resources that provide a platform that allows people to connect and share ideas through online forums. From this observation, perhaps an online publishing site for people to write their own novels to share with other users, and a dedicated discussion forum for people to discuess the novels by other users or talk freely about books of their choice would be a good app to develop.

Although this might be beyond the scope of this analysis, since Apple app store have more than 50% of apps made for games, we may even introduce mini games based on the stories written by the users, to encourage more user engagement, and also cater to the app store market trend.