
# Profitable App Profiles for the App Store and Google Play Markets
Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.


In [1]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        print()
       
       

In [3]:
print(ios_header)
print()
explore_data(ios,0,3)






['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']




In [4]:
print(android_header)
print()
explore_data(android,0,3, True)


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13



We see that the Google Play data set has 10841 apps and 13 columns. At a quick glance, the columns that might be useful for the purpose of our analysis are 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.

In [5]:

dic = {header:i for header in android_header for i,row in enumerate(android)}
# print(dic)
    

Combined = [(data[7],data[6]) for data in android]
# print(Combined)
NaN = android[10472][8]

del android[10472]

print(android[10471])    



['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


## This Google Play data has duplicate entries
This Google Play data has duplicate entries

In [6]:
# print([app[0] for app in android])
# app_name = [app[0] for app in android]
# duplicate_name = [app[0] for app in android]

# for app in android:
#     app_name = app[0]
#     duplicate_name = app[0]
#     if app_name == duplicate_name:
#         print(app_name)

unique_apps = set([app[0] for app in android])
# print(unique_apps)
duplicate_apps = [app[0] for app in android if app[0] in unique_apps]


# for app in android:
#     name = app[0]
#     if name in unique_apps:
#         duplicate_apps.append(name)
#     else:
#         unique_apps.append(name)

print('Number of duplicate apps:', len(duplicate_apps))
print()
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 10840

Examples of duplicate apps: ['Photo Editor & Candy Camera & Grid & ScrapBook', 'Coloring book moana', 'U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'Sketch - Draw & Paint', 'Pixel Draw - Number Art Coloring Book', 'Paper flowers instructions', 'Smoke Effect Photo Maker - Smoke Editor', 'Infinite Painter', 'Garden Coloring Book', 'Kids Paint Free - Drawing Fun', 'Text on Photo - Fonteee', 'Name Art Photo Editor - Focus n Filters', 'Tattoo Name On My Photo Editor', 'Mandala Coloring Book', '3D Color Pixel by Number - Sandbox Art Coloring']


We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.

If you examine the rows we printed two cells above for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times. We can use this to build a criterion for keeping rows. We won't remove rows randomly, but rather we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

To do that, we will:

Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)

In [7]:
import operator
reviews_max = {}
print(android_header)
for app in android:
    name = app[0]
    n_reviews = float(app[3])
#     print(n_reviews)

    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews

    if name not in reviews_max:
        reviews_max[name] = n_reviews
# print(len(reviews_max))
# print(reviews_max)

android_clean = []
already_added = []
for app in android:
    name = app[0]
    n_reviews = float(app[3])
#     print(reviews_max[name])

    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
# x = max(reviews_max.items(), key=operator.itemgetter(1))[0]
# print(x)
# print(reviews_max['Facebook'])
print('Number of cleaned android data:',len(android_clean))
print('Max Reviews:',len(reviews_max))
        
        
    

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
Number of cleaned android data: 9659
Max Reviews: 9659


Both cleaned android data and max reviews has same number of dataframes.

In [8]:
explore_data(android_clean,0,3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13



some examples of cleaned android data

In [9]:
def take_string(string):
    count = 0
    for char in string:
        if ord(char) > 127:
            count += 1
    if count > 3 :
        return False
    else:
        return True
print(take_string('Docs To Go™ Free Office Suite'),
      take_string('Instachat 😜'),take_string('™')) 

True True True


In [10]:
android_english = []
non_english = []
for app in android:
    name = app[0]
    if take_string(name):
        android_english.append(app)
    else:
        non_english.append(name)
        
ios_english = []
for app in ios:
    name = app[0]
    if take_string(name):
        ios_english.append(app)
    else:
        non_english.append(name)
        
print(len(android_english), len(ios_english))
print(non_english)
        


    
    

10795 7197
['Flame - درب عقلك يوميا', 'သိင်္ Astrology - Min Thein Kha BayDin', 'РИА Новости', 'صور حرف H', 'L.POINT - 엘포인트 [ 포인트, 멤버십, 적립, 사용, 모바일 카드, 쿠폰, 롯데]', 'RMEduS - 음성인식을 활용한 R 프로그래밍 실습 시스템', 'AJ렌터카 법인 카셰어링', 'Al Quran Free - القرآن (Islam)', '中国語 AQリスニング', '日本AV历史', 'Ay Yıldız Duvar Kağıtları', 'বাংলা টিভি প্রো BD Bangla TV', 'Cъновник BG', 'CSCS BG (в български)', '뽕티비 - 개인방송, 인터넷방송, BJ방송', 'BL 女性向け恋愛ゲーム◆俺プリクロス', 'SecondSecret ‐「恋を読む」BLノベルゲーム‐', 'BL 女性向け恋愛ゲーム◆ごくメン', 'あなカレ【BL】無料ゲーム', '감성학원 BL 첫사랑', 'BQ-መጽሐፍ ቅዱሳዊ ጥያቄዎች', 'BS Calendar / Patro / पात्रो', 'Vip视频免费看-BT磁力搜索', 'Билеты ПДД CD 2019 PRO', 'Offline Jízdní řády CG Transit', 'Bonjour 2017 Abidjan CI ❤❤❤❤❤', 'CK 初一 十五', 'الفاتحون Conquerors', 'DG ग्राम / Digital Gram Panchayat', 'DM הפקות', 'DW فارسی By dw-arab.com', 'لعبة تقدر تربح DZ', 'বাংলাflix', 'RPG ブレイジング ソウルズ アクセレイト', '英漢字典 EC Dictionary', 'ECナビ×シュフー', 'أحداث وحقائق | خبر عاجل في اخبار العالم', 'EG SIM CARD (EGSIMCARD, 이지심카드)', 'パーリーゲイツ公式通販｜EJ STYLE（イージェイスタイル）', 'FAH

In [11]:
# print(android_header)
# print(ios_header)

free_android = []
for app in android:
    price = app[7]
    if price == '0':
        free_android.append(app)

free_ios = []
for app in ios:
    price = app[4]
#     print(price)
    if price == '0.0':
        free_ios.append(app)
print(len(free_ios),len(free_android))

4056 10040


Our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1.Build a minimal Android version of the app, and add it to Google Play.

2.If the app has a good response from users, we develop it further.

3.If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

In [12]:
print(android_header)
print(ios_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


I would think, out of these columns, i could use the genre column to generate frequency tables to find out the most common genres in each market.

In [13]:
def freq_table(dataset,index):
    dic = {}
    for row in dataset:
        name = row[index]
        if name not in dic:
            dic[name] = 0
        if name in dic:
            dic[name] +=1 
    return dic

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

# # print(android_header)
# print()
# print(ios_header)
# print()
# print(freq_table(ios, 11))
# print()
# print(freq_table(android,1))
# print()
# print(freq_table(android,9))
# print()
print('Displaying Apple store Prime Genres')
print()
print(display_table(ios,11))

Displaying Apple store Prime Genres

Games : 3862
Entertainment : 535
Education : 453
Photo & Video : 349
Utilities : 248
Health & Fitness : 180
Productivity : 178
Social Networking : 167
Lifestyle : 144
Music : 138
Shopping : 122
Sports : 114
Book : 112
Finance : 104
Travel : 81
News : 75
Weather : 72
Reference : 64
Food & Drink : 63
Business : 57
Navigation : 46
Medical : 23
Catalogs : 10
None


## Comments on Prime_genre Column

The most common genre is games, the runner up is entertainment. It would seem that the apple store apps are mostly used for recreational purposes rather than practical purposes such as education, shopping, utilities, productivity or lifestyle.

Given the available genres on the app store and the proclivity of specific genres to have a large use base comparatively, it is plausible to say that to tap into the apple store market, it is better to focus on recreational application purposes. A "prime" genre to go into would be that of the games genre.

In [14]:
print('Displaying Play Store Categories')
print()
print(display_table(android, 1))

Displaying Play Store Categories

FAMILY : 1972
GAME : 1144
TOOLS : 843
MEDICAL : 463
BUSINESS : 460
PRODUCTIVITY : 424
PERSONALIZATION : 392
COMMUNICATION : 387
SPORTS : 384
LIFESTYLE : 382
FINANCE : 366
HEALTH_AND_FITNESS : 341
PHOTOGRAPHY : 335
SOCIAL : 295
NEWS_AND_MAGAZINES : 283
SHOPPING : 260
TRAVEL_AND_LOCAL : 258
DATING : 234
BOOKS_AND_REFERENCE : 231
VIDEO_PLAYERS : 175
EDUCATION : 156
ENTERTAINMENT : 149
MAPS_AND_NAVIGATION : 137
FOOD_AND_DRINK : 127
HOUSE_AND_HOME : 88
LIBRARIES_AND_DEMO : 85
AUTO_AND_VEHICLES : 85
WEATHER : 82
ART_AND_DESIGN : 65
EVENTS : 64
PARENTING : 60
COMICS : 60
BEAUTY : 53
None


## Comments on Play Store categories

The most common genres for Play Store is Family oriented. Different from the Apple Store, the play store does not have a single proclivity for recreational purposes, but a more hollistic application use where each genre has a relatively proportional ammount of the market. No one single category will out perform other category be too much of a margin.

Though saying that no one single category out performs other category by too much a margin, the family and games category is still the better category to tap into.

In [15]:
print('Displaying Play Store Genres')
print()
print(display_table(android, 9))

Displaying Play Store Genres

Tools : 842
Entertainment : 623
Education : 549
Medical : 463
Business : 460
Productivity : 424
Sports : 398
Personalization : 392
Communication : 387
Lifestyle : 381
Finance : 366
Action : 365
Health & Fitness : 341
Photography : 335
Social : 295
News & Magazines : 283
Shopping : 260
Travel & Local : 257
Dating : 234
Books & Reference : 231
Arcade : 220
Simulation : 200
Casual : 193
Video Players & Editors : 173
Puzzle : 140
Maps & Navigation : 137
Food & Drink : 127
Role Playing : 109
Strategy : 107
Racing : 98
House & Home : 88
Libraries & Demo : 85
Auto & Vehicles : 85
Weather : 82
Adventure : 75
Events : 64
Comics : 59
Art & Design : 58
Beauty : 53
Education;Education : 50
Card : 48
Parenting : 46
Board : 44
Educational;Education : 41
Casino : 39
Trivia : 38
Educational : 37
Casual;Pretend Play : 31
Word : 29
Entertainment;Music & Video : 27
Education;Pretend Play : 23
Music : 22
Casual;Action & Adventure : 21
Racing;Action & Adventure : 20
Puzzle;Bra

## Comments on Play Store Genres

The most common genres on the Play Store are "Tools", "Entertainment", "Education", and "Medical". Genres on the Play Store are much more utilitarian than that of the Apple Store. The frequency of Genres on the Play Store compared to the Apple Store is much more hollistic and encompassing with its applications with not one genre gaining the a portion of the market.

It is my opinion that because of such functional oriented availability of applications on the android Play Store, the market of the Play Store will be more lasting and have better branding usage for a wide demography.


In [16]:
print(ios_header)
# print()
# print(freq_table(ios, 11))
total = 0
len_genre = 0
for genre in freq_table(ios,11):
   
    for app in ios:
        genre_app = app[11]
#         print(genre, genre_app)
        if genre_app == genre:
#             print(genre_app, genre)
            user_rating = float(app[7])
#             print(user_rating)
            total += user_rating
            len_genre += 1
print()
print(total, len_genre)
average = total/len_genre
print()
print('App Genre average:',average)
print()

dic = {}
for app in ios:
    name = app[1]
    rating = float(app[7])
    if average + 0.2 >= rating >= average - 0.2:
        genre = app[11]
#         print(genre,':',rating,':', name)
        if genre not in dic:
            dic[genre] = 1
        if genre in dic:
            dic[genre] += 1
    
        
print('Most frequent Genre encompassing the App average rating:',max(dic.items(), key=operator.itemgetter(1))[0])

            
            
   

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

25383.5 7197

App Genre average: 3.526955675976101

Most frequent Genre encompassing the App average rating: Games


## App profile recommendation for the Apple Store

Having went through all the unique apps, giving them an average rating, and with the unique apps average rating genres, an Application profile recommendtaion for the Apple Store would be to make an gaming app.



In [25]:
import operator
# print(android_header)
print()
for category in freq_table(android,9):
    total = 0 
    len_category = 0
    for app in android:
        category_app = app[9]
        if category_app == category:
            Installs = float(app[5].replace(',','').replace('+',''))
            total += Installs
            len_category += 1
print(total,len_category)
print()
Average_Installs = total/len_category
print('Average of app genre:',Average_Installs)
print()
rounded_average = Average_Installs // 1000000 * 1000000
# print(rounded_average)

dic = {}
for app in android:
    name = app[0]
    Installs = int(app[5].replace(',','').replace('+',''))
#     print(Installs)
    if  Installs <= rounded_average:
        genre = app[9]
#         print(genre,':',Installs,':', name)
        if genre not in dic:
            dic[genre] = 1
        if genre in dic:
            dic[genre] += 1
print(list(app[0] for app in android if app[9] == 'Tools' 
         and int(app[5].replace(',','').replace('+','')) >= Average_Installs
        ))
print()
print('Most frequent category around average installs:',max(dic.items(), key=operator.itemgetter(1))[0])


269172550.0 29

Average of app genre: 9281812.068965517

['Moto File Manager', 'Google', 'Google Translate', 'Moto Display', 'Motorola Alert', 'Motorola Assist', 'Cache Cleaner-DU Speed Booster (booster & cleaner)', 'Moto Voice', 'Calculator', 'Device Help', 'Account Manager', 'myMetro', 'File Manager', 'My Telcel', 'Calculator - free calculator, multi calculator app', 'ASUS Sound Recorder', 'Samsung Max - Data Savings & Privacy Protection', 'ZenUI Help', 'SHAREit - Transfer & Share', 'ZenUI Keyboard – Emoji, Theme', 'Files Go by Google: Free up space on your phone', 'File Manager -- Take Command of Your Files Easily', 'Samsung Calculator', 'Clear', 'Phone', 'HTC Lock Screen', 'Gboard - the Google Keyboard', 'Google Korean Input', 'AT&T Smart Wi-Fi', 'Google app for Android TV', 'Sound Recorder: Recorder & Voice Changer Free', 'Remote Link (PC Remote)', 'HTC Sense Input', 'Share Music & Transfer Files - Xender', 'App vault', 'My love', 'DuraSpeed', 'Digital Alarm Clock', 'Alarm Clock 

## App profile recommendation for Play Store

An app profile recommendation for the Google Play Store would be an app that is under the category of Tools. Currently, Google Play Store profiles that are performing above average are, for example, "Moto File Manageer", "Account Manager". In all likelihood, an app profile in the genre of "Tools" similar to the provided examples of profiles having above 