# Profitable app profiles for the App Store and Google Play markets

----------------------------

My aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. 

At my ficticous company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. My goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

## Opening and exploring the data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Naturally I've sourced two relatively small (but substantial) datasets which I will open and explore below.

In [1]:
from csv import reader

opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

The function below will takes four parameters. The dataset, start and end integers for the row slice and a boolean to control whether we want to see the number of rows and columns.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


In [4]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


With the above data I would suggest that the important columns for us would include ...

**iOS** - 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'
**Android** - 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'

## Removing inaccurate and duplicate data

According to a discussion posted on kaggle for this dataset, I've found that users have reported a missing category value for row 10472 in the google play data.

In [5]:
print(android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [6]:
del android[10472]

I see no other discussion from the forum around inaccurate data, however there does seem to be many duplicates. 

Lets check...

In [7]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps: ', duplicate_apps[:15])

Number of duplicate apps:  1181


Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


All I've done above is create a list that holds unique and duplicate apps data. I then iterated over the Android data set and assigned each app to either 'unique' or 'duplicate' depending on whether they were already in 'unique' or not. 

We can see that 1181 duplicate apps were returned. 1181 seems like an awful lot of duplicate apps here.

Lets now remove these duplicate entries...

In [8]:
reviews_max = {}

for app in android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
    
print(len(reviews_max))

9658


In [9]:
android_clean = []
already_added = []

for app in android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
        
print(len(android_clean))

9658


Now we have a new list with zero duplicate entries inside the ```android_clean``` list.

My next job will be to find out if any of these app entries contain non-english characters. The reason for this is of course, our company is only going to develop english speaking apps and naturally we really only care for the relevant, similar data. We can build a function that iterates over strings and detects whether characters belong to ASCII system or not, then remove those apps which do not conform. 

In [10]:
def ascii_character(string):
    for character in string:
        if ord(character) > 127:
            return False
    return True
        
print(ascii_character('instagram'))
print(ascii_character('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
False


The above function works as expected, however there are some apps that use emojis and other characters which can still be included in English titled apps. To combat this, I will amend the function to only detect the apps which have 3 or more non-ascii characters. This isn't perfect but should be quite effective.

In [11]:
def ascii_characters(string): #Have only added pluralisation to the func name
    char_count = 0
    for character in string:
        if ord(character) > 127:
            char_count += 1
            if char_count >= 3:
                return False
    return True
        
print(ascii_characters('instagram'))
print(ascii_characters('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(ascii_characters('Instachat 😜'))

True
False
True


This looks much better. Now I can use this function on both iOS and Android data sets to filter out the non-English apps and append them to a new list.

In [12]:
ios_new = []
android_new = []

for app in ios:
    name = app[1]
    if ascii_characters(name):
        ios_new.append(app)
        
for app in android_clean:
    name = app[1]
    if ascii_characters(name):
        android_new.append(app)
        
print(len(ios))
print(len(android))
print('\n')
print(len(ios_new))
print(len(android_new))

7197
10840


6155
9658


In [13]:
explore_data(ios, 813, 814)
print('\n')
explore_data(ios_new, 813, 814)

['445375097', '爱奇艺PPS -《欢乐颂2》电视剧热播', '224617472', 'USD', '0.0', '14844', '0', '4.0', '0.0', '6.3.3', '17+', 'Entertainment', '38', '5', '3', '1']




['951704333', 'Filterra – Photo Editor, Effects for Pictures', '99465216', 'USD', '0.0', '14744', '2178', '4.5', '4.5', '1.10', '4+', 'Photo & Video', '37', '5', '13', '1']




Using `explore_data` above, we can see an example of row 813 in the new iOS list which previously had a non-ascii app name. This has now been removed. 

Finally for the data cleaning process, we should now isolate the free apps from those that charge. After all, we also aim to develop free apps.

In [14]:
ios_new_free = []
android_new_free = []

for app in ios_new:
    price = app[4]
    if price == '0.0':
        ios_new_free.append(app)
        
for app in android_new:
    price = app[7]
    if price == '0':
        android_new_free.append(app)
        


print(len(ios_new))
print(len(android_new))
print('\n')
print(len(ios_new_free))
print(len(android_new_free))

6155
9658


3203
8904


We have been left with 3203 iOS apps and 8904 Android apps after data cleaning, this should be adequate for our analysis. We have now removed the inaccurate data, duplicate app entries and non-English apps.

## Most common apps by genre

Our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

Build a minimal Android version of the app, and add it to Google Play.
If the app has a good response from users, we develop it further.
If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.
Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our data sets.

In [19]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
            
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
        
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [20]:
ios_final = ios_new_free
android_final = android_new_free

display_table(ios_final, -5)

Games : 58.25788323446769
Entertainment : 7.836403371838902
Photo & Video : 4.995316890415236
Education : 3.6840462066812365
Social Networking : 3.3093974399000934
Shopping : 2.5913206369029034
Utilities : 2.466437714642523
Sports : 2.1542304089915705
Music : 2.0605682172962845
Health & Fitness : 2.0293474867311896
Productivity : 1.7483609116453322
Lifestyle : 1.5610365282547611
News : 1.3424914142990947
Travel : 1.248829222603809
Finance : 1.0927255697783327
Weather : 0.8741804558226661
Food & Drink : 0.8117389946924758
Reference : 0.5307524196066188
Business : 0.5307524196066188
Book : 0.3746487667811427
Navigation : 0.18732438339057134
Medical : 0.18732438339057134
Catalogs : 0.1248829222603809


In [17]:
display_table(android_final, 1) #Category

FAMILY : 18.980233602875114
GAME : 9.703504043126685
TOOLS : 8.434411500449237
BUSINESS : 4.5822102425876015
LIFESTYLE : 3.930817610062893
PRODUCTIVITY : 3.8858939802336026
FINANCE : 3.6837376460017968
MEDICAL : 3.515274034141959
SPORTS : 3.380503144654088
PERSONALIZATION : 3.313117699910153
COMMUNICATION : 3.234501347708895
HEALTH_AND_FITNESS : 3.0660377358490565
PHOTOGRAPHY : 2.9424977538185084
NEWS_AND_MAGAZINES : 2.8301886792452833
SOCIAL : 2.6504941599281224
TRAVEL_AND_LOCAL : 2.324797843665768
SHOPPING : 2.2461814914645104
BOOKS_AND_REFERENCE : 2.178796046720575
DATING : 1.853099730458221
VIDEO_PLAYERS : 1.7969451931716083
MAPS_AND_NAVIGATION : 1.4150943396226416
FOOD_AND_DRINK : 1.2353998203054808
EDUCATION : 1.1680143755615455
ENTERTAINMENT : 0.9546271338724168
LIBRARIES_AND_DEMO : 0.9321653189577718
AUTO_AND_VEHICLES : 0.9209344115004492
HOUSE_AND_HOME : 0.8198562443845463
WEATHER : 0.7973944294699011
EVENTS : 0.7075471698113208
PARENTING : 0.651392632524708
ART_AND_DESIGN : 0

At first glance, the landscape looks much different on Play store. However, when I look at the 'family' section in the store, it does seem that most of these apps are in fact **games** for kids.

In [21]:
display_table(android_final, -4) #Genres

Tools : 8.423180592991914
Entertainment : 6.087151841868823
Education : 5.3908355795148255
Business : 4.5822102425876015
Lifestyle : 3.919586702605571
Productivity : 3.8858939802336026
Finance : 3.6837376460017968
Medical : 3.515274034141959
Sports : 3.447888589398023
Personalization : 3.313117699910153
Communication : 3.234501347708895
Action : 3.088499550763702
Health & Fitness : 3.0660377358490565
Photography : 2.9424977538185084
News & Magazines : 2.8301886792452833
Social : 2.6504941599281224
Travel & Local : 2.3135669362084457
Shopping : 2.2461814914645104
Books & Reference : 2.178796046720575
Simulation : 2.0664869721473496
Dating : 1.853099730458221
Arcade : 1.8418688230008984
Video Players & Editors : 1.7744833782569631
Casual : 1.7520215633423182
Maps & Navigation : 1.4150943396226416
Food & Drink : 1.2353998203054808
Puzzle : 1.1230907457322552
Racing : 0.9883198562443846
Role Playing : 0.9321653189577718
Libraries & Demo : 0.9321653189577718
Strategy : 0.9209344115004492
Au

When looking at the genres data within the play store, **tools** comes out on top. It's not completely clear what we are looking at in regards to the differences in genres and categories, but what we can tell is that genres data offers us a more granular look... ie it has more categories.

So at this point in the analysis, we can see the app store is dominated by games and apps designed for fun. Whilst the play store shows a more balanced landscape between fun and practical apps. 

## Most popular apps by genre on the app store

One of the best ways to find out what the most popular apps are is to calculate the avergae number of installs per app genre. In the google play set we can find this in the `installs` column, but the app store data has this missing. As a workaround, we can take the total number of user ratings as a proxy, we find this in `rating_count_tot`.

In [22]:
genres_ios = freq_table(ios_final, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Medical : 612.0
Business : 7491.117647058823
Finance : 32367.02857142857
Weather : 52279.892857142855
Book : 46384.916666666664
Lifestyle : 16815.48
News : 21248.023255813954
Food & Drink : 33333.92307692308
Productivity : 21028.410714285714
Photo & Video : 28441.54375
Travel : 28243.8
Navigation : 86090.33333333333
Catalogs : 4004.0
Utilities : 19156.493670886077
Education : 7003.983050847458
Music : 57326.530303030304
Sports : 23008.898550724636
Shopping : 27230.734939759037
Health & Fitness : 23298.015384615384
Reference : 79350.4705882353
Entertainment : 14195.358565737051
Games : 22886.36709539121
Social Networking : 71548.34905660378


On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

In [23]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


The aim here is to find popular genres **but** it seems that some genres are heavily influenced by some large players like we see above. The same can be said for social networking apps where Facebook etc skew the results and for reference apps, the bible has the same effect.

In [24]:
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
Jishokun-Japanese English Dictionary & Translator : 0



However, 'reference' seems to show some potential. One thing we could do is take another popular book and turn it into an app where we could add different features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes about the book, etc. On top of that, we could also embed a dictionary within the app, so users don't need to exit our app to look up words in an external app.

This idea seems to fit well with the fact that the App Store is dominated by for-fun apps. This suggests the market might be a bit saturated with for-fun apps, which means a practical app might have more of a chance to stand out among the huge number of apps on the App Store.

Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we described above, but the other genres don't seem too interesting to us:

Weather apps — people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.

Food and drink — examples here include Starbucks, Dunkin' Donuts, McDonald's, etc. So making a popular food and drink app requires actual cooking and a delivery service, which is outside the scope of our company.

Finance apps — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.