# Profitable App Profiles: App Store and Google Play Markets

This is a process for condesning application data to find the most suitable type of mobile applications for development across both the App Store and Google Play Market. The aim is to provide data driven information for finding the most successful types of mobile applications for developers to create. The financial model is based on free-to-use apps with in-app ads.

## Where is the data?

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:

- [A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).

- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).
  

In [7]:
from csv import reader

# Google Play Dataset
opened_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

# Apple App Store Dataset
opened_file = open('AppleStore.csv', encoding='utf8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

Look at a few bits of the data using the explore_data function.

In [14]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print(android_header)
explore_data(android, 0, 2, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


The Android Data Set has 10841 Rows, and 13 columns. Some useful column categories might be: App, Category, Rating, Review, Installs, Type, Price, Genres.


Do the same for iOS:


In [13]:
print(ios_header)
explore_data(ios, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16


The iOS Data Set has 7197 Rows, and 16 Columns. Useful categories might be: 'track_name', 'price', 'rating_count_tot', 'user_rating', 'prime_genre'

## Prune Data

The Google Play data set has a dedicated discussion section, and we can see that one of the discussions outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [16]:
print(android_header, '\n')
print(android[0], '\n')
print(android[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Row 10472 seems to be missing a few data points in the categories, causing a shift of the data. E.g. 'Installs' column for this row has a value of 'Free'. Lets delete this row and print at the same index to verify data has shifted up.

In [17]:
del android[10472] #Only run once or multiple rows delete
print(android[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


So far there doesn't seem to be any bad data from the App Store.


### Duplicate Data

Further scouring of the Play Store discussions hints at some duplicate entries. Let's check.

In [20]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate cases: ', len(duplicate_apps))
print('Sample of duplicate apps: ', duplicate_apps[0:10])

Number of duplicate cases:  1181
Sample of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


Best to check for the App Store as well:

In [40]:
duplicate_apps = []
unique_apps = []

for app in ios:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate cases: ', len(duplicate_apps))
print('Sample of duplicate apps: ', duplicate_apps[0:10])

Number of duplicate cases:  0
Sample of duplicate apps:  []


Let's see if there is any difference between each entry.

#### Android

Looking at the various entries for 'Slack' below, there is really only one difference. The number of reviews (column 4) is slightly larger for the bottom entry. Although slight, this could be a good criterion for deleting duplicate data. The more reviews an app has the better the rating represents the audience.

In [23]:
print(android_header)
for app in android:
    name = app[0]
    if name == 'Slack':
        print(app)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


### Removing Duplicate Data

This process will be based on the number of reviews for each data duplicate -- keeping the one with the most amount of views to best represent the largest audience. We can determine the amount of data we should have left by subtracting the number of duplicates from the total number of entries as a simple sanity check. The process for removing this will consist of two main tasks. First, creating a dictionary with each key being a unique app name, and the corresponding value being the highest number of reviews for that app overall. Then making this into the new data set to use.

In [41]:
print('Expected Android Length: ', len(android) - 1181)

Expected Android Length:  9659


#### Android

In [38]:
reviews_max = {}

for row in android:
    name = row[0]
    n_reviews = float(row[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews   #Updates the existing dictionary value only if it is larger than the existing one
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
#print(len(reviews_max))

In [39]:
android_clean = []
already_added = []

for row in android:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added: 
        android_clean.append(row) #appends entire row to clean data if it has the most reviews and isn't a duplicate
        already_added.append(name)
        
print('Actual Android Length: ', len(android_clean))

Actual Android Length:  9659


### Removing Unwanted Data

The apps this company would like to focus on are oreinted towards English-speaking audiences. Thus, we can remove apps with non-english characters. **Pt.1)** This can be done by checking the unicode number against the ones used for English, for each character in the app title. Following the ASCII system, all characters in the english alphabet have unicode numbers 0 - 127. If a unicode is larger than 127, then it is unlikely English. **Pt.2)** In the current day of emoji's and trademarks, it might not be resonable to exclude all data with a few non-english characters. It seems reasonable to exclude app names with more than three non-english characters.

**Pt.1**

In [44]:
def common_eng(string):
    for char in string:
        if ord(char) > 127:
            return False
    return True


print(common_eng('Instagram'))
print(common_eng('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(common_eng('Docs To Go™ Free Office Suite'))
print(common_eng('Instachat 😜'))


print(ord('™'))
print(ord('😜'))

True
False
False
False
8482
128540


**Pt.2**

In [None]:
def common_eng(string):
    non_ascii = 0
    
    for char in string:
        if ord(char) > 127:
            non_ascii += 1
            
    if non_ascii > 3:
        return False
    else:
        return True
        
print(common_eng('Instagram'))
print(common_eng('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(common_eng('Docs To Go™ Free Office Suite'))
print(common_eng('Instachat 😜'))


Applying this function to the code:

In [49]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if common_eng(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if common_eng(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

### Isolate Free Apps

The apps this company wants to focus on are free apps with in-app ads to generate revenue. We can isolate these apps by sorting through the price of the app and appending to a new final list of data only if it is free.

In [52]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print('Total free, english apps on Android: ', len(android_final))
print('Total free, english apps on iOs:     ', len(ios_final))

Total free, english apps on Android:  8864
Total free, english apps on iOs:      3222


## Deciding the Best App

The data has been cleaned to a reasonable point so far in order to start looking for trends on what type(s) of app(s) will be the most successful. The end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.

In [53]:
def freq_table(dataset, index):
    '''A function for creating a frequency table'''
    ft = {}
    total = 0
    
    for row in dataset:
        value = row[index]
        total += 1
        if value in ft:
            ft[value] += 1
        else:
            ft[value] = 1
            
    ft_percent = {}
    
    for key in ft:
        ft_percent[key] = (ft[key] / total) * 100
        
    return ft_percent


def display_table(dataset, index):
    """Display frequency table sorted by percentage high to low"""
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Table for iOS column 'prime_genre'

In [54]:
display_table(ios_final, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


The most common type of apps on the App Store are games by a large margin, with the next closest, Entertainment. There is only one category out of the top 5 that is not directly related to personal enjoyment/fun and seems more directed towards practical uses. It would be moot to recommend a type of app on this data alone. A large amount of available apps does not correspond with a large amount of users.

Table for Android column 'Category'

In [55]:
display_table(android_final, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

Similarly on the Play Store, the most common category is relating to entertainment/fun more so than practicality. It is  still pointless for recommending a type of app thus far, but is interesting to note the disparity between categories is not nearly as large as found on the App Store.

Table for Android Column 'Genres'

In [56]:
display_table(android_final, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

The Genres on the Play store are quite different from the previous two sets of data. The most popular genres being related to practicality, like things such as tools, education, business, and productivity. Entertainment is still large, but not the most prominent.

Overall, these trends seem to imply that the most common *type* of applications are used for gaming and entertainment, while the second most common type of applications are used for practicality and utility. These types of apps aren't necessarily going to be the most successful as we don't know if this correlates to the amount of people actually using the apps. We should look further into what apps have the most users.

### Most Popular Apps on the App Store by Genre

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app. As a caution, this could skew data as some apps have incentives for giving reviews

Below, we calculate the average number of user ratings per app genre on the App Store:

In [57]:
genres_ios = freq_table(ios_final, 11)

for genre in genres_ios:
    total = 0
    len_genre = 0
    
    for app in ios_final:
        genre_app = app[11]
        
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    
    avg_n_ratings = total / len_genre
    
    print(genre, ':', avg_n_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


In general, the higest amount of ratings are from Navigation, Reference, and Social Networking. Perhaps an app that includes all three of these could find a lot of success. Maybe an app that shows users where other people are, how to get there, and what fun things there are to do in the area.

### Most Popular Apps on the Play Store by Genre

The Google Play market has more consice data on how many times an app was installed. As shown below,

In [58]:
display_table(android_final, 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


Converting strings to floats for calculations

In [59]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    
    for app in android_final:
        category_app = app[1]
        
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
            
    avg_n_installs = total / len_category
    
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

Overall, the communication category on the Play Store has the most installs at nearly 38 million. Other popular installed apps would be things like Video Players, Gaming, and Social. Having an app with these 3 things could be successful.


## Conclusion

After analyzing the most popular types of apps across Android and iOS an app that would find the most success would likely be something that has a mix of social networking and gaming, perhaps with ways to find eachother in real life. Although these are based off of raw data, it would require further inspection to see if there is anything happening at a finer level of these categories to make them more successful than others. Perhaps a few large apps are skewing the numbers away from other possibilites.