# Profitable App Profiles for the App Store and Google Play Markets

## This project about apps on Google Play and the App Store. Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets.

In [1]:
from csv import reader

opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

explore_data(ios, 0, 5, True)
print('\n')
explore_data(android, 0, 5, True)
print('\n')
print(ios_header)
print('\n')
print(android_header)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967',

In [2]:
del android[10472]


we find that some apps have more than one entry. For instance, the application Google has three entries:

In [3]:
for app in android:
    name = app[0]
    if name == 'Google':
        print(app)   

['Google', 'TOOLS', '4.4', '8033493', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 3, 2018', 'Varies with device', 'Varies with device']
['Google', 'TOOLS', '4.4', '8021623', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 3, 2018', 'Varies with device', 'Varies with device']


We can find the total number of duplicate apps in data

In [4]:
duplicate_apps=[]
unique_apps=[]

for app in android:
    name=app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('Number of Duplicate Values:-',len(duplicate_apps))
print('\nExamples of duplicate apps:-\n')
print(duplicate_apps[:15])

Number of Duplicate Values:- 1181

Examples of duplicate apps:-

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In total, there are 1181 cases where an app occures more than once.

We do not want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.

If you examine the rows we printed two cells above for the Google Ads app, the main difference happens on the fouth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times. We can use this to build a criterion for keeping rows. We will not remove rows randomly, but rather we will keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

To do that, we will:

Create a dictionary where each key is unique app name, and the value is the highest number of reviews of that app
Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest numbe rof reviews)

In [5]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
  
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews

    elif name not in reviews_max: 
        reviews_max[name] = n_reviews
    
# print(reviews_max)
print(len(reviews_max))

9659


we found that there are 1181 cases where an app occurs more than once, so the length of unique apps dictionary should be equal to the difference between the length of our data set and 1181

let's use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we will only keep the entries with the highest number of reviews. In the code below:

    We start by initializing two empty lists, android_clean and already_added.
    We loop through the android data set, and for every iteration:
        We isolate the name of the app and the number of reviews.
        We add the current row (app) to the android_clean list, and the app name (name) to the already_cleaned list if:
            The number of reviews of the current app matches the number of reviews of that app as described in the reviews_max dictionary; and
            The name of the app is not already_added list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has the three entries, and the number of reviews, and the number of reviews is the same). If we just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps.

In [6]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

Remember we use English for the apps we develop at our company, and we'd like to analyze only the apps that are designed for an English-speaking audience. However, if we explore the data long enough, we'll find that both datasets have apps with names that suggest they are not designed for an English-speaking audience.

In [7]:
print(ios[813][1])
print(ios[6731][1])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


In [8]:
def characters_Ascii(a_string):
    for character in a_string:
        if ord(character) > 127:
            return False
    
    return True

print(characters_Ascii('Instagram'))
print(characters_Ascii('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(characters_Ascii('Docs To Go™ Free Office Suite'))
print(characters_Ascii('Instachat 😜'))

True
False
False
False


In [9]:
ios_english = []
android_english = []

for app in ios:
    name = app[1]
    if characters_Ascii(name):
        ios_english.append(app)

for app in android_clean:
    name = app[0]
    if characters_Ascii(name):
        android_english.append(app)
        
explore_data(ios_english, 0, 3, True)
print('\n')
explore_data(android_english, 0, 3, True) 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 5707
Number of columns: 16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '

So far in the data cleaning process, we've done the following:

* Removed inaccurate data
* Removed duplicate app entries
* Removed non-English apps
As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our datasets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

Isolating the free apps will be our last step in the data cleaning process. On the next step, we're going to start analyzing the data.

In [10]:
ios_free = []
android_free = []

for app in ios_english:
    price = float(app[4])
    if price == 0.0:
        ios_free.append(app)
        
for app in android_english:
    price = app[7]
    if price == '0':
            android_free.append(app)
            
print(len(ios_free))
print(len(android_free))

2922
8408


So far, we've spent a good amount of time cleaning data, including the following:

    1. Removing inaccurate data
    2. Removing duplicate app entries
    3. Removing non-English apps
    4. Isolating the free apps
As we mentioned in the introduction, our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.

To minimize risks and overhead, our validation strategy for an app idea has three steps:

    * Build a minimal Android version of the app, and add it to Google Play.
    * If the app has a good response from users, we develop it further.
    * If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

    
Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by determining the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our datasets.

* We will build two functions we can use analyze the frequency tables:
    
    : One function to generate frequency tables that show percentages
    
    : Another function that we can use to display the percentages in a descending order

Create a function named freq_table() that takes in two inputs: dataset (which will be a list of lists) and index (which will be an integer).

The function should return the frequency table (as a dictionary) for any column we want. The frequencies should also be expressed as percentages.

In [11]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = table[key] / total * 100
        table_percentages[key] = percentage
        
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    print("Total count :- ",len(table_display))
    print("\n\n")
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [12]:
print(android_header)
print('\n')
print(ios_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Analyze the frequency table you generated for the prime_genre column of the App Store dataset.

In [13]:
display_table(ios_free, 11)

Total count :-  23



Games : 59.171800136892536
Entertainment : 7.529089664613278
Photo & Video : 5.133470225872689
Education : 3.8329911019849416
Social Networking : 3.1143052703627654
Shopping : 2.4982888432580426
Utilities : 2.2587268993839835
Music : 2.1560574948665296
Sports : 2.0533880903490758
Health & Fitness : 1.9849418206707734
Productivity : 1.7111567419575633
Lifestyle : 1.4715947980835045
News : 1.3347022587268993
Travel : 1.1293634496919918
Finance : 1.0951403148528405
Weather : 0.8898015058179329
Food & Drink : 0.8898015058179329
Reference : 0.5133470225872689
Business : 0.5133470225872689
Book : 0.2737850787132101
Medical : 0.20533880903490762
Navigation : 0.13689253935660506
Catalogs : 0.10266940451745381


We can see that we have total 23 values, among the free English apps, more than a half (59.17%) are games. Entertainment apps are close to 7.52%, followed by photo and video apps, which are close to 5.13%. Only 3.83% of the apps are designed for education, followed by social networking apps which amount for 3.11% of the apps in our data set.

The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with partical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users - the demand might not be the same as the offer.

Most Popular Apps by Genre on the App Store

In [14]:
display_table(android_free, 9) 

Total count :-  108



Tools : 8.563273073263558
Entertainment : 6.089438629876309
Education : 5.387725975261656
Business : 4.709800190294957
Productivity : 3.9724072312083734
Lifestyle : 3.8772597526165553
Finance : 3.73453853472883
Medical : 3.6393910561370126
Sports : 3.3301617507136063
Personalization : 3.306374881065652
Communication : 3.2231208372978117
Health & Fitness : 3.1279733587059946
Action : 3.116079923882017
Photography : 3.0090390104662226
News & Magazines : 2.7949571836346334
Social : 2.664129400570885
Travel & Local : 2.3073263558515698
Shopping : 2.247859181731684
Books & Reference : 2.1883920076117986
Simulation : 2.0813510941960036
Dating : 1.8315889628924835
Arcade : 1.8315889628924835
Casual : 1.7721217887725977
Video Players & Editors : 1.736441484300666
Maps & Navigation : 1.3558515699333968
Food & Drink : 1.2012369172216937
Puzzle : 1.1298763082778307
Racing : 1.0228353948620361
Role Playing : 0.939581351094196
Auto & Vehicles : 0.939581351094196
Strategy : 0.

In [15]:
# Category
display_table(android_free, 1)

Total count :-  33



FAMILY : 18.803520456707897
GAME : 9.60989533777355
TOOLS : 8.575166508087536
BUSINESS : 4.709800190294957
PRODUCTIVITY : 3.9724072312083734
LIFESTYLE : 3.8891531874405327
FINANCE : 3.73453853472883
MEDICAL : 3.6393910561370126
PERSONALIZATION : 3.306374881065652
SPORTS : 3.258801141769743
COMMUNICATION : 3.2231208372978117
HEALTH_AND_FITNESS : 3.1279733587059946
PHOTOGRAPHY : 3.0090390104662226
NEWS_AND_MAGAZINES : 2.7949571836346334
SOCIAL : 2.664129400570885
TRAVEL_AND_LOCAL : 2.3073263558515698
SHOPPING : 2.247859181731684
BOOKS_AND_REFERENCE : 2.1883920076117986
DATING : 1.8315889628924835
VIDEO_PLAYERS : 1.7602283539486203
MAPS_AND_NAVIGATION : 1.3558515699333968
FOOD_AND_DRINK : 1.2012369172216937
EDUCATION : 1.165556612749762
ENTERTAINMENT : 0.939581351094196
AUTO_AND_VEHICLES : 0.939581351094196
LIBRARIES_AND_DEMO : 0.9039010466222646
HOUSE_AND_HOME : 0.8087535680304472
WEATHER : 0.7968601332064701
EVENTS : 0.7136060894386299
ART_AND_DESIGN : 0.6660323501

The difference between the Genres and the Category columns is not crystal clear, but one thing we can notice is that the Genres column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Plat data set, we can find this information in the Installs column, but for the App Store data set this information is missing. As a workaround , we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

Below, we calculate the averag number of user ratings per app genre on the App Store :

In [16]:
genres_ios = freq_table(ios_free, 11)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[11]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Social Networking : 78567.30769230769
Photo & Video : 29249.766666666666
Games : 21560.75072296125
Music : 55396.01587301587
Reference : 89562.6
Health & Fitness : 19418.620689655174
Weather : 48275.57692307692
Travel : 34115.57575757576
Shopping : 28877.575342465752
News : 23382.17948717949
Navigation : 125037.25
Lifestyle : 17260.53488372093
Entertainment : 15006.227272727272
Food & Drink : 33333.92307692308
Sports : 25791.666666666668
Finance : 26038.6875
Education : 6103.464285714285
Productivity : 22842.22
Utilities : 11571.69696969697
Book : 16671.0
Business : 6839.6
Catalogs : 5195.0
Medical : 612.0


On average, navigation apps have the highest number of users reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews, together:

In [17]:
for app in ios_free:
    if app[11] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same applies to music apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

Our aim is to find popular genres, but navigation, social networking or music apps might seem more popular than they are really are. The average number of rating seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10 000 threshold. We could get a better picture by removing these extremely popular apps for each genre and then rework the averages, but we will leave this level of detail for later.

Reference apps have 74 942 user ratings on average, but it's actually the Bible and Dictionary.com which skew up the average rating:

In [18]:
for app in ios_free:
    if app[11] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
Jishokun-Japanese English Dictionary & Translator : 0


In [19]:
display_table(android_free, 5)

Total count :-  21



1,000,000+ : 15.592293054234062
100,000+ : 11.596098953377735
10,000+ : 10.442435775451951
10,000,000+ : 10.323501427212179
1,000+ : 8.480019029495718
100+ : 7.088487155090391
5,000,000+ : 6.660323501427213
500,000+ : 5.5542340627973354
50,000+ : 4.7216936251189345
5,000+ : 4.5313986679353
10+ : 3.5442435775451955
500+ : 3.246907706945766
50,000,000+ : 2.2121788772597526
100,000,000+ : 2.1289248334919124
50+ : 1.9743101807802093
5+ : 0.8206470028544244
1+ : 0.5114176974310181
500,000,000+ : 0.285442435775452
1,000,000,000+ : 0.22597526165556614
0+ : 0.04757373929590866
0 : 0.011893434823977166


In [20]:
categories_android = freq_table(android_free, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_free:
        category_app = app[1]
        if category_app == category:
            n_inistalls = app[5]
            n_inistalls = n_inistalls.replace(',', '')
            n_inistalls = n_inistalls.replace('+', '')
            total += float(n_inistalls)
            len_category += 1
    avg_n_inistalls = total / len_category
    print(category, ':', avg_n_inistalls)

ART_AND_DESIGN : 1932519.642857143
AUTO_AND_VEHICLES : 645317.2278481013
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8504745.97826087
BUSINESS : 1602958.308080808
COMICS : 880440.625
COMMUNICATION : 36106662.328413285
DATING : 764959.4610389611
EDUCATION : 1844897.9591836734
ENTERTAINMENT : 12346329.11392405
EVENTS : 232885.83333333334
FINANCE : 1348224.9426751593
FOOD_AND_DRINK : 1974937.1386138613
HEALTH_AND_FITNESS : 4263642.1749049425
HOUSE_AND_HOME : 1391211.1911764706
LIBRARIES_AND_DEMO : 674917.2368421053
LIFESTYLE : 1375297.3058103975
GAME : 15434835.816831684
FAMILY : 3633707.342820999
MEDICAL : 119216.81045751635
SOCIAL : 24441088.17857143
SHOPPING : 7307823.2010582015
PHOTOGRAPHY : 18099283.85375494
SPORTS : 3647640.208029197
TRAVEL_AND_LOCAL : 14487541.68041237
TOOLS : 11084333.292649098
PERSONALIZATION : 5027006.791366907
PRODUCTIVITY : 16972497.946107786
PARENTING : 544745.6363636364
WEATHER : 5219216.7164179105
VIDEO_PLAYERS : 25234606.216216218
NEWS_AND_MAGAZINES 