# Profitable App Profiles for the App Store and Google Play Markets

What we intend to figure out is what apps are most likely to downloaded by users and based on the data collected, analyze what type of apps the users would prefer.

In [1]:
from csv import reader
### The Google Play data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

def explore_data(dataset, start, end, rows_and_columns= False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print(ios_header)
print('\n')
explore_data(ios, 0, 4, True)



['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16


The Google Play data set has a dedicated discussion section, and we can see that one of the discussions describes an error for a certain row.

In [2]:
print(android[10472])
print('\n')
print(android_header)
print('\n')
print(ios_header)
#print(android[1000])
#print(ios[1000])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [3]:
del android[10472]
print(len(android))

10840


In [4]:
duplicate_apps = []
unique_apps = []
for row in android:
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    unique_apps.append(name)
print(len(duplicate_apps))
    

1181


We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.

If you examine the rows we printed above for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times.

We could use this information to build a criterion for removing the duplicates. The higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, we'll only keep the row with the highest number of reviews and remove the other entries for any given app.

In [5]:
reviews_max_1 = {}
reviews_max_2 = {}
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max_1 and reviews_max_1[name] < n_reviews:
        reviews_max_1[name] = n_reviews
    if name not in reviews_max_1:
        reviews_max_1[name] = n_reviews
for app in ios:
    name = app[1]
    n_reviews_2 = float(app[5])
    if name in reviews_max_2 and reviews_max_2[name] < n_reviews_2:
        reviews_max_2[name] = n_reviews_2
    if name not in reviews_max_2:
        reviews_max_2[name] = n_reviews_2
        
print(len(reviews_max_1))
print(len(reviews_max_2))

9659
7195


In the previous screen, we looped through the Google Play data set and found that there are 1,181 duplicates. After we remove the duplicates, we should be left with 9,659 rows.

To remove the duplicates, we will:

   * Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app
   * Use the information stored in the dictionary to create a new data set, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews)


In the code cell below:

   * We initialize two empty lists *android_clean* and *already_added*
   * We loop through the android data set, and for every iteration:

   * We isolate the name of the app and the number of reviews.
   * We add the current row (app) to the android_clean list, and the app name (name) to the already_cleaned list if:
        * The number of reviews of the current app matches the number of reviews of that app as described in the reviews_max dictionary; and
        * The name of the app is not already in the already_added list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps.


In [6]:
android_clean = []
ios_clean = []
already_added_1 = []
already_added_2 = []
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max_1[name] and name not in already_added_1:
        android_clean.append(app)
        already_added_1.append(name)
for app in ios:
    name = app[1]
    n_reviews_2 = float(app[5])
    if n_reviews_2 == reviews_max_2[name] and name not in already_added_2:
        ios_clean.append(app)
        already_added_2.append(name)
print(len(android_clean))
print(len(ios_clean))

9659
7195


Just as expected, we have 9659 rows for android and 7195 rows for ios.

To minimize the impact of data loss, we'll only remove an app if its name has more than three non-ASCII characters:

In [7]:
def is_English(string):
    non_ascii = 0
    for character in string:
        if ord(character) > 127:
            non_ascii += 1 
        if non_ascii > 3:
            return False
    return True

print(is_English('Instagram'))
print(is_English('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_English('Docs To Go™ Free Office Suite'))
print(is_English('Instachat 😜'))

True
False
True
True


In [8]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_English(name):
        android_english.append(app)
        
for app in ios_clean:
    name = app[1]
    if is_English(name):
        ios_english.append(app)
explore_data(android_english, 0, 3, True)
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'G

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis.

Isolating the free apps will be our last step in the data cleaning process. In the next screen, we're going to start analyzing the data.

In [9]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

8864
3220


So, after cleaning all the non-English and non-free apps, we're left with 8864 android apps and 3220 iOS apps.

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

   * Build a minimal Android version of the app, and add it to Google Play.
   * If the app has a good response from users, we then develop it further.
   * If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Our conclusion was that we'll need to build a frequency table for the prime_genre column of the App Store data set, and for the Genres and Category columns of the Google Play data set.

We'll build two functions we can use to analyze the frequency tables:

   * One function to generate frequency tables that show percentages
   * Another function we can use to display the percentages in a descending order


In [10]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        

In [11]:
display_table(ios_final, -5)

Games : 58.13664596273293
Entertainment : 7.888198757763975
Photo & Video : 4.968944099378882
Education : 3.6645962732919255
Social Networking : 3.291925465838509
Shopping : 2.608695652173913
Utilities : 2.515527950310559
Sports : 2.142857142857143
Music : 2.049689440993789
Health & Fitness : 2.018633540372671
Productivity : 1.7391304347826086
Lifestyle : 1.5838509316770186
News : 1.3354037267080745
Travel : 1.2422360248447204
Finance : 1.1180124223602486
Weather : 0.8695652173913043
Food & Drink : 0.8074534161490683
Reference : 0.5590062111801243
Business : 0.5279503105590062
Book : 0.43478260869565216
Navigation : 0.18633540372670807
Medical : 0.18633540372670807
Catalogs : 0.12422360248447205


As we can see from above, more than half of the apps are Games (58%). Following closely are Entertainement (7%), Photo & Video (5%) and Education (3%). It is clear that most of the apps designed for iOS are meant to be for fun purposes (games, entertainment, social networking, etc.) and less for practical purposes (education, shopping, utilities). But this doesn't imply that Gaming apps have large number of users since supply is more than demand here.

In [13]:
display_table(android_final, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

On Google Play, not many apps are designed for fun and quite a number of apps are designed for practical purposes (family, tools, business, productivity, etc.).

What we can see so far is that App Store is designed for more fun whereas Google Play Store maintains a balance between fun and practical apps.

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

In [16]:
genres_ios = freq_table(ios_final, -5)
for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:
            user_rating = float(app[5])
            total += user_rating
            len_genre += 1
    avg_user_rating = total / len_genre
    print(genre, ':', avg_user_rating)

Finance : 31467.944444444445
Photo & Video : 28441.54375
Weather : 52279.892857142855
Travel : 28243.8
Entertainment : 14029.830708661417
News : 21248.023255813954
Food & Drink : 33333.92307692308
Social Networking : 71548.34905660378
Book : 39758.5
Games : 22812.92467948718
Business : 7491.117647058823
Catalogs : 4004.0
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Music : 57326.530303030304
Education : 7003.983050847458
Productivity : 21028.410714285714
Shopping : 26919.690476190477
Sports : 23008.898550724636
Utilities : 18684.456790123455
Health & Fitness : 23298.015384615384
Medical : 612.0
Reference : 74942.11111111111


In [17]:
display_table(android_final, 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


In [26]:
android_category = freq_table(android_final, 1)
for category in android_category:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

BOOKS_AND_REFERENCE : 8767811.894736841
AUTO_AND_VEHICLES : 647317.8170731707
ENTERTAINMENT : 11640705.88235294
FAMILY : 3695641.8198090694
GAME : 15588015.603248259
COMICS : 817657.2727272727
LIBRARIES_AND_DEMO : 638503.734939759
BUSINESS : 1712290.1474201474
SPORTS : 3638640.1428571427
FINANCE : 1387692.475609756
DATING : 854028.8303030303
MAPS_AND_NAVIGATION : 4056941.7741935486
PARENTING : 542603.6206896552
LIFESTYLE : 1437816.2687861272
WEATHER : 5074486.197183099
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
NEWS_AND_MAGAZINES : 9549178.467741935
TRAVEL_AND_LOCAL : 13984077.710144928
EVENTS : 253542.22222222222
HEALTH_AND_FITNESS : 4188821.9853479853
PHOTOGRAPHY : 17840110.40229885
BEAUTY : 513151.88679245283
MEDICAL : 120550.61980830671
HOUSE_AND_HOME : 1331540.5616438356
COMMUNICATION : 38456119.167247385
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
TOOLS : 10801391.298666667
PRODUCTIVITY : 16787331.344927534
PERSONALIZATION : 5201482.6122448975
