**Profitable App Profiles for the App Store and Google Play Markets**

Revenue for free to download and install apps on Google Play and App Store is mostly influenced by the number of users who use the app. The main source of revenue consists of in-app ads, thus for profitability, more users that see and engage with the ads is required. 

**Objective:** To analyze data to help developers understand what type of apps are likely to attract more users


In [1]:
import os

In [2]:
os.chdir(r"C:\Users\nicho\Downloads")

In [3]:
print("my path is ", os.getcwd())

my path is  C:\Users\nicho\Downloads


In [4]:
from csv import reader
opened_file = open('AppleStore.csv', errors='ignore')
read_file = reader(opened_file)
ios_apps = list(read_file)
ios_header = ios_apps[0]
ios = ios_apps[1:]

In [5]:
opened_file = open('googleplaystore.csv', errors='ignore')
read_file = reader(opened_file)
android_apps = list(read_file)
android_header = android_apps[0]
android = android_apps[1:]

In [6]:
print(ios_header)
print('\n')
print(android_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [7]:
def explore_data(dataset,start,end,rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('No of rows: ', len(dataset))
        print('No of columns: ', len(dataset[0]))
              

**Cleaning the datasets**
Function below checks for any missing data in both the android and ios dataset.

In [8]:
def missing_data(dataset):
    for data in dataset:
        if dataset == android:
            header_length = len(android_header)
        elif dataset == ios:
            header_length = len(ios_header)
        else:
            header_length = len(dataset[0])
        
        if len(data) != header_length:
            print(data)
            print('\n')
            print(dataset.index(data))
      

In [9]:
print(missing_data(android))
print(missing_data(ios))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


10472
None
None


In [10]:
explore_data(ios,0,2,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


No of rows:  7197
No of columns:  16


**Deleting the missing rows.**

In [11]:
del android[10472]

In [12]:
explore_data(android,0,2,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


No of rows:  10840
No of columns:  13


**Removng duplicates**
To remove the duplicates, we employ various techniques and not just random removal. Considering the reeview ciolumn, we can get the much latest data by looking at the number of reviews. The hihger the number of reviews, the more latest the data.

In [13]:
def duplicates(dataset,index):
    dup_apps = []
    unique_apps =[]
    
    for data in dataset:
        name= data[index]
        if name in unique_apps:
            dup_apps.append(name)
            
        else:
            unique_apps.append(name)
            
    print('Duplicate apps: ', len(dup_apps))
    print('Unique apps: ', len(unique_apps))
    

In [14]:
print(duplicates(android,0))
print(duplicates(ios,1))

Duplicate apps:  1181
Unique apps:  9659
None
Duplicate apps:  2
Unique apps:  7195
None


In [15]:
androids_max={}
for data in android:
    name = data[0]
    n_reviews= float(data[3])
    
    if name in androids_max and  androids_max[name] < n_reviews:
        androids_max[name] = n_reviews
        
    elif name not in androids_max:
        androids_max[name] = n_reviews
    

In [16]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(androids_max))

Expected length: 9659
Actual length: 9659


In [17]:
android_data=[]
added=[]
for data in android:
    name = data[0]
    n_reviews = float(data[3])
    
    if (androids_max[name] == n_reviews) and (name not in added):
        android_data.append(data)
        added.append(name)
    

In [18]:
ios_max = {}
for data in ios:
    name=data[1]
    n_reviews= float(data[5])
    
    if name in ios_max and ios_max[name] < n_reviews:
        ios_max[name] = n_reviews
        
    elif name not in ios_max:
        ios_max[name] = n_reviews
    

In [19]:
ios_data = []
added_ios = []
for data in ios:
    name= data[1]
    n_reviews = float(data[5])
    
    if (ios_max[name] == n_reviews) and (name not in added_ios):
        ios_data.append(data)
        added.append(name)
        

In [20]:
print(explore_data(ios_data,0,1,True))
print(explore_data(android_data,0,1,True))

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


No of rows:  7195
No of columns:  16
None
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


No of rows:  9659
No of columns:  13
None


**We need to filter out non-english apps**

In [21]:
def check_english_apps(word):
    start = 0
    for characters in word:
        if ord(characters) > 127:
            start += 1
            
        if start > 3:
                return False
        
    return True

In [22]:
android_eng_apps = []
for data in android_data:
    name = check_english_apps(data[0])
    
    if name == True:
        android_eng_apps.append(data)
    

In [23]:
explore_data(android_eng_apps,-5,-1,True)

['Sya9a Maroc - FR', 'FAMILY', '4.5', '38', '53M', '5,000+', 'Free', '0', 'Everyone', 'Education', 'July 25, 2017', '1.48', '4.1 and up']


['Fr. Mike Schmitz Audio Teachings', 'FAMILY', '5.0', '4', '3.6M', '100+', 'Free', '0', 'Everyone', 'Education', 'July 6, 2018', '1.0', '4.1 and up']


['Parkinson Exercices FR', 'MEDICAL', 'NaN', '3', '9.5M', '1,000+', 'Free', '0', 'Everyone', 'Medical', 'January 20, 2017', '1.0', '2.2 and up']


['The SCP Foundation DB fr nn5n', 'BOOKS_AND_REFERENCE', '4.5', '114', 'Varies with device', '1,000+', 'Free', '0', 'Mature 17+', 'Books & Reference', 'January 19, 2015', 'Varies with device', 'Varies with device']


No of rows:  9509
No of columns:  13


In [24]:
ios_eng_apps = []
for data in ios_data:
    name = check_english_apps(data[1])
    
    if name == True:
        ios_eng_apps.append(data)
        

In [25]:
explore_data(ios_eng_apps,-5,-1,True)

['1070854722', 'Be-be-bears!', '480781312', 'USD', '2.99', '0', '0', '0.0', '0.0', '1.0.2.5', '4+', 'Games', '35', '5', '13', '1']


['1169971902', 'Hey Duggee: We Love Animals', '136347648', 'USD', '2.99', '0', '0', '0.0', '0.0', '1.2', '4+', 'Games', '40', '5', '1', '1']


['1170406182', 'Shark Boom - Challenge Friends with your Pet', '245415936', 'USD', '0.0', '0', '0', '0.0', '0.0', '1.0.9', '4+', 'Games', '38', '5', '1', '1']


['1070052833', 'Go!Go!Cat!', '91468800', 'USD', '0.0', '0', '0', '0.0', '0.0', '1.1.2', '12+', 'Games', '37', '2', '2', '1']


No of rows:  6098
No of columns:  16


**Isolating free apps from the lists of english apps**

In [26]:
android_free_eng_apps = []
for data in android_eng_apps:
    price = data[7]
    for characters in price:
        if '$' in characters:  
            price = float(price.translate({ord('$'): None}))
        else:
            price = float(price)
    
    if price == 0:
        android_free_eng_apps.append(data)

print(explore_data(android_free_eng_apps,0,1,True))

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


No of rows:  8769
No of columns:  13
None


In [27]:
ios_free_eng_apps = []
for data in ios_eng_apps:
    price = data[4]
    for characters in price:
        if '$' in characters:
            price = float(price.translate({ord('$'): None}))
        else:
            price = float(price)
            
    if price == 0:
        ios_free_eng_apps.append(data)
        
print(explore_data(ios_free_eng_apps,0,1,True))
        

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


No of rows:  3167
No of columns:  16
None


**To minimize risks and overhead, the validation strategy for an app idea is comprised of three steps:**

**1:** Build a minimal Android version of the app, and add it to Google Play. 
**2:** If the app has a good response from users, we develop it further. 
**3:** If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

The end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets.


In [28]:
def freq_table(dataset,index):
    table = {}
    total=0
    
    for data in dataset:
        total += 1
        name=data[index]
        
        if name in table:
            table[name] += 1
        else:
            table[name]=1
    
    table_percentages={}
    for data in table:
        table_percentages[data] = (table[data]/total)*100
        
    return table_percentages
        

In [29]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [30]:
display_table(ios_free_eng_apps,11)

Games : 58.509630565203665
Entertainment : 7.8307546574044835
Photo & Video : 5.052099778970635
Education : 3.7259235869908434
Social Networking : 3.2838648563309127
Shopping : 2.5260498894853174
Utilities : 2.3997473950110515
Sports : 2.178718029681086
Music : 2.0524155352068205
Health & Fitness : 1.9892642879696874
Productivity : 1.7050836754025893
Lifestyle : 1.5472055573097567
News : 1.3261761919797914
Travel : 1.1367224502683928
Finance : 1.1051468266498263
Weather : 0.8525418377012947
Food & Drink : 0.8209662140827282
Reference : 0.5367856015156299
Business : 0.5367856015156299
Book : 0.3789074834227976
Navigation : 0.1894537417113988
Medical : 0.1894537417113988
Catalogs : 0.12630249447426586


In [31]:
print(ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [32]:
unique_genre = freq_table(ios_free_eng_apps,-5)
print(unique_genre)

{'Social Networking': 3.2838648563309127, 'Photo & Video': 5.052099778970635, 'Games': 58.509630565203665, 'Music': 2.0524155352068205, 'Reference': 0.5367856015156299, 'Health & Fitness': 1.9892642879696874, 'Weather': 0.8525418377012947, 'Utilities': 2.3997473950110515, 'Travel': 1.1367224502683928, 'Shopping': 2.5260498894853174, 'News': 1.3261761919797914, 'Navigation': 0.1894537417113988, 'Lifestyle': 1.5472055573097567, 'Entertainment': 7.8307546574044835, 'Food & Drink': 0.8209662140827282, 'Sports': 2.178718029681086, 'Book': 0.3789074834227976, 'Finance': 1.1051468266498263, 'Education': 3.7259235869908434, 'Productivity': 1.7050836754025893, 'Business': 0.5367856015156299, 'Catalogs': 0.12630249447426586, 'Medical': 0.1894537417113988}


In [33]:
ratings_per_genre = {}
for genre in unique_genre:
    total = 0
    len_genre = 0
    
    for data in ios_free_eng_apps:
        genre_app = data[-5]
        
        if genre_app == genre:
            rating = float(data[5])
            total += rating
            len_genre += 1
    
    avg_ratings = total/len_genre
    ratings_per_genre[genre]= avg_ratings
    
    print(genre, ':', avg_ratings)
    
    

Social Networking : 72916.54807692308
Photo & Video : 28441.54375
Games : 23009.927145169993
Music : 58205.03076923077
Reference : 79350.4705882353
Health & Fitness : 24037.634920634922
Weather : 54215.2962962963
Utilities : 19900.473684210527
Travel : 31358.5
Shopping : 27816.2
News : 21750.071428571428
Navigation : 86090.33333333333
Lifestyle : 16739.34693877551
Entertainment : 14364.774193548386
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 46384.916666666664
Finance : 32367.02857142857
Education : 7003.983050847458
Productivity : 21799.14814814815
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


In [34]:
def disp_table(dictionary):
    table = dictionary
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [35]:
disp_table(ratings_per_genre)

Navigation : 86090.33333333333
Reference : 79350.4705882353
Social Networking : 72916.54807692308
Music : 58205.03076923077
Weather : 54215.2962962963
Book : 46384.916666666664
Food & Drink : 33333.92307692308
Finance : 32367.02857142857
Travel : 31358.5
Photo & Video : 28441.54375
Shopping : 27816.2
Health & Fitness : 24037.634920634922
Games : 23009.927145169993
Sports : 23008.898550724636
Productivity : 21799.14814814815
News : 21750.071428571428
Utilities : 19900.473684210527
Lifestyle : 16739.34693877551
Entertainment : 14364.774193548386
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


Navigation ios apps have the most users followed by reference apps with medical apps having lowest number of users.

In [36]:
user_ratings_per_genre = {}
for genre in unique_genre:
    total = 0
    len_genre = 0
    
    for data in ios_free_eng_apps:
        genre_app = data[-5]
        
        if genre_app == genre:
            rating = float(data[7])
            total += rating
            len_genre += 1
    
    avg_user_rating = total/len_genre
    user_ratings_per_genre[genre]= avg_user_rating
    
print(disp_table(user_ratings_per_genre))

Catalogs : 4.125
Productivity : 4.064814814814815
Games : 4.060172692930383
Shopping : 3.975
Business : 3.9705882352941178
Music : 3.953846153846154
Photo & Video : 3.903125
Health & Fitness : 3.888888888888889
Reference : 3.8823529411764706
Navigation : 3.8333333333333335
Education : 3.635593220338983
Food & Drink : 3.6346153846153846
Social Networking : 3.6009615384615383
Book : 3.5833333333333335
Utilities : 3.5723684210526314
Entertainment : 3.5403225806451615
Weather : 3.4814814814814814
Finance : 3.4714285714285715
Lifestyle : 3.4591836734693877
Travel : 3.4444444444444446
News : 3.2261904761904763
Sports : 3.0652173913043477
Medical : 3.0
None


Catalog ios apps have the highest user ratings followed by productivity apps with medical and sports app having the leas user rating values  

In [37]:
display_table(android_free_eng_apps,5)

1,000,000+ : 15.748660052457522
100,000+ : 11.517846960884935
10,000,000+ : 10.594138442239707
10,000+ : 10.206408940586156
1,000+ : 8.370395712167863
100+ : 6.944919603147451
5,000,000+ : 6.865092941042309
500,000+ : 5.553654920743528
50,000+ : 4.76679210856426
5,000+ : 4.481696886760178
10+ : 3.5237769414984603
500+ : 3.204470293077888
50,000,000+ : 2.3035693921769873
100,000,000+ : 2.1325122590945376
50+ : 1.9272436993955981
5+ : 0.7868628121792679
1+ : 0.5131713992473486
500,000,000+ : 0.27369141293191923
1,000,000,000+ : 0.22807617744326605
0+ : 0.045615235488653205
0 : 0.011403808872163301


In [46]:
android_installs = freq_table(android_free_eng_apps,1)
for category in android_installs:
    total = 0
    len_category = 0
    for data in android_free_eng_apps:
        category_app = data[1]
        
        if category_app == category:
            n_installs = data[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            
            total += float(n_installs)
            len_category += 1
            
    avg_android_installs = total/len_category
    android_installs[category]= avg_android_installs

print(disp_table(android_installs))

COMMUNICATION : 38590581.08741259
VIDEO_PLAYERS : 24878048.860759493
SOCIAL : 23628689.23275862
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15612234.167650532
TRAVEL_AND_LOCAL : 14120454.07804878
ENTERTAINMENT : 11767380.952380951
TOOLS : 10902378.834454913
NEWS_AND_MAGAZINES : 9626407.357723577
BOOKS_AND_REFERENCE : 8329168.936170213
SHOPPING : 7072366.590909091
PERSONALIZATION : 5240358.986111111
WEATHER : 5212877.101449275
HEALTH_AND_FITNESS : 4219697.055350553
MAPS_AND_NAVIGATION : 4115374.214876033
SPORTS : 3725100.537414966
FAMILY : 3709707.689530686
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1951283.8055555555
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1447458.976676385
HOUSE_AND_HOME : 1380033.7285714287
FINANCE : 1365500.4049079753
DATING : 861409.5521472392
COMICS : 859042.1568627451
AUTO_AND_VEHICLES : 654074.8271604938
LIBRARIES_AND_DEMO : 649314.0506329114
PARENTING : 552875.1785714285
BEAUTY : 513151.8867

This data shows that comunnication apps have the most installs however since there are just a few apps in the category with very high number of installs such as whatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts the data is skewed.

The video players category, which is the runner-up with 24,727,872 installs is also a victim. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

These niches seem to be dominated by a few giants who are hard to compete against.

if we remove the apps with 100m plus installs

In [52]:
android_installs = freq_table(android_free_eng_apps,1)
for category in android_installs:
    total = 0
    len_category = 0
    for data in android_free_eng_apps:
        category_app = data[1]
        
        if category_app == category:
            n_installs = data[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            n_installs = float(n_installs)
            if n_installs < 100000000:
                total += float(n_installs)
                len_category += 1
            
    avg_android_installs = total/len_category
    android_installs[category]= avg_android_installs

print(disp_table(android_installs))

PHOTOGRAPHY : 7670532.29338843
GAME : 6240256.451204056
ENTERTAINMENT : 6183037.974683545
VIDEO_PLAYERS : 5575380.67114094
WEATHER : 5212877.101449275
SHOPPING : 4664914.948186529
COMMUNICATION : 3617398.420849421
PRODUCTIVITY : 3379657.318885449
TOOLS : 3221943.2408963586
SOCIAL : 3113497.2694063927
SPORTS : 3065683.4178082193
TRAVEL_AND_LOCAL : 2973465.43
PERSONALIZATION : 2532940.6714285715
MAPS_AND_NAVIGATION : 2503867.899159664
FAMILY : 2345591.1286407765
HEALTH_AND_FITNESS : 2020586.996282528
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1951283.8055555555
EDUCATION : 1833495.145631068
NEWS_AND_MAGAZINES : 1514799.2181069958
BOOKS_AND_REFERENCE : 1445020.4347826086
HOUSE_AND_HOME : 1380033.7285714287
BUSINESS : 1226918.7407407407
LIFESTYLE : 1159293.6520467836
FINANCE : 1062009.6369230768
DATING : 861409.5521472392
COMICS : 859042.1568627451
AUTO_AND_VEHICLES : 654074.8271604938
LIBRARIES_AND_DEMO : 649314.0506329114
PARENTING : 552875.1785714285
BEAUTY : 513151.8867924528