Profitable App Profiles for Google Play Markets

The aim of this project is to identify what app profiles are the most profitable in the Google Play markets, enabling developers to make better data driven decisions on what apps to develop. 

The goal is to help developers who want to make FREE apps find a direction to what app type to build to maximize in-app purchases

To simplify the process, I will be using a sample of 10,000 Android Apps.

Android - https://www.kaggle.com/lava18/google-play-store-apps

STEPS INCLUDED IN THIS PROJECT 

DATA CLEANING
1. Removed inaccurate/missing data
2. Removed duplicate app entries
3. Removed non-english apps
4. Isolated the free apps

DATA ANALYSIS
1. Frequency table (app genre overview)
2. Installs overview
3. Filtering apps that skew averages
4. Explore photography apps
5. Recommendation




Defined an explore_data function that allows us to print a sliced dataset and show how many columns and rows are present. 

In [100]:
def explore_data(dataset, start, end, rows_and_columns=False): 
    dataset_slice = dataset[start:end]
    for row in dataset_slice: 
        print(row)
        print('\n') # this adds a new empty line after each row
    if rows_and_columns: 
        print('Number of rows:', len(dataset))
        print('Number of columns: ', len(dataset[0]))
        

Opening the two datasets

In [101]:
from csv import reader


#Google Play Data
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]


Testing the explore_data() function
Also making notes on which categories will be the best one to use for analysis

In [102]:
print ('ANDROID \n', android_header, '\n')
print (explore_data(android,0,5,True), '\n')

ANDROID 
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free',

START DATA CLEANING PROCESS

rows with missing data

In [103]:
for row in android:
    header_length = len(android_header)
    rowlength = len(row) 
    if rowlength != header_length:
        print(row)
        print(android.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10472


In [104]:
# Deleting said row
print (len(android))
del android[10472]
print (len(android))

10841
10840


Looking for duplicates 



In [105]:

duplicate_apps = [] 
unique_apps = []

for x in android: 
    name = x[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else: 
        unique_apps.append(name)
print('number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:5])

number of duplicate apps:  1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


After Determining that there are 1181 duplicate apps, we investigate further as to why

In [106]:
duplicate_names = []

for x in android: 
    name = x[0]
    if name in duplicate_apps:
        duplicate_names.append(name)
        
print (duplicate_names)
    

['Coloring book moana', 'Mcqueen Coloring pages', 'UNICORN - Color By Number & Pixel Art Coloring', 'Textgram - write on photos', 'Wattpad 📖 Free Books', 'Amazon Kindle', 'Dictionary - Merriam-Webster', 'NOOK: Read eBooks & Magazines', 'Oxford Dictionary of English : Free', 'Spanish English Translator', 'NOOK App for NOOK Devices', 'Ebook Reader', 'English Dictionary - Offline', 'Docs To Go™ Free Office Suite', 'Google My Business', 'OfficeSuite : Free Office + PDF Editor', 'Curriculum vitae App CV Builder Free Resume Maker', 'Facebook Pages Manager', 'Box', 'Call Blocker', 'ZOOM Cloud Meetings', 'Facebook Ads Manager', 'Quick PDF Scanner + OCR FREE', 'SignEasy | Sign and Fill PDF and other Documents', 'Quick PDF Scanner + OCR FREE', 'Genius Scan - PDF Scanner', 'Tiny Scanner - PDF Scanner App', 'Fast Scanner : Free PDF Scan', 'Mobile Doc Scanner (MDScan) Lite', 'TurboScan: scan documents and receipts in PDF', 'Tiny Scanner Pro: PDF Doc Scan', 'Box', 'Zenefits', 'Google Ads', 'Google M

Noticed that PUBG MOBILE was mentioned more than once, now I want to explore why there are so many of PUBG MOBILE

In [107]:
print (android_header)

for x in android: 
    name = x[0]
    if name == 'PUBG MOBILE':
        print(x)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['PUBG MOBILE', 'GAME', '4.4', '3715656', '36M', '50,000,000+', 'Free', '0', 'Teen', 'Action', 'July 24, 2018', '0.7.0', '4.3 and up']
['PUBG MOBILE', 'GAME', '4.4', '3714270', '36M', '50,000,000+', 'Free', '0', 'Teen', 'Action', 'July 24, 2018', '0.7.0', '4.3 and up']
['PUBG MOBILE', 'GAME', '4.4', '3716278', '36M', '50,000,000+', 'Free', '0', 'Teen', 'Action', 'July 24, 2018', '0.7.0', '4.3 and up']
['PUBG MOBILE', 'GAME', '4.4', '3697174', '36M', '50,000,000+', 'Free', '0', 'Teen', 'Action', 'July 24, 2018', '0.7.0', '4.3 and up']


The results indicates that the only differences between the entries are the number of reviews, meaning it is the same app saved at different times, resulting in multiple entries. 

For this purpose we will select the duplicate entry with highest reviews and remove the ones with less reviews, as it will be the most recent.

In [108]:

reviews_max = {}

for x in android:
    name = x[0]
    n_reviews = float(x[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

#confirming that we were successful in removing all duplicates
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


In this step we added a new dataset android_clean to serve as a list of entries with their highest respective reviews.

additional confirmation to how many was added to android_clean was 9659

In [109]:
android_clean = []
already_added = []

for x in android: 
    name = x[0]
    n_reviews = float(x[3])
    if name not in already_added and reviews_max[name] == n_reviews:
        android_clean.append(x)
        already_added.append(name)
        
explore_data(android_clean,0,5,True)
        

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9659
Number of columns:  13


With duplicates removed succesfully, next step is to remove non-english apps

We do this by creating a function is_english to determine every character in a string if its corresponding number is outside of 0-127, then we determine it is not english

In [110]:
def is_english(string):
    for x in string: 
        if ord(x) > 127:
            return False
    return False

In [111]:
is_english('欢乐颂2》电视剧热')

False

This isnt foolproof however, as there are names like 'Docs To Go™ Free Office Suite' or 'Instachat 😜' that has special characters outside of the 0-127 range but is still an english application, therefore we incorporate a count system that if an application has more than 3 foreign characters, it will count as a foreign language app.

In [112]:
def is_english(string):
    count = 0
    for x in string: 
        if ord(x) > 127:
            count += 1
    if count > 3: 
        return False
    else: 
        return True

In [113]:
is_english('Instachat 😜')

True

In [114]:
english_list = []

for x in android_clean: 
    name = x[0]
    if is_english(name) == True: 
        english_list.append(x)
        
explore_data(english_list,0,3,True)
        

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns:  13


Now that we have removed all non-english apps, we have to isolate the free apps

In [115]:
android_free = []

for x in english_list: 
    price = x[7]
    if price == '0':
        android_free.append(x)
        
explore_data(android_free,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns:  13


Now that we have a clean dataset, we will begin analysis by exploring the most popular app genres and creating histograms to develop app profiles


In [116]:


def freq_table(dataset, index): 
    table = {}
    total = 0
    
    for x in dataset:
        total += 1
        value = x[index]
        if value in table: 
            table[value] += 1
        else: 
            table[value] = 1
            
    table_percentages = {}
    
    for x in table: 
        percentage = (table[x] / total) * 100
        table_percentages[x] = percentage
        
    return table_percentages
        

#Convert dictionary to list of lists, sorted and printed

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
display_table(android_free,1)



FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

Above we found that FAMILY and GAME categories has the largest representation. 

However, after exploring the FAMILY Genre, it can quickly be concluded that a lot of the apps are also games for kids.

So Games has the largest representation, and tools coming in second. 

In [117]:
display_table(android_free,9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

Now that we have an idea of what appspace has the most apps, we should dive into looking at what kind of app has the most downloads, since that will translate directly to the potential revenue an app can have 

In [118]:
display_table(android_free,5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


The installs columns dont seem to be very precise, with numbers showing like '50+' instead of actual number of downloads

This will be converted into floats and catagorized by genres

In [119]:

    
categories_android = freq_table(android_free,1)


category_list = []

for category in categories_android: 
    total = 0
    len_category = 0
    for x in android_free:
        category_app = x[1]
        if category_app == category:
            n_installs = x[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '') 
            total += float(n_installs)
            len_category += 1
    avg_n_installs = float(total / len_category)
    print(category, ':', avg_n_installs)
    addtolist = (category,avg_n_installs)
    category_list.append(addtolist)

BEAUTY : 513151.88679245283
PRODUCTIVITY : 16787331.344927534
HOUSE_AND_HOME : 1331540.5616438356
ENTERTAINMENT : 11640705.88235294
DATING : 854028.8303030303
MEDICAL : 120550.61980830671
EDUCATION : 1833495.145631068
COMICS : 817657.2727272727
GAME : 15588015.603248259
FINANCE : 1387692.475609756
AUTO_AND_VEHICLES : 647317.8170731707
SOCIAL : 23253652.127118643
BUSINESS : 1712290.1474201474
PHOTOGRAPHY : 17840110.40229885
SHOPPING : 7036877.311557789
COMMUNICATION : 38456119.167247385
LIBRARIES_AND_DEMO : 638503.734939759
HEALTH_AND_FITNESS : 4188821.9853479853
FAMILY : 3695641.8198090694
WEATHER : 5074486.197183099
PERSONALIZATION : 5201482.6122448975
NEWS_AND_MAGAZINES : 9549178.467741935
EVENTS : 253542.22222222222
VIDEO_PLAYERS : 24727872.452830188
TOOLS : 10801391.298666667
SPORTS : 3638640.1428571427
FOOD_AND_DRINK : 1924897.7363636363
PARENTING : 542603.6206896552
TRAVEL_AND_LOCAL : 13984077.710144928
LIFESTYLE : 1437816.2687861272
MAPS_AND_NAVIGATION : 4056941.7741935486
BOOKS

In [120]:
#looking for category with highest average downloads
max = 0 
for x in category_list:
    if x[1]>max: 
        max = x[1]
for x in category_list:
    if x[1] == max: 
        print('Max Average Downloads: ', x[0],': ', x[1])



Max Average Downloads:  COMMUNICATION :  38456119.167247385


In [121]:
#trying to see if there are application with over 1 Billion downloads skewing the average

for x in android_free:
    if x[1] == 'COMMUNICATION' and x[5] == '1,000,000,000+':
        print(x[0])
    

WhatsApp Messenger
Messenger – Text and Video Chat for Free
Skype - free IM & video calls
Google Chrome: Fast & Secure
Gmail
Hangouts


To prevent these apps skewing the data, we will recount the average without any applications with over 100 million downloads 

In [122]:
category_list = []

for category in categories_android: 
    total = 0
    len_category = 0
    for x in android_free:
        category_app = x[1]
        if category_app == category and (x[5] not in ['100,000,000+','500,000,000+','1,000,000,000+']):
            n_installs = x[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '') 
            total += float(n_installs)
            len_category += 1
    avg_n_installs = float(total / len_category)
    print(category, ':', avg_n_installs)
    addtolist = (category,avg_n_installs)
    category_list.append(addtolist)


BEAUTY : 513151.88679245283
PRODUCTIVITY : 3379657.318885449
HOUSE_AND_HOME : 1331540.5616438356
ENTERTAINMENT : 6118250.0
DATING : 854028.8303030303
MEDICAL : 120550.61980830671
EDUCATION : 1833495.145631068
COMICS : 817657.2727272727
GAME : 6272564.694894147
FINANCE : 1086125.7859327218
AUTO_AND_VEHICLES : 647317.8170731707
SOCIAL : 3084582.5201793723
BUSINESS : 1226918.7407407407
PHOTOGRAPHY : 7670532.29338843
SHOPPING : 4640920.541237113
COMMUNICATION : 3603485.3884615386
LIBRARIES_AND_DEMO : 638503.734939759
HEALTH_AND_FITNESS : 2005713.6605166052
FAMILY : 2342897.527075812
WEATHER : 5074486.197183099
PERSONALIZATION : 2549775.832167832
NEWS_AND_MAGAZINES : 1502841.8775510204
EVENTS : 253542.22222222222
VIDEO_PLAYERS : 5544878.133333334
TOOLS : 3191461.128987517
SPORTS : 2994082.551839465
FOOD_AND_DRINK : 1924897.7363636363
PARENTING : 542603.6206896552
TRAVEL_AND_LOCAL : 2944079.6336633665
LIFESTYLE : 1152128.779710145
MAPS_AND_NAVIGATION : 2484104.7540983604
BOOKS_AND_REFERENCE 

In [123]:
max = 0 
for x in category_list:
    if x[1]>max: 
        max = x[1]
for x in category_list:
    if x[1] == max: 
        print('Max Average Downloads: ', x[0],': ', x[1])

Max Average Downloads:  PHOTOGRAPHY :  7670532.29338843


With the larger players removed from skewing the data, on average, photography applications seem to have the highest average download.

#exploring photography apps with the highest downloads



In [124]:
photography_list = []


for x in android_free:
    if x[1] == 'PHOTOGRAPHY':
        n_installs = x[5]
        n_installs = n_installs.replace(',', '')
        n_installs = n_installs.replace('+', '') 
        n_installs = float(n_installs)
        to_append = (n_installs,x[0])
        photography_list.append(to_append)
        
        
sorted(photography_list, reverse = True)
        

[(1000000000.0, 'Google Photos'),
 (100000000.0, 'Z Camera - Photo Editor, Beauty Selfie, Collage'),
 (100000000.0, 'YouCam Perfect - Selfie Photo Editor'),
 (100000000.0, 'YouCam Makeup - Magic Selfie Makeovers'),
 (100000000.0, 'Sweet Selfie - selfie camera, beauty cam, photo edit'),
 (100000000.0, 'S Photo Editor - Collage Maker , Photo Collage'),
 (100000000.0, 'Retrica'),
 (100000000.0, 'PicsArt Photo Studio: Collage Maker & Pic Editor'),
 (100000000.0, 'PhotoGrid: Video & Pic Collage Maker, Photo Editor'),
 (100000000.0, 'Photo Editor Pro'),
 (100000000.0, 'Photo Editor Collage Maker Pro'),
 (100000000.0, 'Photo Collage Editor'),
 (100000000.0, 'LINE Camera - Photo editor'),
 (100000000.0, 'Cymera Camera- Photo Editor, Filter,Collage,Layout'),
 (100000000.0, 'Candy Camera - selfie, beauty camera, photo editor'),
 (100000000.0, 'Camera360: Selfie Photo Editor with Funny Sticker'),
 (100000000.0, 'BeautyPlus - Easy Photo Editor & Selfie Camera'),
 (100000000.0, 'B612 - Beauty & Fil

After reviewing the top Photography apps, it can be determined that the ones with the higher reviews consist of either post-photo editors, or apps that can control the device camera to set it differently than default

CONCLUSION 

After analyzing the results, we can recommend to developers to create a photo-editing app or camera control app, to find a niche in the market that can satisfy a target customer base. Whether it is to emulate a certain style of photography that is popular, or to ease post-photo editing in creating better pictures, albums or collages.

The study has confirmed that there are a lot of potential revenue in the photography app space. With high average downloads for free applications, in-app purchases be a viable business model for these type of applications.