# Do You Wanna Develop an App? Free App Profitability on Online Marketplaces

A hypothetical client has asked us to make recommendations for a new app. The client's intention is to build an iOs and Android mobile app, and wants us to provide analysis of existing apps to aid their development. The client wishes to make an app for an English-speaking audience that is free to install, and does not offer additional in-app purchases; the revenue will come from in-app advertisements. We conclude that their revenue will be most closely tied to the app's number of users, and will use this as the starting point of our analysis.

We will find through this analysis that there are untapped markets in the free app space. These markets are indicated by high numbers of users and reviews, but low numbers of overall apps in those categories. We will identify one such example, and explain how to find others using our results.


## Finding, Opening, and Exploring the Data

As of September 2018, both of these marketplaces had over 2 million available apps.
<img src="https://s3.amazonaws.com/dq-content/350/py1m8_statista.png" width="400" align="right">

Gathering 4 million data points for our analysis is both too costly and time consuming for our hypothetical client, and also beyond the real-world scope of this project. Instead, we'll be taking a sample of data scraped from the web by users lava18 and ramamet4 and posted to Kaggle.com. Thank you to those users for making this data available to us.

The two data sets we will be using were retrieved in September 2021. Our [first data set](https://www.kaggle.com/lava18/google-play-store-apps) provides 10841 app entries from the Google Play Store in February 2019. Our [second data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) provides 7197 app entries from the Mobile App Store in June 2018. Let's open and explore the data to get a sense of where we're starting.

In [1]:
from csv import reader

Apple_file = open('AppleStore.csv',encoding='utf8')
read_Apple_file = reader(Apple_file)
Apple_data = list(read_Apple_file)

Android_file = open('googleplaystore.csv',encoding='utf8')
read_Android_file = reader(Android_file)
Android_data = list(read_Android_file)

#Print rows in a readable fashion, and give the total size of the data set if requested
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        print('\n')

#Observe the header and 2 rows from each data set
print('Android Header: \n')
explore_data(Android_data, 0, 3, rows_and_columns = True)
print('Apple Header: \n')
explore_data(Apple_data, 0, 3, rows_and_columns = True)

Android Header: 

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


Apple Header: 

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernot

We define a function `explore_data` so that we can repeatedly explore slices of variable size and location. The function will also return the size of the data set if we request it. We can use these slices to identify columns which can be useful to our analysis.

For the Android Market we identify `'Category'`, `'Rating'`, `'Reviews'`, `'Installs'`, `'Type'`, `'Price'`, and `'Genres'`.

The App Store column labels are less self explanatory, but after checking the [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) we can see that we're interested in `'price'`, `'currency'`, `'rating_count_tot'`, `'rating_count_ver'`, `'user_rating'`, `'user_rating_ver'`, and `'prime_genre'`.
## Data Cleaning

While reading the documentation for the Android data, we learn of an empty data value leading to a row of inconsistent length. We check the entry in question and see that it is indeed shorter than the header due to a missing `'Genre'`. We delete this row, but provide a check in our notebook cell so that we don't accidentally delete multiple rows when running the code again.

In [2]:
#Show that the row in question is a different length than the header
print(len(Android_data[0]))
print(len(Android_data[10473]))
#This check makes sure multiple rows aren't deleted when running the code again
if len(Android_data[10473]) != 13 :
    del Android_data[10473]
#Confirm the error row is gone    
print(len(Android_data[10473]))

13
12
13


We also learn that many rows in the Android data are different entries for the same app, likely from different versions. We run a for loop to create a list of the unique app names, and determine that there are 1181 duplicates to remove. 

In [3]:
duplicate_apps = []
unique_apps = []
unique_apps_sorted = []

#Add each app name to a list, and flag names that are already in the list
for app in Android_data:
    name = app[0]            #Android app index 0
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Duplicate apps: ', len(duplicate_apps))


Duplicate apps:  1181


We need to determine which entry of a given duplicated app we're going to retain. We're going to take the entry with the greatest number of reviews, as this will likely be the most recent data. We create an empty dictionary and loop through our data, and assigning only the highest review number to each unique app key.

In [4]:
reviews_max = {}
#Loop through each app name and find the greatest number of reviews.
for app in Android_data[1:]:
    name = app[0]                        #Android app index 0
    n_reviews = float(app[3])            #Android Reviews index 3
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

Once we have this dictionary, we'll use it to scrub the Android data of duplicate entries. Since our client is interested in creating a free app for an English-speaking audience, we'll also remove apps with non-English characters and any apps with a price other than 0. Because some English apps have a few non-English characters like emojis, trademark symbols, etc., we'll only remove apps with more than three non-English characters to mitigate this. The data should now represent most of the free English apps on the Apple and Android stores, which we will call our Apps of Interest.

In [5]:
android_clean = []
already_added = []
for app in Android_data[1:]:
    name = app[0]                        #Android app index 0
    n_reviews = float(app[3])            #Android Reviews index 3
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

def english(string):
    non_english = 0
    for character in string:
        if ord(character) > 127:
            non_english += 1
    if non_english > 3:
        return False
    else:
        return True

android_clean_eng = []
for app in android_clean:
    name = app[0]            #Android app index 0
    if english(name):
        android_clean_eng.append(app)
        
apple_eng = []
for app in Apple_data[1:]:
    name = app[2]            #Apple track_name index 2
    if english(name):
        apple_eng.append(app)

android_free = []
for app in android_clean_eng:
    price = app[7]           #Android Price index 7
    if price == '0':
        android_free.append(app)
        
apple_free = []
for app in apple_eng:
    price = app[5]           #Apple price index 5
    if price == '0':
        apple_free.append(app)
print('Apps of Interest, Android: ', len(android_free))
print('Apps of Interest, Apple: ', len(apple_free))
print('Apps of Interest, Total: ', len(android_free)+len(apple_free))

Apps of Interest, Android:  8864
Apps of Interest, Apple:  3222
Apps of Interest, Total:  12086


Having narrowed down our original data sets of over 18,000 entries to just over 12,000 unique apps which are relevant to our client's interests, we can begin making our analysis and recommendations.
## Analysis
### Part 1: Genre
To begin, we're going to find the most common genre in either market. We'll build a function that generates a frequency table for a given index, and use it to find the most common types of apps. We'll used the Android `'Category'` and `'Genre'` columns and the Apple `'prime_genre'` column.

In [6]:
def freq_table(data_set, index):
    frequencies = {}
    total = 0
    for app in data_set:
        total += 1
        column = app[index]
        if column in frequencies:
            frequencies[column] += 1
        else:
            frequencies[column] = 1
    
    frequency_percent = {}
    
    for key in frequencies:
        percent = (frequencies[key] / total) * 100
        frequency_percent[key] = round(percent, 2)
        
    return frequency_percent

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [7]:
display_table(apple_free, 12) #Apple prime_genre index = 12

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


From this table we see that more than half (58.16%) the apps on the Apple Store are Games. Entertainment and Photo/Video apps make up about 8% and 5%, respectively. It's important to remember that this is just an analysis of the amount of apps in each category, not necessarily the popularity or profitability of those apps. We'll continue our analysis with the Android market.

In [26]:
display_table(android_free, 1) #Android Category index 1

FAMILY : 18.91
GAME : 9.72
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


We can see a few things immediately. Android does not have any genre which is as dominant as Games were on the Apple store; Family is first with 18.9% and Games are second with 9.7%. However, if we look at the apps in the Family category it does appear that the majority of them are games for kids. Even so, both Family and Games together make up only 28.6% of the Android store, less than half of the Games on the Apple store.\*

<img src="https://camo.githubusercontent.com/0d974d86dfcb791ca1d1505f810ba5a191bb6248e413d9b6ae1a60b211cca918/68747470733a2f2f73332e616d617a6f6e6177732e636f6d2f64712d636f6e74656e742f3335302f7079316d385f66616d696c792e706e67
" width="600">

\*It does appear that, at time of writing, the Android store has removed the Family designation and moved kids' games into the Game category. This shouldn't greatly affect the conclusions we've drawn, but would be an area to update were we to revisit this project.
        
Tools, Business, Productivity, and Finance are all have between 3% and 9% of apps on the Android market; on the Apple store none of those equivalent categories were above 3%. It appears that there is more space in the Android store for practical apps, whereas the Apple store is dominated by games. The `'Genre'` index for the Android data confirms this assessment.

In [9]:
display_table(android_free, 9) #Andorid Genre index 9

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.91
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;

The results for Android genres is a lot more granular; it's broken up into far more sub-groups than the Apple store data. There are situations where genres should be combined for more accurate analysis; for example, `'Educational;Education'`, `'Educational'`, and `'Education;Education'` could be all combined. However since this is a preliminary analysis, we will merely point it out here in case we want to revisit the project later.

### Part 2: User Ratings and Installs

Having categorized each data set by genre, we can begin to look at metrics which might indicate profitability for our client. The number of user reviews can give us an indicator of how many people use each app, and therefore how many ads will be seen. Below we find the average number of user reviews for each genre on the Apple and Android markets.

In [36]:
apple_genres = freq_table(apple_free, 12)      #Apple prime_genre index 12
apple_avg_review = {}
for genre in apple_genres:
    total = 0
    len_genre = 0
    for app in apple_free:
        genre_app = app[12]
        if genre_app == genre:
            n_ratings = float(app[6])          #Apple rating_count_tot index 6
            total += n_ratings
            len_genre += 1
    average_n_ratings = round((total / len_genre), 2)
    apple_avg_review[genre] = average_n_ratings
apple_avg_review = sorted(apple_avg_review.items(),key=lambda x: x[1], reverse=True)
k = len(apple_avg_review)
for i in range (0, k):
    row = apple_avg_review[i]
    print(row[0], ':', row[1])

Navigation : 86090.33
Reference : 74942.11
Social Networking : 71548.35
Music : 57326.53
Weather : 52279.89
Book : 39758.5
Food & Drink : 33333.92
Finance : 31467.94
Photo & Video : 28441.54
Travel : 28243.8
Shopping : 26919.69
Health & Fitness : 23298.02
Sports : 23008.9
Games : 22788.67
News : 21248.02
Productivity : 21028.41
Utilities : 18684.46
Lifestyle : 16485.76
Entertainment : 14029.83
Business : 7491.12
Education : 7003.98
Catalogs : 4004.0
Medical : 612.0


In [35]:
android_genres = freq_table(android_free, 1)   #Android Category index 1
android_avg_installs = {}
for catagory in android_genres:
    total = 0
    len_catagory = 0
    for app in android_free:
        catagory_app = app[1]
        if catagory_app == catagory:
            n_installs = app[5]                #Android Installs index 5
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            total += float(n_installs)
            len_catagory += 1
    avg_installs = round((total / len_catagory))
    android_avg_installs[catagory] = avg_installs
    
android_avg_installs = sorted(android_avg_installs.items(),key=lambda x: x[1], reverse=True)
k = len(android_avg_installs)
for i in range (0, k):
    row = android_avg_installs[i]
    print(row[0], ':', row[1])

COMMUNICATION : 38456119
VIDEO_PLAYERS : 24727872
SOCIAL : 23253652
PHOTOGRAPHY : 17840110
PRODUCTIVITY : 16787331
GAME : 15588016
TRAVEL_AND_LOCAL : 13984078
ENTERTAINMENT : 11640706
TOOLS : 10801391
NEWS_AND_MAGAZINES : 9549178
BOOKS_AND_REFERENCE : 8767812
SHOPPING : 7036877
PERSONALIZATION : 5201483
WEATHER : 5074486
HEALTH_AND_FITNESS : 4188822
MAPS_AND_NAVIGATION : 4056942
FAMILY : 3695642
SPORTS : 3638640
ART_AND_DESIGN : 1986335
FOOD_AND_DRINK : 1924898
EDUCATION : 1833495
BUSINESS : 1712290
LIFESTYLE : 1437816
FINANCE : 1387692
HOUSE_AND_HOME : 1331541
DATING : 854029
COMICS : 817657
AUTO_AND_VEHICLES : 647318
LIBRARIES_AND_DEMO : 638504
PARENTING : 542604
BEAUTY : 513152
EVENTS : 253542
MEDICAL : 120551


We see that the Social Media genre is in the top three of both the Apple average reviews and the Android average installs. From this result we might think it makes sense to recommend to our client that they build a social media app. But it's important to remember that this category is absolutely dominated by apps like Facebook and Twitter, each having hundreds of millions of users. These outliers in each app genre will skew the averages and make certain genres appear more popular than they are. A better metric might be to explore the median for each category; this will show us how a middle-of-the-road app might perform in each market.

In [41]:
import statistics
apple_genres = freq_table(apple_free, 12)   #Apple prime_genre index 12
apple_med_review = {}
for genre in apple_genres:
    reviews = []
    for app in apple_free:
        genre_app = app[12]
        if genre_app == genre:
            n_ratings = float(app[6])       #Apple rating_count_tot index 6
            reviews.append(n_ratings)
    med = statistics.median(reviews)
    apple_med_review[genre] = med
apple_med_review = sorted(apple_med_review.items(),key=lambda x: x[1], reverse=True)
k = len(apple_med_review)
for i in range (0, k):
    row = apple_med_review[i]
    print(row[0], ':', row[1])
print('Number of Genres: ', k)

Productivity : 8737.5
Navigation : 8196.5
Reference : 6614.0
Shopping : 5936.0
Social Networking : 4199.0
Music : 3850.0
Health & Fitness : 2459.0
Photo & Video : 2206.0
Finance : 1931.0
Sports : 1628.0
Food & Drink : 1490.5
Catalogs : 1229.0
Entertainment : 1197.5
Business : 1150.0
Lifestyle : 1111.0
Utilities : 1110.0
Games : 901.5
Travel : 798.5
Education : 606.5
Medical : 566.5
Book : 421.5
News : 373.0
Weather : 289.0
Number of Genres:  23


In [40]:
android_genres = freq_table(android_free, 1)#Android Category index 1
android_med_installs = {}
for category in android_genres:
    installs = []
    for app in android_free:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]            #Android Installs index 5
            n_installs = n_installs.replace('+', '')
            n_installs = float(n_installs.replace(',', ''))
            installs.append(n_installs)
    med = statistics.median(installs)
    android_med_installs[category] = med
android_med_installs = sorted(android_med_installs.items(),key=lambda x: x[1], reverse=True)
k = len(android_med_installs)
for i in range (0, k):
    row = android_med_installs[i]
    print(row[0], ':', row[1])

EDUCATION : 1000000.0
ENTERTAINMENT : 1000000.0
GAME : 1000000.0
SHOPPING : 1000000.0
PHOTOGRAPHY : 1000000.0
WEATHER : 1000000.0
VIDEO_PLAYERS : 1000000.0
COMMUNICATION : 500000.0
FOOD_AND_DRINK : 500000.0
HEALTH_AND_FITNESS : 500000.0
HOUSE_AND_HOME : 500000.0
ART_AND_DESIGN : 100000.0
AUTO_AND_VEHICLES : 100000.0
COMICS : 100000.0
FAMILY : 100000.0
SOCIAL : 100000.0
SPORTS : 100000.0
TRAVEL_AND_LOCAL : 100000.0
TOOLS : 100000.0
PERSONALIZATION : 100000.0
PRODUCTIVITY : 100000.0
PARENTING : 100000.0
MAPS_AND_NAVIGATION : 100000.0
BEAUTY : 50000.0
BOOKS_AND_REFERENCE : 50000.0
NEWS_AND_MAGAZINES : 50000.0
DATING : 10000.0
FINANCE : 10000.0
LIBRARIES_AND_DEMO : 10000.0
LIFESTYLE : 10000.0
BUSINESS : 1000.0
EVENTS : 1000.0
MEDICAL : 1000.0
Number of Genres:  33


Because the numbers in the `'Installs'` column are very rough estimates, the median of this data doesn't give us much useful information. We'll re-run our code, but this time use the `'Reviews'` column to get a better idea of where the middle of each genre is.

In [39]:
android_genres = freq_table(android_free, 1)#Android Category index 1
android_med_installs = {}
for category in android_genres:
    installs = []
    for app in android_free:
        category_app = app[1]
        if category_app == category:
            n_installs = app[3]            #Android Reviews index 3
            n_installs = n_installs.replace('+', '')
            n_installs = float(n_installs.replace(',', ''))
            installs.append(n_installs)
    med = statistics.median(installs)
    android_med_installs[category] = med
android_med_installs = sorted(android_med_installs.items(),key=lambda x: x[1], reverse=True)
k = len(android_med_installs)
for i in range (0, k):
    row = android_med_installs[i]
    print(row[0], ':', row[1])
print('Number of Categories: ', k)

GAME : 35371.5
ENTERTAINMENT : 35279.0
PHOTOGRAPHY : 31985.0
EDUCATION : 13612.0
SHOPPING : 13085.0
WEATHER : 11297.0
COMMUNICATION : 6454.0
VIDEO_PLAYERS : 5555.0
HEALTH_AND_FITNESS : 3908.0
SOCIAL : 3884.0
FOOD_AND_DRINK : 3779.0
HOUSE_AND_HOME : 3280.0
TRAVEL_AND_LOCAL : 2277.0
PRODUCTIVITY : 2131.0
SPORTS : 1981.0
MAPS_AND_NAVIGATION : 1799.5
COMICS : 1677.0
FAMILY : 869.0
NEWS_AND_MAGAZINES : 656.5
PERSONALIZATION : 652.0
TOOLS : 645.0
PARENTING : 528.5
ART_AND_DESIGN : 486.0
DATING : 478.0
FINANCE : 467.5
AUTO_AND_VEHICLES : 352.0
BOOKS_AND_REFERENCE : 314.0
BEAUTY : 187.0
LIFESTYLE : 151.0
LIBRARIES_AND_DEMO : 131.0
EVENTS : 48.0
MEDICAL : 22.0
BUSINESS : 15.0
33


These numbers are much more informative. We see that the median Android Game or Entertainment app does better than almost any other category on that marketplace, and the same is true of Productivity and Reference apps on the Apple store. This is an interesting contrast to the observation we made earlier, where the frequency tables showed more productive Android apps and more iOs games.

## Conclusion

When there are more apps in a given genre, there will be more competition among those apps, so they will get less users overall. Genres with fewer apps but higher numbers of users and reviews might indicate a market where demand exceeds supply. Our client is looking for an app idea that will generate the most users and engagement, and therefore the most ad revenue. We want to recommend to our client make an app in a genre that has:

1. Generally high numbers of reviews on both app marketplaces
2. A lower total number of apps than average

Using these criteria, our recommendation to our client would be to look into developing a Shopping app.

The shopping genre ranks 4 out of 23 and 5 out of 33 for median number of reviews on Apple and Android markets, respectively. But shopping ranks in the middle of both frequency tables for number of apps per genre. This indicates that there is high demand for these kinds of apps, but there's not so much competition that our client's app would get buried among a sea of similar products.

In this project I've used Python in the following ways:

1. Open, read, and formatted .csv files into lists of lists
2. Cleaned the data of incomplete rows, non-English strings, and duplicate apps
3. Defined functions that created and displayed frequency tables
4. Ran loops within loops to find means and medians for each sub-group of our data

Jacob Simon, October 2021