# Profitable App Profiles for the App Store and Google Play Markets

At our company, we build Android and iOS apps that are free to download and install. Our main revenue source is in-app ads and as such, the revenue of a given app is largely influenced by the number of users. 

The goal of this project is to explore and analyse information from the Google Play and App Store markets, in order to understand what type of apps are most likely to attract the most users.

*The Google Play Store dataset and documentation can be found and downloaded [here](https://www.kaggle.com/datasets/lava18/google-play-store-apps)*

*The Apple App Store dataset and documentation can be found and downloaded [here](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps)*

## Opening and exploring the data

We'll start by opening and reading in each file, followed by transforming it into a list of lists. 

We then separate the header that contains each column title and the dataset that we will work from.

*As can be read in the link above for the Google dataset, we have removed row 10472 as it is missing information.*

In [268]:
from csv import reader

# Read in the data
opened_apple_file = open("AppleStore.csv")
read_apple_file = reader(opened_apple_file)

# Transform read file into a list of lists
apple_lists = list(read_apple_file)
apple_header = apple_lists[0]    
apple_dataset = apple_lists[1:]

# Read in the data
opened_google_file = open("googleplaystore.csv")
read_google_file = reader(opened_google_file)

# Transform read file into a list of lists
google_lists = list(read_google_file)
google_header = google_lists[0]
google_dataset = google_lists[1:]

del google_dataset[10472]    # Removing incorrect row

To make exploring the data easier, we have defined a function that allows us to select and display the entire dataset or segments of it. There is also additional functionality that shows us the total number of rows and columns in a given dataset.

We'll then use the function to explore a segment of each dataset.

In [269]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') 

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print(apple_header)
print("\n")
explore_data(apple_dataset, 0, 3, True)
print("\n")
print(google_header)
print("\n")
explore_data(google_dataset, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 

## Duplicates

The Google Play Store dataset has a number of duplicates. 

Below, we iterate through each row in the dataset, extracting the name of the application at each iteration. We use the name of the application to populate two lists, one for the uniques, and the other for duplicates.

We then print the total number of duplicates and a few examples from the list.

In [270]:
google_uniques = []
google_duplicates = []

for row in google_dataset:
    name = row[0]
    if name in google_uniques:
        google_duplicates.append(name)
    else:
        google_uniques.append(name)

print(f"The total number of duplicates is {len(google_duplicates)}")
print("\n")
print(f"Examples of duplicates include: {google_duplicates[:20]}")

The total number of duplicates is 1181


Examples of duplicates include: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']


## Removing duplicates

The main difference between the duplicates is the number of reviews, indicating that the data was collected at different times. Rather than removing the duplicates randomly, we will use the number of reviews as our indicator. The higher the number of reviews, the more recent the data should be. Therefore, for each set of duplicates, we will keep the entry that has the highest number of reviews and remove all other duplicates. 

We start by initialising a dictionary to store the application names and their corresponding number of reviews. We iterate through each row in the dataset, and for each we may:

- add the name and number of reviews to the dictionary, if the name is not already present in the dictionary
- set the number of reviews in the dictionary to the number of reviews in our current iteration, if the name is already present in the dictionary AND the existing number of reviews in the dictionary is lower

In [271]:
reviews_max = {}

for row in google_dataset:
    name = row[0]
    n_reviews = float(row[3])

    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews

    if name not in reviews_max:
        reviews_max[name] = n_reviews    

Next, we initialise an empty list for our cleaned data and an empty list for names already added to the cleaned data list.

We then iterate through each row in the dataset. If the number of reviews in the current iteration matches the number of reviews in the dictionary created above AND the name is not already present in our already added list, we:

- add the row to our cleaned data list
- add the name to our already added list *(This prevents us adding duplicates where the number of reviews is the same)*

In [272]:
google_clean = []
already_added = []

for row in google_dataset:
    name = row[0]
    n_reviews = float(row[3])

    if reviews_max[name] == n_reviews and name not in already_added:
        google_clean.append(row)
        already_added.append(name)

## Non-English apps

Currently, our company solely uses English for the apps we develop. As such, we'd like to analyse the apps that are designed for an English-speaking audience.

Both datasets have apps with names that suggest they are not designed for an English-speaking audience. In relation to the ASCII system, the most commonly used English characters correspond with the range 0 to 127, however this does not account for symbols and emojis that may be used. 

We have defined a helper function to check whether an app name is likely to be for an English-speaking audience. To minimise the impact of data loss, we will only consider an app as non-English if there are more than 3 characters that fall out of the ASCII 0-127 range, helping us to account for a few symbols or emojis that may be found within an app name.

The function:

- iterates through each character of the given app name, checking if the ASCII value is more than 127 
- tracks the number of characters that meet the condition
- returns False if there are more than 3 characters that meet the condition, or returns True otherwise

In [273]:
def is_english(string):
    count = 0
    
    for c in string:
        # the ord() function returns the number representing the unicode of the given character
        if ord(c) > 127:    
            count += 1

    if count > 3:
        return False
    else:
        return True

## Removing non-English apps

To remove the non-English apps, we start by initialising two empty lists.

We then iterate through each row for each dataset, using the helper function to check if the name of the app in the current iteration, results in an English or non-English app. 

Each English-app is added to their respective list, for either the Apple App Store or Google Play Store. 

In [274]:
apple_english = []
google_english = []

for row in apple_dataset:
    name = row[1]

    if is_english(name):
        apple_english.append(row)

for row in google_clean:
    name = row[0]

    if is_english(name):
        google_english.append(row)

## Isolating the free apps

As we only build apps that are free to download and install, we want to isolate the free apps for our analysis.

We start by initialising two empty lists for our final datasets.

We then iterate through each row of each dataset, checking if the app is free. If so, it is added to the final dataset for their respective market.

Finally, we print the total number of apps in each market.

In [275]:
apple_final = []
google_final = []

for row in apple_english:
    price = row[4]

    if price == "0.0":
        apple_final.append(row)

for row in google_english:
    price = row[7]

    if price == "0":
        google_final.append(row)

print(f"We have {len(apple_final)} iOS apps in our final Apple dataset")
print(f"We have {len(google_final)} Android apps in our final Google dataset")

We have 3222 iOS apps in our final Apple dataset
We have 8864 Android apps in our final Google dataset


## Most common apps by genre

Our aim is to determine the types of apps that are likely to attract more users, as our revenue is highly influenced by the number of users. 

To minimise risks and overhead, our validation strategy for an app idea is compromised of three steps:

1) Build a minimal Android version of the app, and add it to Google Play
2) If the app has a good response from users, we'll develop it further
3) If the app is profitable after six months, we build an iOS version of the app and add it to the App Store

As our end goal is to add the app to both markets, we need to find app profiles that are successful on both. The columns that are most relevant are `prime_genre` for the Apple dataset and `Category` / `Genres` for the Google dataset.

Below, we define two functions, `freq_table(dataset, index)` and `display_table(dataset, index)`.

- The first generates a frequency table for the chosen column of a dataset. It iterates through each row of the dataset, using a dictionary to count the frequency for each value. It then converts the values to percentages of the overall column
- The second uses our frequency table function and then iterates through the result, converting each key-value pair from the dictionary, into a tuple. We do this so that we can easily sort the percentages and then print them in our desired format

In [276]:
def freq_table(dataset, index):
    freq_table = {}

    for row in dataset:
        value = row[index]

        if value in freq_table:
            freq_table[value] += 1
        else:
            freq_table[value] = 1

    for key in freq_table:
        freq_table[key] = (freq_table[key] / len(dataset)) * 100

    return freq_table

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    
    for key in table:
        key_val_as_tuple = (table[key], key)    # tuple created with value first, to enable sorting by value
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)    # as above, sorted by value
    
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

We begin by examining the `prime_genre` column of the Apple dataset.

In [277]:
display_table(apple_final, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


The most common genre we find is Games, covering 58.16% of the share of free English apps. Entertainment follows with 7.88&, Photo & Video with 4.97%, Education with 3.66% and then Social Networking with 3.29%. These genres form our top 5.

The general impression is that most of the apps are designed for entertainment purposes (Games, Entertainment, Photo & Video, Social Networking etc), rather than practical purposes (Education, Utilities, Productivity, Finance etc). 

This may suggest that entertainment apps are the most popular, as they are created in the largest numbers, however this alone does not imply that they have a large number of users. 

Next, we'll examine the `Category` column of the Google dataset.

In [278]:
display_table(google_final, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

The most common genre is Family, covering 18.91% of the free English apps. Game follows with 9.72%, Tools with 8.46%, Business with 4.59%, Lifestyle with 3.90% and Productivity with 3.89%. These genres form our top 5.

Further investigation shows that the Family genre encompasses many games for children, leaving us the impression that games form the biggest share. This is in line with our findings from the Apple market, however there is a much larger share of apps for practical purposes in the Google market. 

As with the Apple market, our data indicates the most frequent genres in the Google market, however it does not confirm where the largest number of users fall.

Finally, we'll examine the `Genres` column of the Google dataset.

In [279]:
display_table(google_final, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

The most common genre is Tools, covering 8.45% of the free English apps. Entertainment follows with 6.07%, Education with 5.35%, Business with 4.59% and Productivity with 3.89. These genres form our top 5.

We again find that there is a much larger balance between apps for entertainment and those for practical purposes, in comparison to the Apple market.

We do however find that there is a much larger number of categories and that they are more specific in name. Most notably, there are a large number of categories with individually smaller percentages, however with names suggesting that they are games. As such, the `Category` column seems the better, more concise column to use for our analysis of the Google dataset, in conjunction with the `prime_genres` column of the Apple dataset.

## Most popular apps by genre on the Apple App Store

One way to find out which genres are the most popular is to calculate the average number of installs for each app genre. For the Google dataset, we can use the `Installs` column, however there is no similar column in the Apple dataset. As a workaround, we'll use the average number of user ratings for each genre, which can be found using the `rating_count_tot` column.

Below, we generate a frequency table for the `prime_genre` column of the Apple dataset.

We then explore each genre of the frequency table, finding the rows in the overral dataset that match the genre. We do this to total the number of user ratings for each genre and then calculate the average number of user ratings for each genre.

In [280]:
prime_genre_freq_table = freq_table(apple_final, 11)

for genre in prime_genre_freq_table:
    total = 0
    len_genre = 0

    for row in apple_final:
        prime_genre = row[11]

        if prime_genre == genre:
            total_ratings = float(row[5])

            total += total_ratings
            len_genre += 1
            
    avg_ratings = total / len_genre

    print(f"The {genre} genre has an average of {avg_ratings} user ratings")

The Social Networking genre has an average of 71548.34905660378 user ratings
The Photo & Video genre has an average of 28441.54375 user ratings
The Games genre has an average of 22788.6696905016 user ratings
The Music genre has an average of 57326.530303030304 user ratings
The Reference genre has an average of 74942.11111111111 user ratings
The Health & Fitness genre has an average of 23298.015384615384 user ratings
The Weather genre has an average of 52279.892857142855 user ratings
The Utilities genre has an average of 18684.456790123455 user ratings
The Travel genre has an average of 28243.8 user ratings
The Shopping genre has an average of 26919.690476190477 user ratings
The News genre has an average of 21248.023255813954 user ratings
The Navigation genre has an average of 86090.33333333333 user ratings
The Lifestyle genre has an average of 16485.764705882353 user ratings
The Entertainment genre has an average of 14029.830708661417 user ratings
The Food & Drink genre has an average 

We find several genres that have an average number of user ratings over 50,000. These include Navigation, Social Networking, Reference, Music and Weather.

Further investigation shows that in these categories, there are a few number of apps that skew the average, by having a very large number of user ratings individually. Across these categories, this includes apps such as Waze, Facebook, Dictionary.com and Spotify. As such, further investigation may be required to obtain a better picture of the market for these genres.

The Games genre has an average of 22,788, which initially seems much lower than those mentioned above. However, given that over 58% of the free English apps in the Apple market are games, this is a strong average. Creating an enticing, enjoyable game could lead to a large number of users. 

With the Apple market dominated by entertainment apps, it may also be an idea to create an app for practical purposes. Both the Health & Fitness and Productivity genres could be a good fit for this idea as they can be gamified in nature, enticing users to return regularly. This may take the form of promoting regular exercise, starting/sustaining desirable habits or leaving behind undesirable ones. This could offer the opportunity to blend both the entertainment and practical markets into a successful app idea.

## Most popular apps by genre on the Google Play Store

Below we calculate the average number of installs for each genre of the Google market. 

We find that the install values are grouped rather than specific. For example, there are groups for 50+ installs, 100,000+, 1,000,000+ and even 1,000,000,000+. As this is not precise, we want to utilise the data to get a general idea of the user bases. As such, we will consider the groups on their numerical basis only - for example, 100,000+ will be considered strictly 100,000 and 1,000,000+ would be consided as strictly 1,000,000, even though the precise number could be higher.

We follow a similar process to the Apple dataset above, however we need to format our install values by removing the `+` and `,` characters and then converting to a `float` for computation.

In [281]:
category_freq_table = freq_table(google_final, 1)

for category in category_freq_table:
    total = 0
    len_category = 0

    for row in google_final:
        app_category = row[1]

        if app_category == category:
            num_installs = row[5]
            num_installs = num_installs.replace("+", "")
            num_installs = num_installs.replace(",", "")
            
            total += float(num_installs)
            len_category += 1

    avg_installs = total / len_category

    print(f"The {category} genre has {avg_installs} average installs")

The ART_AND_DESIGN genre has 1986335.0877192982 average installs
The AUTO_AND_VEHICLES genre has 647317.8170731707 average installs
The BEAUTY genre has 513151.88679245283 average installs
The BOOKS_AND_REFERENCE genre has 8767811.894736841 average installs
The BUSINESS genre has 1712290.1474201474 average installs
The COMICS genre has 817657.2727272727 average installs
The COMMUNICATION genre has 38456119.167247385 average installs
The DATING genre has 854028.8303030303 average installs
The EDUCATION genre has 1833495.145631068 average installs
The ENTERTAINMENT genre has 11640705.88235294 average installs
The EVENTS genre has 253542.22222222222 average installs
The FINANCE genre has 1387692.475609756 average installs
The FOOD_AND_DRINK genre has 1924897.7363636363 average installs
The HEALTH_AND_FITNESS genre has 4188821.9853479853 average installs
The HOUSE_AND_HOME genre has 1331540.5616438356 average installs
The LIBRARIES_AND_DEMO genre has 638503.734939759 average installs
The L

We find that on average, the Communication genre has the most installs, with a figure of over 38 million. Following this, we have Video Players with over 24 million, Social with over 23 million, Photography with over 17 million and Productivity with over 16 million. This forms our top 5.

Similarly to the Apple market, many of these genres are dominated by individual apps that have an extremely large number of installs, often several hundred million or even more than a billion. This leaves us with the impression that we would need to compete with large organisations or extremely popular products to succeed.

The Game genre has an average number of installs over 15 million. Similarly to the Apple market, this is a strong average given our discovery that the top 2 most common genres in the Google market (Family + Game) are in reality a combined game market. As such, creating an enjoyable, exciting game could lead to a large number of users.

As we are looking for apps that will ultimately succeed on both markets, the Health and Fitness and Productivity markets also seem to be a great option for the Google store too. Gamifying the experience, perhaps to support health/wellness or take up a form of exercise, could lead to a large user base also.

## Conclusion

We analysed data from the Apple App Store and Google Play Store markets, with the goal of recommending app profiles that can meet our business goals. 

We concluded that games are popular across the markets and form a large share of each. Although there are many games on each market already, the analysis showed strong numbers of average ratings and installs, leaving us the impression that a great game could still be a great option for a new app.

We also looked at the potential to gamify an app in the Health and Fitness or Productivity markets, as both have smaller shares of their respective markets, and less individual app dominance compared to other categories. The prospect of blending these practical markets, with a gamified experience, offers a great opportunity to have success.