# 🚀 Mobile Apps Data Exploration

Mobile Apps are on the rise, the two major platforms that offer these apps are Google Play and Apple App Store. In Q3 2022 there were more than 3.5 milion apps on Google's platform and more than 1.6 million apps on Apple's platform [(Statista)](https://www.netsolutions.com/insights/launch-app-on-android-or-ios-app-store-first/). The competition is high and consequently software development companies need to understand what kind of an app has the highest chance to succeed. That is the main purpose of this exploration.

This analysis dives into a dataset on mobile apps of two major app marketplaces. We will investigate their user data to try to underpin what would be the best app genre to concentrate on if we want to maximse our chances of having a succesful and profitable app. The company is looking to launch an app with the following attributes:
1. Free App
2. English language only
3. Monetisation through in-app purchases

Having these three conditions we will try to narrow down the areas to explore by providing a recommendation on the 3 app genres that present the biggest opportunity.

The company plans to first rollout on Google Play to prove the concept amd take advantage of a larger userbase. Once the concept is proven they are planning to expand to Apple App Store. That's why data from the Google dataset will have a bigger weight when we'll make our recommendations later.

This project's goal is also to demonstrate an example of how we can use Python to analyze data to gain business insights and **help stakeholders make informed decisions**.

## Data sources

For our analysis we will use publicly available Apple Store and Google Store datasets from Kaggle. 

They were both collected in 2021 but the Google one has June's data whereas the Apple one has October's data. *This may cause some discrepancies due to seasonal popularity of one genre or the other but because we're going to be looking at the data at an aggregate level this effect shouldn't be that major. For a proper analysis, **data collected during the same period would be preferable.

[Link](https://www.kaggle.com/datasets/gauthamp10/apple-appstore-apps) to Apple Store data. 

[Link](https://www.kaggle.com/datasets/gauthamp10/google-playstore-apps) to Google Store data.

## Step 1: Exploring Data Sets
In this section we will create a function that opens datasets and transforms them into a list of lists, so that we can manipulate the data more easily.

After that, we will look at the structure of the datasets, the number of columns and rows and see a couple rows of data as examples.

In [6]:
from csv import reader 

# This function takes in the path to the dataset to analyse and returns the dataset in a form of a list of lists
def open_dataset(dataset_path):
    opened_file = open(dataset_path)
    read_file = reader(opened_file)
    data = list(read_file) #transforming the dataset into a list of lists
    opened_file.close() 
    return data 

In [7]:
apple_data = open_dataset("appleAppData.csv")
google_data = open_dataset("Google-Playstore.csv")

In [8]:
# This function takes 4 arguments, dataset to analyse, which rows to print out and a flag whether it contains 
# a header or not. Most of datasets have headers so the start_row argument has default value of 2 to skip printing
# the header.
# Function returns header (if applicable), if no header is presented it prints out first row, the number of rows 
# selected and finally the number of rows and columns in the dataset.
def explore_dataset(dataset, start_row = 2, end_row = 4, has_header = True):
    if has_header:
        print("Header:", dataset[0])
        rows = len(dataset)-1 # substracting header column from row count
    else:
        print("Header: No header")
        rows = len(dataset)
        
    print("Showing rows ", start_row, "to ", end_row, "\n")
    
    for row in dataset[start_row-1:end_row+1]:
        print(row, "\n") # iterating over each row and printing it out
    
    print("Number of rows = ", rows)
    print("Number of columns = ", len(dataset[0]))

In [9]:
explore_dataset(apple_data)

Header: ['App_Id', 'App_Name', 'AppStore_Url', 'Primary_Genre', 'Content_Rating', 'Size_Bytes', 'Required_IOS_Version', 'Released', 'Updated', 'Version', 'Price', 'Currency', 'Free', 'DeveloperId', 'Developer', 'Developer_Url', 'Developer_Website', 'Average_User_Rating', 'Reviews', 'Current_Version_Score', 'Current_Version_Reviews']
Showing rows  2 to  4 

['com.hkbu.arc.apaper', 'A+ Paper Guide', 'https://apps.apple.com/us/app/a-paper-guide/id1277517387?uo=4', 'Education', '4+', '21993472', '8.0', '2017-09-28T03:02:41Z', '2018-12-21T21:30:36Z', '1.1.2', '0.0', 'USD', 'True', '1375410542', 'HKBU ARC', 'https://apps.apple.com/us/developer/hkbu-arc/id1375410542?uo=4', '', '0.0', '0', '0.0', '0'] 

['com.dmitriev.abooks', 'A-Books', 'https://apps.apple.com/us/app/a-books/id1031572002?uo=4', 'Book', '4+', '13135872', '10.0', '2015-08-31T19:31:32Z', '2019-07-23T20:31:09Z', '1.3', '0.0', 'USD', 'True', '1031572001', 'Roman Dmitriev', 'https://apps.apple.com/us/developer/roman-dmitriev/id1031

In [10]:
explore_dataset(google_data)

Header: ['App Name', 'App Id', 'Category', 'Rating', 'Rating Count', 'Installs', 'Minimum Installs', 'Maximum Installs', 'Free', 'Price', 'Currency', 'Size', 'Minimum Android', 'Developer Id', 'Developer Website', 'Developer Email', 'Released', 'Last Updated', 'Content Rating', 'Privacy Policy', 'Ad Supported', 'In App Purchases', 'Editors Choice', 'Scraped Time']
Showing rows  2 to  4 

['Gakondo', 'com.ishakwe.gakondo', 'Adventure', '0.0', '0', '10+', '10', '15', 'True', '0', 'USD', '10M', '7.1 and up', 'Jean Confident Irénée NIYIZIBYOSE', 'https://beniyizibyose.tk/#/', 'jean21101999@gmail.com', 'Feb 26, 2020', 'Feb 26, 2020', 'Everyone', 'https://beniyizibyose.tk/projects/', 'False', 'False', 'False', '2021-06-15 20:19:35'] 

['Ampere Battery Info', 'com.webserveis.batteryinfo', 'Tools', '4.4', '64', '5,000+', '5000', '7662', 'True', '0', 'USD', '2.9M', '5.0 and up', 'Webserveis', 'https://webserveis.netlify.app/', 'webserveis@gmail.com', 'May 21, 2020', 'May 06, 2021', 'Everyone', 

## Step 2: Filtering the dataset
Both data sets contain big amounts of data, so in order to speed execution times we are going to apply filters to shrink the size down.

We know we need only free apps and English speaking apps so we can filter based on these two conditions right away since the rest isn't of interest for this company.

**Filtering free apps only**

In [24]:
# This function takes in 4 arguments, the dataset on which to perform the filter, the column that contains the value
# to filter by, the value used for filtering and finally what kind of a filter we want. "Include" will keep only rows
# that match the value whereas "Exclude" will keep all rows except those that contain that value.
# This function returns a filtered dataset (list of lists)
def filter_value(dataset, column_index, value, filter_type = "Include"):
    filtered_data = []
    if filter_type == "Include":
        for row in dataset:
            if row[column_index] == value:
                filtered_data.append(row)
        print(len(dataset)-len(filtered_data), "rows were filtered out.")
        return filtered_data
    elif filter_type == "Exclude":
        for row in dataset:
            if row[column_index] != value:
                filtered_data.append(row)
        print(len(dataset)-len(filtered_data), "rows were filtered out.")
        return filtered_data
    else:
        print('Filter_type not recognized. Only "Include" and "Exclude" are permitted.')
    
print("Google Play Store:")
google_data_free = filter_value(google_data, 8, "True")
print("\nApple App Store:")
apple_data_free = filter_value(apple_data, 12, "True")

Google Play Store:
45069 rows were filtered out.

Apple App Store:
102993 rows were filtered out.


**Filtering apps that are most likely English**

In [14]:
# Step 1: Checking whether we can use currency as an indicator
currencies_google = {}
for app in google_data:
    currency = app[10]
    if currency in currencies_google:
        currencies_google[currency] += 1
    else:
        currencies_google[currency] = 1
print("Google Play Store:\n", currencies_google, "\n")

currencies_apple = {}
for app in apple_data:
    currency = app[11]
    if currency in currencies_apple:
        currencies_apple[currency] += 1
    else:
        currencies_apple[currency] = 1
print("Apple App Store:\n", currencies_apple)

Google Play Store:
 {'Currency': 1, 'USD': 2311548, 'XXX': 1236, 'CAD': 2, 'EUR': 6, 'INR': 5, '': 135, 'VND': 1, 'GBP': 3, 'BRL': 1, 'KRW': 1, 'TRY': 1, 'RUB': 1, 'SGD': 1, 'AUD': 1, 'PKR': 1, 'ZAR': 1} 

Apple App Store:
 {'Currency': 1, 'USD': 1230376}


Currency doesn't seem to be a good indicator of the language of the app, therefore we will use the characters used in its name as a proxy. Essentially we will consider an app to be English if it doesn't contain 3 or more non-English characters. Ths approach has its limitations as many languages don't use any special characters that would distinguish them from English.

In [16]:
# Step 2: Writing a function that checks each app name
# This function takes in the name of the app as an argument and returns the number of non-English characters
# This way we can set the condition for how many non-English characters we want later and the function is more 
# versatile
def is_name_english(app_name):
    non_english_char_count = 0
    for character in app_name:
        if ord(character) > 127: 
            # Checking ASCII code of each character, most used English characters have a number < 127
            non_english_char_count += 1
    return non_english_char_count

# Check that it works as expected for the name '爱奇艺PPS -《欢乐颂2》电视剧热播', should return 13, 0 and 1
print(is_name_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_name_english('Instagram'))
print(is_name_english('Instagram 😜'))

13
0
1


In [25]:
# Step 3: Filtering all entries that contain app names with 3 or more non-English characters
# This function takes in three arguments, the data set to filter, the column index that contains app name,
# and the min number of non-english characters to classify app as a non-English app. A filtered data set (list) 
# is returned.
def filter_non_en(dataset, index, characters = 3):
    filtered_data = []
    for app in dataset:
        app_name = app[index]
        if is_name_english(app_name) < characters:
            filtered_data.append(app)
    print(len(dataset)-len(filtered_data), "rows were filtered out.")
    return filtered_data

print("Google Play Store:")
google_data_free_en = filter_non_en(google_data_free, 0)
print("Remaining rows:", len(google_data_free_en), "\n")


print("Apple App Store:")
apple_data_free_en = filter_non_en(apple_data_free, 1)
print("Remaining rows:", len(apple_data_free_en),"\n")

Google Play Store:
196218 rows were filtered out.
Remaining rows: 2071658 

Apple App Store:
27255 rows were filtered out.
Remaining rows: 1100129 



Our datasets are still quite robust and for the purposes of this analysis we are going to refine even further. We are only interested in apps that people engage with so we will take apps that have at least 1 review.

**Filtering apps that have at least 1 review**

In [26]:
# This function takes the datased and the index of the column that countains review count as arguments
# This function returns a filtered list
def filter_no_reviews(dataset, index):
    filtered_data = []
    for app in dataset:
        no_of_reviews = app[index]
        if no_of_reviews != '0' and no_of_reviews != '': # The Google data set contained null values as well
            filtered_data.append(app)
    return filtered_data

print("Google Play Store:")
google_data_free_en = filter_value(google_data_free_en, 4, "0", filter_type = "Exclude")
print("Remaining rows:", len(google_data_free_en), "\n")

print("Apple App Store:")
apple_data_free_en = filter_value(apple_data_free_en, -3, "0", filter_type = "Exclude")
print("Remaining rows:", len(apple_data_free_en), "\n")

Google Play Store:
967225 rows were filtered out.
Remaining rows: 1104433 

Apple App Store:
610704 rows were filtered out.
Remaining rows: 489425 



Now in the Google dataset we have moticed that sometimes when there are no reviews the field is "" instead of "0" so we will do one additional filter.

In [28]:
print("Google Play Store:")
google_data_free_en = filter_value(google_data_free_en, 4, "", filter_type = "Exclude")
print("Remaining rows:", len(google_data_free_en), "\n")

Google Play Store:
22011 rows were filtered out.
Remaining rows: 1082422 



Since we want to make money through in-app purchases we might be interested only in apps that offer this possibility. The Google Store data set offers this option so let's apply one final filter.

**Filter apps that don't allow in-app purchases from the Google Store dataset**

In [29]:
# filtered_data = []
# for app in google_data_free_en:
#     allows_in_app = app[-3]
#     if allows_in_app == "True": 
#         filtered_data.append(app)
# google_data_free_en = filtered_data

print("Google Play Store:")
google_data_free_en = filter_value(google_data_free_en, -3, "True", filter_type = "Include")
print("Remaining rows:", len(google_data_free_en), "\n")

print("Final filtered Apple dataset has: ", len(apple_data_free_en))
print("Final filtered Google dataset has: ",len(google_data_free_en))

Google Play Store:
939972 rows were filtered out.
Remaining rows: 142450 

Final filtered Apple dataset has:  489425
Final filtered Google dataset has:  142450


## Step 3: Cleaning the Data
Now that we have our final filtered datasets we can proceed to clean the data.

In this step we will try to clean the data from the most common issue. Removing **duplicate values that we will first identify using a function and then we will remove them.**

In [27]:
from tqdm import tqdm
from time import sleep
# This function takes three arguments, the dataset we want to find duplicates in and the column in which we are 
# interested and the number of examples we want to show. 
# By default this function shows one example of an app that is in the dataset multiple times.

def count_duplicates(dataset, column_index): 
    duplicates = []
    unique_apps = []
    for row in tqdm(dataset):
        value = row[column_index]
        if value in unique_apps:
            duplicates.append(value)
        else:
            unique_apps.append(value)
    if len(duplicates) == 0:
        print("There are no duplicates for chosen column.")
    else:
        print("There are ", len(duplicates), " duplicates in this dataset.")
        print("Here are some examples of duplicates:")
        print(duplicates[:5])
        
    return unique_apps

In [38]:
apple_unique_apps = count_duplicates(apple_data_free_en,0)

100%|██████████████████████████████████| 489425/489425 [59:05<00:00, 138.04it/s]

There are no duplicates for chosen column.





After removing apps with no reviews no duplicates were found in the data so additional step not needed.

## Step 4: Analyzing the data
After filtering and cleaning of the data we are going to analyse the datasets. The company wants to approach the launch by first launching on Google Play Store and then launching on Apple App Store. 

Essentially Google store will be used as a proof of concept where an MVP of the app will be developed and afterwards the app will be expanded to other markets.

Threfore we will look at what genres are popular on both markets but a bigger emphasis will be placed on Google's data.

In [60]:
# Step 1: Create a frequency table function
# This function takes in 2 arguments, the name of the dataset and the index of the column for which we want the 
# frequencies

def freq_table(dataset, index):
    table = {}
    length = len(dataset[1:]) # We need to omit the header from total length
    for row in dataset: # first we calculate the absolute number of occurences for each app
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    for app in table: # secondly we will calculate the share of each app from total
        table[app] = round(table[app]/length*100,1)
    return table

apple_freq_table = freq_table(apple_data_free_en, 3)
google_freq_table = freq_table(google_data_free_en, 2)

# Step 2: Display the created frequency table (most of the function code comes from dataquest). I have added two
# additional arguments to permit rounding and to display units like %, nothing is displayed by default.
def display_dictionary(table, rounding = True, round_by = 1, unit = ""):
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    if rounding == False:
        for entry in table_sorted:
            print(entry[1], ':', entry[0], unit)
    else:
        for entry in table_sorted:
            print(entry[1], ':', round(entry[0],round_by), unit)

print("App Genres on Apple store by count descending:\n")
display_dictionary(apple_freq_table, unit = "%")
print("\nApp Genres on Google store by count descending:\n")
display_dictionary(google_freq_table, unit = "%")

App Genres on Apple store by count descending:

Games : 23.3 %
Education : 9.0 %
Business : 7.8 %
Lifestyle : 6.9 %
Utilities : 6.2 %
Health & Fitness : 5.0 %
Entertainment : 5.0 %
Finance : 4.6 %
Productivity : 3.8 %
Food & Drink : 3.7 %
Shopping : 3.2 %
Travel : 2.7 %
Photo & Video : 2.7 %
Music : 2.7 %
Sports : 2.4 %
Social Networking : 2.3 %
Medical : 2.1 %
Reference : 1.6 %
News : 1.6 %
Book : 1.0 %
Navigation : 0.9 %
Stickers : 0.5 %
Weather : 0.4 %
Graphics & Design : 0.2 %
Magazines & Newspapers : 0.1 %
Developer Tools : 0.1 %

App Genres on Google store by count descending:

Games : 43.5 %
Education : 7.4 %
Tools : 6.4 %
Entertainment : 4.1 %
Books & Reference : 3.8 %
Sports : 3.4 %
Health & Fitness : 3.3 %
Productivity : 2.8 %
Lifestyle : 2.7 %
News & Magazines : 2.3 %
Music & Audio : 2.2 %
Travel & Local : 1.9 %
Personalization : 1.9 %
Social : 1.8 %
Photography : 1.4 %
Finance : 1.2 %
Business : 1.2 %
Weather : 1.1 %
Communication : 1.1 %
Dating : 0.9 %
Medical : 0.8 %
Maps

Comparing the results from the two markets, we can see that Google Play data has a different categorisation compared to the Apple data. At first glance it would seem that in Google store, Education apps are the most popular but if we dive deeper we can see that it is because the "Games" category is split into different genres. 

Considering this, games would still come up as the most frequent category of apps on the Google store.

Let's look at the ranking if we were to group all game genres together. [Documentation used for clasification](https://support.google.com/googleplay/android-developer/answer/9859673?hl=en#zippy=%2Cgames%2Capps). 
Sports category can be both an app for sports coverage but it can be also a game. For purposes of this analyses we are considering it to be the sports coverage app category as the share looks similar to what we see with Apple store data. More information is needed for a more accurate answer.


In [61]:
# Step 1: Writing down all genres that we can group under "Games"
game_genres = ["Puzzle", "Arcade", "Casual", "Simulation", "Action", "Adventure", "Role Playing", "Educational", 
               "Strategy", "Racing", "Word", "Casino", "Card", "Board", "Trivia"] 

# Step 2: Creating a modified data set with games category grouped (will facilitate calculations later)
google_data_free_en_grouped = google_data_free_en

for row in google_data_free_en_grouped:
    genre = row[2]
    if genre in game_genres:
        row[2] = "Games"

google_freq_grouped = freq_table(google_data_free_en_grouped, 2)

print("Grouped App Genres on Google store by count descending:\n")
display_dictionary(google_freq_grouped, unit = "%")

Grouped App Genres on Google store by count descending:

Games : 43.5 %
Education : 7.4 %
Tools : 6.4 %
Entertainment : 4.1 %
Books & Reference : 3.8 %
Sports : 3.4 %
Health & Fitness : 3.3 %
Productivity : 2.8 %
Lifestyle : 2.7 %
News & Magazines : 2.3 %
Music & Audio : 2.2 %
Travel & Local : 1.9 %
Personalization : 1.9 %
Social : 1.8 %
Photography : 1.4 %
Finance : 1.2 %
Business : 1.2 %
Weather : 1.1 %
Communication : 1.1 %
Dating : 0.9 %
Medical : 0.8 %
Maps & Navigation : 0.8 %
Video Players & Editors : 0.7 %
Food & Drink : 0.7 %
Auto & Vehicles : 0.5 %
Music : 0.4 %
Art & Design : 0.4 %
Shopping : 0.3 %
Parenting : 0.3 %
Comics : 0.3 %
Libraries & Demo : 0.2 %
House & Home : 0.2 %
Events : 0.1 %
Beauty : 0.1 %


Now if we look at the results we can clearly see that for both Google Play Store and Apple App Store, games are the most frequent genre of apps followed by educational apps.

While for the education category the differences between the two platforms are just 2pp for the games category the differences are significant. App store seems to have a more balanced distribution between the different categories whereas Google platform is mostly inondated with Gaming apps.

Now the sheer volume of apps isn't the only metric company should explore as too many apps in one category might signify that it will be hard to stand out against competition as the market is already over saturated. That's why we will now try to determine:
1. Which apps have the most installs on average (Google data only)
2. Which apps receive the most reviews on average
3. Which apps have the highest average rating

In [62]:
# Step 1: Changing data types for the number columns in both datasets to make latter calculations easier

for row in google_data_free_en_grouped:
    row[7] = int(row[7]) # Installs
    row[4] = int(row[4]) # No. of reviews
    row[3] = float(row[3]) # Rating
    
for row in apple_data_free_en:
    row[-3] = int(row[-3]) # No. of reviews
    row[-4] = float(row[-4]) # Rating

# Step 2: Create a function that calculates averages per metric like installs or rankings and then groups it by 
# selected column. This will enable us to reuse this function for the 3 use cases above.

# This app takes three arguments. First the dataset to analyze then the index of the column that contains the metric 
# we want to see average of. For example average number of installs. Then it takes the index of the column that we 
# want to group the average by like genre.
# App returns a dictionary where key is the group by column and value is the average for that column value.
def average_metric_per_column(dataset, metric_index, column_index):
    addition = {} # Dictionary to hold the sum of the metric index
    count = {} # Dictionary to hold the occurences per value in selected column
    table = {} # Dictionary with averages
    for row in dataset: # Iterating over dataset to get occurences and sum of metric value
        column_value = row[column_index]
        metric_value = row[metric_index]
        if column_value in count:
            count[column_value] += 1
            addition[column_value] += metric_value  # we have converted to floats and integers earlier, so no 
                                                    # conversion needed
        else:
            count[column_value] = 1
            addition[column_value] = metric_value
    for key in addition:
        table[key] = addition[key]/count[key]
    return table

**1. Which apps have the most installs on average (Google data only)**

In [63]:
# For the number of installs we will take the Max installs column to have a precise number, Installs column has only
# ranges.
google_installs = average_metric_per_column(google_data_free_en_grouped, 7,2)
print("Average number of installs on Google Store per genre, descending:\n")
display_dictionary(google_installs, True)

Average number of installs on Google Store per genre, descending:

Communication : 8954326.9 
Social : 7842230.5 
Video Players & Editors : 7140289.5 
Photography : 5633600.0 
Productivity : 5333618.6 
Music : 5089678.9 
Music & Audio : 2484965.9 
Games : 2435950.1 
Sports : 2165192.2 
Tools : 1962104.2 
Entertainment : 1768577.6 
Weather : 1644048.5 
Personalization : 1524009.3 
Beauty : 1416537.8 
Art & Design : 1291105.4 
Business : 1207974.9 
House & Home : 1145224.7 
Comics : 938966.2 
Shopping : 879034.6 
Dating : 798010.1 
Health & Fitness : 769244.1 
Maps & Navigation : 698487.9 
Lifestyle : 685614.0 
Libraries & Demo : 534469.1 
Parenting : 448761.8 
Travel & Local : 427189.5 
Education : 390540.4 
Auto & Vehicles : 360913.6 
Books & Reference : 346250.2 
Events : 345092.5 
Finance : 240606.0 
Food & Drink : 219024.3 
Medical : 180187.9 
News & Magazines : 147346.0 


Contrary to the number of apps, in terms of the number of installs "Games" category doesn't come on top. This shows that although there are many gaming apps on this platform, the market is probably oversaturated with these apps and therefore the number of installs per appp gets diluted.

The two categories that come on top are Communication and Social which contain mostly social networking apps so that isn't surprising.

Let's look at top 10 apps when it comes to the number of installs.

In [64]:
from operator import itemgetter # importing itemgetter for sorting purposes

i = 0
sorted_list = sorted(google_data_free_en_grouped, key=itemgetter(7), reverse = True)
while i<10:
    print(sorted_list[i][0], "-", sorted_list[i][2], ":", sorted_list[i][7])
    i += 1

Google Drive - Productivity : 7028265259
Facebook - Social : 6782619635
Messenger – Text and Video Chat for Free - Communication : 5054312355
Instagram - Social : 3559871277
Microsoft OneDrive - Productivity : 2056017889
Subway Surfers - Games : 1704495994
SHAREit - Transfer & Share - Tools : 1666016612
Microsoft Word: Write, Edit & Share Docs on the Go - Productivity : 1651577965
TikTok - Social : 1645811582
Snapchat - Social : 1621265491


**2. Which apps receive the most reviews on average**

In [65]:
google_reviews = average_metric_per_column(google_data_free_en_grouped, 4,2)
print("Average number of reviews on Google Store per genre, descending:\n")
display_dictionary(google_reviews)

i = 0
print("\nTop 10 most reviewed apps descending:\n")
sorted_list = sorted(google_data_free_en_grouped, key=itemgetter(4), reverse = True)
while i<10:
    print( sorted_list[i][0], "-", sorted_list[i][2], ":", sorted_list[i][4])
    i += 1

Average number of reviews on Google Store per genre, descending:

Social : 148416.1 
Communication : 119482.8 
Video Players & Editors : 88892.7 
Photography : 57469.5 
Music : 40003.1 
Sports : 38432.1 
Games : 35469.1 
Music & Audio : 26782.8 
Tools : 21436.1 
Personalization : 19627.3 
Art & Design : 17171.6 
Productivity : 17140.7 
Entertainment : 16465.7 
Shopping : 15770.5 
Weather : 14667.4 
Beauty : 11810.7 
Comics : 11333.4 
Health & Fitness : 9950.2 
Dating : 9558.0 
Business : 8576.4 
Lifestyle : 8115.7 
Maps & Navigation : 7257.1 
Parenting : 7222.0 
House & Home : 6997.3 
Books & Reference : 5248.7 
Education : 4692.3 
Finance : 4063.7 
Auto & Vehicles : 2828.4 
Medical : 2604.5 
Events : 2543.6 
Travel & Local : 2351.0 
Libraries & Demo : 2064.9 
News & Magazines : 1845.9 
Food & Drink : 1546.1 

Top 10 most reviewed apps descending:

Instagram - Social : 120206190
Facebook - Social : 117850066
Garena Free Fire - Rampage - Games : 89177097
Messenger – Text and Video Chat 

Looking at the ranking, we can see that it is very similar to the number of installs therefore, using the number of reviews for Apple Store as a proxy to deduce popular app categories, seems to be a sensible approach.

Now let's look at Apple Store.

In [66]:
apple_reviews = average_metric_per_column(apple_data_free_en, -3,3)
print("Average number of reviews on Apple Store per genre, descending:\n")
display_dictionary(apple_reviews)

# Now let's look at top 10 most reviewed apps so that we have a compariosn with Google store for specific apps

i = 0
print("\nTop 10 most reviewed apps descending:\n")
sorted_list = sorted(apple_data_free_en, key=itemgetter(-3), reverse = True)
while i<10:
    print( sorted_list[i][1], "-", sorted_list[i][3], ":", sorted_list[i][-3])
    i += 1

Average number of reviews on Apple Store per genre, descending:

Photo & Video : 5654.7 
Weather : 5461.7 
Shopping : 4629.3 
Finance : 4053.4 
Travel : 4006.9 
Music : 3894.2 
Food & Drink : 3187.3 
Social Networking : 2762.8 
Graphics & Design : 2596.9 
Book : 2364.5 
Developer Tools : 2360.9 
Navigation : 2349.9 
Games : 2227.1 
Entertainment : 2034.5 
News : 1746.4 
Reference : 1646.0 
Productivity : 1602.1 
Health & Fitness : 1205.3 
Lifestyle : 1139.0 
Utilities : 1118.6 
Sports : 1111.9 
Business : 703.3 
Education : 586.9 
Medical : 514.4 
Magazines & Newspapers : 123.5 
Stickers : 22.9 

Top 10 most reviewed apps descending:

YouTube: Watch, Listen, Stream - Photo & Video : 22685334
Instagram - Photo & Video : 21839585
Spotify New Music and Podcasts - Music : 18893225
Venmo - Finance : 12634191
DoorDash - Food Delivery - Food & Drink : 12517538
TikTok - Entertainment : 10598509
Lyft - Travel : 10241777
WhatsApp Messenger - Social Networking : 9090956
Pandora: Music & Podcasts 

Overal we can see that Apple Store has significantly smaller number of reviews on average. This might signify either less engaged users or simply less users overal. Another factor that can influence this is that the data was collected at two different time points and also with different collection methods.

Finally let's answer the final question, what category has the highest average rating.

**3. Which apps have the highest average rating**

In [67]:
google_rating = average_metric_per_column(google_data_free_en_grouped, 3,2)
print("Average app rating on Google Store per genre, descending:\n")
display_dictionary(google_rating)

Average app rating on Google Store per genre, descending:

Weather : 4.3 
Books & Reference : 4.2 
Education : 4.2 
Libraries & Demo : 4.2 
Finance : 4.1 
Music & Audio : 4.1 
Health & Fitness : 4.1 
Games : 4.1 
Shopping : 4.1 
Events : 4.1 
Parenting : 4.1 
Food & Drink : 4.0 
Medical : 4.0 
Music : 4.0 
Business : 4.0 
Productivity : 4.0 
Sports : 4.0 
Travel & Local : 4.0 
Lifestyle : 4.0 
Social : 4.0 
Personalization : 4.0 
Communication : 4.0 
Photography : 4.0 
Entertainment : 4.0 
Art & Design : 4.0 
Tools : 4.0 
Auto & Vehicles : 4.0 
Maps & Navigation : 3.9 
Video Players & Editors : 3.9 
Comics : 3.9 
Beauty : 3.9 
House & Home : 3.8 
News & Magazines : 3.7 
Dating : 3.7 


In [68]:
apple_rating = average_metric_per_column(apple_data_free_en, -4,3)
print("Average app rating on Apple Store per genre, descending:\n")
display_dictionary(apple_rating)

Average app rating on Apple Store per genre, descending:

Shopping : 4.4 
Graphics & Design : 4.3 
Developer Tools : 4.3 
Food & Drink : 4.3 
Lifestyle : 4.2 
Health & Fitness : 4.2 
Social Networking : 4.2 
Music : 4.2 
Magazines & Newspapers : 4.2 
Business : 4.1 
Sports : 4.1 
Reference : 4.1 
Book : 4.1 
Stickers : 4.1 
Medical : 4.1 
Finance : 4.1 
News : 4.1 
Education : 4.0 
Productivity : 4.0 
Games : 4.0 
Travel : 4.0 
Entertainment : 3.9 
Weather : 3.9 
Utilities : 3.8 
Photo & Video : 3.7 
Navigation : 3.7 


# Conclusion and Recommendations
So there are two approaches that I would recommend to an app developpment company.

## 1st Approach
Trying to get into a segment that is popular and has high engagement (has high number of installs/reviews) with  positive reviews.

The problem with this approach is that penetrating into this market might be more difficult as there are already apps that are very popular and of high quality based on the reviews and installs. Therefore the company would have to convince the users that their app can compete with what they are using currently.

## 2nd Approach
On the other hand, if we look into apps that are popular (have fairly high number of installs/reviews) but have the lower average ratings we might have a better chance of coming up with an app that will have the audience and will be willing to try something new as they aren't too satisfied with the current app.

## Final Recommendation

I would recommend the 2nd approach, so that's the one I'm going to develop below.

Comparing data from Google and Apple in terms of market saturation, the hardest markets to penetrate will be in gaming, education and utility/tools apps. These categories have the highest share of apps on these markets and therefore it will be hard to stand out in the crowd.

We need people to actually download our app and install it on their device so that we can convince to make an in-app purchase. Ideally we want to find a category that people install often but the competition (number of apps) isn't that big. And the average rating of the apps is towards the lower end so that we have a better chance of bringing improvement. Last thing we might want to consider are categories that have well-established apps with loyal users.

Here I would consider the following genres, Social, Communication, Video and Photography apps. These categories have usually a couple of well established apps (Facebook, Instagram...) and people don't go for new apps on the market that often. A similar case might be in the Music category.

If we wanted to create an objective way to choose the category we could give a weight to the different parameters and then score each category and take the one with the highest score. As we are first launching on Google Play Store we will start with that one and then simply verify if the recommendation makes sense for Apple App Store as well.

This is the weight we're going to attribute to the different factors:
1. Share of apps 30% (Category with the smallest number of apps gets the most points)
2. Number of Installs 60% (Category with the highest average number of installs gets the most points)
3. Average rating 10% (Category with the lowest average ranking gets the most points)

final_score = 30% * Number of Apps rank + 60% * Average number of installs + 10% * Average rating

We won't look at the number of reviews as those closely correlate with the number of installs. The number of installs gets the highest weight as it is the factor that will help us the most in establishing a big enough audience to market to. 

One thing to say is that the spread of rating (difference between highest and lowest rating) isn't very big, therefore this will not have that big of an impact and we will give it a lower weight equally.

In [92]:
sorted_google_apps = sorted(google_freq_grouped, key=google_freq_grouped.get, reverse = True)

sorted_google_installs = sorted(google_installs,key=google_installs.get, reverse = False) 
# we have sorted from smallest number of installs to highest number of installs so that the app with most installs 
# gets most points (highest index/ranking)

sorted_google_rating = sorted(google_rating,key=google_rating.get, reverse = True)

score_google = {}  # Dictionary that will hold the category name as the key and then a list of the three different ranks in
            # in this order: no. of apps, average no. of installs, average rating
    
for genre in sorted_google_apps:
    score_google[genre] = (list(sorted_google_apps).index(genre)+1)*0.3 + (list(sorted_google_installs).index(genre)+1)*0.6 + (list(sorted_google_rating).index(genre)+1)*0.1
    # We have pre sorted the different lists above, therefore their index equals to the rankings/score of the genres
    # We have added 1 to each index as otherwise the first one would be 0.
    # Afterwards we have just multiplied each factor bz corresponding weight defined above and then summed it together
    
print(display_dictionary(score_google))

Video Players & Editors : 29.0 
Communication : 28.3 
Music : 26.6 
Social : 26.0 
Beauty : 25.9 
Photography : 25.4 
House & Home : 23.3 
Art & Design : 22.6 
Comics : 22.2 
Productivity : 22.0 
Music & Audio : 20.7 
Weather : 19.3 
Personalization : 19.2 
Sports : 19.1 
Shopping : 18.9 
Tools : 18.5 
Dating : 18.4 
Entertainment : 18.0 
Business : 18.0 
Games : 17.3 
Maps & Navigation : 16.9 
Libraries & Demo : 16.6 
Parenting : 15.8 
Auto & Vehicles : 14.4 
Events : 13.9 
Lifestyle : 11.8 
Health & Fitness : 11.2 
Travel & Local : 10.8 
Food & Drink : 10.2 
Medical : 9.1 
Finance : 7.7 
News & Magazines : 6.9 
Education : 5.7 
Books & Reference : 5.3 
None


Unsurprisingly the top 4 genres to pursue are 

1. Video Players & Editors 
2. Communication 
3. Music 
4. Social

This is probably the reason why apps like Facebook, Instagram and Youtube generate such a big quantity of money. But as I mentioned above it will be very heard to beat these well-established apps.

Therefore, I would recommend looking into the following genres:
1. Beauty
2. House & Home
3. Art & Design
4. Comics
5. Productivity

I have on purpose omitted photography category, as here I'm afraid we might have apps such as Photoshop that will be very heard to develop and to beat in terms of popularity. Even with apps that have paid filters as Instagram now offers a vast range of filters for free.

At the end of the day, the company should also consider its expertise and actual capacity to build an app in a genre. A succesfull app in any genre requires a fairly original idea and good execution. So if a company has a good ideas in any of the genres above it might have a higher chance to gain larger profit from its launch.

Now, we know we want to launch the concept on Google Play Store but let's check whether we will get a similar result for the Apple data set. Here we will supplement installs for reviews as we don't have the information on the installs.

In [95]:
sorted_apple_apps = sorted(apple_freq_table, key=apple_freq_table.get, reverse = True)

sorted_apple_reviews = sorted(apple_reviews,key=apple_reviews.get, reverse = False) 

sorted_apple_rating = sorted(apple_rating,key=apple_rating.get, reverse = True)

score_apple = {}  # Dictionary that will hold the category name as the key and then a list of the three different ranks in
            # in this order: no. of apps, average no. of installs, average rating
    
for genre in sorted_apple_apps:
    score_apple[genre] = (list(sorted_apple_apps).index(genre)+1)*0.3 + (list(sorted_apple_reviews).index(genre)+1)*0.6 + (list(sorted_apple_rating).index(genre)+1)*0.1
    # We have pre sorted the different lists above, therefore their index equals to the rankings/score of the genres
    # We have added 1 to each index as otherwise the first one would be 0.
    # Afterwards we have just multiplied each factor bz corresponding weight defined above and then summed it together
    
print(display_dictionary(score_apple))

Weather : 24.2 
Photo & Video : 22.0 
Travel : 19.5 
Graphics & Design : 18.2 
Navigation : 17.9 
Shopping : 17.8 
Finance : 17.8 
Developer Tools : 17.7 
Book : 17.5 
Music : 17.0 
Social Networking : 16.9 
Food & Drink : 15.4 
News : 14.6 
Reference : 13.2 
Entertainment : 12.1 
Games : 10.7 
Productivity : 10.6 
Magazines & Newspapers : 9.6 
Sports : 9.2 
Stickers : 8.6 
Medical : 8.4 
Utilities : 8.1 
Health & Fitness : 7.8 
Lifestyle : 6.5 
Business : 4.9 
Education : 4.8 
None


As you can see for Apple App Store we see a different picture. Beauty and House and Home genres don't even exist in their list of genres but Design apps ranks fairly high on both of them. 

So a good design app might be something that will work on both platforms.

## Additional data
Let's take a final look at the data. We will look at the top 10 apps in terms of installs for the Beauty, House & Home and Art & Design genres.

In [105]:
i = 0

google_beauty = filter_value(google_data_free_en_grouped, 2, "Beauty", filter_type = "Include")
sorted_list = sorted(google_beauty, key=itemgetter(7), reverse = True)

print("\nTop 10 Google Apps based on number of installs Beauty category:\n")
while i<10:
    print( sorted_list[i][0], "-", sorted_list[i][2], ":", sorted_list[i][7])
    i += 1

142280 rows were filtered out.

Top 10 Google Apps based on number of installs Beauty category:

Perfect365: One-Tap Makeover - Beauty : 53849836
Mirror Plus: Mirror with Light for Makeup & Beauty - Beauty : 41505377
Beauty Makeup Editor: Beauty Camera, Photo Editor - Beauty : 32076828
DSLR HD Camera : 4K HD Camera Ultra Blur Effect - Beauty : 9026840
Crown Editor - Heart Filters for Pictures - Beauty : 7813579
Hairstyles step by step - Beauty : 7651901
YuFace: Makeup Camera, Makeover Face Editor Magic - Beauty : 7413964
Ice Queen - Dress Up & Makeup - Beauty : 7313753
camera for instagram filters & effects: IG filters - Beauty : 6484792
Mirror Camera  (Mirror + Selfie Camera) - Beauty : 5974006


In [106]:
i = 0

google_house = filter_value(google_data_free_en_grouped, 2, "House & Home", filter_type = "Include")
sorted_list = sorted(google_house, key=itemgetter(7), reverse = True)

print("\nTop 10 Google Apps based on number of installs House & Home category:\n")
while i<10:
    print( sorted_list[i][0], "-", sorted_list[i][2], ":", sorted_list[i][7])
    i += 1

142203 rows were filtered out.

Top 10 Google Apps based on number of installs House & Home category:

Universal TV Remote Control - House & Home : 96444714
SURE - Smart Home and TV Universal Remote - House & Home : 34855441
Planner 5D - Home & Interior Design Creator - House & Home : 30189982
Alfred Home Security Camera: Baby Monitor & Webcam - House & Home : 26557624
Home Design 3D - House & Home : 18456706
Room Planner: Home Interior & Floorplan Design 3D - House & Home : 10147915
Home Security Camera WardenCam - reuse old phones - House & Home : 9504176
Universal TV Remote - House & Home : 7551952
idealista - House & Home : 6923059
Smart TV's Remote Control - House & Home : 4841215


In [109]:
i = 0

google_design = filter_value(google_data_free_en_grouped, 2, "Art & Design", filter_type = "Include")
sorted_list = sorted(google_design, key=itemgetter(7), reverse = True)

print("\nTop 10 Google Apps based on number of installs Art & Design category:\n")
while i<10:
    print( sorted_list[i][0], "-", sorted_list[i][2], ":", sorted_list[i][7])
    i += 1

141812 rows were filtered out.

Top 10 Google Apps based on number of installs Art & Design category:

Canva: Graphic Design, Video Collage, Logo Maker - Art & Design : 134619454
ibis Paint X - Art & Design : 100950900
Flipaclip: Cartoon Animation Creator & Art Studio - Art & Design : 32068539
PaperColor - Art & Design : 27024527
MediBang Paint - Make Art ! - Art & Design : 25820756
U Launcher Lite-New 3D Launcher 2020, Hide apps - Art & Design : 20583655
Floor Plan Creator - Art & Design : 20093590
Photos Alive - Jellify - Art & Design : 19354476
Unfold — Story Maker & Instagram Template Editor - Art & Design : 17803439
Adobe Spark Post: Graphic Design & Story Templates - Art & Design : 16431111


Based on looking at the different apps a makeover app or a home planning app might perform particularly well.