# Profitable App Profiles for the Apple iOS App Store and Google Play Markets 

### Purpose of this Project
This project is designed to demonstrate my progress and competency with Python and Jupyter functionalities to produce a professional data analysis report. 
The style of this document follows the principles of **literate programming**, a coding paradigm first adopted in 1984 by Donald Knuth where code is accompanied by natural human language for enhanced code annotation and discussion of outputs. This style of workflow is highly advantageous in the context of data analysis as code can be easily annotated and clarified in markdown cells and results can be directly discussed in the same document. It also offers a huge advantage when combined with data visualization, as figures and graphs can be directly integrated into the same document (though this project does not focus on this aspect of the data analysis process).

### Project Background
I am acting as a data analyst working for a company that builds Android and iOS mobile apps. The company only builds apps that are *free* to download/install and for English-speaking audiences. Therefore, the company business model relies on in-app ads as the main source of revenue. Consequently, revenue for any given app the company builds is primarily controlled by the number of users of the app (and the amount of engagement users make with ads within the app).

>The main goal of this project is to gain an insight into what kinds of apps are most likely to attract the largest user base so that our developers can make more informed decisions on the directions they could take when building new applications.




### About the Datasets
As of September 2018, roughly 2 million iOS apps were available for purchase on the app store with 2.1 million available on Google Play. This data is outdated now and we can expect there to be an even larger number of apps available on both marketplaces.

![img](https://s3.amazonaws.com/dq-content/350/py1m8_statista.png) 

Source: [Statista](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/)

Analyzing data over 4 million unique records would require a significant amount of time and money, so we can turn our attention to analyzing a sample of the data instead. To avoid wasting resources on data collection, we can find relevant existing data at no cost. The two datasets below are suitable for initial exploration:

>* Android app [Dataset](https://www.kaggle.com/datasets/lava18/google-play-store-apps) from Google Play containing ~10,800 apps, collected in August 2018

>| Column name      | Description                              |
>| ---------------- | ---------------------------------------- |
>| App              | Application name                         |
>| Category         | Category app belongs to                  |
>| Rating           | Overall user rating of the app           |
>| Reviews          | Number of user reviews for the app       |
>| Size             | Size of the app (MB)                     |
>| Installs         | Number of user downloads/installs        |
>| Type             | Paid or Free                             |
>| Price            | Price of app                             |
>| Content Rating   | Age group that the app targets           |
>| Genres           | A more detailed category breakdown       |
>| Last Updated     | Date of last update                      |
>| Current Ver      | Current app version                      |
>| Android Ver      | Oldest compatible android OS version     |
---


>* iOS app [Dataset](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) from the App Store containing ~7,200 apps, collected in July 2017

>| Column name      | Description                              |
>| ---------------- | ---------------------------------------- |
>| id               | App ID                                   |
>| track_name       | App Name                                 |
>| size_bytes       | Size (in Bytes)                          |
>| id               | App ID                                   |
>| currency         | Currency Type                            |
>| price            | Price Amount                             |
>| rating_count_tot | User rating count (All versions)         |
>| rating_count_ver | User rating count (Current version)      |
>| user_rating      | Average User Rating (All versions)       |
>| user_rating_ver  | Average User Rating (Current version)    |
>| ver              | Latest version code                      |
>| cont_rating      | Content Rating                           |
>| prime_genre      | Primary Genre                            |
>| sup_devices.num  | Number of supporting devices             |
>| ipadSc_urls.num  | Number of screenshots showed for display |
>| lang.num         | Number of supported languages            |
>| vpp_lic          | Vpp Device Based Licensing Enabled       |

*Obviously, this data is quite outdated for the current year (2025), so any conclusions drawn may only be relevant to the digital landscape back in 2018.*

## Opening, Loading and Exploring the Datasets
Each dataset of interest is first opened and stored in their respective variables (`opened_apple_file` and `opened_google_file`). The csv `reader` function from the csv module us then imported (from the Python Standard Library), allowing the opened file objects to be read and converted into nested list data structures for analysis.

A **context manager** is used to handle the file operations using the `with` and `as` keywords to ensure the files are opened and closed as intended (this helps to prevent resource leaks). These leaks can occur if a program doesn't release resources (such as opened files) properly which may lead to performance issues or crashes.

In [1]:
from csv import reader

with open("AppleStore.csv", encoding="utf8") as opened_apple_file:
  read_apple_file = reader(opened_apple_file)
  apple_data = list(read_apple_file)
  
apple_header = apple_data[0]
ios_app_data = apple_data[1:]

with open("googleplaystore.csv", encoding="utf8") as opened_google_file:
  read_google_file = reader(opened_google_file)
  google_data = list(read_google_file)
  
google_header = google_data[0]
android_app_data = google_data[1:]

To make it simple to explore each dataset, we first define a function `explore_data()` which can be used repeatedly to retrieve rows presented in a digestible manner. The function takes any dataset as a nested list alongside a start row and an end row, outputting each row in the slice created from the start and end row. The `rows_and_columns` parameter is added to the function to give us the option to return the number of rows and columns within any dataset passed to it. The `header` parameter is included to ensure the header row is not included in the number of rows calculation.

In [2]:
def explore_data(dataset, start, end, show_rows_and_columns=False, header=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds new line after each row

    if show_rows_and_columns and header:
        print('Number of rows:', len(dataset) - 1) # if dataset contains header, remove one from row count
        print('Number of columns:', len(dataset[0]))
        
    elif show_rows_and_columns and not header:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(apple_data, 0, 4, show_rows_and_columns=True, header=True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


In [4]:
explore_data(google_data, 0, 4, show_rows_and_columns=True, header=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


## Cleaning the datasets

### Removal of inaccurate datapoints
We begin by cleaning the datasets to ensure all records contain all of the relevant data for the analysis, removing any records that contain missing/incorrect data. 

The iOS dataset is free of any duplicate values and there is no known issues with inaccurate data according to users of the dataset from the original source.

In [5]:
unique_ids = []
duplicate_ids = []

for app in ios_app_data:
  app_id = app[0]
  if app_id not in unique_ids:
    unique_ids.append(app_id)
  else:
    duplicate_ids.append(app_id)
    
print("Number of unique apps: " + str(len(unique_ids)))
print("\n")
print("Number of duplicate apps: " + str(len(duplicate_ids)))
  

Number of unique apps: 7197


Number of duplicate apps: 0


The Google Play dataset however does have some known issues and duplicate datapoints, which will be addressed below.

In [6]:
explore_data(google_data, 10473, 10474)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']




In [7]:
explore_data(google_data, 9149, 9150)

['Command & Conquer: Rivals', 'FAMILY', 'NaN', '0', 'Varies with device', '0', 'NaN', '0', 'Everyone 10+', 'Strategy', 'June 28, 2018', 'Varies with device', 'Varies with device']




The row at index 10473 in the raw googleplay dataset: ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', ' ', 'February 11, 2018', '1.0.19', '4.0 and up'] is missing a category and hence caused column misalignment with the datapoints. Rather than guessing what the correct category should be, it is safer to delete this record.

The row at index 9149 in the raw googleplay dataset: ['Command & Conquer: Rivals', 'FAMILY', 'NaN', '0', 'Varies with device', '0', 'NaN', '0', 'Everyone 10+', 'Strategy', 'June 28, 2018', 'Varies with device', 'Varies with device'] has a missing rating and price (labelled with not a number - NaN). Again, it is safer to delete this record.

*These are known issues with the dataset (found in the Discussion tab from the original data source on Kaggle).*

In [8]:
deleted_google_data = [google_data.pop(10473), google_data.pop(9149)]

deleted_google_data

[['Life Made WI-Fi Touchscreen Photo Frame',
  '1.9',
  '19',
  '3.0M',
  '1,000+',
  'Free',
  '0',
  'Everyone',
  '',
  'February 11, 2018',
  '1.0.19',
  '4.0 and up'],
 ['Command & Conquer: Rivals',
  'FAMILY',
  'NaN',
  '0',
  'Varies with device',
  '0',
  'NaN',
  '0',
  'Everyone 10+',
  'Strategy',
  'June 28, 2018',
  'Varies with device',
  'Varies with device']]

Confirm removal of the two records below by checking the new items at the indexes with inaccurate data:

In [9]:
explore_data(google_data, 10473, 10474)

['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']




In [10]:
explore_data(google_data, 9149, 9150)

['Star Wars™: Galaxy of Heroes', 'FAMILY', '4.5', '1461698', '67M', '10,000,000+', 'Free', '0', 'Everyone 10+', 'Role Playing', 'May 21, 2018', '0.12.334385', '4.1 and up']




Running the `explore_data()` function on the googleplay dataset below confirms the deletion of the two records (the dataset now contains 10,839 records with the deleted records stored in the `deleted_google_data` variable).

In [11]:
explore_data(google_data, 0, 4, show_rows_and_columns=True, header=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10839
Number of columns: 13


### Handling duplicate datapoints
Additional known issues exist in the Google Play dataset (mentioned in the discussion section of the original source) - specifically the occurrence of multiple entries for the same app. 

For instance, there are four total entries for the Instagram app, likely because the data was scraped on four separate occasions. 

*Since the only difference between these entries are the total ratings for the app, we can remove entries with **fewer** total reviews. By keeping the entry with the greatest total reviews, this gives the highest likelihood that we are keeping the most recently collected datapoint for the app (which is preferable over randomly deleting duplicate entries).*

In [12]:
for app in android_app_data:
  name = app[0]
  if name == "Instagram":
    print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


To get a full understanding of the state of the dataset, we create a list of all unique apps and a dictionary to contain all apps that have more than one entry in the database (with a key to indicate how many extra copies exist). 

The code below loops over all rows in the google play dataset, assigning the app name to the `name` variable. If the app name is not found in the `duplicate_apps` dictionary, the app is added to the `unique_apps` list. If the app is already found in the `unique_apps` list and is not already in the `duplicate_apps` dictionary, it will be instantiated in the dictionary with a value of 1 (for the first instance of a repeated entry). If the app is already present in the duplicate dictionary, +1 will be added to the value stored for that app to give the total number of extra entries of the app.

The number of unique apps in the dataset can be found by simply returning the length of the `unique_apps` list. The number of unique apps with extra entries is found by returning the length of the `duplicate_apps` dictionary, with the total number of duplicates to be removed calculated by taking the sum of all values within the dictionary.

In [13]:
duplicate_apps = {}
unique_apps = []

for app in google_data[1:]:
  name = app[0]
  
  if name not in unique_apps: 
    unique_apps.append(name)
  else:
    if name not in duplicate_apps:
      duplicate_apps[name] = 1
    else:
      duplicate_apps[name] += 1
    
print("Number of unique apps: " + str(len(unique_apps)))
print("\n")
print("Number of apps with duplicate entries: " + str(len(duplicate_apps)))
print("\n")
print("Number of duplicates to remove: " + str(sum(duplicate_apps.values())))

Number of unique apps: 9658


Number of apps with duplicate entries: 798


Number of duplicates to remove: 1181


Based on our criterion for removing duplicate entries (*removing all duplicates with fewer total reviews*), we can create a dictionary `max_reviews` to store app names as keys with the apps highest number of reviews stored as the value for each app. The length of this dictionary matches the expected number of unique apps found above.

In [14]:
max_reviews = {}

for app in google_data[1:]:
  name = app[0]
  n_reviews = float(app[3])
  
  if name in max_reviews and max_reviews[name] < n_reviews:
    max_reviews[name] = n_reviews
  elif name not in max_reviews:
    max_reviews[name] = n_reviews
    
print(len(max_reviews))

9658


We can use this `max_reviews` dictionary to create a cleaned dataset where all duplicate entries have been removed to leave only the entries with the greatest number of user reviews. 

We create two empty lists `android_data_cleaned` to store the app data with no duplicates and an `already_added` list to track the app names that have already been added to the cleaned dataset. For each app in the original `google_data` dataset (with header omitted), we assign the name of the app as `name` and number of reviews for the app as `n_reviews`. If the app is not present in the `already_added` list AND has the same number of reviews as the entry with the maximum number (stored in the `max_reviews` dictionary created above) then the row for that app will be appended to the `android_data_cleaned` list and the app name will be added to the `already_added` list to ensure no duplicate entries can be stored. 

We then verify that the expected number of datapoints are present in the final cleaned list (9,658).

In [15]:
android_data_cleaned = []
already_added = []

for app in google_data[1:]:
  name = app[0]
  n_reviews = float(app[3])
  
  if name not in already_added and float(max_reviews[name]) == n_reviews:
    android_data_cleaned.append(app)
    already_added.append(name)
    
print("Number of apps after cleaning: " + str(len(android_data_cleaned)))
print("\n")
print(f"First 3 rows of cleaned dataset: \n {android_data_cleaned[:3]}")

Number of apps after cleaning: 9658


First 3 rows of cleaned dataset: 
 [['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]


### Removal of Non-English apps
As we are only interested in insights about apps designed for an English-speaking audience, we can remove any apps with names that indicate they are not designed for English-speakers. Both datasets contain apps that fall into this criteria, and since we are not interested in these apps, we can remove them from the datasets.

To do this, we can use the built-in `ord()` function to return the Unicode code point of characters in app-names. Numbers corresponding to the most common English characters in text are in the range 0-127 according to the ASCII system (American Standard Code for Information Interchange). Based on this range, we can define a function to detect whether a character belongs to the set of common English characters or not.

In [16]:
def is_common_english(string):
  for char in string:
    if ord(char) > 127:
      return False
  return True

print(is_common_english("Instagram"))
print(is_common_english("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(is_common_english("Docs To Go™ Free Office Suite"))
print(is_common_english("Instachat 😜"))

True
False
False
False


To improve the quality of our function, we can add an additional check such that if there are more than 3 characters that fall outside of the ASCII range, then the function will return False. This will minimise the likelihood of wrongly identifying English apps as non-English. The filter is not perfect, with English apps containing more than 3 non-ASCII characters being wrongly assigned. 

For our purposes, this is an acceptable level of data loss (since very few English apps are likely to have more than 3 non-ASCII characters).

A count variable is assigned as 0 outside of the loop, and for every character in the input string, if the character falls outside of the allowed ASCII range, +1 is added to the count. If the count is greater than 3, the string is deemed non-English otherwise, it is deemed to be English.

In [17]:
def is_common_english(string):
  count = 0
  
  for char in string:
    if ord(char) > 127:
      count += 1
  
  if count > 3:
    return False
  else:
    return True

print(is_common_english("Instagram"))
print(is_common_english("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(is_common_english("Docs To Go™ Free Office Suite"))
print(is_common_english("Instachat 😜"))

True
False
True
True


The `is_common_english()` function is used to filter the App Store and Google Play datasets, creating new nested lists for both datasets (`english_apple_data_cleaned` and `english_android_data_cleaned`) which contain only apps designed for English speaking audiences. 

This is done by looping over the apps in both datasets and appending each app's data to the list of english apps if the `is_common_english()` returns true for the name of the app. If the `is_common_english()` returns false, the app name is appended to the non_english app name list. A sample of the non_english app name lists is printed to help verify the identified apps are indeed not designed for English speakers.

In [18]:
english_apple_data_cleaned = []
non_english_apple_apps = []

english_android_data_cleaned = []
non_english_android_apps = []

for app in apple_data[1:]:
  name = app[1]
  if is_common_english(name):
    english_apple_data_cleaned.append(app)
  else:
    non_english_apple_apps.append(name)
    
for app in android_data_cleaned:
  name = app[0]
  if is_common_english(name):
    english_android_data_cleaned.append(app)
  else:
    non_english_android_apps.append(name)
  
line_break = "-" * 515  
 
print("Number of English apps in App Store dataset: " + str(len(english_apple_data_cleaned)))
print("\n")
print("Number of non-English apps removed from the App Store dataset: " + str(len(non_english_apple_apps)))
print("\n")
print("Sample of non-English apps from the App Store dataset: " + str(non_english_apple_apps[:15]))
print("\n")
print(line_break)
print("\n")
print("Number of English apps in Google Play dataset: " + str(len(english_android_data_cleaned)))
print("\n")
print("Number of non-English apps removed from the cleaned Google Play dataset: " + str(len(non_english_android_apps)))
print("\n")
print("Sample of non-English apps from the Google Play dataset: " + str(non_english_android_apps[:15]))



Number of English apps in App Store dataset: 6183


Number of non-English apps removed from the App Store dataset: 1014


Sample of non-English apps from the App Store dataset: ['爱奇艺PPS -《欢乐颂2》电视剧热播', '聚力视频HD-人民的名义,跨界歌王全网热播', '优酷视频', '网易新闻 - 精选好内容，算出你的兴趣', '淘宝 - 随时随地，想淘就淘', '搜狐视频HD-欢乐颂2 全网首播', '阴阳师-全区互通现世集结', '百度贴吧-全球最大兴趣交友社区', '百度网盘', '爱奇艺HD -《欢乐颂2》电视剧热播', '乐视视频HD-白鹿原,欢乐颂,奔跑吧全网热播', '万年历-值得信赖的日历黄历查询工具', '新浪新闻-阅读最新时事热门头条资讯视频', '喜马拉雅FM（听书社区）电台有声小说相声英语', '央视影音-海量央视内容高清直播']


-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


Numbe

In [19]:
line_break = "-" * 195

print("First 3 rows of the cleaned App Store dataset: \n")
explore_data(english_apple_data_cleaned, 0, 3, show_rows_and_columns=True, header=False)
print(line_break)
print("First 3 rows of the cleaned Google Play dataset: \n")
explore_data(english_android_data_cleaned, 0, 3, show_rows_and_columns=True, header=False)

First 3 rows of the cleaned App Store dataset: 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6183
Number of columns: 16
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
First 3 rows of the cleaned Google Play dataset: 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, H

This cleaning process has left us with 6,183 unique App Store apps and 9,613 unique Android apps. 

### Isolating Free Apps
As mentioned in the introduction, the company is only interested in building apps which are *Free* to download and install, so we need to filter the cleaned data up to this point such that only free apps remain for our analysis.

To do this, we initialise two more lists to hold the free english iOS apps and the free english Android apps, looping through the `english_apple_data_cleaned` and `english_android_data_cleaned` datasets to extract the apps that are free into each list.

In [20]:
free_english_apple_data_cleaned = []
paid_english_apple_data_cleaned = []

for app in english_apple_data_cleaned:
  price = app[4]
  if price == "0.0":
    free_english_apple_data_cleaned.append(app)
  else:
    paid_english_apple_data_cleaned.append(app)
    
free_english_android_data_cleaned = []
paid_english_android_data_cleaned = []

for app in english_android_data_cleaned:
  price = app[7]
  if price == "0":
    free_english_android_data_cleaned.append(app)
  else:
    paid_english_android_data_cleaned.append(app)
    
print("Number of free English apps in the App Store dataset: " + str(len(free_english_apple_data_cleaned)))
print("\n")
print("Number of paid English apps in the App Store dataset: " + str(len(paid_english_apple_data_cleaned)))
print("\n")
print("Number of free English apps in the Google Play dataset: " + str(len(free_english_android_data_cleaned)))
print("\n")
print("Number of paid English apps in the Google Play dataset: " + str(len(paid_english_android_data_cleaned)))

Number of free English apps in the App Store dataset: 3222


Number of paid English apps in the App Store dataset: 2961


Number of free English apps in the Google Play dataset: 8863


Number of paid English apps in the Google Play dataset: 750


In [21]:
line_break = "-" * 200

explore_data(free_english_apple_data_cleaned, 0, 3, show_rows_and_columns=True, header=False)
print(line_break)
explore_data(free_english_android_data_cleaned, 0, 3, show_rows_and_columns=True, header=False)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 3222
Number of columns: 16
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Ar

After filtering to ensure all of the apps are free, we are left with 3,222 iOS apps and 8,863 Android apps for our analysis.

## Analysis: Profitable App Profiles

To minimize risk and overhead, the company suggests a **validation strategy** for an app idea:

>1. Build a minimal Android version of the app and add it to the Google Play market.
>
>2. If the app receieves a good response from users, further develop the app.
>
>3. If the app is profitable after the first 6-month period, build an iOS version of the app and add it to the Apple App Store.


Since the end goal is to add apps to both marketplaces, we need to identify app profiles that are likely to be successful in both markets. 

#### We can begin our analysis by determining the most common genres for each marketplace.

We first define a `freq_table()` function with parameters `dataset` (a list of lists), `column` (the column number) and  a kwarg `percentage` which is set to False by default. The function generates a dictionary with keys that correspond to all unique datapoints in the specified column. The value of each of the keys in the dictionary is an integer value representing the number of occurences of the key in the specified column (*i.e.,* the frequency). 

If the `percentage` kwarg is set to true, the frequency values will be converted into percentages.


In [22]:
def freq_table(dataset, column, percentage=False):
  frequency_table = {}
  for row in dataset:
    value = row[column]
    if value in frequency_table:
      frequency_table[value] += 1
    else:
      frequency_table[value] = 1
      
  if percentage:
    total_frequency = sum(frequency_table.values())
    for key in frequency_table:
      frequency_table[key] = str(round((frequency_table[key] / total_frequency) * 100, 2)) + "%"
      
  return frequency_table

freq_table(free_english_apple_data_cleaned, column=11, percentage=True) 
# generates a frequency table for the prime_genre column in the App Store dataset, listing the results as a percentage of the total number of apps

{'Social Networking': '3.29%',
 'Photo & Video': '4.97%',
 'Games': '58.16%',
 'Music': '2.05%',
 'Reference': '0.56%',
 'Health & Fitness': '2.02%',
 'Weather': '0.87%',
 'Utilities': '2.51%',
 'Travel': '1.24%',
 'Shopping': '2.61%',
 'News': '1.33%',
 'Navigation': '0.19%',
 'Lifestyle': '1.58%',
 'Entertainment': '7.88%',
 'Food & Drink': '0.81%',
 'Sports': '2.14%',
 'Book': '0.43%',
 'Finance': '1.12%',
 'Education': '3.66%',
 'Productivity': '1.74%',
 'Business': '0.53%',
 'Catalogs': '0.12%',
 'Medical': '0.19%'}

To improve the clarity of the `freq_table()` output, we define another function `display_table()` to display the same results as the `freq_table()` function in descending order. Since dictionaries are not ordered, we must convert the data structure into another type that can be sorted.

To do this, we first initiate the `table` variable which is the frequency table produced for a given dataset, column and percentage (True/False) followed by a `table_display` list. We then write a for loop to iterate over the keys of `table` dictionary, creating a value-key tuple pair for every key:value pair in the `table` dictionary. These tuple pairs are then appended to the `table_display` list. 

If the percentage parameter is assigned to False, the `table_display` list is sorted (using the `sorted()` function). This list consists of tuples as elements in the form `(value, key)`, meaning the sorted function will order the elements with the largest value to the smallest value (since the `reverse` parameter is set to True). We then print `key : value` for each entry in the `table_sorted` variable to give the desired result.

The same logic applies if the percentage parameter is assigned to False, except the values in the `table` dictionary must first be converted to floats (and have the `%` character removed) such that the values can be sorted numerically to give the desired output.

In [23]:
def display_table(dataset, column, percentage):
    table = freq_table(dataset, column, percentage)
    table_display = []
    
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    
    if not percentage:
            table_sorted = sorted(table_display, reverse = True)
            for entry in table_sorted:
                print(entry[1], ':', entry[0])
                
    else:   # the else clause runs if percentage = True. the logic is the same as above, except the values are converted to floats and stripped of the "%" character to be sorted numerically. 
        table_display = []
        
        for key in table:
            key_val_as_tuple = (float(table[key].strip("%")), key)
            table_display.append(key_val_as_tuple)
            table_sorted = sorted(table_display, reverse = True)
            
        for entry in table_sorted:
            print(entry[1], ':', str(entry[0]) + "%")


#### App Store: Most Common Genres

In [24]:
free_english_apple_genres_frequencies = display_table(free_english_apple_data_cleaned, column=11, percentage=False) # generates a frequency table for the prime_genre column in the App Store dataset

Games : 1874
Entertainment : 254
Photo & Video : 160
Education : 118
Social Networking : 106
Shopping : 84
Utilities : 81
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 51
News : 43
Travel : 40
Finance : 36
Weather : 28
Food & Drink : 26
Reference : 18
Business : 17
Book : 14
Navigation : 6
Medical : 6
Catalogs : 4


In [25]:
free_english_apple_genre_percentages = display_table(free_english_apple_data_cleaned, column=11, percentage=True) # generates a percentage breakdown for the prime_genre column in the App Store dataset

Games : 58.16%
Entertainment : 7.88%
Photo & Video : 4.97%
Education : 3.66%
Social Networking : 3.29%
Shopping : 2.61%
Utilities : 2.51%
Sports : 2.14%
Music : 2.05%
Health & Fitness : 2.02%
Productivity : 1.74%
Lifestyle : 1.58%
News : 1.33%
Travel : 1.24%
Finance : 1.12%
Weather : 0.87%
Food & Drink : 0.81%
Reference : 0.56%
Business : 0.53%
Book : 0.43%
Navigation : 0.19%
Medical : 0.19%
Catalogs : 0.12%


Among **Free English Apps** on the App Store, over half of all apps in this subset (58.16%) fall into the gaming genre. Entertainment apps are close to 8%, followed by Photo & Video apps which are just under 5%. Only 3.66% of apps are designed for Education, followed by Social Networking apps, making up only 3.29% of the Free English app market.

The general impression this breakdown provides is that Free English apps on the App Store are dominated by apps designed for gaming and entertainment rather than apps designed for practical purposes/utility (such as education, shopping, productivity, lifestyle). 

This does not necessarily mean that apps falling into these largest genres also have the greatest number of users however. We do not have any insight into the number of users within each genre (we will probe into this more deeply below).

#### Google Play Store: Most Common Categories

In [26]:
free_english_android_category_frequencies = display_table(free_english_android_data_cleaned, column=1, percentage=False) # generates a frequency table for the Category column in the Google Play dataset

FAMILY : 1675
GAME : 862
TOOLS : 750
BUSINESS : 407
LIFESTYLE : 346
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 313
SPORTS : 301
PERSONALIZATION : 294
COMMUNICATION : 287
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 190
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 124
FOOD_AND_DRINK : 110
EDUCATION : 103
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 73
WEATHER : 71
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 57
COMICS : 55
BEAUTY : 53


In [27]:
free_english_android_category_percentages = display_table(free_english_android_data_cleaned, column=1, percentage=True) # generates a percentage breakdown for the Category column in the Google Play dataset

FAMILY : 18.9%
GAME : 9.73%
TOOLS : 8.46%
BUSINESS : 4.59%
LIFESTYLE : 3.9%
PRODUCTIVITY : 3.89%
FINANCE : 3.7%
MEDICAL : 3.53%
SPORTS : 3.4%
PERSONALIZATION : 3.32%
COMMUNICATION : 3.24%
HEALTH_AND_FITNESS : 3.08%
PHOTOGRAPHY : 2.94%
NEWS_AND_MAGAZINES : 2.8%
SOCIAL : 2.66%
TRAVEL_AND_LOCAL : 2.34%
SHOPPING : 2.25%
BOOKS_AND_REFERENCE : 2.14%
DATING : 1.86%
VIDEO_PLAYERS : 1.79%
MAPS_AND_NAVIGATION : 1.4%
FOOD_AND_DRINK : 1.24%
EDUCATION : 1.16%
ENTERTAINMENT : 0.96%
LIBRARIES_AND_DEMO : 0.94%
AUTO_AND_VEHICLES : 0.93%
HOUSE_AND_HOME : 0.82%
WEATHER : 0.8%
EVENTS : 0.71%
PARENTING : 0.65%
ART_AND_DESIGN : 0.64%
COMICS : 0.62%
BEAUTY : 0.6%


The landscape of the **Free English apps** on the Google Play marketplace seems notably different than the App Store. There are fewer apps designed for fun (Games, Entertaiment etc.) with a larger proportion designed for practical purposes (Tools, Business, Lifestyle, Productivity). The largest categories include `FAMILY` and `GAME`, making up just over one quarter of all Free English apps analysed (28.63%). 

Investigating more deeply into the types of apps with the `FAMILY` category, we note that the majority are games designed for young children (see code cell below), meaning around a quarter of all Free English apps in the sample fall into the gaming genre.

Even so, it is still interesting to note that practical apps do seem to have a better representation on the Google Play store than the apps featured in the App Store dataset.

In [28]:
android_family_apps = []

for app in free_english_android_data_cleaned:
  app_name = app[0]
  category = app[1]
  if category == "FAMILY":
    android_family_apps.append(app_name)

android_family_apps[:25]

['Jewels Crush- Match 3 Puzzle',
 'Coloring & Learn',
 'Mahjong',
 'Super ABC! Learning games for kids! Preschool apps',
 'Toy Pop Cubes',
 'Educational Games 4 Kids',
 'Candy Pop Story',
 'Princess Coloring Book',
 'Hello Kitty Nail Salon',
 'Candy Smash',
 'Happy Fruits Bomb - Cube Blast',
 'Princess Adventures Puzzles',
 'Kids Educational Game 3 Free',
 'Puzzle Kids - Animals Shapes and Jigsaw Puzzles',
 'Coloring book moana',
 'Baby Panda Care',
 'Kids Educational :All in One',
 'Number Counting games for toddler preschool kids',
 'Learn To Draw Glow Flower',
 'No. Color - Color by Number, Number Coloring',
 'Draw.ly - Color by Number Pixel Art Coloring',
 'Baby puzzles',
 'Garden Fruit Legend',
 'Barbie™ Fashion Closet',
 'Candy Day']

#### Google Play Store: Most Common Genres

In [29]:
free_english_android_genres_frequencies = display_table(free_english_android_data_cleaned, column=9, percentage=False) # generates a frequency table for the Genres column in the Google Play dataset

Tools : 749
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 345
Finance : 328
Medical : 313
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181
Dating : 165
Arcade : 164
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 124
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 80
House & Home : 73
Weather : 71
Events : 63
Adventure : 60
Comics : 54
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 40
Casino : 38
Trivia : 37
Educational;Education : 35
Board : 34
Educational : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 12
Arcade;Action & Advent

In [30]:
free_english_android_genres_percentages = display_table(free_english_android_data_cleaned, column=9, percentage=True)

Tools : 8.45%
Entertainment : 6.07%
Education : 5.35%
Business : 4.59%
Productivity : 3.89%
Lifestyle : 3.89%
Finance : 3.7%
Medical : 3.53%
Sports : 3.46%
Personalization : 3.32%
Communication : 3.24%
Action : 3.1%
Health & Fitness : 3.08%
Photography : 2.94%
News & Magazines : 2.8%
Social : 2.66%
Travel & Local : 2.32%
Shopping : 2.25%
Books & Reference : 2.14%
Simulation : 2.04%
Dating : 1.86%
Arcade : 1.85%
Video Players & Editors : 1.77%
Casual : 1.76%
Maps & Navigation : 1.4%
Food & Drink : 1.24%
Puzzle : 1.13%
Racing : 0.99%
Role Playing : 0.94%
Libraries & Demo : 0.94%
Auto & Vehicles : 0.93%
Strategy : 0.9%
House & Home : 0.82%
Weather : 0.8%
Events : 0.71%
Adventure : 0.68%
Comics : 0.61%
Beauty : 0.6%
Art & Design : 0.6%
Parenting : 0.5%
Card : 0.45%
Casino : 0.43%
Trivia : 0.42%
Educational;Education : 0.39%
Board : 0.38%
Educational : 0.37%
Education;Education : 0.34%
Word : 0.26%
Casual;Pretend Play : 0.24%
Music : 0.2%
Racing;Action & Adventure : 0.17%
Puzzle;Brain Games

Comparing the results of the `Genres` column to the `Catgeory` column within the Google Play dataset, we note that the `Genres` column is more granular, containing more categories. The exact difference between the two is not obvious, but the most common genres in the `Genres` column seem to generally align with the most common categories listed above from the `Categories` column. Since we are only interested in insights at a higher level, we can just focus on the `Category` column as this is more coarse (i.e., less granular).

We can now turn our attention to analysing the most popular apps by genre on both marketplaces.

### Most Popular Apps by Genre on the App Store
One route we can take to determine what genres have the largest user base is to calculate the average number of installs for each app genre. We can use the `Installs` column when working with the Google Play dataset, but this information is not readily available for the App Store data.

As a workaround, we can use the total number of user ratings (`rating_count_tot`) as a proxy.

In [31]:
free_english_apple_genres_frequencies = freq_table(free_english_apple_data_cleaned, column=11, percentage=False)

for genre in free_english_apple_genres_frequencies:
  total = 0
  len_genre = 0
  for app in free_english_apple_data_cleaned:
    genre_app = app[11]
    if genre_app == genre:
      n_ratings = float(app[5])
      total += n_ratings
      len_genre += 1
      
  avg_n_ratings = round(total / len_genre, 2)
  print(genre, ':', avg_n_ratings)

Social Networking : 71548.35
Photo & Video : 28441.54
Games : 22788.67
Music : 57326.53
Reference : 74942.11
Health & Fitness : 23298.02
Weather : 52279.89
Utilities : 18684.46
Travel : 28243.8
Shopping : 26919.69
News : 21248.02
Navigation : 86090.33
Lifestyle : 16485.76
Entertainment : 14029.83
Food & Drink : 33333.92
Sports : 23008.9
Book : 39758.5
Finance : 31467.94
Education : 7003.98
Productivity : 21028.41
Business : 7491.12
Catalogs : 4004.0
Medical : 612.0


On average, navigation apps have the greatest average number of user ratings. The average is heavily influenced by Waze and Google Maps, which almost have half a million reviews together. Considering the median is preferable to negate the influence of these extreme values. 

A median of just under 8,200 ratings is still relatively **large** for a genre as a whole, so there could be some potential here. It is worth noting however that there are only 5 datapoints for this genre over the entire dataset, so it is difficult to confidently state the median value of 8,200 is representative of the true average if this was repeated with a larger sample size. 

>Even so, it could still be worth looking into for the company as the market is definitely not oversaturated in this genre!

In [32]:
for app in free_english_apple_data_cleaned:
  if app[11] == "Navigation":
    print(app[1], ':', app[5])

print("\n")    
print("Median: " + str((12811+3582)/2))

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


Median: 8196.5


The Social Networking genre is similarly challenging to break into as the highest rated networking apps (Facebook, Pinterest, Skype, Messenger, etc.) disproportionately drive up the mean total user ratings. 

If we take the median for the Social Networking genre, we have a smaller middling value of 4,199 total ratings. The market is also far more saturated in this particular genre, so we can expect a Social Networking app to be a lot more challenging to make work.

A similar story applies to other big genres like the Music genre (Pandora, Spotify and Shazam all significantly skew the average). We could remove the extremely popular apps within each genre for a more balanced picture, but this level of detail can be implemented at a later date.

In [33]:
social_networking_apple_apps = []
for app in free_english_apple_data_cleaned:
  if app[11] == "Social Networking":
    social_networking_apple_apps.append(app)
    print(app[1], ':', app[5])

print("\n")    
print("Number of social networking apps: " + str(len(social_networking_apple_apps)))
print("Median: " + str((4253+4145)/2))

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

Reference apps have a mean number of ratings of 74,942 - though this is skewed upwards by the Bible and Dictionary.com apps. There does seem to be some potential in this genre however, with a median of 6,614 ratings. It is also not an overly saturated market, with only 18 apps falling within this category.

One avenue to explore could be to take a popular book and build this into an app, adding additional features beyond the raw digital book. This could be along the lines of providing the user with daily quotes from the book, an audio version of the book or integrated study materials (e.g., quizzes, flash cards, an annotation feature) if the book is a text commonly studied as part of a school course/university course etc. 

We could also integrate extra features such as an in-built dictionary/pronunciation guide, include user stats (such as reading speed), include reminders to read for a specified amount of time each day and so on.

This idea does seem to be somewhat promising, especially considering that the App Store has a large emphasis on gaming and 'for-fun' style apps. To give us the best chance at success, the app should ideally not be launched into a saturated genre and offer features that break-the-mould of the apps within that genre.

In [34]:
reference_apple_apps = []
for app in free_english_apple_data_cleaned:
  if app[11] == "Reference":
    reference_apple_apps.append(app)
    print(app[1], ':', app[5])

print("\n")    
print("Number of reference apps: " + str(len(reference_apple_apps)))
print("Median: " + str((8535+4693)/2))

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


Number of reference apps: 18
Median: 6614.0


Some other genres that appear to be large include `Food & Drink`, `Weather` and `Finance`. 

These genres are not likely to be of interest to us:

* `Food & Drink`: Starbucks, Dunkin' Donuts, McDonald's, ... Making a popular app in this genre would require a vast amount of work and domain knowledge outside of the scope of a start-up app developing company.

* `Weather`: people generally do not spend much time in-app and there are alternatives in the market that will not present the user with any ads. There is no good reason to believe building a weather app with in-app ads would do better than competitors on the market.

* `Finance`: these apps involve personal banking, bill payment, money transfers and so on. This would require domain knowledge outside of the scope of the company.

### Most Popular Apps by Genre on the Google Play Store

As was mentioned above, we have a `Installs` column within the Google Play dataset, so we can get a more accurate picture of app popularity.

In [35]:
display_table(free_english_android_data_cleaned, column=5, percentage=True)

1,000,000+ : 15.73%
100,000+ : 11.55%
10,000,000+ : 10.55%
10,000+ : 10.2%
1,000+ : 8.39%
100+ : 6.92%
5,000,000+ : 6.83%
500,000+ : 5.56%
50,000+ : 4.77%
5,000+ : 4.51%
10+ : 3.54%
500+ : 3.25%
50,000,000+ : 2.3%
100,000,000+ : 2.13%
50+ : 1.92%
5+ : 0.79%
1+ : 0.51%
500,000,000+ : 0.27%
1,000,000,000+ : 0.23%
0+ : 0.05%


Unfortunately, we lose some precision in the numbers as the install numbers are grouped over a wide range. An app falling into the 100,000+ install range could have installs ranging from anywhere between 100,000 to 999,999 - we cannot know. We can still use this data as we are interested moreso in the bigger picture.

In the same manner as we calculated the average total ratings of the genres for the App Store, we do the same for the `Category` column of the Google Play dataset. We handle the imprecision noted above by converting an app with, say, 100,000+ installs to 100,000. To do this, we replace the "," and "+" characters from the `Installs` strings and convert the data type to a float to perform arithmetic operations to find the mean.

In [36]:
free_english_android_category_frequencies = freq_table(free_english_android_data_cleaned, column=1, percentage=False)

for category in free_english_android_category_frequencies:
  total = 0
  len_category = 0
  for app in free_english_android_data_cleaned:
    category_app = app[1]
    if category_app == category:
      n_installs = app[5]
      n_installs = n_installs.replace(",", "")
      n_installs = n_installs.replace("+", "")
      n_installs = float(n_installs)
      total += n_installs
      len_category += 1
      
  avg_n_ratings = round(total / len_category, 2)
  print(category, ':', avg_n_ratings)

ART_AND_DESIGN : 1986335.09
AUTO_AND_VEHICLES : 647317.82
BEAUTY : 513151.89
BOOKS_AND_REFERENCE : 8767811.89
BUSINESS : 1712290.15
COMICS : 817657.27
COMMUNICATION : 38456119.17
DATING : 854028.83
EDUCATION : 1833495.15
ENTERTAINMENT : 11640705.88
EVENTS : 253542.22
FINANCE : 1387692.48
FOOD_AND_DRINK : 1924897.74
HEALTH_AND_FITNESS : 4188821.99
HOUSE_AND_HOME : 1331540.56
LIBRARIES_AND_DEMO : 638503.73
LIFESTYLE : 1437816.27
GAME : 15588015.6
FAMILY : 3697848.17
MEDICAL : 120550.62
SOCIAL : 23253652.13
SHOPPING : 7036877.31
PHOTOGRAPHY : 17840110.4
SPORTS : 3638640.14
TRAVEL_AND_LOCAL : 13984077.71
TOOLS : 10801391.3
PERSONALIZATION : 5201482.61
PRODUCTIVITY : 16787331.34
PARENTING : 542603.62
WEATHER : 5074486.2
VIDEO_PLAYERS : 24727872.45
NEWS_AND_MAGAZINES : 9549178.47
MAPS_AND_NAVIGATION : 4056941.77


On average, of the Free English apps on the Google Play store, `Communications` apps have the greatest number of installs (38,456,119). This number is heavily skewed upwards by a select few apps (WhatsApp, Messenger, Skype, Google Chrome, Gmail, Hangouts, ...). If we removed all apps with over 100 million installs, the average would drop by roughly 10 times (to around 3-4 million installs).

In [37]:
for app in free_english_android_data_cleaned:
  if app[1] == "COMMUNICATION" and (app[5] == "1,000,000,000+" or app[5] == "500,000,000+"):
    print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Viber Messenger : 500,000,000+


It is a similar story for the `VIDEO_PLAYERS` category at 24,727,872 installs. The market is dominated by giants such as YouTube, Google Play Movies & TV and MX Player. 

In [38]:
for app in free_english_android_data_cleaned:
  if app[1] == "VIDEO_PLAYERS" and (app[5] == "1,000,000,000+" or app[5] == "500,000,000+"):
    print(app[0], ':', app[5])

YouTube : 1,000,000,000+
Google Play Movies & TV : 1,000,000,000+
MX Player : 500,000,000+


The pattern is repeated for other categories including `SOCIAL`, `PHOTOGRAPHY` and `PRODUCTIVITY`, all having skewed averages due to the extreme number of installations of app giants in these genres. The main concern is that these markets are saturated and it will be very challenging to see success when up against the big titans.

The `GAME` genre is popular, but as expected, is quite oversaturated.

In [39]:
for app in free_english_android_data_cleaned:
  if app[1] == "GAME" and (app[5] == "1,000,000,000+" or app[5] == "500,000,000+" or app[5] == "100,000,000+"):
    print(app[0], ':', app[5])

Sonic Dash : 100,000,000+
PAC-MAN : 100,000,000+
Roll the Ball® - slide puzzle : 100,000,000+
Piano Tiles 2™ : 100,000,000+
Pokémon GO : 100,000,000+
Extreme Car Driving Simulator : 100,000,000+
Trivia Crack : 100,000,000+
Angry Birds 2 : 100,000,000+
Candy Crush Saga : 500,000,000+
8 Ball Pool : 100,000,000+
Subway Surfers : 1,000,000,000+
Candy Crush Soda Saga : 100,000,000+
Clash Royale : 100,000,000+
Clash of Clans : 100,000,000+
Plants vs. Zombies FREE : 100,000,000+
Pou : 500,000,000+
Flow Free : 100,000,000+
My Talking Angela : 100,000,000+
slither.io : 100,000,000+
Cooking Fever : 100,000,000+
Yes day : 100,000,000+
Score! Hero : 100,000,000+
Dream League Soccer 2018 : 100,000,000+
My Talking Tom : 500,000,000+
Sniper 3D Gun Shooter: Free Shooting Games - FPS : 100,000,000+
Zombie Tsunami : 100,000,000+
Helix Jump : 100,000,000+
Crossy Road : 100,000,000+
Temple Run 2 : 500,000,000+
Talking Tom Gold Run : 100,000,000+
Agar.io : 100,000,000+
Bus Rush: Subway Edition : 100,000,00

The `BOOKS_AND_REFERENCE` genre is also relatively popular (8,767,812 installs) and as a stand-out candidate from the App Store dataset, is definitely worth looking into.

In [40]:
books_and_reference_apps = []
for app in free_english_android_data_cleaned:
  if app[1] == "BOOKS_AND_REFERENCE":
    books_and_reference_apps.append(app)
    print(app[0], ':', app[5])
    
print("\n")
print("Number of books and reference apps: " + str(len(books_and_reference_apps)))

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

In [41]:
for app in free_english_android_data_cleaned:
  if app[1] == "BOOKS_AND_REFERENCE" and (app[5] == "1,000,000,000+" or app[5] == "500,000,000+" or app[5] == "100,000,000+"):
    print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


Only a small number of very popular apps skew the average, so this market does show some potential. 

We can look into the types of apps that lie in the middle range for installs (say 1,000,000 to 50,000,000):

In [42]:
for app in free_english_android_data_cleaned:
  if app[1] == "BOOKS_AND_REFERENCE" and (app[5] == "50,000,000+" or app[5] == "10,000,000+" or app[5] == "5,000,000+" or app[5] == "1,000,000+"):
    print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

The category seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries. If venturing down this category of app, we should not build apps that need to directly compete with these subcategories of app as it will make it more challenging to succeed.

There are also many apps built around a highly popular book (such as the Qu'ran), which supports the initial idea discussed in the App Store analysis section above. A potential profitable app could come in the form of taking a more recent popular book and building it into an app with additional features to enhance the user experience. This aligns quite nicely with the landscapes of both the App Store and Google Play marketplaces.