# App Profile Recommendation
## Introduction
In this project I will perform a data analysis on the free mobile apps world. 
More specifically, I'll use these two datasets provided by Kaggle:
- [Google Play Store Apps Dataset](https://www.kaggle.com/lava18/google-play-store-apps/home) for the Google Play data [Data collected in August 2018]
- [Mobile App Store ( 7200 apps)](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) for the App Store data [Data collected in July 2017]

The aim of this project is to find the best category to generate earnings among free apps, whose source of revenue consists of in-app ads.

## Data preparation
The first step to take is to analyze the two datasets. To accomplish this task, I created a function named `explore_data()` to print the rows of the dataset in a more readable way:

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    """
    This function print the rows of a given dataset.
    Parameters:
        dataset A list of lists, assuming no header row
        start: Start index of a slice from the data set
        end: End index of a slide from the data set
        rows_and_columns(optional): If true, prints the number of rows and columns 
    """
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Now it's time to open the two data sets and to explore the first few lines of them:

In [2]:
from csv import reader

# Google Play data set
opened_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

# App Store data set
opened_file = open('AppleStore.csv', encoding='utf8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

In [3]:
print(android_header)
print('\n')
explore_data(android, 0, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Eve

In [4]:
print(ios_header)
print('\n')
explore_data(ios, 0, 5, True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37'

## Data cleaning
To clean the data in the datasets I'll need to:
1. Remove/Correct inaccurate data
2. Remove duplicate data
3. Remove the non-English apps 
4. Remove the non-free apps

### 1 - Remove/Correct inaccurate data
First of all I detect on this [Kaggle discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) on the Google Play Store Apps datasets that the entry with index 10472 has no 'Category' data, and this causes a shift for the next columns

In [5]:
print(android_header)
print('\n')
print(android[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


I choose to delete this row from the dataset

In [6]:
del android[10472]

Apparently there are no missing fields in the Google Play Store Apps dataset, and the same applies to the App Store dataset.
### 2 - Remove duplicate data
Now it's time to search for duplicated data in the dataset:

In [7]:
duplicate_apps = [] # List to store the name of duplicate apps
unique_apps = [] # List to store the name of unique apps
for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


I try to compare the duplicates in the dataset to examine what changes between one and the other. For example, I analyze the duplicates of the Google Ads app:

In [8]:
for app in android:
    name = app[0]
    if name == 'Slack':
        print(app)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


There are two rows with exatcly the same data, while the third has a different value on the fourth field, which corresponds to the number of reviews. 

I use this information to build a criterion for removing the duplicates: I keep only the row with the higher number of reviews, removing the other entries. After we remove the duplicates, in the Google Play dataset we should be left with 9659 rows:

In [9]:
print('Expected length:', len(android) - 1181)

Expected length: 9659


To remove the duplicates, I will:
* Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app
* Use the dictionary to create a new dataset which will have only one entry per app, corresponding to the one with the highest number of reviews

In [10]:
# Creation of the dictionary

reviews_max = {} # Dictionary containing the app name as a key and the highest number of reviews as a value
for app in android:
    name = app[0]
    n_reviews = int(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
print('Length of reviews_max:', len(reviews_max)) # Expected length = 9659

Length of reviews_max: 9659


In [11]:
# Creation of a new dataset with only one entry per app

android_clean = [] # New (clean) dataset for the Google Play Store Apps
already_added = [] # List that stores app names
for app in android:
    name = app[0]
    n_reviews = int(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
print('Length of android_clean:', len(android_clean)) # Expected length = 9659

Length of android_clean: 9659


### 3 - Remove the non-free apps
Since I'd like to analyze only the apps that are directed to an English-speaking audience, I'll have to remove from the datasets all the apps whose name suggests that they are not directed toward an English-speaking audience.

One way to do this is to remove each app whose name contains characters not belonging to the common English characters. This means that I'm going to remove all the apps whose name contains a character that is not in the range of 0-127 according to the [ASCII](https://en.wikipedia.org/wiki/ASCII) system, that is the range of the English characters.

In [12]:
def detect_english_name(app_name):
    """
    Detect if the input app_name contains a non english char
    
    Parameters:
        app_name: String containing the app_name
    Returns:
        Boolean: False if there's any character in app_name that doesn't belong to the set of English characters, otherwise True        
    """
    for curr_char in app_name:
        if ord(curr_char) > 127:
            return False
    return True   

Two examples:

In [13]:
detect_english_name('Twitter')

True

In [14]:
detect_english_name('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

The problem of this approach is that it filters out a lot of apps that contains some characters (like ™) that fall outside the ASCII range, but that are still English apps. Since this will cause a loss of useful data, I'll only remove an app if its name has more than three characters with corresponding number falling outside the ASCII range. The filter will not be perfect, but it should be fairly effective

In [15]:
def detect_english_name(app_name):
    """
    Detect if the input app_name contains three or more non english char
    
    Parameters:
        app_name: String containing the app_name
    Returns:
        Boolean: False if there's any character in app_name that doesn't belong to the set of English characters, otherwise True        
    """
    non_eng_ascii = 0
    for curr_char in app_name:
        if ord(curr_char) > 127:
            non_eng_ascii += 1
            if non_eng_ascii == 3:
                return False
    return True   

Three examples:

In [16]:
detect_english_name('Docs To Go™ Free Office Suite')

True

In [17]:
detect_english_name('Instachat 😜')

True

In [18]:
detect_english_name('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

Now I'll use this newly created function to filter out non-English apps from both data sets:

In [19]:
android_eng = []
ios_eng = []

for app in android_clean:
    name = app[0]
    if detect_english_name(name):
        android_eng.append(app)
        
for app in ios:
    name = app[1]
    if detect_english_name(name):
        ios_eng.append(app)
        
explore_data(android_eng, 0, 3, True)
print('\n')
explore_data(ios_eng, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9597
Number of columns: 13


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188

### 4 - Remove the non-free apps
The final step of the data cleaning process consists in isolating the free apps from both the datasets. For the Google Play Store Apps dataset the price of the app is described by the column with index 7, while in the App Store dataset it has index 4

In [20]:
android_eng_free = []
ios_eng_free = []

for app in android_eng:
    price = app[7]
    if price == '0':
        android_eng_free.append(app)

for app in ios_eng:
    price = app[4]
    if price == '0.0':
        ios_eng_free.append(app)
        
print(len(android_eng_free))
print(len(ios_eng_free))

8848
0


## Detect the most common apps by genre
I begin the analysis by getting a sense of what are the most common genres for each market. For this, I'll need to build a frequency table for a few columns in the datasets. 
Let's start by analyzing the columns of the two datasets:

In [21]:
print(android_header)
print('\n')
explore_data(android_eng_free, 0, 2, False)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']




In [22]:
print(ios_header)
print('\n')
explore_data(ios_eng_free, 0, 2, False)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']




From the analysis I can conclude that:
* For the Google Play Store Apps I'll need to build a frequency table for both the *Category* and *Genres* columns
* For the App Store I'll need to build a frequency table for the *prime_genre* column

I'll need to build two functions to analyze the frequency tables:
1. A function to generate frequency tables that show percentages
2. Another function I can use to display the percentages in a descending order

In [23]:
def freq_table(dataset, index):
    """
    Return the frequency table (as a dictionary) for any column we want
    Parameters:
        dataset: A list of lists
        index: An integer describing the column index of the dataset rows
    Returns:
        freq_table_perc: A dictionary that contains the frequency table of the chosen column
        """
    freq_dict = {}
    for app in dataset:
        curr_row = app[index]
        if curr_row in freq_dict:
            freq_dict[curr_row] += 1
        else:
            freq_dict[curr_row] = 1
            
    freq_table_perc = {}
    total = len(dataset)
    for key in freq_dict:
        percentage = (freq_dict[key] / total) * 100
        freq_table_perc[key] = percentage
    return freq_table_perc

In [24]:
def display_table(dataset, index):
    """
    Generates a frequency table using the freq_table() function, transforms the 
    frequency table into a list of tuples, sorts the list in a 
    descending order and prints the entries of the frequency table in descending order
 
    Parameters:
        dataset: A list of lists
        index: An integer describing the column index of the dataset rows
    """
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

### iOs App Store
I start the analysis by examining the frequency table for the *prime_genre* column of the App Store dataset:

In [25]:
display_table(ios_eng_free, -5)

The most common genre is definitely the one of the Games, with a dominant 58.25%. The other genres in the top 3 are Entertainment (7.83%) and Photo & Video (5%). A comment on this is that in the App Store, regarding the free english apps, is dominated by apps that are designed for fun (games, entertainment, photo & video, social networking, sports, music, etc.), while apps with a more practical purpose (like education, shopping, utilies, productivity, etc.) are more rare. However, this doesn't imply that the most numerous apps are also the ones that have the major number of users, since the demand might not be the same as the offer.

### Google Play Store Apps
Now I analyze the frequency table generated for the *Category* and *Genres* column of the Google Play data set

In [26]:
display_table(android_eng_free, 1) # Category

FAMILY : 18.942133815551536
GAME : 9.697106690777577
TOOLS : 8.453887884267631
BUSINESS : 4.599909584086799
PRODUCTIVITY : 3.899186256781193
LIFESTYLE : 3.887884267631103
FINANCE : 3.7070524412296564
MEDICAL : 3.5375226039783
SPORTS : 3.390596745027125
PERSONALIZATION : 3.322784810126582
COMMUNICATION : 3.2323688969258586
HEALTH_AND_FITNESS : 3.0854430379746836
PHOTOGRAPHY : 2.949819168173599
NEWS_AND_MAGAZINES : 2.802893309222423
SOCIAL : 2.667269439421338
TRAVEL_AND_LOCAL : 2.3395117540687163
SHOPPING : 2.2490958408679926
BOOKS_AND_REFERENCE : 2.1360759493670884
DATING : 1.8648282097649187
VIDEO_PLAYERS : 1.7970162748643763
MAPS_AND_NAVIGATION : 1.3901446654611211
FOOD_AND_DRINK : 1.2432188065099457
EDUCATION : 1.164104882459313
ENTERTAINMENT : 0.9606690777576853
LIBRARIES_AND_DEMO : 0.9380650994575045
AUTO_AND_VEHICLES : 0.9267631103074141
HOUSE_AND_HOME : 0.8024412296564195
WEATHER : 0.7911392405063291
EVENTS : 0.7120253164556962
PARENTING : 0.6555153707052441
ART_AND_DESIGN : 0.64

This leads to very different results with respect to the iOs App Store analysis. The game category here doesn't dominate, and instead the practical purposes apps are the most diffused (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids.

Even so, practical apps seem to have a better representation on Google Play compared to App Store. This can be confirmed by the frequency table for the Genres column:

In [27]:
display_table(android_eng_free, -4)

Tools : 8.44258589511754
Entertainment : 6.080470162748644
Education : 5.357142857142857
Business : 4.599909584086799
Productivity : 3.899186256781193
Lifestyle : 3.8765822784810124
Finance : 3.7070524412296564
Medical : 3.5375226039783
Sports : 3.4584086799276674
Personalization : 3.322784810126582
Communication : 3.2323688969258586
Action : 3.096745027124774
Health & Fitness : 3.0854430379746836
Photography : 2.949819168173599
News & Magazines : 2.802893309222423
Social : 2.667269439421338
Travel & Local : 2.328209764918626
Shopping : 2.2490958408679926
Books & Reference : 2.1360759493670884
Simulation : 2.0456600361663653
Dating : 1.8648282097649187
Arcade : 1.842224231464738
Video Players & Editors : 1.7744122965641953
Casual : 1.763110307414105
Maps & Navigation : 1.3901446654611211
Food & Drink : 1.2432188065099457
Puzzle : 1.1301989150090417
Racing : 0.9945750452079566
Role Playing : 0.9380650994575045
Libraries & Demo : 0.9380650994575045
Auto & Vehicles : 0.9267631103074141
St

The Genres column represents more detailed sub categories for the apps (for example the games genres like Puzzle, Arcade, Racing, etc.).

This confirm the impression given by the analysis of the Category column, I found that Google Play has a more balanced landscape of both practical and for-fun apps. Now I'd like to find the app categories which have the most number of users.

## Most popular apps by genre

In the Google Play Store Apps dataset the popularity of an app can be easily found by using the *Installs* column, that represents the number of user downloads divided by the installs for the app. 

In the iOs App Store, instead, this column is missing, so as a workaround I'll take the *rating_count_tot* column to represents the popularity.

### iOs App Store
I start with calculating the average number of user ratings per app genre on the app Store. I need to:
1. Isolate the apps of each genre
2. Sum up the user ratings for the apps of that genre
3. Divide the sum by the number of apps belonging to that genre

I start with generating a frequency table for the prime genre column:

In [28]:
genre_table = freq_table(ios_eng_free, -5)
for genre in genre_table:
    total = 0 # Stores the sum of the number of user ratings for each genre
    len_genre = 0 # Stores the number of apps specific to each genre
    for app in ios_eng_free:
        genre_app = app[-5]
        if genre_app == genre:
            num_ratings = float(app[5])
            total += num_ratings
            len_genre += 1
    avg_user_ratings = total / len_genre
    print(genre, ':', avg_user_ratings)

The Genre with the most number of users it the Navigation one, but if I analyze the apps in this genre we'll discover that this is due to the fact that Google Maps and Waze have a very large amount of user ratings:

In [29]:
for app in ios_eng_free:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # Print name and number of ratings

The same reasoning applies to the categories of Social Networking and Music, where few apps dominates the market. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps rarely get past the 10,000 threshold. I could get a better picture by removing these extremely popular apps for each genre and then rework the averages.

### Google Play Store Apps
As said before, this dataset has a column that take into accounts the number of installs for a certain app. However, the install column represents an interval instead of a precise number:

In [30]:
display_table(android_eng_free, 5) # Display the Install column frequency table

1,000,000+ : 15.75497287522604
100,000+ : 11.539330922242314
10,000,000+ : 10.567359855334539
10,000+ : 10.194394213381555
1,000+ : 8.39737793851718
100+ : 6.928119349005425
5,000,000+ : 6.826401446654612
500,000+ : 5.560578661844485
50,000+ : 4.769439421338156
5,000+ : 4.486889692585895
10+ : 3.5375226039783
500+ : 3.2436708860759493
50,000,000+ : 2.2830018083182644
100,000,000+ : 2.1360759493670884
50+ : 1.9213381555153706
5+ : 0.7911392405063291
1+ : 0.5085895117540687
500,000,000+ : 0.27124773960216997
1,000,000,000+ : 0.22603978300180833
0+ : 0.045207956600361664
0 : 0.011301989150090416


This could represents a problem since an app with 100,000+ installs can mean that the app has 100,000 installs, 200,000, or 350,000. However, I don't need very precise data for this project, since I only want to find out which app genres attract the most users, and I don't need perfect precision with respect to the number of users.

I'll leave the numbers as they are, considering for example that an app with 100,000+ installs has 100,000 installs. To perform computations, I'll need to convert each install number from string to float. 

In [31]:
category_table = freq_table(android_eng_free, 1)
for category in category_table:
    total = 0
    len_category = 0
    for app in android_eng_free:
        category_app = app[1]
        if category_app == category:
            installs = app[5]
            installs = installs.replace(',', '')
            installs = installs.replace('+', '')
            installs = float(installs)
            total += installs
            len_category += 1
    avg_installs = total / len_category
    print(category, ':', avg_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8814199.78835979
BUSINESS : 1712290.1474201474
COMICS : 832613.8888888889
COMMUNICATION : 38590581.08741259
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1360598.042253521
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1446158.2238372094
GAME : 15544014.51048951
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3650602.276666667
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10830251.970588235
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5145550.285714285
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_MAGAZ

On average, the communication category have the most installs: 38,590,581. This number is heavily influenced by a few apps that have over one billion installs, like Telegram, WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts, and a few others with over 100 and 500 million installs:

In [32]:
for app in android_eng_free:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

If I removed all the communication apps that have over 100 million installs, the average would be reduced by a large amount:

In [33]:
under_100_m = []

for app in android_eng_free:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3617398.420849421

The same pattern hold for other categories like the video players category, social apps, photography apps, or productivity apps. Again, the main concern is that these app genres might seem more popular than they really are because they're dominated by a few giants who are hard to compete against. The game genre seems pretty popular, but previously we found out this part of the market seems a bit saturated, so I'd like to come up with a different app recommendation if possible.

The books and reference category seems popular for both the datasets, and it doesn't seem that there are a large amount of dominating apps in this category:

In [34]:
for app in android_eng_free:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

This category includes various types of apps, like software for processing and reading books, tutorials, dictionaries, audiobooks etc. There 

The book and reference genre includes a variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. It seems there's still a small number of extremely popular apps:

In [35]:
for app in android_eng_free:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


So this market shows potential! I can get some app ideas based on the kind of apps that are in the middle of the popularity scale:

In [36]:
for app in android_eng_free:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

## Conclusion
It's probably not a good idea to build apps for reading or processing ebooks, as well as various collections of libraries and dictionaries, since there'll be some significant competition with some giants like Wikipedia or Favorite Book Reader

I also notice there are quite a few apps built around the Quran and the Bible, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book that reach a very high popularity) and turning it into an app could be profitable for both the Google Play and the App Store markets.

However, it looks like the market is already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.