# Profitable App Profiles for App Store and Google Play Markets

## 1. Project Introduction
In this project, I will work as data analyst for a company that builds Android and iOS mobile apps. The apps are available on Google Play and in the app store. The company only build apps that are free to download and install, meaning the main source of revenue consists of in-app ads. The number of users of the apps determines the revenue for any given app - the more users who see and engage with the ads, the better.

The goal of this project is to analyze data to help the app developers understand what type of apps are likely to attract more users.

## 2. Opening and Exploring the Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

I will work with the following two datasets, which seem to serve for the purpose:

[Google Play Store dataset](https://www.kaggle.com/lava18/google-play-store-apps): Approximately ten thousand Android apps<br>
[App Store dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps): Approximately seven thousand iOS apps

Let's start by opening and exploring these two datasets:

In [1]:
from csv import reader

# Opening Apple Store dataset
opened_file = open("AppleStore.csv")
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

#Opening Google Play Store dataset
opened_file = open("googleplaystore.csv")
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

We apply explore_data() function to explore the datasets more easily, because it allows us to repeatedly explore rows in a more readable way. Our function also shows the number of rows and columns.

The explore_data() function takes in following parameter:<br>

  - "dataset": a list of lists
  - "start" and "end": integers, representing the start and end indices of a slice from the dataset
  - "rows_and_columns": boolean, with "false" as default argument
  
Aferward the function slices the dataset using "dataset[start:end]"<br>
As next step the function loops through the slice, and for each iteration, prints a row and adds a new line after that row using "print(`\n`)"<br>
During the last step the function
prints the number of rows and columns if "rows_and_columns" is "True"

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print(android_header) # to display the column names
print('\n')
explore_data(android, 0, 2, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


The function shows that Google Play dataset has 10840 Android apps and 13 columns. The columns which serve the purpose of this analysis are:

- App
- Category
- Reviews
- Installs
- Type
- Price
- Genres

Lets continue by exploring the Apple Store dataset

In [4]:
print(ios_header)
print("\n")
explore_data(ios, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16


By checking the output of the function, we can see that the Apple Store dataset contains 7197 apps and 16 columns. The columns that seem interested for the analysis are:

- Track_Name (App name)
- Currency
- Price
- Rating_Count_Tot (for all app versions)
- Rating_Count_ver (for current app version)
- Prime_Genre

## 3. Data Cleaning

Before beginning the analysis, it need to be ensured that the data for the analysis is accureate. Otherwise the results of the analysis will be wrong. Therefore following tasks have to be done:

- Detect inaccurate data, and correct or remove it
- Detect duplicate data, and remove the duplicates

### Removing Inaccurate Data

The Google Play dataset has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and one of the discussions describes an error for row 10472.

First I print the mentioned row and check if it is incorrect by comparing the row with another correct row.

In [5]:
print(android[10472]) # print incorrect row
print("\n")
print(android_header) # print header
print("\n")
print(android[1]) # print another correct row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


The App of row 10472 is named "Life Made WI-Fi Touchscreen Phot Frame". By comparing the row with the header and another correct row, it can be noticed that the "Category" value is missing. Therefore I will delete this row.

In [6]:
print("Number of apps before deletion:", len(android))
del android[10472]
print("Number of apps after deletion:", len(android))

Number of apps before deletion: 10841
Number of apps after deletion: 10840


The number of apps in the Google Play dataset has decreased to 10840 after the deletion of the wrong data.

### Removing Duplicates

The Google Play dataset [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) shows also that the dataset has duplicate entries for some apps.

I used following technique to find the number of duplicates and print some examples:

- Create two lists: name of duplicate apps, name of unique apps
- Loop through the Google Play dataset and if the name of the app is already in unique names list -> put in duplicates list, otherwise put in unique name list

In [7]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print("Number of duplicate apps:", len(duplicate_apps))
print("\n")
print("Examples of duplicate apps:", duplicate_apps[:15])



Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


Certain apps should not get counted more than once, so I need to remove the 1181 duplicate entries and keep only one entry per app. 

The main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times. We can use this to build a criterion for keeping rows. We won't remove rows randomly, but rather we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

To do that, we will:

- Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
- Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)

In [8]:
reviews_max = {} # dictionary with unique apps

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In [9]:
print(len(reviews_max))

9659


The Google dataset has been reduced before to 10840 apps. Reducing this number by the 1181 duplicates will lead to 9659 unique apps. Thi number matches with the length of "review_max". So I can use the review_max dictionary to remove the duplicated rows.

To keep only the highest number of reviews for the duplicated apps, we apply the code below:

- We initialise two empty lists: android_clean and already_added.
- We loop through the Android dataset, and for every iteration:
- We extract the name of the app (index: 0) and the number of reviews (index: 3).
- We add the current row (app) to the android_clean list, and the app name (name) to the already_added list if:
    - The number of reviews of the current app matches the number of reviews of that app as described in the reviews_max dictionary; and
    - The name of the app is not already in the already_added list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps.


In [10]:
android_clean = []    # store new cleaned dataset
already_added = []    # store app names

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

In [11]:
print(len(already_added))

9659


By using again the "explore_data()" function we can check the number of rows of the clean Google Play dataset "android_clean". The Number of rows should as the length of review_max, 9659

In [13]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


### Removing Non-English Apps

The company uses only English for the apps they develop, and therefore I would like to analyze only the apps that are designed for an English-speaking customers.

First I define a function "english_checker" that will help me to identify whether the app name has non-English characters. Each character in a string has a corresponding Unicode number associated with it. According to the ASCII system, the characters we commonly use in an English text are in in the range of 0 to 127.

In [19]:
def english_checker(any_string):
    i = 0
    for buchstabe in any_string:
        if ord(buchstabe) > 127:
            return False
    return True

Let me check first if the functions works correct:

In [20]:
english_checker("Instagram")

True

In [21]:
english_checker('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [22]:
english_checker('Docs To Go™ Free Office Suite')

False

In [23]:
english_checker('Instachat 😜')

False

The function works fine, but it also does tag apps with names that use special characters (trademarks, emojis,etc.) as non-English. So I have to modify the function a little bit that it only tag apps as non-English if they have more than three "non-English" characters in their names. The new function will be called "english_checker_new"

In [25]:
def english_checker_new(any_string):
    i = 0
    for buchstabe in any_string:
        if ord(buchstabe) > 127:
            i += 1
        if i > 3:
            return False
    return True

In [27]:
print(english_checker_new("Instagram"))
print(english_checker_new('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_checker_new('Docs To Go™ Free Office Suite'))
print(english_checker_new('Instachat 😜'))

True
False
True
True


Now the function should be good enough. I can now use this function to filter out non-Englsih apps from the two datasets. I will call the new datasets "google_english" and "apple_english"

In [31]:
google_english = []
apple_english = []

for app in android_clean:
    name = app[0]
    if english_checker_new(name):
        google_english.append(app)
        
for app in ios:
    name = app[1]
    if english_checker_new(name):
        apple_english.append(app)
        
print("English Android apps:")
print('\n')
explore_data(google_english, 0, 3, True)
print('\n')
print("English Ios apps:")
print('\n')
explore_data(apple_english, 0, 3, True)

English Android apps:


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


English Ios apps:


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130

The number of apps in the Google Play dataset was reduced from 9659 to 9614 apps. The Apple dataset was reduced by more than 1000 apps from 7197 to 6183, meaning that the original Apple dataset contained a lot of non-English apps.

### Isolating Free Apps

As mentioned before the company only builds free apps. So for this analysis I am only interested in free apps, too. Therefore I have to remove in this step all apps that are not free.

First, I check the header again to find the column number that refer to the price of the apps:

In [33]:
print('Apple Dataset')
print(android_header)
print('\n')
print('Google Dataset')
print(ios_header)

Apple Dataset
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Google Dataset
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [35]:
google_final = []
apple_final = []

for app in google_english:
    price = app[7]
    if price == '0':
        google_final.append(app)

for app in apple_english:
    price = app[4]
    if price == '0.0':
        apple_final.append(app)
        
print('Final Google apps data:', len(google_final))
print('Final Apple apps data:', len(apple_final))

Final Google apps data: 8864
Final Apple apps data: 3222


Only 8864 Google apps and 3222 Apple apps are left in the final two datasets

## 4. Data Analysis

### Most Common Apps by Genre

As mentioned above the goal of this analysis is to determine the kinds of apps that are likely to attract more users because the number of people using the apps affect the revenue.

The strategy of the company has three steps:

- build a minimal Android version of the app, and add it to Google
- if the app has a good response from users, develop it further
- if the app is profitable after six month, build an iOS version of the app and add it to the App Store

Therefore I have to find app profiles that are successful in both markets. As first step I will now begin the analysis by determining the most common genres for each market. For this I will need to build frequency tables for the most interesting columns in the datasets.

So, first, I'll need to make two functions to start the analysis:

- One function to generate frequency tables that show percentages
- Another to display the percentages in descending order

In [36]:
# Freq_table function
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for app in dataset:
        total += 1
        value = app[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
            
    table_percentage = {}
    for key in table:
        percentage = table[key] / total * 100
        table_percentage[key] = percentage
        
    return table_percentage

# Display_table function
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Now i can use the "display_table" function to display the most common genre in both datasets:

In [37]:
print("Most common genre in App Store:")
print('\n')
display_table(apple_final, 11)

Most common genre in App Store:


Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


In [39]:
print("Most common genre in Google Play Store:")
print('\n')
display_table(google_final, 9)

Most common genre in Google Play Store:


Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.93637184

Looking at the App Store frequency table, we see that the most common genre are Games followed far behind by Entertainment, Photo & Video and Education.

The TOP 5 Genres for the Google play Store are as follows:
Tools, Entertainment, Education, Business, Productivity


### Most Popular Apps by Genre

In the next step I check which apps are the most popular based on the number of users. For the Apple Store, this data is not available. Instead I use the number of ratings ("rating_count_tot") as a proxy. For the Google store, I can use the "Installs" column to check how many users there are.

#### Apple

In [40]:
print(ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [49]:
print(apple_final[0])

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


In [52]:
print('Frequency table for primary genre in ios apps:')

genres_ios = freq_table(apple_final, 11)
for genre in genres_ios:
    total = 0      # number of ratings
    len_genre = 0  # number of apps specific to each genre
    for app in apple_final:
        genre_app = app[11]
        if genre_app == genre:
            n_rating = float(app[5])
            total += n_rating
            len_genre += 1
    avg_n_rating = total / len_genre
    print(genre, ':', avg_n_rating)

Frequency table for primary genre in ios apps:
Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


I can see that Socail Networking, Reference and Navigation are the 3 most high rated genres. Let me check more in detail the most popular apps of these genres:

In [56]:
print('Navigation')
for app in apple_final:
    if app[11] == 'Navigation' and (int(app[5]) > 100000):
        print(app[1], ':', app[5])

print('\n')
print('Reference')
for app in apple_final:
    if app[11] == 'Reference' and (int(app[5]) > 100000):
        print(app[1], ':', app[5])

print('\n')        
print('Social Networking')        
for app in apple_final:
    if app[11] == 'Social Networking' and (int(app[5]) > 100000):
        print(app[1], ':', app[5])

Navigation
Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911


Reference
Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047


Social Networking
Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778


I can now confirm that the average number of reviews (which is the proxy for average number of users) for the top 3 categories are being pulled up by a small number extremely popular apps. The same is most likely true for Music (Spotify, Pandora) and Weather (Weather, Accuweather), and Book (Kindle, Audible).

Unless the company wants to develop the less popular app categories (with fewer reviews), they will just have to try to compete with bigger players.

#### Google

In [57]:
print(android_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [59]:
category_android = freq_table(google_final, 1)

for category in category_android:
    total = 0  # sum of installs specific to each genre
    len_category = 0  # number of apps specific to each genre
    for app in google_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

The categories in the Google Store with the highest number of average installs are Communication, Video_Players, Social, Photography, and Productivity. Similar to the Apple Store, it is most likely that these numbers are being pulled up by very popular apps. Let's check.

In [60]:
print('VIDEO_PLAYERS')
for app in google_final:
    if (app[1] == 'VIDEO_PLAYERS') and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+'):
        print(app[0], ':', app[5])

print('\n')
print('COMMUNICATION')
for app in google_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+'):
        print(app[0], ':', app[5])

VIDEO_PLAYERS
YouTube : 1,000,000,000+
Google Play Movies & TV : 1,000,000,000+
MX Player : 500,000,000+


COMMUNICATION
WhatsApp Messenger : 1,000,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Viber Messenger : 500,000,000+


The same pattern of a few extremely popular apps inflating the average number of users per category can be seen in the Google Store. We have YouTube for Video Players and Google Chrome, WhatsApp, and Messenger for Communication.

Since we were considering some sort of gamified or interactive references or books for the Apple Store, let's check what the books_and_references and games category in Google Store look like.

In [61]:
print('\n')
print('BOOKS_AND_REFERENCE')
for app in google_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+' 
                                            or app[5] == '500,000,000+' 
                                           or app[5] == '100,000,000+'
                                           or app[5] == '10,000,000+'):
        print(app[0], ':', app[5])
        
print('\n')
print('GAME')
for app in google_final:
    if app[1] == 'GAME' and (app[5] == '1,000,000,000+' 
                                            or app[5] == '500,000,000+' 
                                           or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])



BOOKS_AND_REFERENCE
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Google Play Books : 1,000,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Aldiko Book Reader : 10,000,000+
Wattpad 📖 Free Books : 100,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Quran for Android : 10,000,000+
Audiobooks from Audible : 100,000,000+
Dictionary.com: Find Definitions for English Words : 10,000,000+
English Dictionary - Offline : 10,000,000+
NOOK: Read eBooks & Magazines : 10,000,000+
Dictionary : 10,000,000+
Spanish English Translator : 10,000,000+
Dictionary - Merriam-Webster : 10,000,000+
JW Library : 10,000,000+
Oxford Dictionary of English : Free : 10,000,000+
English Hindi Dictionary : 10,000,000+


GAME
Sonic Dash : 100,000,000+
PAC-MAN : 100,000,000+
Roll the Ball® - slide puzzle : 100,000,000+
Piano Tiles 2™ : 100,000,000+

Dictionaries, e-book readers, and religious text references appear to be the most popular. On the other hand, casual games appear to dominate the Game category.

## 5. Conclusion

The analysis in this project was to identify the type of app I would recommend developing if the business model is to attract users and earn revenues through ads. There appears to be no clear answer but references or educational apps that are interactive or 'gamified' seem to hold some promise. The company can develop primary-school level reference materials (nursery rhymes, short stories) that are interactive which means we could add simple mini-games (word matching, fill in the blanks, image matching) that grant the user some experience points of some sort.

For the more mature demographic, the company could create an app that recommends books or movies based on what the user indicates to be the movies or books he likes. The app can contain some basic information on movies and books.