# Profitable App Profiles for the App Store and Google Play Markets


## Introduction

This project will focus on analysing the data from apps in the app store and google play markets to deduce which app's are the most profitable.

The data sets used in this project can be viewed by clicking the following links:

[Google Play Store Apps](https://www.kaggle.com/lava18/google-play-store-apps) by Lavanya Gupta

[Mobile App Store (7200 apps)](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) by Ramanathan 


**Project Goal**: To analyze data to help understand what type of apps are likely to attract more users and be profitable.
***

## Opening and Reading the Data Set

In the below code block we open up the apple app store and google play store data sets and assign them to `apple_data` & `google_data` respectively.

Note: Each of these data sets have header rows which was discovered in the data exploration stage. The headers will be useful in identifying the type of data in each column.

The header files have been assigned to variables `apple_header` & `google_header`

In [1]:
from csv import reader

opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
apple_data = list(read_file)
apple_header = apple_data[0]
apple_data = apple_data[1:]

opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
google_data = list(read_file)
google_header = google_data[0]
google_data = google_data[1:]

## Exploring the Data Set

In the below code block we define the `explore_data` function.

This function allows us to print a slice of the dataset by passing it the parameters:

* `dataset`, the list of lists comprising of the .csv data extracted
* `start`, specifys the starting position of the data splice
* `end`, specifys the ending position of the data splice
* `rows_and_columns`, returns the number of rows and coloums in the dataset if `True`. (By default `False`)

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print(google_header, '\n')
explore_data(google_data,0,3,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


From the above output we can begin to understand the structure of the data set.<br/>
The header row was printed and each of its elements are straightforward and easy to understand.

In [4]:
print(apple_header, '\n')
explore_data(apple_data,0,3,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


As we can see from the printed output above the `apple_header` variable output has some not so straight forward elements in them. <br/><br/>
Each of the elements are well explained in the documentation of the dataset provided in the [Link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

## Data Cleaning

### Deleting Wrong Data

In the Google Play data set [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015), a comment describes an error for a certain row. <br/><br/>
Lets have a look at this particular data point in the `google_data` data set

In [5]:
print(google_data[10472], '\n') #Incorrect Data Row
print(google_header, '\n')      #Header Row
print(google_data[0])     #Correct Data Row for reference

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] 

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


The row 10472 in `google_data` shows a rating of 19. This is clearly wrong as google app ratings range between 0 and 5 with .1 increments. <br/><br/>
In order to proceed we can either delete this row or correct the rating coloumn for the particular row.<BR>

In this instance since we have a sample dataset of over 10k+ apps, (which a large enough data set) so I chose to remove the row from our dataset.

In [6]:
del google_data[10472] #do not run more than once

In [7]:
print(len(google_data)) # To check if row has been deleted

10840


### Deleting Duplicate Entries

If we go through the google play store data set [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion) we can see that there are a number of duplicated app entries.<br/><br/>
We are going to focus on removing these duplicated entries, but before we do that we need to identify which apps have duplicated data.<br/><br/>
The code below allows us to identify this by seperating the data set into two lists, `duplicate_apps` and `unique_apps`

In [8]:
duplicate_apps = []
unique_apps = []

for app in google_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:',len(duplicate_apps),'\n')
print('Examples: ',duplicate_apps[:10])

Number of duplicate apps: 1181 

Examples:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


Taking the app Example 'Slack' and returning all rows that have that name gives us three instances shown below:

In [9]:
print(google_header,'\n')
for app in google_data:
    name = app[0]
    if name == 'Slack':
        print(app,'\n')

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device'] 

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device'] 

['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device'] 



Now in order to clean the data we create a dictonary that will contain the unique names of the apps as the key and maximum number of reviews as the value.<br/><br/>
Among the duplicate apps, we plan on retaining apps with the most number of reviews as this is the latest data obtained in the data set for that particular app. In this instance: the last row of output for 'Slack' 

In [10]:
review_max = {}

for app in google_data:
    name = app[0]
    n_reviews = float(app[3])
    if name in review_max and n_reviews > review_max[name]:
        review_max[name] = n_reviews
    elif name not in review_max:
        review_max[name] = n_reviews      


In [11]:
# Check if the dictonary created contains all the apps without the duplicate apps
print(len(google_data)-len(duplicate_apps))
print(len(review_max))


9659
9659


In [12]:
android_clean = []
already_added = []

for app in google_data:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == review_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)


In [13]:
print(len(android_clean))

9659


### Deleting Non-English Apps



The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system.<br/><br/> Based on this we will use the below function to remove apps that contain any ASCII value outside this range.

In [14]:
def is_english(app_name):
    for character in app_name:
        if ord(character) > 127:
            return False
    return True
    

In [15]:
print(is_english('Instachat 😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Instagram'))
print(is_english('Docs To Go™ Free Office Suite'))

False
False
True
False


In [16]:
def is_english(app_name):
    count = 0
    for character in app_name:
        if ord(character) > 127:
            count += 1
        if count > 3:
            return False
    return True

In [17]:
print(is_english('Instachat 😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Instagram'))
print(is_english('Docs To Go™ Free Office Suite'))

True
False
True
True


In [18]:
english_android = []
english_apple = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        english_android.append(app)

for app in apple_data:
    name = app[1]
    if is_english(name):
        english_apple.append(app)

In [19]:
explore_data(english_android,0,3,True)
print('\n')
explore_data(english_apple,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

### Isolating Free Apps

Since our goal is to identify apps that create revenue through in-app ads. We need to target free apps since these are the apps that normally have ads present in them.<br/><br/>
Lets take a look at how we can isolate the free apps in the new variables `android_final` & `ios_final`

In [20]:
android_final = []
ios_final = []

for app in english_android:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in english_apple:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

8864
3222


The final list after cleaning the data shows us there are 8864 apps in the google play store data set and 3222 apps in the Apple Store data set.

We can now proceed with our data analysis to determine what type of apps are likely to genreate more users.

## Analysing the Data

### Most Common Apps by Genre for App Store and Google Play market

The end goal is to determine the kinds of apps that are likely to attract more users because the revenue is highly influenced by the number of people using the app.



In [21]:
print(google_header,'\n')
print(apple_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [22]:
def freq_table(dataset, index):
    freq_dict = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in freq_dict:
            freq_dict[value] +=1
        else:
            freq_dict[value] = 1
            
    percentage_dict = {}
    for value in freq_dict:
        percentage = float(freq_dict[value] / total)*100
        percentage_dict[value] = percentage
        
    return percentage_dict

def display_table(dataset, index): # Sorts and prints the percentage table highest to lowest
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', round(entry[0],2), '%')     

In [23]:
display_table(ios_final,11)


Games : 58.16 %
Entertainment : 7.88 %
Photo & Video : 4.97 %
Education : 3.66 %
Social Networking : 3.29 %
Shopping : 2.61 %
Utilities : 2.51 %
Sports : 2.14 %
Music : 2.05 %
Health & Fitness : 2.02 %
Productivity : 1.74 %
Lifestyle : 1.58 %
News : 1.33 %
Travel : 1.24 %
Finance : 1.12 %
Weather : 0.87 %
Food & Drink : 0.81 %
Reference : 0.56 %
Business : 0.53 %
Book : 0.43 %
Navigation : 0.19 %
Medical : 0.19 %
Catalogs : 0.12 %


The above data tells us that the vast majority of free apps on the app store that are marketed towards an english speaking audience are:



* Games (58.1%)
* Entertainment (7.88%)
* Photo & Video (4.9%)


Each of these apps genres are focused around fun and entertainment.


In [24]:
display_table(android_final,1)


FAMILY : 18.91 %
GAME : 9.72 %
TOOLS : 8.46 %
BUSINESS : 4.59 %
LIFESTYLE : 3.9 %
PRODUCTIVITY : 3.89 %
FINANCE : 3.7 %
MEDICAL : 3.53 %
SPORTS : 3.4 %
PERSONALIZATION : 3.32 %
COMMUNICATION : 3.24 %
HEALTH_AND_FITNESS : 3.08 %
PHOTOGRAPHY : 2.94 %
NEWS_AND_MAGAZINES : 2.8 %
SOCIAL : 2.66 %
TRAVEL_AND_LOCAL : 2.34 %
SHOPPING : 2.25 %
BOOKS_AND_REFERENCE : 2.14 %
DATING : 1.86 %
VIDEO_PLAYERS : 1.79 %
MAPS_AND_NAVIGATION : 1.4 %
FOOD_AND_DRINK : 1.24 %
EDUCATION : 1.16 %
ENTERTAINMENT : 0.96 %
LIBRARIES_AND_DEMO : 0.94 %
AUTO_AND_VEHICLES : 0.93 %
HOUSE_AND_HOME : 0.82 %
WEATHER : 0.8 %
EVENTS : 0.71 %
PARENTING : 0.65 %
ART_AND_DESIGN : 0.64 %
COMICS : 0.62 %
BEAUTY : 0.6 %


The above data shows us the `Category` distribution of free apps marketed to an english audience in the google play store. The top 3 categories for are as follows:

* FAMILY : 18.91 %
* GAME : 9.72 %
* TOOLS : 8.46 %

The [Family Category of the Play Store](https://play.google.com/store/apps/category/FAMILY?hl=en) upon further invetigation shows us that it is a category that contains mainly children friendly apps and games.

Unlike the Apple Store data set, we can see there is no widely popular category that make over half of the apps in the data set. Instead, the Google Play store has a much more even distribution with practical apps (such as under the tools category) also having a significant presence compared to gaming apps.


This is further confirmed by analysing the `Genre` coloumn of the data set below.

In [25]:
display_table(android_final,9)

Tools : 8.45 %
Entertainment : 6.07 %
Education : 5.35 %
Business : 4.59 %
Productivity : 3.89 %
Lifestyle : 3.89 %
Finance : 3.7 %
Medical : 3.53 %
Sports : 3.46 %
Personalization : 3.32 %
Communication : 3.24 %
Action : 3.1 %
Health & Fitness : 3.08 %
Photography : 2.94 %
News & Magazines : 2.8 %
Social : 2.66 %
Travel & Local : 2.32 %
Shopping : 2.25 %
Books & Reference : 2.14 %
Simulation : 2.04 %
Dating : 1.86 %
Arcade : 1.85 %
Video Players & Editors : 1.77 %
Casual : 1.76 %
Maps & Navigation : 1.4 %
Food & Drink : 1.24 %
Puzzle : 1.13 %
Racing : 0.99 %
Role Playing : 0.94 %
Libraries & Demo : 0.94 %
Auto & Vehicles : 0.93 %
Strategy : 0.91 %
House & Home : 0.82 %
Weather : 0.8 %
Events : 0.71 %
Adventure : 0.68 %
Comics : 0.61 %
Beauty : 0.6 %
Art & Design : 0.6 %
Parenting : 0.5 %
Card : 0.45 %
Casino : 0.43 %
Trivia : 0.42 %
Educational;Education : 0.39 %
Board : 0.38 %
Educational : 0.37 %
Education;Education : 0.34 %
Word : 0.26 %
Casual;Pretend Play : 0.24 %
Music : 0.2 %
R

The above data shows us the `Genre` distribution among free apps in the Google Play store marketed toward an English speaking audience.

Here we can see supporting evidence towards the fact that practical apps have a major presence in the Play Store along with 'fun' game apps.

The difference between the `Genre` & `Category` coloumn in unclear, however the `Genre` coloumn has a lot more refined categories present within. For our analysis since we only require a broad understanding we will from hereforth only focus on the `Category` coloumn.


### Most Popular Apps by Genres on App Store

We now try to find the most used apps by Genre in the App Store.

In order to do this, the google play data set has an `Installs` coloumn which will allow us to determine the average number of users per app genre.

However in case of the Apple Store data set, we do not have data points associated with the number of installs a particular app has.
As a proxy we are going to use the `rating_count_tot` coloumn, which tells us the total number of ratings for a particular app.

In [26]:
freq_ios_genre = freq_table(ios_final,11)

for genre in freq_ios_genre:
    total = 0
    len_genre = 0
    for app in ios_final:
        app_genre = app[11]
        if app_genre == genre:
            total += float(app[5])
            len_genre += 1
    avg_rating =  total / len_genre
    print(genre, ':', avg_rating)       

Utilities : 18684.456790123455
Business : 7491.117647058823
Sports : 23008.898550724636
Weather : 52279.892857142855
Entertainment : 14029.830708661417
Travel : 28243.8
Food & Drink : 33333.92307692308
Music : 57326.530303030304
Finance : 31467.944444444445
Social Networking : 71548.34905660378
Medical : 612.0
Book : 39758.5
Catalogs : 4004.0
News : 21248.023255813954
Lifestyle : 16485.764705882353
Education : 7003.983050847458
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Navigation : 86090.33333333333
Games : 22788.6696905016
Reference : 74942.11111111111
Photo & Video : 28441.54375
Productivity : 21028.410714285714


The above data shows the top 3 highest average users per app genre are for:
* Navigation (86k)
* Social Media (74k)
* Reference (71k)

Lets dive into each of these categories further to gain a better understanding of the distribution of total ratings for each app within these genres.

In [27]:
for app in ios_final:
    genre = app[11]
    app_name = app[1]
    n_rating = app[5]
    if genre == "Navigation":
        print(app_name, ':', n_rating)

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


We can see from the above analysis that Waze and Google Maps under the 'Navigation' genre have a significantly high number of ratings that are affecting the average number of ratings we calculated earlier. 

Without these two apps in out dataset our average rating for the 'Navigation' genre would be *4,146.25* instead of *86,090.3*.

This tells us Google and Waze dominate the 'Navigation' genre and creating a navigation app in hopes to obtain a high number of users may be slim as we would be competing in this market with tech giants.

Now lets analyse the distribution of total ratings for each app within the 'Social Networking' genre.

In [28]:
for app in ios_final:
    genre = app[11]
    app_name = app[1]
    n_rating = app[5]
    if genre == "Social Networking":
        print(app_name, ':', n_rating)

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

Here we observe a similar pattern that we saw while analysing the Navigation genre. Social media giants like Facebook, Pintrest, Skype largely skew the results.

Competing in this market would be difficult as it would take a whole new concept of an addictive social media platform that has never been done before in order to even try and succeed as an app.

'Reference' category apps have 74,942 average user ratings, but it's actually the Bible and Dictionary.com skewing the average.

In [29]:
for app in ios_final:
    genre = app[11]
    app_name = app[1]
    n_rating = app[5]
    if genre == "Reference":
        print(app_name, ':', n_rating)

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


However, this genre still shows some potential as it seems like a niche market that can still allow our app to be successful in.

With the rise in number of users using and owning tablets, reading on a screen seems to be becoming a norm.

A potential app that may do particulary well would be one that allowed users to download books or subscribe to newspapers for them to read on their devices. This app could also integrate its own dictionary that allows users to look-up definitions without ever having to switch between apps. It could also act as a pdf reader for users to read their own personal downloaded files.

There will definetly be similar apps on the store that are based on this concept, however with the right pricing whether it be a subscription model or pay per read model or even through ads alone, this could potentially become a profitable app and a big hit among readers. 

### Popular Apps by Genre in the Google Play market

Unliked the Apple Store data set, the Google Play data set has a 'Installs' category. This will provide us with a more accurate idea of the number of people using a particular app.

However as we see below the exact number of installs is not provided, instead the data set has provided an open ended number (5,000+, 100,000+ ...)

In [42]:
display_table(android_final, 5)


1,000,000+ : 15.73 %
100,000+ : 11.55 %
10,000,000+ : 10.55 %
10,000+ : 10.2 %
1,000+ : 8.39 %
100+ : 6.92 %
5,000,000+ : 6.83 %
500,000+ : 5.56 %
50,000+ : 4.77 %
5,000+ : 4.51 %
10+ : 3.54 %
500+ : 3.25 %
50,000,000+ : 2.3 %
100,000,000+ : 2.13 %
50+ : 1.92 %
5+ : 0.79 %
1+ : 0.51 %
500,000,000+ : 0.27 %
1,000,000,000+ : 0.23 %
0+ : 0.05 %
0 : 0.01 %


To proceed with the analysis we are going to consider these install numbers at face value (i.e 1,000,000+ installs would be considered as 1,000,000).

Lets have a look at the average number of installs per category below:

In [31]:
freq_android_category = freq_table(android_final,1)

for category in freq_android_category:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace(',','')
            n_installs = n_installs.replace('+','')
            total += float(n_installs)
            len_category += 1
    avg_installs = total/len_category
    print(category, ':', round(avg_installs,2))        


GAME : 15588015.6
FOOD_AND_DRINK : 1924897.74
SPORTS : 3638640.14
PARENTING : 542603.62
SOCIAL : 23253652.13
LIBRARIES_AND_DEMO : 638503.73
FAMILY : 3695641.82
TOOLS : 10801391.3
WEATHER : 5074486.2
PRODUCTIVITY : 16787331.34
ART_AND_DESIGN : 1986335.09
NEWS_AND_MAGAZINES : 9549178.47
TRAVEL_AND_LOCAL : 13984077.71
ENTERTAINMENT : 11640705.88
BOOKS_AND_REFERENCE : 8767811.89
HOUSE_AND_HOME : 1331540.56
SHOPPING : 7036877.31
COMMUNICATION : 38456119.17
LIFESTYLE : 1437816.27
HEALTH_AND_FITNESS : 4188821.99
DATING : 854028.83
MEDICAL : 120550.62
VIDEO_PLAYERS : 24727872.45
PHOTOGRAPHY : 17840110.4
PERSONALIZATION : 5201482.61
EVENTS : 253542.22
BUSINESS : 1712290.15
EDUCATION : 1833495.15
MAPS_AND_NAVIGATION : 4056941.77
BEAUTY : 513151.89
FINANCE : 1387692.48
COMICS : 817657.27
AUTO_AND_VEHICLES : 647317.82


From the above list we can see that communication apps have the most number of installs. This is largely skewed due to popular apps such as WhatsApp and Facebook Messenger.

In [36]:
for app in android_final:
    category = app[1]
    name = app[0]
    installs = app[5]
    if category=='COMMUNICATION' and (installs == '1,000,000,000+'
                                      or installs == '500,000,000+'
                                      or installs == '100,000,000+'):
        print(name, ':', installs)

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

Removing all apps with over 1 billion installs would reduce the average for the communication category by roughly 10 times.

In [37]:
under_100_m = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
print('Average without 1 billion installs apps: ', sum(under_100_m) / len(under_100_m))

Average without 1 billion installs apps:  3603485.3884615386


This trend is similar for video players (Youtube, Googe Play Movies etc.), Social (Facebook, Instagram etc), Photography apps (Google Photos and others). 

The main concern here is these genres seem to be more popular than they really are due to a few anomalies. Hence, creating an app under this genre might not be the most successful for gaining a large number of users.

The game category as discovered in our analysis of the apple store data set is over saturated.

Instead lets focus on the 'BOOKS_AND_REFERENCE' category since this is what stood out to us in our previous analysis and from our current analysis it shows a decent average number of installs at around 8 million.

In [39]:
for app in android_final:
    name = app[0]
    category = app[1]
    installs = app[5]
    if category == 'BOOKS_AND_REFERENCE' and (installs == '1,000,000,000+'
                                            or installs == '500,000,000+'
                                            or installs == '100,000,000+'):
        print(name, ':', installs)

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


From the data above we see that there are not as many apps with over 100 million installs in the 'BOOKS_AND_REFERENCE' category. This is promising since it tells us our average installs figure is not as highly skewed as the other categories discussed earlier.

Let see what kind og apps in this category have a installs lesser than 100 million

In [43]:
for app in android_final:
    name = app[0]
    category = app[1]
    installs = app[5]
    if category == 'BOOKS_AND_REFERENCE' and (installs == '5,000,000+'
                                            or installs == '10,000,000+'
                                            or installs == '50,000,000+'):
        print(name, ':', installs)

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
FBReader: Favorite Book Reader : 10,000,000+
AlReader -any text book reader : 5,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Quran for Android : 10,000,000+
Dictionary.com: Find Definitions for English Words : 10,000,000+
English Dictionary - Offline : 10,000,000+
Bible KJV : 5,000,000+
NOOK: Read eBooks & Magazines : 10,000,000+
Dictionary : 10,000,000+
Spanish English Translator : 10,000,000+
Dictionary - Merriam-Webster : 10,000,000+
JW Library : 10,000,000+
Oxford Dictionary of English : Free : 10,000,000+
English Hindi Dictionary : 10,000,000+
English to Hindi Diction

## Conclusion

The data above looks very similar to the one we analysed for the 'Reference' Category in the App Store Data.

This tells us this category is consistently in demand in both stores. As mentioned earlier a e-book reader type app that has the functionality to download popular books such as the Bible and Quran (not limited to regligious books ofcourse) and have an embedded dictionary app would be in popular demand in both stores.