# Identifying Profitable App Profiles for the App Store and Google Play

In this project I will have look at tow datasets from Kaggle, the [IOS App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) and [Google Play Store](https://www.kaggle.com/lava18/google-play-store-apps/downloads/google-play-store-apps.zip/6#googleplaystore.csv) datasets. These datasets are created by scraping the web and contain various information about apps. 
The aim of this project is to find mobile app profiles that are most profitable in the IOS App Store and Google Play. The focus will be only on apps for English speaking users and free apps. And therefore the the goal for this project is to analyze the data to understand what kinds of apps are likely to attract the most users.

In [3]:
# import reader to read csv's
from csv import reader

# open the app store data set
app_store_file = open("AppleStore.csv", encoding="utf8")
read_file = reader(app_store_file)
# make a list of list
app_store = list(read_file)
# removing header from the rest of the dataset
app_store_header = app_store[0]
app_store = app_store[1:]

### open the google play data set
google_file = open("googleplaystore.csv", encoding="utf8")
read_file = reader(google_file)
google = list(read_file)
google_header = google[0]
google = google[1:]

#### Exploring the datasets

In [4]:
# creating a explore data function 
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [5]:
# App store first 3 rows
explore_data(app_store, 0 , 3)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']




In [6]:
### Google play first 3 rows
explore_data(google, 0, 3)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']




In [7]:
### Number of rows and columns of apple store data set
explore_data(app_store, 0, 0, rows_and_columns = True)

Number of rows: 7197
Number of columns: 17


In [8]:
### Number of rows and columns of google play data set
explore_data(google, 0, 0, rows_and_columns = True)

Number of rows: 10841
Number of columns: 13


In [9]:
### Column names of the apple store data set 
print(app_store_header)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Column description as found at [Kaggle](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)
The first column is the row number

|Column name| Description|
|------|------|
|"id"|App ID|
|"track_name"|App Name|
|"size_bytes"|Size (in Bytes)|
|"currency"| Currency Type|
|"price"| Price amount|
|"rating_count_tot"|User Rating counts (for all version)|
|"rating_count_ver"| User Rating counts (for current version)|
|"user_rating"| Average User Rating value (for all version)|
|"user_rating_ver"| Average User Rating value (for current version)|
|"ver"| Latest version code|
|"cont_rating"| Content Rating|
|"prime_genre"| Primary Genre|
|"sup_devices.num"| Number of supporting devices|
|"ipadSc_urls.num"| Number of screenshots showed or display|
|"lang.num"| Number of supported languages|
|"vpp_lic"| Vpp Device Based Licensing Enabled|

In [8]:
### Column names of the google play store data set
print(google_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Column description as found at [Kaggle](https://www.kaggle.com/lava18/google-play-store-apps/downloads/google-play-store-apps.zip/6#googleplaystore.csv)

|Column name| Description|
|------|------|
|"App"|Application name|
|"Category"|Category the app belongs to|
|"Rating"|Overall user rating of the app (as when scraped)|
|"Reviews"| Number of user reviews for the app (as when scraped)|
|"Size"| Size of the app (as when scraped)|
|"Installs"|Number of user downloads/installs for the app (as when scraped)|
|"Type"|Paid or Free|
|"Price"| Price of the app (as when scraped)|
|"Content Rating"| Age group the app is targeted at - Children / Mature 21+ / Adult|
|"Genres"| An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.|
|"Last Updated"| Date when the app was last updated on Play Store (as when scraped)|
|"Current Ver"| Current version of the app available on Play Store (as when scraped)|
|"Android Ver"| Min required Android version (as when scraped)|

## Data cleaning
In this project we will focus on English-speaking apps and free to use apps.

1. remove duplicate entries
2. remove non-English apps
3. remove apps that aren't free

On Kaggle is a [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) thread of the google play data set in which the row 10472 is missing the `category` column and therefore the following columns have the wrong column data inside them. 

In [10]:
# google play data set 
print(google_header)
google[10472]

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

Delete the row 10472

In [11]:
del google[10472] # do not run it more than once

### 1. Remove duplicate entries
In the Google Play store data set some apps have duplicate entries. As seen below with the app `Instagram`, which has four duplicate entries.

The main difference in the duplicates is the fourth column, which is the number of Reviews given for the app. To get the most recent data we will keep the row with the highest number of reviews and remove the other entries.

In [12]:
for row in google:
    app_name = row[0]
    if app_name == "Instagram":
        print(row)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In the Google Play store data set are 1181 duplicates.

In [13]:
#creating two empty lists for storing the names of duplicate and unique apps
duplicate_apps = []
unique_apps = []

for row in google:
    app_name = row[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)
        
print("Number of duplicate apps:", len(duplicate_apps))
print("\n")
print("Example of duplicate apps:", duplicate_apps[:15])

Number of duplicate apps: 1181


Example of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


Not all apps differ in the Review column. The app "Box" has no difference at all between the duplicate entries

In [14]:
for row in google:
    app_name = row[0]
    if app_name == "Box":
        print(row)

['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


The app "Slack" also varies only in the number of Reviews

In [15]:
for row in google:
    app_name = row[0]
    if app_name == "Slack":
        print(row)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


Before we delete all duplicated we first need to make a dictionary which stores the value with the highest number of review for each app.

In [16]:
### create dictionary where the key is the name and the value the highest number of review
name_and_reviews = {}
for row in google:
    app_name = row[0]
    n_reviews = float(row[3])
    if app_name in name_and_reviews and name_and_reviews[app_name] < n_reviews:
        name_and_reviews[app_name] = n_reviews
    elif app_name not in name_and_reviews:
        name_and_reviews[app_name] = n_reviews

We know form before that we have 1181 duplicated so lets check if the dictionary has the correct number of rows.

In [17]:
### the length of the new created dictionary
print(len(name_and_reviews))

9659


In [18]:
### expected length (original dataset minus the duplicates)
print(len(google)- 1181)

9659


Now we can remove the duplicates. First we iterate through the dataset and if the review number correspond with the number we find in the before created dictionary with the highest review number than we copy it in the new clean dataset. But as we seen before for the `Box` app: there are apps with the same reviews number, and therefore we need another condition which test if it is already added, so only one will be added.

In [19]:
### remove duplicte rows with help of the created dictionary
google_clean = [] # new cleaned data set
already_added = [] # store app names

for row in google:
    app_name = row[0]
    n_reviews = float(row[3])
    if n_reviews == name_and_reviews[app_name] and app_name not in already_added:
        google_clean.append(row)
        already_added.append(app_name)

In [21]:
#checking if our new data-set has the correct number
print(len(google_clean))

9659


### 2. Remove non-English apps
As we only interested in english apps we want to remove others

In [24]:
# some apps are not for english speaking audience 
print(google_clean[4412])

['中国語 AQリスニング', 'FAMILY', 'NaN', '21', '17M', '5,000+', 'Free', '0', 'Everyone', 'Education', 'June 22, 2016', '2.4.0', '4.0 and up']


In [29]:
print(app_store[814])

['926', '436957087', '搜狐新闻—新闻热点资讯掌上阅读软件', '136421376', 'USD', '0', '383', '0', '4.5', '0', '5.8.9', '17+', 'News', '38', '0', '1', '1']


English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /).

In Python each string has a corresponding number behind the scenes:

In [30]:
print(ord("a"))
print(ord("+"))

97
43


The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the [American Standard Code for Information Interchange](https://en.wikipedia.org/wiki/ASCII) system. 

In Python, strings are indexable and iterable, which means we can use indexing to select an individual character, and we can also iterate on the string using a for loop.

In [32]:
### A function which gives True if it is a string which correspond to the number 0 to 127 and False if higher 

def detect_language(string):
    for letter in string:
        if ord(letter) > 127:
            return False
    return True

In [33]:
# check if the function works
print(detect_language("Instagram"))
print(detect_language("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(detect_language("Docs To Go™ Free Office Suite"))
print(detect_language("Instachat 😜"))

True
False
False
False


In [25]:
print(ord("™"))
print(ord("😜"))

8482
128540


The function we created will label English apps incorrectly as non-English apps because these apps use special characters whic h correspond to a higher number than 127. To minimize the data loss impact we will only remove them if they have more than 3 characters above 127.

In [35]:
# new function for detecting if a app is English
def detect_language(string): 
    non_english = 0
    for letter in string:
        if ord(letter) > 127:
            non_english += 1 
    if non_english > 3: # see if it has more than 3 non english character
        return False
    else:
        return True
    

In [36]:
# check if the function works
print(detect_language("Instagram"))
print(detect_language("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(detect_language("Docs To Go™ Free Office Suite"))
print(detect_language("Instachat 😜"))

True
False
True
True


In [37]:
### cleaning the google data set
clean_google = []
for row in google_clean:
    app_name = row[0]
    is_english = detect_language(app_name)
    if is_english == True:
        clean_google.append(row)

In [38]:
print(len(clean_google))

9614


In [48]:
### cleaning the app store data set
clean_app_store =[]
for row in app_store:
    app_name = row[2]
    is_english = detect_language(app_name)
    if is_english == True:
        clean_app_store.append(row)

In [49]:
print(len(clean_app_store))

6183


### 3. Remove apps that aren't free
We are only interested in free apps therefore we seperate free apps from the clean data sets

Google play store

In [50]:
explore_data(clean_google, 0, 3)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']




In [51]:
# the index for the type column is 6
free_google = []
for row in clean_google:
    type_col = row[6]
    if type_col == "Free":
        free_google.append(row)        

In [52]:
print(len(free_google))

8863


IOS App store

In [53]:
explore_data(clean_app_store, 0, 3)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']




In [56]:
# the index for the price is 5 and free is indicated as 0.0
free_appstore = []
for row in clean_app_store:
    price = row[5]
    if price == "0":
        free_appstore.append(row)

In [57]:
print(len(free_appstore))

3222


We are left with **8863** apps in the Google Play store and **3222** apps in the IOS Apple store.

## Data analysis
We first have a look, which free apps attract the most users in both App stores seperatly and than investigate which apps are successful in both stores.

First I analyse which is the most common genre for each market.
Google has two columns (`Category` and `Genres`) which can be used for this analysis and IOS Apple appstore has the `prime_genre` column.

Google Play store

In [58]:
# Genres is at index 9 and Category has index 1 
print(google_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


IOS App store

In [60]:
# prime_genre is index 12
print(app_store_header)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


First we build a function that generates a frequency tables with  percentage. Than a function which can display the percentage in a decending order. 

In [61]:
# function which generates a frequency table in %
def frequency_table(dataset, index):
    table = {}
    total = 1
    for app in dataset:
        genre = app[index]
        total += 1
        if genre in table:
            table[genre] += 1
        else:
            table[genre] = 1
    table_percentage = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentage[key] = percentage
    return table_percentage

In [62]:
### the display table will sort the frequency table in decending order
def display_table(dataset, index):
    table = frequency_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [64]:
print("App store Prime genre")
print(display_table(free_appstore,12))

App store Prime genre
Games : 58.144585789636984
Entertainment : 7.880856345020168
Photo & Video : 4.964318957493019
Education : 3.661185231151101
Social Networking : 3.288861309339125
Shopping : 2.6062674526838348
Utilities : 2.513186472230841
Sports : 2.140862550418864
Music : 2.04778156996587
Health & Fitness : 2.016754576481539
Productivity : 1.7375116351225566
Lifestyle : 1.5823766677008997
News : 1.3341607198262488
Travel : 1.2410797393732547
Finance : 1.1169717654359292
Weather : 0.8687558175612783
Food & Drink : 0.8067018305926157
Reference : 0.5584858827179646
Business : 0.5274588892336333
Book : 0.43437790878063914
Navigation : 0.18616196090598822
Medical : 0.18616196090598822
Catalogs : 0.12410797393732546
None


in the IOS Apple store dataset more than half of the apps (58.16%) are games. Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.66% of the apps are designed for education, followed by social networking apps which amount for 3.29% of the apps. 

The general impression is that the IOS Apple Store is dominated by free-apps, that are designed for fun, while apps with practical purposes (education, shopping, utilities, productivity, lifestyle) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

In [65]:
print("Google play category")
print(display_table(free_google, 1))

Google play category
FAMILY : 18.896660649819495
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638

In the Google Play store it seems to be different. There are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids.

Even so, practical apps seem to have a better representation on Google Play compared to App Store. This picture is also confirmed by the frequency table we see for the Genres column:


In [66]:
print("Google play genre")
print(display_table(free_google, 9))

Google play genre
Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles

The distinct difference between the `Genres` and the `Category` columns is not clear, only that the `Genre` column has much more categories than the `Category` column. As we are only interested in the bigger picture in the moment, we will work with the `Category` column. 

We found that the IOS Apple App Store is dominated by apps designed for fun, while Google Play store shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have the most users.

### Apps with the most viewers
In the Google play store data set we can use `installs` to calculate the average number of installs for each app genre, to find out which genre has the most users. In the IOS App store data set this information is not available. 
We can use the total number of user ratings as an alternate value (`rating_count_tot`).

#### IOS App store
First we seperate the apps per genre, than sum up the user ratings for each genre and than divide them by the number of apps belonging to that genre. 

In [67]:
# names of the column in apps_Store
print(app_store_header)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [68]:
# frequency table for the prime_genre
unique_genre = frequency_table(free_appstore, 12)

In [69]:
for genre in unique_genre:
    total = 0
    len_genre = 0
    for app in free_appstore:
        genre_app = app[12]
        if genre_app == genre:
            number_user = float(app[6])
            total += number_user
            len_genre += 1
    average_number = total / len_genre
    print(genre, ':', average_number)

Productivity : 21028.410714285714
Weather : 52279.892857142855
Shopping : 26919.690476190477
Reference : 74942.11111111111
Finance : 31467.944444444445
Music : 57326.530303030304
Utilities : 18684.456790123455
Travel : 28243.8
Social Networking : 71548.34905660378
Sports : 23008.898550724636
Health & Fitness : 23298.015384615384
Games : 22788.6696905016
Food & Drink : 33333.92307692308
News : 21248.023255813954
Book : 39758.5
Photo & Video : 28441.54375
Entertainment : 14029.830708661417
Business : 7491.117647058823
Lifestyle : 16485.764705882353
Education : 7003.983050847458
Navigation : 86090.33333333333
Medical : 612.0
Catalogs : 4004.0


**Navigation**

On average, navigation apps have the highest number of user reviews. This result is heavily influenced by two apps Waze and Google Maps, which have close to half a million user reviews together:

In [72]:
for app in free_appstore:
    if app[12] == 'Navigation':
        print(app[2], ':', app[6]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Geocaching® : 12811
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5
CoPilot GPS – Car Navigation & Offline Maps : 3582
Google Maps - Navigation & Transit : 154911


**Reference**

After Navigation Reference has the second highest average numbers of user reviews.  The Reference is highly skewed by one app the Bible, which accounts close to a million reviews all by itself.

In [73]:
for app in free_appstore:
    if app[12] == 'Reference':
        print(app[2], ':', app[6]) # print name and number of ratings

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
Merriam-Webster Dictionary : 16849
Google Translate : 26786
Night Sky : 12122
WWDC : 762
Jishokun-Japanese English Dictionary & Translator : 0
教えて!goo : 0
VPN Express : 14
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Real Bike Traffic Rider Virtual Reality Glasses : 8


**Social Networking**

The third highest average number of user reviews per genre is Social Networking. And also here Social Networking is highly skewed by one app: Facebook, which accounts to nearly three million reviews.


In [74]:
for app in free_appstore:
    if app[12] == 'Social Networking':
        print(app[2], ':', app[6]) # print name and number of ratings

Facebook : 2974676
LinkedIn : 71856
Skype for iPhone : 373519
Tumblr : 334293
Match™ - #1 Dating App. : 60659
WhatsApp Messenger : 287589
TextNow - Unlimited Text + Calls : 164963
Grindr - Gay and same sex guys chat, meet and date : 23201
imo video calls and chat : 18841
Ameba : 269
Weibo : 7265
Badoo - Meet New People, Chat, Socialize. : 34428
Kik : 260965
Qzone : 1649
Fake-A-Location Free ™ : 354
Tango - Free Video Call, Voice and Chat : 75412
MeetMe - Chat and Meet New People : 97072
SimSimi : 23530
Viber Messenger – Text & Call : 164249
Find My Family, Friends & iPhone - Life360 Locator : 43877
Weibo HD : 16772
POF - Best Dating App for Conversations : 52642
GroupMe : 28260
Lobi : 36
WeChat : 34584
ooVoo – Free Video Call, Text and Voice : 177501
Pinterest : 1061624
知乎 : 397
Qzone HD : 458
Skype for iPad : 60163
LINE : 11437
QQ : 9109
LOVOO - Dating Chat : 1985
QQ HD : 5058
Messenger : 351466
eHarmony™ Dating App - Meet Singles : 11124
YouNow: Live Stream Video Chat : 12079
Cougar 

The same pattern applies also to music apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

The aim is to find popular genres, but navigation, social networking or reference apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10,000 threshold. We could get a better picture by removing these extremely popular apps for each genre and then rework the averages.

#### Google Play store
The `Installs` column in the Google play store data set is not that precise as expected. The install numbers are open ended. So apps with 100,000+ installs could have numbers between 100,000 and 500,000. As we want to find out which app genre attracts the most users we do not need perfect precision with respect to the number of users.  

In [75]:
display_table(free_google, 5) # Installs columns

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372


We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. To perform computations, however, we'll need to convert each install number from string to float. This means we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error

In [76]:
google_header

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

In [77]:
# first generating a frequency table for Category
category_freq_table = frequency_table(free_google, 1)

In [78]:
for category in category_freq_table:
    total = 0
    len_category = 0
    for app in free_google:
        category_app = app[1]
        if category_app == category:
            installs = app[5]
            installs = installs.replace("+", "")
            installs = installs.replace(",", "")
            n_installs = float(installs)
            total += n_installs
            len_category += 1
    avg_installs = total / len_category
    print(category , ":" , avg_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3697848.1731343283
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

In [79]:
for app in free_google:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                     ):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Viber Messenger : 500,000,000+


In [83]:
for app in free_google:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                           or app[5] == '1,000,000+'):
        print(app[0], ':', app[5])

Book store : 1,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Al Quran Al karim : 1,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
Hafizi Quran 15 lines per page : 1,000,000+
Satellite AR : 1,000,000+
Oxford A-Z of English Usage : 1,000,000+
Brilliant Quotes: Life, Love, Family & Motivation : 1,000,000+
Stats Royale for Clash Royale : 1,000,000+
wikiHow: how to do anything : 1,000,000+
EGW Writings : 1,000,000+
My Little Pony AR Guide : 1,000,000+


So far we learned that the most popular genres are influenced by a few apps which are dominating these genres. It also is not that easy to compare both app stores due to that they categorise their apps differently and provide different information for them. 
The genre `Reference` in IOS app store and `BOOKS_AND_REFERECE` in the google app store seem to be quite balanced (not to many dominating apps) and maybe more accessible for new apps. The next step to this project could be to analyse these genres more in depth and to find what kind of app would be most profitable to create. 


