# App Data Analysis - Profitability

This project is an **exploratory data analysis** based on data related to the usage of free apps  available on Google Play (Android) and App Store (iOS).

The goal of this project is to **analyze the available data and find out what makes an app more profitable**. Since all apps are free and revenue comes from engagement (in-app ads), profitability in this case means *attraction* - the more users engaged using the apps, the better.

--------

## Opening and Exploring the Data

We have a sample of data from both app markets (Apple and Google) that allows us to analyze the profitability for each case.

First, let's open the CSV files where the app datasets are stored.

In [1]:
from csv import reader

apple_file = open('AppleStore.csv', encoding='utf8')
google_file = open('googleplaystore.csv', encoding='utf8')

apple_data = list(reader(apple_file))
google_data = list(reader(google_file))

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(apple_data[1:], 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


In [4]:
explore_data(google_data[1:], 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In [5]:
print('Apple Data columns:')
print(apple_data[0])

Apple Data columns:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [6]:
print('Google Data columns:')
print(google_data[0])

Google Data columns:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Above, we have the lists of all columns for both datasets. If you have any trouble knowing what every column means, you can find [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) the documentation for the Apple dataset and [here](https://www.kaggle.com/lava18/google-play-store-apps) the documentation for the Google dataset.

For this analysis, the following columns could help us on finding out what makes an app profitable:

**Apple Dataset:**

* `track_name` - App name
* `size_bytes` - App size, in bytes
* `price` - App price
* `rating_count_tot` - Total number of ratings
* `user_rating` - Average user rating value
* `cont_rating` - Content rating
* `prime_genre` - Primary genre
* `sup_devices.num` - Number of supporting devices
* `ipadSc_urls.num` - Number of screenshots showed for display
* `lang.num` - Number of supported languages

**Google Dataset:**

* `App` - App name
* `Category` - Category (primary genre)
* `Rating` - Average user rating value
* `Reviews` - Total number of reviews (ratings)
* `Size` - App size, in megabytes (M)
* `Installs` - Number of downloads/installs
* `Type` - Indicates whether the app is free or paid
* `Price` - App price
* `Content Rating` - Content rating
* `Genres` - App genres

## Data Cleaning - Deleting wrong data

Now, let's check our app datasets and **remove any inaccurate or duplicate data**.

Also, since our company only builds free apps, directed toward an english-speaking audience, we need to **remove all paid apps and all non-english apps**.

First of all, let's look for a specific error described [here](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015).

In [7]:
print(google_data[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


As we can see, the category for the app above is missing. Let's remove this row:

In [8]:
del google_data[10473]

Now, let's look for duplicate entries:

In [9]:
duplicate_apple_apps = []
unique_apple_apps = []

for app in apple_data[1:]:
    name = app[0]
    if name in unique_apple_apps:
        duplicate_apple_apps.append(name)
    else:
        unique_apple_apps.append(name)
        
print('Number of duplicate apps - Apple:', len(duplicate_apple_apps))
print('\n')
print('Examples of duplicate apps - Apple:', duplicate_apple_apps[:15])

Number of duplicate apps - Apple: 0


Examples of duplicate apps - Apple: []


In [10]:
duplicate_google_apps = []
unique_google_apps = []

for app in google_data[1:]:
    name = app[0]
    if name in unique_google_apps:
        duplicate_google_apps.append(name)
    else:
        unique_google_apps.append(name)
        
print('Number of duplicate apps - Google:', len(duplicate_google_apps))
print('\n')
print('Examples of duplicate apps - Google:', duplicate_google_apps[:15])

Number of duplicate apps - Google: 1181


Examples of duplicate apps - Google: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


As we can see above, only the Google dataset has duplicate entries (1181).
We used the `id` column to look for duplicates on Apple dataset and the `App` column to look for duplicates on Google dataset.

Let's check an example to build a criterion for removing duplicates:

In [11]:
for app in google_data[1:]:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


The difference between these duplicated entries happens on the fourth position of each row, which corresponds to the number of reviews. Probably the data was collected at different times.

Let's keep the most recent data, or in this case, **keep the rows with highest number of reviews**. In order to do it, let's create a dictionary with all unique apps and their latest number of reviews:

In [12]:
reviews_max_google = {}

for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max_google and reviews_max_google[name] < n_reviews:
        reviews_max_google[name] = n_reviews
    elif name not in reviews_max_google:
        reviews_max_google[name] = n_reviews
        
print("We have", len(reviews_max_google), "unique entries for Google dataset.")

We have 9659 unique entries for Google dataset.


Now, let's use this dictionary to remove all duplicate rows on Google dataset:

In [13]:
android_clean = []
already_added_google = []

# We include the already_added list in case there are duplicate
# entries with the same number of reviews.

for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if reviews_max_google[name] == n_reviews and name not in already_added_google:
        android_clean.append(app)
        already_added_google.append(name)
        
print("We have", len(android_clean), "unique entries for Google dataset.")

We have 9659 unique entries for Google dataset.


In [14]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


In [15]:
explore_data(apple_data[1:], 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


After removing duplicate entries, we now have **9659 rows for Google dataset and 7197 rows for Apple dataset**.

Now, let's **remove all non-english apps**:

In [16]:
def english_app(string):
    count_non_ascii = 0
    for char in string:
        if ord(char) > 127:
            count_non_ascii += 1
        if count_non_ascii > 3:
            return False
    return True

The function above takes in a string and returns **`False`** if there's **more than three characters** in the string that **doesn't belong to the set of common English characters**, returning **`True`** otherwise.

The threshold of three characters is adopted due to certain cases that have a few non-ASCII characters, like `Instachat 😜`.

Let's test the function with some cases below:

In [17]:
print(english_app('Instagram'))
print(english_app('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_app('Docs To Go™ Free Office Suite'))
print(english_app('Instachat 😜'))

True
False
True
True


Now, let's use this function to filter out non-English apps from Google and Apple datasets:

In [18]:
android_non_english = []
android_clean_english = []

for row in android_clean:
    if english_app(row[0]):
        android_clean_english.append(row)
    else:
        android_non_english.append(row)

explore_data(android_clean_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


In [19]:
apple_non_english = []
apple_clean_english = []

for row in apple_data[1:]:
    if english_app(row[1]):
        apple_clean_english.append(row)
    else:
        apple_non_english.append(row)

explore_data(apple_clean_english, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6183
Number of columns: 16


After removing non-english entries, we now have **9614 rows for Google dataset and 6183 rows for Apple dataset**.

Now, as our last step in the data cleaning process, we will **isolate all free apps**, since all of our apps are free to download and install.

In [20]:
google_final_dataset = []

for row in android_clean_english:
    if row[6] == 'Free':
        google_final_dataset.append(row)
        
explore_data(google_final_dataset, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8863
Number of columns: 13


In [21]:
apple_final_dataset = []

for row in apple_clean_english:
    if float(row[4]) == 0:
        apple_final_dataset.append(row)
        
explore_data(apple_final_dataset, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 3222
Number of columns: 16


In the end of our data cleaning process, we now have **8863 rows for Google dataset and 3222 rows for Apple dataset**.

## Data Analysis

### Most Common Apps by Genre

We are using both Google and Apple datasets for this analysis because of our strategy as a company: our end goal is to add our apps on both Google Play and App Store. So, we need to find **app profiles that are successful on both markets**.

In our validation strategy for an app ideia, we firstly build a minimal Android version of the app and add it to Google Play. Then, if the app has a good response from users, we develop it further and we build an iOS version if the app is profitable after six months.

In order to begin this analysis, we are going to **generate frequency tables to find out what are the most common genres in each market**. To create these frequency tables, we are going to use the following columns for each dataset:

* Google dataset: `Genres`, `Category`
* Apple dataset: `prime_genre`

In [22]:
def freq_table(dataset, index):
    freq_abs_table = {}
    for row in dataset:
        if row[index] in freq_abs_table:
            freq_abs_table[row[index]] += 1
        else:
            freq_abs_table[row[index]] = 1
    freq_rel_table = {}
    for key in freq_abs_table:
        prop = freq_abs_table[key] / len(dataset)
        freq_rel_table[key] = prop * 100
    return freq_rel_table

In [23]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [24]:
display_table(apple_final_dataset, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


As we can see by the results above:

* On Apple dataset, the most common genres are **`Games` (58.16%), `Entertainment` (7.88%) and `Photo & Video` (4.97%)**.
* The majority of apps are designed for **entertainment purposes**, instead of practical purposes, like education (3.66%), shopping (2.61%) or social networking (3.28%)

Based on this frequency table, a **gaming app** seems to be the most recommended app profile. However, we know that **having a large number of gaming apps does not necessarily imply that these apps generally have a large number of users** (demand is not necessarily equal to offer).

In [25]:
display_table(google_final_dataset, 1)

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

In [26]:
display_table(google_final_dataset, 9)

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & Vehicles : 0.9251946293580051
S

As we can see by the results above:

* On Google dataset, the most common categories are **`Family` (18.90%), `Game` (9.73%) and `Tools` (8.46%)**.
* Now, in comparison with the Apple dataset results, we have a higher percentage of **practical purpose apps** here, as we can see by the genres `Tools` (8.45%), `Education` (5.35%) and `Business` (4.59%). **Practical and fun apps seem to be more balanced**.

Based on this frequency table, a **gaming/family app** seems to be the most recommended app profile. However, as we said before, **having a large number of apps of a certain genre does not necessarily imply that these apps have a large number of users**!

### Most Popular Genres (App Store)

To find out what genres are the most popular (have the most users), we can calculate the average number of installs for each app genre.

However, this information is missing for the App Store dataset. So, we are going to use the **total number of user ratings** (`rating_count_tot`).

In [27]:
prime_genre_freq_table_apple = freq_table(
    apple_final_dataset, 11)
prime_genre_freq_table_apple

{'Book': 0.4345127250155183,
 'Business': 0.5276225946617008,
 'Catalogs': 0.12414649286157665,
 'Education': 3.662321539416512,
 'Entertainment': 7.883302296710118,
 'Finance': 1.1173184357541899,
 'Food & Drink': 0.8069522036002483,
 'Games': 58.16263190564867,
 'Health & Fitness': 2.0173805090006205,
 'Lifestyle': 1.5828677839851024,
 'Medical': 0.186219739292365,
 'Music': 2.0484171322160147,
 'Navigation': 0.186219739292365,
 'News': 1.3345747982619491,
 'Photo & Video': 4.9658597144630665,
 'Productivity': 1.7380509000620732,
 'Reference': 0.5586592178770949,
 'Shopping': 2.60707635009311,
 'Social Networking': 3.2898820608317814,
 'Sports': 2.1415270018621975,
 'Travel': 1.2414649286157666,
 'Utilities': 2.5139664804469275,
 'Weather': 0.8690254500310366}

In [28]:
for key in prime_genre_freq_table_apple:
    total = 0
    len_genre = 0
    
    for row in apple_final_dataset:
        genre_app = row[11]
        if genre_app == key:
            total += float(row[5])
            len_genre += 1
    
    avg_user_ratings = total / len_genre
    print(key, '-', round(avg_user_ratings, 2))

Photo & Video - 28441.54
Music - 57326.53
Utilities - 18684.46
Weather - 52279.89
News - 21248.02
Entertainment - 14029.83
Lifestyle - 16485.76
Productivity - 21028.41
Reference - 74942.11
Health & Fitness - 23298.02
Medical - 612.0
Shopping - 26919.69
Travel - 28243.8
Social Networking - 71548.35
Navigation - 86090.33
Education - 7003.98
Book - 39758.5
Games - 22788.67
Food & Drink - 33333.92
Business - 7491.12
Finance - 31467.94
Sports - 23008.9
Catalogs - 4004.0


According to the results above, we are able to see that the five most popular genres are (in terms of average ratings per app):

* **Navigation** - 86090.33 ratings per app
* **Reference** - 74942.11 ratings per app
* **Social Networking** - 71548.35 ratings per app
* **Music** - 57326.53 ratings per app
* **Weather** - 52279.89 ratings per app

Let's check some examples for each of these categories to find out how distributed are these ratings per app:

In [32]:
for row in apple_final_dataset:
    if row[-5] == 'Navigation':
        print(row[1], '-', row[5])

Waze - GPS Navigation, Maps & Real-time Traffic - 345046
Google Maps - Navigation & Transit - 154911
Geocaching® - 12811
CoPilot GPS – Car Navigation & Offline Maps - 3582
ImmobilienScout24: Real Estate Search in Germany - 187
Railway Route Search - 5


The average number of app ratings for `Navigation` is heavily influenced by Waze and Google Maps, and there aren't many apps in this category.

**Waze and Google Maps are already very big players in this category, so it's hard to create a big competitor.**

In [34]:
for row in apple_final_dataset:
    if row[-5] == 'Reference':
        print(row[1], '-', row[5])

Bible - 985920
Dictionary.com Dictionary & Thesaurus - 200047
Dictionary.com Dictionary & Thesaurus for iPad - 54175
Google Translate - 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran - 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition - 17588
Merriam-Webster Dictionary - 16849
Night Sky - 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) - 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools - 4693
GUNS MODS for Minecraft PC Edition - Mods Tools - 1497
Guides for Pokémon GO - Pokemon GO News and Cheats - 826
WWDC - 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free - 718
VPN Express - 14
Real Bike Traffic Rider Virtual Reality Glasses - 8
教えて!goo - 0
Jishokun-Japanese English Dictionary & Translator - 0


The average number of app ratings for `Reference` are heavily influenced by the Bible and Dictionary.com apps.

**This category could be a good idea for a new app** if the objective is getting a high number of installs and engagement. It's possible to find another popular book and create an app with many features related to that book, for example.

In [35]:
for row in apple_final_dataset:
    if row[-5] == 'Social Networking':
        print(row[1], '-', row[5])

Facebook - 2974676
Pinterest - 1061624
Skype for iPhone - 373519
Messenger - 351466
Tumblr - 334293
WhatsApp Messenger - 287589
Kik - 260965
ooVoo – Free Video Call, Text and Voice - 177501
TextNow - Unlimited Text + Calls - 164963
Viber Messenger – Text & Call - 164249
Followers - Social Analytics For Instagram - 112778
MeetMe - Chat and Meet New People - 97072
We Heart It - Fashion, wallpapers, quotes, tattoos - 90414
InsTrack for Instagram - Analytics Plus More - 85535
Tango - Free Video Call, Voice and Chat - 75412
LinkedIn - 71856
Match™ - #1 Dating App. - 60659
Skype for iPad - 60163
POF - Best Dating App for Conversations - 52642
Timehop - 49510
Find My Family, Friends & iPhone - Life360 Locator - 43877
Whisper - Share, Express, Meet - 39819
Hangouts - 36404
LINE PLAY - Your Avatar World - 34677
WeChat - 34584
Badoo - Meet New People, Chat, Socialize. - 34428
Followers + for Instagram - Follower Analytics - 28633
GroupMe - 28260
Marco Polo Video Walkie Talkie - 27662
Miitomo - 2

The average number of app ratings for `Social Networking` is heavily influenced by big players like Facebook, Pinterest, Skype, Tumblr and Whatsapp.

**Since there are so many big players in this category, it's hard to create a new big competitor.**

In [36]:
for row in apple_final_dataset:
    if row[-5] == 'Music':
        print(row[1], '-', row[5])

Pandora - Music & Radio - 1126879
Spotify Music - 878563
Shazam - Discover music, artists, videos & lyrics - 402925
iHeartRadio – Free Music & Radio Stations - 293228
SoundCloud - Music & Audio - 135744
Magic Piano by Smule - 131695
Smule Sing! - 119316
TuneIn Radio - MLB NBA Audiobooks Podcasts Music - 110420
Amazon Music - 106235
SoundHound Song Search & Music Player - 82602
Sonos Controller - 48905
Bandsintown Concerts - 30845
Karaoke - Sing Karaoke, Unlimited Songs! - 28606
My Mixtapez Music - 26286
Sing Karaoke Songs Unlimited with StarMaker - 26227
Ringtones for iPhone & Ringtone Maker - 25403
Musi - Unlimited Music For YouTube - 25193
AutoRap by Smule - 18202
Spinrilla - Mixtapes For Free - 15053
Napster - Top Music & Radio - 14268
edjing Mix:DJ turntable to remix and scratch music - 13580
Free Music - MP3 Streamer & Playlist Manager Pro - 13443
Free Piano app by Yokee - 13016
Google Play Music - 10118
Certified Mixtapes - Hip Hop Albums & Mixtapes - 9975
TIDAL - 7398
YouTube Mu

The average number of app ratings for `Music` is heavily influenced by big players like Pandora, Spotify, Shazam and SoundCloud.

**Again, there are many big players in this category, so it's hard to create a new big competitor.**

In [37]:
for row in apple_final_dataset:
    if row[-5] == 'Weather':
        print(row[1], '-', row[5])

The Weather Channel: Forecast, Radar & Alerts - 495626
The Weather Channel App for iPad – best local forecast, radar map, and storm tracking - 208648
WeatherBug - Local Weather, Radar, Maps, Alerts - 188583
MyRadar NOAA Weather Radar Forecast - 150158
AccuWeather - Weather for Life - 144214
Yahoo Weather - 112603
Weather Underground: Custom Forecast & Local Radar - 49192
NOAA Weather Radar - Weather Forecast & HD Radar - 45696
Weather Live Free - Weather Forecast & Alerts - 35702
Storm Radar - 22792
QuakeFeed Earthquake Map, Alerts, and News - 6081
Moji Weather - Free Weather Forecast - 2333
Hurricane by American Red Cross - 1158
Forecast Bar - 375
Hurricane Tracker WESH 2 Orlando, Central Florida - 203
FEMA - 128
iWeather - World weather forecast - 80
Weather - Radar - Storm with Morecast App - 78
Yurekuru Call - 53
Weather & Radar - 37
WRAL Weather Alert - 25
Météo-France - 24
JaxReady - 22
Freddy the Frogcaster's Weather Station - 14
Almanac Long-Range Weather Forecast - 12
TodayAir

The average number of app ratings for `Music` is heavily influenced by big players like The Weather Channel, AccuWeather and Yahoo Weather.

Again, there are many big players in this category, so it's hard to create a new big competitor. Also, **weather apps do not generate a high amount of engagement**, since generally people spend little time on weather apps.

#### Recommended profiles for App Store

With the considerations above, we recommend the creation of a **book-based app (`Reference`) or a practical app that relies on gamification** somehow, since the **main focus for App Store seems to be on entertainment apps**, but the **genres directly related to entertainment (gaming apps, for example) seem to be saturated or have many big players already (music and social networking apps, for example)**.

### Most Popular Genres (Google Play)

For Google dataset, we have the number of installs for each app and, then, we are able to calculate the average number of installs for each app genre.

However, these numbers are open-ended (for example, `10,000,000+`). In order to calculate the average number of installs for each app genre, we need to remove commas and plus characters, converting each install number from string to float.

In [29]:
prime_genre_freq_table_google = freq_table(
    google_final_dataset, 1)
prime_genre_freq_table_google

{'ART_AND_DESIGN': 0.6431230960171499,
 'AUTO_AND_VEHICLES': 0.9251946293580051,
 'BEAUTY': 0.5979916506826132,
 'BOOKS_AND_REFERENCE': 2.1437436533904997,
 'BUSINESS': 4.592124562789123,
 'COMICS': 0.6205573733498815,
 'COMMUNICATION': 3.2381812027530184,
 'DATING': 1.8616721200496444,
 'EDUCATION': 1.1621347173643235,
 'ENTERTAINMENT': 0.9590432133589079,
 'EVENTS': 0.7108202640189552,
 'FAMILY': 18.898792733837304,
 'FINANCE': 3.7007785174320205,
 'FOOD_AND_DRINK': 1.241114746699763,
 'GAME': 9.725826469592688,
 'HEALTH_AND_FITNESS': 3.0802211440821394,
 'HOUSE_AND_HOME': 0.8236488773552973,
 'LIBRARIES_AND_DEMO': 0.9364774906916393,
 'LIFESTYLE': 3.9038700214374367,
 'MAPS_AND_NAVIGATION': 1.399074805370642,
 'MEDICAL': 3.5315355974275078,
 'NEWS_AND_MAGAZINES': 2.798149610741284,
 'PARENTING': 0.6544059573507841,
 'PERSONALIZATION': 3.317161232088458,
 'PHOTOGRAPHY': 2.944826808078529,
 'PRODUCTIVITY': 3.8925871601038025,
 'SHOPPING': 2.245289405393208,
 'SOCIAL': 2.66275527473767

In [30]:
for key in prime_genre_freq_table_google:
    total = 0
    len_category = 0
    
    for row in google_final_dataset:
        category_app = row[1]
        if category_app == key:
            installs = row[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            
            total += float(installs)
            len_category += 1
    
    avg_user_ratings = total / len_category
    print(key, '-', round(avg_user_ratings, 2))

HOUSE_AND_HOME - 1331540.56
PRODUCTIVITY - 16787331.34
PHOTOGRAPHY - 17840110.4
LIBRARIES_AND_DEMO - 638503.73
GAME - 15588015.6
SOCIAL - 23253652.13
BUSINESS - 1712290.15
AUTO_AND_VEHICLES - 647317.82
LIFESTYLE - 1437816.27
BOOKS_AND_REFERENCE - 8767811.89
SHOPPING - 7036877.31
FOOD_AND_DRINK - 1924897.74
DATING - 854028.83
BEAUTY - 513151.89
TOOLS - 10801391.3
WEATHER - 5074486.2
COMICS - 817657.27
FINANCE - 1387692.48
ART_AND_DESIGN - 1986335.09
COMMUNICATION - 38456119.17
SPORTS - 3638640.14
EDUCATION - 1833495.15
ENTERTAINMENT - 11640705.88
FAMILY - 3697848.17
NEWS_AND_MAGAZINES - 9549178.47
PERSONALIZATION - 5201482.61
MAPS_AND_NAVIGATION - 4056941.77
TRAVEL_AND_LOCAL - 13984077.71
EVENTS - 253542.22
VIDEO_PLAYERS - 24727872.45
MEDICAL - 120550.62
HEALTH_AND_FITNESS - 4188821.99
PARENTING - 542603.62


According to the results above, we are able to see that the five most popular genres are (in terms of average installs per app):

* **Communication** - 38456119.17 installs per app
* **Video Players** - 24727872.45 installs per app
* **Social** - 23253652.13 installs per app
* **Photography** - 17840110.4 installs per app
* **Productivity** - 16787331.34 installs per app

Let's check some examples for each of these categories to find out how distributed are these ratings per app:

In [41]:
for row in google_final_dataset:
    if row[1] == 'COMMUNICATION' and (row[5] == '1,000,000,000+'
                                      or row[5] == '500,000,000+'
                                      or row[5] == '100,000,000+'):
        print(row[0], '-', row[5])

WhatsApp Messenger - 1,000,000,000+
imo beta free calls and text - 100,000,000+
Android Messages - 100,000,000+
Google Duo - High Quality Video Calls - 500,000,000+
Messenger – Text and Video Chat for Free - 1,000,000,000+
imo free video calls and chat - 500,000,000+
Skype - free IM & video calls - 1,000,000,000+
Who - 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji - 100,000,000+
LINE: Free Calls & Messages - 500,000,000+
Google Chrome: Fast & Secure - 1,000,000,000+
Firefox Browser fast & private - 100,000,000+
UC Browser - Fast Download Private & Secure - 500,000,000+
Gmail - 1,000,000,000+
Hangouts - 1,000,000,000+
Messenger Lite: Free Calls & Messages - 100,000,000+
Kik - 100,000,000+
KakaoTalk: Free Calls & Text - 100,000,000+
Opera Mini - fast web browser - 100,000,000+
Opera Browser: Fast and Secure - 100,000,000+
Telegram - 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer - 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure - 100,000,000+
Viber Mess

The average number of app installs for `Communication` is heavily influenced by Whatsapp, Messenger, Skype, Gmail and Hangouts, a lot of big players.

**Again, it's hard to create a new big competitor in this category.** Also, the high average number of installs in this category is highly influenced by these big apps. If we removed these renowned apps, the average number would probably be much smaller.

In [43]:
for row in google_final_dataset:
    if row[1] == 'VIDEO_PLAYERS' and (row[5] == '1,000,000,000+'
                                      or row[5] == '500,000,000+'
                                      or row[5] == '100,000,000+'):
        print(row[0], '-', row[5])

YouTube - 1,000,000,000+
Motorola Gallery - 100,000,000+
VLC for Android - 100,000,000+
Google Play Movies & TV - 1,000,000,000+
MX Player - 500,000,000+
Dubsmash - 100,000,000+
VivaVideo - Video Editor & Photo Movie - 100,000,000+
VideoShow-Video Editor, Video Maker, Beauty Camera - 100,000,000+
Motorola FM Radio - 100,000,000+


The average number of app installs for `Video Players` are heavily influenced by Youtube and Google Play.

It follows the same pattern as communication apps, where **big players enlarge the average number of installs and it's hard to create a new competitor**.

In [44]:
for row in google_final_dataset:
    if row[1] == 'SOCIAL' and (row[5] == '1,000,000,000+'
                               or row[5] == '500,000,000+'
                               or row[5] == '100,000,000+'):
        print(row[0], '-', row[5])

Facebook - 1,000,000,000+
Facebook Lite - 500,000,000+
Tumblr - 100,000,000+
Pinterest - 100,000,000+
Google+ - 1,000,000,000+
Badoo - Free Chat & Dating App - 100,000,000+
Tango - Live Video Broadcast - 100,000,000+
Instagram - 1,000,000,000+
Snapchat - 500,000,000+
LinkedIn - 100,000,000+
Tik Tok - including musical.ly - 100,000,000+
BIGO LIVE - Live Stream - 100,000,000+
VK - 100,000,000+


The average number of app installs for `Social` are heavily influenced by Facebook, Instagram, Google+, Snapchat and other big players.

It follows the same pattern as communication apps, where **big players enlarge the average number of installs and it's hard to create a new competitor**.

In [45]:
for row in google_final_dataset:
    if row[1] == 'PHOTOGRAPHY' and (row[5] == '1,000,000,000+'
                                    or row[5] == '500,000,000+'
                                    or row[5] == '100,000,000+'):
        print(row[0], '-', row[5])

B612 - Beauty & Filter Camera - 100,000,000+
YouCam Makeup - Magic Selfie Makeovers - 100,000,000+
Sweet Selfie - selfie camera, beauty cam, photo edit - 100,000,000+
Google Photos - 1,000,000,000+
Retrica - 100,000,000+
Photo Editor Pro - 100,000,000+
BeautyPlus - Easy Photo Editor & Selfie Camera - 100,000,000+
PicsArt Photo Studio: Collage Maker & Pic Editor - 100,000,000+
Photo Collage Editor - 100,000,000+
Z Camera - Photo Editor, Beauty Selfie, Collage - 100,000,000+
PhotoGrid: Video & Pic Collage Maker, Photo Editor - 100,000,000+
Candy Camera - selfie, beauty camera, photo editor - 100,000,000+
YouCam Perfect - Selfie Photo Editor - 100,000,000+
Camera360: Selfie Photo Editor with Funny Sticker - 100,000,000+
S Photo Editor - Collage Maker , Photo Collage - 100,000,000+
AR effect - 100,000,000+
Cymera Camera- Photo Editor, Filter,Collage,Layout - 100,000,000+
LINE Camera - Photo editor - 100,000,000+
Photo Editor Collage Maker Pro - 100,000,000+


The average number of app installs for `Photography` are heavily influenced by Google Photos and many other popular photo editors.

It follows the same pattern as communication apps, where **big players enlarge the average number of installs and it's hard to create a new competitor**.

In [46]:
for row in google_final_dataset:
    if row[1] == 'PRODUCTIVITY' and (row[5] == '1,000,000,000+'
                                     or row[5] == '500,000,000+'
                                     or row[5] == '100,000,000+'):
        print(row[0], '-', row[5])

Microsoft Word - 500,000,000+
Microsoft Outlook - 100,000,000+
Microsoft OneDrive - 100,000,000+
Microsoft OneNote - 100,000,000+
Google Keep - 100,000,000+
ES File Explorer File Manager - 100,000,000+
Dropbox - 500,000,000+
Google Docs - 100,000,000+
Microsoft PowerPoint - 100,000,000+
Samsung Notes - 100,000,000+
SwiftKey Keyboard - 100,000,000+
Google Drive - 1,000,000,000+
Adobe Acrobat Reader - 100,000,000+
Google Sheets - 100,000,000+
Microsoft Excel - 100,000,000+
WPS Office - Word, Docs, PDF, Note, Slide & Sheet - 100,000,000+
Google Slides - 100,000,000+
ColorNote Notepad Notes - 100,000,000+
Evernote – Organizer, Planner for Notes & Memos - 100,000,000+
Google Calendar - 500,000,000+
Cloud Print - 500,000,000+
CamScanner - Phone PDF Creator - 100,000,000+


The average number of app installs for `Productivity` are heavily influenced by Google Drive, Google Calendar, Cloud Print, Dropbox and many Microsoft Office apps.

It follows the same pattern as communication apps, where **big players enlarge the average number of installs and it's hard to create a new competitor**.

Let's check out the `Books and Reference` category, which we figured out to be a good idea for App Store (it's also a popular category for Google Play, even though not as much as the others we have analyzed until now):

In [47]:
for row in google_final_dataset:
    if row[1] == 'BOOKS_AND_REFERENCE' and (row[5] == '1,000,000,000+'
                                            or row[5] == '500,000,000+'
                                            or row[5] == '100,000,000+'):
        print(row[0], '-', row[5])

Google Play Books - 1,000,000,000+
Bible - 100,000,000+
Amazon Kindle - 100,000,000+
Wattpad 📖 Free Books - 100,000,000+
Audiobooks from Audible - 100,000,000+


There aren't as many big players or popular apps like the other categories, so **probably there's a good potential for this category for Google Play**.

This category includes a variety of apps: from software for processing and reading ebooks and collections of libraries (like Google Play Books and Amazon Kindle) to dictionaries and book-based apps (like the Bible).

#### Recommended profiles for Google Play

With the considerations above, we again recommend the creation of a **book-based app (`Reference`)**, since the **genres directly related to entertainment (gaming apps, for example) seem to be saturated or have many big players already (music and social networking apps, for example)**.