# Profitable Apps in iOS and Android

#### Goals
For this project we'll establish what kinds of apps are likely to attract more users for our company.

The apps our comany builds are: 
* free to download & install, and 
* our main source of revenue consists of in-app ads

This means our **revenue** for any given app is mostly **influenced by the number of users** who use our app — the more users that see and engage with the adds, the better.

## Data source
Collecting data for over 4 million apps requires a significant amount of time and money, we'll try to analyze a sample of the data instead. To avoid spending resources on collecting new data ourselves, we refer to these are two data sets that seem suitable for our goals:

* [A data set](https://www.kaggle.com/lava18/google-play-store-apps/home) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018.
* [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017.

## Examining the data sets

In [1]:
opened_file_ios = open('AppleStore.csv')
from csv import reader
read_file_ios = reader(opened_file_ios)
apps_data_ios = list(read_file_ios)

#extract the header and the data
ios_header = apps_data_ios[0]
ios_data = apps_data_ios[1:]

opened_file_droid = open('googleplaystore.csv')
from csv import reader
read_file_droid = reader(opened_file_droid)
apps_data_droid = list(read_file_droid)

#extract the header and the data
droid_header = apps_data_droid[0]
droid_data = apps_data_droid[1:]

To make it easier explore the data sets, we create a function  `explore_data()` to print rows in a more readable way.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
#test the function 
explore_data(apps_data_ios, 10,11, rows_and_columns=True)

['343200656', 'Angry Birds', '175966208', 'USD', '0.0', '824451', '107', '4.5', '3.0', '7.4.0', '4+', 'Games', '38', '0', '10', '1']


Number of rows: 7198
Number of columns: 16


To see the header of Apple apps data we print it.

In [4]:
print(ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


To see the header of Google apps data we print it.

In [5]:
print(droid_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


We use the funciton to explore several rows of the Android and Apple apps datasets. 

In [6]:
explore_data(apps_data_droid, 1, 4)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']




In [7]:
explore_data(apps_data_ios,1,4)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']




In [8]:
print("The number of columns in the ios datase is", len(ios_header))

The number of columns in the ios datase is 16


In [9]:
print("The number of columns in the Android datase is", len(droid_header))

The number of columns in the Android datase is 13


### Exploring columns
The columns that interest us most in the Android and iOS datasets are.

|Column iOS|row iOS| column Anroid|row Android| description|
|---|---|---|---|---|
|'price'|4|price|7| price|
|'prime_genre'|12|genres|9| type, or genre|
|'rating_count_tot'|5|installs|5| number of installs
|'user_rating'|7|rating|3| overall rating|

**Notes:**
Because not all columns have matches between the datassets, we didn's use the `genres` column from the Android Dataset. We used `category` as an equivalent of the iOS's `genres` comumn.

Since there is no explicit column for installs in iOS we use the column `"rating_count_tot"` (User Rating count for all versions) as the closest approximation of the total number of downloads.

For ratings the Android figure represents the overall user rating, which is equivalent to iOS's `'user_rating'`.


### Deleting data
Before the analysis we need to clean the dataset. 

There is entry in the Andorid dataset which [has been reported incorrect](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015). 

In [10]:
print(droid_data[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The entry has missing 'Rating' and a column shift happened for next columns. We delete this row.

In [11]:
del droid_data[10472]

In [12]:
#check what's there after deleting
print(droid_data[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


### Handling duplicates
The dataset might contain duplciate entries which will interfere with the objectivity of our analysis. 

#### Identifying duplicates
Below we check for duplicates by creating two lists.
* One for storing the name of duplicate apps.
* One for storing the name of unique apps.

In [13]:
unique_droid_apps = []
duplicate_droid_apps = []

for app in droid_data:
    #pick the column where the name is stored in the dataset
    name = app[0]
    if name in unique_droid_apps:
        duplicate_droid_apps.append(name)
    else:
        unique_droid_apps.append(name)
    

print("The number of unique apps is:", len(unique_droid_apps))
print("The number of dupliacte apps is:", len(duplicate_droid_apps))
print('Some Examples of duplicate apps are:', duplicate_droid_apps[:10])

The number of unique apps is: 9659
The number of dupliacte apps is: 1181
Some Examples of duplicate apps are: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


We don't want to count certain apps more than once when we analyze data. We need to **remove the duplicate entries** and keep only one entry per app. We could remove the duplicate rows randomly, but we can also find a better way.

If we examine one instance of a duplicate app, we can see the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times.

Below we can see one example, Instagram.

In [14]:
for app in droid_data:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


We can assume that the higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, we'll only keep the row with the highest number of reviews and remove the other entries for any given app.

#### Removing duplicates

To remove the duplicates, we will:

* Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

* Use the information stored in the dictionary to create a new data set, which will have only one entry per app. For each app, we'll only select the entry with the highest number of reviews.

In [15]:
print('Expected number of apps after removing duplicates:', len(droid_data)-len(duplicate_droid_apps))

Expected number of apps after removing duplicates: 9659


In [16]:
print('Expected number of apps after removing duplicates:', len(droid_data)-1181)

Expected number of apps after removing duplicates: 9659


The code below creates a dictionary of apps picking only those wth maximum number of reviews.
It loops through the Anroid data set and:
* Adds every new app enncountered to the data set with the corresponding number of reviews.
* Checks if the app already exists in the dictionary and what is the corresponding value (number of reviews) to update it when necessary.

If the name it already exists in the `reviews_max` dictionary, it checks if the number of reviews it has assigned is smaller than the number of another entry with the same name `reviews_max[name]`.

If the existing number is smaller it will be updated with the new, higher value. If not, skipped.

In [17]:
reviews_max = {}
for app in droid_data:
    #identify the column where the name is
    name = app[0]
    #identify the col where the no of reviews is and turn into a float
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
            
print("Number of apps after removing duplicates is", len(reviews_max))

Number of apps after removing duplicates is 9659


We'll use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we'll only **keep the entries with the highest number of reviews**. In the code cell below:

We initialize two empty lists, `android_clean` and `already_added`.
We loop through the android data set (`droid_data`), and for every iteration:
* We isolate the name of the app and the number of reviews.
* We add the current row (app) to the `android_clean` list, and the app name (name) to the `already_added` list if:
    - The number of reviews of the current app matches the number of reviews of that app as described in the `reviews_max` dictionary; and
    - The name of the app is not already in the `already_added` list. 

We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry. For example, the "Box app" has three entries, and the number of reviews is the same. If we just check for `reviews_max[name] == n_reviews`, we'll still end up with duplicate entries for some apps.

In [18]:
android_clean = []
already_added = []

for app in droid_data:
    name = app[0]
    n_reviews = float(app[3])
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

In [19]:
#check if the lenghts of the list match the expected length after removing duplicates
print(len(android_clean))
print(len(already_added))
print(len(reviews_max))

9659
9659
9659


### Handling non-English apps

The language we use for the apps we develop at our company is English, and we'd like to analyze only the apps that are directed toward an English-speaking audience. However, if we explore the data long enough, we'll find that both data sets have apps whose name suggests that they are not directed toward an English-speaking audience. 

For example:

In [20]:
print(ios_data[813][1])
print(ios_data[6731][1])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


We're not interested in keeping these kind of apps, so we'll remove them. 

One way to go about this is to remove each app whose name contains a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;, etc.), and other symbols (+, *, /, etc.).

Behind the scenes, each character we use in a string has a corresponding number associated with it. For instance, the corresponding number for character 'a' is 97, for character 'A' is 65, and for character '爱' is 29,233. We can get the corresponding number of each character using the `ord()` funciton.

The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. Based on this number range, we will build a function that detects whether a character belongs to the set of common English characters or not. If the number is **equal to or less than 127**, then the character belongs to the set of common English characters, otherwise it doesn't.

Emojis and some characters like ™ fall outside the ASCII range and have corresponding numbers that are over 127. 

We'll only remove an app **if its name has more than three characters with corresponding numbers falling outside the ASCII range**. This means all English apps with up to three emoji or other special characters will still be labeled as English.

In [21]:
# input = string
# output = True if there is less than 3 characters with 
# ASCII value over 127, else False
def check_values(string):
    counter = 0
#each character in a string is converted into an ascii number
    for character in string:
        character_number = ord(character)
#if the ascii value is over 127 we increment the counter
        if character_number > 127:
            counter = counter + 1
    if counter <= 3:
        return True
    else:
        return False

print(check_values('Instachat 😜'))
print(check_values('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
False


The above function checks if an app can be defined as English or not. Now we can loop thorugh the whole data set to check all the names and create a list that contains only the English apps. 

In [22]:
#list we want to add the English apps to
English_apps_droid = []

#loop through the data set to find the name
for app in android_clean:
    name = app[0]
#run the check values funciton on it to see if it's English or not
#if yes, add it to the list
    if check_values(name):
        English_apps_droid.append(app)

#check if the function works
print(English_apps_droid[100])

['Hairstyles step by step', 'BEAUTY', '4.6', '4369', '14M', '100,000+', 'Free', '0', 'Everyone', 'Beauty', 'July 25, 2018', '1.9', '4.0.3 and up']


We can use the same functions on the ios dataset.

In [23]:
English_apps_ios =[]

#loop through the data set to find the name
for app in ios_data:
    name = app[1]
#run the check values function on it to see if it's English or not
#if yes, add it to the list
    if check_values(name):
        English_apps_ios.append(app)
        
#check if the function works
print(English_apps_ios[100])

['303849934', 'Beer Pong Game', '188956672', 'USD', '0.0', '187315', '9', '2.0', '4.0', '17.05.15', '17+', 'Games', '37', '5', '9', '1']


In [24]:
#check how many apps are left in each dataset
print(len(English_apps_droid))
print(len(English_apps_ios))

9614
6183


## Isolating free apps
At the moment our data sets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis.

1. We isolate the free apps for the final list of Android apps.

In [25]:
clean_droid = []

#input & output are datasets

#loop through all values and if the app is free append to the dataset
for app in English_apps_droid:
    #for each row in the dataset the price is what sits in column 7
    price = app[7]
    #if that value is 0 we append to the clean_droid list
    if price == '0':
        clean_droid.append(app)

#check one example row to see how it worked
print(clean_droid[10])

['Name Art Photo Editor - Focus n Filters', 'ART_AND_DESIGN', '4.4', '8788', '12M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'July 31, 2018', '1.0.15', '4.0 and up']


2. We isolate the free apps for the final list of ios apps.

In [26]:
clean_ios = []
for app in English_apps_ios:
    price = app[4]
    if price == '0.0':
        clean_ios.append(app)

#check one example row to see how it worked        
print(clean_ios[10])

['512939461', 'Subway Surfers', '156038144', 'USD', '0.0', '706110', '97', '4.5', '4.0', '1.72.1', '9+', 'Games', '38', '5', '1', '1']


We check for the number of apps in each, final dataset.

In [27]:
print('The final number of Android apps is:', len(clean_droid))
print('The final number of iOS apps is:', len(clean_ios))

The final number of Android apps is: 8864
The final number of iOS apps is: 3222


## Market analysis

Our company's revenue is highly influenced by the number of people using our apps, therefore our aim is to determine the kinds of apps that are likely to **attract more users**.

### Idea validation
Our idea validation strategy is as follows:
1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users => Develop it further.
3. If the app is profitable after six months => Build an iOS version of the app and add it to the App Store.

Our end goal is to add the app on both the App Store and Google Play.

We need to **find app profiles that are successful on both markets**.

## Genre analysis
We **want to know the most common genres for each market**. 

For this, we'll need to build frequency tables for a few columns in our data sets.

* The column with the genres information in the Google Play dataset is column: 1 (`Category`) and 9 (`Genres`)
* The column with the genres information in the iOS dataset is column: 11 (`prime_genre`)

We will write a funciton to create frequency tables with percentages of popularity of each genre. 

The function below creates a dictionary for Android. 

In [28]:
# input dataset, and an index number for the column to explore
# output is a dictionary with: a key being a genre, and a value being the number of times the genre appeared 

def freq_table(dataset, index):
    genres_table = {}

    # we loop through all the rows checking for the values in the genres column
    for row in dataset:
        # for each row in the dataset we define the genre as a specified column
        genre = row[index]
# if the genre exists in the dictionary already we increment the value by one
        if genre in genres_table:
            genres_table[genre] += 1
# if it doesn't exist in the dictionary we create it
        else:
            genres_table[genre] = 1
            
#calculate the total nuber of values            
    total = sum(genres_table.values())   
#we want to convert the numbers into %
    for key in genres_table:
#the number to retrieve is the value corresponding to the key
        number = genres_table[key]
#convert the number into a %
        percentage = (number / total)*100
#update the value in the table
        genres_table[key] = round(percentage,5)
    return genres_table

Now we create the a frequenct table for app categories in Android.

In [29]:
freq_table(clean_droid,1)

{'ART_AND_DESIGN': 0.64305,
 'AUTO_AND_VEHICLES': 0.92509,
 'BEAUTY': 0.59792,
 'BOOKS_AND_REFERENCE': 2.1435,
 'BUSINESS': 4.59161,
 'COMICS': 0.62049,
 'COMMUNICATION': 3.23782,
 'DATING': 1.86146,
 'EDUCATION': 1.162,
 'ENTERTAINMENT': 0.95894,
 'EVENTS': 0.71074,
 'FINANCE': 3.70036,
 'FOOD_AND_DRINK': 1.24097,
 'HEALTH_AND_FITNESS': 3.07987,
 'HOUSE_AND_HOME': 0.82356,
 'LIBRARIES_AND_DEMO': 0.93637,
 'LIFESTYLE': 3.90343,
 'GAME': 9.72473,
 'FAMILY': 18.90794,
 'MEDICAL': 3.53114,
 'SOCIAL': 2.66245,
 'SHOPPING': 2.24504,
 'PHOTOGRAPHY': 2.94449,
 'SPORTS': 3.39576,
 'TRAVEL_AND_LOCAL': 2.33529,
 'TOOLS': 8.46119,
 'PERSONALIZATION': 3.31679,
 'PRODUCTIVITY': 3.89215,
 'PARENTING': 0.65433,
 'WEATHER': 0.80099,
 'VIDEO_PLAYERS': 1.79377,
 'NEWS_AND_MAGAZINES': 2.79783,
 'MAPS_AND_NAVIGATION': 1.39892}

A simialr frequency list for the iOS apps.

In [30]:
freq_table(clean_ios,11)

{'Social Networking': 3.28988,
 'Photo & Video': 4.96586,
 'Games': 58.16263,
 'Music': 2.04842,
 'Reference': 0.55866,
 'Health & Fitness': 2.01738,
 'Weather': 0.86903,
 'Utilities': 2.51397,
 'Travel': 1.24146,
 'Shopping': 2.60708,
 'News': 1.33457,
 'Navigation': 0.18622,
 'Lifestyle': 1.58287,
 'Entertainment': 7.8833,
 'Food & Drink': 0.80695,
 'Sports': 2.14153,
 'Book': 0.43451,
 'Finance': 1.11732,
 'Education': 3.66232,
 'Productivity': 1.73805,
 'Business': 0.52762,
 'Catalogs': 0.12415,
 'Medical': 0.18622}

Dictionaries don't have order, and it will be very difficult to analyze the frequency tables. 

We need to build a second function which can help us display the entries in the frequency table in a descending order.

In [31]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Below is a table of iOS app column `genres` in order of popularity.

In [32]:
display_table(clean_ios, 11)

Games : 58.16263
Entertainment : 7.8833
Photo & Video : 4.96586
Education : 3.66232
Social Networking : 3.28988
Shopping : 2.60708
Utilities : 2.51397
Sports : 2.14153
Music : 2.04842
Health & Fitness : 2.01738
Productivity : 1.73805
Lifestyle : 1.58287
News : 1.33457
Travel : 1.24146
Finance : 1.11732
Weather : 0.86903
Food & Drink : 0.80695
Reference : 0.55866
Business : 0.52762
Book : 0.43451
Navigation : 0.18622
Medical : 0.18622
Catalogs : 0.12415


Below is a table of Android app `Categories` in order of popularity.

In [33]:
display_table(clean_droid, 1)

FAMILY : 18.90794
GAME : 9.72473
TOOLS : 8.46119
BUSINESS : 4.59161
LIFESTYLE : 3.90343
PRODUCTIVITY : 3.89215
FINANCE : 3.70036
MEDICAL : 3.53114
SPORTS : 3.39576
PERSONALIZATION : 3.31679
COMMUNICATION : 3.23782
HEALTH_AND_FITNESS : 3.07987
PHOTOGRAPHY : 2.94449
NEWS_AND_MAGAZINES : 2.79783
SOCIAL : 2.66245
TRAVEL_AND_LOCAL : 2.33529
SHOPPING : 2.24504
BOOKS_AND_REFERENCE : 2.1435
DATING : 1.86146
VIDEO_PLAYERS : 1.79377
MAPS_AND_NAVIGATION : 1.39892
FOOD_AND_DRINK : 1.24097
EDUCATION : 1.162
ENTERTAINMENT : 0.95894
LIBRARIES_AND_DEMO : 0.93637
AUTO_AND_VEHICLES : 0.92509
HOUSE_AND_HOME : 0.82356
WEATHER : 0.80099
EVENTS : 0.71074
PARENTING : 0.65433
ART_AND_DESIGN : 0.64305
COMICS : 0.62049
BEAUTY : 0.59792


Below is a table of Android app `Genres` in order of popularity.

In [34]:
display_table(clean_droid, 9)

Tools : 8.44991
Entertainment : 6.06949
Education : 5.34747
Business : 4.59161
Productivity : 3.89215
Lifestyle : 3.89215
Finance : 3.70036
Medical : 3.53114
Sports : 3.46345
Personalization : 3.31679
Communication : 3.23782
Action : 3.10244
Health & Fitness : 3.07987
Photography : 2.94449
News & Magazines : 2.79783
Social : 2.66245
Travel & Local : 2.32401
Shopping : 2.24504
Books & Reference : 2.1435
Simulation : 2.04197
Dating : 1.86146
Arcade : 1.85018
Video Players & Editors : 1.77121
Casual : 1.75993
Maps & Navigation : 1.39892
Food & Drink : 1.24097
Puzzle : 1.12816
Racing : 0.99278
Role Playing : 0.93637
Libraries & Demo : 0.93637
Auto & Vehicles : 0.92509
Strategy : 0.91381
House & Home : 0.82356
Weather : 0.80099
Events : 0.71074
Adventure : 0.6769
Comics : 0.60921
Beauty : 0.59792
Art & Design : 0.59792
Parenting : 0.49639
Card : 0.45126
Casino : 0.4287
Trivia : 0.41742
Educational;Education : 0.39486
Board : 0.38357
Educational : 0.37229
Education;Education : 0.33845
Word :

### Analysis
The most common genres in are:
* iOS: Games, Entertainment
* Android Categories: Family, Game
* Android Genres: Tools, Entertainment

In iOS dtaset, the vast majority of apps belog to the first category (Games), the runner up is far behind. 
Most apps in iOS fall into the cateogry of "general fun", rather than utitilites.

In the Android dataset the picture is more varied, there isn't such a big gap between the top option in each category and the runner ups. 

**Conclusion:** It appears the best category for an app to create would be a Family Game. "Family" is the most popular category in Android and "Games" in iOS. To add to that, the runner up in Android is also "games" in `Categories` and "Entertainment" in `Genres`.

## Anaylysis of the Number Installs (Popularity)
We still don't know what kind of apps have the most users.

A way to find out what genres are the most popular (have the most users) is to **calculate the average number of installs for each app genre**. 

For the Google Play data set, this information is in the `Installs` column, row 5.

This information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` app, row 4.

We need to calculate the average number of user ratings per app genre. To do that, we'll need to:

* Isolate the apps of each genre
* Sum up the number of installs or user ratings for the apps of that genre
* Divide the sum by the number of apps belonging to that genre

### Popularity in iOS

In [42]:
ios_genres_table = freq_table(clean_ios,11)
print(ios_genres_table)

{'Social Networking': 3.28988, 'Photo & Video': 4.96586, 'Games': 58.16263, 'Music': 2.04842, 'Reference': 0.55866, 'Health & Fitness': 2.01738, 'Weather': 0.86903, 'Utilities': 2.51397, 'Travel': 1.24146, 'Shopping': 2.60708, 'News': 1.33457, 'Navigation': 0.18622, 'Lifestyle': 1.58287, 'Entertainment': 7.8833, 'Food & Drink': 0.80695, 'Sports': 2.14153, 'Book': 0.43451, 'Finance': 1.11732, 'Education': 3.66232, 'Productivity': 1.73805, 'Business': 0.52762, 'Catalogs': 0.12415, 'Medical': 0.18622}


In [43]:
#loop through the genres and sum all the installs per each genre
#sum the number of apps belonging to each genree
#divide the number of ratings by number of apps belonging to each genre
#get average rating per genre 

for genre in ios_genres_table:
#total user ratings    
    total = 0 
#number of apps in each genre    
    len_genre = 0
#loop through the main data set and compare genres with the frequency table
    for row in clean_ios:        
        genre_app = row[11]
#if the genre in the frequency table is the same as in the main dataset
        if genre_app == genre:
#float the number of ratings and add to total of ratings for that genre        
            n_ratings = float(row[5])
            total += n_ratings
            len_genre += 1
#compute the average of ratings per genre       
    average_ratings = total / len_genre
    print (genre, ':', average_ratings)                

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


The most popular (widely-installed) apps in iOS belong to the Social Networking category.

In [45]:
#investigate names of apps in the social Networking Category
for app in clean_ios:
    if app[11] == 'Social Networking':
        print(app[1])

Facebook
Pinterest
Skype for iPhone
Messenger
Tumblr
WhatsApp Messenger
Kik
ooVoo – Free Video Call, Text and Voice
TextNow - Unlimited Text + Calls
Viber Messenger – Text & Call
Followers - Social Analytics For Instagram
MeetMe - Chat and Meet New People
We Heart It - Fashion, wallpapers, quotes, tattoos
InsTrack for Instagram - Analytics Plus More
Tango - Free Video Call, Voice and Chat
LinkedIn
Match™ - #1 Dating App.
Skype for iPad
POF - Best Dating App for Conversations
Timehop
Find My Family, Friends & iPhone - Life360 Locator
Whisper - Share, Express, Meet
Hangouts
LINE PLAY - Your Avatar World
WeChat
Badoo - Meet New People, Chat, Socialize.
Followers + for Instagram - Follower Analytics
GroupMe
Marco Polo Video Walkie Talkie
Miitomo
SimSimi
Grindr - Gay and same sex guys chat, meet and date
Wishbone - Compare Anything
imo video calls and chat
After School - Funny Anonymous School News
Quick Reposter - Repost, Regram and Reshare Photos
Weibo HD
Repost for Instagram
Live.me – Li

The category of "Social Networking" is popular but the numbers might be a little skewed by apps like Facebook, Pinterest, and messengers like Whatsapp, Kik and messeneger. With a simple new app we will not have a chance to compete with these apps.

We investigate another popular catgory in the dataset, "Navigation".

In [53]:
for app in clean_ios:
    if app[11] == 'Navigation':
        print(app[1], app[5])

Waze - GPS Navigation, Maps & Real-time Traffic 345046
Google Maps - Navigation & Transit 154911
Geocaching® 12811
CoPilot GPS – Car Navigation & Offline Maps 3582
ImmobilienScout24: Real Estate Search in Germany 187
Railway Route Search 5


Similar results can be seen in the navigation category, with Waze & Google Maps taking the vast number of users. 

Let's check two other popular categoties: `Weather` and `Food and Drink`:

In [56]:
for app in clean_ios:
    if app[11] == 'Weather':
        print(app[1], app[5])

The Weather Channel: Forecast, Radar & Alerts 495626
The Weather Channel App for iPad – best local forecast, radar map, and storm tracking 208648
WeatherBug - Local Weather, Radar, Maps, Alerts 188583
MyRadar NOAA Weather Radar Forecast 150158
AccuWeather - Weather for Life 144214
Yahoo Weather 112603
Weather Underground: Custom Forecast & Local Radar 49192
NOAA Weather Radar - Weather Forecast & HD Radar 45696
Weather Live Free - Weather Forecast & Alerts 35702
Storm Radar 22792
QuakeFeed Earthquake Map, Alerts, and News 6081
Moji Weather - Free Weather Forecast 2333
Hurricane by American Red Cross 1158
Forecast Bar 375
Hurricane Tracker WESH 2 Orlando, Central Florida 203
FEMA 128
iWeather - World weather forecast 80
Weather - Radar - Storm with Morecast App 78
Yurekuru Call 53
Weather & Radar 37
WRAL Weather Alert 25
Météo-France 24
JaxReady 22
Freddy the Frogcaster's Weather Station 14
Almanac Long-Range Weather Forecast 12
TodayAir 0
wetter.com 0
WarnWetter 0


The first app takes more than twice as many users as the second and third one. It migh be hard to compete with the rest of the apps. Weather isn't really a category wehre one can innovate.

In [55]:
for app in clean_ios:
    if app[11] == 'Food & Drink':
        print(app[1], app[5])

Starbucks 303856
Domino's Pizza USA 258624
OpenTable - Restaurant Reservations 113936
Allrecipes Dinner Spinner 109349
DoorDash - Food Delivery 25947
UberEATS: Uber for Food Delivery 17865
Postmates - Food Delivery, Faster 9519
Dunkin' Donuts - Get Offers, Coupons & Rewards 9068
Chick-fil-A 5665
McDonald's 4050
Deliveroo: Restaurant Delivery - Order Food Nearby 1702
SONIC Drive-In 1645
Nowait Guest 1625
7-Eleven, Inc. 1356
Outback 805
Bon Appetit 750
Starbucks Keyboard 457
Whataburger 197
Delish Eatmoji Keyboard 154
Lieferheld - Delicious food delivery service 29
Lieferando.de 29
McDo France 22
Chefkoch - Rezepte, Kochen, Backen & Kochbuch 20
Youmiam 9
Marmiton Twist 2
Open Food Facts 1


Top food brands have their own apps which dominate the rankings here. 

In line with the previous conclusions about popular genres we investigate "Games" and "Entertainment".

In [57]:
for app in clean_ios:
    if app[11] == 'Games':
        print(app[1], app[5])

Clash of Clans 2130805
Temple Run 1724546
Candy Crush Saga 961794
Angry Birds 824451
Subway Surfers 706110
Solitaire 679055
CSR Racing 677247
Crossy Road - Endless Arcade Hopper 669079
Injustice: Gods Among Us 612532
Hay Day 567344
PAC-MAN 508808
DragonVale 503230
Head Soccer 481564
Despicable Me: Minion Rush 464312
The Sims™ FreePlay 446880
Sonic Dash 418033
8 Ball Pool™ 416736
Tiny Tower - Free City Building 414803
Jetpack Joyride 405647
Bike Race - Top Motorcycle Racing Games 405007
Kim Kardashian: Hollywood 397730
Trivia Crack 393469
WordBrain 391401
Sniper 3D Assassin: Shoot to Kill Gun Game 386521
Flow Free 373857
Geometry Dash Lite 370370
▻Sudoku 359832
Fruit Ninja® 327025
Pixel Gun 3D 301182
Temple Run 2 295211
My Horse 293857
Word Cookies! 287095
Dragon City Mobile 277268
The Simpsons™: Tapped Out 274501
Plants vs. Zombies™ 2 267394
Clash Royale 266921
Pokémon GO 257627
CSR Racing 2 257100
Star Wars™: Commander 253448
Boom Beach 241929
MARVEL Contest of Champions 233599
MADDEN

In [58]:
for app in clean_ios:
    if app[11] == 'Entertainment':
        print(app[1], app[5])

Netflix 308844
Fandango Movies - Times + Tickets 291787
Colorfy: Coloring Book for Adults 247809
IMDb Movies & TV - Trailers and Showtimes 183425
TRUTH or DARE!!! - FREE 171055
Mad Libs 117889
Twitch 109549
Action Movie FX 101222
Voice Changer Plus 98777
iFunny :) 98344
The CW 97368
The Moron Test 88613
DIRECTV 81006
ABC – Watch Live TV & Stream Full Episodes 78890
Xbox 72187
Redbox 60236
Talking Tom Cat 2 for iPad 56399
Hulu: Watch TV Shows & Stream the Latest Movies 56170
NBC – Watch Now and Stream Full TV Episodes 55950
Emoji> 55338
DIRECTV App for iPad 47506
Amazon Prime Video 43667
CBS Full Episodes and Live TV 39436
FOX NOW - Watch Full Episodes and Stream Live TV 39391
Talking Angela for iPad 32763
Recolor - Coloring Book 31180
Talking Ben the Dog for iPad 31116
Talking Tom Cat for iPad 29492
YouTube Kids 28560
Tom's Love Letters 27711
HBO GO 26278
NFL Sunday Ticket 24258
Pigment - Coloring Book for Adults 23967
Disney Channel – Watch Full Episodes, Movies & TV 21082
BuzzTube - 

The games category is very long and apart from the first two games with the highest values, the rest are more evenly distributed.For entertainment, ranking apps dominate followed by what also seems like games, or game related apps.

## Popularity in Android 
We have the installs data in Android but the install numbers aren't  precise enough, most values are open-ended (100+, 1,000+, 5,000+, etc.).

In [59]:
display_table(clean_droid, 5)

1,000,000+ : 15.72653
100,000+ : 11.55235
10,000,000+ : 10.54829
10,000+ : 10.19856
1,000+ : 8.3935
100+ : 6.91561
5,000,000+ : 6.82536
500,000+ : 5.56182
50,000+ : 4.77211
5,000+ : 4.51264
10+ : 3.54242
500+ : 3.2491
50,000,000+ : 2.30144
100,000,000+ : 2.13222
50+ : 1.91787
5+ : 0.78971
1+ : 0.50767
500,000,000+ : 0.27076
1,000,000,000+ : 0.22563
0+ : 0.04513
0 : 0.01128


We'll leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

To perform computations, however, we'll need to convert each install number from string to float. This means we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error.

We'll create a similar table as we did for iOS, to see the install averages for the Google Play apps, using the Category column.

In [60]:
#create a frequency table for ios genres 
droid_genres_table = freq_table(clean_droid,1)
print(droid_genres_table)

{'ART_AND_DESIGN': 0.64305, 'AUTO_AND_VEHICLES': 0.92509, 'BEAUTY': 0.59792, 'BOOKS_AND_REFERENCE': 2.1435, 'BUSINESS': 4.59161, 'COMICS': 0.62049, 'COMMUNICATION': 3.23782, 'DATING': 1.86146, 'EDUCATION': 1.162, 'ENTERTAINMENT': 0.95894, 'EVENTS': 0.71074, 'FINANCE': 3.70036, 'FOOD_AND_DRINK': 1.24097, 'HEALTH_AND_FITNESS': 3.07987, 'HOUSE_AND_HOME': 0.82356, 'LIBRARIES_AND_DEMO': 0.93637, 'LIFESTYLE': 3.90343, 'GAME': 9.72473, 'FAMILY': 18.90794, 'MEDICAL': 3.53114, 'SOCIAL': 2.66245, 'SHOPPING': 2.24504, 'PHOTOGRAPHY': 2.94449, 'SPORTS': 3.39576, 'TRAVEL_AND_LOCAL': 2.33529, 'TOOLS': 8.46119, 'PERSONALIZATION': 3.31679, 'PRODUCTIVITY': 3.89215, 'PARENTING': 0.65433, 'WEATHER': 0.80099, 'VIDEO_PLAYERS': 1.79377, 'NEWS_AND_MAGAZINES': 2.79783, 'MAPS_AND_NAVIGATION': 1.39892}


The function below outputs a list of install averages for apps in Google Play.

In [61]:
# input genres frequency table & droid clean dataset
# output: a list of average number of installs per genre 
# loop through the genres and locate the number of instals per genre
# add the number of installs together and divide by the number of apps in that genre

for category in droid_genres_table:
#this is where we sum ratings
    total = 0
# this is the number of apps in each genre    
    len_category = 0
    for app in clean_droid:
        category_app = app[1]
#looping through the main data set we compare the name of the category
#to the name in the frequency table
        if category_app == category:
# find where number of installs is        
            n_installs = app[5]
# get rid of unwanted characters    
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            n_installs = float(n_installs)
# convert into float and add to total
            total += n_installs
# add the isntance of existence of a genre to the number of apps in a genre    
            len_category += 1
#find average of installs per genre     
    average = total / len_category
    print(category, ':', average)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

In [67]:
#funciton to check popularity of apps in a category

def popular(category):
    for app in clean_droid:
        if app[1] == category and (app[5] == '50,000,000+' 
                                             or  app[5] =='100,000,000+'
                                             or  app[5] =='50,000,000+'):
            print(app[0], ':', app[5])
            
#investigate apps in travel and local
popular("TRAVEL_AND_LOCAL")

trivago: Hotels & Travel : 50,000,000+
Booking.com Travel Deals : 100,000,000+
VZ Navigator : 50,000,000+
TripAdvisor Hotels Flights Restaurants Attractions : 100,000,000+
2GIS: directory & navigator : 50,000,000+
MAPS.ME – Offline Map and Travel Navigation : 50,000,000+
Google Earth : 100,000,000+


In line with previus findings we check categories: "Game" and "Entertainment" .

In [70]:
popular("GAME")

Sonic Dash : 100,000,000+
PAC-MAN : 100,000,000+
Bubble Witch 3 Saga : 50,000,000+
Roll the Ball® - slide puzzle : 100,000,000+
Block Craft 3D: Building Simulator Games For Free : 50,000,000+
Love Balls : 50,000,000+
Piano Tiles 2™ : 100,000,000+
Pokémon GO : 100,000,000+
Snake VS Block : 50,000,000+
Extreme Car Driving Simulator : 100,000,000+
Trivia Crack : 100,000,000+
Angry Birds 2 : 100,000,000+
PUBG MOBILE : 50,000,000+
Summoners War : 50,000,000+
Lords Mobile: Battle of the Empires - Strategy RPG : 50,000,000+
8 Ball Pool : 100,000,000+
Candy Crush Soda Saga : 100,000,000+
Toy Blast : 50,000,000+
Clash Royale : 100,000,000+
Clash of Clans : 100,000,000+
Plants vs. Zombies FREE : 100,000,000+
Flow Free : 100,000,000+
My Talking Angela : 100,000,000+
slither.io : 100,000,000+
Cooking Fever : 100,000,000+
Yes day : 100,000,000+
Gardenscapes : 50,000,000+
Score! Hero : 100,000,000+
Magic Tiles 3 : 50,000,000+
Granny : 50,000,000+
Dream League Soccer 2018 : 100,000,000+
Sniper 3D Gun

In [71]:
popular("ENTERTAINMENT")

Hotstar : 100,000,000+
Talking Angela : 100,000,000+
Talking Ginger 2 : 50,000,000+
Amazon Prime Video : 50,000,000+
IMDb Movies & TV : 100,000,000+
Twitch: Livestream Multiplayer Games & Esports : 50,000,000+
PlayStation App : 50,000,000+
Talking Ben the Dog : 100,000,000+
Netflix : 100,000,000+


There are only a few main players in the "Travel and local" and "Entertainment" categories. In contrast the "Games" sector seems more eventlyy distrbuted with many games with more than 100m installs.

# Conclusions 
The app we need to create, that has a chance of:
* becoming popular and
* appealing to most users

Should likely falls into the catgory of Games. As the **next step** of this analysis we could investigate more categories to check if there is a secondary characteristic, sub-genre we can define for the game we want to create. 