# App Revenue Analysis

To understand the basics of data science this introductory project is to take the standpoint of someone working as a data analyst at an app development company.

The company only builds apps that are free to download and thereby the revenue stream is via in-app adds. Revenue and therefore profit, is solely influenced by user engagement. 

Therefore, this project will analyse 2 data sets (1 each for Android and IOS) and aims to aid app developers understand which type of apps are likely to attract more users.

## Importing the data sets
The first step in any data analysis project is to compile and load in the data you will utilise.

Currently there are over 4 million apps across Apple and Android app stores. It would be extremely time and cost intensive to collect data for all of these apps and therefore a sample was taken.

Luckily sample data for both Apple and Android had already be collated

- [A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data from approximately 10,000 Android apps from Google Play; the data was collected in August 2018.

- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. 

Both of these are opened in the code below, with the header rows saved in seperate varaibles to the list to allow for increased ease of analysis.

In [27]:
from csv import reader

## The Android data set ##
opened_file = open('C:/Users/ben/Documents/Data science course python/googleplaystore.csv', encoding="utf8")
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

## The App Store data set ##
opened_file = open('C:/Users/ben/Documents/Data science course python/AppleStore.csv' , encoding="utf8")
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]




**Function 1 : `explore_data`**

Once the data has been loaded, a quick way to navigate through each data set is required. 

Therefore, a function is created to ease the analysis process

The `explote_data()` function:

*Takes in four parameters:*

- `dataset`, which is expected to be a list of lists.
- `start` and `end`, which are both expected to be integers and represent the starting and the ending indices of a slice from the data set.
- `rows_and_columns`, which is expected to be a Boolean and has False as a default argument.

*Slices the data set using dataset`[start:end]`.*

*Loops through the slice, and for each iteration, prints a row and adds a new line after that row using `print('\n')`.*

- The `\n` in `print('\n')` is a special character and won't be printed. Instead, the \n character adds a new line, and we use `print('\n')` to add some blank space between rows.

*Prints the number of rows and columns if `rows_and_columns` is `True`.*

- `dataset` shouldn't have a header row, otherwise the function will print the wrong number of rows (one more row compared to the actual length).

Below we create this function and then test it on the android data set:

In [28]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line between rows
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


We can conclude that the Android data set has 10841 app entries and 13 different headers. 

Each column title's definition can be accessed from the datas' [documentation](https://www.kaggle.com/lava18/google-play-store-apps)

From this information the columns that might be useful for the purpose of our analysis are; 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.

Applying the same function to the Apple store data:

In [30]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


We have 7197 iOS apps in this data set, and 16 different headers. 

Not all column names are self-explanatory in this case, but details about each column can be found in the data set [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

Similarly to above, the useful column names could potentially include; 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'.

## Cleaning the Data Sets ## 

Before any meaningful data analysis can take place, the data must be purged of all duplicates and other potential irrelevant data.

**Removing Incorrect Data**

Upon reading the discussion for the `Android` data set, there appears to be an incorrect row at index `10472`

Extracting this row to check if it is indeed incorrect:

In [36]:
print(android_header)
print("\n")
print(android[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The rating for this app is set as `19` which is clearly incorrect as the maximum rating on the google play store is 5. Therefore it is neccessary to delete this row from the data set:

In [38]:
print(len(android))

### del android[10472] ###

print(len(android))

10841
10840


The code has been commented so only the one incorrect row is deleted for next time the code is run.

## Finding Duplicates##

When scrolling through the Android data set, there appears to be multiple entries for certain apps.

For example:

In [42]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Instagram appears 4 times in the Android data set, 3 of the 4 Instagrams as well as any other duplicate data needs to be removed.

The following code is designed to print the names and amount of duplicate apps from the Android data set to better understand the cleaning required:

In [45]:
duplicate_apps = []
unique_apps = [] 

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:' , len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:' , duplicate_apps[:10]) #Prints the first 10 apps that are duplicates 



Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


### Removing Duplicates (Part 1)##

It would be simple to remove a random selection of duplicates leaving 1 entry for each app

However, in order to improve the accuracy of our analysis, the 4th column `Reviews`, which states the number of reviews is utilised.

The higher the number of reviews the more recent the data was collected and therefore the more accurate the data will be.

Therefore we will remove the duplicates in order to leave 1 of each app, the one with the highest review number.

To do that, the steps are:

- Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app

- Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)

In [50]:
reviews_max = {} #Creating the empty dictionary

for app in android:
    name = app[0]
    reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < reviews:
        reviews_max[name] = reviews
    
    elif name not in reviews_max:
        reviews_max[name] = reviews



We previously found that there were `1181` duplicate apps

Therefore to check the code above has done as expected the length of the new dictionary `reviews_max` should be the length of the `android` data set minus the duplicates

In [51]:
print('Expected value:' , len(android) - 1181 )
print('\n')
print('Actual value:' , len(reviews_max))

Expected value: 9659


Actual value: 9659


### Removing Duplicates (Part 2):##

Next we will use the dictionary created above to remove the duplicate rows, the steps are outlined below:

1) Create two empty lists: `android_clean` (which will store our new cleaned data set) and `already_added` (which will just store app names).

2) Loop through the Google Play data set, and for each iteration:
- Assign the app name to a variable named `name`.
- Convert the number of reviews to float, and assign it to a variable named `n_reviews`.

In [52]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    reviews = float(app[3])
    
    if reviews == reviews_max[name] and (name not in already_added) :
        android_clean.append(app) #Using `app` transfers all data to this list, as opposed to just the app name
        
        already_added.append(name) #Used to make sure no duplicates occur    
    
    

## Data Checking##

Using the `explore_data` function to check there are now no duplicates in the `android_clean` data set:

In [54]:
explore_data(android_clean, 0 , 5 , True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9659
Number of columns: 13


There are `9659` rows, just as expected

## Removing non-English apps ##

### Part 1 ###

When scrolling through the data sets, you begin to notice that there are some apps not presented in English

For example:

In [59]:
print(ios[813][1])
print(ios[6731][1])

print('\n')

print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


Utilising ASCII codes the english alphabet and numbers 0-9 have an ASCII code between 0-127

We can therefore create a function `is_english()` which can return `if` a string uses solely english characters:

In [82]:
def is_english(string):
    
    for character in string:
        if ord(character) > 127:
            return False
    
    return True

#example use of function
print('"Hello" is English:' , is_english('Hello'))
print("'爱奇艺' is English:" , is_english('爱奇艺'))

"Hello" is English: True
'爱奇艺' is English: False


### Part 2###

The `is_english` function does as it was designed to do.

However, a problem arises when we consider strings such as;

- 'Docs To Go™ Free Office Suite'
- 'Instachat 😜'

The ™ symbol and emojis have an ASCII code above 127 and so will return `False` when entered into our function:


In [86]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(ord('™'))
print(ord('😜'))


False
False
8482
128540


To mitigate this problem we can alter the `is_english` function to only return false if there are more than 3 non-english characters in the string.

Whilst not a perfect solution, it is accurate enough for the pupose of this analysis

In [88]:
def is_english(string):
    number_non_english = 0
    for character in string:
        if ord(character) > 127:
            number_non_english += 1
    
    if number_non_english > 3:
         return False
    else:
        return True
    
#Test new function
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
True


Below we use the updated `is_english` function to further clean our data

In [95]:
android_english = []
ios_english = []

#For the google play data set
for app in android_clean:
    name = app[0]
    
    if is_english(name):
        android_english.append(app)

print('Android Data Set:')       
print(explore_data(android_english , 0 , 5 , True))
print('\n')

#For the IOS data set
for app in ios:
    name = app[0]
    
    if is_english(name):
        ios_english.append(app)

print('IOS Data Set:')       
print(explore_data(ios_english , 0 , 5 , True))

Android Data Set:
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9614
Number of columns: 13
None


IOS Data Set:
['2848822

It can be seen that we now have 9614 Android apps and 7197 IOS apps

## Removing Paid Apps

As our company is only interested in the free app industry, the next step is removing any apps that cost to download.

Using a similar method to above:

In [110]:
android_final = []
ios_final = []

for app in ios_english:
    price = float(app[4])
    if price == 0:
        ios_final.append(app)
        
print('Number of IOS apps:' , len(ios_final))
print('\n')
        
for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
print('Number of Android apps:' , len(android_final))

Number of IOS apps: 4056


Number of Android apps: 8864


After this final stage of data cleansing we are left with 4056 ios apps and 8864 Android apps

# Data Analysis #

## Validation Strategy ##

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1) Build a minimal Android version of the app, and add it to Google Play.

2) If the app has a good response from users, we develop it further.

3) If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Therefore we need to create tables for both the android and ios data sets highlighting which genres/catagory's of apps are the most popular.

We will use the `prime_genre` column of the App Store data set, and the `Genres` and `Category` columns of the Google Play data set.


### Frequency Table Function

We'll build two functions we can use to analyze the frequency tables:

- One function to generate frequency tables that show percentages
- Another function we can use to display the percentages in a descending order

In [112]:
def freq_table(dataset , index):
    
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        
        if value in table:
            table[value] += 1 #If the value is already in the dictionary adds 1 to the frequency
            
        else:
            table[value] = 1 #If the value does not exist in dictionary, creates name with value of 1
    
    table_percentages = {} 
    
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages # This is the end of the first function


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0]) #This is the end of the second function
       
            

### IOS Genre Frequency Table

Using our newly defined functions to analyse the `prime_genre` column of the App Store data set, and the `Genres` and `Category` columns of the Google Play data set.

In [117]:
print(ios_header) #The index for prime_genre found as 11
print('\n')

display_table(ios_final , 11 )

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032


We can see that the majority (55.6%) of apps on the ios store are games.

### Android Genre Frequency Tables

Using the same function on the genre and catagory columns of the Android app store:

In [127]:
print(android_header) # Catagory index is 1 and Genres index is 9
print('\n')

print('Catagory Frequency Table:')
print('\n')
display_table(android_final , 1)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Catagory Frequency Table:


FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.936371841155234

There are more tools and business related apps than seen on the app store (~22% across `Tools` , `Business` , `Lifestyle` and `Productivity`)

Similaly for the Genres column:

In [129]:
print('Genres Frequency Table:')
print('\n')
display_table(android_final , 9)

Genres Frequency Table:


Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & 

This also shows that the Android app store has a larger proportion of "practical" apps as opposed to games or other fun related products.

**However**

Just becuase there are large numbers of games on the ios store and a large number of tool related apps on the Android store **does not** mean that these types of apps have alot of downloads, just that they exist.

## Apps With The Most Downloads

Following on from above, we must now find which apps have the largest number of downloads.

For the Google Play data set, we can find this information in the `Installs` column, but for the ios App Store data set this information is missing. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` column.

Below, we calculate the average number of user ratings per app genre on the ios App Store by:

- Isolating the apps of each genre.
- Summing up the user ratings for the apps of that genre.
- Divide the sum by the number of apps belonging to that genre (not by the total number of apps).

In [136]:
genres_ios = freq_table(ios_final , 11 )

for genre in genres_ios:
    total = 0
    len_genre = 0
    
    for app in ios_final:
        genre_app = app[11]
        
        if genre_app == genre:
            number_of_ratings = float(app[5])
            total += number_of_ratings
            len_genre += 1
    
    average_number_of_ratings = total/len_genre
    print(genre, ':' , average_number_of_ratings)




Social Networking : 53078.195804195806
Photo & Video : 27249.892215568863
Games : 18924.68896765618
Music : 56482.02985074627
Reference : 67447.9
Health & Fitness : 19952.315789473683
Weather : 47220.93548387097
Utilities : 14010.100917431193
Travel : 20216.01785714286
Shopping : 18746.677685950413
News : 15892.724137931034
Navigation : 25972.05
Lifestyle : 8978.308510638299
Entertainment : 10822.961077844311
Food & Drink : 20179.093023255813
Sports : 20128.974683544304
Book : 8498.333333333334
Finance : 13522.261904761905
Education : 6266.333333333333
Productivity : 19053.887096774193
Business : 6367.8
Catalogs : 1779.5555555555557
Medical : 459.75


The largest number of user ratings come from `Social Networking` `Music` and `Weather`.

However, it is likely these numbers are skewed by certain very popular apps in these genres that have a Monopoly/Oligopoly on that space e.g. Facebook, Instagram and LinkedIn for `Social Networking`

Therefore some popular spaces that have no obvious oligopolictic acting upon them are;
- `Food & Drink`
- `Sports`
- `Book`

As our app is ideally going to be created across both ios and Android app stores, a similar analysis is conducted below for the Android market

### Android Apps Frequency by Genre
For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [138]:
display_table(android_final , 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


The problem arises as we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. 

However, we don't need very precise data for our purposes — we only want to get an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

To perform computations, however, we'll need to convert each install number to float — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. We'll do this directly in the loop below, where we also compute the average number of installs for each genre (category).

In [143]:
categories_android = freq_table(android_final , 1)

for category in categories_android:
    total = 0
    len_category = 0
    
    for app in android_final:
        category_app = app[1]
        
        if category_app == category:
            number_installs = app[5]
            number_installs = number_installs.replace('+' , '')
            number_installs = number_installs.replace(',' , '')
            number_installs = float(number_installs)
            
            total += number_installs
            len_category += 1
            
    average_number_installs = total/len_category
    print(category , ':' , average_number_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

## Conclusion

As a reminder the categories with the most promise in the ios app store were:

- `Food & Drink`
- `Sports`
- `Book`

When looking at the Android data set above we can see that these 3 are also popular on the Android store.

However, a Food and Drink app will require some form of cooking knowledge to be successful which is beyond the scope of this company. Similaly a sports app will require knowledge of some form of sport rather than just relaying scores (as apps like Sky sports already have a monopoly in this niche market)

Therefore an app within the `Book` (ios) and `Books_and_Reference` (Android) has the greatest potential across both markets.