# Analysing Mobile App Data 
## Dataquest Guided Project 2
### Introduction:
In this project we are going pretend like we're a data analyst working for company that creates mobile applications. The company only builds free applications, and depend on ad-revenue to profit. The project's goal is to analyse data to help our developers understand what type of apps are more likely to attract more users

# Opening & Exploring the Data

We start by creating a function to explore both datasets with the function: `explore_data()`
which takes 3 (or 4) arguments
1. dataset = the filename - e.g, `AppleStore.csv`
2. where in the dataset list to start from - e.g, index number
3. where in the dataset list to stop/end the exploring.
4. get the number og rows and columns - optional since default value is `False`

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of Columns: ', len(dataset[0]))

## Extracting files

using Python file handling and importing a csv reader to extract the datasets, then reading the data as a `list`
setting the dataset to skip the header by default - `google/apple[1:]`

In [2]:
google_file = open('googleplaystore.csv')
apple_file = open('AppleStore.csv')
from csv import reader
read_google = reader(google_file)
read_apple = reader(apple_file)
google = list(read_google)
google_header = google[0]
google = google[1:]
apple = list(read_apple)
apple_header = apple[0]
apple = apple[1:]

Printing out the first two rows of each dataset - skipping the header by default

In [3]:
print('Google Playstore Data:\n')
explore_data(google, 0,2, rows_and_columns=True)
print('\nApple Store Data:\n')
explore_data(apple, 0,2, rows_and_columns=True)

Google Playstore Data:

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows:  10841
Number of Columns:  13

Apple Store Data:

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows:  7197
Number of Columns:  16


### Checking only headers to find valuable columns for later analysis

In [4]:
important_google = google_header[1:4] + google_header[5:10]
print(important_google)

['Category', 'Rating', 'Reviews', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres']


In [5]:
important_apple = apple_header[4:8] + apple_header[10:11]
print(important_apple)

['price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'cont_rating']


## Deleting Wrong Data:

### Deleting inaccurate data

Using the `del()` function to delete a inaccurate application data from the Google Playstore, the application was missing the `Category` column

In [6]:
#deleting 'Life Made WI-FI Touchscreen Photo Frame' because its missing a category name
del google[10472]



## Removing Duplicate Entries:

### Checking for duplicates

finding every instance of `Instagram` 

In [7]:
for app in google:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


### Removing duplicates

Creating two empty lists:
`duplicate_apps = []` - to find all duplicates in the Google Playstore dataset,
`unique_apps = []` - to place all the apps into after duplicates have been found.

Starts a `for` loop to iterate over the google dataset, and if the name is found more than once, it is appended to the first list, and if `else` the app is placed into the unique list.

Giving us a number of 1181 duplicates, and an expected application number of 9659

In [8]:
duplicate_apps = []
unique_apps = []

for app in google:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicated apps: ', len(duplicate_apps), '\n')
print('Examples of duplicated apps: ', duplicate_apps[:12], '\n')
print('Expected application number: ', len(google) - 1181)
    

Number of duplicated apps:  1181 

Examples of duplicated apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM'] 

Expected application number:  9659


### Criterion for removal 

I used the suggested criterion for removing duplicates, which was to keep the entry with the highest number of reviews

in this code:
1. create an empty dictionary,
2. start a `for` loop - iterating the Google Playstore dataset,
3. add the variable `name` to the first row - 'App',
4. add variable for reviews to a type: `float` of the third row - reviews,
5. check for the application name in the dict, and find the application name where the reviews are highest - then add them to the empty dict `reviews_max`,
6. if name is `not in` the empty dict, we simply add the name to the dict with the review,
7. lastly we print out the length of the dataset - `1181` which was the number of duplicates, and compare it to our newly formed `reviews_max` dictionary.

In [9]:
reviews_max = {}

for row in google:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print('Expected length: ', len(google) - 1181, '\n')
print('Actual length: ', len(reviews_max))



Expected length:  9659 

Actual length:  9659


### Creating a new cleaned dataset

We use the newly formed dictionary `review_max` to clean our dataset.
1. create two new lists, one for a newly cleaned dataset, and one to store app names,
2. `for` looping through the google data, finding name `row[0]` and reviews `row[3]`, 
3. if the app name is equal to (`==`) the reviews, and the name is not already in the `already_added` list,
4. then we first append the name to the first list, and then to the second so that it crosses it out - e.g, makes sure no duplicates enters the `google_clean` list,
5. lastly we test the `explore_data()` function from earlier at the new and cleaned `google_clean` list.

In [10]:
google_clean = []
already_added = []

for row in google:
    name = row[0]
    n_reviews = float(row[3])
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        google_clean.append(row)
        already_added.append(name)

explore_data(google_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9659
Number of Columns:  13


## Removing Non-English Apps:

First we test out some applictions given in the assignement, to see the Non-ASCII characters

In [11]:
print(apple[813][1])
print(apple[6731][1])

print(google_clean[4412][0])
print(google_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜
中国語 AQリスニング
لعبة تقدر تربح DZ


We then create a function that takes a string as an argument to check if the ASCII value is over 127 - the english alphabet's highest value.

1. create function, and sets variable: `non_ascii` to `0`, 
2. loop over the each character in the given string, and if the `ord()` value is higher `>` than `127` we increment the `non_ascii` value by `1`,
3. then we write another `if-statement` to allow every string with `3` or less ASCII characters - allowing for applications like - `Instachat 😜`,
4. lastly we test out the result with strings provided in the assignment.

In [12]:
def english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

print(english('Instagram'))
print(english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english('Docs To Go™ Free Office Suite'))
print(english('Instachat 😜'))
print(ord('™'))
print(ord('😜'))


True
False
True
True
8482
128540


Now we need to filter out all the non-English apps from our two datasets:
1. two empty lists for only English apps,
2. looping through the datasets using the `english()` function created above to find english and/or names with 3 or less ACSII characters,
3. we append them to the empty lists so they only contain qualified apps.
4. we print out 3 apps from each dataset and the number of rows and columns to see the differenec.

In [13]:
google_eng = []
apple_eng = []

for app in google_clean:
    name = app[0]
    if english(name):
        google_eng.append(app)

for app in apple:
    name = app[1]
    if english(name):
        apple_eng.append(app)

explore_data(google_eng, 0, 3, True)
print('\n')
explore_data(apple_eng, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9614
Number of Columns:  13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+'

## Isolating the Free Apps:

Last bit of data cleaning we need to do is to isolate the free apps, before appending the result to our new dataset without - inaccurate data, duplicates and non-english apps.

1. we start again with two lists, one for each dataset.
2. loop through datasets and to find the price, and if price is equal to `'0' or '0.0'` we append them to the new lists, giving us a new number of apps for each dataset

Google Playstore = 8864

AppleStore = 3222

In [14]:
google_apps = []
apple_apps = []

for app in google_eng:
    price = app[7]
    if price == '0':
        google_apps.append(app)
        
for app in apple_eng:
    price = app[4]
    if price == '0.0':
        apple_apps.append(app)
      
    
print(len(google_apps))
print(len(apple_apps))

8864
3222


## Most Common Apps by Genre:

Our end goal with this project is to reach as many users as possible, on both Google Play and Apple Store, we need to situate app profiles that succed in both - giving our company a higher chance of success.

1. We start out by creating a function `freq_table` taking two agruments - `dataset` and `index` - this function can be used by both datasets later, saving us some time.
2. create an empty dictionary called `frequence` to contain the results, and a counter `total` to increment findings.
3. loops through dataset and for each iteration we increment `+=1` the `total` value by 1, we define the value to the row with the given `index` argument,
4. if the value is in the frequence table, we increment the table row value by 1, and if not, it stays the same.
5. we create a new empty dictionary to find percentages, which is the value of the `key:value` pair in the `frequence` table,
6. looping through the frequence dict and for each instance of the `key`, we calculate the percentage, before adding the percentage to the `freq_percentage` and using the `round()` function to make the results easier to read - before finally returning the dict,
7. then we create a function to display the data from `freq_table` called `table_display` - taking dataset and index as arguments.
8. declare a variable for the dataset, and an empty list for the display.
9. loop through the table to find the `key:value` pairs before adding them to the empty `display` list,
10. lastly we sort the display list with the `sorted()` function, and print each entry out in a for loop.

In [15]:
def freq_table(dataset, index):
    frequence = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in frequence:
            frequence[value] += 1
        else:
            frequence[value] = 1
    
    freq_percentage = {}
    for key in frequence:
        percentage = (frequence[key] / total) * 100
        freq_percentage[key] = round(percentage, 2)
        
    return freq_percentage

def table_display(dataset, index):
    table = freq_table(dataset, index)
    display = []
    for key in table:
        key_val = (table[key], key)
        display.append(key_val)
        
    table_sorted = sorted(display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        

Testing the `table_display` function - with Apple Store data from index `-5` (prime_genre)

In [16]:
table_display(apple_apps, -5)

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


On Apple Store we see that applications that are free with English names, over 50% (58.16%) of them are games. Giving the impression that applications need to have some aspect of fun to increase their chance of success. 

Note: Given that 'Games' is the most comman genre, it is might also be the hardest market, and a lot of the games might not be very popular.

Testing the `table_display` function - with Google Play data from index `1` (Category)

In [17]:
table_display(google_apps, 1)

FAMILY : 18.91
GAME : 9.72
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


Testing with index `-4` (Genre)

In [18]:
table_display(google_apps, -4)

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.91
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;

As we can see, the Google Play data operates in a very different way than Apple Store, the entertainment section is lower, and there are more applications operating as 'tools'. The 'FAMILY' category which accounts for almost 19& of Google Play content, is mostly comprised of games - making the total of games to around 30%, but it is still nowhere near Apple Store's 58%. 

Google Play's column for 'GENRE' seems to be a sub-category for their 'CATEGORY' column, and since this project has a 'larger-picture' approach, we will focus on 'CATEGORY' instead of 'GENRE'. 

## Most Popular Apps by Genre - Apple Store:

To find out which applications were the most popular, we can look at the number of downloads/installments for each genre. The Google Play dataset has the column: `Installs`, and Apple Store the column: `rating_count_tot` for this information.

for the Apple Store dataset we calculate the average score from user ratings per genre:
1. we use the `freq_table()` function from earlier, start a `for` loop with a counter for total ratings and genre length,
2. we start a `nested for loop` to where the `genre_app` variable is the row index `-5` (rating_count_tot),
3. if the variable is in the genre, we give the rating a type of `float`, and increment the result to our `total` counter variable before incrementing the genre length with 1,
4. we create an variable for the average ratings with the value of the total ratings divided (`/`) by the genre length - and place them in a `res` variable to round the decimal numbers off to 2 (`1.00`),
5. lastly we print them out as `key:value` pairs.

In [19]:
genre_apple = freq_table(apple_apps, -5)

for genre in genre_apple:
    total = 0
    genre_len = 0
    for app in apple_apps:
        genre_app = app[-5]
        if genre_app == genre:
            ratings = float(app[5])
            total += ratings
            genre_len += 1
    
    avg_ratings = total / genre_len
    res = round(avg_ratings, 2)
    print(genre, ':', res)

Social Networking : 71548.35
Photo & Video : 28441.54
Games : 22788.67
Music : 57326.53
Reference : 74942.11
Health & Fitness : 23298.02
Weather : 52279.89
Utilities : 18684.46
Travel : 28243.8
Shopping : 26919.69
News : 21248.02
Navigation : 86090.33
Lifestyle : 16485.76
Entertainment : 14029.83
Food & Drink : 33333.92
Sports : 23008.9
Book : 39758.5
Finance : 31467.94
Education : 7003.98
Productivity : 21028.41
Business : 7491.12
Catalogs : 4004.0
Medical : 612.0


### Findings:

The calculations above showed that `Navigation` had the highest number of user reviews, but after checking the genre closer, we see that `Waze` & `Google Maps` is where most of the users are, meaning stiff competition.

The same is with the most rater genre `Social Networking` and `Music` in 4th, these genres are all contain large user-bases but divided by a few large companies - again, to stiff competition.

In [20]:
for app in apple_apps:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


At first glance the `Reference` genre might seem to be another case of this, with the `Bible` & the `Dictionary` having a large user-base. However, since games and other 'fun' apps overflood the Apple Store, and the `Education` genre only amounts to around 4% (3.66), and given that the Google Playstore's `TOOLS` genre is at 

In [21]:
for app in apple_apps:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


## Most Popular Apps by Genre - Google Play:

As previously stated, the Google Playstore dataset has a column called `Installs`, we can use this to find the most popular apps by genre, but first we print out the Installs row to see its string values for our calculations.

at first glance, we can see a problem - after `100,000+` - the next download number is `500,000+`, which leaves a lot of room inbetween, but the data will have to do.

In [22]:
table_display(google_apps, 5)

1,000,000+ : 15.73
100,000+ : 11.55
10,000,000+ : 10.55
10,000+ : 10.2
1,000+ : 8.39
100+ : 6.92
5,000,000+ : 6.83
500,000+ : 5.56
50,000+ : 4.77
5,000+ : 4.51
10+ : 3.54
500+ : 3.25
50,000,000+ : 2.3
100,000,000+ : 2.13
50+ : 1.92
5+ : 0.79
1+ : 0.51
500,000,000+ : 0.27
1,000,000,000+ : 0.23
0+ : 0.05
0 : 0.01


Now, to the calculations:
1. we create variable - calling the frequency function from earlier, and start a `for` loop with two counters, one for total downloads, and one for category length,
2. we start a `nested` loop from the category column, and if the row `app_cat` is in the category, retrieve the installs column row for this app,
3. then we need to use the `replace()` function to turn the strings - `100,000+` into numbers only, 
4. we then increment the `total` counter by the number of installs, which we first turn from a type `str` to `float`, then we increment the category length by 1,
5. lastly we, find the average installs by taking the number of total downloads and divide it by the category length - before giving the result a type of `int` to make it more readable (removes decimals).

In [23]:
google_cat = freq_table(google_apps, 1)

for category in google_cat:
    total = 0
    cat_len = 0
    for app in google_apps:
        app_cat = app[1]
        if app_cat == category:
            installs = app[5]
            installs = installs.replace(',', '')
            installs = installs.replace('+', '')
            total += float(installs)
            cat_len += 1
    avg_installs = total / cat_len
    res = int(avg_installs)
    print(category, ':', res)

ART_AND_DESIGN : 1986335
AUTO_AND_VEHICLES : 647317
BEAUTY : 513151
BOOKS_AND_REFERENCE : 8767811
BUSINESS : 1712290
COMICS : 817657
COMMUNICATION : 38456119
DATING : 854028
EDUCATION : 1833495
ENTERTAINMENT : 11640705
EVENTS : 253542
FINANCE : 1387692
FOOD_AND_DRINK : 1924897
HEALTH_AND_FITNESS : 4188821
HOUSE_AND_HOME : 1331540
LIBRARIES_AND_DEMO : 638503
LIFESTYLE : 1437816
GAME : 15588015
FAMILY : 3695641
MEDICAL : 120550
SOCIAL : 23253652
SHOPPING : 7036877
PHOTOGRAPHY : 17840110
SPORTS : 3638640
TRAVEL_AND_LOCAL : 13984077
TOOLS : 10801391
PERSONALIZATION : 5201482
PRODUCTIVITY : 16787331
PARENTING : 542603
WEATHER : 5074486
VIDEO_PLAYERS : 24727872
NEWS_AND_MAGAZINES : 9549178
MAPS_AND_NAVIGATION : 4056941


We find that on average - communication apps have the most installs with 38456119 - but this could be the same case where a few large apps have a majority of the user-base, so we try to find this information with the code below:

1. if apps in the `COMMUNICATION` category is equal to 1b, 500m or 100m, we print them, and we see the result is a lot of apps, meaning a large portion of the user-base is already taken.

In [24]:
for app in google_apps:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                     or app[5] == '500,000,000+'
                                     or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])
        

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

Next we try and find how apps under a 100m downloads are doing, and we find that the average is reduced by alot (almost 10 folds).

In [25]:
under_100m = []

for app in google_apps:
    installs = app[5]
    installs = installs.replace(',', '')
    installs = installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(installs) < 100000000):
        under_100m.append(float(installs))
        
sum(under_100m) / len(under_100m)

3603485.3884615386

Now, since we want to match our potential apps to both Google Playstore and Apple Store, we can try and find the how `BOOK_AND_REFERENCE`, which is equvilant to Apple Store's `Reference` category, is doing. 

And thankfully, this category is fairly popular with `8767811` installs.

So, lets see how big the top players are, and how much of the market they occupy with the same method as with `COMMUNICATION`,

1. first we print out the whole row, before we start a new loop with 1b, 500m, 100m and 50m.

In [26]:
for app in google_apps:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

We find that this category has less big players than the communication category, which makes it a better fit for our purpose.

In [27]:
for app in google_apps:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


Then we check the apps with 50m and under, and find that a lot of applications are doing fearly well, meaning this marked might be a good fit.

In [29]:
for app in google_apps:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

# Conclusion:

In this project we used Python to analyse data about how free, english only mobile apps are doing in the App Store and Google Playstore with the goal of finding an application profile with the highest chance of success for our company's app developers.

In conclusion, the book and references categories might be the best fit for our purpose. The provide markets that aren't flooded with content - like the 'Games' section of Apple Store, and they have a record smaller applications finding success with a more shared user-base than f.ex - `Sosial Networking`, were the user-base majority is owned by a few big companies. 

So, I would recommend the `BOOK_AND_REFERENCE` (Google Playstore) and `Reference` (Apple Store) categories to o