# Profitable App Profiles for Apple Store & Google Play

This project is aimed at investigating the profitability of the different applications found on the Apple Store and Google Play. The profitability of the applications will be determined via how much ad-revenue the applications can generate via in-app ads. The ad-revenue will be dependent on the number of users an application has, and it will be the primary marker for the profitability of the application. 

The goal of the project will be to increase the understanding of which types of applications are likely to attract more users via analyzing the data taken from Apple Store and Google Play. A sce

In [1]:
from csv import reader

#Open, read and store AppleStore.csv as a list of list
open_file_appStore = open('AppleStore.csv', encoding='utf8')
read_file_appStore = reader(open_file_appStore)
apps_data_apple = list(read_file_appStore)

#Open, read and store googleplaystore.csv as a list of list
open_file_googlePlayStore = open('googleplaystore.csv', encoding='utf8')
read_file_googlePlayStore = reader(open_file_googlePlayStore)
apps_data_google = list(read_file_googlePlayStore)

We define the explore_data function below, which allows us to present the data in a more readable fashion. 

In [2]:
#Defining the 'Explore funtion'
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') 
        
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns', len(dataset[0]))


After having stored the datasets, and having defined a exploration function: We are able to observe a part of the data set in order to get a better understanding of how it is structured. 

In [3]:
print(explore_data(apps_data_apple,1,5,True))
print(explore_data(apps_data_google,1,5,True))

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


Number of rows:  7198
Number of columns 17
None
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Desig

In [4]:
print('Columns in AppleStore.csv: ', apps_data_apple[0])
print('\n')
print('Columns in googleplaystore.csv: ', apps_data_google [0])

Columns in AppleStore.csv:  ['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Columns in googleplaystore.csv:  ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


### Variables
Above we have the different columns from the two datasets. 
The columns describe the variables: 

Apple Store: 

"id" : App ID

"track_name": App Name

"size_bytes": Size (in Bytes)

"currency": Currency Type

"price": Price amount

"rating_count_tot": User Rating counts (for all version)

"rating_count_ver": User Rating counts (for current version)

"user_rating" : Average User Rating value (for all version)

"user_rating_ver": Average User Rating value (for current version)

"ver" : Latest version code

"cont_rating": Content Rating

"prime_genre": Primary Genre

"sup_devices.num": Number of supporting devices

"ipadSc_urls.num": Number of screenshots showed for display

"lang.num": Number of supported languages

"vpp_lic": Vpp Device Based Licensing Enabled

Google play: 

"Application name": The name of the application

"Category": Category the app belongs to

"Rating": Overall user rating of the app (as when scraped)

"Reviews": Number of user reviews for the app (as when scraped)

"Size" Size of the app (as when scraped)

"Installs": Number of user downloads/installs for the app (as when scraped)

"Type" Paid or Free

"Price": Price of the app (as when scraped)

"Content Rating": Age group the app is targeted at - Children / Mature 21+ / Adult

"Genres": An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to

---
Out of these columns we can see some variables that are of interest for analysis: Total rating count, Average user rating, Genre, Content Rating, number of installs. 

   - [Apple Store Dataset](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps)
   -[Google Play Dataset](https://www.kaggle.com/datasets/lava18/google-play-store-apps) 


## Pre-processing

This section will focus on the pre-processing of the data. Were our aim is to extract the applications which are written in english and are ad-drive (i.e free). We will furthermore filter out any duplicate, missing or incorrect data from the two datasets. 

#### Duplicates & missing values

The section below will be dedicated to identifying duplicates in the two datasets. We will furthermore remove data points with missing or incorrect values.

In [5]:
#Identifying two faulty data points mentioned in documentation

print(apps_data_google[10473])
print(apps_data_google[9149])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['Command & Conquer: Rivals', 'FAMILY', 'NaN', '0', 'Varies with device', '0', 'NaN', '0', 'Everyone 10+', 'Strategy', 'June 28, 2018', 'Varies with device', 'Varies with device']


In [6]:
del apps_data_google[10473]
#del apps_data_google[9149]

In [7]:
#Identifying duplicate values in Apple Store Dataset 
ios_unique_apps = []
ios_duplicate_apps = []
iteration = 0
for row in apps_data_apple:
    iteration += 1
    app_name = row[1]
    if app_name in ios_unique_apps:
        ios_duplicate_apps.append(app_name)
    else:
        ios_unique_apps.append(app_name)

print(len(ios_unique_apps))
print(len(ios_duplicate_apps))
print(ios_duplicate_apps)



7198
0
[]


In the code above we can see that two apps are flagged as duplicates. But these are separate apps with different developers according to [this discussion](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps/discussion/90409) 

It is worth noting however that there are duplicate values in the Google Play dataset, which we can observe via the code snippet below. The code lists all applications that have duplicate names from the dataset, counts how many they are, and give a few example of duplicate entries. 

In [8]:
#Identifying duplicate values in Google Play Dataset 
duplicate_apps = []
unique_apps = []

for app in apps_data_google:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('Number of duplicate apps:', len(duplicate_apps))
print('Examples of duplicate apps:', duplicate_apps[:3])




Number of duplicate apps: 1181
Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business']


We can observe that there are over 1000 duplicate entries in the Google Play dataset, and we are also given some examples of these duplicates. It is a necessity to remove the duplicates from the dataset in order to avoid a skewed analysis later on. But it is also worth noting that the duplicates might differ in quality, which necessitates a method for choosing which duplicates to keep. This will be acomplished by selecting the data point which has the highest amount of reviews, seeing as that one should be most up-to-date. 

In [9]:
#Example of duplicate data in google play store
for app in apps_data_google:
    name = app[0]
    if name == 'Instagram':
        print(app)


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [10]:
reviews_max = {}

for app in apps_data_google[1:]:
    name = app[0]
    #if 'M' in app[3]:
     #   str_n_reviews = app[3].replace('M','')
      #  n_reviews = float(str_n_reviews)
       # n_reviews *= 1000000
    #else:
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
print(len(apps_data_google))
print(len(apps_data_google[1:]) - len(duplicate_apps) )
print(len(reviews_max))

10841
9659
9659


The code above identifies the duplicated entries, and thereafter returns the one with the highest amount of rewievs. This is controlled via a couple of print statements. Below is the code used to delete the duplicated entries.

In [11]:
android_clean = []
already_added = []

for app in apps_data_google[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

In [12]:
def eng_character(string):
    non_ascii = 0
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
   
    if non_ascii > 3:
        return False
    else:
        return True

print('Is "Instagram" a english word?: ', eng_character('Instagram'))
print('Is "爱奇艺PPS -《欢乐颂2》电视剧热播" a english word?: ', eng_character('爱奇艺PPS -《欢乐颂2》电视剧热播'))

Is "Instagram" a english word?:  True
Is "爱奇艺PPS -《欢乐颂2》电视剧热播" a english word?:  False


This project will be focused on the profitability of english-speaking apps. Which is why we have developed and tested a function above, which checks if the name of an app is written in english or not. The function is not perfect but should be satisfactory for now. In the code below we will implement the function on our two data_sets, and thereafter we will extract the english-speaking apps. 

In [13]:
eng_android = []
eng_apple = []
non_english_g = []
non_english_a = []

for app in android_clean:
    name = app[0]
    if eng_character(name):
        eng_android.append(app)
    else:
        non_english_g.append(app)


for app in apps_data_apple:
    name = app[1]
    if eng_character(name):
        eng_apple.append(app)
    else:
        non_english_a.append(app)

explore_data(eng_android, 0, 5, True)
print('\n')
explore_data(eng_apple, 1, 5, True) #different index value due to header being included

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows:  9614
Number of columns 13


['1', '281656475', 'PAC-MAN Premium', '1007882

Above we have som examples of the cleaned data, where we:
- Removed inaccurate data
- Removed duplicate app entries
- Removed non-English apps

We will now focus on filtering based by price. The target data is apps that are free, seeing as they typically hold ad-revenue as their main source of income, and are most applicable to our project. 


In [14]:
android_final = []
apple_final = []

android_charged = []
apple_charged = []
for app in eng_android:
    price = app[7]
    if price == '0':
        android_final.append(app)
    
for app in eng_apple[1:]:
    price = app[4]
    if price == '0.0':
        apple_final.append(app)

print(len(android_final))
print(len(apple_final))

8864
0


## Analysing the data

The data pre-processing is complete, and we have a data on a collection on apps that are relevant towards the target of presenting a profit profile for the application. This section will be focused on analysing the data, with the goal of having a clear understanding of what drives an applications profitability. This will then enable us to propose a plan for an profitable app as well as a validation strategy for said app. We will hereafter utilize the following validation strategy for building a profitable app: 
1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Let us begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set. 

In [15]:
# Function for generating frequency tables with 
# percentages
def freq_table(dataset, index):
    frequency_table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in frequency_table:
            frequency_table[value] += 1
        else:
            frequency_table[value] = 1
    
    percentage_frequency = {}
    
    for key in frequency_table:
        percentage = (frequency_table[key]/total)*100
        percentage_frequency[key] = percentage
    
    return percentage_frequency

# Function for displaying the frequency tables
# in a read able manner. 

def display_table (dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

def printdict (dataset):
    diction = dataset
    a_list = []
    key = ''
    value = 0
    for iteration in diction:
        key = iteration[0]
        value = iteration[1]
        print(key, ': ', value)


### Apple Frequency Table (prime_genre)

Below we have a frequency table, in percentages, based on the relative frequencies of different instances of the "prime_genre" variable in the Apple Store dataset. This table presents us information regarding how prevalent the different genres of applications are on the apple store. 

The first finding of note is that the most common genre, games, consists of more than half of the applications. It represents 58% of the application found at the Apple store and it is by far the most popular genre. We can furthermore see that the frequencies of the genres are very imbalanced seeing as the next common genre only represents 8% of the applications. 

Most of the application seems to be aimed at entertainment as a whole, with the two most popular genres being 'Games' and 'Entertainment'. There are however signs of the there being a prevalent amount of applications designed for pratical purposes as well with genres such as Photo & Video (5%), education (4%) and shopping (3%) being quite common. They are however not comparable to the scope of entertainment-aligned applications. 

It is difficult to arrive at an recommended app profile with this information alone. The large amount of entertainment applications would suggest that the demand for those are higher amongst Apple customers, with Games representing the largest market, but there are no indicators for how popular specific applications are withing these genres. It is entirely possible that one of the less common genres hosts some of the more popular applications. 

In [16]:
print(display_table(apple_final, -5))

None


### Android Frequency Table (Category)

The previous section covered an frequency table over the genres in the apple store. The table below shows the frequencies of the different categories in the Google Play store. 

Here we can see that the data is less imbalanced as compared to the Apple Store. The most common genre, "Family" only represents 19% of the applications, instead of the representing over half of the applications such as "Games" did for the Apple store. There is a more even distribution amongs the frequencies in the Google Play Store. 

The landscape of the apps seem to differ from the Apple Store as well. In the Apple store there was an overwhelming amount of applications aimed at entertainment. The apps found in the Google Play store seems to be more aimed at utility, even though entertainment focused apps still are common. It is worth noting however that the most common genre, "Family", mainly consists on games aimed at younger audiences. Further investigation of the "Genres" table suggests that most apps are focused on utility however. Where the most common genre is "Tools" with a frequency of 8% and many other popular genres such as: "Education", "Business" and "Productivity".

There is currently not enough information to suggest an recommended app profile for the landscape, due to similar reasons that it was impossible to recommend one for the Apple store. 

In [17]:
print(display_table(android_final, 1))

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

### Android Frequency Table (Genre) 

In [18]:
print(display_table(android_final, -4))

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

### Popular Genres & Genre Recommendation (Apple)
The information presented above is not enough to generate a recommended app profile, and additional information is required. One of the most tangible pieces of information we are missing is how popular apps within an genre are. Which is why we are going to investigate which genres are most popular by users below, we are thereafter going to be able to suggest a genre for our recommended App Profile. 

In [19]:
prime_genres = freq_table(apple_final, -5)

for genre in prime_genres:
    total = 0
    len_genre = 0
    for app in apple_final:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_users = total/len_genre
    print(genre,': ', avg_n_users)
            

Navigation is the most popular genre on average, but this is heavily influenced by apps such as Google maps and Waze. Which are substantially more popular than the other alternatives and drive up the average user amount of the app. 

In [20]:
for apps in apple_final:
    if apps[-5] == 'Navigation':
        print(apps[1], ': ', apps[5])

Social Networking, Music and Weather genre are also similarily popular, where Social Networking might be explained by popular social media such as Facebook, Pinterest and Skype. It is also worth noting that other popular social media apps might not fall under Social Networking. Instagram, for example, is categorized as being under the genre Photo & Video. 

In [21]:
for apps in apple_final:
    if apps[-5] == 'Social Networking':
        print(apps[1], ': ', apps[5])   

In [22]:
for apps in apple_final:
    if apps[1] == 'Instagram':
        print(apps[-5])

The music genres popularity can also be explained by a few big actors such as Pandora and Spotify, and the in-app time typically spent by users is short (which mitigates the benefits of an ad-driven app).

In [23]:
for apps in apple_final:
    if apps[-5] == 'Music':
        print(apps[1], ': ', apps[5])   

The user distribibution is more even when it comes to the weather apps (even though there is a discrepancy between the giant apps and the runner-ups). But the main issue with the genre is that users typically do not spend a long time on weather apps. A critical point when the main source of income is ad-revenue. 

In [24]:
for apps in apple_final:
    if apps[-5] == 'Weather':
        print(apps[1], ': ', apps[5])   

The reference genres user count seems to be due to a few popular app books such as the Bible and Dictionaries. The genre is one of the most popular however, and it would seem conceivable that the market could accomodate a new popular app. A recommended app profile could consist of making an app for a popular book (not already covered in the genre), and attract customers by including bonus features and content beyond the actual book. The apps would in that case be open for long periods of time, and the app users would therefore generate more ad-revenue than they would with an an, for example, weather app (which users typically only visit briefly).

In [25]:
for apps in apple_final:
    if apps[-5] == 'Reference':
        print(apps[1], ': ', apps[5])   

### Popular Genres & Genre Recommendation (Google Play)

We are now moving on to the Google Play store, where we are going to implement a similar approach in order to investigate the most popular genres. The investigated variable will be the number of installs however, instead of the number of user ratings. 

Note: The number of installs variable is not very precise (with intervalls such as: 100+, 1 000+. 10 000+, 100 000+ etc.) but the accuracy should be sufficient for finding the most popular genre, a value like 100 000+ will be considered equivalent to 100 000. 



In [26]:
freq_table_category = freq_table(android_final, 1)
for category in freq_table_category:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
    avg_installs = total/len_category
    print(category, ': ', avg_installs)


ART_AND_DESIGN :  1986335.0877192982
AUTO_AND_VEHICLES :  647317.8170731707
BEAUTY :  513151.88679245283
BOOKS_AND_REFERENCE :  8767811.894736841
BUSINESS :  1712290.1474201474
COMICS :  817657.2727272727
COMMUNICATION :  38456119.167247385
DATING :  854028.8303030303
EDUCATION :  1833495.145631068
ENTERTAINMENT :  11640705.88235294
EVENTS :  253542.22222222222
FINANCE :  1387692.475609756
FOOD_AND_DRINK :  1924897.7363636363
HEALTH_AND_FITNESS :  4188821.9853479853
HOUSE_AND_HOME :  1331540.5616438356
LIBRARIES_AND_DEMO :  638503.734939759
LIFESTYLE :  1437816.2687861272
GAME :  15588015.603248259
FAMILY :  3695641.8198090694
MEDICAL :  120550.61980830671
SOCIAL :  23253652.127118643
SHOPPING :  7036877.311557789
PHOTOGRAPHY :  17840110.40229885
SPORTS :  3638640.1428571427
TRAVEL_AND_LOCAL :  13984077.710144928
TOOLS :  10801391.298666667
PERSONALIZATION :  5201482.6122448975
PRODUCTIVITY :  16787331.344927534
PARENTING :  542603.6206896552
WEATHER :  5074486.197183099
VIDEO_PLAYERS 

We can see from the table above that there are som categories which stick out in terms of popularity. Such as: Communication, Games, Social and Books & References. The largest by far is the Communication category, but this is heavily influenced by a few giants such as Facebook and WhatsApp with over a billion installs.  


In [27]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0],': ', app[5])

WhatsApp Messenger :  1,000,000,000+
imo beta free calls and text :  100,000,000+
Android Messages :  100,000,000+
Google Duo - High Quality Video Calls :  500,000,000+
Messenger – Text and Video Chat for Free :  1,000,000,000+
imo free video calls and chat :  500,000,000+
Skype - free IM & video calls :  1,000,000,000+
Who :  100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji :  100,000,000+
LINE: Free Calls & Messages :  500,000,000+
Google Chrome: Fast & Secure :  1,000,000,000+
Firefox Browser fast & private :  100,000,000+
UC Browser - Fast Download Private & Secure :  500,000,000+
Gmail :  1,000,000,000+
Hangouts :  1,000,000,000+
Messenger Lite: Free Calls & Messages :  100,000,000+
Kik :  100,000,000+
KakaoTalk: Free Calls & Text :  100,000,000+
Opera Mini - fast web browser :  100,000,000+
Opera Browser: Fast and Secure :  100,000,000+
Telegram :  100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer :  100,000,000+
UC Browser Mini -Tiny Fast Private & Secure :  

The Category would be much less impressive however if we excluded these outliers:

In [28]:
under_100m = []
for app in android_final:
    app_c = app[1]
    n_installs = app[5]
    n_installs = n_installs.replace(',','')
    n_installs = n_installs.replace('+','')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100m.append(float(n_installs))
adj_avg = sum(under_100m)/len(under_100m)
diff = 38456119.167247385 - adj_avg
print('Adjusted average: ', adj_avg)
print('Difference: ', diff) 


Adjusted average:  3603485.3884615386
Difference:  34852633.77878585


This adjusted average shows us that there probably are not very many gains from competing with these giant apps, seeing as the average amount of installations for other apps is so low. The runner up to communication in terms of popularity are Video Players, where we can see a similar pattern: 

In [29]:
for app in android_final:
    if app[1] == 'VIDEO_PLAYERS' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0],': ', app[5])
under_100m = []
for app in android_final:
    app_c = app[1]
    n_installs = app[5]
    n_installs = n_installs.replace(',','')
    n_installs = n_installs.replace('+','')
    if (app[1] == 'VIDEO_PLAYERS') and (float(n_installs) < 100000000):
        under_100m.append(float(n_installs))
adj_avg = sum(under_100m)/len(under_100m)
diff = 38456119.167247385 - adj_avg
print('\n')
print('Adjusted average: ', adj_avg)
print('Difference: ', diff) 


YouTube :  1,000,000,000+
Motorola Gallery :  100,000,000+
VLC for Android :  100,000,000+
Google Play Movies & TV :  1,000,000,000+
MX Player :  500,000,000+
Dubsmash :  100,000,000+
VivaVideo - Video Editor & Photo Movie :  100,000,000+
VideoShow-Video Editor, Video Maker, Beauty Camera :  100,000,000+
Motorola FM Radio :  100,000,000+


Adjusted average:  5544878.133333334
Difference:  32911241.033914052


So that is also a candidate that is excluded from contention. The main issue with many of the more popular genres are that they are very skewed and misleading. There are categories such as games, but these can run the risk of being over-saturated as we saw in the Apple Store analysis. A better alternative could be to investigate if Books & Reference is a valid genre to recommend, seeing as we then could develop a cohesive app recommendation profile between the Apple Store and the Google Play Store. 

The first step in that case should be to investigate the most popular apps from the Genre: 

In [30]:
for app in android_final:
    app_c = app[1]
    if app_c == 'BOOKS_AND_REFERENCE':
        print(app[0],': ', app[5])
        
    

E-Book Read - Read Book for free :  50,000+
Download free book with green book :  100,000+
Wikipedia :  10,000,000+
Cool Reader :  10,000,000+
Free Panda Radio Music :  100,000+
Book store :  1,000,000+
FBReader: Favorite Book Reader :  10,000,000+
English Grammar Complete Handbook :  500,000+
Free Books - Spirit Fanfiction and Stories :  1,000,000+
Google Play Books :  1,000,000,000+
AlReader -any text book reader :  5,000,000+
Offline English Dictionary :  100,000+
Offline: English to Tagalog Dictionary :  500,000+
FamilySearch Tree :  1,000,000+
Cloud of Books :  1,000,000+
Recipes of Prophetic Medicine for free :  500,000+
ReadEra – free ebook reader :  1,000,000+
Anonymous caller detection :  10,000+
Ebook Reader :  5,000,000+
Litnet - E-books :  100,000+
Read books online :  5,000,000+
English to Urdu Dictionary :  500,000+
eBoox: book reader fb2 epub zip :  1,000,000+
English Persian Dictionary :  500,000+
Flybook :  500,000+
All Maths Formulas :  1,000,000+
Ancestry :  5,000,00

There are some very popular apps such as WattPad and Google Play Books, which could potentially skew the dataset: 

In [31]:
for app in android_final:
    app_c = app[1]
    if app_c == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                           or app[5] == '500,000,000+'
                                           or app[5] == '100,000,000+'):
        print(app[0],': ', app[5])

Google Play Books :  1,000,000,000+
Bible :  100,000,000+
Amazon Kindle :  100,000,000+
Wattpad 📖 Free Books :  100,000,000+
Audiobooks from Audible :  100,000,000+


The amount of extremely popular apps does not seem to be as extensive as they were for other categories. There could be potential for this category, but it bears investigating the moderately popular apps. Where we might garner deeper insight into the viability of the genre. 

In [32]:
for app in android_final:
    app_c = app[1]
    if app_c == 'BOOKS_AND_REFERENCE' and (app[5] == '50,000,000+'
                                           or app[5] == '10,000,000+'
                                           or app[5] == '5,000,000+'
                                           or app[5] == '1,000,000+'):
        print(app[0],': ', app[5])

Wikipedia :  10,000,000+
Cool Reader :  10,000,000+
Book store :  1,000,000+
FBReader: Favorite Book Reader :  10,000,000+
Free Books - Spirit Fanfiction and Stories :  1,000,000+
AlReader -any text book reader :  5,000,000+
FamilySearch Tree :  1,000,000+
Cloud of Books :  1,000,000+
ReadEra – free ebook reader :  1,000,000+
Ebook Reader :  5,000,000+
Read books online :  5,000,000+
eBoox: book reader fb2 epub zip :  1,000,000+
All Maths Formulas :  1,000,000+
Ancestry :  5,000,000+
HTC Help :  10,000,000+
Moon+ Reader :  10,000,000+
English-Myanmar Dictionary :  1,000,000+
Golden Dictionary (EN-AR) :  1,000,000+
All Language Translator Free :  1,000,000+
Aldiko Book Reader :  10,000,000+
Dictionary - WordWeb :  5,000,000+
50000 Free eBooks & Free AudioBooks :  5,000,000+
Al-Quran (Free) :  10,000,000+
Al Quran Indonesia :  10,000,000+
Al'Quran Bahasa Indonesia :  10,000,000+
Al Quran Al karim :  1,000,000+
Al Quran : EAlim - Translations & MP3 Offline :  5,000,000+
Koran Read &MP3 30

Book readers seem to be very popular within this area, and it could be a viable idea to base ones app of a book. Most of the apps made that are moderately popular seem to be centered around books, and most of them seem to take the shape of being book readers. There are furthermore popular apps based on singular books, as we saw with the Apple Store.

Most of the apps in the genre are, as mentioned previously, libraries or dictionaires. But there are also some popular apps based on books such as the Quran, which would suggest that apps focusing on singular books also hold the potential to be popular. 


## Conclusions

One viable strategy for the recommended app profile could be to develop a book reader.  But it would be important to distinguish the app from other apps in the genre. Some possible avenues could be to focus more on modern literature, or try to find a popular and specialised area of literature to focus the app on. There could furthermore be oppurtunities within the functionality of the app. One could for example strive to add functionality such as forums, recommendations, bonus content, reading lists etc. in order to distinguish the app from more well-established reading apps. 

Another alternative would be to build an app focusing on a book, or smaller collection of books, in order to distinguish and specalize the apps further. This would also allow for more in-depth functionality within the app which could be a viable selling-point for users. This could be implemented as a app focusing on a sub-genre, author or series where the app offers reading functionality, but also focuses on offering bonus content, behind-the-scenes-content and discussion forums for fans to discuss the book and present their fan-made work. 