# Data Analysis of Android and iOS apps

In order to better understand which kind of apps our Dev team should focus on programming, we are going to go through a dataset of Apps on both iOS and Android devices. 

As our only source of income is through in-app ads (our apps are free) we will try to understand which apps attract and retain more users.

### What about the datasets?

Both datasets, for Android and iOS, come from kaggle.com :  

-[The Android Dataset](https://www.kaggle.com/lava18/google-play-store-apps) was created in August 2018.

-[The iOS Dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) was created in July 2017.

Those 2 datasets do not represent all the different apps on iOS and Android, as there are more than 4 millions apps in total. We will focus on a sample of those data, going through 7197 iOS apps and 10841 Android Apps as we do not have the time and money to through the entirety of the apps.

First let's create a function that will allow us to explore those dataset quickly by printing rows.

In [2]:
#let's open Apple Dataset
opened_file = open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
apple_set = list(read_file) 
apple_header = apple_set[0]
apple_set = apple_set[1:]

#let's open Android Dataset
opened_file = open('googleplaystore.csv')
from csv import reader
read_file = reader(opened_file)
android_set = list(read_file)
android_header = android_set[0] #We separate the header and the rest of the file
android_set = android_set[1:]


def explore_data(apps_data, start, end, rows_and_column=False):
    dataset_slice = apps_data[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_column:
        print('Number of rows :',len(apps_data))
        print('Number of columns :',len(apps_data[0]))
        print('\n')

print(apple_header)
print('\n')
explore_data(apple_set, 0 , 2 , True)
print('\n')
print(android_header)
print('\n')
explore_data(android_set, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows : 7197
Number of columns : 16




['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Ev

### 1) The columns description

We printed the columns of both datasets in order to better understand how they are are organized. We can see that they don't have the same number of columns (13 vs 16) and they use different ways of describing the same items (ex : size_bytes vs size).

Let's see which columns are truly interesting for our project.

Here are the columns that are interesting for ou project : 


| **Android**     | **iOS** |
| ----------- | ----------- |
| App     | track_name      |
| Category   | currency        |
| Reviews    | price       |
| Installs   | rating_count_tot        |
| Type   | rating_count_ver       |
| Price   | prime_genre      |
| Genres  |        |

### Cleaning the data

Before making any analysis let's focus on cleaning the dataset. 

Thanks to the [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) on the forum we can see that there is a mistake in the Android Dataset tghat we need to take care of. 


In [3]:
#First let's see if there is actually an issue with the entry 10472 as discussed in the forum.
print(android_header)
print('\n')
print(android_set[10472])

#We can see that the column 'Category' is indeed missing, let's delete this row

del(android_set[10472])





['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


### Android Duplicates

Now let's focus on the duplicate entries of the Android set. Some applications appear several times in our Dataset, which could distort the results. 



In [4]:
#Example of duplicates entries : 

instagram_entries = [0]
for apps in android_set:
    name = apps[0]
    if name == 'Instagram':
        instagram_entries.append(apps)
        
print(instagram_entries)

[0, ['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'], ['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'], ['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'], ['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']]


Now let's count how many duplicates apps we have! 

In [5]:
duplicate_apps = []
unique_apps = []

for app in android_set:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print("Number of duplicated apps :", len(duplicate_apps))
print('\n')
print("Example of duplicated app :", duplicate_apps[:15])


Number of duplicated apps : 1181


Example of duplicated app : ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


- We can see that the duplicates differ depending on the number of reviews they have. Let's not delete the duplicates randomly but let's take into account the number of reviews.
- The duplicates may be the different version of an app, this is confirmed by the fact that this is the only column that is changing from a duplicate to another. 
- The more number of reviews we have the latest the version of the app! So we will only keep the duplicates with the highest number of reviews.

In [6]:
review_max={}

for app in android_set:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in review_max and review_max[name] < n_reviews:
        review_max[name] = n_reviews
        
    elif name not in review_max:
        review_max[name] = n_reviews

#Let's check if we are the right number of applications!
#We should have 10840 - 1,181 = 9659 apps left!

print(len(review_max))
    

9659


Let's erase the duplicate rows by:
- creating 2 empty lists android_clean and already_added
- Adding the right apps to the android_clean list if
    - its number of reviews match the number we have in our dictionnary reviews_max 
    - if its name is not already in the already_added list in order to avoid apps with the same maximum number of reviews

In [7]:
android_clean = []
already_added = []

for app in android_set:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == review_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

#Let's check if we have the right number of apps in our clean data set!
print(len(android_clean))

9659


### The language issue

We now need to get rid of the non-English Speaking apps, as our customer base is English speaking. 

In order to do that we will erase in our databases the apps that have non-English characters.

In the ASCII system English characters are ranged from 0 to 127 so let's create a function that will detect which characters do not belong to this range (we use the *ord* function in order to know the corresponding number for a character)

We need to go through each character of a string in order to check each of them.

In [8]:
def english_check(word):
    for letter in word:
        if ord(letter) > 127:
            return(False)
    return(True)
    
print(english_check('Docs To Go™ Free Office Suite'))
english_check('😜')

False


False

We can see that our english_check application is not taking into account the special characters or the emoji that can be used by both non-English and English speakers. 

In order to avoid as many mistakes as possible we will only remove the apps that have more than 3 specials characters in their name. This is not a perfect solution but it should be fairly effective. 

Let's update our function english_check!

In [9]:
def english_check(word):
    non_english_count = 0
    for letter in word:
        if ord(letter) > 127:
            non_english_count += 1
        if non_english_count == 3:
            return(False)
    return(True)

#Let's try our new function
print(english_check('Docs To Go™ Free Office Suite'))
print(english_check('Docs To Go™ Free Office Suite😜😜'))

True
False


Now let's go through both our datasets to filter the non-English apps using the english_check function (For the Android apps we use the Android_Clean list that we created earlier) 

In [10]:
english_android_apps = []
for app in android_clean:
    name = app[0]
    if english_check(name):
        english_android_apps.append(app)

print(len(english_android_apps))
explore_data(english_android_apps, 0, 3, True)

english_apple_apps = []
for app in apple_set:
    name = app[1]
    if english_check(name):
        english_apple_apps.append(app)
        
print(len(english_apple_apps))
explore_data(english_apple_apps, 0, 3, True)

9597
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows : 9597
Number of columns : 13


6155
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24

### Free vs Paying apps

As we want to develop **free** apps with in-apps advertisement, we need to clean our datasets in order to only have the free apps.

Let's go through both datasets and just like we previously did for the language of the apps, let's create new lists with only the free apps. 

In [11]:
#Let's find where which column refers to the price

print(android_header)
print(apple_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [12]:
free_android_apps = []

for app in english_android_apps:
    price =  app[7]
    if price == "0":
        free_android_apps.append(app)
        
print(len(free_android_apps))

free_apple_apps = []

for app in english_apple_apps:
    price =  float(app[4])
    if price == 0.0 :
        free_apple_apps.append(app)
        
print(len(free_apple_apps))

8848
3203


**We have in the end : 8848 Android apps and 3203 iOS apps to analyze our data.**

# Analyzing the data

In order to better understand what we should analyzewith our data sets, let's go into the detail of our business strategy.

We are very dependant of the number of users on our apps, as our income comes from the in-apps ads.

Our strategy will be to : 

1. Build a Minimum Viable Product of our App for the Android Store

2. If the meet is attracting users, we develop it further

3. If the app is profitable after 6 months, we create an iOS version.

We need to develop an app that would be successfull on both Android and iOS.

First let's built a Frequency Table in order to better understand which genre of apps are the most successful on both apps.

Based on the columns name, we will take into account the 'prime_genre' column of the App Store data set, and the 'Genres' and 'Category' columns of the Google Play data set.

In order to create our Frequency Table we need to create two functions : 

- A function to generate the frequency table with percentages
- A function to display the percentages in a descending order

In [13]:
def freq_table(dataset, index):
    frequency_table = {}
    
    for row in dataset:
        value = row[index]
        if value in frequency_table:
            frequency_table[value] += 1
        else:
            frequency_table[value] = 1
    
    table_percentages = {}
    for key in frequency_table:
        percentage = (frequency_table[key]/len(dataset))*100
        table_percentages[key]=percentage
        
    return table_percentages
        
def display_table(dataset,index):
    table = freq_table(dataset,index)
    table_display = []
    
    for key in table:
        key_val_as_tuple = (table[key],key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':' , entry[0])

Now let's display the frequency table for the columns we chose earlier, for both Android and iOS apps.

In [14]:
display_table(free_apple_apps, -5)


Games : 58.25788323446769
Entertainment : 7.836403371838902
Photo & Video : 4.995316890415236
Education : 3.6840462066812365
Social Networking : 3.3093974399000934
Shopping : 2.5913206369029034
Utilities : 2.466437714642523
Sports : 2.1542304089915705
Music : 2.0605682172962845
Health & Fitness : 2.0293474867311896
Productivity : 1.7483609116453322
Lifestyle : 1.5610365282547611
News : 1.3424914142990947
Travel : 1.248829222603809
Finance : 1.0927255697783327
Weather : 0.8741804558226661
Food & Drink : 0.8117389946924758
Reference : 0.5307524196066188
Business : 0.5307524196066188
Book : 0.3746487667811427
Navigation : 0.18732438339057134
Medical : 0.18732438339057134
Catalogs : 0.1248829222603809


We can clearly see that **'Games'** is the **most represented Genre of apps** present on the Google Play Store, when it comes to English and Free Apps at least!

'Games' are followed by 'Entertainment' and 'Photo & Video' apps, and then we have 'Education' and 'Social Networking' genres. The top three is **only composed of Entertaining applications**, but this doesn't mean that they are popular. Maybe there is a huge offer and not so much demand!

Now let's see the result for the Android data_set : 

In [18]:
display_table(free_android_apps, 1)


FAMILY : 18.942133815551536
GAME : 9.697106690777577
TOOLS : 8.453887884267631
BUSINESS : 4.599909584086799
PRODUCTIVITY : 3.899186256781193
LIFESTYLE : 3.887884267631103
FINANCE : 3.7070524412296564
MEDICAL : 3.5375226039783
SPORTS : 3.390596745027125
PERSONALIZATION : 3.322784810126582
COMMUNICATION : 3.2323688969258586
HEALTH_AND_FITNESS : 3.0854430379746836
PHOTOGRAPHY : 2.949819168173599
NEWS_AND_MAGAZINES : 2.802893309222423
SOCIAL : 2.667269439421338
TRAVEL_AND_LOCAL : 2.3395117540687163
SHOPPING : 2.2490958408679926
BOOKS_AND_REFERENCE : 2.1360759493670884
DATING : 1.8648282097649187
VIDEO_PLAYERS : 1.7970162748643763
MAPS_AND_NAVIGATION : 1.3901446654611211
FOOD_AND_DRINK : 1.2432188065099457
EDUCATION : 1.164104882459313
ENTERTAINMENT : 0.9606690777576853
LIBRARIES_AND_DEMO : 0.9380650994575045
AUTO_AND_VEHICLES : 0.9267631103074141
HOUSE_AND_HOME : 0.8024412296564195
WEATHER : 0.7911392405063291
EVENTS : 0.7120253164556962
PARENTING : 0.6555153707052441
ART_AND_DESIGN : 0.64

The 'Category' colum is more balanced in the Android dataset. We can see that 'Family' Apps are the most common, followed by 'Games' and 'Tools'. 

The classification is not quite clear here, as 'Family' or 'Tools' app is not self explanatory. Let's see what are the results when we take the 'Genres' column into account.

In [19]:
display_table(free_android_apps, 9)


Tools : 8.44258589511754
Entertainment : 6.080470162748644
Education : 5.357142857142857
Business : 4.599909584086799
Productivity : 3.899186256781193
Lifestyle : 3.8765822784810124
Finance : 3.7070524412296564
Medical : 3.5375226039783
Sports : 3.4584086799276674
Personalization : 3.322784810126582
Communication : 3.2323688969258586
Action : 3.096745027124774
Health & Fitness : 3.0854430379746836
Photography : 2.949819168173599
News & Magazines : 2.802893309222423
Social : 2.667269439421338
Travel & Local : 2.328209764918626
Shopping : 2.2490958408679926
Books & Reference : 2.1360759493670884
Simulation : 2.0456600361663653
Dating : 1.8648282097649187
Arcade : 1.842224231464738
Video Players & Editors : 1.7744122965641953
Casual : 1.763110307414105
Maps & Navigation : 1.3901446654611211
Food & Drink : 1.2432188065099457
Puzzle : 1.1301989150090417
Racing : 0.9945750452079566
Role Playing : 0.9380650994575045
Libraries & Demo : 0.9380650994575045
Auto & Vehicles : 0.9267631103074141
St

The 'Genre' column confirms our first impression : there are far less games in the Android platform compared to the Apple one ( 'Tools' in 1st position, and 'Education' and 'Business' in 3rd and 4th, 'Entertainment' in 2nd positon). 

The difference between the two columns 'Category' and 'Genre' is not very clear, but as the 'Genre' column is much more granular and we only need the big picture, we decide to only focus on the 'Category' column.

For now we cannot conclude anything. As we said after looking into the iOS dataset, we only have the number of apps per 'Category', not the number of users for each of those apps. We might have a majority of 'Games' on iOS, if nobody uses them, then it would be a terrible choice.

We can only say that the majority of apps on iOS are games and that the landscape is more balanced on Android with both practical and fun apps. 

The next step will be to see how many people installed apps for each genre. This information is available on Android under the 'Installs' column but is lacking on iOS. Instead we will use the number of review 'rating_count_tot' to provide a reliable picture.

First let's calculate the average number of user ratings per app genre on the App Store. We will :

1. Isolate the apps of each genre
2. Calculate the average number of reviews for each genre

In [36]:
genre_iOS = freq_table(free_apple_apps,11) 

for genre in genre_iOS:
    total = 0
    len_genre = 0
    for app in free_apple_apps:
        genre_app = app[11]
        if genre_app == genre:
            number_ratings = float(app[5])
            total += number_ratings
            len_genre += 1
    avg_number_rating = total/len_genre
    print(genre , ':' , avg_number_rating)


Book : 46384.916666666664
Music : 57326.530303030304
Finance : 32367.02857142857
Education : 7003.983050847458
Weather : 52279.892857142855
Sports : 23008.898550724636
Entertainment : 14195.358565737051
Food & Drink : 33333.92307692308
Productivity : 21028.410714285714
Reference : 79350.4705882353
News : 21248.023255813954
Lifestyle : 16815.48
Travel : 28243.8
Games : 22886.36709539121
Navigation : 86090.33333333333
Catalogs : 4004.0
Medical : 612.0
Health & Fitness : 23298.015384615384
Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Business : 7491.117647058823
Utilities : 19156.493670886077
Shopping : 27230.734939759037


We can see that 'Travels' has the most reviews, but this could be influenced a lot by *Waze* and *Google Maps*. As we can see below those two applications weight a lot.


In [38]:
for app in free_apple_apps:
    if app[11] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


In second comes 'References' which includes dictionnaries, 'Wikipedia apps' or translations apps. This could be an interesting segment to target, plus they only represent 0,53% of the free-english apps on iOS, so there is much less concurency than for the 'Games' segment (which has less number of review in average per apps). 

Even if, just like 'Travels', there are some applications that are very popular and may affect the results (for exemple *Dictionary.com* and *Bible*), this is still a promising genre.

In [41]:
for app in free_apple_apps:
    if app[11] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
Jishokun-Japanese English Dictionary & Translator : 0


We could think of another famous book like the *Bible* that we could turn into an app and add a dictionnary of other features (like a quizz, some commentaries, additional pieces of informations...). 

Now let's have a look at the Android dataset and more specifically to the number of install per genre. We can see thanks to our *display_table* function that the number of install are not accruate but distributed into different range ( ex : 100+ , 10,000+....). 

In [53]:
print(display_table(free_android_apps,5))

1,000,000+ : 15.75497287522604
100,000+ : 11.539330922242314
10,000,000+ : 10.567359855334539
10,000+ : 10.194394213381555
1,000+ : 8.39737793851718
100+ : 6.928119349005425
5,000,000+ : 6.826401446654612
500,000+ : 5.560578661844485
50,000+ : 4.769439421338156
5,000+ : 4.486889692585895
10+ : 3.5375226039783
500+ : 3.2436708860759493
50,000,000+ : 2.2830018083182644
100,000,000+ : 2.1360759493670884
50+ : 1.9213381555153706
5+ : 0.7911392405063291
1+ : 0.5085895117540687
500,000,000+ : 0.27124773960216997
1,000,000,000+ : 0.22603978300180833
0+ : 0.045207956600361664
0 : 0.011301989150090416
None


In order to use those figures we need to get rid of the '+' and the ',' in each figure. Now let's calculate the average number of installs per genre on Android.

In [59]:
category_android = freq_table(free_android_apps,1) 

for category in category_android:
    total = 0
    len_category = 0
    for app in free_android_apps:
        category_app = app[1]
        if category_app == category:
            number_install = app[5]
            number_install = number_install.replace(',','')
            number_install = number_install.replace('+','')
            total += float(number_install)
            len_category += 1
    avg_number_install = total/len_category
    print(category , ':' , avg_number_install)

MAPS_AND_NAVIGATION : 4049274.6341463416
TRAVEL_AND_LOCAL : 13984077.710144928
NEWS_AND_MAGAZINES : 9549178.467741935
ART_AND_DESIGN : 1986335.0877192982
BUSINESS : 1712290.1474201474
HOUSE_AND_HOME : 1360598.042253521
BOOKS_AND_REFERENCE : 8814199.78835979
LIFESTYLE : 1446158.2238372094
COMICS : 832613.8888888889
PRODUCTIVITY : 16787331.344927534
TOOLS : 10830251.970588235
LIBRARIES_AND_DEMO : 638503.734939759
FINANCE : 1387692.475609756
BEAUTY : 513151.88679245283
AUTO_AND_VEHICLES : 647317.8170731707
DATING : 854028.8303030303
SHOPPING : 7036877.311557789
SOCIAL : 23253652.127118643
PARENTING : 542603.6206896552
HEALTH_AND_FITNESS : 4188821.9853479853
FOOD_AND_DRINK : 1924897.7363636363
COMMUNICATION : 38590581.08741259
GAME : 15544014.51048951
VIDEO_PLAYERS : 24727872.452830188
EVENTS : 253542.22222222222
MEDICAL : 120550.61980830671
PHOTOGRAPHY : 17840110.40229885
FAMILY : 3695641.8198090694
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
PERSONALIZATION : 5201482.

We can see that communication is by far the most installed genre, but this could be heavily influenced by the famous messaging applications such as 'Messenger', 'Whatsapp'...
Let's see what's inside this category.


In [61]:
for app in free_android_apps:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

Creating a new communication application would be a waste of time, as the leader of the market are already well settled.

We can find the same pattern with videos applications (a category dominated by Youtube, Google Play TV...), Social Network (Facebook, Instagram...) or productivity tools (Word, Dropbox...).

Those huge apps are completly altering our results and those category are hard to enter for a new app on the market.

The book and reference category have an interesting average number of install per apps (8814199 installs). Since we thought it was a good idea to explore on iOS we might take a look on Android if this category is worth investing our time and money in it! 

Let's see the type of apps it contains and if its similar to the ones in the *Reference* genre on iOS.

In [65]:
for app in free_android_apps:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

There are just a few famous apps in this category : 



In [67]:
for app in free_android_apps:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


If we take a look at the average apps, from 1,000,000 to 100,000,000 installs, we can see that most of those apps are for reading ebooks. So we should maybe not get into competition with all those apps

In [69]:
for app in free_android_apps:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

We can see that a famous book like the *Quran* seems to interest a lot of users, so our first idea, that we developed after looking into the Apple dataset, could be really interesting. 

# Conclusion

Thanks to our data anlysis through both Android and iOS dataset, we came to the conclusionof turning a famous book into an app with a lot of additional content (commentaries, quizz, anectodes...) could be a profitable idea.

This choice was motivated by the fact that those apps attract a fair amount of users and that the market is not saturated with those applications. At least on the english speaking and non-paying apps of those online stores!