# Most attractive apps

Since the main source of revenue consists of in-app ads, the number of users becomes a significant factor. This project analyzes mobile apps data to help developers understand what type of apps are likely to attract more users on Google Play and App Store.

Goal of the project is to discover the kind of apps to be developed that will increase revenue through attracting the highest number of users possible.
Finding the most attractive apps can be a guide to:
  - Build a minimal Android version of the app and add to Google Play
  - If the app has a good response from users, develop it further
  - if the app is profitable after six months, build an iOS version and add it to App Store

In [1]:
from csv import reader

In [2]:
with open('AppleStore.csv', 'r') as file_opened:
    read_lines = reader(file_opened)
    apple_data = list(read_lines)

with open('googleplaystore.csv', 'r') as file_opened:
    read_lines = reader(file_opened)
    google_data = list(read_lines)

With the function `explore_data` the dataset can be sliced to the number of rows declared as `start` and `end` arguments. The argument `rows_and_columns` when set to `True` informs for the number of dataset rows and columns. 

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [4]:
explore_data(apple_data, 0, 5, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


In [5]:
explore_data(google_data, 0, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


The documentation for the `apple_data` can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

The documentation for the `google_data` can be found [here](https://www.kaggle.com/lava18/google-play-store-apps).

In [6]:
copy_apple = apple_data.copy()
copy_google = google_data.copy()

In [7]:
header_apple_data = list(enumerate(copy_apple[0]))
print('Apple data header: ', header_apple_data)
print('\n')
header_google_data = list(enumerate(copy_google[0]))
print('Google data header: ', header_google_data)

Apple data header:  [(0, 'id'), (1, 'track_name'), (2, 'size_bytes'), (3, 'currency'), (4, 'price'), (5, 'rating_count_tot'), (6, 'rating_count_ver'), (7, 'user_rating'), (8, 'user_rating_ver'), (9, 'ver'), (10, 'cont_rating'), (11, 'prime_genre'), (12, 'sup_devices.num'), (13, 'ipadSc_urls.num'), (14, 'lang.num'), (15, 'vpp_lic')]


Google data header:  [(0, 'App'), (1, 'Category'), (2, 'Rating'), (3, 'Reviews'), (4, 'Size'), (5, 'Installs'), (6, 'Type'), (7, 'Price'), (8, 'Content Rating'), (9, 'Genres'), (10, 'Last Updated'), (11, 'Current Ver'), (12, 'Android Ver')]


## Data cleaning:
   - Remove incomplete rows
   - Remove duplicate apps
   - Remove non-English apps
   - Remove apps that aren't free

### Find and Remove Incomplete rows
The function `incomplete_row` finds the rows that are shorter compared to the header of the dataset. The shorter rows will be removed.

In [8]:
def incomplete_row(dataset):
    for row in dataset[1:]:
        if len(row) != len(dataset[0]):
            print(row)
            print('Incomplete row index: ', dataset.index(row))
    print('All rows are complete!')

#### Apple data

In [9]:
incomplete_row(apple_data)

All rows are complete!


#### Google data

In [10]:
incomplete_row(google_data)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Incomplete row index:  10473
All rows are complete!


The row with index `10473` misses the `Category` data and all other data are shifted one place. The row must be removed.

In [11]:
del google_data[10473]

In [12]:
print('New length of google_data: ', len(google_data))

New length of google_data:  10841


### Find and Remove Duplicate rows
The function `find_duplicate` is searching by app name for any duplicated apps in a dataset.

In [13]:
def find_duplicate(dataset, index):
    duplicate_apps = []
    unique_apps = []
    for row in dataset[1:]:
        name = row[index]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
    print(len(duplicate_apps))
    return duplicate_apps

#### Apple data

In [14]:
find_duplicate(apple_data, 1)

2


['Mannequin Challenge', 'VR Roller Coaster']

The dataset `apple_data` has two duplicated apps. Exploring these rows, it can be assumed that data were collected at different times for different versions of the same app. Therefore, the rows for the old versions will be removed. Since the newest versions have more total ratings the `rating_count_tot` column will be used as guide for the removal.
The find and removal process can be done manually for this dataset.

In [15]:
for row in apple_data[1:]:
    name = row[1]
    if name == 'Mannequin Challenge' or name == 'VR Roller Coaster':
        print(row)

['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']
['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']


In [16]:
for app in apple_data[1:]:
    id_app = app[0]
    if id_app == '1178454060' or id_app == '1089824278':
        print(apple_data.index(app))

4464
4832


In [17]:
print(len(apple_data))
del apple_data[4464]
del apple_data[4832]
print(len(apple_data))

7198
7196


#### Google data

The dataset `google_data` has 1181 duplicated rows. Exploring these rows it can be assumed that data for some apps were collected more than once at the same or different time. Therefore, the apps with lesser number of `Reviews` will be removed and the one with the max number will be included.

In [18]:
find_duplicate(google_data, 0)

1181


['Quick PDF Scanner + OCR FREE',
 'Box',
 'Google My Business',
 'ZOOM Cloud Meetings',
 'join.me - Simple Meetings',
 'Box',
 'Zenefits',
 'Google Ads',
 'Google My Business',
 'Slack',
 'FreshBooks Classic',
 'Insightly CRM',
 'QuickBooks Accounting: Invoicing & Expenses',
 'HipChat - Chat Built for Teams',
 'Xero Accounting Software',
 'MailChimp - Email, Marketing Automation',
 'Crew - Free Messaging and Scheduling',
 'Asana: organize team projects',
 'Google Analytics',
 'AdWords Express',
 'Accounting App - Zoho Books',
 'Invoice & Time Tracking - Zoho',
 'join.me - Simple Meetings',
 'Invoice 2go — Professional Invoices and Estimates',
 'SignEasy | Sign and Fill PDF and other Documents',
 'Quick PDF Scanner + OCR FREE',
 'Genius Scan - PDF Scanner',
 'Tiny Scanner - PDF Scanner App',
 'Fast Scanner : Free PDF Scan',
 'Mobile Doc Scanner (MDScan) Lite',
 'TurboScan: scan documents and receipts in PDF',
 'Tiny Scanner Pro: PDF Doc Scan',
 'Docs To Go™ Free Office Suite',
 'OfficeS

The dictionary `duplicates` informs of more than one duplicates of the same app. For example the app `Viber Messanger` has 5 entries. Four in the `duplicate_apps` list and one in the `unique_apps` list.

In [19]:
duplicates = {}
for name in find_duplicate(google_data, 0):
    if name in duplicates:
        duplicates[name] += 1
    else:
        duplicates[name] = 1
print(duplicates)
print(len(duplicates))

1181
{'Quick PDF Scanner + OCR FREE': 2, 'Box': 2, 'Google My Business': 2, 'ZOOM Cloud Meetings': 1, 'join.me - Simple Meetings': 2, 'Zenefits': 1, 'Google Ads': 2, 'Slack': 2, 'FreshBooks Classic': 1, 'Insightly CRM': 1, 'QuickBooks Accounting: Invoicing & Expenses': 2, 'HipChat - Chat Built for Teams': 1, 'Xero Accounting Software': 1, 'MailChimp - Email, Marketing Automation': 1, 'Crew - Free Messaging and Scheduling': 1, 'Asana: organize team projects': 1, 'Google Analytics': 1, 'AdWords Express': 1, 'Accounting App - Zoho Books': 1, 'Invoice & Time Tracking - Zoho': 1, 'Invoice 2go — Professional Invoices and Estimates': 1, 'SignEasy | Sign and Fill PDF and other Documents': 1, 'Genius Scan - PDF Scanner': 1, 'Tiny Scanner - PDF Scanner App': 1, 'Fast Scanner : Free PDF Scan': 1, 'Mobile Doc Scanner (MDScan) Lite': 1, 'TurboScan: scan documents and receipts in PDF': 1, 'Tiny Scanner Pro: PDF Doc Scan': 1, 'Docs To Go™ Free Office Suite': 1, 'OfficeSuite : Free Office + PDF Editor

In [20]:
for app in google_data[1:]:
    name = app[0]
    if name == 'Viber Messenger':
        print(app)

['Viber Messenger', 'COMMUNICATION', '4.3', '11334799', 'Varies with device', '500,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 18, 2018', 'Varies with device', 'Varies with device']
['Viber Messenger', 'COMMUNICATION', '4.3', '11334973', 'Varies with device', '500,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 18, 2018', 'Varies with device', 'Varies with device']
['Viber Messenger', 'COMMUNICATION', '4.3', '11334973', 'Varies with device', '500,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 18, 2018', 'Varies with device', 'Varies with device']
['Viber Messenger', 'COMMUNICATION', '4.3', '11335255', 'Varies with device', '500,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 18, 2018', 'Varies with device', 'Varies with device']
['Viber Messenger', 'COMMUNICATION', '4.3', '11335481', 'Varies with device', '500,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 18, 2018', 'Varies with device', 'Varies with device']


In [21]:
review_max = {}

for row in google_data[1:]:
    name = row[0]
    n_review = int(row[3])
    if name in review_max and review_max[name] < n_review:
        review_max[name] = n_review
    if name not in review_max:
        review_max[name] = n_review
print(len(review_max))
print(len(google_data[1:])-1181)
print(review_max['Viber Messenger'])

9659
9659
11335481


In [22]:
google_data_clean = []
google_name_added = []

for app in google_data[1:]:
    name = app[0]
    n_review = int(app[3])
    if name not in google_name_added and n_review == review_max[name]:
        google_data_clean.append(app)
        google_name_added.append(name)
print(len(google_data_clean))

9659


In [23]:
for app in google_data_clean:
    name = app[0]
    if name == 'Viber Messenger':
        print(app)

['Viber Messenger', 'COMMUNICATION', '4.3', '11335481', 'Varies with device', '500,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 18, 2018', 'Varies with device', 'Varies with device']


### Find and Remove non-English apps

The function `english_app` is used to seperate the non-english apps. If an app has more than 3 non-english letters in its name, it can be remove from the dataset.

In [45]:
def english_app(string):
    non_eng = 0
    for letter in string:
        if ord(letter) > 127:
            non_eng += 1
    if non_eng > 3:
        return False
    return True

#### Apple data

In [47]:
apple_data_eng = []

for row in apple_data[1:]:
    name = row[1]
    if english_app(name):
        apple_data_eng.append(row)
print(len(apple_data_eng))    

6181


#### Google data

In [48]:
google_data_eng = []

for row in google_data_clean:
    name = row[0]
    if english_app(name):
        google_data_eng.append(row)
print(len(google_data_eng))

9614


### Find and Remove non-Free apps

#### Apple data

In [51]:
apple_data_free = []

for row in apple_data_eng:
    price = float(row[4])
    if price == 0.0:
        apple_data_free.append(row)
print(len(apple_data_free))

3221


#### Google data

In [54]:
google_data_free = []

for row in google_data_eng:
    type = row[6]
    if type == 'Free':
        google_data_free.append(row)
print(len(google_data_free))

8863


## Find the most popular genres
An effective stategy to decide on what kind of apps should be build is to explore the most common genres in both stores. It can be assumed that these kinds of apps have higher demand.

The function `freq_table` creates a dictionary that lists the genres of the dataset and assigns to them the percentage of their appearance in the dataset. The function `display_table` sorts the percentages in descenting order. 

In [64]:
def freq_table(dataset, index):
    freq_dict = {}
    total = 0
    
    for app in dataset:
        total += 1
        genre = app[index]
        if genre in freq_dict:
            freq_dict[genre] += 1
        else:
            freq_dict[genre] = 1
            
    percent_freq = {}
    for key in freq_dict:
        percent_freq[key] = (freq_dict[key]/total) * 100
    return percent_freq
    
def display_table(dataset, index):
    table = freq_table(dataset, index)
    sort_table = []
    for key in table:
        to_tuple = (table[key], key)
        sort_table.append(to_tuple)
        
    sort_freq_perc = sorted(sort_table, reverse = True)
    for item in sort_freq_perc:
        print(item[1], ':', item[0])

#### Apple data
Exploring the results for the `apple_data_free` dataset we see that the genre `Games` comes first with a big difference in comparison to other genres.

In [65]:
display_table(apple_data_free, 11)

Games : 58.149642968022356
Entertainment : 7.885749767153058
Photo & Video : 4.967401428127911
Education : 3.6634585532443342
Social Networking : 3.290903446134741
Shopping : 2.607885749767153
Utilities : 2.5147469729897547
Sports : 2.1421918658801617
Music : 2.049053089102763
Health & Fitness : 2.018006830176964
Productivity : 1.7385904998447685
Lifestyle : 1.5833592052157717
News : 1.334989133809376
Travel : 1.2418503570319777
Finance : 1.11766532132878
Weather : 0.8692952499223843
Food & Drink : 0.8072027320707855
Reference : 0.55883266066439
Business : 0.5277864017385905
Book : 0.43464762496119214
Navigation : 0.18627755355479667
Medical : 0.18627755355479667
Catalogs : 0.12418503570319776


#### Google data
The landscape is different for the `google_data_free` dataset. It seems that in Google Play we can find more family-friendly apps. Since the relevant genre doesn't exist in the `apple_data_free` dataset we can not realy draw any definitive conclusion. It seems though that in both stores the entertaining apps have the highest demand.

In [66]:
display_table(google_data_free, 1)

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

### Find the number of downloads for each genre

Another important factor is the number of dowloads for each genre. This factor may give us more information about the kind of apps we should build to increase revenue. Therefore, we will use the data in the `Installs` column for the `google_data_free` dataset. Since the relevant column for the `apple_data_free` doesn't exist, we will use the `rating_count_tot` instead and try to meet a decision from there.

#### Apple data
Exploring the results we come to the conclusion that the most popular apps regarding the times that were rated are the `Navigation` apps with 86090.33. In second place are coming the `Reference` apps and in third place the `Social Networking` apps.

In [74]:
genre_apple = freq_table(apple_data_free, 11)

for genre in genre_apple:
    total = 0
    len_genre = 0
    for app in apple_data_free:
        genre_app = app[11]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', round(avg_n_ratings, 2))

Social Networking : 71548.35
Photo & Video : 28441.54
Games : 22800.78
Music : 57326.53
Reference : 74942.11
Health & Fitness : 23298.02
Weather : 52279.89
Utilities : 18684.46
Travel : 28243.8
Shopping : 26919.69
News : 21248.02
Navigation : 86090.33
Lifestyle : 16485.76
Entertainment : 14029.83
Food & Drink : 33333.92
Sports : 23008.9
Book : 39758.5
Finance : 31467.94
Education : 7003.98
Productivity : 21028.41
Business : 7491.12
Catalogs : 4004.0
Medical : 612.0


In [76]:
for app in apple_data_free:
    genre = app[11]
    if genre == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


In [77]:
for app in apple_data_free:
    genre = app[11]
    if genre == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


In [78]:
for app in apple_data_free:
    genre = app[11]
    if genre == 'Social Networking':
        print(app[1], ':', app[5])

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

#### Google data
For the Google Play market, we actually have data about the number of `Installs`, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.).

In [79]:
display_table(google_data_free,5)

1,000,000+ : 15.728308699086089
100,000+ : 11.55365000564143
10,000,000+ : 10.549475346947986
10,000+ : 10.199706645605326
1,000+ : 8.394448832223853
100+ : 6.916393997517771
5,000,000+ : 6.826131106848697
500,000+ : 5.562450637481666
50,000+ : 4.772650344127271
5,000+ : 4.513144533453684
10+ : 3.542818458761142
500+ : 3.2494640640866526
50,000,000+ : 2.3017037120613786
100,000,000+ : 2.1324607920568655
50+ : 1.9180864267178157
5+ : 0.7898002933543946
1+ : 0.5077287600135394
500,000,000+ : 0.270788672007221
1,000,000,000+ : 0.2256572266726842
0+ : 0.045131445334536835


One problem with this data is that is not precise. For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to get an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

In [80]:
categories_android = freq_table(google_data_free, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in google_data_free:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3697848.1731343283
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

On average, `COMUNICATION` apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs

In [81]:
for app in google_data_free:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

In [83]:
under_100_m = []

for app in google_data_free:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3603485.3884615386

The `BOOKS_AND_REFERENCE` genre looks fairly popular as well, with an average number of installs of 8,767,811. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

In [84]:
for app in google_data_free:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

However, it looks like the market is already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

## Conclusion

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.