# Profitable App Profiles for the App Store and Google Play Markets

Our company builds Android and iOS mobile apps. Since we only build apps that are free to download and install, our main source of revenue is the in-app ads. This means that our company's revenue is determined by the number of users of our apps. The more users see and engage with the ads the more our revenue becpmes. The purpose of this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

### Collection of the Data set
According to Statista, as of September 2018, there were approximately 2 million iOS apps available on the Spp Store, and 2.1 million Android apps on Google Play. Collecting data for over 4 million apps requires a significant amount of time and money. So we will collect and analyze a sample of the data.

There are two data sets that contain sample of data we need. The first one can is data about approximately 10,000 Android apps from Google Play, which was collected in August 2018. It can be downloaded from [this link](https://www.kaggle.com/datasets/lava18/google-play-store-apps).

The second data set contains about approximately 7,000 iOS apps from the App Store, which was collected in July 2017. This data set can be downloaded from [this link](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) 

In [1]:
# opening the dataset of google play apps and creating a list out of it
# make sure to separate the headers of the dataset and the actual data
from csv import reader

opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

In [2]:
# f you run into an error named UnicodeDecodeError, add encoding="utf8" to the open() function 
# (for instance, use open('AppleStore.csv', encoding='utf8')).
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))
        
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows:  10841
Number of columns:  13


In [3]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows:  7197
Number of columns:  16


# Deleting Wrong Data

At this stage, we need to make sure the data we analyze is accurate, or the results of our analysis will be wrong. So we will do **data cleaning** before the analysis, and these are the following activities we will do:

- Detect inaccurate data, and correct or remove it.
- Detect duplicate data, and remove the duplicates.
- Remove non-English apps
- Remove apps that aren't free

In a dedicated discussion section of the Google Play data set, one of the discussions outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct 

In [4]:
print(android[10472]) #incorrect row
print('\n')
print(android_header) # header
print('\n')
print(android[0])     # correct row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


After analyzing the result printed above it is identified that the row 10472 corresponds to the app *Life MAde WI-Fi Touchscreen Photo Frame*, which has a rating of 19. This is obviously inadmissible because the maximum rating for a Google Play app is 5. As mentioned in the discussion section, this problem is caused by a missing value in the *'Category'* column. Therefore, we will delete this row.

In [5]:
print(len(android))
del android[10472] # don't run this more than once
print(len(android))

10841
10840


# Removing duplicate entries

Some apps on the Google Play have duplicate entries. For instance, instagram has four entries.

In [6]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [7]:
# we need to remove the duplicate entries and keep only one entry per app.
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('The number of duplicate apps is ', len(duplicate_apps))
print('The number of unique apps is ', len(unique_apps))

The number of duplicate apps is  1181
The number of unique apps is  9659


To remove the duplicates, we will do the following:

- Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

- Use the information stored in the dictionary and create a new dataset, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

In [8]:
reviews_max = {}
for app in android:
    name = app[0]
    n_reviews = app[3]
    if name in reviews_max and n_reviews > reviews_max[name]:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print('The number of apps without duplication is ', len(reviews_max))

The number of apps without duplication is  9659


Now we can delete the duplicate rows using the dictionary created (ie. *reviews_max*) and create a new list with no duplicate data (we will call this new list *android_clean*).

In [9]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = app[3]
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name) # make sure this is inside the if block

In [10]:
# after running this line, we are expecting our new list to have 9,659 rows like
# we had from our dictionary
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9659
Number of columns:  13


# Removing Non-English apps

Since we use English for the apps we develop at our company, we'd like to analyze only the apps that are designed for an English-speaking audience. So we will remove all the non-English apps.

One way to do this is to remove each app with a name containing a symbol that isn't commonly used in English text. The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. 

If an app name contains a character that is greater than 127, then it probably means that the app has a non-English name.

In [11]:
def is_english(string):
    
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
            
    # emojis and characters like ™ fall outside the ASCII range and have corresponding numbers over 127 
    # so to minimize the impact of the loss of English apps with emojis and special characters, 
    # we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range
 
    if non_ascii > 3:
        return False
    else:
        return True

Using new function *is_english()*, we can filter out non-English apps from both datasets. We loop through each dataset, and if an app name is identified as English, we append the whole row to a separate list.

In [12]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9614
Number of columns:  13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+'

Our datasets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis. We do this by noticing the apps with the price of 0. We will add those apps to a new list, which will be our final list for the analysis.

In [13]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))
print(android_header)
        
    

8862
3222
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


## The context of our project
### why we want to find an app profile that fits both the App Store and Google Play

Our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.
To minimize risks and overhead, our **validation strategy** for an app idea has three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

## Most Common Apps by Genre

The next thing we want to do is to identify the columns that can be used to generate frequency tables to determine the most common genres in each market. After inspection of both data sets,  we'll build a frequency table for the *prime_genre* column of the App Store data set, and for the *Genres* and *Category* columns of the Google Play data set.

In [14]:
#Since we want to compute the number or the percentage of the apps that belong to a  particular genre it is reasonable to use dictionary
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        genre = row[index]
        if genre in table:
            table[genre] += 1
        else:
            table[genre] = 1
            
    freq_percentage = {}
    for key in table:
        freq_percentage[key] = (table[key]/total) * 100
        
    return freq_percentage

# dictionaries don't have order, and it will be very difficult to analyze the frequency tables. 
# We'll need to build a second function that can help us display the entries in the frequency 
# table in descending order.
# This function takes in an iterable data type (like a list, dictionary, tuple, etc.), and returns 
# a list of the elements of that iterable sorted in ascending or descending order (the reverse parameter 
# controls whether the order is ascending or descending).
def display_table(dataset, index):
    table_display = []
    table = freq_table(dataset, index)
    for key in table:
        key_as_a_tuple = (table[key], key)
        table_display.append(key_as_a_tuple)
        
    table_sorted = sorted(table_display, reverse=True)
    for entry in table_sorted:
        print(entry[1], " : ", entry[0])              

In [15]:
display_table(ios_final, -5) #prime_genre

Games  :  58.16263190564867
Entertainment  :  7.883302296710118
Photo & Video  :  4.9658597144630665
Education  :  3.662321539416512
Social Networking  :  3.2898820608317814
Shopping  :  2.60707635009311
Utilities  :  2.5139664804469275
Sports  :  2.1415270018621975
Music  :  2.0484171322160147
Health & Fitness  :  2.0173805090006205
Productivity  :  1.7380509000620732
Lifestyle  :  1.5828677839851024
News  :  1.3345747982619491
Travel  :  1.2414649286157666
Finance  :  1.1173184357541899
Weather  :  0.8690254500310366
Food & Drink  :  0.8069522036002483
Reference  :  0.5586592178770949
Business  :  0.5276225946617008
Book  :  0.4345127250155183
Navigation  :  0.186219739292365
Medical  :  0.186219739292365
Catalogs  :  0.12414649286157665


By inspection, we identify that more than half of the the free English apps on App Store are games(**58.16%**). Entertainment apps are close to **8% (7.88%)**, followed by photo and video apps, which are close to **5%**. Apps designed for education comprise of only **3.66%** of the apps, and social networking apps amounted for only **3.29%** of the apps.

The general impression is that apps designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.) dominate the App Store, and apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are not common. 

In [16]:
display_table(android_final, 1) #Category

FAMILY  :  18.934777702550214
GAME  :  9.693071541412774
TOOLS  :  8.451816745655607
BUSINESS  :  4.5926427443015125
LIFESTYLE  :  3.9043105393816293
PRODUCTIVITY  :  3.8930264048747465
FINANCE  :  3.7011961182577298
MEDICAL  :  3.5206499661475967
SPORTS  :  3.39652448657188
PERSONALIZATION  :  3.3175355450236967
COMMUNICATION  :  3.238546603475513
HEALTH_AND_FITNESS  :  3.080568720379147
PHOTOGRAPHY  :  2.945159106296547
NEWS_AND_MAGAZINES  :  2.798465357707064
SOCIAL  :  2.663055743624464
TRAVEL_AND_LOCAL  :  2.335815842924848
SHOPPING  :  2.2455427668697814
BOOKS_AND_REFERENCE  :  2.143985556307831
DATING  :  1.8618821936357481
VIDEO_PLAYERS  :  1.7941773865944481
MAPS_AND_NAVIGATION  :  1.399232678853532
FOOD_AND_DRINK  :  1.2412547957571656
EDUCATION  :  1.1735499887158656
ENTERTAINMENT  :  0.9591514330850823
LIBRARIES_AND_DEMO  :  0.9365831640713158
AUTO_AND_VEHICLES  :  0.9252990295644324
HOUSE_AND_HOME  :  0.8237418190024826
WEATHER  :  0.8011735499887158
EVENTS  :  0.710900473

The representations seem different on Google PLay. Apps designed for fun are significantly less than apps that are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.)

In [17]:
display_table(android_final, -4) #Genres

Tools  :  8.440532611148726
Entertainment  :  6.070864364703228
Education  :  5.348679756262695
Business  :  4.5926427443015125
Productivity  :  3.8930264048747465
Lifestyle  :  3.8930264048747465
Finance  :  3.7011961182577298
Medical  :  3.5206499661475967
Sports  :  3.4642292936131795
Personalization  :  3.3175355450236967
Communication  :  3.238546603475513
Action  :  3.1031369893929135
Health & Fitness  :  3.080568720379147
Photography  :  2.945159106296547
News & Magazines  :  2.798465357707064
Social  :  2.663055743624464
Travel & Local  :  2.324531708417964
Shopping  :  2.2455427668697814
Books & Reference  :  2.143985556307831
Simulation  :  2.0424283457458814
Dating  :  1.8618821936357481
Arcade  :  1.8505980591288649
Video Players & Editors  :  1.7716091175806816
Casual  :  1.7490408485669149
Maps & Navigation  :  1.399232678853532
Food & Drink  :  1.2412547957571656
Puzzle  :  1.128413450688332
Racing  :  0.9930038366057323
Role Playing  :  0.9365831640713158
Libraries & De

The Genres column also confirms that practical apps seem to have a better representation on Google Play. Though the difference between the Genres and the Category columns is not clear, we can notice that the Genres column has more categories. We're only looking for the bigger picture at the moment, so we'll only work with the Category column for now.

## Most Popular Apps by Genre on the App Store

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. On the App Store, there is no specific column that has a number of installs so we'll take the total number of user ratings as a proxy, which we can find in the *rating_count_tot* app.

In [18]:
genres_ios = freq_table(ios_final, -5)

for genre in genres_ios:
    total = 0
    number_of_genre_apps = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            number_of_genre_apps += 1
    
    avg_n_ratings = total / number_of_genre_apps
    print(genre, " : ", avg_n_ratings)

Social Networking  :  71548.34905660378
Photo & Video  :  28441.54375
Games  :  22788.6696905016
Music  :  57326.530303030304
Reference  :  74942.11111111111
Health & Fitness  :  23298.015384615384
Weather  :  52279.892857142855
Utilities  :  18684.456790123455
Travel  :  28243.8
Shopping  :  26919.690476190477
News  :  21248.023255813954
Navigation  :  86090.33333333333
Lifestyle  :  16485.764705882353
Entertainment  :  14029.830708661417
Food & Drink  :  33333.92307692308
Sports  :  23008.898550724636
Book  :  39758.5
Finance  :  31467.944444444445
Education  :  7003.983050847458
Productivity  :  21028.410714285714
Business  :  7491.117647058823
Catalogs  :  4004.0
Medical  :  612.0


On average, **navigation** apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

In [19]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same applies to music apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

These few apps which have hundreds of thousands of user ratings seem to skew the average number of ratings. It is therefore recommended to remove these apps in order to get a better picture of the averages.

Taking another look, we observe that reference apps have 74,942 user ratings on average, but this rating is skewed by the Bible and DIctionary.com

In [20]:
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


There seems to be a possibility of creating a niche out of this. We can take another popular book and create an app out of it and add attracting features such as quotes from the book, an audio version etc. Additionaly, we can integrate a dictionary within the app to prevent users from exiting our app for an external app.

This idea seems to be plausible due to the fact that the App Store is dominated by for-fun apps. Our app might have a more chance to stand out among the huge numbers of for-fun apps saturating the market.

Other genres that seem popular like weather, book, food and drink,or finance would not be our focus since people generally don't spend too much time on weather app, making the chances of making profit very low. Food and drink app requires actual cooking and a delivery service, which is outside the scope of our company. Finance too is outside our scope.

# Most Popular Apps by Genres on Google Play

Google Play data set comprise of column that gives us a number of installs of every app. However, we are short of a precise number since the data points in that column are open-ended values (100+, 1,000+, 5,000+, etc.)

In [21]:
display_table(android_final, 5) # the Installs columns

1,000,000+  :  15.741367637102236
100,000+  :  11.554953735048521
10,000,000+  :  10.516813360415256
10,000+  :  10.200857594222523
1,000+  :  8.395396073121193
100+  :  6.917174452719477
5,000,000+  :  6.838185511171294
500,000+  :  5.574362446400361
50,000+  :  4.773188896411646
5,000+  :  4.513653802753328
10+  :  3.5432182351613632
500+  :  3.2498307379823967
50,000,000+  :  2.2906793048973144
100,000,000+  :  2.1214172872940646
50+  :  1.9183028661701647
5+  :  0.7898894154818324
1+  :  0.5077860528097494
500,000,000+  :  0.2708192281651997
1,000,000,000+  :  0.22568269013766643
0+  :  0.045136538027533285
0  :  0.011284134506883321


So we will get rid of the commas and plus, convert the install numbers (which is a string) to float and compute the average number of installs for each genre (category).

In [22]:
category_android = freq_table(android_final, 1)
table_category = {}
table_category_list = []

for category in category_android:
    total_installs = 0
    total_num_of_apps = 0
    for app in android_final:
        if app[1] == category:
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total_installs += float(n_installs)
            total_num_of_apps += 1        
    avg_num = total_installs / total_num_of_apps
    table_category[category] = avg_num

for key in table_category:
    key_as_a_tuple = (table_category[key], key)
    table_category_list.append(key_as_a_tuple)
        
table_sorted = sorted(table_category_list, reverse=True)
for entry in table_sorted:
    print(entry[1], " : ", entry[0])

COMMUNICATION  :  38456119.167247385
VIDEO_PLAYERS  :  24727872.452830188
SOCIAL  :  23253652.127118643
PHOTOGRAPHY  :  17805627.643678162
PRODUCTIVITY  :  16787331.344927534
GAME  :  15560965.599534342
TRAVEL_AND_LOCAL  :  13984077.710144928
ENTERTAINMENT  :  11640705.88235294
TOOLS  :  10682301.033377837
NEWS_AND_MAGAZINES  :  9549178.467741935
BOOKS_AND_REFERENCE  :  8767811.894736841
SHOPPING  :  7036877.311557789
PERSONALIZATION  :  5201482.6122448975
WEATHER  :  5074486.197183099
HEALTH_AND_FITNESS  :  4188821.9853479853
MAPS_AND_NAVIGATION  :  4056941.7741935486
FAMILY  :  3694276.334922527
SPORTS  :  3638640.1428571427
ART_AND_DESIGN  :  1986335.0877192982
FOOD_AND_DRINK  :  1924897.7363636363
EDUCATION  :  1820673.076923077
BUSINESS  :  1712290.1474201474
LIFESTYLE  :  1437816.2687861272
FINANCE  :  1387692.475609756
HOUSE_AND_HOME  :  1331540.5616438356
DATING  :  854028.8303030303
COMICS  :  817657.2727272727
AUTO_AND_VEHICLES  :  647317.8170731707
LIBRARIES_AND_DEMO  :  638

After sorting the average installs for each genre, it was found that communication apps have the most installs (38,456,119), and is largely skewed up by a few extremely popular apps such as WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts.

In [23]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

In [24]:
# Now, let's remove communication apps that have over 100 million installs
under_100_m = []
for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3603485.3884615386

In addition to communication apps, video player apps, social apps, photography apps, and productivity apps exhibit the same pattern. Moreover, these niches seem to be dominated by a few giants who are hard to compete against. Games genre seem pretty popular but a bit saturated.

The next niches we could consider are 'news and magazine' and 'book and reference' genres which looks fairly popular as well, with an avergae number of installs of 9,549,178 and 8,767,811 respectively. Since we have already found that the Book and Reference genre shows a plausibility of working well on the App Store, and our aim to recommend an app that shows potential to generate profit on both the App Store and Google Play, we will consider that genre next.

In [25]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ' : ', app[5])

E-Book Read - Read Book for free  :  50,000+
Download free book with green book  :  100,000+
Wikipedia  :  10,000,000+
Cool Reader  :  10,000,000+
Free Panda Radio Music  :  100,000+
Book store  :  1,000,000+
FBReader: Favorite Book Reader  :  10,000,000+
English Grammar Complete Handbook  :  500,000+
Free Books - Spirit Fanfiction and Stories  :  1,000,000+
Google Play Books  :  1,000,000,000+
AlReader -any text book reader  :  5,000,000+
Offline English Dictionary  :  100,000+
Offline: English to Tagalog Dictionary  :  500,000+
FamilySearch Tree  :  1,000,000+
Cloud of Books  :  1,000,000+
Recipes of Prophetic Medicine for free  :  500,000+
ReadEra – free ebook reader  :  1,000,000+
Anonymous caller detection  :  10,000+
Ebook Reader  :  5,000,000+
Litnet - E-books  :  100,000+
Read books online  :  5,000,000+
English to Urdu Dictionary  :  500,000+
eBoox: book reader fb2 epub zip  :  1,000,000+
English Persian Dictionary  :  500,000+
Flybook  :  500,000+
All Maths Formulas  :  1,000

We have a variety of Google Play apps in this genre: dictionaries, reading ebooks, collection of libraries etc. Inspection of this result show that there are still a small number of popular apps that skew the average.

In [26]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


But considering the result above, we have only a few very popular apps, so this market still shows potential

### Now let's consider the Genre column of the Google Play data set

In [27]:
genre_android = freq_table(android_final, -4)
table_genre = {}
table_genre_list = []

for genre in genre_android:
    total_installs = 0
    total_num_of_apps = 0
    for app in android_final:
        if app[9] == genre:
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total_installs += float(n_installs)
            total_num_of_apps += 1        
    avg_num = total_installs / total_num_of_apps
    table_genre[genre] = avg_num

for key in table_genre:
    key_as_a_tuple = (table_genre[key], key)
    table_genre_list.append(key_as_a_tuple)
        
table_sorted = sorted(table_genre_list, reverse=True)
for entry in table_sorted:
    print(entry[1], " : ", entry[0])

Communication  :  38456119.167247385
Adventure;Action & Adventure  :  35333333.333333336
Video Players & Editors  :  24947335.796178345
Social  :  23253652.127118643
Arcade  :  22888365.48780488
Casual  :  19630958.51612903
Puzzle;Action & Adventure  :  18366666.666666668
Photography  :  17805627.643678162
Educational;Action & Adventure  :  17016666.666666668
Productivity  :  16787331.344927534
Racing  :  15910645.681818182
Travel & Local  :  14051476.145631067
Casual;Action & Adventure  :  12916666.666666666
Action  :  12603588.872727273
Strategy  :  11199902.530864198
Tools  :  10683213.20053476
Tools;Education  :  10000000.0
Role Playing;Brain Games  :  10000000.0
Lifestyle;Pretend Play  :  10000000.0
Casual;Music & Video  :  10000000.0
Card;Action & Adventure  :  10000000.0
Adventure;Education  :  10000000.0
News & Magazines  :  9549178.467741935
Music  :  9445583.333333334
Educational;Pretend Play  :  9375000.0
Word  :  9094458.695652174
Puzzle;Brain Games  :  9013125.0
Racing;Act

Using the Genre column of the Google Play data set, we still find communication, video player, and social apps having extremely higher number of installs. Books and Reference, which is currently our recommendation, still has a fairly number of installs (8,767,811). It seems to be useful to add features similar to those from puzzle and brain games to our book app, since they have a decent number of installs from users. Other suggested features include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

# Conclusion

The goal of this project was to recommend an app profile that can be profitable for both App Store and Google Play based on data-driven analysis. So we analyzed data about mobile apps in both markets to identify patterns and provided recommendations.

We concluded that taking a popular book (perhaps a more recent book) and creating an app out of it could be profitable for both Google Play and App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book such daily quotes from the book, an audio version of the book, quizzes, puzzles and brain games on the book, a forum where people can discuss the book, etc. 