# Profitable Apps Profiles to Develop for Google Play and App Store Markets

In this project, we try to find out app profiles which are beneficial for iOS and Android develops in their respective markets. We are working as data analyst for a company that develops both Android and iOS apps. This project is therefore focused on helping  the company make informed decisions about profitable apps.

Since the company is in the business of developing free applications for both Google Play and App Store, revenue for these apps is therefore generated from in-app ads. This in turn depends on the number of downloads and number of users that interact with those in-app ads. Our main goal is therefore to analyse current data on download activities on both Google Play and App Store in order to know which app profiles are likely to attract more users and are profitable for the company to develop.

There are two data sets that seem suitable for our goals:

   * A data set containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. The data set was downloaded directly from this [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
   * A data set containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. The data set was downloaded directly from this [link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).
   
   
We'll start by opening and exploring these two data sets. To make them easier to explore, we created a function named `explore_data()` that can be used to repeatedly print rows in a readable way.

In [1]:
## Data set for Android apps on Google Play
opened_file = open('googleplaystore.csv')
from csv import reader
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

## Data set for iOS apps on App Store
opened_file = open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
iOS = list(read_file)
iOS_header = iOS[0]
iOS = iOS[1:]

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print(android_header)
print('\n')
explore_data(android, 0, 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


In [3]:
print(iOS_header)
print('\n')
explore_data(iOS, 0, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16


In [4]:
print(android_header)
print('\n')
print(iOS_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


The column names contained in the two lists above are from Android and iOS datasets respectively. From the first list of Android dataset, the columns that are essential for our analysis are: *App, Category, Rating, Intalls, Type *and *Genres.
*The essential columns from the second list are: *track_name, currency, price, rating_count_tot, rating_count_ver and prime_genre. 

For more on the description of these column names, readers can click on [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home#AppleStore.csv)

## Data Cleaning

The Google Play data set has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error on row 10472 of the android data set. This particular row has its category missing (shown below), this has caused a shift (to left) of the other columns in that row.

In [5]:
print(android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


### Part 1
#### Deleting Rows with Missing Values and Duplicate Rows

We're going to start our data cleaning process by deleting this particular row from our data set.

In [6]:
del android[10472]

The Google Play data set to be analysed contains duplicate entries for some apps. The list below shows some of the examples of duplicate entries in the data set. The number of the duplicates are subsequently shown.

In [7]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Examples of apps with duplicate entries:', duplicate_apps[:10])
print('\n')
print('Number of duplicate entries:', len(duplicate_apps))

Examples of apps with duplicate entries: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


Number of duplicate entries: 1181


When these duplicate entries are properly examined, it can be noticed that there are differences in the number of reviews (i.e. the fourth position of each row) for some of the duplicate apps. However, some have no differences in any of their column entries. Examples are shown below:

In [8]:
for app in android:
    name = app[0]
    if name == 'Google Ads':
        print(app)
print('\n')
for app in android:
    name = app[0]
    if name == 'ZOOM Cloud Meetings':
        print(app)
print('\n')
for app in android:
    name = app[0]
    if name == 'Quick PDF Scanner + OCR FREE':
        print(app)
print('\n')
for app in android:
    name = app[0]
    if name == 'Slack':
        print(app)

['Google Ads', 'BUSINESS', '4.3', '29313', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']
['Google Ads', 'BUSINESS', '4.3', '29313', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']
['Google Ads', 'BUSINESS', '4.3', '29331', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']


['ZOOM Cloud Meetings', 'BUSINESS', '4.4', '31614', '37M', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 20, 2018', '4.1.28165.0716', '4.0 and up']
['ZOOM Cloud Meetings', 'BUSINESS', '4.4', '31614', '37M', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 20, 2018', '4.1.28165.0716', '4.0 and up']


['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2',

From the examples above, it can be seen that **Google Ad, Quick PDF Scanner + OCR FREE and Slack**, with 3 entries (duplicates) each, have 2 different entries in the fourth column (reviews). On the other hand, **ZOOM Cloud Meeting** has 2 entries (duplicates) with no difference in any column.

Therefore, we have to clean off these duplicates by eliminating some entries and leaving just one entry per each application.

Most importantly, while eliminating some entries, we want to make sure only the entry with the highest number of reviews (this most likely signifies the most recent of the entries) is left.

In [9]:
reviews_max = {}

for app in android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In a previous code cell, we showed that the number of duplicate entries is 1,181. Therefore, the length of our dictionary (reviews_max) in the above code must be equal to the difference between the length of our android data set and 1,181. Lets find out in the code below...

In [10]:
print('Expected length:', len(android[1:]) - 1181)
print('Actual Dictionary length:', len(reviews_max))

Expected length: 9658
Actual Dictionary length: 9658


Now, we want to remove the duplicate rows. When removing duplicates, we want to make sure that only the row with maximum of reviews is left behind.

As shown in the code below, we want to achieve the above goal by:

-Creating two empty lists: 
    
   1) android_clean (this will store our new cleaned data 
        set) 
   2) already_added (this will store app names)
    
-Looping through the Google Play data set and for each iteration:
    
   *We isolate the name of the app and the number of reviews.
   *Then we add the current row (app) to the android_clean list,
    and the app name (name) to the already_added list if:
       -The number of reviews of the current app matches the number of reviews of that app as described in the 
        reviews_max dictionary; and
        -The name of the app is not already in the already_added
        list. We need to add this supplementary condition to 
        account for those cases where the highest number of 
        reviews of a duplicate app is the same for more than one 
        entry (for example, the Box app has three entries, and 
        the number of reviews is the same). If we just check for 
        reviews_max[name] == n_reviews, we'll still end up with 
        duplicate entries for some apps.

In [11]:
android_clean = []
already_added = []

for app in android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

Lets check whether or not the total number of rows (excluding the header row) in our new data set is 9,658. We also displayed the first three rows of our data set.

In [12]:
explore_data(android_clean, 0, 3, True)

['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9658
Number of columns: 13


Exactly as expected! We have 9,658 in our new data set. 

### Part 2
#### Deleting Entries For Non-English App

Our company develops apps targeted at English-speaking individuals. If we are to explore our two data sets, we would find some apps with non-English names. Examples of these apps present in our data sets are shown below:

In [13]:
print(android_clean[4411][0])
print(android_clean[7939][0])
print('\n')
print(iOS[813][1])
print(iOS[6731][1])

‰∏≠ÂõΩË™û AQ„É™„Çπ„Éã„É≥„Ç∞
ŸÑÿπÿ®ÿ© ÿ™ŸÇÿØÿ± ÿ™ÿ±ÿ®ÿ≠ DZ


Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠
„ÄêËÑ±Âá∫„Ç≤„Éº„É†„ÄëÁµ∂ÂØæ„Å´ÊúÄÂæå„Åæ„Åß„Éó„É¨„Ç§„Åó„Å™„ÅÑ„Åß „ÄúË¨éËß£„ÅçÔºÜ„Éñ„É≠„ÉÉ„ÇØ„Éë„Ç∫„É´„Äú


In order to obtain accurate result in our analysis, we want to remove those apps in our data sets that have non-English names. We create a function in order to check whether or not an app name is english or not. Below, we check whether our function works effectively.

In [14]:
def is_english(string):
    
    for character in string:
        if ord(character) > 127:
            return False
        
    return True
    
print(is_english('Instagram'))
print(is_english('Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠'))
print(is_english('Docs To Go‚Ñ¢ Free Office Suite'))
print(is_english('Instachat üòú'))

True
False
False
False


The function above incorrectly labeled some english apps as non-English (because of some special characters). This may lead to gross data loss.

Below we want to modify this fuction to correctly filter out non-English apps and minimize the impact of data loss. To do this, we'll only remove an app whose name has more than three characters falling outside the ASCII range.

In [15]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
            
    if non_ascii > 3:
        return False
    else:    
        return True
    
print(is_english('Instagram'))
print(is_english('Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠'))
print(is_english('Docs To Go‚Ñ¢ Free Office Suite'))
print(is_english('Instachat üòú'))

True
False
True
True


The function above seems to work!

Now we want to use this function to filter out non-English apps from our two data sets (android_clean and iOS). We also create other lists for English apps in each data set and count the rows in each.

In [16]:
english_android = []
english_iOS = []
    
for app in android_clean:
    name = app[0]
    
    if is_english(name):
        english_android.append(app)
            
for app in iOS:
    name = app[1]
    
    if is_english(name):
        english_iOS.append(app)
        
explore_data(english_android, 0, 3, True)
print('\n')
explore_data(english_iOS, 0, 3, True)

['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9613
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

Previously, we had 9,658 and 7,196 rows (excluding the header row) in our `android_clean` and `iOS` data sets respectively. After removing the non-English apps, we are left with 9,613 rows for the android data set and 6,183 rows for iOS data set. 

Therefore, we have removed 45 non-English apps from the android data set and 1,013 non-English apps from the iOS data set.

So far so good, our data cleaning process has involved:

* Removing inaccurate data
* Removing duplicate app entries
* Removing non-English apps

Now, remember that our company only builds apps that are free to download and install, and the main source of revenue is in-app ads. But our data sets contain both free and non-free apps. Hence, there's a need to isolate only free apps for our analysis.

In this process of isolating free apps, we create separate lists for free apps (based on the price column) and then check the number of rows (i.e. apps) that we are left with. 

In [17]:
free_android_apps = []
free_iOS_apps = []

for app in english_android:
    price = app[7]
    if price == '0':
        free_android_apps.append(app)
        
for app in english_iOS:
    price = app[4]
    if price == '0.0':
        free_iOS_apps.append(app)
        
        
print(len(free_android_apps))
print(len(free_iOS_apps))

8863
3222


We're now left with 8,863 Android apps and 3,222 iOS apps for our analysis. This is suitable for our data analysis.

Excellent! Through our data cleaning process, we have successfully been able to:

1) Remove inaccurate data
2) Remove duplicate app entries
3) Remove non-English apps
4) Isolate free apps into separate lists

Now we move!


## Data Analysis

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps. 

In order to achieve our aim as well as minimize risks and overhead, our validation strategy for an app idea is comprised of these three steps:

   1. Build a minimal Android version of the app, and add it to Google Play.
   2. If the app has a good response from users, we develop it further.
   3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.
   
 
Since we want to build the app for both Google Play and App Store, we need to find app profiles that are successful on both markets. 

### Part 1

We begin this analysis by getting a sense of what the most common genres are for each market. In order to do this, we'll build frequency tables for a few columns in our data sets.

Let's inspect the header rows of our two data sets to try and know which columns will be essential for building the frequency tables (based on how common the genres are in each market).

In [18]:
print(android_header)
print('\n')
print(iOS_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


From the `android_header` list above, the *category* (on the 2nd column) and *Genres* (on the 10th column) are needed for building our frequency table. 
We need *prime_genre* (on the 11th column) from the `iOS_header` list to build the second frequency table.

We are going to build two functions in order to generate and analyze our frequency tables:

 * A function to generate frequency tables that show percentages
 * Another function to display the percentages in a descending order

In [19]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In this part, we're going to start by reviewing the frequency table for category in the Android data set. We'll find out the category with the most number of apps.

In [20]:
display_table(free_android_apps, 1)

FAMILY : 18.910075595170937
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

As it can be seen above, the **Family** category has the highest frequency followed by **Game** and **Tools**. This frequency table suggests that a larger proportion of apps in the Google Play Store are made for practical purposes (Family, Tools, Business, Lifestyle etc.). A lower proportion of the apps are designed solely for fun (Game category) representing only about 10% of the total number of apps.

However, if we're to examine each category, we'll find out that the Family category is mostly filled with game apps for children. Nevertheless, practical apps still possess a larger presence on Google Play Store.

Next, we'll examine the frequency table for the Genres column.

In [21]:
display_table(free_android_apps, 9)

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & Vehicles : 0.9251946293580051
S

The difference between the Category and Genres columns is not entirely clear. However, one notable difference between both of them is that the Genres column has more categories. Since we're after a bigger picture at this moment, we'll only consider the Category column as we move forward.

Next, we'll examine the `prime_genre` column in the iOS data set.

In [22]:
display_table(free_iOS_apps, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Unlike what we found in Android frequency tables, Games category dominates App Store with 58% (way more than half) of the total apps. Entertainment follows with close to 8%, Photo & Video with about 5%. Only 3.6% apps fall into Education category, 2.5% into Utilities and 2.6% into Shopping category.

The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users ‚Äî the demand might not be the same as the offer.

### Part 2

With our analysis so far, we've found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and fun apps. Now, we'd like to get an idea about the kind of apps that have most users.

One way to know what genres have the most users is to calculate the average number of installs for each app genre.


### Apps with Most Users by Genre on App Store

There is no information on the number of installs in the App Store data set. However, a useful column that can be used to represent the number of installs is the `rating_count_tot` which contains the total number of user ratings.

To calculate the average number of user ratings per app genre on App Store, we are going to:

   * Isolate the apps of each genre
   * Sum up the user ratings for the apps of that genre
   * Divide the sum by the number of apps belonging to a particular genre

In [23]:
iOS_genres = freq_table(free_iOS_apps, -5)

for genre in iOS_genres:
    total = 0
    len_genre = 0
    for app in free_iOS_apps:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
            
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Lifestyle : 16485.764705882353
Health & Fitness : 23298.015384615384
Reference : 74942.11111111111
News : 21248.023255813954
Music : 57326.530303030304
Utilities : 18684.456790123455
Productivity : 21028.410714285714
Catalogs : 4004.0
Weather : 52279.892857142855
Book : 39758.5
Education : 7003.983050847458
Shopping : 26919.690476190477
Navigation : 86090.33333333333
Food & Drink : 33333.92307692308
Travel : 28243.8
Sports : 23008.898550724636
Games : 22788.6696905016
Social Networking : 71548.34905660378
Entertainment : 14029.830708661417
Finance : 31467.944444444445
Business : 7491.117647058823
Photo & Video : 28441.54375
Medical : 612.0


On average, Navigation is the most popular genre on App Store followed by Reference, Social Networking, Music and Weather respectively. However, if we examine our data set critically, we'll find that the figure for Navigation is influenced by Google Maps and Waze which together both have close to have a million user reviews (shown below). Other genres mentioned above also have one or two apps developed by big companies which influenced the average number of each genre (shown below). 

Therefore apps in these genres might not actually be the best recommendations for our company. And we cannot hastily conclude on an app profile.

In [24]:
for app in free_iOS_apps:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print app name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching¬Æ : 12811
CoPilot GPS ‚Äì Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


In [25]:
for app in free_iOS_apps:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5]) # print app name and number of ratings


Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ‚Ñ¢ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pok√©mon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
Êïô„Åà„Å¶!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


Bible and Dictionary.com Dictionary & Thesaurus apps both influenced the number in this genre category.

In [26]:
for app in free_iOS_apps:
    if app[-5] == 'Social Networking':
        print(app[1], ':', app[5]) # print app name and number of ratings


Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo ‚Äì Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger ‚Äì Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match‚Ñ¢ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miito

Facebook and Pinterest both have approximately 4 million number of ratings. This number greatly influences the Social Networking category which has app ratings with an average number of 73,457.

This pattern is also recognizable in the Music genre. As seen below, Pandora and Spotify have a great impact in the skewness of the average number of rating for the Music genre 

In [27]:
for app in free_iOS_apps:
    if app[-5] == 'Music':
        print(app[1], ':', app[5]) # print app name and number of ratings

Pandora - Music & Radio : 1126879
Spotify Music : 878563
Shazam - Discover music, artists, videos & lyrics : 402925
iHeartRadio ‚Äì Free Music & Radio Stations : 293228
SoundCloud - Music & Audio : 135744
Magic Piano by Smule : 131695
Smule Sing! : 119316
TuneIn Radio - MLB NBA Audiobooks Podcasts Music : 110420
Amazon Music : 106235
SoundHound Song Search & Music Player : 82602
Sonos Controller : 48905
Bandsintown Concerts : 30845
Karaoke - Sing Karaoke, Unlimited Songs! : 28606
My Mixtapez Music : 26286
Sing Karaoke Songs Unlimited with StarMaker : 26227
Ringtones for iPhone & Ringtone Maker : 25403
Musi - Unlimited Music For YouTube : 25193
AutoRap by Smule : 18202
Spinrilla - Mixtapes For Free : 15053
Napster - Top Music & Radio : 14268
edjing Mix:DJ turntable to remix and scratch music : 13580
Free Music - MP3 Streamer & Playlist Manager Pro : 13443
Free Piano app by Yokee : 13016
Google Play Music : 10118
Certified Mixtapes - Hip Hop Albums & Mixtapes : 9975
TIDAL : 7398
YouTube 

### App Profiles Recommendation for App Store
Based on the information obtained so far i.e.:

   * Games category dominates App Store with 58%
   * Navigation is the most popular genre on App Store followed by Reference, Social Networking, Music and Weather respectively

we can make App Profile recommendations for our company to develop. Two ideas behind our recommendations will be that lots of people are installing the apps and that people are spending a lot of time on the apps in order to interact with the in-app ads which is the source of revenue for our company.

Navigation apps, although being the most popular genre on App Store, cannot be recommended. This is because apps in Navigation categorys (mainly maps, GPS) are not games and people don't spend enough time on such apps. In-app ads may not be suitable for such apps which users run mostly in the background or apps such as *Speedometer*. Such apps are made to ensure safe driving and in-apps may be a distraction. Also users may not have time to interact with in-app ads on such apps. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.

A notable genre for consideration is the Reference genre which comes second in the most popular app list. Book apps can be developed which can incorporate quizzes, quotes and brain teasers. This idea can lure more users and make them interact more with those apps. Since users are able to derive fun from such apps, Reference genre (incorporated with games through our company's ingenuity) is a great recommendation.

Dictionary apps embedded with word games such as Scrabble, Word Search, Riddles, Cross Word, Word Puzzles should also be a major focus for developers. Another recommendation is development of Audiobook apps which can convert sections of some popular religious books and references from texts to audio. This has a great potential to attract users and also generate revenues through in-app ads.

Other popular genres such as Music, Social Networking and Food could've been recommended. However, most apps in the categories involves the delivery of service (of which our company is not into) and a very large revenue. Therefore, such app profiles are not a good choice for our company.



### Apps with Most Users by Genre on Google Play

For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough ‚Äî we can see that the figures are open-ended (1,000,000+, 100,000+, 5,000,000+, etc.):

In [28]:
display_table(free_android_apps, 5) # Installs column

1,000,000+ : 15.728308699086089
100,000+ : 11.55365000564143
10,000,000+ : 10.549475346947986
10,000+ : 10.188423784271691
1,000+ : 8.394448832223853
100+ : 6.916393997517771
5,000,000+ : 6.826131106848697
500,000+ : 5.562450637481666
50,000+ : 4.772650344127271
5,000+ : 4.513144533453684
10+ : 3.542818458761142
500+ : 3.2494640640866526
50,000,000+ : 2.3017037120613786
100,000,000+ : 2.1324607920568655
50+ : 1.9180864267178157
5+ : 0.7898002933543946
1+ : 0.5077287600135394
500,000,000+ : 0.270788672007221
1,000,000,000+ : 0.2256572266726842
0+ : 0.045131445334536835
0 : 0.011282861333634209


We cannot determine the exact numbers of installations for 1,000,000+, 100,000+ etc from the above table. Fortunately for us, we don't need the precise figures in order to determine the app Genres with the most number of users. 

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. However, in order to perform the computations, we'll need to convert each install number to float ‚Äî removing the commas and the plus characters in the process. We'll do this directly in the loop below, where we also compute the average number of installs for each category of genre.

In [37]:
categories_android = freq_table(free_android_apps, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in free_android_apps:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

EVENTS : 253542.22222222222
DATING : 854028.8303030303
LIBRARIES_AND_DEMO : 638503.734939759
PRODUCTIVITY : 16787331.344927534
BEAUTY : 513151.88679245283
BUSINESS : 1712290.1474201474
FOOD_AND_DRINK : 1924897.7363636363
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
HEALTH_AND_FITNESS : 4188821.9853479853
GAME : 15588015.603248259
SHOPPING : 7036877.311557789
HOUSE_AND_HOME : 1331540.5616438356
BOOKS_AND_REFERENCE : 8767811.894736841
ART_AND_DESIGN : 2021626.7857142857
FINANCE : 1387692.475609756
EDUCATION : 1833495.145631068
TRAVEL_AND_LOCAL : 13984077.710144928
PARENTING : 542603.6206896552
LIFESTYLE : 1437816.2687861272
VIDEO_PLAYERS : 24727872.452830188
FAMILY : 3695641.8198090694
MAPS_AND_NAVIGATION : 4056941.7741935486
PHOTOGRAPHY : 17840110.40229885
PERSONALIZATION : 5201482.6122448975
COMMUNICATION : 38456119.167247385
WEATHER : 5074486.197183099
SPORTS : 3638640.1428571427
MEDICAL : 120550.61980830671
ENTERTAINMENT : 11640705.88235294
SOCIAL : 23253652.1271

### App Profiles Recommendation for Google Play

The information obtained so far are that:

   * A larger proportion of apps in the Google Play Store are made for practical purposes (Family, Tools, Business, Lifestyle etc.) with only some fun apps (Game)
   * Communication has the most average number of installs on Google Play (with more than 38 million installs in that category) followed by Video_Players (~25 million), Social (~23 million) Photography (~18 million), Productivity (~17 million), Travel_And_Local (~14 million), Entertainment (~12 million), Tools (~11 million) etc.
   

One cogent point is that the largest number of apps downloaded from Google Play are those made for practical purposes. A hurried recommendation would be Communication apps being the category with the highest number of installs. However, if we're to examine the category critically, we'd find that this high figure is mainly due to some apps (such as WhatsApp Messenger, Facebook Messenger, Skype, Gmail, Telegram etc.) made by big companies. 

These apps are shown below:

In [44]:
for app in free_android_apps:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+' or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger ‚Äì Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Me

When these apps (shown above) with over 100 million installs are removed and average is calculated from apps with less than 100 million installs, the average for Communication category is greatly lowered (as shown below).

In [40]:
under_100_m = []

for app in free_android_apps:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3603485.3884615386

This same pattern is recognizable in Video_Player; Social; Photography and Productivity categories, where apps such as Youtube and MX Player; Facebook and Instagram; Google Photos; Microsoft Word and Dropbox dominate the respective Android market categories.

We are worried that these app categories might seem more popular than they really are. Additionally, these niches appear to be dominated by a few giants who are hard to compete against - given our company's limited revenue.

The game genre seems pretty popular, but previously we found out this part of the market seems a bit saturated, so we'd like to come up with a different app recommendation if possible. The books and reference category appears to have a good prospect.

### Part 3

### App Profiles Recommendation for App Store and Google Play

_Remember, our primary goal is to find App Profiles that will work well (profit-wise) in both App Store and Google Play._

The books and reference category looks a little popular as well, with an average number of installs of 8,767,811. We can explore this in more depth, since we already found out that this category has some potential to work well on the App Store.

Let's take a look at some of the apps from this genre and their number of installs:

In [42]:
for app in free_android_apps:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra ‚Äì free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+

The book and reference category includes a variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. The average figure for category also appears to be influenced by a small number of extremely popular apps:

In [43]:
for app in free_android_apps:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+' or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad üìñ Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


Unlike what we found in the Communication category, the number of popular apps in the Book and reference category is small. We can as well check those apps which are somehow in the middle in terms of popularity.

In [46]:
for app in free_android_apps:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '50,000,000+' or app[5] == '10,000,000+' or app[5] == '5,000,000+' or app[5] == '1,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra ‚Äì free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+

This category appears to be dominated by software for reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

However, as it appears, the market is already full of libraries. Therefore, as mentioned earlier, we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, creating quizzes and word puzzles based on the book or a forum where people can discuss the book, etc. We can also develop dictionary/book apps embedded with games like scrabble, word search, riddles, cross word.


## Conclusions

In this project, we analyzed data on the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book and turning it into an app could be profitable for both the Google Play and the App Store markets. We found that the markets are already full of libraries and references, so we need to add some special features besides the raw version of such a book. This might include daily quotes from the book, an audio version of the book, creating quizzes and word puzzles based on the book or a forum where people can discuss the book etc.