# Profitable App Profiles for the App Store and Google Play Markets

This project aims to survey data on apps on The Play Store and The App Store. Since all these apps are free, their main source of revenue is through in-app advertisements. Hence, the number of users greatly determine the revenue for any given app. The more users who see and engage with the ads, the better. 

The goal of this project is to analyse the data and understand what type of apps are more likely to attract a higher amount of users. 

## The Use of Sample Data

Since there are over 4 million apps on both the Play Store and App Store, it would take a lot of time and money to collect data on all of them. Therefore, it makes more sense to take a sample representative of the whole data for analysis purposes. 

* The sample dataset for apps from Google Play Store contains of 10,000 Android apps

* The sample dataset for apps from App Store contains of 7,000 ios apps

In [1]:
#opening both the sample datasets and saving them as list of lists

file1 = open('googleplaystore.csv', encoding = 'utf8')
from csv import reader
reader_file1 = reader(file1)
google_data = list(reader_file1)

file2 = open('AppleStore.csv', encoding = 'utf8')
from csv import reader
reader_file2 = reader(file2)
ios_data = list(reader_file2)

## Exploring the first few rows of the sample datasets

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(google_data, 0, 3)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']




In [4]:
explore_data(ios_data, 0, 3)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']




We can see that the first row in both the datasets consists of a row of Column Names (a.k.a the Header Row)

Let's print out the Header rows seperately and try to identify which columns will be useful for our analysis. 

(Incase the column names are not descriptive enough, the links to the dataset documentations are given here:)
* [Google Play Store Dataset](https://www.kaggle.com/lava18/google-play-store-appsGoogle)

* [App Store Dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)


In [5]:
google_apps_header = google_data[0]
print(google_apps_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [6]:
ios_header = ios_data[0]
print(ios_data[0])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


For the Play Store dataset:
Columns like 'Category', 'Price', 'Rating', 'Installs', 'Content Rating', 'Genres', and 'Reviews' seem very useful for analysing which kind of apps are popular with users. 

For the App Store dataset:
Columns like ''track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre' seem very useful for analysing which kind of apps are popular with users

## Deleting inaccurate and duplicate data

The goal in this step is to remove all non-english apps, all apps that aren't free, and all duplicate entries of apps

In [7]:
#Checkimg for amy row discrepancies in both the datasets i.e. if any column values are missing

google_length = len(google_data[0])
ios_length = len(ios_data[0])

print(google_length, ios_length)

13 16


In the below function, we loop over each row of the given dataset, based on which row the dataset is from, the length of that row is checked by comparing it with the google_length or the ios_length. If the values match, the row is then stored in a seperate list. Finally, we compare the new list with the old list (i.e. the old datasets) and check if both have the same lengths.

If not, we know that the dataset has row(s) with missing column values.

In [8]:
ios_data_1 = []
google_data_1 = []

def check_row_length(dataset, app_type):
    
    for row in dataset:
        if app_type == "Google":
            if len(row) == google_length:
                google_data_1.append(row)
        elif app_type == "ios":
            if len(row) == ios_length:
                ios_data_1.append(row)
        else:
            print("Wrong app type entered!")
            
check_row_length(ios_data, "ios")
check_row_length(google_data, "Google")

In [9]:
print(len(ios_data),len(ios_data_1))
print(len(google_data),len(google_data_1))

7198 7198
10842 10841


As we can see, the number of rows are the same in the ios_data and ios_data_1 list created through the check_row_length function. This means that there are no column values missing in the ios dataset and we can continue with the original dataset.

For google however, there seems to be one row with a missing column value. To get over this, we can simply assign google_data to the corrected google_data_1 list and move ahead.

In the next few steps, we will be checking if there are any duplicate entries in both the datasets.

In [10]:
google_data = google_data_1
print(len(google_data))

10841


In [11]:
#Checking for duplicates in ios_data 

ios_unique_apps = [] 
ios_duplicate_apps = [] 

for app in ios_data: 
    app_name = app[1] 

    if app_name not in ios_unique_apps:
         ios_unique_apps.append(app_name)
    else:
         ios_duplicate_apps.append(app)

In [12]:
print(f'The number of unique apps:{len(ios_unique_apps)}')
print(f'The number of duplicate apps:{len(ios_duplicate_apps)}')

The number of unique apps:7196
The number of duplicate apps:2


In [13]:
#Checking for duplicates in google_data

google_unique_apps = [] 
google_duplicate_apps = [] 

for app in google_data: 
    app_name = app[0] 

    if app_name not in google_unique_apps:
        google_unique_apps.append(app_name)
    else:
        google_duplicate_apps.append(app)


In [14]:
print(f'The number of unique apps:{len(google_unique_apps)}')
print(f'The number of duplicate apps:{len(google_duplicate_apps)}')

The number of unique apps:9660
The number of duplicate apps:1181


As displayed by the output above, there are 90 duplicate apps in the ios dataset and 1181 duplicates in the google dataset. However, it is unwise to remove these duplicates randomly. 

In [15]:
for element in google_data:
    if element[0] == "Facebook":
        print(element)
        
for element in ios_data:
    if element[1] == "Facebook":
        print(element)

['Facebook', 'SOCIAL', '4.1', '78158306', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


For example: In the output shown above, we can see that it is wiser to keep the data with the higher number of reviews (the number in the fourth position), as it seems to be the more recent data of the two. 

It is also likely that the other duplicate apps have a similar situation. This is why the row with the highest number of reviews will be retained!

The below function takes in the dataset, name index, and review index positions as parameters. The whole point of the function is to loop over each row in the dataset, take the name element of the row and see if it aready exists in the dictionaries. (g_reviews_max dictionary for google dataset, i_reviews_max dictionary for apple dataset). If the name already exists, we then check if the corresponding 'value' (which is the number of the reviews) for that particular name 'key' in the dictionary  is lower than the corresponding number of reviews in the row being iterated over. If yes, then we replace the value, and otherwise, we just ignore it. 

If the name is not in the dictionaries, then we create a new entry in the dictionary where the name becomes the 'key', and the corresponding value of the number of reviews in that particular row becomes the 'value'.

In [16]:

def max_reviews(dataset, name_index, reviews_index):
    
    g_reviews_max = {}
    i_reviews_max = {}

    for row in dataset[1:]:
        name = row[name_index]
        reviews = float(row[reviews_index])
        
        if dataset == google_data:
            if name in g_reviews_max and reviews > g_reviews_max[name]:
                g_reviews_max[name] = reviews
            else:
                g_reviews_max[name] = reviews
       
                    
        elif dataset == ios_data:
            if name in i_reviews_max and reviews > i_reviews_max[name]:
                i_reviews_max[name] = reviews
            else:
                i_reviews_max[name] = reviews
         
            
        else:
            print("Wrong dataset")
                
    if dataset == google_data:
        return g_reviews_max
    else:
        return i_reviews_max
    

In [17]:
g_reviews_max = max_reviews(google_data, 0, 3)
i_reviews_max = max_reviews(ios_data,1,5 )

Now let's check if our dictionary has actually captured the entry with the highest number of reviews. Just a few lines ago, we noticed that the highest number of reviews for facebook was 78128208. Therefore, our dictionary g_reviews_max with the key 'Facebook' should store the same value

In [18]:
g_reviews_max['Facebook']

78128208.0

In [19]:
i_reviews_max['Instagram']

2161558.0

As we can see from the above outputs, our dictionaries have captured the maximum number of reviews for each app successfully

In the below function, we are going to create two lists
* apps_clean to store the data for each app and it's corresponding row with the highest number of reviews
* apps_already_added to store the names of the apps that we have already added, so that we don't add apps twice in the case that two rows have the same highest number of reviews value (two rows that contain the facebook value: 78128208 for example. We only want one of these rows to be entered)

We basically loop over each row in our Play Store and App Store datasets and check for two conditions
* If the name of the app is not present in the apps_already_added list
* If the corresponding reviews in that row is equal to the highest number of reviews for the app in the g_reviews_max/i_reviews_max dictionary

Provided these conditions are satisfied, we then add the row to the apps_clean list, and the name element of the row to the apps_already_added list

In [20]:
def remove_duplicates(dataset, name_index, reviews_index):
    apps_clean = []
    apps_already_added = []
    
    for row in dataset[1:]:
        name = row[name_index]
        reviews = float(row[reviews_index])
        
        if dataset == google_data:
            if name not in apps_already_added and reviews == g_reviews_max[name]:
                apps_clean.append(row)
                apps_already_added.append(name)
                
        elif dataset == ios_data:
            if name not in apps_already_added and reviews == i_reviews_max[name]:
                apps_clean.append(row)
                apps_already_added.append(name)
                
        else:
            print("Wrong dataset")
    
    return apps_clean

In [21]:
clean_google_data = remove_duplicates(google_data, 0, 3)

In [22]:
print(len(clean_google_data), len(google_unique_apps))

for element in clean_google_data:
    if element[0] == "Facebook":
        print(element)

9659 9660
['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']


In [23]:
clean_ios_data = remove_duplicates(ios_data, 1, 5)

In [24]:
print(len(clean_ios_data), len(ios_unique_apps))

for element in clean_ios_data:
    if element[1] == "Facebook":
        print(element)

7195 7196
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


As we can see from the above outputs, the clean_google_data and clean_ios_data lists contain only the unique app names along with their respective highest number of reviews. Note that the length of both the datasets match with their respective google_unique_apps and ios_unique_apps lists. (-1 for the Header row)

## Deleting Non English App Data

Since Python follows the UTF8 encoding system, all english characters have values between 0 - 127. So we are going to make a function that checks for values of individual characters in the app names and if more than three of them have values above 127, classifies the app as non english and ignore it. 

While this isn't a fullproof technique, it should save us from wrongly classifying apps that contain special characters or emojis as non english apps. 

In [25]:
def check_eng(dataset,name_index):
    english_apps = []
 
    for row in dataset:
        name = row[name_index]
        count = 0
        for element in name:
            if ord(element) > 127:
                count += 1
        if count <= 3:
            english_apps.append(row)
        
    return english_apps

In [26]:
eng_google_apps = check_eng(clean_google_data, 0)
eng_ios_apps = check_eng(clean_ios_data, 1)

In [27]:
print(len(eng_google_apps))
print(len(eng_ios_apps))

9614
6181


## Isolating Free Apps

The free_apps function is going to take in the eng_google_apps and eng_ios_apps lists as the dataset arguments, along with the index positions of the 'Price' value in both the datasets. Then it is going to loop over each row and check if the price value is equal to 0, if yes, then the row will be appended to the final cleaned list of apps, if not, the row will be ignored. 

In [29]:
def free_apps(dataset, price_index):
    final_apps = []
    
    for row in dataset:
        price = row[price_index]
        if price == "0" or price == "0.0":
            final_apps.append(row)
            
    return final_apps

In [30]:
final_google_data = free_apps(eng_google_apps, 7)
final_ios_data = free_apps(eng_ios_apps, 4)

In [31]:
print(len(final_google_data))
print(len(final_ios_data))

print(final_ios_data[0:2])

8864
3220
[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']]


Finally, we have 8864 Play Store apps and 3220 App Store apps to conduct our analysis on.

# Analyzing the Cleaned Lists

The validation strategy for an app idea has three steps:

* Build a minimal Android version of the app, and add it to Google Play.
* If the app has a good response from users, develop it further.
* If the app is profitable after six months, build an iOS version of the app and add it to the App Store.

Because the end goal of this project is to  find app profiles that are successful in both markets. Let's begin the analysis by determining the most common genres for each market.

For the Play Store dataset: Columns like 'Category', and 'Genres' can give us an idea of how many users use apps per genre and category

For the App Store dataset: Similary 'prime_genre' seem very useful for analysing which kind of apps are popular with users

The freq_table function is going to create a dictionary that stores each unique genre as a seperate 'key', and the percentage of the total number of apps in each category w.r.t the total number of apps in the whole store as the 'value'. Basically, the function is going to help us build a frequency table for different app genres.

Since dictionary keys cannot be sorted over, we are then going to use the display_table function to take the key value pairs of the dictionary returned by the freq_table function, reverse it and store it as tuples in a seperate list. This list will then be sorted in the reverse order to give us an idea about which category/genre has the most number of apps on the playstore.

In [32]:
def freq_table(dataset, index):
    freq_dict = {}
    total_rows = 0
    
    for row in dataset:
        total_rows += 1
        key_value = row[index]
        if key_value in freq_dict:
            freq_dict[key_value] += 1
        elif key_value not in freq_dict:
            freq_dict[key_value] = 1
    
    for key in freq_dict:
        freq_dict[key] /= total_rows 
        freq_dict[key] = round(freq_dict[key]*100, 2)
            
    return freq_dict

In [33]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [34]:
display_table(final_google_data, 1 )

FAMILY : 19.22
GAME : 9.51
TOOLS : 8.46
BUSINESS : 4.58
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.54
SPORTS : 3.42
PERSONALIZATION : 3.32
COMMUNICATION : 3.25
HEALTH_AND_FITNESS : 3.07
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.78
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.13
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
ENTERTAINMENT : 0.88
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


In [35]:
display_table(final_google_data, -4)

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.58
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.54
Sports : 3.46
Personalization : 3.32
Communication : 3.25
Action : 3.1
Health & Fitness : 3.07
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.86
Video Players & Editors : 1.78
Casual : 1.75
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.91
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.44
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;

As shown by the above two tables, we can make out that there is a dominance of family related and general tools related apps on the Play Store. Which is  quite different from the dominance that game apps seem to have on the App Store (See below).



In [36]:
display_table(final_ios_data, -5)

Games : 58.14
Entertainment : 7.89
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.52
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.34
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


As depicted by the output above, apps belonging to the Games and Entertainment section have the highest number of entries/rows in the App Store dataset. While this may not necessarily translate to these genres having the highest number of users, it can be concluded that in the App Store, amongst English Free Apps atleast, more than 50% of the apps belong to the Games and Entertainment sector. 

### Why cannot we assume already that these apps also have the highest number of users?

Because this table only counts the frequency of the genres/categories. Meaning out of 10,000 rows we could have 5000 rows with data of apps belonging to the games category, but with only 3 number of reviews or 10 installs per game app. Whereas we could have just two apps from the social media category, with 10000000 number of reviews/installs each. That is why the above frequency tables alone are not conducive. 

To truly dive deeper and understand if these genres also have the highest number of users, we will need to take a look at the number of reviews/installs. This number will give us a much better perspective of how many users are actually using these apps. 

The above tables however, have given us a fair idea of which category of apps dominate the Play Store and App Store, amongst free apps using English as their main language atleast. They also tell us which categories are saturated.

## Getting to the nitty-gritty, finding the popular apps amongst users

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

In [37]:
ios_unique_genre = freq_table(final_ios_data, -5)


# Finding the most popular IOS Apps

In [38]:
print(ios_unique_genre)

{'Social Networking': 3.29, 'Photo & Video': 4.97, 'Games': 58.14, 'Music': 2.05, 'Reference': 0.56, 'Health & Fitness': 2.02, 'Weather': 0.87, 'Utilities': 2.52, 'Travel': 1.24, 'Shopping': 2.61, 'News': 1.34, 'Navigation': 0.19, 'Lifestyle': 1.58, 'Entertainment': 7.89, 'Food & Drink': 0.81, 'Sports': 2.14, 'Book': 0.43, 'Finance': 1.12, 'Education': 3.66, 'Productivity': 1.74, 'Business': 0.53, 'Catalogs': 0.12, 'Medical': 0.19}


In the code block below, we are looping over each row in the final_ios_dataset, checking if the prime_genre column value of the row is matching with the present 'genre' key from the ios_unique_genre dictionary being looped over. 

If yes, we enter the loop and count the total number of user reviews, the total number of apps, and calculate the average number of reviews per app per genre by the calculation (total number of reviews/total number of apps)

Then we store this average number along with the name of the genre in a tuple and append that to the ios_list (which we will later use for sorting)

In [39]:
ios_list = []

for genre in ios_unique_genre:
    total = 0
    len_genre = 0
    for row in final_ios_data:
        genre_app = row[-5]
        if genre == genre_app:
            user_ratings = float(row[5])
            total += user_ratings
            len_genre += 1
    avg_user_ratings = round(total/len_genre, 2)
    
    print(genre, ':', avg_user_ratings)
    ios_list.append((avg_user_ratings, genre))


Social Networking : 71548.35
Photo & Video : 28441.54
Games : 22812.6
Music : 57326.53
Reference : 74942.11
Health & Fitness : 23298.02
Weather : 52279.89
Utilities : 18684.46
Travel : 28243.8
Shopping : 26919.69
News : 21248.02
Navigation : 86090.33
Lifestyle : 16485.76
Entertainment : 14029.83
Food & Drink : 33333.92
Sports : 23008.9
Book : 39758.5
Finance : 31467.94
Education : 7003.98
Productivity : 21028.41
Business : 7491.12
Catalogs : 4004.0
Medical : 612.0


In [40]:
sorted_ios_list = sorted(ios_list, reverse = True)
print(sorted_ios_list)

[(86090.33, 'Navigation'), (74942.11, 'Reference'), (71548.35, 'Social Networking'), (57326.53, 'Music'), (52279.89, 'Weather'), (39758.5, 'Book'), (33333.92, 'Food & Drink'), (31467.94, 'Finance'), (28441.54, 'Photo & Video'), (28243.8, 'Travel'), (26919.69, 'Shopping'), (23298.02, 'Health & Fitness'), (23008.9, 'Sports'), (22812.6, 'Games'), (21248.02, 'News'), (21028.41, 'Productivity'), (18684.46, 'Utilities'), (16485.76, 'Lifestyle'), (14029.83, 'Entertainment'), (7491.12, 'Business'), (7003.98, 'Education'), (4004.0, 'Catalogs'), (612.0, 'Medical')]


In [41]:
for entry in sorted_ios_list:
    print(f'{entry[1]} : {entry[0]}')

Navigation : 86090.33
Reference : 74942.11
Social Networking : 71548.35
Music : 57326.53
Weather : 52279.89
Book : 39758.5
Food & Drink : 33333.92
Finance : 31467.94
Photo & Video : 28441.54
Travel : 28243.8
Shopping : 26919.69
Health & Fitness : 23298.02
Sports : 23008.9
Games : 22812.6
News : 21248.02
Productivity : 21028.41
Utilities : 18684.46
Lifestyle : 16485.76
Entertainment : 14029.83
Business : 7491.12
Education : 7003.98
Catalogs : 4004.0
Medical : 612.0


Here we can see that navigation apps have the highest number of apps.However, the original display table showed us that navigation apps only have a 0.19% share in the total number of apps present in the App Store. This means that there are two or more incredibly popular apps that have garnered very high number of reviews, and this has skewed the average. It is highly possible that reference, social media, and the music categories exhibit the same behaviour

We can deepdive into this further by checking out the most popular apps by category

In [42]:
for element in final_ios_data:
    if element[-5] == 'Navigation':
        print(element[1], ':', element[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


Look at the above output, there are only 5 navigation apps and the number of reviews range from 5 to 300,000+. It's clear that the average number of reviews for navigation apps have been heavily skewed by the incredibly high reviews Waze and Google Maps have received.

Builidng an app in this genre would be quite risky as the above data shows that Waze and Google Maps already are dominationg the navigation space, and in general navigation apps do not seem to be very popular as evidenced by their low 0.19% share. 

Having in-app advertisements inside navigation apps could also lead to user frustration, watching an ad while trying to navigate a route is definitely not going to be a pleasant experience.

In [43]:
for element in final_ios_data:
    if element[-5] == 'Social Networking':
        print(element[1], ':', element[5])

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

Similary Social Media is showing a similar pattern with Facebook, Pinterest, and Skype getting the most number of reviews. Another point to note is that this space is already quite saturated with a high number of apps.

We know that the App Store is dominated by apps belonging to the fun and entertainment sector (over 58%), meaning that there is scope for more functional/educational apps as the fun and entertainment sector seems quite saturated. So, let's take a look at Book and Reference categories as both of them fall outside the fun and entertainment sector, but at the same time, have accumulated a decent amount of user reviews. 

In [44]:
for element in final_ios_data:
    if element[-5] == 'Book':
        print(element[1], ':', element[5])

Kindle – Read eBooks, Magazines & Textbooks : 252076
Audible – audio books, original series & podcasts : 105274
Color Therapy Adult Coloring Book for Adults : 84062
OverDrive – Library eBooks and Audiobooks : 65450
HOOKED - Chat Stories : 47829
BookShout: Read eBooks & Track Your Reading Goals : 879
Dr. Seuss Treasury — 50 best kids books : 451
Green Riding Hood : 392
Weirdwood Manor : 197
MangaZERO - comic reader : 9
ikouhoushi : 0
MangaTiara - love comic reader : 0
謎解き : 0
謎解き2016 : 0


In [45]:
for element in final_ios_data:
    if element[-5] == 'Reference':
        print(element[1], ':', element[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


Notice how there is quite some overlap between the two categories. In the book categories, apps like Kindle and Audible, that provide users with the ability to listen to podcasts, read through pdfs online, store books offline and be able to read them on the go seem to be the most popular. 

Similarly if you look at the reference category, religious books and dictionaries seem to be incredibly popular. It's safe to say that these categories are not exactly saturated either, given their market shares (from the display table for ios apps above) are 0.43% and 0.56% only.

Something that's a good food for thought is that a single app could possibly belong to both the categories. You could develop a single app that allows you to store pdfs, and serve as a dictionary, or an app that converts text to speech for any books that you purchase digitally, and helps you listen to podcasts. The greatest pro here is that these are apps that users are more likely to spend a lot of time on, meaning that good revenue can be generated from in-app advertisements. 


Food & Drink, Health & Fitness, and Finance, also fall into a similar category i.e. they're not saturated and there seems to be good scope for growth. However, good Subject Matter Experts will be needed to build a high quality app in these domains, and hiring them will cost extra, which is something important to keep in mind!

# Finding the Most Popular Google Play Store Apps

The Google Play Store dataset actually has data about the number of installs. However, the install numbers don't seem precise enough since most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [46]:
display_table(final_google_data, 5)

1,000,000+ : 15.75
100,000+ : 11.56
10,000,000+ : 10.5
10,000+ : 10.21
1,000+ : 8.39
100+ : 6.92
5,000,000+ : 6.83
500,000+ : 5.56
50,000+ : 4.77
5,000+ : 4.51
10+ : 3.54
500+ : 3.25
50,000,000+ : 2.3
100,000,000+ : 2.13
50+ : 1.92
5+ : 0.79
1+ : 0.51
500,000,000+ : 0.27
1,000,000,000+ : 0.23
0+ : 0.05
0 : 0.01


While this data does not give us the exact number, we can work with it given that our main goal is to find out which apps and which categories attract the most users.

However, since this data is in string format and has commas and plus signs, we will need to replace them with empty spaces in order to convert these values to float. Converting them to float will later help us sort the number of installs in descending order, making our analysis a bit easier

In [47]:
google_list = []

In [48]:
google_unique_genre = freq_table(final_google_data, 1)

for genre in google_unique_genre:
    total1= 0
    len_category = 0
    for row in final_google_data:
        genre_app = row[1]
        if genre == genre_app:
            n_installs = row[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            n_installs = float(n_installs)
            total1+= n_installs
            len_category += 1
    avg_user_installs = round(total1/len_category, 2)
    
    print(genre, ':', avg_user_installs)
    google_list.append((avg_user_installs, genre))

ART_AND_DESIGN : 1986335.09
AUTO_AND_VEHICLES : 647317.82
BEAUTY : 513151.89
BOOKS_AND_REFERENCE : 8767811.89
BUSINESS : 1704192.34
COMICS : 817657.27
COMMUNICATION : 38326063.2
DATING : 854028.83
EDUCATION : 1768500.0
ENTERTAINMENT : 9146923.08
EVENTS : 253542.22
FINANCE : 1387692.48
FOOD_AND_DRINK : 1924897.74
HEALTH_AND_FITNESS : 4167457.36
HOUSE_AND_HOME : 1331540.56
LIBRARIES_AND_DEMO : 638503.73
LIFESTYLE : 1437816.27
GAME : 12914435.88
FAMILY : 5180161.79
MEDICAL : 123064.79
SOCIAL : 23253652.13
SHOPPING : 7036877.31
PHOTOGRAPHY : 17840110.4
SPORTS : 4274688.72
TRAVEL_AND_LOCAL : 13984077.71
TOOLS : 10801391.3
PERSONALIZATION : 5201482.61
PRODUCTIVITY : 16772838.59
PARENTING : 542603.62
WEATHER : 5074486.2
VIDEO_PLAYERS : 24790074.18
NEWS_AND_MAGAZINES : 9549178.47
MAPS_AND_NAVIGATION : 4056941.77


In [49]:
sorted_google_list = sorted(google_list, reverse = True)

for entry in sorted_google_list:
    print(f'{entry[1]} : {entry[0]}')


COMMUNICATION : 38326063.2
VIDEO_PLAYERS : 24790074.18
SOCIAL : 23253652.13
PHOTOGRAPHY : 17840110.4
PRODUCTIVITY : 16772838.59
TRAVEL_AND_LOCAL : 13984077.71
GAME : 12914435.88
TOOLS : 10801391.3
NEWS_AND_MAGAZINES : 9549178.47
ENTERTAINMENT : 9146923.08
BOOKS_AND_REFERENCE : 8767811.89
SHOPPING : 7036877.31
PERSONALIZATION : 5201482.61
FAMILY : 5180161.79
WEATHER : 5074486.2
SPORTS : 4274688.72
HEALTH_AND_FITNESS : 4167457.36
MAPS_AND_NAVIGATION : 4056941.77
ART_AND_DESIGN : 1986335.09
FOOD_AND_DRINK : 1924897.74
EDUCATION : 1768500.0
BUSINESS : 1704192.34
LIFESTYLE : 1437816.27
FINANCE : 1387692.48
HOUSE_AND_HOME : 1331540.56
DATING : 854028.83
COMICS : 817657.27
AUTO_AND_VEHICLES : 647317.82
LIBRARIES_AND_DEMO : 638503.73
PARENTING : 542603.62
BEAUTY : 513151.89
EVENTS : 253542.22
MEDICAL : 123064.79


From the above output, we can see that apps belonging to communication, video players, and social media have the highest number of installs. In the App Store we noticed that the top apps had averages skewed due to a few incredibly popular apps. Let's test that theory for apps on the Play Store as well. 

It is highly likely that apps belonging to the 'Social' category will depict the same behaviour here.

In [51]:
for element in final_google_data:
    if element[1] == 'SOCIAL':
        print(element[0], ':', element[5])

Social network all in one 2018 : 100,000+
TextNow - free text + calls : 10,000,000+
The Messenger App : 1,000,000+
Messenger Pro : 1,000,000+
Free Messages, Video, Chat,Text for Messenger Plus : 1,000,000+
The Video Messenger App : 100,000+
Jodel - The Hyperlocal App : 1,000,000+
Hide Something - Photo, Video : 5,000,000+
Love Sticker : 1,000,000+
Web Browser & Fast Explorer : 5,000,000+
VidStatus app - Status Videos & Status Downloader : 5,000,000+
Love Images : 1,000,000+
Web Browser ( Fast & Secure Web Explorer) : 500,000+
SPARK - Live random video chat & meet new people : 5,000,000+
Golden telegram : 50,000+
Facebook Local : 1,000,000+
Meet – Talk to Strangers Using Random Video Chat : 5,000,000+
MobilePatrol Public Safety App : 1,000,000+
💘 WhatsLov: Smileys of love, stickers and GIF : 1,000,000+
HTC Social Plugin - Facebook : 10,000,000+
Kate Mobile for VK : 10,000,000+
Family GPS tracker KidControl + GPS by SMS Locator : 1,000,000+
Moment : 1,000,000+
Text Me: Text Free, Call Fr

As shown above, the social networking category is definitely not a good idea to build a new app in. Not only is this field incredibly saturated on the Play Store, but also there are a lot of big names that have already estabilished a brand name and taken hold of a big part of the market share. Building a new social media app will require some serious capital investment (marketing, operations, state of the art features, a solid legal team, and maintainence), as you will also have to account for the competition that you'll be entering into. 

A similar situation applies for the Communication category. This field is incredibly saturated and it's going to take a lot of marketing to even get your app to be noticed. 

If you see the data carefully, you will also see that the numbers are completely skewed due to the social media giants. Whatsapp, Facebook have over a billion installs, and a few other apps have installations in the millions. This makes these app categories seem more popular than they are. 

It's also evident that these categories/genres are dominated by a few giants, who will be hard to compete against.

In [52]:
for element in final_google_data:
    if element[1] == 'COMMUNICATION':
        print(element[0], ':', element[5])

Messenger – Text and Video Chat for Free : 1,000,000,000+
Messenger for SMS : 10,000,000+
Gmail : 1,000,000,000+
My Tele2 : 5,000,000+
imo beta free calls and text : 100,000,000+
imo free video calls and chat : 500,000,000+
Contacts : 50,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Browser 4G : 10,000,000+
MegaFon Dashboard : 10,000,000+
ZenUI Dialer & Contacts : 10,000,000+
Cricket Visual Voicemail : 10,000,000+
TracFone My Account : 1,000,000+
Xperia Link™ : 10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard : 10,000,000+
Skype Lite - Free Video Call & Chat : 5,000,000+
My magenta : 1,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Seznam.cz : 1,000,000+
Antillean Gold Telegram (original version) : 100,000+
AT&T Visual Voicemail : 10,000,000+
GMX Mail : 10,000,000+
Omlet Chat : 10,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
My Vodacom SA : 5,000,000+
Calls & Text by Mo+ 

Since our priority is to build an app that can be profitable on both the stores, it's worth the effort to dive deeper into the 'BOOKS_AND_REFERENCE' category to see which apps have the most amount of reviews.

In [54]:
for element in final_google_data:
    if element[1] == 'BOOKS_AND_REFERENCE':
        print(element[0], ':', element[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

We can see that  there are a lot of apps already in the books and reference category, provided that only a few are popular and many of them just have reviews in the low thousands. However, if we do want to build an app that will be a success in both the stores, we should dive deeper to see the apps that have the highest number of installs in this category. Doing so will give us an idea about the kind of topics/subjects/fields that attract more users.

In [56]:
for element in final_google_data:
    if element[1] == 'BOOKS_AND_REFERENCE' and (element[5] == '1,000,000+' or element[5] == '5,000,000+' or element[5] == '10,000,000+' or element[5] == '50,000,000+'):
        print(element[0], ':', element[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

We can see that this category seems to be dominated by apps that make it easy to store and read books digitally, and apps containing collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition. However, building an app that has the software to store books digitally AND also contain an inbuilt dictionary/library of sorts should be a recipe for success. The main selling factor here is convenience as users will then have to rely on just one app.

There are also quite a few apps built around the book Quran, which suggests that building an app around a religious book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. However, the apps need to have some special features to thrive on both the stores. This might include daily quotes from a religious book, an audio version of the book, a dictionary, quizzes on the book, a forum where people can discuss the book, and the features mentioned in the paragraph above.

# Conclusion

In this project we analysed data on apps from the Play Store and App Store with the goal of finding a suitable app profile for both the markets. 

The original datasets were first cleaned to remove rows with missing column values, then they were further refined by removing the duplicates and only the rows of apps with the highest number of reviews were retained. After which, apps with non english names were removed from the datasets. 

Analysis was then conducted on these two clean datasets and we came to the conclusion that an app which can provide users with the features of reading their favourite books digitally, storing pdfs, having dictionary and library features, and perhaps also having religious books/quotes can provide good revenue through in app advertisements on both the stores.
