# Profitable App Profiles for the App Store and Google Play Markets

This project aims to survey data on apps on The Play Store and The App Store. Since all these apps are free, their main source of revenue is through in-app advertisements. Hence, the number of users greatly determine the reenue for any given app. The more users who see and engage with the ads, the better. 

The goal of this project is to analyse the data and understand what type of apps are more likely to attract a higher amount of users. 

## The Use of Sample Data

Since there are over 4 million apps on both the Play Store and App Store, it would take a lot of time and money to collect data on all of them. Therefore, it makes more sense to take a sample representative of the whole data for analysis purposes. 

* The sample dataset for apps from Google Play Store contains of 10,000 Android apps

* The sample dataset for apps from App Store contains of 7,000 ios apps

In [1]:
#opening both the sample datasets and saving them as list of lists

file1 = open('googleplaystore.csv', encoding = 'utf8')
from csv import reader
reader_file1 = reader(file1)
google_data = list(reader_file1)

file2 = open('AppleStore.csv', encoding = 'utf8')
from csv import reader
reader_file2 = reader(file2)
ios_data = list(reader_file2)

## Exploring the first few rows of the sample datasets

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(google_data, 0, 3)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']




In [4]:
explore_data(ios_data, 0, 3)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']




We can see that the first row in both the datasets consists of a row of Column Names (a.k.a the Header Row)

Let's print out the Header rows seperately and try to identify which columns will be useful for our analysis. 

(Incase the column names are not descriptive enough, the links to the dataset documentations are given here:)
* [Google Play Store Dataset](https://www.kaggle.com/lava18/google-play-store-appsGoogle)

* [App Store Dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)


In [5]:
google_apps_header = google_data[0]
print(google_apps_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [7]:
ios_header = ios_data[0]
print(ios_data[0])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


For the Play Store dataset:
Columns like 'Category', 'Price', 'Rating', 'Installs', 'Content Rating', 'Genres', and 'Reviews' seem very useful for analysing which kind of apps are popular with users. 

For the App Store dataset:
Columns like ''track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre' seem very useful for analysing which kind of apps are popular with users

## Deleting inaccurate and duplicate data

Since this analysis is only concerned with free english apps, the goal in this step is to remove all non-english apps and all apps that aren't free

In [35]:
#Checkimg for amy row discrepancies in both the datasets i.e. if any column values are missing

google_length = len(google_data[0])
ios_length = len(ios_data[0])

print(google_length, ios_length)

13 16


In [37]:
ios_data_1 = []
google_data_1 = []

def check_row_length(dataset, app_type):
    
    for row in dataset:
        if app_type == "Google":
            if len(row) == google_length:
                google_data_1.append(row)
        elif app_type == "ios":
            if len(row) == ios_length:
                ios_data_1.append(row)
        else:
            print("Wrong app type entered!")
            
check_row_length(ios_data, "ios")
check_row_length(google_data, "Google")

In [40]:
print(len(ios_data),len(ios_data_1))
print(len(google_data),len(google_data_1))

7198 7198
10841 10841


As we can see, the number of rows are the same in both the original google_data/ios_data and the new datasets created through the check_row_length function. This means that there are no column values missing in either of the datasets and we can continue with the original datasets.

In the next few steps, we will be checking if there are any duplicate entries in both the datasets.

In [96]:
#Checking for duplicates in ios_data 

ios_unique_apps = [] 
ios_duplicate_apps = [] 

for app in ios_data: 
    app_name = app[1] 

    if app_name not in ios_unique_apps:
         ios_unique_apps.append(app_name)
    else:
         ios_duplicate_apps.append(app)

In [97]:
print(f'The number of unique apps:{len(ios_unique_apps)}')
print(f'The number of duplicate apps:{len(ios_duplicate_apps)}')

The number of unique apps:7196
The number of duplicate apps:2


In [20]:
#Checking for duplicates in google_data

google_unique_apps = [] 
google_duplicate_apps = [] 

for app in google_data: 
    app_name = app[0] 

    if app_name not in google_unique_apps:
        google_unique_apps.append(app_name)
    else:
        google_duplicate_apps.append(app)


In [21]:
print(f'The number of unique apps:{len(google_unique_apps)}')
print(f'The number of duplicate apps:{len(google_duplicate_apps)}')

The number of unique apps:9660
The number of duplicate apps:1181


As displayed by the output above, there are 90 duplicate apps in the ios dataset and 1181 duplicates in the google dataset. However, it is unwise to remove these duplicates randomly. 

In [64]:
for element in google_data:
    if element[0] == "Facebook":
        print(element)
        
for element in ios_data:
    if element[1] == "Facebook":
        print(element)

['Facebook', 'SOCIAL', '4.1', '78158306', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


For example: In the output shown above, we can see that it is wiser to keep the data with the higher number of reviews (the number in the fourth position), as it seems to be the more recent data of the two. 

It is also likely that the other duplicate apps have a similar situation. This is why the row with the highest number of reviews will be retained!

In [101]:

def max_reviews(dataset, name_index, reviews_index):
    
    g_reviews_max = {}
    i_reviews_max = {}

    for row in dataset[1:]:
        name = row[name_index]
        reviews = float(row[reviews_index])
        
        if dataset == google_data:
            if name in g_reviews_max and reviews > g_reviews_max[name]:
                g_reviews_max[name] = reviews
            else:
                g_reviews_max[name] = reviews
       
                    
        elif dataset == ios_data:
            if name in i_reviews_max and reviews > i_reviews_max[name]:
                i_reviews_max[name] = reviews
            else:
                i_reviews_max[name] = reviews
         
            
        else:
            print("Wrong dataset")
                
    if dataset == google_data:
        return g_reviews_max
    else:
        return i_reviews_max
    

In [102]:
g_reviews_max = max_reviews(google_data, 0, 3)
i_reviews_max = max_reviews(ios_data,1,5 )

Now let's check if our dictionary has actually captured the entry with the highest number of reviews. Just a few lines ago, we noticed that the highest number of reviews for facebook was 78128208. Therefore, our dictionary g_reviews_max with the key 'Facebook' should store the same value

In [103]:
g_reviews_max['Facebook']

78128208.0

In [104]:
i_reviews_max['Instagram']

2161558.0

As we can see from the above outputs, our dictionaries have captured the maximum number of reviews for each app successfully

In the below function, we are going to create two lists
* apps_clean to store the data for each app and it's corresponding row with the highest number of reviews
* apps_already_added to store the names of the apps that we have already added, so that we don't add apps twice in the case that two rows have the same highest number of reviews

We basically loop over each row in our Play Store and App Store datasets and check for two conditions
* If the name of the app is not present in the apps_already_added list
* If the corresponding reviews in that row is equal to the highest number of reviews for the app in the g_reviews_max/i_reviews_max dictionary

Provided these conditions are satisfied, we then add the row to the apps_clean list, and the name element of the row to the apps_already_added list

In [108]:
def remove_duplicates(dataset, name_index, reviews_index):
    apps_clean = []
    apps_already_added = []
    
    for row in dataset[1:]:
        name = row[name_index]
        reviews = float(row[reviews_index])
        
        if dataset == google_data:
            if name not in apps_already_added and reviews == g_reviews_max[name]:
                apps_clean.append(row)
                apps_already_added.append(name)
                
        elif dataset == ios_data:
            if name not in apps_already_added and reviews == i_reviews_max[name]:
                apps_clean.append(row)
                apps_already_added.append(name)
                
        else:
            print("Wrong dataset")
    
    return apps_clean

In [109]:
clean_google_data = remove_duplicates(google_data, 0, 3)

In [116]:
print(len(clean_google_data), len(google_unique_apps))

for element in clean_google_data:
    if element[0] == "Facebook":
        print(element)

9659 9660
['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']


In [112]:
clean_ios_data = remove_duplicates(ios_data, 1, 5)

In [117]:
print(len(clean_ios_data), len(ios_unique_apps))

for element in clean_ios_data:
    if element[1] == "Facebook":
        print(element)

7195 7196
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


As we can see from the above outputs, the clean_google_data and clean_ios_data lists contain only the unique app names along with their respective highest number of reviews. Note that the length of both the datasets match with their respective google_unique_apps and ios_unique_apps lists. (-1 for the Header row)

## Deleting Non English App Data

In [144]:
def check_eng(dataset,name_index):
    english_apps = []
 
    for row in dataset:
        name = row[name_index]
        count = 0
        for element in name:
            if ord(element) > 127:
                count += 1
        if count <= 3:
            english_apps.append(row)
        
    return english_apps

In [145]:
eng_google_apps = check_eng(clean_google_data, 0)
eng_ios_apps = check_eng(clean_ios_data, 1)

In [146]:
print(len(eng_google_apps))
print(len(eng_ios_apps))

9614
6181


## Isolating Free Apps

In [148]:
print(eng_google_apps[1])
print('\n')
print(eng_ios_apps[1])

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


The free_apps function is going to take in the eng_google_apps and eng_ios_apps lists as the dataset arguments, along with the index positions of the 'Price' value in both the datasets. Then it is going to loop over each row and check if the price value is equal to 0, if yes, then the row will be appended to the final cleaned list of apps, if not, the row will be ignored. 

In [149]:
def free_apps(dataset, price_index):
    final_apps = []
    
    for row in dataset:
        price = row[price_index]
        if price == "0" or price == "0.0":
            final_apps.append(row)
            
    return final_apps

In [150]:
final_google_data = free_apps(eng_google_apps, 7)
final_ios_data = free_apps(eng_ios_apps, 4)

In [151]:
print(len(final_google_data))
print(len(final_ios_data))

8864
3220


So we have about 8864 Play Store apps and 3220 App Store apps to conduct our analysis on

# Analyzing the Cleaned Lists

The validation strategy for an app idea has three steps:

* Build a minimal Android version of the app, and add it to Google Play.
* If the app has a good response from users, develop it further.
* If the app is profitable after six months, build an iOS version of the app and add it to the App Store.

Because the end goal of this project is to  find app profiles that are successful in both markets. Let's begin the analysis by determining the most common genres for each market.

For the Play Store dataset: Columns like 'Category', and 'Genres' can give us an idea of how many users use apps per genre and category

For the App Store dataset: Similary 'prime_genre' seem very useful for analysing which kind of apps are popular with users

In [156]:
def freq_table(dataset, index):
    freq_dict = {}
    total_rows = 0
    
    for row in dataset:
        total_rows += 1
        key_value = row[index]
        if key_value in freq_dict:
            freq_dict[key_value] += 1
        elif key_value not in freq_dict:
            freq_dict[key_value] = 1
    
    for key in freq_dict:
        freq_dict[key] /= total_rows 
        freq_dict[key] = round(freq_dict[key]*100, 2)
            
    return freq_dict

In [157]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [158]:
display_table(final_google_data, 1 )

FAMILY : 19.22
GAME : 9.51
TOOLS : 8.46
BUSINESS : 4.58
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.54
SPORTS : 3.42
PERSONALIZATION : 3.32
COMMUNICATION : 3.25
HEALTH_AND_FITNESS : 3.07
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.78
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.13
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
ENTERTAINMENT : 0.88
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


In [160]:
display_table(final_google_data, -4)

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.58
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.54
Sports : 3.46
Personalization : 3.32
Communication : 3.25
Action : 3.1
Health & Fitness : 3.07
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.86
Video Players & Editors : 1.78
Casual : 1.75
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.91
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.44
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;

In [161]:
display_table(final_ios_data, -5)

Games : 58.14
Entertainment : 7.89
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.52
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.34
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12
