# Profitable App Profiles for the App Store and Google Play Markets

We try to find a profitable free app profile that will suit both the App Store and Google Play markets. For that task, we will be using two data sets - [one](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about around 7200 App Store applications, and the [second](https://www.kaggle.com/lava18/google-play-store-apps) one containing data about around 10 000 Google Play applications. 

My goal in this project is to further my data science skills with some hands-on problems. 

First, let's start by opening our data sets, loading them into lists and then printing the first few rows of each data set to make sure they've been loaded correctly. Let's also count the amount of columns and rows to make sure that we've loaded all data into our lists. 

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row, '\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        print('\n')

In [2]:
def read_data(dataset_path):
    
    from csv import reader

    open_dataset = open(dataset_path, encoding='utf8')
    read_dataset = reader(open_dataset)
    list_dataset = list(read_dataset)    
    return list_dataset

In [3]:
appstore_dataset = read_data('E:\Jupyter\Profitable App Profiles for App Store and Google Play\AppleStore.csv')
googleplay_dataset = read_data('E:\Jupyter\Profitable App Profiles for App Store and Google Play\googleplaystore.csv')

explore_data(appstore_dataset, 0, 5, True)
explore_data(googleplay_dataset, 0, 5, True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1'] 

['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1'] 

['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1'] 

['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1'] 

Number of rows: 7198
Number of columns: 17


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 

As we can see above, our lists seem to be loaded correctly. Now, let's assume that our app has to be free and directed towards english-speaking audience. We need to clean our data sets by removing wrong data and paid or non-english apps and then print column names(descriptions of which can be found in the links provided at the beginning) to find columns that might help us with our analysis.

The Google Play data set has a dedicated discussion section, where one of [discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) mentions an error at entry 10472(10473 with header included). Let's print that row and see if it really is incorrect. 

In [4]:
print(googleplay_dataset[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


It is indeed incorrect. What we can do now is just simply remove that row with `del` statement. Let's print that row again. 

In [5]:
del googleplay_dataset[10473]
print(googleplay_dataset[10473])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


Now that we know there are no more errors in our data, we should check our data sets for duplicates by writing a simple function that will check apps' names and create a list of unique apps and a list of duplicates. Note that our data sets are built in a different order, so we need to add `name_index` argument to our function, which will let us work with both data sets using only one function. 

In [6]:
def find_duplicates(dataset, name_index):
    duplicate_apps = []
    unique_apps = []
    for row in dataset[1:]:
        if row[0] in unique_apps:
            duplicate_apps.append(row[name_index])
        else:
            unique_apps.append(row[name_index])
    print('There are ' + str(len(duplicate_apps)) + ' duplicate apps in the data set')
    len(duplicate_apps)
    return len(duplicate_apps)


In [7]:
android_duplicates = find_duplicates(googleplay_dataset, 0)
appstore_duplicates = find_duplicates(appstore_dataset, 2)

There are 1181 duplicate apps in the data set
There are 0 duplicate apps in the data set


We can see there are quite a few duplicates in our data set that we will need to remove. Since we can't do that randomly, we have to find a criterion which will make our data the most accurate. In this case, we can judge the relevance of our data by the amount of user reviews(the latest app data has the most reviews). 

To check if our function works correctly, we can calculate the expected amount of unique apps and compare it with the length of our list after removing duplicates from it. 

In [8]:
def remove_duplicates_android(dataset):
    reviews_max = {}
    for row in dataset[1:]:
        if row[0] in reviews_max:
            if row[3] > reviews_max[row[0]]:
                reviews_max[row[0]] = row[3]
        else:
            reviews_max[row[0]] = row[3]
    print('Expected number of unique android apps: ', len(dataset) - 1 -android_duplicates)
    print('Our number of unique android apps: ', len(reviews_max))
    clean_data = []
    clean_data.append(dataset[0])
    added_apps = []
    for row in dataset[1:]:
        if row[3] == reviews_max[row[0]] and row[0] not in added_apps:
            clean_data.append(row)
            added_apps.append(row[0])
    return clean_data

In the code above, we created a dictionary to hold the unique names of our apps and their highest amount of reviews. It allowed us later to create two empty lists - for cleaned data and for the names of apps that have already been put into that list. 

Then, we simply just looped through our data set(excluding header) and checked if the current app's number of reviews equals its highest amount possible saved in the `reviews_max` and if its name has already been added to our control list called `added_apps` that ensures no duplicates are put into `clean_data` list. 

To make sure that our function work correctly, we also checked if the length of cleaned list equals the expected number of unique apps. 

In [9]:
googleplay_dataset = remove_duplicates_android(googleplay_dataset)

Expected number of unique android apps:  9659
Our number of unique android apps:  9659


We can finally move on to the next problem - removing non-english apps. To do that, we should know that each character we use has its own corresponding number, which we can get by using `ord()` function. According to ASCII system, the characters commonly used in english texts are numbered 0 to 127. Since some english apps might use non-english characters, we will assume that non-english app is an app that contains more than 3 non-english characters in its name. We can clean non-english apps from our data sets by checking each row's name column for characters with assigned number outside of our 0-127 range. To do that, we will create a function that iterates through our lists and then removes every app that contains more than 3 non-english characters. 

In [10]:
def remove_non_english(dataset, name_index):
    clean_data = []
    for row in dataset:
        non_english_chars = 0
        for character in row[name_index]:
            if ord(character) > 127:
                non_english_chars += 1
        if non_english_chars <= 3:
            clean_data.append(row)
    return clean_data

Let's create some simple and short test lists to check if our functions work correctly.

In [11]:
test_list_android = [['Instagram'], ['爱奇艺PPS -《欢乐颂2》电视剧热播'], ['Docs To Go™ Free Office Suite'], ['Instachat 😜']]
test_list_appstore = [['','','Instagram'], ['','','爱奇艺PPS -《欢乐颂2》电视剧热播'], ['','','Docs To Go™ Free Office Suite'], ['','','Instachat 😜']]

test_list_android = remove_non_english(test_list_android, 0)
test_list_appstore = remove_non_english(test_list_appstore, 2)
print(test_list_android, '\n', test_list_appstore)

[['Instagram'], ['Docs To Go™ Free Office Suite'], ['Instachat 😜']] 
 [['', '', 'Instagram'], ['', '', 'Docs To Go™ Free Office Suite'], ['', '', 'Instachat 😜']]


As we can see, our cleaning functions got rid of an app named **爱奇艺PPS -《欢乐颂2》电视剧热播**, but left **Instachat 😜** and **Docs To Go™ Free Office Suite**, both of which also contain non-english characters. Now that we've confirmed that everything works, we can remove non-english apps from our data sets. 

In [12]:
googleplay_dataset = remove_non_english(googleplay_dataset, 0)
appstore_dataset = remove_non_english(appstore_dataset, 2)
print('Remaining number of android apps: ', len(googleplay_dataset))
print('Remaining number of ios apps: ', len(appstore_dataset))

Remaining number of android apps:  9615
Remaining number of ios apps:  6184


The last thing we have to do to clean our data is removing any non-free apps. We can see that in both data sets prices are strings and the price of each free app is a string `'0'`. It allows us to iterate through our data sets and to assign free apps(apps containing `'0'` in price olumn) to a new list. 

In [13]:
def remove_non_free(dataset, price_index):
    clean_data = []
    clean_data.append(dataset[0])
    for row in dataset[1:]:
        if row[price_index] == '0':
            clean_data.append(row)
    return clean_data

In [14]:
googleplay_dataset = remove_non_free(googleplay_dataset, 7)
appstore_dataset = remove_non_free(appstore_dataset, 5)
print('Number of free android apps: ', len(googleplay_dataset))
print('Number of free ios apps: ', len(appstore_dataset))

Number of free android apps:  8863
Number of free ios apps:  3223


As we can see, after the final cleaning we're left with 8862 Android apps and 3222 iOS apps. We can now move on to finding the most suitable app profile for our analysis. Our app is going to be free, so we need to determine the kinds of apps that are likely to attract the most users. 

Our validation strategy for an app idea is made of three steps:
- We create a minimal version of our app and add it to Google Play
- If the app gets a good response we develop it futher
- If the app is profitable after six months, we build an iOS version and add it to App Store

Since our goal is to find an app that will perform well on Google Play and App Store, we need to find app profiles that are successful on both markets. We can start our analysis by finding the most popular genres on both platforms. To do that, we will write a function that creates an empty dictionary, fills it with genres listed in our data sets and if the app's genre already exists in our dictionary, it will increase our genre's frequency by 1.