# Profitable App Profiles for the App Store and Google Play Markets

In this project, we will be exploring what types of applications are available for download on the Apple and Google Play app stores. We will be analyzing free applications and how much advertising revenue they generate to determine profitability, and we will try to look for trends in the data.

First, we must import the CSV files of the application data (shown below).

In the code below, .csv files containing the application are opened in python and saved as a variable containing lists of lists.

In [1]:
from csv import reader

### The Google Play data set ###
android_open = open('googleplaystore.csv', encoding='utf8')
android_read = reader(android_open)
android = list(android_read)
android_header = android[0]
android = android[1:]

### The App Store data set ###
ios_open = open('AppleStore.csv', encoding='utf8')
ios_read = reader(ios_open)
ios = list(ios_read)
ios_header = ios[0]
ios = ios[1:]

Next, we will define a function called explore_data that will allow us to slice our data and make it more readable. This function will also allow us to see the size of our dataset (numbers of rows and columns).

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Below we display the first few rows of android applicaton data, including the number of columns and rows. More detailed information about the dataset can be found [here](https://www.kaggle.com/lava18/google-play-store-apps). 

In [3]:
print(android_header)
print('\n')
explore_data(android,0,3,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


See below for description of the columns that are found in the Google Play dataset.

| Column Title   | Description |
| ----------- | :----------- |
| App      | Application name |
| Category   | Category the app belongs to        |
| Rating   | Overall user rating of the app (as when scraped)       |
| Reviews   | Number of user reviews for the app (as when scraped)        |
| Size   | Size of the app (as when scraped)       |
| Installs   | Number of user downloads/installs for the app (as when scraped)      |
| Type   | Paid or Free  |
| Price   | Price of the app (as when scraped)       |
| Content Rating   | Age group the app is targeted at - Children / Mature 21+ / Adult       |
| Genres   | An app can belong to multiple genres (apart from its main category). Eg, a musical family game will belong to     |
| Last Updated   | Date the application was last updated        |
| Current Ver   | Current version of the application |
| Android Ver   | Current version of Android that is compatible with the application        |

Display the first few rows of Apple IOS applicaton data, including the number of columns and rows. More detailed information about the dataset can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps). 



In [4]:
print(ios_header)
print('\n')
explore_data(ios,0,3,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


See below for description of the columns that are found in the Apple Store dataset.

| Column Title   | Description |
| ----------- | :----------- |
| id      | App ID |
| track_name   | App Name        |
| size_bytes   | Size (in Bytes)       |
| currency   | Currency Type        |
| price   | Price amount       |
| rating_count_tot   | User Rating counts (for all version)      |
| rating_count_ver   | User Rating counts (for current version)  |
| user_rating   | Average User Rating value (for all version)  |
| user_rating_ver   | Average User Rating value (for current version)       |
| ver   | Latest version code      |
| cont_rating   | Content Rating   |
| prime_genre   | Primary Genre |
| sup_devices.num   | Number of supporting devices |
| ipadSc_urls.num   | Number of screenshots showed for display |
| lang.num   | Number of supported languages |
| vpp_lic   | Vpp Device Based Licensing Enabled        |

# Removing an Incorrect App Entry

In the discussion section of the Google Play Store data set, one of the discussions outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [5]:
print(android[10472])  # incorrect row
print('\n')
print(android_header)  # header
print('\n')
print(android[0])      # correct row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


It is clear that the "Life Made WI-Fi" entry is incorrect, as the second entry of that row displays '1.9' for the category. We will delete this row below to delete it from the dataset.

In [6]:
print(len(android))
del android[10472]  # don't run this more than once
print(len(android))

10841
10840


# Removing Duplicate App Entries

Next, we will begin to analyze our data to determine if there are any errors or corrections needed. We will begin by determining if there are any duplicates in the Android dataset.

The code below looks at the name of each application in our android dataset. If the name is unique, it will be added to a list of all of the unique app names. If the name has already been added previously to the list of unique apps, it will be added to the duplicates list.

In [7]:
unique_android_apps = []
duplicate_android_apps = []

for app in android:
    if app[0] in unique_android_apps:
        duplicate_android_apps.append(app[0])
    else:
        unique_android_apps.append(app[0])
        
print('There are ' + str(len(unique_android_apps)) + ' unique apps in the android data set.' + '\n\n' +'There are '+ str(len(duplicate_android_apps)) + ' duplicate apps in the android data set.')
print('\n The first few unique application names will be printed below \n\n' + str(unique_android_apps[0:4]))
print('\n The first few duplicate application names will be printed below \n\n' + str(duplicate_android_apps[0:4]))

There are 9659 unique apps in the android data set.

There are 1181 duplicate apps in the android data set.

 The first few unique application names will be printed below 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'Coloring book moana', 'U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'Sketch - Draw & Paint']

 The first few duplicate application names will be printed below 

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings']


We will want to remove duplicate apps from our dataset, but we want to keep 1 unique value for all applications. To determine which applications to delete, we will look at the number of ratings. The assumption is that the higher the number of ratings, the more up to date the information associated with that entry. We will create a dictionary that contains each unique application and the maximum number of reviews that each unique application has in our dataset. 

To create our dictionary, we will loop through the android data set. Each time we encounter a new application name, we will add the application and its number of reviews to our dictionary. Also, any time we encounter a duplicate application that has a number of reviews greater than has already been encountered for that app, it will be added to our dictionary.

In [8]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max))
print(reviews_max['Photo Editor & Candy Camera & Grid & ScrapBook'])


9659
159.0


Now that we have a dictionary with a list of each unique application and its maximum rating in our dataset, we will use the dictionary to clean up our dataset.

For this task, we create new empty lists called android_clean and already_added.

We loop through our android data set adding all application data to android_clean and application names to already_added. For each application we encounter, the following must be true in order for the application to be added to the cleaned data set or the already_added list:
- The number of reviews for the application must be equal to the maximum number of reviews for that app (per the values in the reviews_max dictionary, created above.) This ensures that we are not capturing any applications that are out of date and have a smaller rating count than what we are expecting.
- The application name must not have been already added to the already_added list. This ensures that if any applications have duplicates that have the same maximum rating count, only one of the applications are saved to the cleaned data set.


In [9]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
        
print(len(android_clean))
        

9659


Above, we have removed any duplicate applications from the android data set. 

Now, we will review the ios data set and use app id's to determine if we need to remove duplicates from that data set as well.

In [10]:
unique_ios_apps = []
duplicate_ios_apps = []

for app in ios:
    if app[0] in unique_ios_apps:
        duplicate_ios_apps.append(app[0])
    else:
        unique_ios_apps.append(app[0])
        
print('There are ' + str(len(unique_ios_apps)) + ' unique apps in the ios data set.' + '\n\n' +'There are '+ str(len(duplicate_ios_apps)) + ' duplicate apps in the ios data set.')
print("\n The first few unique application id's will be printed below \n\n" + str(unique_ios_apps[0:4]))


There are 7197 unique apps in the ios data set.

There are 0 duplicate apps in the ios data set.

 The first few unique application id's will be printed below 

['284882215', '389801252', '529479190', '420009108']


# Remove non-English application data
For our analysis we are interested only in applications that are published in English. Upon review of our cleaned data sets, we find that some of the applications in both the ios and android app stores are published in foreign languages. We will remove these applications to further clean the data sets.

Below we can see a couple of examples from each app store that will be removed.

In [11]:
print(ios[813][1])
print(ios[6731][1])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


Now, we will define a function that takes in a string and determines if there are any non-Enlish characters contained within the string. For this exercise, we define a character as non-English if  the corresponding ASCII code number is over 127. We can check the ASCII code of characters with the ord() function.

Some characters (such as the ™ and 😜 characters in the strings tested below) do not have ASCII codes between 0 and 127 but should not strictly be considered non-English. For this reason, we will require that a string has 3 or more characters falling outside the ASCII 0 - 127 range before determining that the string itself is non-English.

See below for the definition of the function and a few test cases.

In [12]:
def is_english(string):
    string = str(string)
    non_english_char = 0
    for character in string:
        if ord(character) > 127:
            non_english_char += 1
            if non_english_char == 4:
                return False
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


In [13]:
android_english_clean = []
android_non_english_clean = []

for app in android_clean:
    if is_english(app[0]) == True:
        android_english_clean.append(app)
    else:
        android_non_english_clean.append(app)
        
print(len(android_english_clean))
print(len(android_non_english_clean))

ios_english_clean = []
ios_non_english_clean = []

for app in ios:
    if is_english(app[1]) == True:
        ios_english_clean.append(app)
    else:
        ios_non_english_clean.append(app)

print(len(ios_english_clean))
print(len(ios_non_english_clean))


9614
45
6183
1014


So far, we have spent time cleaning our Apple store and Android datasets, removing incorrect data, removing duplicate entries and non-English applications.

Now we want to isolate the applications that are free to download. To do so, we will loop through the cleaned lists of applications and create new lists containing only the apps whose price is zero.

To do this, we define a new function, free_apps that takes a list of applications and the index number of the price column for that set of applications as inputs. For the android and apple datasets that we are analyzing, the float function can be used to identify applications with a price of zero. However, applications in the android store that have a non-zero price can not be read using the float because they contain a '$' symbol in front of the cost. For this reason, and because we are only looking for free applications, we use a try / except block to skip over these priced applications.

Free applications from our datasets are returned as outputs of the free_apps function and can be saved as new variables below.

In [14]:
def free_apps(app_list,price_column):
    free_app_list = []
    for app in app_list:
        try:
            app_price = float(app[price_column])
        except:
            continue
        if app_price == 0:
            free_app_list.append(app)
    return free_app_list

free_android_apps = free_apps(android_english_clean,7)
print(len(free_android_apps))

free_ios_apps = free_apps(ios_english_clean,4)
print(len(free_ios_apps))



8864
3222


We would like to find out if there are categories of applications that are popular on both the iOS store and the android google play store. In order to do so,  we want to take the count or percentage of free applications belonging to each genre from each store. The columns of data that will be helpful for this analysis are the prime_genre column for the iOS store and the Category and column for the android store.

We will use the freq_table function below to loop through our free lists of applications to create a frequency table / dictionary of the genres of applications. The display_proportion_table function will be used to sort the genres of applications from largest proportion to lowest proportion. 

In [15]:
def freq_table(dataset, index):
    table = {}
    numelements = 0
    for app in dataset:
        numelements += 1
        if app[index] in table:
            table[app[index]] += 1
        else:
            table[app[index]] = 1
            
    table_percentages = {}
    for key in table:
            table_percentages[key] = table[key] * 100 / numelements
    return table_percentages

def display_proportion_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        

We will use the functions defined above to review the categories of applications found in our free iOS applications list.

In [16]:
display_proportion_table(free_ios_apps, 11) #prime_genre

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.6623215394165114
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.017380509000621
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Above, we can see that among free english applications, Games appear to be the most frequent category on the iOS store, followed by entertainment, photo and video applications. These three application categories make up over 70% of apps in our list and the iOS free application store seems to largely populated with applications for fun or social purposes.

Let's see if the same trend is found for android applications.

In [17]:
display_proportion_table(free_android_apps, 1) #Category column

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.700361010830325
MEDICAL : 3.5311371841155235
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.237815884476534
HEALTH_AND_FITNESS : 3.079873646209386
PHOTOGRAPHY : 2.9444945848375452
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768953
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418774
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075813
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 0

Interestingly, among the android google play app store, the category with the most free applications is Family, with the Game having the second most applications. It certainly appears that there are more productivity focused and less game and entertainment focused applications for the android dataset than there were for the iOS dataset (with tools and business categories making up more than 13% of free android applications). Let's take a closer look at the Family category of applications to gain some more insight to our top android category.

![alt text](FamilyCategory.png "Family Category of Applications")

We can see above that the Family category of free applications largely consists of games for children. while games are the dominant category for iOS, games and family categores are split into two unique values for the android store. The categorization of genres is clearly  different for the android and iOS app stores, which makes it somewhat difficult to make comparisons between the two.

# Install and Rating Counts for Application Categories
Now that we know the number of applications that are published for each category, it would be useful to know the popularity of each of these categories of applications. If we are looking to publish a popular and profitable application, it might be useful to publish it under a category that is often downloaded. 
For the android store, a useful piece of data for determining popularity is the # installs column. This information is not present in the iOS dataset, so we will use the rating count as a proxy.

In [19]:
ios_category_ratingcount_frequency = freq_table(free_ios_apps, 11)

ios_category_ratings = {}
for category in ios_category_ratingcount_frequency:
    total = 0
    len_genre = 0
    for app in free_ios_apps:
        if app[11] == category:
            total += float(app[5])
            len_genre += 1
    ios_category_ratings[category] = total / len_genre

for i in ios_category_ratingcount_frequency:
    print(i,':', ios_category_ratings[i])

            

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


We can see that navigation and social networking free ios applications tend to get large numbers of ratings on average. Medical, catalog, and Education applications do not tend to get many ratings. An assumption that we will make (that may or may not be valid) is that the number of downloads that an application gets has a positive linear correlation with the number of reviews received. Under this assumption, we may choose to develop a social networking application, which are moderately represented as a proportion of apps on the ios app store and tend to get high numbers of ratings.

In [29]:
android_category_installcount_frequency = freq_table(free_android_apps, 1)

android_category_average_install_freqtable = {}
for category in android_category_installcount_frequency:
    total = 0
    len_category = 0
    for app in free_android_apps:
        category_app = app[1]
        if category_app == category:
            len_category += 1
            installs = app[5]
            installs = (installs.replace("+",""))
            installs_cleaned = int(installs.replace(",",""))
            total += installs_cleaned
    average_installs = total / len_category
    #print(category, ':', str(int(average_installs)))
    android_category_average_install_freqtable[category] = int(average_installs)
#print(android_category_average_install_freqtable)

import operator
x = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}
sorted_x = sorted(android_category_average_install_freqtable.items(), key=operator.itemgetter(1),reverse=True)
print(sorted_x)


            

[('COMMUNICATION', 38456119), ('VIDEO_PLAYERS', 24727872), ('SOCIAL', 23253652), ('PHOTOGRAPHY', 17840110), ('PRODUCTIVITY', 16787331), ('GAME', 15588015), ('TRAVEL_AND_LOCAL', 13984077), ('ENTERTAINMENT', 11640705), ('TOOLS', 10801391), ('NEWS_AND_MAGAZINES', 9549178), ('BOOKS_AND_REFERENCE', 8767811), ('SHOPPING', 7036877), ('PERSONALIZATION', 5201482), ('WEATHER', 5074486), ('HEALTH_AND_FITNESS', 4188821), ('MAPS_AND_NAVIGATION', 4056941), ('FAMILY', 3695641), ('SPORTS', 3638640), ('ART_AND_DESIGN', 1986335), ('FOOD_AND_DRINK', 1924897), ('EDUCATION', 1833495), ('BUSINESS', 1712290), ('LIFESTYLE', 1437816), ('FINANCE', 1387692), ('HOUSE_AND_HOME', 1331540), ('DATING', 854028), ('COMICS', 817657), ('AUTO_AND_VEHICLES', 647317), ('LIBRARIES_AND_DEMO', 638503), ('PARENTING', 542603), ('BEAUTY', 513151), ('EVENTS', 253542), ('MEDICAL', 120550)]


Above we can see that Communication, Video Player, and Social applications receive many installations from the google play store, while Parenting, Beauty, Events, and Medical applications do not receive as many installs. Communication applications seem to be fairly well represented on the google play and get many installations, so it may be a good idea to try to create a communications app.