# Data Analysis: Monetising Free Apps

Goal of this project is to analysis data to identify and understand what type of apps are likely to attract more users.



In [2]:
from csv import reader

## Apple Store Data Set ##
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

## Google Play Data Set ##
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]


In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print('Apple Store Sample', '\n')
print(ios_header, '\n')
explore_data(ios, 0, 5, rows_and_columns = True)

print('\n')
print('Google Play Store Sample', '\n')
print(android_header, '\n')
explore_data(android, 0, 5, rows_and_columns = True)

Apple Store Sample 

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of column

## Data Cleaning

Data \#10472 is missing the entire 'Category' column.

In [4]:
print(android[10472])
del android[10472]

print("Data #10472 Deleted")

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Data #10472 Deleted


Check for the new data \#10472 after deletion.

In [5]:
android[10472]

['osmino Wi-Fi: free WiFi',
 'TOOLS',
 '4.2',
 '134203',
 '4.1M',
 '10,000,000+',
 'Free',
 '0',
 'Everyone',
 'Tools',
 'August 7, 2018',
 '6.06.14',
 '4.4 and up']

From the demonstration above, we can see that Google Play data set is not completely error free. Not all errors within the data set are of same type. We can inspect 'duplicate errors' in our data set using one of Python code method below.

In [6]:
unique_apps = []
duplicate_apps = []
for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps:', len(duplicate_apps))
print('Examples of duplicate apps:', duplicate_apps[:5])

Number of duplicate apps: 1181
Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


Here is another method. This method is useful if you already know the specifics of the errors. We've just identified the names of duplicate apps, and 'Instagram' is in one of them. 

In [7]:
for app in android:
    if app[0] == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


The main difference among these data is the difference of values in the 4th column, the number of reviews. This suggests that these data have been collected along diffrent time line. 

The more recent the data is, more reviews it will have. Hence, we shall leave the data with the most number of reviews. This is much better than picking out a data point randomly. 

Next, we'll be devising how to get rid of all duplicate entries in our data.

First, we'll be creating a dictionary containing **unique** names with the most number of ratings. 

In [8]:
reviews_max = {}
## Dictionary to store our unique data set ##

for app in android:
## Iterate through every data point ##
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        ## If an app name is already in dictionary, replace the data 
        ## ONLY IF the current data has higher number of ratings
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        ## If current app name is not in the dictionary, store this data
        
print(len(reviews_max))

9659


Second, now that we have our dictionary containing unique names and number of ratings, we'll iterate through our data set again to match each data to our dictionary. 

Before that, we'll create 2 empty lists to seperate our unique data from the duplicates, 'clean' and 'already added'. 

Iteration: If the number of ratings of the data matches the number in our dictionary and the app name is not in the 'already added' list, we'll store the data to our 'clean' list, and then add the app name to the 'already added' list because this app data has just been added to our 'clean' data. 


In [9]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
                

We've found out before that there are 1181 duplicate data, so our clean data set should have 10840(number of entries in original data) - 1181 = 9659 entries. Lets check

In [10]:
print(len(android_clean))

9659


Hooray!

# Removing Non-English Apps

Our data sets contain non-english apps that will need to be removed for our analysis because our company is only interested in the english app market.

To determine if an app is for english-speaking audience or not, we'll set the standard so that if an app contains any non-english characters it will be classified as for non-english-speaking audience.

Strings are iterable using the `for ... in ...` loop, so we can check each charater in the app name to determine whether the name for an app contains foreign charaters by checking their ASCII value. 

ASCII (American Standard Code for Information Interchange) system gives an assigned number for each language character in the computer system. English alphabets and common symbols/marks have ASCII value equal to or less than 127. In Python, We can check the ASCII value for a character using the `ord(character)` function.

Now let's build a function that checks wheter a given string is in english or not.

In [11]:
def english_checker(string):
    for character in string:
        if ord(character) > 127:
            return False
        
    return True

print(english_checker('Instagram'))
print(english_checker('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_checker('Docs To Go™ Free Office Suite'))
print(english_checker('Instachat 😜'))

True
False
False
False


It has filterd out the non-english characters just fine! However, it is obvious that the 3rd and 4th app is intended for the english-speaking audience but just contain special characters in their names. We don't want to filter them out from our data set so we'll have to make an adjustment to our function.

We'll modify the function so that it only returns `False` if an app name contains 3 or more non-standard and non-english characters.

In [12]:
def english_checker(string):
    count = 0
    for character in string:
        if ord(character) > 127:
            count += 1
    if count > 3:
        return False
    return True

print(english_checker('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_checker('Docs To Go™ Free Office Suite'))
print(english_checker('Instachat 😜'))

False
True
True


This is not the perfect way to clean the data set completely but it is good enough for our demo.

Let's now use our function and filter out non-english apps in our data sets.

In [13]:
android_clean_english = []
android_non_english = []

for app in android_clean:
    name = app[0]
    english = english_checker(name)
    if english == True:
        android_clean_english.append(app)
    else:
        android_non_english.append(app)
        
print(len(android_clean_english))
print(len(android_non_english))


9614
45


In [14]:
ios_english = []
ios_non_english = []

for app in ios:
    name = app[1]
    english = english_checker(name)
    if english == True:
        ios_english.append(app)
    else:
        ios_non_english.append(app)
        
print(len(ios_english))
print(len(ios_non_english))

6183
1014


We've removed 45 apps from Google Play data sets and a whopping 1014 apps from the Apple Store data set.

# Removing Non-Free apps

We've stated in the beginning that our company only develops apps that are free, so we'll have to remove app data that have prices more than 0.

We'll create a new empty list that will only contain data for free apps. We'll then iterate through entries in our data set and check their prices. We'll only store the entry to our new list if it's price is '0'. 

Beware that our entries are in the format of string, hence in our conditionals, we'll have to check if the price equates to **'**0**'**, NOT 0!

In [15]:
android_final = []
for app in android_clean_english:
    price = app[7]
    if price == '0':     #Beware of zero string!!
        android_final.append(app)
        
ios_final = []

for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))
        

8864
3222


After the isolation, we're left with 8864 entries in the Google Play data set and 3222 entries in the App Store data set.

## Development Strategy

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further. If the app is profitable after six months, we build an iOS version of 
3. the app and add it to the App Store.


Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

# Analysis

So which columns of the data sets can we use to make analysis and prediction? It seems appropriate to use 'Genres' and 'category' columns from the Google Play data set, and 'prime_genre' column from the App Store data set because these columns divide data sets into definitive categories.

We're now going to create 2 functions. One is to create the frequency table of the data sets as percentage, and the other is to display the table as output.

For our frequency table generating function that takes the dataset and the column index as parameters, we'll first create 2 variables; one with an empty dictionary which will eventually store the numbers of counts per category, and the other with an integer value 0 which will eventually serve as a counter for total number of entries in the data set.

The function will then iterate through the given data set. For each iteration, the function will increment our total counter by 1, assign the category name to a variable, then either store the given category variable in the dictionary with it's value as 1 if the given category variable has never been entered in the dictionary, or increment the value of the already existing category key in the dictionary by 1; End of iteration 1.

Out of our first iteration, we'll create a new variable containing an empty dictionary that will later store the percentage value for each category.

We now enter another iteration, but through our first dictionary containing the counts for the categories. For each category key in our count table, the function will calculate and store as a variable it's percentage value by dividing the value by the total counter variable and timesing it by 100. The function then adds each information in the dictionary with the category as key and it's percentage as value.

Lastly, the function returns the dictionary containg the percentage values for each category.

In [16]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages



The next function that will in final display the contents of the percentage table also takes the dataset and the column index as parameters.

Firstly, it will make a function call to our previously defined function and store the returned percentage table dictionary in a varialbe. 

It will then create an empty list that will eventually contain tuples consisting of firstly the percentage value and secondly the category value i.e. (value, key). The reason for conversion is because we'll be wanting to display the contents of the table in order of their percentage values, not of their alphabetical order. Entries in dictionaries can only be sorted by their key values, but we have the names of our categories as our key value in our dictionary, hence they need to be reversed and be stored in a diffrent set.

The function then iterates through the table, creates a tuple in our desired order, then appends the tuple to the previously created list.

Exiting the iteration, we now create a new sorted version of our list using the `sorted()` function. We want the order in our list to be from low to high, so we'll have to set the `reverse` argument as `True`.

Finally, we iterate through our sorted table and print the contents.

In [17]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Let's analyse the 'prime_genre' column of the App Store data set.

In [18]:
print('App Store: Percentage Table by prime_genre')
display_table(ios_final, 11)


App Store: Percentage Table by prime_genre
Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


The most common genre of the free English apps among the App Store was Games, consisting about 58% of the total data set which is quite overwhelming considering that the next most common genre, Entertainment, consists about 8% of the total data set.

The data suggests that most of the free English apps in the App Store are designed for entertainment purposes rather than practical purposes: Out of the Top 10 genres, 4 genres take the 4th (Education), 6th (Shopping), 7th (Utilities) and 10th (Health & Fitness) position of the ranks and they altogether sums to about 11% of the total data set which is not that much.

Although it is obvious that most developers build free English apps for entertainment purposes, especially games, it isn't clear whether it is because these types of apps automatically generate profit for developers. Further investigation is required.

Let's now analyse the Google Play Store dataset by their Category.

In [19]:
print('Google Play Store: Percentage Table by Category')
display_table(android_final, 1)


Google Play Store: Percentage Table by Category
FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661


When divided by their cateogry, the ratio of free English apps in the Google Play Store are much more evenly distributed compared to the apps in Apple Store; the 'Family' category had the most number of apps with it's percentage about 19%, 'Game' as the second most with about 10% and 'Tools' as the third most with about 8%.

The data suggests that free English apps in the Google Play Store are mostly designed for practical purposes as opposed to the free English apps in the Apple Store.

In [20]:
print('Google Store: Percentage Table by Genre')
display_table(android_final, 9)

Google Store: Percentage Table by Genre
Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411

Grouping the Google Play Store data set by their Genre gives similar results when grouped by their category.

For the Google Play Store Data, it seems that developers are genereally more interested in building apps for practical purposes rather than for entertainment purposes, however, it isn't clear whether a specific genre or category is distinctly favoured like in the apple store.

These frequency tables generated reveal which genre is the most produced among the app developers but not which genre have the most users.


# Which Genre/Category gets downloaded the most?

## Apple Store

In [21]:
apple_genre_table = freq_table(ios_final, 11)
print(apple_genre_table)

{'Business': 0.5276225946617008, 'Social Networking': 3.2898820608317814, 'Book': 0.4345127250155183, 'Finance': 1.1173184357541899, 'Photo & Video': 4.9658597144630665, 'Medical': 0.186219739292365, 'Shopping': 2.60707635009311, 'Food & Drink': 0.8069522036002483, 'Lifestyle': 1.5828677839851024, 'Sports': 2.1415270018621975, 'Health & Fitness': 2.0173805090006205, 'Education': 3.662321539416512, 'Travel': 1.2414649286157666, 'Entertainment': 7.883302296710118, 'Catalogs': 0.12414649286157665, 'Reference': 0.5586592178770949, 'Utilities': 2.5139664804469275, 'News': 1.3345747982619491, 'Productivity': 1.7380509000620732, 'Music': 2.0484171322160147, 'Navigation': 0.186219739292365, 'Games': 58.16263190564867, 'Weather': 0.8690254500310366}


In [22]:
for genre in apple_genre_table:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[11]
        if genre_app == genre:
            rating_count = float(app[5])
            total += rating_count
            len_genre += 1
            
    average_ratings = total / len_genre
    print(str(genre) + ': ' + str(average_ratings))

Business: 7491.117647058823
Social Networking: 71548.34905660378
Book: 39758.5
Finance: 31467.944444444445
Photo & Video: 28441.54375
Medical: 612.0
Shopping: 26919.690476190477
Food & Drink: 33333.92307692308
Lifestyle: 16485.764705882353
Sports: 23008.898550724636
Health & Fitness: 23298.015384615384
Education: 7003.983050847458
Travel: 28243.8
Entertainment: 14029.830708661417
Catalogs: 4004.0
Reference: 74942.11111111111
Utilities: 18684.456790123455
News: 21248.023255813954
Productivity: 21028.410714285714
Music: 57326.530303030304
Navigation: 86090.33333333333
Games: 22788.6696905016
Weather: 52279.892857142855


In [23]:
print(ios_final[0][5])

2974676


social networking, finance, food & drink, reference seems good

## Google Play Store

In [24]:
android_category_table = freq_table(android_final, 1)
print(android_category_table)

{'BUSINESS': 4.591606498194946, 'EVENTS': 0.7107400722021661, 'PHOTOGRAPHY': 2.944494584837545, 'LIBRARIES_AND_DEMO': 0.9363718411552346, 'MEDICAL': 3.531137184115524, 'VIDEO_PLAYERS': 1.7937725631768955, 'COMICS': 0.6204873646209386, 'GAME': 9.724729241877256, 'BEAUTY': 0.5979241877256317, 'FINANCE': 3.7003610108303246, 'LIFESTYLE': 3.9034296028880866, 'PRODUCTIVITY': 3.892148014440433, 'SHOPPING': 2.2450361010830324, 'HOUSE_AND_HOME': 0.8235559566787004, 'COMMUNICATION': 3.2378158844765346, 'FAMILY': 18.907942238267147, 'DATING': 1.861462093862816, 'ART_AND_DESIGN': 0.6430505415162455, 'PERSONALIZATION': 3.3167870036101084, 'WEATHER': 0.8009927797833934, 'NEWS_AND_MAGAZINES': 2.7978339350180503, 'FOOD_AND_DRINK': 1.2409747292418771, 'MAPS_AND_NAVIGATION': 1.3989169675090252, 'AUTO_AND_VEHICLES': 0.9250902527075812, 'TRAVEL_AND_LOCAL': 2.33528880866426, 'EDUCATION': 1.1620036101083033, 'TOOLS': 8.461191335740072, 'HEALTH_AND_FITNESS': 3.0798736462093865, 'PARENTING': 0.654332129963898

In [25]:
for category in android_category_table:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            install = app[5]
            install = install.replace(',', '')
            install = install.replace('+', '')
            install = float(install)
            total += install
            len_category += 1
    average = total / len_category
    print(category, average)

BUSINESS 1712290.1474201474
EVENTS 253542.22222222222
PHOTOGRAPHY 17840110.40229885
LIBRARIES_AND_DEMO 638503.734939759
MEDICAL 120550.61980830671
VIDEO_PLAYERS 24727872.452830188
COMICS 817657.2727272727
GAME 15588015.603248259
BEAUTY 513151.88679245283
FINANCE 1387692.475609756
LIFESTYLE 1437816.2687861272
PRODUCTIVITY 16787331.344927534
SHOPPING 7036877.311557789
HOUSE_AND_HOME 1331540.5616438356
COMMUNICATION 38456119.167247385
FAMILY 3695641.8198090694
DATING 854028.8303030303
ART_AND_DESIGN 1986335.0877192982
PERSONALIZATION 5201482.6122448975
WEATHER 5074486.197183099
NEWS_AND_MAGAZINES 9549178.467741935
FOOD_AND_DRINK 1924897.7363636363
MAPS_AND_NAVIGATION 4056941.7741935486
AUTO_AND_VEHICLES 647317.8170731707
TRAVEL_AND_LOCAL 13984077.710144928
EDUCATION 1833495.145631068
TOOLS 10801391.298666667
HEALTH_AND_FITNESS 4188821.9853479853
PARENTING 542603.6206896552
SPORTS 3638640.1428571427
SOCIAL 23253652.127118643
ENTERTAINMENT 11640705.88235294
BOOKS_AND_REFERENCE 8767811.89473