# Data Analysis of App profitability via Ad revenue

In this project, I'm a Data Analyst working for a company that builds only free apps available on Google Play and the App Store. 

Because the apps are free, most of my company's profits come from the ad revenue we generate from advertisements seen within the apps. The revenue we generate therefore directly correlates to the number of users who download and use our apps. 

The aim of this project is thus to analyze consumer patterns and other data so as to glean insight into what kind of app might be most popular among users, to help our software developers make more educated decisions on what kind of apps to create. 

In [2]:
from csv import reader


# google play dataset

opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

# apple store dataset 

opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

# print(android_header)
# print(ios_header)

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

        
# Exploring the data:
        
explore_data(android, 1, 3, True)

explore_data(ios, 1, 3, True)


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


Apple Store data-set has 7197 rows, 16 columns.
Google Play data-set has 10841 rows, 13 columns.

These figures exclude the header columns.

In [3]:
print(ios_header)
print('\n')
print(android_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In the case of the Apple Store, the columns that might be of interest to our project are most likely:

- price (Price amount)
- currency (Currency Type)
- rating_count_tot (User Rating counts (for all version))
- rating_count_ver (User Rating counts (for current version))
- user_rating (Average User Rating value (for all version))
- prime_genre ("prime_genre": Primary Genre)


With the Google Store, the columns of interest to us are most likely:

- App
- Category
- Reviews
- Installs
- Type
- Price
- Genres


# Data Cleaning

Because our company is only interested in data pertaining to free apps, the apps that come with a price need to be filtered out.

Additionally, data that is incorrect or duplicate will also need to be pruned out of the tables.

In [4]:
# It's been reported that row 10472 is incorrect for Google Play data.

print(android[10472])
print('\n')
print(android_header)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


It would appear that the 'Category' column for row 10472 in the Google Play dataset is missing.

Checking further the length of row 10472 against the length of the header row, we see that this is indeed the case:

In [5]:
print(len(android[10472]))
print(len(android_header))

12
13


To remedy this, we'll delete the errant column:

In [6]:
del android[10472]
print(android[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


Some Duplicate entries also exist, as seen below:

In [7]:
duplicates = []
unique = []

for app in android:
    if app[0] in unique:
        duplicates.append(app[0])
    else:
        unique.append(app[0])
        
print(len(duplicates))

1181


It looks like 1181 entries were duplicates. A dictionary will be constructed to view the frequencies with which an app was repeated:

In [8]:
duplicates_dict = {}

for app in duplicates:
    if app in duplicates_dict:
        duplicates_dict[app] += 1
    else:
        duplicates_dict[app] = 1
        
print(duplicates_dict)

{'Truecaller: Caller ID, SMS spam blocking & Dialer': 1, 'My Dressing - Fashion closet': 1, 'Booking.com Travel Deals': 2, 'Candy Crush Jelly Saga': 1, 'Tiny Scanner Pro: PDF Doc Scan': 1, 'Apartment List: Housing, Apt, and Property Rentals': 1, 'Showtime Anytime': 1, 'DC Comics': 1, 'Viki: Asian TV Dramas & Movies': 2, 'Black People Meet Singles Date': 1, 'Candy Camera - selfie, beauty camera, photo editor': 2, 'Wheretoget: Shop in style': 1, 'No.Draw - Colors by Number 2018': 1, 'USA TODAY': 2, 'Yandex Browser with Protect': 1, 'Crackle - Free TV & Movies': 2, 'FP Notebook': 1, 'Ada - Your Health Guide': 1, 'G Cloud Backup': 1, 'Tumblr': 2, 'JH Blood Pressure Monitor': 1, 'InstaBeauty -Makeup Selfie Cam': 1, 'QuickBooks Accounting: Invoicing & Expenses': 2, 'Wordscapes': 3, 'Calm - Meditate, Sleep, Relax': 1, 'RT 516 VET': 1, 'Zappos – Shoe shopping made simple': 1, 'Curriculum vitae App CV Builder Free Resume Maker': 1, 'Podcast App: Free & Offline Podcasts by Player FM': 1, 'Random

The above dictionary shows us that the app 'Sniper 3D Gun Shooter: Free Shooting Games - FPS' is repeated 5 times, in addition to having appeared uniquely once. 

We'll now try and ascertain why this was repeated:

In [9]:
print(android_header)
print('\n')
for app in android:
    if app[0] == 'Sniper 3D Gun Shooter: Free Shooting Games - FPS':
        print(app)
        print('\n')


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Sniper 3D Gun Shooter: Free Shooting Games - FPS', 'GAME', '4.6', '7671249', 'Varies with device', '100,000,000+', 'Free', '0', 'Mature 17+', 'Action', 'August 2, 2018', 'Varies with device', 'Varies with device']


['Sniper 3D Gun Shooter: Free Shooting Games - FPS', 'GAME', '4.6', '7672495', 'Varies with device', '100,000,000+', 'Free', '0', 'Mature 17+', 'Action', 'August 2, 2018', 'Varies with device', 'Varies with device']


['Sniper 3D Gun Shooter: Free Shooting Games - FPS', 'GAME', '4.6', '7672495', 'Varies with device', '100,000,000+', 'Free', '0', 'Mature 17+', 'Action', 'August 2, 2018', 'Varies with device', 'Varies with device']


['Sniper 3D Gun Shooter: Free Shooting Games - FPS', 'GAME', '4.6', '7674252', 'Varies with device', '100,000,000+', 'Free', '0', 'Mature 17+', 'Action', 'August 2, 2018', 'Varies with device'

Above we see that the only differences between the rows are the number of reviews in the 'Reviews' column.

It looks like the different number of reviews is what caused the data to be repeated as another row. We can also conclude that the row with the highest amount of reviews was the most recently updated row.

The way we're going to handle removing the duplicates is thus as such:
Where the name of an app is repeated in multiple rows, only the row with the highest number of reviews will be kept. The rest will be discarded.

In [10]:
reviews_max = {}

for each in android:
    name = each[0]
    n_reviews = float(each[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
         reviews_max[name] = n_reviews
            
            
        
print('Expected number of rows after clearing duplicates: ', len(android) - 1181)
print('Length of dictionary after removing duplicates and keeping max review values:', len(reviews_max))




Expected number of rows after clearing duplicates:  9659
Length of dictionary after removing duplicates and keeping max review values: 9659


In [11]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name) 
        
explore_data(android_clean, 0, 2, True)        

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9659
Number of columns: 13


It looks like removal of duplicate values was succesful. Only the rows with the highest review values were kept. 

Now, we have to remove apps that use languages other than English, since our company only uses English apps.

The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters.

We'll start by writing a function that iterates over a string to detect if any non-English character is present:


In [12]:
def englishchars(string):
    
    for each in string:
        if ord(each) > 127:
            return False
    return True
        
        
print(englishchars('LinkedIn'))
print(englishchars('爱奇艺PPS -《欢乐颂2》电视剧热播'))   

True
False


However, some English apps also return a false value if they use certain emojis or special characters.

In [13]:
print(englishchars('Docs To Go™ Free Office Suite'))
print(englishchars('Instachat 😜'))

False
False


In [14]:
print(englishchars('Docs To Go™ Free Office Suite'))
print(englishchars('Instachat 😜'))

print(ord('™'))
print(ord('😜'))


False
False
8482
128540


To minimize the loss of important data from characters that fall well outside the ASCII range of 127, we'll only filter apps if they contain 4 or more such characters.

In [15]:
def englishchars(string):
    
    count = 0
    for each in string:
        if ord(each) > 127:
            count += 1
            if count >= 4:
                return False
    return True

print(englishchars('Instachat😜'))
print(englishchars('Instachat😜😜'))
print(englishchars('Instachat😜😜😜'))
print(englishchars('Instachat😜😜😜😜'))

True
True
True
False


This modified function now only returns a false value when there are 3 or more non-english characters. 

Now, we use this function to include only English apps from the above android_clean dataset.

In [16]:
android_english_clean = []
android_non_eng = []

for each in android_clean:
    name = each[0]
    if englishchars(name) == True:
        android_english_clean.append(each)
    else:
        android_non_eng.append(each)
        
print(len(android_english_clean))
print(len(android_non_eng))

explore_data(android_non_eng, 0, 5, True)   

9614
45
['Flame - درب عقلك يوميا', 'EDUCATION', '4.6', '56065', '37M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'July 26, 2018', '3.3', '4.1 and up']


['သိင်္ Astrology - Min Thein Kha BayDin', 'LIFESTYLE', '4.7', '2225', '15M', '100,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'July 26, 2018', '4.2.1', '4.0.3 and up']


['РИА Новости', 'NEWS_AND_MAGAZINES', '4.5', '44274', '8.0M', '1,000,000+', 'Free', '0', 'Everyone', 'News & Magazines', 'August 6, 2018', '4.0.6', '4.4 and up']


['صور حرف H', 'ART_AND_DESIGN', '4.4', '13', '4.5M', '1,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 27, 2018', '2.0', '4.0.3 and up']


['L.POINT - 엘포인트 [ 포인트, 멤버십, 적립, 사용, 모바일 카드, 쿠폰, 롯데]', 'LIFESTYLE', '4.0', '45224', '49M', '5,000,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'August 1, 2018', '6.5.1', '4.1 and up']


Number of rows: 45
Number of columns: 13


We see above that most of the apps that were discarded do indeed seem to be non-english apps.

Now we do the same for ios apps.

In [18]:
ios_eng = []
ios_non_eng = []

for each in ios:
    name = each[1]
    if englishchars(name) == True:
        ios_eng.append(each)
    else:
        ios_non_eng.append(each)
        
print(len(ios_eng))
print(len(ios_non_eng))

explore_data(ios_non_eng, 0, 2, True) 

6183
1014
['445375097', '爱奇艺PPS -《欢乐颂2》电视剧热播', '224617472', 'USD', '0.0', '14844', '0', '4.0', '0.0', '6.3.3', '17+', 'Entertainment', '38', '5', '3', '1']


['405667771', '聚力视频HD-人民的名义,跨界歌王全网热播', '90725376', 'USD', '0.0', '7446', '8', '4.0', '4.5', '5.0.8', '12+', 'Entertainment', '24', '4', '1', '1']


Number of rows: 1014
Number of columns: 16


In [24]:
print(ios_header)
print(android_header)
print('\n')
print(ios_header.index('price'))
print(android_header.index('Price'))

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


4
7


From the above lists, we recall that the price index for ios apps is 4, and price index for google play apps is 7.

In [29]:
android_free = []

for each in android_english_clean:
    price = each[7]
    if price == '0':
        android_free.append(each)
    
ios_free = []
    
for each in ios_eng:
    price = each[4]
    if price == '0.0':
        ios_free.append(each)
        
print(len(ios_free))
print(len(android_free))

3222
8864


After the clean-up processes, we arrive at our final datasets, android_free and ios_free.

android_free has 8864 rows, and ios_free has 3222 rows. We thus move on to data analysis of these cleaned up datasets.

# Data Analysis

To reiterate the goal of our company, since our revenue is based off advertisements that are found in-app, our revune is strongly determined by app popularity. 

Our validation strategy for app ideas usually comprises these 3 steps:
1) Build a minimalistic Android version of the app and add it on Google Play
2) If the app is well-received, develop it further.
3) If the app remains profitable after 6 months, we build an iOS version and add it to the App Store as well.

Since our end goal is to have our apps up on both markets, we want to be looking at other apps that are likewise successful in these two markets. 

First, we'll go about looking at the frequencies of the genre of each app. Mainly, we'll consider the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.

In [38]:
def freq_table(dataset, index):
    dict = {}
    total_amt = 0
    for each in dataset:
        total_amt += 1
        column = each[index]
        if column in dict:
            dict[column] += 1
        else:
            dict[column] = 1
            
    percentages = {}
    for key in dict:
        percentage = (dict[key] / total_amt) * 100
        percentages[key] = percentage 
    
    return percentages
  
    
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
display_table(ios_free, -5)
print('\n')
display_table(android_free, 9)
print('\n')
display_table(android_free, 1)



Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.31678700

Above, we see tables expressing the popularity of each category or genre as a percentage value of the total apps on the store.

In particular, let's take a closer look at the prime_genre column of the App Store data set.

In [39]:
display_table(ios_free, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


We see that the 'Games' genre is the most popular one by a staggering margin of 58.1%. The next most popular genre is 'Entertainment' at 7.88%, which isn't that far ahead of the genres that follow it such as 'Photo & Video' and 'Education'.

Meanwhile, apps that fall under the category of 'Finance' only constitute 1.12%, and apps that fall under 'Business' only constitute 0.53% of the total number of apps on the app store.

One pattern that stands out here is that apps that are centered around entertainment are vastly more common than apps that serve some mundane but pragmatic purpose. 

However, one shouldn't be misled into thinking that the proportion of apps correlates to their popularity among users. It might just be that the entertainment app space is very saturated because the creation of apps in this genre is limited only by creativity.

Next, we look at the frequency tables generated for the 'Category' and 'Genres' column of the Google Play dataset more closely:

In [40]:
display_table(android_free, 9)
print('\n')
display_table(android_free, 1)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

The 'Categories' table paints a very different picture than was see from the iOS app store. It looks like there are a lot fewer apps revolving around entertainment, and many more apps that have to do with practical needs and matters such as 'tools', 'business', 'lifestyle', and 'productivity'. 

Taking a closer look, however, we see that the 'Family' category, which covers 19% of all apps, mainly consists of games aimed at  children. 

Nonetheless, entertainment apps still are far fewer in frequency than they were on the App Store. 

The 'genres' category also gives us further insight on this, by splitting the categories even further. The most popular genre is tools at 8.45%. The next most popular is entertainment at 6.07%, and this is followed by Education (5.35%), Business (4.6%), and Productivity (3.89%). Of the top 5 most popular categories, 4 are related to practical needs.

The main point to note at this juncture is that while the iOS App Store was dominated by entertainment apps (especialling gaming ones), apps on the Google Play store enjoy much more even representation.

Further analysis is clearly needed to determine app popularity.

In [45]:
table = freq_table(ios_free, -5)

for genre in table:
    total = 0
    len_genre = 0
    for each in ios_free:
        genre_app = each[-5]
        if genre_app == genre:
            n_ratings = float(each[5])
            total += n_ratings
            len_genre += 1
        
    average = total / len_genre
    
    print(genre, ':', average)
            
# print('\n')
# freq_table(android_free, 9)
# print('\n')
# freq_table(android_free, 1)

Reference : 74942.11111111111
Navigation : 86090.33333333333
Medical : 612.0
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
News : 21248.023255813954
Weather : 52279.892857142855
Lifestyle : 16485.764705882353
Shopping : 26919.690476190477
Food & Drink : 33333.92307692308
Social Networking : 71548.34905660378
Catalogs : 4004.0
Utilities : 18684.456790123455
Entertainment : 14029.830708661417
Business : 7491.117647058823
Productivity : 21028.410714285714
Games : 22788.6696905016
Photo & Video : 28441.54375
Travel : 28243.8
Book : 39758.5
Education : 7003.983050847458
Music : 57326.530303030304
Finance : 31467.944444444445


At face value, it appears that Navigation apps have the largest number of ratings per app. However, this large average is heavily skewed by outliers such as Google Maps.

In [50]:
for app in ios_free:
    if app[-5] == 'Navigation':
        print('Number of raters for ', app[1]  ,'is: ', app[5])

Number of raters for  Chase Mobile℠ is:  233270
Number of raters for  Mint: Personal Finance, Budget, Bills & Money is:  232940
Number of raters for  Bank of America - Mobile Banking is:  119773
Number of raters for  PayPal - Send and request money safely is:  119487
Number of raters for  Credit Karma: Free Credit Scores, Reports & Alerts is:  101679
Number of raters for  Capital One Mobile is:  56110
Number of raters for  Citi Mobile® is:  48822
Number of raters for  Wells Fargo Mobile is:  43064
Number of raters for  Chase Mobile is:  34322
Number of raters for  Square Cash - Send Money for Free is:  23775
Number of raters for  Capital One for iPad is:  21858
Number of raters for  Venmo is:  21090
Number of raters for  USAA Mobile is:  19946
Number of raters for  TaxCaster – Free tax refund calculator is:  17516
Number of raters for  Amex Mobile is:  11421
Number of raters for  TurboTax Tax Return App - File 2016 income taxes is:  9635
Number of raters for  Bank of America - Mobile B

This confirms our preliminary suspicion that outliers were heavily skewing the results. Waze and Google Maps had a total of around 500,000 ratings between themselves, while the other 4 apps in the category had a total of less than 16,000. 

It is likely that this pattern also holds true when it comes to social networking apps, where prominent apps like Facebook might heavily skew the average figure. Music apps also have particularly big players like Pandora and Spotify that might skew results a lot. The Weather and Finance categories also seem to face this setback.

In [52]:
for app in ios_free:
    if app[-5] == 'Health & Fitness':
        print('Number of raters for ', app[1]  ,'is: ', app[5])

Number of raters for  Calorie Counter & Diet Tracker by MyFitnessPal is:  507706
Number of raters for  Lose It! – Weight Loss Program and Calorie Counter is:  373835
Number of raters for  Weight Watchers is:  136833
Number of raters for  Sleep Cycle alarm clock is:  104539
Number of raters for  Fitbit is:  90496
Number of raters for  Period Tracker Lite is:  53620
Number of raters for  Nike+ Training Club - Workouts & Fitness Plans is:  33969
Number of raters for  Plant Nanny - Water Reminder with Cute Plants is:  27421
Number of raters for  Sworkit - Custom Workouts for Exercise & Fitness is:  16819
Number of raters for  Clue Period Tracker: Period & Ovulation Tracker is:  13436
Number of raters for  Headspace is:  12819
Number of raters for  Fooducate - Lose Weight, Eat Healthy,Get Motivated is:  11875
Number of raters for  Runtastic Running, Jogging and Walking Tracker is:  10298
Number of raters for  WebMD for iPad is:  9142
Number of raters for  8fit - Workouts, meal plans and per

To some degree, the same situation applies to the 'Health & Fitness' genre, but one thing to note here is that the runner-ups are fairly competitive nonetheless. 

It is a possible that a targetted, well-designed ad of this genre might serve some potential use, so long as it finds a suitable niche. Since entertainment seems to be so popular, it might be well-advised to build an app that 'gamifies' health and fitness, so to speak.

For instance, it could market Health & Fitness as something fun and enjoyable, by leveraging on people's love of sports, as well as their latent desires to get in shape. In addition to improving their fitness, it could also offer a sense of progress by tracking the activities they engage in, thereby offering another avenue of measurable progress. 

Next, we analyze the Google Play apps.

In [53]:
display_table(android_free, 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


In [62]:
table = freq_table(android_free, 1)

for category in table:
    total = 0
    len_genre = 0
    for each in android_free:
        category_app = each[1]
        if category_app == category:
            n_installs = each[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            n_installs = float(n_installs)
            total += n_installs
            len_genre += 1
        
    average = total / len_genre
    
    print(category, ':', average)
            

DATING : 854028.8303030303
PRODUCTIVITY : 16787331.344927534
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
AUTO_AND_VEHICLES : 647317.8170731707
TOOLS : 10801391.298666667
ENTERTAINMENT : 11640705.88235294
NEWS_AND_MAGAZINES : 9549178.467741935
SHOPPING : 7036877.311557789
COMMUNICATION : 38456119.167247385
HEALTH_AND_FITNESS : 4188821.9853479853
EVENTS : 253542.22222222222
LIBRARIES_AND_DEMO : 638503.734939759
SPORTS : 3638640.1428571427
FAMILY : 3695641.8198090694
GAME : 15588015.603248259
HOUSE_AND_HOME : 1331540.5616438356
MEDICAL : 120550.61980830671
PHOTOGRAPHY : 17840110.40229885
LIFESTYLE : 1437816.2687861272
TRAVEL_AND_LOCAL : 13984077.710144928
FOOD_AND_DRINK : 1924897.7363636363
VIDEO_PLAYERS : 24727872.452830188
ART_AND_DESIGN : 1986335.0877192982
EDUCATION : 1833495.145631068
BEAUTY : 513151.88679245283
COMICS : 817657.2727272727
PERSONALIZATION : 5201482.6122448975
PARENTING : 542603.6206896552
SOCIAL : 23253652.127118643
WEATHER : 5074486.19718309

On average, Communication apps have the most installs at 38.5 million. However, this is clearly skewed by very few apps such as Gmail, Whatsapp, and Skype that each have over a billion downloads.

In [64]:
for app in android_free:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
Skype - free IM & video calls : 1,000,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+


By removing the outliers with over a billion installs, the average number of installs plummets to 17 million, which is around half of what it was with the outliers. 

Even further, by defining outliers as exceeding 100mill (as opposed to 1 million), the average falls even further to only 3.6million.

In [70]:
less_popular = []

for each in android_free:
    n_installs = each[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (each[1] == 'COMMUNICATION') and (float(n_installs) < 1000000000):
        less_popular.append(float(n_installs))
        
print(sum(less_popular) / len(less_popular))

less_popular = []

for each in android_free:
    n_installs = each[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (each[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        less_popular.append(float(n_installs))
        
sum(less_popular) / len(less_popular)

17924933.09964413


3603485.3884615386

A similar pattern is observed with categories like Video Players and social networking apps.

We don't want to be misled into thinking that a category is more popular than it really is, and we especially don't want to dive into a category where we might have to compete against giants like Youtube and Facebook. 

In [76]:
for app in android_free:
    if app[1] == 'HEALTH_AND_FITNESS'and (app[5] == '1,000,000+'):
        print(app[0], ':', app[5])

Pedometer - Step Counter Free & Calorie Burner : 1,000,000+
Sportractive GPS Running Cycling Distance Tracker : 1,000,000+
Home Workout for Men - Bodybuilding : 1,000,000+
Sleep Sounds : 1,000,000+
Calorie Counter - EasyFit free : 1,000,000+
Bike Computer - GPS Cycling Tracker : 1,000,000+
Running Distance Tracker + : 1,000,000+
Walking: Pedometer diet : 1,000,000+
Keep Trainer - Workout Trainer & Fitness Coach : 1,000,000+
PumpUp — Fitness Community : 1,000,000+
Home workouts - fat burning, abs, legs, arms,chest : 1,000,000+
Running Weight Loss Walking Jogging Hiking FITAPP : 1,000,000+
StrongLifts 5x5 Workout Gym Log & Personal Trainer : 1,000,000+
Fitbit Coach : 1,000,000+
Map My Ride GPS Cycling Riding : 1,000,000+
Weight Loss Running by Verv : 1,000,000+
Map My Fitness Workout Trainer : 1,000,000+
Seven - 7 Minute Workout Training Challenge : 1,000,000+
Relax Meditation: Sleep with Sleep Sounds : 1,000,000+
Meditate OM : 1,000,000+
Meditation Music - Relax, Yoga : 1,000,000+
Simpl

Once again, I believe the HEALTH_AND_FITNESS category appears to have a lot of potential. This category has an average of a modest 4.8 million installs, and even after removing outliers with over 100 million installs, it retains an impressive 2 million or so average install rate. 

Moreover, further reducing the number of outliers so we're only looking at apps with less than 10 million installs, the average install amount is still around 768,000, which is still a respectable amount. 

In [82]:
less_popular = []

for each in android_free:
    n_installs = each[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (each[1] == 'HEALTH_AND_FITNESS') and (float(n_installs) < 100000000):
        less_popular.append(float(n_installs))
        
print(sum(less_popular) / len(less_popular))

less_popular = []

for each in android_free:
    n_installs = each[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (each[1] == 'HEALTH_AND_FITNESS') and (float(n_installs) < 10000000):
        less_popular.append(float(n_installs))
        
print(sum(less_popular) / len(less_popular))

2005713.6605166052
767984.9456066946


In [83]:
for app in android_free:
    if app[1] == 'HEALTH_AND_FITNESS'and (app[5] == '1,000,000+' or app[5] == '100,000+'):
        print(app[0], ':', app[5])

Pedometer - Step Counter Free & Calorie Burner : 1,000,000+
Sportractive GPS Running Cycling Distance Tracker : 1,000,000+
Home Workout for Men - Bodybuilding : 1,000,000+
Fat Burning Workout - Home Weight lose : 100,000+
Walking for Weight Loss - Walk Tracker : 100,000+
Sleep Sounds : 1,000,000+
Abs Training-Burn belly fat : 100,000+
Calorie Counter - EasyFit free : 1,000,000+
Bike Computer - GPS Cycling Tracker : 1,000,000+
Six Packs for Man–Body Building with No Equipment : 100,000+
Running Distance Tracker + : 1,000,000+
The TK-App - everything under control : 100,000+
Walking: Pedometer diet : 1,000,000+
Abs Workout - 30 Days Fitness App for Six Pack Abs : 100,000+
Keep Trainer - Workout Trainer & Fitness Coach : 1,000,000+
PumpUp — Fitness Community : 1,000,000+
Home workouts - fat burning, abs, legs, arms,chest : 1,000,000+
Running Weight Loss Walking Jogging Hiking FITAPP : 1,000,000+
StrongLifts 5x5 Workout Gym Log & Personal Trainer : 1,000,000+
Fitbit Coach : 1,000,000+
Map 

The average app here (including most things from 100,000 to 1,000,000+ installs) seem to be about a variety of different things, such as weight loss, blood glucose levels, fertility, meditation, and sports. 

A large number of apps are centered around body-building, especially when it comes to things like having abs and very defined musculature. It would appear that making an app that sells these promises might be very profitable!

This also ties back in to our earlier conclusion from the iOS App Store about health and fitness products!

# Conclusion

Since weight loss (i.e. body fat reduction) is a very significant part of body-building, especially for someone who just casually trains, hoping to achieve abs and some more defined musculature, it might be an extremely good idea to market a way of losing weight through an avenue that people enjoy, i.e. sports. This kind of app would likely do well in both the Google Play and App Stores, which is exactly what our company intends to do.

Furthermore, fitness and sports-related aps often feature advertisements that sell things like nutrition and weight loss supplements. Because people interested in improving their health also tend to seek out these supplements to help them reach their goals quicker, the advertisements we place in our ads would have a very high click rate, thus generating higher revenue. 

The app should be something that sells an approach to weight loss through very accessible and enjoyable means. An appropriately sensationalistic slogan like "lose weight doing what you love!" is also bound to grab attention very quickly might also be something that would go well with the app. 