# Maximimzing ad revenue from Free Apps on Google Play and Apple Store 

In this project, I analyze a large dataset to generate insights that can help us maximize revenues from that are free on Google Play and Apple Store. The source of revenue is in-app ads.  This means that our revenue for any given app is mostly influenced by the number of users that use our app. 

My goal for this project is to analyze data to help  developers understand what type of apps are likely to attract more users.

## There are two data sets that seem suitable for our goals:

A data set containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from this [link](https://www.kaggle.com/lava18/google-play-store-apps)

A data set containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from this [link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

In [1]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

To make it easier to explore the two data sets, we'll first write a function named explore_data() that we can use repeatedly to explore rows in a more readable way. We'll also add an option for our function to show the number of rows and columns for any data set.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(android,0,3)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']




In [4]:
explore_data(ios,0,3)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']




To find the number of rows and columns of each data set (recall that the function assumes the argument for the dataset parameter doesn't have a header row)


In [5]:
explore_data(android,0,3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In [6]:
explore_data(ios, 0,3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


In [7]:
print(ios_header)
print('\n')
print(android_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


For the AppleStore dataset, the useful columns for our analysis are ['track_name', 'size_bytes', 'currency',  'rating_count_tot', 'user_rating', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

For the Google play dataset, the useful columns for our analysis are: 

'App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Content Rating', 'Genres' 


## Data cleaning: We need to remove inaccurate data, duplicate data, non English apps and apps that are not free

we need to:

Detect inaccurate data, and correct or remove it.
Detect duplicate data, and remove the duplicates.

Recall that at our company, we only build apps that are free to download and install, and that are directed toward an English-speaking audience. This means that we'll need to:

Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
Remove apps that aren't free.

## Removing incorrect data
The Google Play data set has a dedicated discussion section, and we can see that one of the discussions describes an error for a certain row. Here is the [Link](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015)


In [8]:
print(android_header)
print('\n')
print(android[10471]) #The problem was reported in row 10472 but we need to check and see if this is with or without header
print ('\n')
print(android[10472])
print ('\n')
print(android[10473])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


It is clear that the third row here (which is row 10472) is missing the category digit. Furthermore, there are 14 columns but row 10472 shows 13 elements. Let's delete this row

In [9]:
print(len(android))#to check if the incorrect row was deleted, we check the number of rows before and after deletion
del(android[10472]) #Do not run this more than once otherwise you will be deleting good data
print(len(android))

10841
10840


Length (or number of rows) of android dataset was 10841 before deletion. After deletion, it became 10840. This confirms that deletion was successful.

However, as a last check, let's print row 10472. It should show row starting with an app called: "osmino Wi-Fi" which is the row right under the deleted row 

In [10]:
print(android[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


## Removing duplicate data 

For the android list, we need to ensure that there are no duplicates. Sometimes, the same app is entered multiple times into the dataset. We should not remove the duplicates randomly but use a criterion to include the most recent data. We need to include the entry with the highest numbers of reviews


To remove the duplicate rows, we need to loop through the dataset and filter through it by appending unique apps to unique_apps list while appending repeated apps to duplicate_apps list

In [11]:
unique_apps=[]
duplicate_apps=[]

for row in android:
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
         unique_apps.append(name)
print('Number of unique apps=',len(unique_apps))

print('\n')
print('Number of duplicate apps=', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:7])
            

Number of unique apps= 9659


Number of duplicate apps= 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits']


## Instead of random inclusion, we will includ the duplicate based on one criterion i.e. the highest review 

_Using a loop, we can build a dictionary whose key is name of the app and its value is the highest review for duplicates of the same app_

If the app already exists in the dictionary and the value of review is smaller than the currently looped review, the loop will update the value in the dictionary with the higher review.

If the app does not yet exist in the dictionary, the loop will update the dictionary with a new key (which is the name of the app) and the value of the review. 

Notice that you should not use else statement here otherwise it will incorrectly update the dictionary 


In [12]:
reviews_max={}
for row in android:
    name=row[0] #name of the app is in the 1st column
    n_reviews=float(row[3])# Reviews are in the 4th column 
    if name in reviews_max and reviews_max[name]<n_reviews:
        reviews_max[name]=n_reviews
    elif name not in reviews_max:
                reviews_max[name]=n_reviews
            
            

In [13]:
print(len(reviews_max)) #check that the dictionary worked. 
#It should show 9659 unique apps

9659


## Avoiding duplicates for the same app with the same maximum reviews

Now that we have the dictionary reviews_max that includes the apps with their highest reviews. Another problem would be if there are duplicates with the same highest review. We need to include only one entry for each app with the highest review. To do this we need to clean the data from the reviews_max dictionary further. We can do this by creating 2 lists: 

android_clean will contain only one entry for each app with the app's highest review and the other will contain a list of apps that are already in the android_clean list to ensure there are no duplicate entries for the same app

In [14]:
android_clean=[]
already_added=[]

for app in android:
    name=app[0]
    n_reviews=float(app[3])
    if (n_reviews==reviews_max[name]) and (name not in already_added):
        already_added.append(name) #use parentheses otherwise this complex if statement will confuse python 
        #this will add the name of the app only to the already added list
        
        android_clean.append(app)
        #this will add the whole row to the android_clean list
        
explore_data(android_clean, 0,3, True)


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


## Removing non English apps
We will write a function that takes a string and returns false if there is any character in the string that doesn't belong to the set of common English characters, otherwise it returns true

We will loop inside the function over the the input string checking whether any character is greater than 127. If it is greater, it would return false. 


Note: 
Emojis and characters like ™ fall outside the ASCII range and have corresponding numbers over 127. Which means our function will return false despite these apps being English

To counter this, we will only remove an app if its name has more than 3 characters with corresponding numbers falling outside the ASCII range. This means that all English apps with up to 3 emoijis or other secial characters wil still be labeled as English.



In [15]:
def is_English(string):
    
    non_ascii=0
    for character in string:
        if ord(character) >127:
            non_ascii+=1
        
    if non_ascii>3:
        return False
    else: 
        return True
    
            
print(is_English('Docs To Go™ Free Office Suite'))
print(is_English('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_English('Instachat 😜'))

True
False
True


## Filtering the non English apps from both datasets

We loop through each dataset to identify English apps and append the whole row to a separate list

In [16]:
English_apps_ios=[]

for row in ios:
    name=row[1]
    if is_English(name):
        English_apps_ios.append(row)
        
explore_data(English_apps_ios, 0,3,True)
   

    

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6183
Number of columns: 16


In [17]:
English_apps_android=[]

for row in android_clean:
    name=row[0]
    if is_English(name):
        English_apps_android.append(row)
    
explore_data(English_apps_android, 0,3,True)



['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


## Filtering out free english apps:
We will create 2 new lists that will contain only free english apps and exclude all the rest.

We will loop through the english apps lists for android and ios and if the price = 0 or free, we will append into the new lists  

In [18]:
free_eng_android=[]
free_eng_ios=[]

for row in English_apps_android:
    price=row[7]
    if price == "0": #if you put 0.0 it would not work, you will get length =0
       free_eng_android.append(row)
    
for row in English_apps_ios:
    price=row[4]
    if price == '0.0': #if you put 0, it would not work 
       free_eng_ios.append(row)
        
        
print(len(free_eng_android))
print('\n')
print(len(free_eng_ios))


8864


3222


## Analysing our data to find app profiles that are successful on Google Play and the App store

To minimise costs and risks, our strategy is comprised of 3 steps:
1- Build a minimal Android app and add it to Google Play
2- If the app is recieved well by the Android market, we develop it further
3- If the app is profitable after 6 months, we build an ios version of the app and add it to the App store 

By looking at both datasets, there are 2 columns in each dataset that could serve as indicators of success

For the Google Play dataset:

The User rating column and the rating_count total


For the App store dataset: 

The Installs column and the ratings column


## Building frequency tables with percentages 

We'll need to build a frequency table for the prime_genre column of the App Store data set, and for the Genres and Category columns of the Google Play data set.

We'll build two functions we can use to analyze the frequency tables:

One function to generate frequency tables that show percentages

Another function we can use to display the percentages in a descending order

In [19]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

We start by examining the frequency table for the prime_genre column of the App Store data set.

In [20]:
display_table(free_eng_ios, -5)
print('\n')
display_table(free_eng_android, 1)


Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
C

## Analysing the most popular apps:

To calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

Let's start with calculating the average number of user ratings per app genre on the App Store. To do that, we'll need to:

Isolate the apps of each genre.
Sum up the user ratings for the apps of that genre.
Divide the sum by the number of apps belonging to that genre (not by the total number of apps).

In [21]:
genres_ios=freq_table(free_eng_ios,-5) 

for genre in genres_ios:    #######
    total = 0
    len_genre=0
    for app in free_eng_ios:
        genre_app=app[-5]
        if genre_app == genre:
          n_ratings=float(app[5])
          total += n_ratings
          len_genre +=1 

            
    avg_n_ratings = total / len_genre
    print(genre, ':',avg_n_ratings)



Finance : 31467.944444444445
Productivity : 21028.410714285714
Photo & Video : 28441.54375
Catalogs : 4004.0
Navigation : 86090.33333333333
Shopping : 26919.690476190477
Food & Drink : 33333.92307692308
News : 21248.023255813954
Lifestyle : 16485.764705882353
Music : 57326.530303030304
Utilities : 18684.456790123455
Book : 39758.5
Business : 7491.117647058823
Sports : 23008.898550724636
Social Networking : 71548.34905660378
Education : 7003.983050847458
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Games : 22788.6696905016
Entertainment : 14029.830708661417
Travel : 28243.8
Weather : 52279.892857142855
Medical : 612.0


## Removing commas and plus signs from the number of installs in the Google play dataset
 
 We have data about the number of installs for the Google Play market, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.)
 
 We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. To perform computations, however, we'll need to convert each install number from string to float. This means we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error.

In [25]:
categories_android=freq_table(free_eng_android, 1)

for category in categories_android:
    total=0
    len_category=0
    for app in free_eng_android:
        category_app= app[1]
        if category_app == category:
            n_installs=app[5]
            n_installs= n_installs.replace("+","")
            n_installs= n_installs.replace(",","")
            total += float(n_installs)
            len_category+=1

    avg_n_installs=total / len_category
            
    print(category, ":", avg_n_installs)
            
        


COMICS : 817657.2727272727
PHOTOGRAPHY : 17840110.40229885
SHOPPING : 7036877.311557789
FOOD_AND_DRINK : 1924897.7363636363
FAMILY : 3695641.8198090694
BOOKS_AND_REFERENCE : 8767811.894736841
BEAUTY : 513151.88679245283
WEATHER : 5074486.197183099
MEDICAL : 120550.61980830671
ART_AND_DESIGN : 1986335.0877192982
LIBRARIES_AND_DEMO : 638503.734939759
HOUSE_AND_HOME : 1331540.5616438356
HEALTH_AND_FITNESS : 4188821.9853479853
PERSONALIZATION : 5201482.6122448975
ENTERTAINMENT : 11640705.88235294
LIFESTYLE : 1437816.2687861272
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_MAGAZINES : 9549178.467741935
COMMUNICATION : 38456119.167247385
PRODUCTIVITY : 16787331.344927534
MAPS_AND_NAVIGATION : 4056941.7741935486
GAME : 15588015.603248259
AUTO_AND_VEHICLES : 647317.8170731707
PARENTING : 542603.6206896552
TRAVEL_AND_LOCAL : 13984077.710144928
SPORTS : 3638640.1428571427
TOOLS : 10801391.298666667
FINANCE : 1387692.475609756
SOCIAL : 23253652.127118643
EVENTS : 253542.22222222222
DATING : 854028.

## Recommendation for a Google play app profile

From the analysis above, the top 7 genres in order are Communications, Video players, social, photography, productivity, games, travel and local

Since these top genres are already dominated by heavy weight tech giants like facebook and google, we have to offer a unique selling point for our app that distinguishes it from the mainstream communication apps.

Based on this, I would recommend an app that can fit in one of the top 3 of these genres but with features from the other 6 genres. An app with different features from different genre can stand out. 


## The current trend of slowing down smartphone sales :

According to [CNBC (2018) ](https://www.cnbc.com/2018/02/23/smartphone-sales-are-slowing-and-here-are-two-key-reasons-why.html)smartphone sales are slowing because: 
a) a lack of innovation and incremental benefits are failing to entice new buyers  
b) depreciation of high-end devices as prices drop shortly after purchase. 

This trend shows that most consumers are holding on to their phones longer and are also increasingly unimpressed with the frequency and diversity of new models. It is understandable that older phones with a large number of installed apps suffere from slower performance and very limited memory space.  


# Conclusion 

Therefore, innovation in the area of combining features from different different genres into one efficient app can help save space and prevent the slowing down of your smartphone
