# "Earning through In-app ads – Analyzing the android market"

**My aim** in this project is to find preferable android app profile that can generate reasonable profits by the in-app ads. This can help an Android app developing company providing the preferences such as Category or Genre to be targeted for their next app development.

**Approach taken** by me along with the primary constrains related to the android markets are as follows:
- App has to be free(In the paid apps, users doesn’t want to see any advertisements)
- Language preference: English
- App profile having highest run time engagement(Longer the user stays engaged with app, more the chances of ad generated earnings)

My approach led me to a **surprising deduction**. By analyzing the free android apps market through the data extracted from Google Play Store , I found that category of  ‘Books and Reference’ seems to be more promising than the obviously expected ‘Entertainment’ and ‘Tools’ categories. Let's see how I reached this analytical decision.



## 1. Opening and Exploring the Data

A data set containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from this link.

### Step 1: Opening and reading the CSV file:

In [1]:
from csv import reader
import matplotlib.pyplot as plt
opened_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### Step 2: Exploring the data by creating and using the explore function:
I created the `function ‘explore’` that can be used repeatedly to explore the data slice from the main data set, mentioning the 'beginning and ending row indices', can show total number of rows and columns of the main data set by providing 'True'.


In [2]:
def explore(dataset, start, end, row_column=False):
    dataslice = dataset[start:end]
    
    for row in dataslice:
        print (row)
        print ('\n')
    if row_column:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print(android_header, '\n')
explore(android, 0, 4, row_column=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


It is observed that the Google Play data set has **10841 apps along with their 13 features**.
Out of the 13 features, useful features for our analysis are ‘App’, ‘Category’, ‘Reviews’, ‘Installs’, ‘Type’, ‘Price’ and ‘Genres’.
Name of the columns clearly state the features mentioned by them. If someone finds difficulty with understanding some features, can have a look at the Dataset Documentation.

## 2. Data cleaning:

### Step 1: Detecting and removing rows with missing values:


The rows with the missing data can be easily detected by comparing **‘requirelen’: required number of entries in a row determined from header row** and **‘len(row)’: number of entries present in each row** using a for loop. If later is found less than former than that particular row misses some data and needs to be worked upon.


In [3]:
requirelen = len(android_header)
fault = 0
for row in android:
    if len(row) != requirelen:
        print(android.index(row))
        fault += 1
print('Total faulty rows:' + str(fault))

10472
Total faulty rows:1


Row with  *'index number:10472'*   is detected as a row with missing value. Let’s compare and see which feature is missing from the data. 

In [4]:
print(android_header)
print(android[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


After comparing one can say that value for the *column:‘Rating’* is missed because current value of rating for app **‘Life Made Wi-Fi Touchscreen Photo Frame’** is 19 which is more than the maximum rating criteria of Google play store which is ‘5’. So, let’s remove this row.

In [5]:
del android[10472]      #do not run this twice
print(len(android))     #to confirm deletion

10840


Now we have **10840 apps** remaining.

### Step 2: Removing duplicates:


By observing the dataset enough, one can find many instances of the duplicates i.e. more than one row for the same app. For example, app namely **‘Googe Ads’ have three rows** which is not desirable.

In [6]:
for row in android:
    name = row[0]
    
    if name == 'Google Ads':
        print(row)

['Google Ads', 'BUSINESS', '4.3', '29313', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']
['Google Ads', 'BUSINESS', '4.3', '29313', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']
['Google Ads', 'BUSINESS', '4.3', '29331', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']


Let’s count the number of rows that can be considered as ‘duplicates’ looping for the ‘app names’ through the dataset.

In [7]:
unique_app = []
duplicate_app = []
for row in android:
    name = row[0]
    if name in unique_app:
        duplicate_app.append(name)
    else:
        unique_app.append(name)

print('duplicate apps:' + str(len(duplicate_app)))
print('unique apps:' + str(len(unique_app)))

duplicate apps:1181
unique apps:9659


In total, there are **1,181 cases of duplicates** where an app occurs more than once.

~~I do not want these duplicates to be included in my analysis~~ because this can alter the analytical results. To get rid of the duplicates, I want a list of apps having only one row per each app name.

But the **decisive part is: Which particular row of an app out of its duplicates will be the best fit for our analysis dataset?**. If you see the example shown above for app ‘Google Ads’, noticeable difference is in the *‘number of reviews’*.'
**Answer: The row with the highest number of reviews can be considered as the latest data and will give the accurate and reliable details for the app.**


For this, Let’s first create a `dictionary 'reviews_max'` in which we will store unique app names as ‘Key’ and their corresponding maximum reviews as ‘Key Values’.
At the end I also confirm that 'Unique apps(9659)' = 'Total apps(10840)' - 'Duplicate apps(1181)'

In [8]:
reviews_max = {}
for row in android:
    name = row[0]
    nreviews = float(row[3])
    if name in reviews_max and reviews_max[name] < nreviews:
        reviews_max[name] = nreviews
    else:
        reviews_max[name] = nreviews
print(len(reviews_max))

9659


Now, let's use the `dictionary 'reviews_max'` to remove the duplicates. I created a `list ‘android_unique’` and added unique apps one by one with the row having maximum number of reviews only. To avoid adding the rows not hvaing the highest reviews, I added an `if condition` using "reviews_max".

In [9]:
android_unique = []
already_added = []
for row in android:
    name = row[0]
    nreviews = float(row[3])
    if name not in already_added and nreviews == reviews_max[name]:
        android_unique.append(row)
        already_added.append(name)
print(len(android_unique))
print(android_unique[:6])

9659
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up'], ['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000

### Step 3: Modifying data to fit analysis purpose:

#### Part 1. Removing Non-English Apps

By observing the dataset enough, one can even find many instances of the app names which surely are non-English.

To observe this, first create a `function ‘english’` which will use an `in-built function ‘ord’` to check `ASCII code of alphabets` of an app name. Allowed characters: the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;, etc.), and other symbols (+, *, /, etc.). **All allowable mentioned here have ASCII code less than or equal to 127**. So, if this function finds a non-English/allowable character in an app’s name, it will `return Boolean False`. By using this function and for loop, I checked for non-English apps and found the following result: 

In [10]:
def english (word):
    for element in word:
        if ord(element) > 127:
            return False
        else:
            return True

non_english = 0
for row in android_unique:
    name=row[0]
    if english(name) == False:
       print(android_unique.index(row), ':     ', row, '\n')
       non_english += 1

print('\n' + 'Number of non_english apps:' + str(non_english))

258 :      ['漫咖 Comics - Manga,Novel and Stories', 'COMICS', '4.1', '12088', '21M', '1,000,000+', 'Free', '0', 'Mature 17+', 'Comics', 'July 6, 2018', '2.3.1', '4.0.3 and up'] 

265 :      ['【Ranobbe complete free】 Novelba - Free app that you can read and write novels', 'COMICS', 'NaN', '1330', '22M', '50,000+', 'Free', '0', 'Everyone', 'Comics', 'July 3, 2018', '6.1.1', '4.2 and up'] 

640 :      ['🔥 Football Wallpapers 4K | Full HD Backgrounds 😍', 'ENTERTAINMENT', '4.7', '11661', '4.0M', '1,000,000+', 'Free', '0', 'Everyone', 'Entertainment', 'July 14, 2018', '1.1.3.2', '4.0.3 and up'] 

770 :      ['İşCep', 'FINANCE', '4.5', '381788', '32M', '10,000,000+', 'Free', '0', 'Everyone', 'Finance', 'August 2, 2018', '3.22.0', '4.1 and up'] 

1087 :      ['乐屋网: Buying a house, selling a house, renting a house', 'HOUSE_AND_HOME', '3.7', '2248', '15M', '100,000+', 'Free', '0', 'Everyone', 'House & Home', 'August 3, 2018', 'v3.1.1', '4.0 and up'] 

1180 :      ['သိင်္ Astrology - Min Thein Kha

I found that there are **36 non-English apps**. By going through the app names in the printed list, I found that there are some apps which have been *'considered non-English because of the special characters present in their names'*. For example:
- '🔥 Football Wallpapers 4K | Full HD Backgrounds 😍' at index ‘640’, 
- ‘MultiCraft ― Free Miner! 👍’ at index ‘2821’.

To provide some leniency for such apps, I added a `variable ‘count’` which will *'count the total non-English characters in app’s name with the allowable limit of 5'*. So now, our `function ‘english5’` will check app names and consider it as non-English allowing upto 5 instances per each app name. 

This actually opens some grey area where few non-English apps will be able to pass our function checking and might few English apps (having more than 5 special characters in name) still will be considered as non-English. But, such instances will be very less and for now we can say that fuction ‘english’ works well as shown below:

In [11]:
def english5 (word):
    count = 0
    for element in word:
        if ord(element) > 127:
            count+=1
    if count > 5:
        return False
    else:
        return True

android_english = []
non_english = 0
for row in android_unique:
    name=row[0]
    if english5(name):
        android_english.append(row)
    else:
        non_english += 1
        print(android_unique.index(row), ':     ', row, '\n')
print(non_english)

521 :      ['Flame - درب عقلك يوميا', 'EDUCATION', '4.6', '56065', '37M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'July 26, 2018', '3.3', '4.1 and up'] 

2643 :      ['РИА Новости', 'NEWS_AND_MAGAZINES', '4.5', '44274', '8.0M', '1,000,000+', 'Free', '0', 'Everyone', 'News & Magazines', 'August 6, 2018', '4.0.6', '4.4 and up'] 

3046 :      ['صور حرف H', 'ART_AND_DESIGN', '4.4', '13', '4.5M', '1,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 27, 2018', '2.0', '4.0.3 and up'] 

3174 :      ['L.POINT - 엘포인트 [ 포인트, 멤버십, 적립, 사용, 모바일 카드, 쿠폰, 롯데]', 'LIFESTYLE', '4.0', '45224', '49M', '5,000,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'August 1, 2018', '6.5.1', '4.1 and up'] 

3399 :      ['RMEduS - 음성인식을 활용한 R 프로그래밍 실습 시스템', 'FAMILY', 'NaN', '4', '64M', '1+', 'Free', '0', 'Everyone', 'Education', 'July 17, 2018', '1.0.1', '4.4 and up'] 

4113 :      ['AJ렌터카 법인 카셰어링', 'MAPS_AND_NAVIGATION', 'NaN', '0', '27M', '10+', 'Free', '0', 'Everyone', 'Maps & Navigation', 'July 30, 2

From the results above, there are **30 apps that has been considered non-English by our function allowing upto 5 special characters.**

To remove these 30, I created a `list 'android_english'` and added only the English apps from the previous `list 'android_clean'` .

In [12]:
explore(android_english, 0, 5, row_column = True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9629
Number of columns: 13


Now we have **9629 English apps** left for analysis.(9659-30 = 9629).

#### Part 2. Removing Paid Apps

Following the primary constrains for my analysis, I need to 'remove the paid apps' from the `list android_english`. 

To remove paid apps, I used for 'for loop' to check price of apps in *android_english* and only took apps having *price = ‘0’* into the new `list android_final` as shown below:

In [13]:
android_final = []
for row in android_english:
    price = (row[7])
    if price == '0':
        android_final.append(row)
explore(android_final, 0, 5, row_column=True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 8879
Number of columns: 13


Now, list android_final having **8879 apps which are English and Free apps with no missing or incorrect entries**, is ready for our analysis purposes.

Write about category frequency table, percentage table

### Step 4: Analysing data:

#### Part 1. Number of apps per each category :

I started Exploratory Data Analysis with **counting number of apps in each category** and computing **Percentage distribution for each category under whole Android Dataset** using `cft()` function.

In [14]:
def cft (dataset, index):
    total = 0
    freq_table = {}
    per_table = {}
    
    for element in dataset:
        total += 1
        value = element[index]
        if value in freq_table:
            freq_table[value] += 1
        else:
            freq_table[value] = 1
            
    for element in freq_table:
        percentage = freq_table[element]/total*100
        per_table[element] = percentage
        
    return per_table

cft(android_final, 1)

{'ART_AND_DESIGN': 0.641964185155986,
 'AUTO_AND_VEHICLES': 0.9235274242594886,
 'BEAUTY': 0.5969140668994256,
 'BOOKS_AND_REFERENCE': 2.1624056763149007,
 'BUSINESS': 4.583849532605023,
 'COMICS': 0.6194391260277058,
 'COMMUNICATION': 3.2548710440364905,
 'DATING': 1.8583173780831175,
 'EDUCATION': 1.1262529564140107,
 'ENTERTAINMENT': 0.8784773060029283,
 'EVENTS': 0.7095393625408267,
 'FINANCE': 3.6941096970379546,
 'FOOD_AND_DRINK': 1.2388782520554116,
 'HEALTH_AND_FITNESS': 3.0634080414461087,
 'HOUSE_AND_HOME': 0.8221646581822277,
 'LIBRARIES_AND_DEMO': 0.9347899538236288,
 'LIFESTYLE': 3.9193602883207572,
 'GAME': 9.49431242257011,
 'FAMILY': 19.2364004955513,
 'MEDICAL': 3.5364342831399935,
 'SOCIAL': 2.657956977137065,
 'SHOPPING': 2.241243383263881,
 'PHOTOGRAPHY': 2.9507827458047076,
 'SPORTS': 3.412546457934452,
 'TRAVEL_AND_LOCAL': 2.331343619777002,
 'TOOLS': 8.446897173105079,
 'PERSONALIZATION': 3.322446221421331,
 'PRODUCTIVITY': 3.885572699628337,
 'PARENTING': 0.6532

But above produced result is not sorted. So I created another function `sort_table()` to be used to sort the data produced by ``cft()’ in the *descending order*.
I have even used **String Formatting techniques** to limiting percentages to 4 decimals and adding % sign at the end.


In [15]:
def sort_table(dataset, index):
    table = cft(dataset, index)
    display = []
    for key in table:
        tpl = (table[key], key)
        display.append(tpl)
        
    table_sorted = sorted(display, reverse = True)
    for entry in table_sorted:
        string = "{} = {:.4f}%". format(entry[1], entry[0])
        print(string)
        
sort_table(android_final, 1)

FAMILY = 19.2364%
GAME = 9.4943%
TOOLS = 8.4469%
BUSINESS = 4.5838%
LIFESTYLE = 3.9194%
PRODUCTIVITY = 3.8856%
FINANCE = 3.6941%
MEDICAL = 3.5364%
SPORTS = 3.4125%
PERSONALIZATION = 3.3224%
COMMUNICATION = 3.2549%
HEALTH_AND_FITNESS = 3.0634%
PHOTOGRAPHY = 2.9508%
NEWS_AND_MAGAZINES = 2.8156%
SOCIAL = 2.6580%
TRAVEL_AND_LOCAL = 2.3313%
SHOPPING = 2.2412%
BOOKS_AND_REFERENCE = 2.1624%
DATING = 1.8583%
VIDEO_PLAYERS = 1.7795%
MAPS_AND_NAVIGATION = 1.4078%
FOOD_AND_DRINK = 1.2389%
EDUCATION = 1.1263%
LIBRARIES_AND_DEMO = 0.9348%
AUTO_AND_VEHICLES = 0.9235%
ENTERTAINMENT = 0.8785%
HOUSE_AND_HOME = 0.8222%
WEATHER = 0.7996%
EVENTS = 0.7095%
PARENTING = 0.6532%
ART_AND_DESIGN = 0.6420%
COMICS = 0.6194%
BEAUTY = 0.5969%


We can see that categories like **“FAMILY”, “GAME”, “TOOLS” and “BUSINESS” collectively occupies 41.7614 % of total number of apps present in Android Market**, which can be interpreted as **“Numerous apps are already there under these categories and a new app into these will rarely ever grab any attention.”**

#### Part 2. Number of average installs per app for each category:

In this part, I looked for the average number of installs that an app can expect for the particular category. For this, first we will have to define a `list categories_android[]` which will contain the names of total categories present in our dataset.

In [16]:
categories_android = []
for each in android_final:
    name = each[1]
    if name not in categories_android:
        categories_android.append(name)
        
print(categories_android)
print("\n Total Categories = " + str(len(categories_android)))

['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY', 'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION', 'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE', 'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME', 'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL', 'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL', 'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER', 'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION']

 Total Categories = 33


We can see that there are total **33 unique categories.**

Now for every app, we will add its number of installs  to the corresponding category’s total installs and keep the count for number of apps in each category that has been processed. 

At the end, **dividing total installs by number of apps** for each category will give our required statistics.

In [17]:
#for romiving '+' from the end of number of install values for each app
for app in android_final:
    i = app[5]
#     i = i.replace(',', '')
    i = i.replace('+', '')
    app[5] = i

In [18]:
avg_installs = {}
for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            total += int(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    avg_installs[category]=avg_n_installs
    
avg_installs

{'ART_AND_DESIGN': 1986335.0877192982,
 'AUTO_AND_VEHICLES': 647317.8170731707,
 'BEAUTY': 513151.88679245283,
 'BOOKS_AND_REFERENCE': 8676537.8125,
 'BUSINESS': 1700127.9852579853,
 'COMICS': 817657.2727272727,
 'COMMUNICATION': 38193481.66435986,
 'DATING': 854028.8303030303,
 'EDUCATION': 1768500.0,
 'ENTERTAINMENT': 9146923.076923076,
 'EVENTS': 253542.22222222222,
 'FINANCE': 1387692.475609756,
 'FOOD_AND_DRINK': 1924897.7363636363,
 'HEALTH_AND_FITNESS': 4167457.3602941176,
 'HOUSE_AND_HOME': 1331540.5616438356,
 'LIBRARIES_AND_DEMO': 638503.734939759,
 'LIFESTYLE': 1429869.0488505748,
 'GAME': 12914435.883748516,
 'FAMILY': 5168680.731850117,
 'MEDICAL': 123064.7898089172,
 'SOCIAL': 23253652.127118643,
 'SHOPPING': 7036877.311557789,
 'PHOTOGRAPHY': 17772018.759541985,
 'SPORTS': 4274688.722772277,
 'TRAVEL_AND_LOCAL': 13984077.710144928,
 'TOOLS': 10801391.298666667,
 'PERSONALIZATION': 5183850.806779661,
 'PRODUCTIVITY': 16772838.591304347,
 'PARENTING': 542603.6206896552,
 '

Let us sort the above results according to the highest number of average installs per app for particular category.

In [19]:
avg_installs_sorted = sorted(avg_installs.items(), key=lambda item: (item[1], item[0]), reverse= True)
avg_installs_sorted

[('COMMUNICATION', 38193481.66435986),
 ('VIDEO_PLAYERS', 24790074.17721519),
 ('SOCIAL', 23253652.127118643),
 ('PHOTOGRAPHY', 17772018.759541985),
 ('PRODUCTIVITY', 16772838.591304347),
 ('TRAVEL_AND_LOCAL', 13984077.710144928),
 ('GAME', 12914435.883748516),
 ('TOOLS', 10801391.298666667),
 ('NEWS_AND_MAGAZINES', 9472829.04),
 ('ENTERTAINMENT', 9146923.076923076),
 ('BOOKS_AND_REFERENCE', 8676537.8125),
 ('SHOPPING', 7036877.311557789),
 ('PERSONALIZATION', 5183850.806779661),
 ('FAMILY', 5168680.731850117),
 ('WEATHER', 5074486.197183099),
 ('SPORTS', 4274688.722772277),
 ('HEALTH_AND_FITNESS', 4167457.3602941176),
 ('MAPS_AND_NAVIGATION', 4025286.24),
 ('ART_AND_DESIGN', 1986335.0877192982),
 ('FOOD_AND_DRINK', 1924897.7363636363),
 ('EDUCATION', 1768500.0),
 ('BUSINESS', 1700127.9852579853),
 ('LIFESTYLE', 1429869.0488505748),
 ('FINANCE', 1387692.475609756),
 ('HOUSE_AND_HOME', 1331540.5616438356),
 ('DATING', 854028.8303030303),
 ('COMICS', 817657.2727272727),
 ('AUTO_AND_VEHIC

As seen above, categories like **”COMMUNICATION”, “VIDEO_PLAYERS”, “SOCIAL”, “PHOTOGRAPHY”, “PRODUCTIVITY”, “TRAVEL_AND_LOCAL”, “GAME”, “TOOLS”, “NEWS_AND_MAGAZINES”, “ENTERTAINMENT” are the Top 10 categories getting the highest average number of installs per app.** So, these categories are believed to offer more users for an app.

Let us check *the trend of number of installs per app* under the “COMMUNICATION” category.

In [20]:
install_counts = {}
for app in android_final:
    if app[1] == "COMMUNICATION":
        installs = app[5]
        installs = installs.replace(",", "")
        installs = int(installs)
        if installs not in install_counts:
            install_counts[installs] = 1
        else:
            install_counts[installs] += 1
        
install_counts_sorted = sorted(install_counts.items(), key=lambda item: (item[0], item[1]), reverse= True)
install_counts_sorted

for each in install_counts_sorted:
    string = '{:,} : {}'.format(each[0], each[1])
    print(string)

1,000,000,000 : 6
500,000,000 : 5
100,000,000 : 16
50,000,000 : 7
10,000,000 : 43
5,000,000 : 22
1,000,000 : 41
500,000 : 9
100,000 : 16
50,000 : 10
10,000 : 21
5,000 : 16
1,000 : 19
500 : 8
100 : 28
50 : 5
10 : 14
5 : 2
1 : 1


Above we can see that **there are 6 apps having 1 billion or more users i.e. 1000 million or more users each and 5 apps having 500 million users each**. **Don’t you think that these 11 apps together dominate the whole category?**

Let’s check out the % of users these 11 apps occupies against the rest.


In [21]:
users_total = 0
users_11 = 0

for each in install_counts:
    users_total += install_counts[each]*each
    if each > 499999999:
        users_11 += install_counts[each]*each
        
print('{:.4f}%'.format((users_11*100)/users_total))

77.0073%


That is surprising. **Just 11 apps hold the 77 % of the total users in COMMUNICATION category**. It will be a stupidity trying to target such categories where beating the giants seem to be next to impossible.

Curiosity drove me to **find out names of these giants.**


In [22]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000' or app[5] == '500,000,000'):
        print(app[0], ':', app[5])

Messenger – Text and Video Chat for Free : 1,000,000,000
Gmail : 1,000,000,000
imo free video calls and chat : 500,000,000
Google Duo - High Quality Video Calls : 500,000,000
UC Browser - Fast Download Private & Secure : 500,000,000
Skype - free IM & video calls : 1,000,000,000
WhatsApp Messenger : 1,000,000,000
Google Chrome: Fast & Secure : 1,000,000,000
LINE: Free Calls & Messages : 500,000,000
Hangouts : 1,000,000,000
Viber Messenger : 500,000,000


It will be a stupidity trying to target such categories where beating the giants like "Messenger of facebook, Gmail, Skype, Whatsapp and Hangouts" will be next to impossible for a new app.

After digging out more information for each category similar to our findings for “COMMUNICATION” category, I made the following conclusions.

| Category Name    | Reason for not considering                                                |
|:-----------------|:--------------------------------------------------------------------------|
| Communication    | Dominated by few Giants (Whatsapp, Messenger, Gmail, etc..)               |
| Video-players    | Dominated by few Giants (Youtube, Google Play Movies & TV, MX Player)     |
| Social           | Dominated by few Giants (Facebook, Instagram, etc..)                      |
| Photography      | Dominated by few Giants (Adobe, Google photos and editor, etc..)          |
| Productivity     | Dominated by few Giants (Microsoft Word, Dropbox, Google Calendar, etc..) |
| Travel & Local   | Enormous travelling data and regular local updates are required           |
| Games            | Seem to be more saturated already                                         |
| Tools            | Seem to be more saturated already                                         |
| News & Magazines | Can be targeted but Google Voice assistance reads news without ads        |
| Entertainment    | Dominated by few Giants (Netflix, Hotstar, Amazon Prime, etc..)           |

Our next category for consideration is **“BOOKS_AND_REFERENCE”**. Let’s checkout the number of installs distribution for this.

In [23]:
install_counts_b = {}
for app in android_final:
    if app[1] == "BOOKS_AND_REFERENCE":
        installs = app[5]
        installs = installs.replace(",", "")
        installs = int(installs)
        if installs not in install_counts_b:
            install_counts_b[installs] = 1
        else:
            install_counts_b[installs] += 1
        
install_counts_b_sorted = sorted(install_counts.items(), key=lambda item: (item[0], item[1]), reverse= True)
install_counts_b_sorted

for each in install_counts_b_sorted:
    string = '{:,} : {}'.format(each[0], each[1])
    print(string)

1,000,000,000 : 6
500,000,000 : 5
100,000,000 : 16
50,000,000 : 7
10,000,000 : 43
5,000,000 : 22
1,000,000 : 41
500,000 : 9
100,000 : 16
50,000 : 10
10,000 : 21
5,000 : 16
1,000 : 19
500 : 8
100 : 28
50 : 5
10 : 14
5 : 2
1 : 1


Now, this looks to match our need. There is **only 1 app having 1000 million installs and next 4 apps having 100 million installs**. Later distribution seems to be somewhat evenly. Following are the apps holding majority installs:

In [24]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000' or app[5] == '500,000,000' or app[5] == '100,000,000'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000
Bible : 100,000,000
Amazon Kindle : 100,000,000
Wattpad 📖 Free Books : 100,000,000
Audiobooks from Audible : 100,000,000


From the above, **Google Play Books** and **Amazon Kindle** have paid services for the majority of the books. **Wattpad Free Books** provide the free STORY books only and **Audiobooks from Audible** lacks the availability of the inbuilt dictionary and/or other features. 

***
**Some Favorable Facts:**

* `BOOKS_AND_REFERENCE` is the category in which a user spends lot of time reading on the screen i.e. *More amount of screen engagement*. More the user stays on screen, more advertisements in a not-interrupting manner can be deployed.

* Providing an option of *embedded Multilingual Dictionary* which a user can use while reading and *translator services* too before which a short 5-sec advertisement per each time access can be shown.

* Users will not be bothered much by advertisements if it is at the cost of *getting paid books to read for free*.

* While analyzing, *“Business, Education, Bible”* were found to be few interesting categories which can be productively used for generating Books and references in our app.
***

So if we choose this category, Let's see how many average installs can we target excluding the 5 giants from the consideration:

In [25]:
install_total = 0
apps_total = 0
for app in android_final:
    if app[1] == "BOOKS_AND_REFERENCE":
        installs = app[5]
        installs = installs.replace(",", "")
        installs = int(installs)
        if installs < 10000001:
            install_total += installs
            apps_total += 1
        
string = 'Average installs = {:,.2f}'.format(install_total/apps_total)
print(string)

Average installs = 1,421,899.79


Looking at the average number of installs per app, neglecting the top 5 giants, **a new app is likely to get over 1 million installs under “BOOKS AND REFERENCES” category.**

### Step 5: Conclusion:

With the *stated Favorable facts and 1 million average installs estimated*, it can be concluded that **‘Books and Reference’ seems to be more promising than the obviously expected ‘Entertainment, Social, Communication and Tools’ categories to for a new app to “Earn through In-app ads”.**
