# Analyzing Profitable Apps for Google Play Store and iOS App Store

## Objective

The goal of this project is to analyse the marketplace for Profitable Apps on iOS Store and Google Play Store. This notebook is created from the perspective of a Data Analyst who is working for a company that specialises in creating Mobile Apps. 

We are trying to help the App Developers of our company make data-driven business decisions as to what type of app they should start developing, the target audience and other factors that would define how much feasible resources need to be put in to create a free app that generates revenue through in-app ads. 

We need to find a niche which has the optimum traffic and screen-time to have maximum number of customers engaging with the ads within our app. At the same time, it should not be dominated by apps that are hard to compete against or that do not saturate the market genre so much that there are hardly any customers who choose alternatives to these apps.

## Data

The first task would be to scrape or find data for these two App Stores with relevant information like app name, size, price, rating, number of reviews, number of installs etc. 
For better cost management, the first approach would be to find any readily available datasets on the web that we can use. 
Luckily for us, there are two Kaggle datasets that have the type of app data that we need.

1. [Google Play Store Dataset](https://www.kaggle.com/datasets/lava18/google-play-store-apps?resource=download) containing a scraped dataset of 10K apps from the Play Store.

2. [iOS App Store Dataset](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) containing a scraped dataset of 7K apps from the iOS Mobile App Store.

### Reading the dataset 
Open the two datasets and instantiate them as list of lists. Create a function `explore_data` to explore the dataset so we don't have to write code to print the data whenever we need to explore it.

In [1]:
from csv import reader
import numpy as np

file_1 = reader(open('App_Dataset/AppleStore.csv', encoding = 'utf8'))           # Open and read datasets and create a list
applestore = list(file_1)

file_2 = reader(open('App_Dataset/googleplaystore.csv', encoding = 'utf8'))
playstore = list(file_2)

applestore_header = applestore[0]                                               # Isolate the header rows
applestore = applestore[1:]

playstore_header = playstore[0]
playstore = playstore[1:]

In [2]:
def explore_data(data, start, end, rows_columns = False):     # defines four parameters to execute the function
    data_slice = data[start:end]                              # creates a small printable slice of the dataset
    for row in data_slice:
        print(row)
        print('\n')                                           # print the selected rows
    if rows_columns:
        print('Number of rows:', len(data))
        print('Number of columns:', len(data[0]))             #print the number of rows and columns if desired

In [3]:
print(applestore_header,'\n')
explore_data(applestore,0,3,True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 7197
Number of columns: 17




We have 7197 iOS apps in this data set, and the columns that seem interesting are: `track_name`, `currency`, `price`, `rating_count_tot`, `rating_count_ver`, and `prime_genre`. Not all column names are self-explanatory in this case, but details about each column can be found in the data set [documentation](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps).

In [4]:
print(playstore_header,'\n')
explore_data(playstore,0,3,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


We see that the Google Play Store data set has 10841 apps and 13 columns. At a quick glance, the columns that might be useful for the purpose of our analysis are `App`, `Category`, `Reviews`, `Installs`, `Type`, `Price`, and `Genres`.

## Data Wrangling

Now after gaining an insight to the format of our data and examples to its entries. We have an overview of our datasets and can begin the next phase of our project. Data Wrangling is an important role in any Data Analyst's work. The cleaner and more uniform the data is, the easier it is to analyse it. 

We can always come back and create more cleaning procedures if there are any errors encountered during the analysis part.

### Missing or incomplete data

Create a function to delete incomplete entries causing column shift inside a row.

In [5]:
def delete_incomplete(dataset,dataset_header):
    for row in dataset:
        i = 0                                                       # denotes the regular iteration of the for loop 
        n = 0                                                       # denotes the iteration increment when a deletion is made
        row_len = len(row)
        if row_len != len(dataset_header):
            del dataset[i-n]                                        # helps compensate the index after each deletion
            n = n+1
        i = i+1
        

In [6]:
print('Rows before deletion:',len(playstore))
delete_incomplete(playstore,playstore_header)
print('Rows after deletion:',len(playstore))

Rows before deletion: 10841
Rows after deletion: 10840


There was one row in this dataset which had discrepancies with its entries that would cause the row to have less number of columns. This might seem a very redundant procedure in this case but it is a necessary one. This would help avoid any errors that we would have encountered trying to convert our list of lists into a Pandas DataFrame.

In [7]:
print('Rows before deletion:',len(applestore))
delete_incomplete(applestore,applestore_header)
print('Rows after deletion:',len(applestore))

Rows before deletion: 7197
Rows after deletion: 7197


There were no incomplete rows in this dataset and we can move forwards with next steps of our data cleaning procedure.

### Deleting duplicate rows

Duplication is very common in any dataset and is always a good practice to check for any duplicates and remove them. We do this in order to avoid any errors in our analysis that might skew the conclusion that we come to. 

We create two empty lists, all the apps will be added to the `unique_apps` list once. After which, if the same app name is encountered again it will be appended to the `duplicate_apps` list.

In [8]:
duplicate_apps = []
unique_apps = []

for row in playstore:
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

In [9]:
print(len(duplicate_apps))

1181


Google Play Store dataset has 1181 instances of duplication. This could mean 1181 different apps duplicated once, or a less number of apps with multiple entries.

In [10]:
duplicate_apps_unique = np.unique(duplicate_apps)
print(duplicate_apps[:30])

print('\n','Number of duplicate apps: ',len(duplicate_apps_unique))

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express', 'Accounting App - Zoho Books', 'Invoice & Time Tracking - Zoho', 'join.me - Simple Meetings', 'Invoice 2go — Professional Invoices and Estimates', 'SignEasy | Sign and Fill PDF and other Documents', 'Quick PDF Scanner + OCR FREE', 'Genius Scan - PDF Scanner', 'Tiny Scanner - PDF Scanner App', 'Fast Scanner : Free PDF Scan', 'Mobile Doc Scanner (MDScan) Lite']

 Number of duplicate apps:  798


There are 798 apps with duplication. There would be a few apps with more than 2 duplication.

In [11]:
for row in playstore:
    name = row[0]
    if name in duplicate_apps[:3]:
        print(row)
        print('\n')
        
print(playstore_header)

['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up']


['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+'

In most cases the duplicate entries are identical. The discrepancies appear in the `Reviews` column for some apps like Instagram.

In [12]:
for row in playstore:
    name = row[0]
    if name == "Instagram":
        print(row)
        print('\n')

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']




The criteria to filter duplicate rows will be based on the `Reviews` column. The entry with the highest number of reviews would be the most recent updation. Therefore, that entry will be kept.

We will filter these duplicate apps by creating a dictionary with app name as the key and the maximum reviews that it has as its dictionary value. 

In [13]:
maxreview_dict = {}

for row in playstore:
    name = row[0]
    reviews = float(row[3])
    if name not in maxreview_dict:
        maxreview_dict[name] = reviews                                        
    elif name in maxreview_dict and reviews > maxreview_dict[name]:
        maxreview_dict[name] = reviews
        
print(maxreview_dict)
        

ValueError: could not convert string to float: '3.0M'

We encountered an error depicting 'M' being used in `Reviews` instead of numerically entering millions. Let's run a for loop to change each instance where this happens and turn it into numerical format. 

In [14]:
for row in playstore:
    name = row[0]
    review = row[3]
    if "M" in review:
        review = float(review.split('M')[0])*1000000          # use the first element of the split list and convert into million
        row[3] = review

In [44]:
maxreview_dict = {}

for row in playstore:
    name = row[0]
    reviews = float(row[3])
    if name not in maxreview_dict:                                  # first instance of app entry with its reviews
        maxreview_dict[name] = reviews
    elif name in maxreview_dict and reviews > maxreview_dict[name]: # if second entry has higher reviews, replace value
        maxreview_dict[name] = reviews
        
#print(maxreview_dict)
        

After creating a dictionary with each app and its highest rating among the duplicate entry, we can use this dictionary to filter through our data and create a new list where only the row with reviews matching the `maxreview_dict` value will be kept.

In [16]:
playstore_unique = []
added_names = []

for row in playstore:
    name = row[0]
    review = float(row[3])
    if maxreview_dict[name] == review and name not in added_names:
        playstore_unique.append(row)
        added_names.append(name)

In [17]:
print(len(playstore_unique))

9659


In [18]:
explore_data(playstore_unique,0,5,True)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'April 26, 2018', '1.1', '4.0.3 and up']


Number of rows: 9659
Number of columns: 13


### Removing Non-English apps

Use ASCII standard character which have encoding between 0 and 127. Use `ord()` function to find out the encoding number. 
We will define a function that provides a boolean output to whether the app name is english or not.

Since there are some apps with emojis and other special characters within the English apps. We will allow a maximum of three non-ASCII characters in our app name filtering. This will ensure no unnecessary loss of useful data.

In [19]:
def is_english(name):
    non_ascii = 0                                          # ascii counter that increases each time there ord() returns value
    for char in name:                                      # greater than 127.
        if ord(char) > 127:
            non_ascii += 1
    if non_ascii > 3:
        return False
    else:
        return True

In [20]:
print(is_english('Instachat 😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Instagram'))

True
False
True


In [21]:
print(applestore[0][2])

PAC-MAN Premium


We have successfully created a function to filter out non-english apps while also keeping the apps that are in english but have a few special characters. Now this function can be used to remove the non-english apps from our unique apps dataset.

In [22]:
playstore_english = []
applestore_english = []

for row in playstore_unique:
    name = row[0]
    if is_english(name):
        playstore_english.append(row)

for row in applestore:
    name = row[2]
    if is_english(name):
        applestore_english.append(row)

explore_data(playstore_english,0,3,True)
print('\n')
explore_data(applestore_english,0,3,True)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9614
Number of columns: 13


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '18858

### Isolating free apps

The next step would be to filter only the apps that are free. Since our app development model leverages in-app ads to generate revenue. We need to remove paid apps from our datasets before starting our analysis.

In [23]:
playstore_final = []
for row in playstore_english:
    price = row[7]
    if price == '0':
        playstore_final.append(row)   

In [24]:
len(playstore_final)

8863

In [25]:
applestore_final = []
for row in applestore_english:
    price = row[5]
    if price == '0':
        applestore_final.append(row)

In [26]:
len(applestore_final)

3222

The data wrangling procedure is now complete.

The final dataset contains only free english apps and the dataset contains unique apps without any duplication. 

## Data Analysis

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

This means finding genres which have scope on both the platforms.

### Most common apps by genre

We'll start by creating a function that generates a frequency table of the apps in each genre; convert it into percentage and sort it in a descending order. This would give us a clear picture of which genres have the most apps.

In [27]:
print(playstore_header,'\n')
explore_data(playstore_english,0,3,True)
print('\n',applestore_header,'\n')
explore_data(applestore_english,0,3,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9614
Number of columns: 13

 ['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['1', '281

In [28]:
def freq_table_percent(dataset,index):                                          # Takes the dataset and index to create a
    genres_freq = {}                                                            # frequency table with
    total = len(dataset)
    for row in dataset:
        genre = row[index]
        if genre not in genres_freq:
            genres_freq[genre] = 1
        elif genre in genres_freq:
            genres_freq[genre] += 1                                             # create a frequency table
    genres_freq_percent = {}
    for key in genres_freq:
        genres_freq_percent[key] = round(((genres_freq[key])/total)*100,2)      # convert it into frequency percentage
    
    return sorted(genres_freq_percent.items(), key=lambda item: item[1], reverse = True)   # return a descending order table
        
    

In [29]:
ios_genre_freq = freq_table_percent(applestore_final,-5)
ios_genre_freq

[('Games', 58.16),
 ('Entertainment', 7.88),
 ('Photo & Video', 4.97),
 ('Education', 3.66),
 ('Social Networking', 3.29),
 ('Shopping', 2.61),
 ('Utilities', 2.51),
 ('Sports', 2.14),
 ('Music', 2.05),
 ('Health & Fitness', 2.02),
 ('Productivity', 1.74),
 ('Lifestyle', 1.58),
 ('News', 1.33),
 ('Travel', 1.24),
 ('Finance', 1.12),
 ('Weather', 0.87),
 ('Food & Drink', 0.81),
 ('Reference', 0.56),
 ('Business', 0.53),
 ('Book', 0.43),
 ('Navigation', 0.19),
 ('Medical', 0.19),
 ('Catalogs', 0.12)]

According to our sample dataset, more than half of the apps on iOS Store are based around gaming or for-fun apps.

In [30]:
freq_table_percent(playstore_final, 1)

[('FAMILY', 18.91),
 ('GAME', 9.73),
 ('TOOLS', 8.46),
 ('BUSINESS', 4.59),
 ('LIFESTYLE', 3.9),
 ('PRODUCTIVITY', 3.89),
 ('FINANCE', 3.7),
 ('MEDICAL', 3.53),
 ('SPORTS', 3.4),
 ('PERSONALIZATION', 3.32),
 ('COMMUNICATION', 3.24),
 ('HEALTH_AND_FITNESS', 3.08),
 ('PHOTOGRAPHY', 2.94),
 ('NEWS_AND_MAGAZINES', 2.8),
 ('SOCIAL', 2.66),
 ('TRAVEL_AND_LOCAL', 2.34),
 ('SHOPPING', 2.25),
 ('BOOKS_AND_REFERENCE', 2.14),
 ('DATING', 1.86),
 ('VIDEO_PLAYERS', 1.79),
 ('MAPS_AND_NAVIGATION', 1.4),
 ('FOOD_AND_DRINK', 1.24),
 ('EDUCATION', 1.16),
 ('ENTERTAINMENT', 0.96),
 ('LIBRARIES_AND_DEMO', 0.94),
 ('AUTO_AND_VEHICLES', 0.93),
 ('HOUSE_AND_HOME', 0.82),
 ('WEATHER', 0.8),
 ('EVENTS', 0.71),
 ('PARENTING', 0.65),
 ('ART_AND_DESIGN', 0.63),
 ('COMICS', 0.62),
 ('BEAUTY', 0.6)]

The category `FAMILY` in the Play Store dataset is a very vague and broad genre. We will be using another column, `Genres` in the playstore dataset, which has a broader range of genres given instead of the `Category` column. We do this to see if we can get a better insight into what each category means.

In [31]:
freq_table_percent(playstore_final,-4)

[('Tools', 8.45),
 ('Entertainment', 6.07),
 ('Education', 5.35),
 ('Business', 4.59),
 ('Lifestyle', 3.89),
 ('Productivity', 3.89),
 ('Finance', 3.7),
 ('Medical', 3.53),
 ('Sports', 3.46),
 ('Personalization', 3.32),
 ('Communication', 3.24),
 ('Action', 3.1),
 ('Health & Fitness', 3.08),
 ('Photography', 2.94),
 ('News & Magazines', 2.8),
 ('Social', 2.66),
 ('Travel & Local', 2.32),
 ('Shopping', 2.25),
 ('Books & Reference', 2.14),
 ('Simulation', 2.04),
 ('Dating', 1.86),
 ('Arcade', 1.85),
 ('Video Players & Editors', 1.77),
 ('Casual', 1.76),
 ('Maps & Navigation', 1.4),
 ('Food & Drink', 1.24),
 ('Puzzle', 1.13),
 ('Racing', 0.99),
 ('Libraries & Demo', 0.94),
 ('Role Playing', 0.94),
 ('Auto & Vehicles', 0.93),
 ('Strategy', 0.91),
 ('House & Home', 0.82),
 ('Weather', 0.8),
 ('Events', 0.71),
 ('Adventure', 0.68),
 ('Comics', 0.61),
 ('Beauty', 0.6),
 ('Art & Design', 0.59),
 ('Parenting', 0.5),
 ('Card', 0.45),
 ('Casino', 0.43),
 ('Trivia', 0.42),
 ('Educational;Education

This gives us a idea of what categories/genres have a good supply of apps already present. Whether the market is saturated for each category or not. 
Now we will look at what apps have the most users and compare it with the genre frequency tables to find a sweet spot for our app development. 

For Google Play Store, this data is provided in the `Installs` columns. Since there is no such column in the Apple Store, we will use the `rating_count_tot` column as an alternate method of finding relative traffic on the apps per genre.  

### Popular apps sorted by genres

#### iOS Store

In [32]:
print(applestore_header,'\n')

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 



In [33]:
ios_genre = freq_table_percent(applestore_final, -5)
ios_genre

[('Games', 58.16),
 ('Entertainment', 7.88),
 ('Photo & Video', 4.97),
 ('Education', 3.66),
 ('Social Networking', 3.29),
 ('Shopping', 2.61),
 ('Utilities', 2.51),
 ('Sports', 2.14),
 ('Music', 2.05),
 ('Health & Fitness', 2.02),
 ('Productivity', 1.74),
 ('Lifestyle', 1.58),
 ('News', 1.33),
 ('Travel', 1.24),
 ('Finance', 1.12),
 ('Weather', 0.87),
 ('Food & Drink', 0.81),
 ('Reference', 0.56),
 ('Business', 0.53),
 ('Book', 0.43),
 ('Navigation', 0.19),
 ('Medical', 0.19),
 ('Catalogs', 0.12)]

We loop through each genre in our frequency tables and add up the popularity of each app and the length of each genre. Then we print out a sorted version of this calculation, giving us the popularity index per each genre.

In [34]:
popular_per_genre_ios = {}
for row in ios_genre:
    genre = row[0]
    popularity = 0
    len_genre = 0
    for app in applestore_final:
        ratings = float(app[6])
        if app[-5] == genre:
            popularity += ratings
            len_genre += 1
    popular_per_genre_ios[genre] = popularity/len_genre
popular_per_genre_ios = sorted(popular_per_genre_ios.items(), key=lambda item: item[1], reverse = True)

In [35]:
popular_per_genre_ios

[('Navigation', 86090.33333333333),
 ('Reference', 74942.11111111111),
 ('Social Networking', 71548.34905660378),
 ('Music', 57326.530303030304),
 ('Weather', 52279.892857142855),
 ('Book', 39758.5),
 ('Food & Drink', 33333.92307692308),
 ('Finance', 31467.944444444445),
 ('Photo & Video', 28441.54375),
 ('Travel', 28243.8),
 ('Shopping', 26919.690476190477),
 ('Health & Fitness', 23298.015384615384),
 ('Sports', 23008.898550724636),
 ('Games', 22788.6696905016),
 ('News', 21248.023255813954),
 ('Productivity', 21028.410714285714),
 ('Utilities', 18684.456790123455),
 ('Lifestyle', 16485.764705882353),
 ('Entertainment', 14029.830708661417),
 ('Business', 7491.117647058823),
 ('Education', 7003.983050847458),
 ('Catalogs', 4004.0),
 ('Medical', 612.0)]

Let's check for saturated genres where apps supply is high whereas traffic per genre is not that significant.
Also we need to check for genres that have big monopolies in them. 
As Data Analysts, we know one of the basic rules of statistics, i.e., the Pareto principle. Once an entity starts to dominate ~50% of the total sector, it will have a high tendency to keep dominating the market and increasing it's share.

Navigation and Social Networking apps can have a monopolised landscape. Lets confirm that by checking the apps with the most traffic in the first few genres.
If the traffic is divided amongst a range of apps, that would be a suitable scenario for us.

In [36]:
popular_genres_ios = ['Navigation','Reference','Social Networking','Music','Weather','Book','Food & Drink']
for genres in popular_genres_ios:
    print('\n','Popular apps in', genres,'\n')
    for row in applestore_final:
        name = row[2]
        ratings = row[6]
        genre = row[-5]
        if genre == genres:
            print(name,': Ratings = ', ratings,)


 Popular apps in Navigation 

Waze - GPS Navigation, Maps & Real-time Traffic : Ratings =  345046
Geocaching® : Ratings =  12811
ImmobilienScout24: Real Estate Search in Germany : Ratings =  187
Railway Route Search : Ratings =  5
CoPilot GPS – Car Navigation & Offline Maps : Ratings =  3582
Google Maps - Navigation & Transit : Ratings =  154911

 Popular apps in Reference 

Bible : Ratings =  985920
Dictionary.com Dictionary & Thesaurus : Ratings =  200047
Dictionary.com Dictionary & Thesaurus for iPad : Ratings =  54175
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : Ratings =  18418
Merriam-Webster Dictionary : Ratings =  16849
Google Translate : Ratings =  26786
Night Sky : Ratings =  12122
WWDC : Ratings =  762
Jishokun-Japanese English Dictionary & Translator : Ratings =  0
教えて!goo : Ratings =  0
VPN Express : Ratings =  14
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : Ratings =  17588
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket W

#### Findings after analysis

* Navigation apps have big players like Waze, Google Maps that attract all the users.

* Reference apps mostly have traffic for apps like Bible or Dictionaries.

* Social Networking apps also have monopoly from big players like Facebook, WhatsApp, Pinterest, Messenger etc. It would take up   a lot of resources to compete with and disrupt the market with these companies' presence in the market.

* Music apps have traffic on Spotify, Pandora, SoundCloud adding the total to around 2 million users. 

* Since our objective is to develop a free app that monetizes on ads, weather apps are not a good niche. Since people dont         usually spend too much time on these apps. 

- The only balanced traffic we can notice is in genres `Reference` and `Book`. This provides us with a good sector that we can target.
However, we need to overlap this sector with Google Play Store as well; and confirm availability in this sector.

#### Popular apps in Google Play Store

In [37]:
print(playstore_header,'\n')

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 



Using the same method as for the iOS Store dataset, we loop through each genre, create a popularity index based on `Installs`. Since the format of this column is of the type `1,000,000+`,`5,000,000+` etc., we will use the number as they are and remove the `+` sign.

In [38]:
google_genre = freq_table_percent(playstore_final,1)
google_genre

[('FAMILY', 18.91),
 ('GAME', 9.73),
 ('TOOLS', 8.46),
 ('BUSINESS', 4.59),
 ('LIFESTYLE', 3.9),
 ('PRODUCTIVITY', 3.89),
 ('FINANCE', 3.7),
 ('MEDICAL', 3.53),
 ('SPORTS', 3.4),
 ('PERSONALIZATION', 3.32),
 ('COMMUNICATION', 3.24),
 ('HEALTH_AND_FITNESS', 3.08),
 ('PHOTOGRAPHY', 2.94),
 ('NEWS_AND_MAGAZINES', 2.8),
 ('SOCIAL', 2.66),
 ('TRAVEL_AND_LOCAL', 2.34),
 ('SHOPPING', 2.25),
 ('BOOKS_AND_REFERENCE', 2.14),
 ('DATING', 1.86),
 ('VIDEO_PLAYERS', 1.79),
 ('MAPS_AND_NAVIGATION', 1.4),
 ('FOOD_AND_DRINK', 1.24),
 ('EDUCATION', 1.16),
 ('ENTERTAINMENT', 0.96),
 ('LIBRARIES_AND_DEMO', 0.94),
 ('AUTO_AND_VEHICLES', 0.93),
 ('HOUSE_AND_HOME', 0.82),
 ('WEATHER', 0.8),
 ('EVENTS', 0.71),
 ('PARENTING', 0.65),
 ('ART_AND_DESIGN', 0.63),
 ('COMICS', 0.62),
 ('BEAUTY', 0.6)]

In [39]:
popular_per_genre_google = {}
for row in google_genre:
    genre = row[0]
    popularity = 0
    len_genre = 0
    for app in playstore_final:
        installs = float(app[5].replace(',','').replace('+',''))
        if app[1] == genre:
            popularity += installs
            len_genre += 1
    popular_per_genre_google[genre] = round(popularity/len_genre,2)
popular_per_genre_google = sorted(popular_per_genre_google.items(), key=lambda item: item[1], reverse = True)

In [40]:
popular_per_genre_google

[('COMMUNICATION', 38456119.17),
 ('VIDEO_PLAYERS', 24727872.45),
 ('SOCIAL', 23253652.13),
 ('PHOTOGRAPHY', 17840110.4),
 ('PRODUCTIVITY', 16787331.34),
 ('GAME', 15588015.6),
 ('TRAVEL_AND_LOCAL', 13984077.71),
 ('ENTERTAINMENT', 11640705.88),
 ('TOOLS', 10801391.3),
 ('NEWS_AND_MAGAZINES', 9549178.47),
 ('BOOKS_AND_REFERENCE', 8767811.89),
 ('SHOPPING', 7036877.31),
 ('PERSONALIZATION', 5201482.61),
 ('WEATHER', 5074486.2),
 ('HEALTH_AND_FITNESS', 4188821.99),
 ('MAPS_AND_NAVIGATION', 4056941.77),
 ('FAMILY', 3695641.82),
 ('SPORTS', 3638640.14),
 ('ART_AND_DESIGN', 2021626.79),
 ('FOOD_AND_DRINK', 1924897.74),
 ('EDUCATION', 1833495.15),
 ('BUSINESS', 1712290.15),
 ('LIFESTYLE', 1437816.27),
 ('FINANCE', 1387692.48),
 ('HOUSE_AND_HOME', 1331540.56),
 ('DATING', 854028.83),
 ('COMICS', 817657.27),
 ('AUTO_AND_VEHICLES', 647317.82),
 ('LIBRARIES_AND_DEMO', 638503.73),
 ('PARENTING', 542603.62),
 ('BEAUTY', 513151.89),
 ('EVENTS', 253542.22),
 ('MEDICAL', 120550.62)]

In [41]:
popular_genres_google = ['COMMUNICATION','VIDEO_PLAYERS','SOCIAL','BOOKS_AND_REFERENCE']
for genres in popular_genres_google:
    print('\n','Popular apps in', genres,'\n')
    for row in playstore_final:
        name = row[0]
        installs = float(row[5].replace('+','').replace(',',''))
        genre = row[1]
        if genre == genres:
            print(name,': Installs = ', installs,)


 Popular apps in COMMUNICATION 

WhatsApp Messenger : Installs =  1000000000.0
Messenger for SMS : Installs =  10000000.0
My Tele2 : Installs =  5000000.0
imo beta free calls and text : Installs =  100000000.0
Contacts : Installs =  50000000.0
Call Free – Free Call : Installs =  5000000.0
Web Browser & Explorer : Installs =  5000000.0
Browser 4G : Installs =  10000000.0
MegaFon Dashboard : Installs =  10000000.0
ZenUI Dialer & Contacts : Installs =  10000000.0
Cricket Visual Voicemail : Installs =  10000000.0
TracFone My Account : Installs =  1000000.0
Xperia Link™ : Installs =  10000000.0
TouchPal Keyboard - Fun Emoji & Android Keyboard : Installs =  10000000.0
Skype Lite - Free Video Call & Chat : Installs =  5000000.0
My magenta : Installs =  1000000.0
Android Messages : Installs =  100000000.0
Google Duo - High Quality Video Calls : Installs =  500000000.0
Seznam.cz : Installs =  1000000.0
Antillean Gold Telegram (original version) : Installs =  100000.0
AT&T Visual Voicemail : In

#### Findings after Analysis 

* Again there is a monopoly in sectors like `COMMUNICATION`,`VIDEO_PLAYERS`,`SOCIAL`.

* `Book` genre showed promise in the iOS platform. We will check the same genre in Google Play Store `BOOK_AND_REFERENCE`.

In [42]:
for app in playstore_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


Not many apps dominating the this genre. There are only a few apps that have a high number of installs. 
We can look at the type of apps that are somewhere in the mid-popularity of this niche. If those apps also have significant traffic, this would mean the sector is not saturated by the big players. We can them aim to develop an app similar to those in the mid-popularity range. Ensuring a good enough revenue and traffic through our desired business model. 

In [43]:
for app in playstore_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])



Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

Apart from Dictionaries and other popular book apps like the Bible and Qur'an, there are plenty of apps with Book reading software and ePDFs that have a uniform number of downloads.

## Conclusion

We looked through the different genres and were able to find one niche which shows promise. We can thus conclude that an app based around book reading, whether a single book or multiple ones, with a simplistic and fluid GUI can gain popularity and have significant user traffic for our company to benefit from ads revenue generation. However, it looks like the market is already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

