# Analysis of Mobile App Data

In this project we will be analyzing data on iOS and Android mobile apps that are sold on Google Play and the App store. We will be simulating that we work for a company that builds iOS and Android apps. The apps our company builds will be free to download and install with all of the revenue being generated by in-app advertisements. 

Our goal will be to analyze the data sets and gain insight into what type of app will be the most profitable. Since revenue is driven entirely by the amount of users that interact with the in-app advertisments, this means creating an app that will attract the highest amount of users possible. 

## Opening and Exploring the Data

We will be working with two data sets in this project which are both readily available on Kaggle:
- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data on approximately 7,000 iOS apps from the App Store. The data set can be downloaded directly [here.](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)
- [A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data on approximately 10,000 Android apps from Google Play. The data set can be downloaded directly [here.](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)


To begin, we will open and explore the two data sets to get a better idea of what we are working with. To assist with this we will define an function `explore_data()` that can be used to print rows of the data set in a readable way:

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))

Now that we have a function to assist in exploring our data, we will go ahead and open the two data sets so we can start exploring. We will also seperate the header rows from the rest of the data set, since we don't want to include these as part of the main data set:

In [2]:
from csv import reader

open_file = open('AppleStore.csv')
read_file = reader(open_file)
ios_data = list(read_file)
ios_header = ios_data[0]
ios_data = ios_data[1:]

open_file = open('googleplaystore.csv')
read_file = reader(open_file)
android_data = list(read_file)
android_header = android_data[0]
android_data = android_data[1:]

explore_data(ios_data, 1, 4, rows_and_columns=True)
print('\n')
explore_data(android_data, 1, 4, rows_and_columns=True)

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows:  7197
Number of columns:  16


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.

Looking at the results of calling `explore_data()` on our two datasets we can see a few rows of each to get an idea of what our data looks like. Also, we see that the App Store data set has 7197 rows and 16 columns, while the Google Play data set has 10841 rows and 13 columns. To get a better idea of what the data in each column of the two data sets represents, let's print out the header rows of each column:

In [3]:
print(ios_header)
print('\n')
print(android_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


If we take a look at the App Store data set [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) we can get a better idea of what each column heading means in order to determine which will be useful in our analysis. 'track_name' refers to the name of the app which will obviously be useful. 'rating_cont_tot' and 'user_rating' have data about the amount of ratings the app has received and what the ratings were which can be valueable for determining how popular an app is. Finally 'prime_genre' describes the genre that the app belongs to which can help us determine which types of apps are more popular than others. 

Likewise, looking at the Google Play data set [documentation](https://www.kaggle.com/lava18/google-play-store-apps) we see that 'App' refers to the name of the app. 'Category' and 'Genres' will be useful in grouping the apps together by content. 'Rating', 'Reviews', and 'Installs' will help us determine which type of apps are the most popular.

## Data Cleaning 

Before we progress further in our analysis we will pause to clean our data sets. Do do this we will remove and correct any incorrect data points, remove any duplicate data points, and modify the data sets to better fit the purpose of our analysis. For instance, our company is only interested in developing *free* apps that are targeted toward an *English speaking* audience. To that end, we will need to remove any non-free and non-English apps from our data sets.

### Deleting Wrong Data

We will begin our cleaning process by detecting and deleting any incorrect data in our data sets. First, let's take a look at the Google Play data set. If we explore the [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion) section for the data set on Kaggle, we can see that there is an [error](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) in a row of the data set. Row 10472 is missing the 'Category' column, which caused all the columns after it to shift out of place. If we print the length of row 10472 we can see that it is 12 when we know that the number of columns for the Google Play data set is 13:

In [4]:
print(android_data[10472])
print(len(android_data[10472]))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
12


To fix this error, we will go ahead and delete this row from the data set:

In [5]:
print(len(android_data))
del(android_data[10472])
print(len(android_data))

10841
10840


Now the Google Play data set is one step closer to being cleaned. The [discussion](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion) for the App Store data set dosen't show any obvious incorrect data, so we will assume that it is correct. We can now move on with the data cleaning process.

### Removing Duplicate Entries

For now we will continue cleaning the Google Play data set. When examining the data set or reading through the [documentation](https://www.kaggle.com/lava18/google-play-store-apps/discussion) we see that there are multiple instances of apps appearing in the data set more than once. Take Instagram for example, a quick check shows us that it appears 4 seperate times:

In [6]:
for app in android_data:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


If we do a deeper dive, we find that there are 1181 duplicate entries in the Google Play data set:

In [7]:
unique_entries = []
duplicate_entries = []

for app in android_data:
    name = app[0]
    if name in unique_entries:
        duplicate_entries.append(name)
    else:
        unique_entries.append(name)
        
print('Number of duplicate entries: ', len(duplicate_entries))
print('\n')
print('Examples of duplicate entries: ', duplicate_entries[:10])

Number of duplicate entries:  1181


Examples of duplicate entries:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


Now we need to remove these duplicate entries so that there is only one entry per app. If we look at the print out of the four Instagram duplicates above, we see that they are pretty much identical except for one important difference: they differ at the 'Reviews' column. Using this knowledge, we can decide which of the duplicates to keep in our data set. The entry with the highest amount of review will be the most recent and up to date entry, so that will be the one we will keep while deleting the rest. Below we see the expected number of apps in our data set after duplicates are removed: 

In [8]:
print('Expected length after duplicate removal: ', len(android_data) - 1181)

Expected length after duplicate removal:  9659


We can now start working on removing the duplicate entries. First we will create a dictionary that will hold the name of each unique app in the data set as well as the highest number of reviews an entry of that app has recieved:

In [9]:
reviews_max = {}

for app in android_data:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and n_reviews > reviews_max[name]:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))
print(reviews_max['Instagram'])

9659
66577446.0


Our dictionary looks to be working as intended, there are 9659 entries which is the amount of unique apps we expect and the value for 'Instagram' is 66577446 which we learned above is the highest amout of reviews that an entry for 'Instagram' has received. Now we can use our dictionary to remove the duplicate rows from the data set. We will create a new data set called `android_clean` and a list called `already_added`. Next we will loop through the Google Play data set and if the 'Reviews' data point for that row match the value saved for that app name in our `reviews_max` dictionary, we will append the row to our `android_clean` data set. The `already_added` list will serve as a way to catch duplicate entries of an app that have the same number of reviews:

In [10]:
android_clean = []
already_added = []

for app in android_data:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
        
explore_data(android_clean, 0, 10, rows_and_columns=True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+

Using our `explore_data` function we can see that our duplicate removal worked as intended, we are left with 9659 unique apps in the data set. If we take a quick look below at the App Store data set, we can see that it has no duplicate entries so we can now move on with our data cleaning process.

In [11]:
unique_entries = []
duplicate_entries = []

for app in ios_data:
    app_id = app[0]
    if name in unique_entries:
        duplicate_entries.append(app_id)
    else:
        unique_entries.append(app_id)

print('Number of duplicate entries: ', len(duplicate_entries))

Number of duplicate entries:  0


### Removing Non-English Apps 

The next step in our data cleaning process will be to remove any non-English apps, since the company we are working for is developing apps for an English speaking audience. This way we can refine our data and only analyze information on apps geared toward English speakers. Below, we find evidence of non-English apps in both of our data sets:

In [12]:
print(ios_data[813][1])
print(ios_data[6731][1])
print('/n')
print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜
/n
中国語 AQリスニング
لعبة تقدر تربح DZ


In order to locate and remove these non-English apps, we will write a function that can detect the presence of non-English characters in the apps name. Our function will work by converting each character of the apps name to it's [ASCII](https://en.wikipedia.org/wiki/ASCII) value and then check to see if it is in the range of English characters (0 to 127). If the characters ASCII value is out of this range, we know the app is likley not built for English speakers and we can remove it. However, some English apps contain certain symbols that fall out of the 0-127 ASCII range such as 'Docs To Go™ Free Office Suite' or Instachat 😜'. These are English apps so we don't want to exclude them from our data sets. To address this issue, we will only remove an app if it has more than 3 non-English characters in it's name:

In [13]:
def check_english(string):
    non_english_chars = 0
    for char in string:
        if ord(char) > 127:
            non_english_chars += 1
    if non_english_chars > 3:
        return False
    else:
        return True

Let's now test out our `check_english` function with a few app names:

In [14]:
print(check_english('Instagram'))
print(check_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(check_english('Docs To Go™ Free Office Suite'))
print(check_english('Instachat 😜'))

True
False
True
True


Our `check_english` function looks to be working as intended. We can now use it to remove non-English apps from the App Store and Google Play data sets:

In [15]:
ios_cleaned = []

for app in ios_data:
    name = app[1]
    if check_english(name) == True:
        ios_cleaned.append(app)
        
android_cleaned = []

for app in android_clean:
    name = app[0]
    if check_english(name) == True:
        android_cleaned.append(app)
        
print('Number of App Store Apps: ', len(ios_cleaned))
print('Number of Google Play Apps: ', len(android_cleaned))

Number of App Store Apps:  6183
Number of Google Play Apps:  9614


After our sucessful removal of non-English apps from our data sets we are left with 6183 apps in the App Store data set and 9614 apps in the Google Play data set. We will now proceed to the final step in our data cleaning process.

### Isolating Free Apps

As stated previously, our company is only interested in developing apps that are free to download and use. Since our data sets still contain apps that are not free to download, the final step in our data cleaning process will be to remove these apps prior to beginning our analysis. To do this, we will loop through our data sets and evaluate the 'Price' column. If an app has a non zero value for price, it will be removed:

In [16]:
ios_final = []

for app in ios_cleaned:
    price = app[4]
    if price == '0' or price == '0.0':
        ios_final.append(app)
        
android_final = []

for app in android_cleaned:
    price = app[7]
    if price == '0' or price == '0.0':
        android_final.append(app)
        
print('Number of App Store Apps: ', len(ios_final))
print('Number of Google Play Apps: ', len(android_final))

Number of App Store Apps:  3222
Number of Google Play Apps:  8864


After removing non-free apps we are left with 3222 apps in the App Store data set and 8864 apps in the Google Play data set. Our data sets are now cleaned and curated for our purposes and we can move on to our data analysis.

## Data Analysis

### Determining Most Common Apps by Genre 

Now that we have cleaned up our data we are ready to start our analysis. For this project, the validation strategy for our app idea will be composed of three steps: 
1. Build a minimal Android version of the app and release on Google Play
2. If the app has good response from users, develop it further
3. If the app is profitable after six months, build an iOS version of the app and release on the App Store

Since our overall strategy involves building an app for both Google Play and the App Store, we should use our analysis to determine types of apps that are profitable in both marketplaces. Let's start by examining the header rows of both data sets to determine what columns will be useful for our analysis:

In [17]:
print(android_header)
print('\n')
print(ios_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


The 'Category' and 'Genres' columns of the Google Play data set look like they will be useful to us, as does the 'prime_genre' column of the App Store data set. We will start by building frequency tables for these columns to get a look at which types of apps appear the most. To accomplish this, we will write two functions: `freq_table()` and `display_table()`. `freq_table()` will take a column of a data set and build a frequency table of it in the form of a dictionary. `display_table()` will take the frequency table and display it in decending order:

In [18]:
def freq_table(dataset, index):
    freq_dict = {}
    for row in dataset:
        col_value = row[index]
        if col_value in freq_dict:
            freq_dict[col_value] += 1
        else:
            freq_dict[col_value] = 1
    return freq_dict

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Our functions are written and we will now use them to build and analyze a frequency table for the 'prime_genre' column of the App Store data set:

In [19]:
display_table(ios_final, 11)

Games : 1874
Entertainment : 254
Photo & Video : 160
Education : 118
Social Networking : 106
Shopping : 84
Utilities : 81
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 51
News : 43
Travel : 40
Finance : 36
Weather : 28
Food & Drink : 26
Reference : 18
Business : 17
Book : 14
Navigation : 6
Medical : 6
Catalogs : 4


The frequency table for the 'prime_genre' column of the App Store data set gives us lots of useful information. We can see that by far the most common type of app available on the App Store is 'Games' with 'Entertainment' in a distant second place. It also appears that apps geared more toward leisure like 'Games', 'Entertainment' and 'Social Networking' are much more common than more practical or utility apps. While our frequency table shows us which type of apps are the most common on the App Store, it doesn't give us any information on the amount of users those apps have. Just because a type of app is very common doesn't necessarily mean that those types of apps have the largest user base. We will have to continue our analysis before we can make a recommendation. Let's now bulid similar frequency tables for the 'Category' and 'Genres' columns of the Google Play data set: 

In [20]:
print('Category Column')
display_table(android_final, 1)
print('\n')
print('Genres Column')
display_table(android_final, 9)

Category Column
FAMILY : 1676
GAME : 862
TOOLS : 750
BUSINESS : 407
LIFESTYLE : 346
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 313
SPORTS : 301
PERSONALIZATION : 294
COMMUNICATION : 287
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 190
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 124
FOOD_AND_DRINK : 110
EDUCATION : 103
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 73
WEATHER : 71
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 57
COMICS : 55
BEAUTY : 53


Genres Column
Tools : 749
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 345
Finance : 328
Medical : 313
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181
Dating : 165
Arcade : 164
Video Players

Looking at the results something that jumps out right away is that there seems to be many more different app types in the 'Genres' column than in the 'Category' column. The 'Category' column seems to align better with the 'prime_genre' of the App Store data set so we will focus on this column for our analysis. We can see that the 'FAMILY' category is far and away the most common with 'GAME' the second most popular. We can make an assumption that the 'FAMILY' category contains apps that are family friendly, which may include things such as games or entertainment apps for children. In this way, the App Store and Google Play data sets are similar with apps geared toward entertainment being the most common. One difference is that practical apps like 'TOOLS', 'BUSINESS', 'LIFESTYLE', 'PRODUCTIVITY', 'FINANCE' and 'MEDICAL' seem to be more common on the Google Play market which suggests a more even distribution of app types. However, as was the case with our frequency table for the 'prime_genre' column of the App Store data set, we still do not have enough information on the amount of users for each type of app which will be very important when making our recommendation so we must continue our analysis.

### Most Popular Apps by Genre on the App Store

Our frequency tables were helpful in determining which apps are the most common on both of the marketplaces, but in order to make an informed recommendation we need to know which apps are the most popular, that is which apps have the most users. The Google Play data set has a column called 'Installs' that we can use to measure the average amount of users per genre, but the App Store data set is missing this column. For the App Store data set we will use the number of ratings an app has as a meausre of its popularity. This data can be found in the 'rating_count_tot' column. Below, we calculate the average amount of ratings for each app genre:

In [21]:
genres = freq_table(ios_final, 11)

for genre in genres:
    total = 0
    len_genre = 0
    
    for app in ios_final:
        genre_app = app[11]
        if genre_app == genre:
            ratings = float(app[5])
            total += ratings
            len_genre += 1
            
    average = total / len_genre
    
    print(genre, ':', average)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


As we can see, 'Navigation' apps have the most reviews at 86090. When we look closer at the 'Navigation' apps below, we can see that the majority of the reviews belong to only 2 apps, Waze and Google Maps. 

In [22]:
for app in ios_final:
    genre = app[11]
    if genre == 'Navigation':
        print(app[1], ':', float(app[5]))

Waze - GPS Navigation, Maps & Real-time Traffic : 345046.0
Google Maps - Navigation & Transit : 154911.0
Geocaching® : 12811.0
CoPilot GPS – Car Navigation & Offline Maps : 3582.0
ImmobilienScout24: Real Estate Search in Germany : 187.0
Railway Route Search : 5.0


This makes 'Navigation' apps possibly appear more popular than they are. We wouldn't necessarily want to recommend that our company develop a navigation app since that genre is dominated by Waze and Google Maps and then there is a steep drop off in reviews after that. This indicates that it could be hard for a new navigation app to gain traction in the App Store. A similar pattern can be seen in the 'Social Networking' genre, where Facebook and Pinterest dominate the genre. The 'Reference' genre is also dominated by Bible and Dictionary apps, however this genre may be easier to break into than 'Navigation' or 'Social Networking'. Using what we know from our analysis, it might be a reasonable idea to create a refernece app that has some type of game mechanism built in, with games being the most common type of app and reference being one of the most popular types.

In [23]:
for app in ios_final:
    genre = app[11]
    if genre == 'Reference':
        print(app[1], ':', float(app[5]))

Bible : 985920.0
Dictionary.com Dictionary & Thesaurus : 200047.0
Dictionary.com Dictionary & Thesaurus for iPad : 54175.0
Google Translate : 26786.0
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418.0
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588.0
Merriam-Webster Dictionary : 16849.0
Night Sky : 12122.0
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535.0
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693.0
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497.0
Guides for Pokémon GO - Pokemon GO News and Cheats : 826.0
WWDC : 762.0
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718.0
VPN Express : 14.0
Real Bike Traffic Rider Virtual Reality Glasses : 8.0
教えて!goo : 0.0
Jishokun-Japanese English Dictionary & Translator : 0.0


### Most Popular Apps by Genre on Google Play

Now that we have an idea of what app we might want to build for the Apps Store, we will take a look at the Google Play data set to see what type of app would be profitable in that marketplace. Our validation strategy was to first release the app on Google Play and then on the App Store so an app that will be profitable in both markets will be ideal. The Google Play data set has an 'Installs' column which describes the amount of installs that the app has from Google Play. We will use this metric to gague the apps popularity. Let's take a look at the 'Installs' column below:

In [24]:
display_table(android_final, 5)

1,000,000+ : 1394
100,000+ : 1024
10,000,000+ : 935
10,000+ : 904
1,000+ : 744
100+ : 613
5,000,000+ : 605
500,000+ : 493
50,000+ : 423
5,000+ : 400
10+ : 314
500+ : 288
50,000,000+ : 204
100,000,000+ : 189
50+ : 170
5+ : 70
1+ : 45
500,000,000+ : 24
1,000,000,000+ : 20
0+ : 4
0 : 1


We can see that the number of installs is not precise, they are described as ranges such as 100+, 1,000+, 10,000+ etc. For our purposes this is fine, we don't need a high level of precision with regards to installs for our analysis and these ranges will work for us. We will assume that the amount of installs shown is approximately how many installs the app has, so an app with 5,000+ in the 'Installs' column will have 5,000 installs for example. We will cacluate the average amount of installs for each app category simliar to how we did for the App Store data set, with the added step of stripping special characters from the install value string and converting it to a float:

In [30]:
categories = freq_table(android_final, 1)

for category in categories:
    total = 0
    len_category = 0
    
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            installs = app[5].replace('+', '')
            installs = installs.replace(',', '')
            installs = float(installs)
            total += installs
            len_category += 1
    
    average = int(total / len_category)
    
    print(category, ":", average)
    

ART_AND_DESIGN : 1986335
AUTO_AND_VEHICLES : 647317
BEAUTY : 513151
BOOKS_AND_REFERENCE : 8767811
BUSINESS : 1712290
COMICS : 817657
COMMUNICATION : 38456119
DATING : 854028
EDUCATION : 1833495
ENTERTAINMENT : 11640705
EVENTS : 253542
FINANCE : 1387692
FOOD_AND_DRINK : 1924897
HEALTH_AND_FITNESS : 4188821
HOUSE_AND_HOME : 1331540
LIBRARIES_AND_DEMO : 638503
LIFESTYLE : 1437816
GAME : 15588015
FAMILY : 3695641
MEDICAL : 120550
SOCIAL : 23253652
SHOPPING : 7036877
PHOTOGRAPHY : 17840110
SPORTS : 3638640
TRAVEL_AND_LOCAL : 13984077
TOOLS : 10801391
PERSONALIZATION : 5201482
PRODUCTIVITY : 16787331
PARENTING : 542603
WEATHER : 5074486
VIDEO_PLAYERS : 24727872
NEWS_AND_MAGAZINES : 9549178
MAPS_AND_NAVIGATION : 4056941


Communication apps are the most popular on Google Play, but let's see if there is a similar trend to the App Store where a few super popular apps at the top skew the results upward:

In [36]:
for app in android_final:
    category = app[1]
    name = app[0]
    installs = app[5]
    if category == "COMMUNICATION" and installs == "1,000,000,000+":
        print(name, ':', installs)

WhatsApp Messenger : 1,000,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
Skype - free IM & video calls : 1,000,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+


We can see that this is the case, with a handful of apps such as WhatsApp and Google Chrome getting over 1,000,000,000 installs skewing the average installs higher than it would otherwise be. When looking at the average installs for app category, we can see the 'BOOKS_AND_REFERENCE' is somewhat high at 8767811 average installs. Since we already know that reference apps would be profitable on the App Store, it will be worth looking into this category for Google Play as well:

In [37]:
for app in android_final:
    category = app[1]
    name = app[0]
    installs = app[5]
    if category == "BOOKS_AND_REFERENCE":
        print(name, ':', installs)

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

We can see that the number of installs are more evenly distributed across this column with most of the high installs coming from apps such as the Bible and the Al-Quran. This is in line with what we saw with the App Store and gives us reason to think that the books and reference category might be a good place for our app to break in. If we combine a book or reference app with some type of game mechanic, the app would likely be popular on both Google Play and the App Store.

## Conclusion 

In this project we analyzed data on mobile apps that are downloaded via the App Store and Google Play. Our goal was to determine which type of app would be profitable in both market places. Our validation strategy was to first release a basic version of the app on the Google Play market, and if after 6 months it is profitable, we will develop the app further and then also release it on the App Store. Our analysis showed us that entertainment apps such as games were very common in both marketplaces and also that reference/book apps were very popular for both but also had room for our app to break in. Using this knowledge, we would recommened to develop an reference/book app that has some kind of game mechanic, maybe an interactive book that tests the users knowledge of the book with games or quizzes as they read. 