# Profitable Apps for the App Store and Google Play Markets

Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

# Step 1: Opening and Exploring the Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:

- A [data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from [this link]([https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
- A [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

To make exploring the datasets by printing the rows in a readable way we'll first build a function named: **explore_data()**



In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columnes:', len(dataset[0]))

The **explore_data()** function:

Takes in four parameters:
- **dataset**, which is expected to be a list of lists.
- **start** and **end**, which are both expected to be integers and represent the starting and the ending indices of a slice from the data set.
- **rows_and_columns**, which is expected to be a Boolean and has False as a default argument.
- Slices the data set using **dataset[start:end]**.
- Loops through the slice, and for each iteration, prints a row and adds a new line after that row using **print('\n')**.
    - The **\n** in **print('\n')** is a special character and won't be printed. Instead, the **\n** character adds a new line, and we use **print('\n')** to add some blank space between rows.
- Prints the number of rows and columns if **rows_and_columns** is **True**.
    - **dataset** shouldn't have a header row, otherwise the function will print the wrong number of rows (one more row compared to the actual length).
    
Will demonstrate the function below:

In [2]:
from csv import reader #importing the reader function to open up the files

In [3]:
opened_app_store_data = open('AppleStore.csv',encoding='utf8')
read_app_store_data = reader(opened_app_store_data)
app_store = list(read_app_store_data)

opened_google_play_data = open('googleplaystore.csv',encoding='utf8')
read_google_play_data = reader(opened_google_play_data)
google_store = list(read_google_play_data)

Now I will explore the data by using the function we built to show **explore_data()**:
1. The header for each data set to get an understand of what data we're working with
2. The number of rows and columns to have as a gut check to see if things look normal

In [4]:
explore_data(app_store,0,2,rows_and_columns=True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7198
Number of columnes: 16


In [5]:
explore_data(google_store,0,2,rows_and_columns=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columnes: 13


For this analysis we want to understand how apps that meet the below criteria perform, so we can inform our team on what we should be building.

### Criteria

* **Pricing:** Free to download and install
* **Revenue:** Source of revenue is in-app ads
* **User Count:** Has a higher number of users
* **Language:** English speaking audience

Quickly looking at the headers for each dataset these columns look like a great place to start.

### App Store
Variable is "app_store"
All data is in a string format
* Index 1: **track_name** - *App Name* - which is the name of the app
* Index 4: **price** - *App Cost* - to know if it's free or not use the float 0.0 for free
* Index 5: **rating_count_tot** - *Total Number of Ratings* - will be used as an approxmation for installs
* Index 10: **cont_rating** - *Rating* - what is the app's rating
* Index 11: **prime_genre** - *App Category* - to know which genre/category the app is in

### Google Play Store
Variable is "google_store"
All data is in a string format
* Index 0: **App** - *App Name* - which is the name of the app
* Index 6: **Type** - *App's Cost* - to know if it's free or not; use the string 'Free'
* Index 5: **Installs** - *Number of Installs* - category of number of installs the app recieved
* Index 2: **Rating** - *Rating* - what is the app's rating
* Index 1: **Category** - *App Category* - to know which genre/category the app is in


## Data Cleaning Tasks
- removing non-english apps since this project is for an app maker that is English-speaking only
- removing apps that aren't free
- remove duplicated app entries
- google play store discussion forum says there's a missing category datapoint for the app "Life Made WI-Fi Touchscreen Photo Frame"

In [6]:
google_store_header = google_store[0]
bad_idx = 0
for row in google_store:
    if len(row) != len(google_store_header):
        print(row)
        bad_idx = google_store.index(row)
print(bad_idx)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10473


In [7]:
#Double checking Index if it matches the app name
google_store[bad_idx]

['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

In [8]:
#delete the bad row
del google_store[bad_idx]

### Removing Duplicates from both datasets

**Duplicate Entries**

There are multiple entries within the Google Play store dataset. It looks like there are 1181 duplicate apps, and the culprit is the number of reviews.

In [9]:
# checking for how many duplicate app entries there are in the Google Play Store data
duplicate_apps = []
unique_apps = []

for app in google_store:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In [10]:
# Will use Instagram as an example of an app being duplicated

for app in google_store:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


You can see that there are multiple Instagram app entries and the only difference is the number of reviews. 

We will look to take the app entry with the most reviews and delete the others.

Because the app entry with the most reviews should be the most recent.

Below we will start removing the duplications by creating a dictionary with the key being the app name and the value being the highest rating. Use that to reference to the main Google Play Store table and remove the duplicate entries.

In [11]:
reviews_max = {}
for app in google_store[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    
    if name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))

9659


When we print out the length of **unique_apps** we get the number 9659 (excluding headline - *9660 with headline*). So after removing the duplicates and when we print out the length of **reviews_max** we also get the number of 9659 we can tell our cleaning process worked

In [12]:
android_clean = []
already_added = []

android_clean.append(google_store[0])

for app in google_store[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

In [13]:
print(len(android_clean)) #with headline

9660


How we cleaned this dataset is by looking for the number of reviews within the original **google_store** dataset and cross referenced it to the **reviews_max** dataset we built as it houses the correct app review count.

Then once we found the numbers that matched we made sure that we didn't already add that app entry but checking to see if it was in a list called *already_added*. This ensured we didn't add in anymore entries.

Finally we added the correct app entries to the **android_clean** dataset and stored the name of the app in the *already_clean* list.

And checked to ensure the length of the new dataset was the same amount as the unique and max review datasets. Looks like it matches!

### Removing Non-English Apps from both datasets

We do not see duplicate entries within the dataset, because when we look at the ids there are no duplications. Example checker below.


In [14]:
unique_ids = []
duplicate_ids = []

for app in app_store[1:]:
    ios_id = app[0]
    
    if ios_id not in duplicate_ids:
        unique_ids.append(ios_id)
    else:
        duplicate_ids.append(ios_id)
        
print('Length of unqiue ids (no header row)', len(unique_ids))
print('Length of duplicate ids', len(duplicate_ids))
print('Length of app store dataset (without header row)', len(app_store[1:]))
        
        

Length of unqiue ids (no header row) 7197
Length of duplicate ids 0
Length of app store dataset (without header row) 7197


To check if the app name is a non-English character we are going to create a function that will check each character of the app's name.

We will check the character agaisnt the ASCII system code by using the built in function **ord()**. English characters fall within the range of 0 to 127, so if the character is not within this range then it's not English, thus the app is not English.

In [15]:
def is_english_v1(word):
    for char in word:
        if ord(char) > 127:
            print(char,'is not English ',word,'is not an English word')
            return False
    print(word,'is an English word')
    return True

In [16]:
print(is_english_v1('Instagram'))
print(is_english_v1('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english_v1('Docs To Go™ Free Office Suite'))
print(is_english_v1('Instachat 😜'))
print(is_english_v1('123'))

Instagram is an English word
True
爱 is not English  爱奇艺PPS -《欢乐颂2》电视剧热播 is not an English word
False
™ is not English  Docs To Go™ Free Office Suite is not an English word
False
😜 is not English  Instachat 😜 is not an English word
False
123 is an English word
True


Unforuntately, our filter function flags emojis and "trademark" as not English. So we'll rewrite the function to take up to 3 characters that fall outside of the ASCII range we specified. 

In [17]:
def is_english_v2(word):
    counter = 0
    for char in word:
        if ord(char) > 127:
            counter += 1
            if counter >= 3:
                print('More than three characters were flagged as not within ASCII range',word,'is not an English word/phrase')
                return False
    print(word,'is an English word')
    return True

In [18]:
print(is_english_v2('Instagram'))
print(is_english_v2('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english_v2('Docs To Go™ Free Office Suite'))
print(is_english_v2('Instachat 😜'))
print(is_english_v2('123'))

Instagram is an English word
True
More than three characters were flagged as not within ASCII range 爱奇艺PPS -《欢乐颂2》电视剧热播 is not an English word/phrase
False
Docs To Go™ Free Office Suite is an English word
True
Instachat 😜 is an English word
True
123 is an English word
True


Now that version two of our filter function works we'll recreate it without the print statements to be used to filter the ios and android datasets.

In [19]:
def is_english(word):
    counter = 0
    for char in word:
        if ord(char) > 127:
            counter += 1
            if counter >= 3:
                return False
    return True

In [20]:
eng_only_app_store = []
eng_only_google_store = []

In [21]:
# Filtering the Apple Store dataset
for app in app_store:
    name = app[1]
    if is_english(name):
        eng_only_app_store.append(app)
    else:
        next

explore_data(eng_only_app_store,0,2,rows_and_columns=True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 6156
Number of columnes: 16


In [22]:
# Filtering the Google Play Store dataset
for app in android_clean:
    name = app[0]
    if is_english(name):
        eng_only_google_store.append(app)
    else:
        next

explore_data(eng_only_google_store,0,2,rows_and_columns=True)
        

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 9598
Number of columnes: 13


Filtering out non-English apps took the **Apple App Store** dataset down from *7198* rows to *6156*, thus were ***1042*** non-English apps.

Filtering out non-English apps took the **Google Playstore Store** dataset down from *9660* rows to *9598*, thus were ***62*** non-English apps.

### Removing Paid Apps from both datasets

Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

Isolating the free apps will be our last step in the data cleaning process.

In [23]:
def filter_row(dataset, column_idx, data_point, header = True):
    cleaned_dataset = []
    if header:
        cleaned_dataset.append(dataset[0]) # give the new dataset the header from the dataset that's being transformed
        for row in dataset:
            column_item = row[column_idx]
            if column_item == data_point:
                cleaned_dataset.append(row)
    
    if header == False:
        for row in dataset:
            column_item = row[column_idx]
            if column_item == data_point:
                cleaned_dataset.append(row)
                
    return cleaned_dataset

In [24]:
cleaned_ios_store = filter_row(eng_only_app_store,4,'0.0')

In [25]:
cleaned_android_store = filter_row(eng_only_google_store,6,'Free')

In [26]:
explore_data(cleaned_ios_store,0,5,rows_and_columns=True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 3204
Number of columnes: 16


In [27]:
explore_data(cleaned_android_store,0,5,rows_and_columns=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 8848
Number of columnes: 13


Now let's make a checker function to ensure it worked.

In [28]:
# checker function

def row_filter_checker(dataset, column_idx, data_point, header = True):
    counter = 0
    if header:
        for row in dataset[1:]:
            column_item = row[column_idx]
            if column_item != data_point:
                counter += 1   
    return counter

In [29]:
# checking to see if there are paid apps in the iOS dataset
print('We found',row_filter_checker(cleaned_ios_store,4,'0.0'),'paid apps in the iOS dataset.')

We found 0 paid apps in the iOS dataset.


In [30]:
# checking to see if there are paid apps in the iOS dataset
print('We found',row_filter_checker(cleaned_android_store,6,'Free'),'paid apps in the Android dataset.')

We found 0 paid apps in the Android dataset.


Filtering out paid apps took the **Apple App Store** dataset down from *6156* rows to *3204*, that is a reduction of ***2952*** paid apps.

Filtering out paid apps apps took the **Google Playstore Store** dataset down from *9598* rows to *8848*, that is a reduction of ***750*** paid apps apps.

# Step 1: Analyzing The Data

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.
## Building Frequency Tables

Let's begin the analysis by getting a sense of what are the most common genres for each market. 

**Step 1:** Inspect both data sets and identify the columns you could use to generate frequency tables to find out what are the most common genres in each market.

**iOS:**
- Index 11: "prime_genre" - app's type

**Android:**
- Index 1: "Category" - the app's type - ex: Art & Design
- Index 9: "Genres" - more specific app type - ex: Art & Design; Creativity

**Step 2:** We'll build two functions we can use to analyze the frequency tables:

1. One function to generate frequency tables that show percentages
2. Another function we can use to display the percentages in a descending order

The **freq_table()** function you see below:

- Takes in two parameters: dataset and index. dataset is expected to be a list of lists, and index is expected to be an integer.
- Created an empty dictionary that will be filled with the desired column's data as the key
- Then count the times that key comes up
- Returns the filled table with the key being the column's data point and the value being the frequency

The **display_table()** function you see below:

- Takes in two parameters: dataset and index. dataset is expected to be a list of lists, and index is expected to be an integer.
- Generates a frequency table using the freq_table() function (which you're going to write as an exercise).
- Transforms the frequency table into a list of tuples, then sorts the list in a descending order.
- Prints the entries of the frequency table in descending order.

In [31]:
def freq_table(dataset, index):
    f_table = {}
    total = 0
    
    for row in dataset[1:]:
        total += 1
        column = row[index]
        if column not in f_table:
            f_table[column] = 1
        else:
            f_table[column] += 1
            
    table_percentages = {}
    for key in f_table:
        percentage = (f_table[key] / total) * 100
        table_percentages[key] = percentage
        
    return table_percentages

In [43]:
def display_table(dataset, index=0, need_idx = True):
    table_display = []
    
    if need_idx:
        table = freq_table(dataset, index)
        for key in table:
            key_val_as_tuple = (table[key], key)
            table_display.append(key_val_as_tuple)

        table_sorted = sorted(table_display, reverse = True)
        for entry in table_sorted:
            print(entry[1], ':', entry[0])
    else:
        for key in dataset:
            key_val_as_tuple = (dataset[key], key)
            table_display.append(key_val_as_tuple)

        table_sorted = sorted(table_display, reverse = True)
        for entry in table_sorted:
            print(entry[1], ':', entry[0])

### Ordered Frequency Table for iOS Genres

In [33]:
ios_genres_f_table = display_table(cleaned_ios_store, 11)

Games : 58.25788323446769
Entertainment : 7.836403371838902
Photo & Video : 4.995316890415236
Education : 3.6840462066812365
Social Networking : 3.3093974399000934
Shopping : 2.5913206369029034
Utilities : 2.466437714642523
Sports : 2.1542304089915705
Music : 2.0605682172962845
Health & Fitness : 2.0293474867311896
Productivity : 1.7483609116453322
Lifestyle : 1.5610365282547611
News : 1.3424914142990947
Travel : 1.248829222603809
Finance : 1.0927255697783327
Weather : 0.8741804558226661
Food & Drink : 0.8117389946924758
Reference : 0.5307524196066188
Business : 0.5307524196066188
Book : 0.3746487667811427
Navigation : 0.18732438339057134
Medical : 0.18732438339057134
Catalogs : 0.1248829222603809


#### Analysis of iOS Genres
1. What is the most common genre? What is the runner-up?
    - Most common genre is Games
    - Second most common genre is Entertainment
2. What other patterns do you see?
    - Games, Entertainment, Photo & Video, and Social Networking, 4 of the top 5 app genres are about passing the time.
3. What is the general impression — are most of the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle) or more for entertainment (games, photo and video, social networking, sports, music)?
    - Entertainment
4. Can you recommend an app profile for the App Store market based on this frequency table alone? If there's a large number of apps for a particular genre, does that also imply that apps of that genre generally have a large number of users?
    - Higher number of apps within a genre could mean higher total number of users, but to know if this translates into dollars we'd need to know either the revenue driven, engagement rate, or some kind of proxy to understand value within the app. Entertaining apps have a lower barrier of entry and commitment compared to utility which is usually tied to a problem that needs to be solved, which in turn could have a higher engagement or revenue rate.

### Ordered Frequency Table for Android Category

In [34]:
display_table(cleaned_android_store, 1)

FAMILY : 18.932971628800725
GAME : 9.698202780603594
TOOLS : 8.45484344975698
BUSINESS : 4.600429524132474
PRODUCTIVITY : 3.8996269922007465
LIFESTYLE : 3.888323725556686
FINANCE : 3.7074714592517237
MEDICAL : 3.537922459590822
SPORTS : 3.39097999321804
PERSONALIZATION : 3.3231603933536795
COMMUNICATION : 3.2327342602011986
HEALTH_AND_FITNESS : 3.0857917938284163
PHOTOGRAPHY : 2.950152594099695
NEWS_AND_MAGAZINES : 2.803210127726913
SOCIAL : 2.6675709279981916
TRAVEL_AND_LOCAL : 2.3397761953204474
SHOPPING : 2.2493500621679665
BOOKS_AND_REFERENCE : 2.136317395727365
DATING : 1.8650389962699219
VIDEO_PLAYERS : 1.797219396405561
MAPS_AND_NAVIGATION : 1.3903017972193965
FOOD_AND_DRINK : 1.2433593308466147
EDUCATION : 1.1642364643381937
ENTERTAINMENT : 0.9607776647451114
LIBRARIES_AND_DEMO : 0.938171131456991
AUTO_AND_VEHICLES : 0.9268678648129309
HOUSE_AND_HOME : 0.8025319317282694
WEATHER : 0.7912286650842093
EVENTS : 0.7121057985757884
PARENTING : 0.6555894653554878
ART_AND_DESIGN : 0.6

### Ordered Frequency Table for Android Genre

In [35]:
display_table(cleaned_android_store, 9)

Tools : 8.44354018311292
Entertainment : 6.081157454504352
Education : 5.357748389284503
Business : 4.600429524132474
Productivity : 3.8996269922007465
Lifestyle : 3.8770204589126256
Finance : 3.7074714592517237
Medical : 3.537922459590822
Sports : 3.458799593082401
Personalization : 3.3231603933536795
Communication : 3.2327342602011986
Action : 3.0970950604724763
Health & Fitness : 3.0857917938284163
Photography : 2.950152594099695
News & Magazines : 2.803210127726913
Social : 2.6675709279981916
Travel & Local : 2.3284729286763874
Shopping : 2.2493500621679665
Books & Reference : 2.136317395727365
Simulation : 2.045891262574884
Dating : 1.8650389962699219
Arcade : 1.8424324629818016
Video Players & Editors : 1.7746128631174407
Casual : 1.763309596473381
Maps & Navigation : 1.3903017972193965
Food & Drink : 1.2433593308466147
Puzzle : 1.1303266644060133
Racing : 0.9946874646772917
Role Playing : 0.938171131456991
Libraries & Demo : 0.938171131456991
Auto & Vehicles : 0.9268678648129309

#### Analysis of Android Genres & Category columns

1. What are the most common genres?
    - Family for Category
    - Tools for Genre
2. What other patterns do you see?
    - There's no Family Genre instead it's probably borken up.
    - Entertainment is high on Genre, but low on Category.
    - Category is very broad and is an aggregation of what's in Genre, but not sure how they map to each other.
3. Compare the patterns you see for the Google Play market with those you saw for the App Store market.
    - Games and Entertainment are top 2 across the board.
    - Android has more functional apps as it's top ranking while iOS has more entertaining types at its top.
4. Can you recommend an app profile based on what you found so far? Do the frequency tables you generated reveal the most frequent app genres or what genres have the most users?
    - The frequency tables do help me understand the genres, but I don't think we have enough for our recommendation. This is a very broad analysis.

## Understanding Most Popular Genres
One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. 

For the **Google Play data set**, we can find this information in the **Installs** column, but this information is missing for the App Store data set. 

As a workaround, we'll take the *total number of user ratings* as a ***proxy***, which we can find in the **rating_count_tot** app.

### iOS Store: Average # of User Ratings per Genre
1. Isolate the apps of each genre.
2. Sum up the user ratings for the apps of that genre.
3. Divide the sum by the number of apps belonging to that genre (not by the total number of apps).

In [36]:
# 1. Frequency table for iOS app genre
ios_genre_f_table = freq_table(cleaned_ios_store, 11)

In [41]:
# 2 - 3 get the average
ios_popular_app = {}

for genre in ios_genre_f_table:
    total = 0
    len_genre = 0
    
    for row in cleaned_ios_store[1:]:
        genre_app = row[11]
        if genre_app == genre:
            user_rating = float(row[5])
            total += user_rating
            len_genre += 1
    
    average_user_rating = total / len_genre
    
    ios_popular_app[genre] = average_user_rating         

In [45]:
# Analyze
display_table(ios_popular_app,need_idx = False)

Navigation : 86090.33333333333
Reference : 79350.4705882353
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 46384.916666666664
Food & Drink : 33333.92307692308
Finance : 32367.02857142857
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 27230.734939759037
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22886.36709539121
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 19156.493670886077
Lifestyle : 16815.48
Entertainment : 14195.358565737051
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


Naivgation has the highest average rating from users and Reference is the second highest. Navigation makes sense since it's quite useful and needed in a modern society, but reference was a bit confusing until I looked it up.

Reference has to do with dictionary, thesaurus, and translations. It's anytime someone needs to look some kind of information up around language be it via text, audio or video.

### Google Play Store: Average # of User Ratings per Genre
Challenge here is that the app installs are strings with characters in them, making it hard to manipulate.

We'll need to clean the string and convert them into a float.

In [51]:
android_category_freq_table = freq_table(cleaned_android_store, 1)

display_table(android_category_freq_table,need_idx = False)

FAMILY : 18.932971628800725
GAME : 9.698202780603594
TOOLS : 8.45484344975698
BUSINESS : 4.600429524132474
PRODUCTIVITY : 3.8996269922007465
LIFESTYLE : 3.888323725556686
FINANCE : 3.7074714592517237
MEDICAL : 3.537922459590822
SPORTS : 3.39097999321804
PERSONALIZATION : 3.3231603933536795
COMMUNICATION : 3.2327342602011986
HEALTH_AND_FITNESS : 3.0857917938284163
PHOTOGRAPHY : 2.950152594099695
NEWS_AND_MAGAZINES : 2.803210127726913
SOCIAL : 2.6675709279981916
TRAVEL_AND_LOCAL : 2.3397761953204474
SHOPPING : 2.2493500621679665
BOOKS_AND_REFERENCE : 2.136317395727365
DATING : 1.8650389962699219
VIDEO_PLAYERS : 1.797219396405561
MAPS_AND_NAVIGATION : 1.3903017972193965
FOOD_AND_DRINK : 1.2433593308466147
EDUCATION : 1.1642364643381937
ENTERTAINMENT : 0.9607776647451114
LIBRARIES_AND_DEMO : 0.938171131456991
AUTO_AND_VEHICLES : 0.9268678648129309
HOUSE_AND_HOME : 0.8025319317282694
WEATHER : 0.7912286650842093
EVENTS : 0.7121057985757884
PARENTING : 0.6555894653554878
ART_AND_DESIGN : 0.6

In [60]:
explore_data(cleaned_android_store,0,4)


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']




In [64]:
android_popular_app_table = {}

for category in android_category_freq_table:
    total = 0
    len_category = 0
    
    for row in cleaned_android_store[1:]:
        category_app = row[1]
        if category_app == category:
            num_installs = row[5]
            num_installs = num_installs.replace('+','')
            num_installs = num_installs.replace(',','')
            num_installs = float(num_installs)
            total += num_installs
            len_category += 1
    
    average_num_installs = total / len_category
    
    android_popular_app_table[category] = average_num_installs    

In [66]:
display_table(android_popular_app_table,need_idx = False)

COMMUNICATION : 38590581.08741259
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15544014.51048951
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10830251.970588235
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8814199.78835979
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5145550.285714285
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4049274.6341463416
FAMILY : 3697848.1731343283
SPORTS : 3650602.276666667
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1446158.2238372094
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1360598.042253521
DATING : 854028.8303030303
COMICS : 832613.8888888889
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 513151.886

### Analysis
iOS top 3 genres are 1. navigation, 2. reference, and 3. social networking.
Android top 3 genres are 1. communication, 2. video player and 3. social.

Reference genre are things like books, dictionaries, religious texts, basically anything with written word on it that people want to retrieve information.

The only genre that is in the top three for both operating systems is social. So why not make a community app around a specific book genre or book genres in general. So that people can become social around the the information they're looking to consumer and retrieve.

## Conclusions
In this project, I analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that combining reference and social could be an app could that is profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. 