# Analysis of the Most Popular Free Apps

This project exists to demonstrate how one can leverage python and publicly available mobile application data to learn key insights from various characteristics of an application. 

Specifically, the goal of this project is to understand what are the key drivers of popularity within free apps on the Google Play and iOS App Stores, to inform our strategic aim to increase user traffic to our own suite of Android and iOS applications.  

# Step 1: Exploratory Data Analysis (EDA)

### Open iOS and Google Play App Data for Analysis

In [1]:
from csv import reader
opened_file_1 = open('AppleStore.csv')
read_file_1 = reader(opened_file_1)
ios = list(read_file_1)
ios_header = ios[0]
ios_data = ios[1:]

opened_file_2 = open('googleplaystore.csv')
read_file_2 = reader(opened_file_2)
gp = list(read_file_2)
gp_header = gp[0]
gp_data = gp[1:]

### Define the 'explore_data' function for easily slicing and printing subsets of the data

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

### Print the first few rows of each dataset using the explore_data() function

In [3]:
print('iOS App Data Preview:' + '\n')
explore_data(ios_data, 1, 5)

print('Android App Data Preview:' + '\n')
explore_data(gp_data, 1, 5)

iOS App Data Preview:

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Android App Data Preview:

['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch 

### Print first few rows PLUS count of rows and columns for each dataset -- exclude header row 


In [4]:
print('iOS App Data Preview:' + '\n')
explore_data(ios_data[1:], 1, 5, rows_and_columns=True)

print('------------------------')
print('\n')

print('Android App Data Preview:' + '\n')
explore_data(gp_data[1:], 1, 5, rows_and_columns=True)
print('------------------------')

iOS App Data Preview:

['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061624', '1814', '4.5', '4.0', '6.26', '12+', 'Social Networking', '37', '5', '27', '1']


Number of rows: 7196
Number of columns: 16
------------------------


Android App Data Preview:

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & De

### Print the column names and identify which columns can be most helpful for our analysis:

In [5]:
print('iOS Column Names:')
print(ios_header)
print('\n')
print('Google Play Column names:')
print(gp_header)

iOS Column Names:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Google Play Column names:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Based on the column names above, in the context of this projects goal, it seems that these will be the most useful columns to analyze  for each data set:

***iOS:***

| column name | description |
| --- | --- |
| 'price' | total price per app |
| 'rating_count_tot' | avg rating |
| 'rating_count_tot' | number of total reviews per app|
| 'Category' | app category type |


***Google Play:***

| column name | description |
| --- | ---|
| 'Price' | total price per app |
| 'Rating' | avg rating |
| 'Reviews' | number of total reviews |
| 'Category' | app category type |

# Step 2: Data Cleaning

### Now, we will look into a report that row 10472 of the Googleplay dataset is missing it's 'Rating' value
This is indicated on Kaggle, in [the discussion section for this dataset](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015).


In [6]:
print(gp_data[10472:10473])

[['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']]


After inspecting the row, we can see that the 'Category' value is missing for row 10,472. That being the case, we will delete this row to clean up the dataset:

In [7]:
print(gp_data[10472])
del gp_data[10472]
print(gp_data[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


### Next we will check to see if there are duplicate entries for any apps in the datasets:

In [8]:
duplicate_apps = []
unique_apps = []

for app in gp_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps: ', duplicate_apps[:15])



Number of duplicate apps:  1181


Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


### We can see that there are quite a few duplicated app entries. To better understand the nature of the duplication, we are going to look at a specific duplicated app example:

In [9]:
for app in gp_data:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


It looks like the variation between these duplicates in the Google Play dataset lies at index 3, which represents 'Reviews' (i.e. the total number of reviews). We can see that 2 of the duplications represent an incrementally higher number of total reviews for the Instagram app than the lowest total Review count represented, suggesting that there may be a pattern of more recent review counts associated with the app being added in.

We could randomly remove some of the duplicates, but considering the pattern described above, it feels like there is a better approach. We will, instead, only keep the row with the highest number of reviews and then remove all other duplicate entries for each app.

### Removing the Duplicates:

Before writing the code to remove the duplicates, using the approach described above, I want to first determine what the expected number of records will be after removing the duplicate records. 

Based on calculations above, I know that there are 1,181 duplicate app records in the Google Play dataset.

In [10]:
print('Expected Length:', len(gp_data) - 1181)

Expected Length: 9659


#### To remove the duplicates, we will:
- Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
- Use the information stored in the dictionary and create a new data set, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

First, let's create the dictionary, which will serve as our record of truth regarding highest number of reviews receievd per app. 

In [11]:
reviews_max = {}

for app in gp_data:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    
    if name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))        

9659


With a quick length check, we can now see I've created a dictionary, `reviews_max`, that contains a unique row for each app in the dataset, reflecting the highest total number of reviews received per app. 

Next, we will use this new dictionary to create a new, deduped Google Play dataset.

To do so, I create two lists. The first, `android_clean`, is for capturing the full row for each app that matches the key/value pairs we have in the dictionary created above, `reviews_max`. This is the new dataset.

The second list, `already_added`, is used to keep track of which apps have been already added to `android_clean`, as I loop through the original dataset. We will add a conditional check to ensure that the app name is not present in `already_added`, before adding the row value to the `android_clean` list.  

In [12]:
android_clean = []
already_added = [] # Apps that have been added to Android Clean already

for app in gp_data:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

Next, let's briefly explore the new dataset to confirm the number of rows equals 9,659.

In [13]:
explore_data(android_clean, 0, 3, rows_and_columns=True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


Awesome! We have 9659 rows, just as expected.

## Removing Non-English Apps

After working with and exploring this data for awhile, it becomes apparant that there are multiple languages represented across both datasets. This is clear because the names of some apps are in languages other than English. 

The following code reveals a couple examples of this from each dataset:

In [14]:
print('iOS Non-English App Examples:')
print(ios_data[813][1])
print(ios_data[6731][1])
print('\n')
print('Android Non-English App Examples:')
print(android_clean[4412][0])
print(android_clean[7940][0])


iOS Non-English App Examples:
爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


Android Non-English App Examples:
中国語 AQリスニング
لعبة تقدر تربح DZ


For this analysis, our organization is only interested in analyzing apps written in English. We currently use English only for the apps we develop, so analyzing that subset will be most similar and relevant to our business.

That being the case, we will remove all Non-English apps from the datasets. 

To do so, we will take the approach of analyzing each character of the app name strings to determine the integer that represents its unicode character. If it is an English unicode character, the unicode value will fall within the range of 0 - 127, according to the [Wikipedia documentation](https://en.wikipedia.org/wiki/ASCII).

The `ord()` built-in function allows us to easily check this, such as this example demonstrates:

In [15]:
print(ord('a'))
print(ord('A'))
print(ord('5'))
print(ord('+'))


97
65
53
43


Because strings are indexable and iterable, we can loop through each name string character by character to determine each unicode, such as this demonstrates:

In [16]:
for app in android_clean[:5]:
    for character in app[0]:
        print('character: ' + str(character) + ' =')
        print(ord(character))
        print('\n')

character: P =
80


character: h =
104


character: o =
111


character: t =
116


character: o =
111


character:   =
32


character: E =
69


character: d =
100


character: i =
105


character: t =
116


character: o =
111


character: r =
114


character:   =
32


character: & =
38


character:   =
32


character: C =
67


character: a =
97


character: n =
110


character: d =
100


character: y =
121


character:   =
32


character: C =
67


character: a =
97


character: m =
109


character: e =
101


character: r =
114


character: a =
97


character:   =
32


character: & =
38


character:   =
32


character: G =
71


character: r =
114


character: i =
105


character: d =
100


character:   =
32


character: & =
38


character:   =
32


character: S =
83


character: c =
99


character: r =
114


character: a =
97


character: p =
112


character: B =
66


character: o =
111


character: o =
111


character: k =
107


character: U =
85


character:   =
32


character: L =
76

To make our code more reusable and flexible, we will write a function to take the approach that we described above for determining if the unicode values represent English characters or not.

We will start by keeping it simple by checking if a single string is composed of English values. 

If the loop finishes without the return statement being executed, then it means that none of the characters from the string had a corresponding value over 127, so the function returns `True`, indicating it is most likely an English app. If the return statement _is_ executed, the function will return `False`, indicating that app name is probably non-English. 

In [17]:
def eng_char_check(string):
    for character in string:
        if ord(character) > 127:
            return False      

To confirm our new function is working as expected, we will run a series of string tests:

In [18]:
print('Test 1:', eng_char_check('Hello'))

Test 1: None


In [19]:
print('Test 2:', eng_char_check('你好'))

Test 2: False


In [20]:
print('Test 3:', eng_char_check('Docs To Go™ Free Office Suite'))

Test 3: False


In [21]:
print('Test 4:', eng_char_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))

Test 4: False


In [22]:
print('Test 5:', eng_char_check('Instachat 😜'))

Test 5: False


Looking at our tests above, it appears that our function isn't quite robust enough to handle certain characters that may be present in the titles of English apps. Specifically, test 3 and test 5 show us that the tradmark character and emoji characters must have unicode values above 127 and, therefore, return `False`, even though the name of the app is clearly in English. 

Observing this, we need find a way to minimize the potential loss of useful data that would occur if our function was left in its current state.

One way we could do this is to only return `False` if there are more than X amount of characters with corresponding unicode values that fall outside of the ASCII range. We will set the threshold at > 3 non-English characters, and this will require editing the function we just created to count how many characters fall outside of the ASCII range.

In [23]:
def eng_char_check(string):
    noneng_char_count = 0;
    for character in string:
        if ord(character) > 127:
            noneng_char_count += 1
    if noneng_char_count > 3:
        return False
    else:
        return True

Like above, we will run a series of tests to see if we have sufficiently accounted for the trademark and Emoji edge cases:

In [24]:
print('Test 1:', eng_char_check('Docs To Go™ Free Office Suite'))

Test 1: True


In [25]:
print('Test 2:', eng_char_check('Instachat 😜'))

Test 2: True


In [26]:
print('Test 3:', eng_char_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))

Test 3: False


Excellent! Our function appears to be correctly identifying English from non-English strings now, even with edge cases like non-text symbols factored in.

Using this new function, we will now create new datasets for Android and iOS that consist solely of English apps.

In [27]:
android_eng_apps = []
ios_eng_apps = []

for app in android_clean:
    if eng_char_check(app[0]):
        android_eng_apps.append(app)

for app in ios_data:
    if eng_char_check(app[1]):
        ios_eng_apps.append(app)
        
print("Total Android Apps in English:", len(android_eng_apps))
print('\n')
print("Total iOS Apps in English:", len(ios_eng_apps))
print('\n')

print(android_eng_apps[0:3])
print(ios_eng_apps[0:3])

Total Android Apps in English: 9614


Total iOS Apps in English: 6183


[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]
[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', 

## Isolating the Free Apps

Now that we have our new, cleaned up datasets consisting of English-only apps from the GooglePlay and iOS app stores, we would like to further trim down these datasets so that they only consist of free applications.

Let's do a quick examination of the column headers for each dataset to recall which indices represents "price".

In [28]:
print(gp_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [29]:
print(ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Referencing the examples above, we can see that the price column for Android is at index 7, and the price column for iOS is at index 4.

Looking a little closer at the Android data columns, we can see that index 6, 'Type', is returning the string 'Free', for the first few examples we look at, so it looks like Type might be a field that simply differentiates whether the app is free or requires payment to use. Let's do a quick exploration of that column to clarify:

In [30]:
for app in android_eng_apps:
    print(app[6])

Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free


Yes, as suspected, it appears that their are two categories under Type: 'Free' or 'Paid'.

We can use this to quickly separate the free from the paid apps for the Android dataset. For iOS, we can use the value of 'price' to separate apps with price == 0 from all paid apps.

In [31]:
free_android_eng_apps = []
free_ios_eng_apps = []

for app in android_eng_apps:
    if app[6] == 'Free':
        free_android_eng_apps.append(app)

for app in ios_eng_apps:
    if app[4] == '0.0':
        free_ios_eng_apps.append(app)
        
print('Total Free Android Apps in English:', len(free_android_eng_apps))
print('\n')
print('Total Free iOS Apps in English:', len(free_ios_eng_apps))
print('\n')        


Total Free Android Apps in English: 8863


Total Free iOS Apps in English: 3222




We are left with 8863 Android apps and 3,222 iOS apps, which should be enough for us to do a proper analysis from here.

## Most Common Apps by Genre

As we mentioned in the beginning, our strategy is the identify the kinds of free apps that are most likely to attract more users because our free app revenue model is largely influenced by the how many people are using our apps.

To minimize risk and overhead when developing a new app, our validation strategy will consist of the following three parts:
1. Build an MVP Android version of the app, then add it to the GooglePlay store.
2. If the Android app gets a high level of engagement, then we will develop it further.
3. If the app is profitable after 6 months, then we will develop an iOS version as well.

Because our ulitmate goal is to develop an app that is successful on both the GooglePlay and Apple App Stores, we need to identify app profiles that have a track record of being successful in both markets. 

Let's begin the analysis by identifying what are the most common genres in each market. To do so, we will generate frequency tables using the `Category` and `Genres` fields for Android and the `prime_genre` field for iOS.

For reference below, it's important to note the indices of these fields.

`Category` == 1

`Genres` == 9

`prime_genre` == 11

### Functions for Genre Frequency Tables

We will build two functions that we can use to analyze the frequency tables. 
- One function to generate frequency tables that show percentages
- Another function we can use to display the percentages in descending order

In [32]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else: 
            table[value] = 1 
            
    table_percentages = {}
    
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
    
    return table_percentages
    
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Now we will do a quick test to ensure the two functions to create and sort a frequency table are working as expected:

In [33]:
print(freq_table(free_android_eng_apps, 1))

{'SHOPPING': 2.245289405393208, 'FINANCE': 3.7007785174320205, 'DATING': 1.8616721200496444, 'ENTERTAINMENT': 0.9590432133589079, 'TRAVEL_AND_LOCAL': 2.335552296062281, 'MEDICAL': 3.5315355974275078, 'BOOKS_AND_REFERENCE': 2.1437436533904997, 'PHOTOGRAPHY': 2.944826808078529, 'VIDEO_PLAYERS': 1.7939749520478394, 'ART_AND_DESIGN': 0.6431230960171499, 'HEALTH_AND_FITNESS': 3.0802211440821394, 'LIBRARIES_AND_DEMO': 0.9364774906916393, 'FAMILY': 18.898792733837304, 'WEATHER': 0.8010831546880289, 'SPORTS': 3.396141261423897, 'GAME': 9.725826469592688, 'PERSONALIZATION': 3.317161232088458, 'EDUCATION': 1.1621347173643235, 'PARENTING': 0.6544059573507841, 'SOCIAL': 2.6627552747376737, 'FOOD_AND_DRINK': 1.241114746699763, 'LIFESTYLE': 3.9038700214374367, 'EVENTS': 0.7108202640189552, 'COMICS': 0.6205573733498815, 'AUTO_AND_VEHICLES': 0.9251946293580051, 'MAPS_AND_NAVIGATION': 1.399074805370642, 'NEWS_AND_MAGAZINES': 2.798149610741284, 'HOUSE_AND_HOME': 0.8236488773552973, 'BUSINESS': 4.5921245

In [34]:
display_table(free_android_eng_apps, 1)

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

Based on our test using the 'Category' column of the Android dataset, we can see that the `display_table` function is returning the frequency table results as a proportion of records in sorted order, which was our goal. Let's now generate tables for the 'Genres' field, the other Android field that is relevant for this genre analysis. 

In [35]:
display_table(free_android_eng_apps, 9)

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & Vehicles : 0.9251946293580051
S

For comparison, let's see how that stacks up against the most popular free iOS genres:

In [36]:
display_table(free_ios_eng_apps, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


**What is the most common genre for iOS?**

After generating the frequency tables - and starting with the `prime_genre` column (index 11), which we looked in the most recently generated frequency table - it looks like the most common genre within free English apps on the iOS store, by far, is "Games", consisting of ~58% of the free English iOS apps. This is followed by "Entertainment" and "Photo & Video" as the 2nd and 3rd most popular genres for free iOS apps, representing ~8% and ~5% respectively. 

**What is the most common genre for Android?**

For Android, we looked at two columns, `Category` and `Genres`. Within both columns, it looks like free Android apps tend to be designed with the intent to have practical utility for the entire family, with "Tools" being the top `Genre` and "Family" being the top `category`. However, "Games" and "Entertainment" are a close second behind tool apps, with the proportion of genre distrbution being far more balanced in the free Android apps.



At initial glance, this suggests most free iOS apps are designed with entertainment functions in mind, compared to being designed with practical need fulfillment, such as productivity or utilities apps. Free Android apps appear to be more commonly designed for practicality, even though entertainment apps are still very prevelant. However, looking at the frequency tables of Genres alone is not necessarily reflective of which app category has the most users. This will be important for our ultimate consideration, so that is what we will look at next.  

**Which genres are most popular (i.e. have the most users)?**

To measure genre popularity, we can take the approach of calculating the average number of installs for each app genre. For the Google Play data set, there is a column called `Installs` which is perfect for determining this. For iOS, "install" count isn't an available data field, so we will instead use the total number of user ratings as a proxy, which can be found within the `rating_count_tot` for each iOS app. 

## iOS App Profile Recommendation

To understand genre popularity and how that can be factored into our iOS App profile recommendation, we start by calculate the average number of user ratings per app genre on the iOS App Store.

In [37]:
genre_freq_table_ios = freq_table(free_ios_eng_apps, 11)

for genre in genre_freq_table_ios:
    total = 0
    len_genre = 0
    for app in free_ios_eng_apps:
        genre_app = app[11]
        if genre_app == genre: 
            user_rating_total = float(app[5])
            total += user_rating_total
            len_genre += 1
    avg_user_rating_total = total / len_genre
    print(genre, ':', avg_user_rating_total)

Finance : 31467.944444444445
Navigation : 86090.33333333333
Travel : 28243.8
Productivity : 21028.410714285714
Shopping : 26919.690476190477
Catalogs : 4004.0
Weather : 52279.892857142855
Medical : 612.0
News : 21248.023255813954
Book : 39758.5
Photo & Video : 28441.54375
Health & Fitness : 23298.015384615384
Utilities : 18684.456790123455
Business : 7491.117647058823
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Games : 22788.6696905016
Social Networking : 71548.34905660378
Entertainment : 14029.830708661417
Education : 7003.983050847458
Reference : 74942.11111111111
Music : 57326.530303030304
Lifestyle : 16485.764705882353


On average, it looks like Navigation apps have the highest number of user ratings. Let's take a closer look at the types of apps that are driving
the highest rating volume in that category:

In [38]:
for app in free_ios_eng_apps:
    if app[11] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


It looks like the "Navigation" categories avg rating numbers is being heavily skewed by apps like Waze and Google Maps, who dominate the category. This doesn't seem like our best category choice.

Let's investigate a few more of the most popular categories until we find one that seems like it a viable category for us to consider entering. We will look at "Social Networking" followed by "Music": 

In [39]:
for app in free_ios_eng_apps:
    if app[11] == 'Social Networking':
        print(app[1], ':', app[5])

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

In [40]:
for app in free_ios_eng_apps:
    if app[11] == 'Music':
        print(app[1], ':', app[5])

Pandora - Music & Radio : 1126879
Spotify Music : 878563
Shazam - Discover music, artists, videos & lyrics : 402925
iHeartRadio – Free Music & Radio Stations : 293228
SoundCloud - Music & Audio : 135744
Magic Piano by Smule : 131695
Smule Sing! : 119316
TuneIn Radio - MLB NBA Audiobooks Podcasts Music : 110420
Amazon Music : 106235
SoundHound Song Search & Music Player : 82602
Sonos Controller : 48905
Bandsintown Concerts : 30845
Karaoke - Sing Karaoke, Unlimited Songs! : 28606
My Mixtapez Music : 26286
Sing Karaoke Songs Unlimited with StarMaker : 26227
Ringtones for iPhone & Ringtone Maker : 25403
Musi - Unlimited Music For YouTube : 25193
AutoRap by Smule : 18202
Spinrilla - Mixtapes For Free : 15053
Napster - Top Music & Radio : 14268
edjing Mix:DJ turntable to remix and scratch music : 13580
Free Music - MP3 Streamer & Playlist Manager Pro : 13443
Free Piano app by Yokee : 13016
Google Play Music : 10118
Certified Mixtapes - Hip Hop Albums & Mixtapes : 9975
TIDAL : 7398
YouTube Mu

Like "Navigation", the categories of "Social Networking" and "Music" both appear to be dominated by a few major players, such as Facebook and Pinterest for social, and Pandora and Spotify for Music.

We'd like to, instead, tackle a niche that is a little more even of a playing field for new entrants, so we will keep narrowing it down. 

With everyone spending a lot more time at home than usual, due to COVID-19, and with many on a tighter budget, due to the economic ripple effects the pandemic is having, it would be nice to create a free app that brings entertainment (which is the 2nd most common genre of iOS apps) _and_ health improvements to our potential users.  Health and Fitness looks like a pretty popular Genre based on user ratings (avg. ~23,300 ratings), so let's explore the potential of that. 

In [41]:
for app in free_ios_eng_apps:
    if app[11] == 'Health & Fitness':
        print(app[1], ':', app[5])

Calorie Counter & Diet Tracker by MyFitnessPal : 507706
Lose It! – Weight Loss Program and Calorie Counter : 373835
Weight Watchers : 136833
Sleep Cycle alarm clock : 104539
Fitbit : 90496
Period Tracker Lite : 53620
Nike+ Training Club - Workouts & Fitness Plans : 33969
Plant Nanny - Water Reminder with Cute Plants : 27421
Sworkit - Custom Workouts for Exercise & Fitness : 16819
Clue Period Tracker: Period & Ovulation Tracker : 13436
Headspace : 12819
Fooducate - Lose Weight, Eat Healthy,Get Motivated : 11875
Runtastic Running, Jogging and Walking Tracker : 10298
WebMD for iPad : 9142
8fit - Workouts, meal plans and personal trainer : 8730
Garmin Connect™ Mobile : 8341
Record by Under Armour, connects with UA HealthBox : 7754
Fitstar Personal Trainer : 7496
My Cycles Period and Ovulation Tracker : 7469
Seven - 7 Minute Workout Training Challenge : 6808
RUNNING for weight loss: workout & meal plans : 6407
Lifesum – Inspiring healthy lifestyle app : 5795
Waterlogged - Daily Hydration Tr

Looking at this category, it is clear that the top 5 most popular apps by user rating all primarily help a user more effectively track exercise, nutrition, sleep, and/or all 3 combined into one app. 

From our earlier analysis of most common genres, we also know that "Photo & Video" is 3 third most common app category, which speaks to the fact that Apple iPhones are known to empower users to take wonderful photos with the high quality camera that comes with most models. High quality photo-based apps deliver a joyful user experience and also can add a ton of convenience for certain applications, such as bar code scanners for food tracking apps like the MyFitnessPal app.

One idea we could pursue, which would deliver time-saving convenience for users in the health/fitness category, is to create an iPhone app that allows a user to scan a receipt of their grocery purchases, and the app would then recommend healthy meal combinations and recipes that can be made using the ingredients that were purchased. These recommendations could be aligned to users diet preference, use as Paleo, Keto, Vegan, etc.
To do this, the app could could leverage the optical character recognition (OCR) features that come with iOS 13's VisionKit. Considering the popularity of food logging and healthy meal programs, this ideal feels like it could have potential. 

Before we develop it out any further, let's switch over to the Google Play app and take a closer look at genre popularity based on total installs.

## Android App Profile Recommendation

For the Google Play dataset, we actually have data about the number of installs, so that should help us get a clearer picture of genre popularity. However, as we see below, the install numbers don't seem precise enough - most of the values are open-ended (100+, 1,000+, etc.):

In [43]:
print(gp_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [44]:
display_table(free_android_eng_apps, 5)

1,000,000+ : 15.728308699086089
100,000+ : 11.55365000564143
10,000,000+ : 10.549475346947986
10,000+ : 10.199706645605326
1,000+ : 8.394448832223853
100+ : 6.916393997517771
5,000,000+ : 6.826131106848697
500,000+ : 5.562450637481666
50,000+ : 4.772650344127271
5,000+ : 4.513144533453684
10+ : 3.542818458761142
500+ : 3.2494640640866526
50,000,000+ : 2.3017037120613786
100,000,000+ : 2.1324607920568655
50+ : 1.9180864267178157
5+ : 0.7898002933543946
1+ : 0.5077287600135394
500,000,000+ : 0.270788672007221
1,000,000,000+ : 0.2256572266726842
0+ : 0.045131445334536835


Because this data isn't more precise then the open-ended install count categories that we see above, we don't know if an app that falls into the "100,000+" category actually has 100K, 200K, or 350K downloads. However, even with that being the case, we don't really need this particularly data to be that precise, as we are just trying to get a general idea of which app genre attracts the most users.

We are going to leave the numbers in the current state, which means we will make the assumption that any app with 10,000+ installs has 10,000 installs, and an app with 100,000+ installs has 100,000 installs, and so on. 

However, before we can work with those numbers, we will have to convert the install count numbers from strings to floats. To do this, we must remove the comma and plus characters that are currently present, otherwise conversion from string to float will fail.

In the code below, we loop through each category and perform that conversion in the process, to get to an average installs count for each category. 

In [48]:
category_freq_table_android = freq_table(free_android_eng_apps, 1)

for category in category_freq_table_android:
    total = 0
    len_category = 0
    for app in free_android_eng_apps:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            total += float(n_installs)
            len_category += 1
    avg_installs = total / len_category
    print(category, ':', avg_installs)            

SHOPPING : 7036877.311557789
FINANCE : 1387692.475609756
DATING : 854028.8303030303
ENTERTAINMENT : 11640705.88235294
TRAVEL_AND_LOCAL : 13984077.710144928
MEDICAL : 120550.61980830671
BOOKS_AND_REFERENCE : 8767811.894736841
PHOTOGRAPHY : 17840110.40229885
VIDEO_PLAYERS : 24727872.452830188
ART_AND_DESIGN : 1986335.0877192982
HEALTH_AND_FITNESS : 4188821.9853479853
LIBRARIES_AND_DEMO : 638503.734939759
FAMILY : 3697848.1731343283
WEATHER : 5074486.197183099
SPORTS : 3638640.1428571427
GAME : 15588015.603248259
PERSONALIZATION : 5201482.6122448975
EDUCATION : 1833495.145631068
PARENTING : 542603.6206896552
SOCIAL : 23253652.127118643
FOOD_AND_DRINK : 1924897.7363636363
LIFESTYLE : 1437816.2687861272
EVENTS : 253542.22222222222
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
MAPS_AND_NAVIGATION : 4056941.7741935486
NEWS_AND_MAGAZINES : 9549178.467741935
HOUSE_AND_HOME : 1331540.5616438356
BUSINESS : 1712290.1474201474
TOOLS : 10801391.298666667
BEAUTY : 513151.8867924528