## Profitable App Profiles for the App Store and Google Play Markets
By: Lauren Wilson  
Created: Dec 15, 2020

In this project we will use Python to find profitable app profiles.  Our company only builds apps that are free to install, directed at and English speaking audience, and our main source of revenue comes from in-app advertisments. 
  
In the code below we will:
* Analyze an open source data set
* Use Python syntax to count frequencies and find meaningful insights
* Have fun learning!  


### Opening and Exploring the Data
As of September 2018, there were 2M IOS apps available on the App Store and 2.1M Android apps available on Google Play. 

* [A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about 10,000 Android apps from Google Play.  You can download directly from [this link](https://www.kaggle.com/lava18/google-play-store-apps/download) 
* [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about 7000 Apple apps from the AppStore.  You can download directly from [this link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/download) 

First we will open the two data sets and begin our exploration.

In [64]:
from csv import reader
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
apple = list(read_file)
apple = apple[1:]


In [2]:
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
google = list(read_file)
google = google[1:]


To make it easier to explore the data set we'll write a fucntion named ```explore_data()``` that can be used repeatedly to explore rows in an easy to read format.

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line between rows
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))


explore_data(google, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


### GooglePlay

We see that Google Play has 10,841 apps and 13 columns.  The columns that may be useful are described below. A complete listing of the dataset can be found [here](https://www.kaggle.com/lava18/google-play-store-apps).

|**Column Name and Zero Indexed Number**   |**Description**                                      |
|------------------------------------------|-----------------------------------------------------|
|                                   App [0]|                                     Application name|
|                              Category [1]|                         Category the app belongs to |
|                                Rating [2]|                       Overall user rating of the app|
|                               Reviews [3]|                              Number of user reviews |
|                              Installs [5]|       Number of user downloads/ installs for the app|
|                                  Type [6]|                                         Paid or Free|
|                         Content Rating[8]|                     Age group the app is targeted at|
|                                Genres [9]|                A group can belong to multiple Genres|

Now let's look at the Apple data set


In [4]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line between rows
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))


explore_data(apple, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


### AppStore
We see there are less apps in this dataset, 7,198 and 16 columns.  The columns are not as self explanatory as the GooglePlay dataset but the most potentially useful columns I have described below. Further details can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).


|**Column Name and Zero Indexed Number**   |**Description**                                      |
|------------------------------------------|-----------------------------------------------------|
|                                 price [4]|                                         Price amount|
|                           user_rating [7]|          Average User Rating value(for all versions)|
|                          cont_rating [10]|                                       Content Rating|
|                          prime_genre [11]|                                        Primary Genre|
|                      sup_devices.num [12]|                         Number of supporting devices|
|                             lang.num [14]|                        Number of supported languages|







## Data Cleaning
 `Remember: Our company only works with free apps, directed towards an English-speaking audience`
 
 We will need to:
 * Remove NA values
 * Remove non-English apps
 * Remove apps that aren't free

The row 10473 corresponds to the app Life Made WI-Fi Touchscreen Photo Frame, and we can see that the rating is 19. This is clearly off because the maximum rating for a Google Play app is 5 (as mentioned in the discussions section, this problem is caused by a missing value in the 'Category' column). As a consequence, we'll delete this row.

In [5]:
print(google[10473])  # incorrect row
print('\n')
print(google[0])  # header
print('\n')
print(google[0])      # correct row

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


In [6]:
print(google[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [7]:
# deleting incorrect row don't run this more than once or you will be deleting more rows than you would like =}
print(len(google))
del google[10472]
print(len(google))

10841
10840


### Handling Duplicates

The GooglePlay has duplicate values that can lead to incorrect data.  Without correct data proper insights cannot be made.  The code below finds these duplicates in the GooglePlay data and prints the first few rows.

In [8]:
unique_apps = []
duplicate_apps = []
for app in google:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print()
print('10 Examples of duplicate apps:', duplicate_apps[:10])
   

Number of duplicate apps: 1181

10 Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


In total there are 1,181 duplicate apps.  These duplicate values will not be removed randomly.  Upon studying the data, we find that the observations were taken at various times and contain diferent values for `Reviews` as shown in the code below.  The values for the 3 duplicate observations from the *Quick PDF Scanner + OCR FREE* app contains different values in its 4TH column.  We will use this as our criterion for removing duplicate values.  In each duplicate observation, the value with the highest number of reviews will be kept and the remaining removed.  

Let's see this issue up close.

In [9]:
for app in google:
    name = app[0]
    if name == 'Quick PDF Scanner + OCR FREE':
        print(app)
        print()
        
        

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']



There are also duplicate applications such as *Box* that contain the same value for this column as shown below.  Only one will be necessary in this situation.

In [10]:
for app in google:
    name = app[0]
    if name == 'Box':
        print(app)
        print()
        

['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']

['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']

['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']



### Removing Duplicates

We will now use dictionaries to remove these duplicate values from the GooglePlay data.  To do that we will:

* Create a dictionary where each key is a unique app, and the value is the highest number of reviews for that app
* Use that dictionary to create a new data set with one entry per app

In [11]:
reviews_max = {}

for app in google:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
        

In [12]:
print(google[0])

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


In a previous code cell, we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,181.  
 

In [13]:
print('Expected length:', len(google) - 1181)
print('Actual length:', len(reviews_max))  

Expected length: 9659
Actual length: 9659


I will now use similar methodology as earlier when we deciphered unique and duplicate values.  This will remove these incorrect data points and give us cleaner data. In the code below:

* We create two empty lists, `google_clean` and `already_added`.
* We loop through the Google data set and for every iteration:  
    * Assign the name of the app and the number of reviews to a variable
    * The number of reviews is converted to a float
    * We add the current row to the `google_clean` list and the app name to the `already_added` list if:
        * The number of reviews for that row matches the number of reviews for that app in our previous `reviews_max` dictionary AND
        * The name of the app is not in the `already_added` list.  We add this condition to make up for the issue stated earlier where a duplicated app had the same same number of reviews for each duplication.  If we just check for `reviews_max[name] == n_reviews` we will still end up with duplicate entries for some apps.
        
        
        


In [14]:
google_clean = []
already_added = []

for app in google[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        google_clean.append(app)
        already_added.append(name)
           

In [15]:
unique_apps = []
duplicate_apps = []
for app in apple[1:]:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print()
print('10 Examples of duplicate apps:', duplicate_apps[:10])
   

Number of duplicate apps: 0

10 Examples of duplicate apps: []


As we can see there are no duplicates in the AppStore dataset.  To finish, let's explore the new data set and confirm the number of rows is 9,659 with `explore_data()`

In [16]:
explore_data(google_clean, 0, 3, True)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9658
Number of columns: 13


### Removing Non-English Apps

At our company we are developing apps for an English-speaking audience.  As shown below some of the data in both of our datasets have names suggesting they are not directed toward an Enlglish-speaking audience.  

In [17]:
print('In GooglePlay data:', google_clean[4412][0])
print()
print('In AppStore data:', apple[814][1])

In GooglePlay data: ClanHQ

In AppStore data: Filterra – Photo Editor, Effects for Pictures


At this time we are not interested in these observations and will remove them.  Behind the scenes, each string has a corresponding number associated with it.  The corresponding charcter for `'Z'` is 90, `'z'` is 122, and `国` is 22,269

In [18]:
print(ord('Z'))
print(ord('z'))
print(ord('国'))

90
122
22269


The numbers corresponding to the English language aer in the range 0 to 127 according to the [ASCII](https://en.wikipedia.org/wiki/ASCII) (American Standard Code for Information Interchange). Based on this number range we can detect if a character is apart of the English character set or not.  

An app containing characters greater than 127 suggest the app has a non-English name.  Our app names are stored as strings.  The same as lists, our character strings are indexable and iterable, which means we can use a for loop to detect non-English characters in the data sets.  

The function below takes in a string and returns a boolean based on the character value being in or outside of the Enlglish character range. Lets check the following strings to ensure our function is working properly.

* 'Instagram'
* '爱奇艺PPS -《欢乐颂2》电视剧热播'
* 'Docs To Go™ Free Office Suite'
* 'Instachat 😜'

In [19]:
def lang_check(a_string):
    
    for letter in a_string:
        if ord(letter) > 127:
            return False
        else:
            return True
        
print(lang_check('Instagram'))
print(lang_check('Instachat 😜'))
print(lang_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(lang_check('Docs To Go™ Free Office Suite'))

True
True
False
True


The function seems to work correctly but some English apps have characters that fall outside that 127 range. 

In [20]:
print('Smile Emoji:', lang_check('😜'), ', Character Value:', ord('😜'))
print('Trademark Symbol:', lang_check('™'), ', Character Value:', ord('™'))

Smile Emoji: False , Character Value: 128540
Trademark Symbol: False , Character Value: 8482


To remedy this, lets edit the function to only remove an app if its name has more than three characters with corresponding numbers outside the ASCII range.  This is not a perfect solution but, an efficient work-aroudn to this issue

In [21]:
def lang_check2(string):
    non_ascii = 0
    for letter in string:
        if ord(letter) > 127:
            non_ascii+=1
            
    if non_ascii > 3:
        return False
    else:
        return True
        
print(lang_check2('Instagram'))
print(lang_check2('Instachat 😜'))
print(lang_check2('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(lang_check2('Docs To Go™ Free Office Suite'))

True
True
False
True


No changes appear.  Lets see if there is any difference in the individual character values with this new function.

In [22]:
print('Smile Emoji:', lang_check2('😜'), ', Character Value:', ord('😜'))
print('Trademark Symbol:', lang_check2('™'), ', Character Value:', ord('™'))

Smile Emoji: True , Character Value: 128540
Trademark Symbol: True , Character Value: 8482


Viola! Now the individual characters will not cause our apps to be flagged as non-English. 

Now lets creatae a function to filter out non-English apps from both datasets. In the code below we will:
* Create a function that takes a list of lists and an integer as arguments and returns two lists
* Create two empty lists to store English and non-English apps
* Inside a for loop create a variable to store the app name at the index of the integer passed in the function defintion
* Loop through the data set with each iteration:
    * Using `lang_check2()` to see if the returned value is equal to True, if so append this entire row to the empty list for English apps.  If not, append the entire row to the empty list for non-English apps
    
   

In [23]:
def app_is_english(data, index):
    english_list = []
    non_english = []
    for app in data:
        name = app[index]
        if lang_check2(name) == True:
            english_list.append(app)
        else:
            non_english.append(app)
            
    return english_list, non_english


eng_list_appl, non_eng_list_appl = app_is_english(apple, 1)
eng_list_goog, non_eng_list_goog = app_is_english(google_clean, 0)

Using the `explore_data()` on both datasets there are 9,613 rows in the newly cleaned GooglePlay data set 

In [24]:
explore_data(eng_list_goog, 0, 3, True)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9613
Number of columns: 13


There are now 6,184 rows in the newly cleaned AppStore dataset

In [25]:
explore_data(eng_list_appl, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6183
Number of columns: 16


### Removing Non-Free Apps

So far in the data cleaning process, we:

* Removed inaccurate data
* Removed duplicate app entries
* Removed non-English apps

Another criterion for this project is that the apps included in the data are free.  As the last step to our data cleaning project we will need to isolate only free apps for our analysis.

Lets create a function that does this.  The function will:
* Take a list of lists and an integer index as arguments and returns two lists
* Create two empty lists to store free and non-free apps
* Loop through the data passed to the function with each iteration:
    * Storing the price data of each app at the index of the integer passed in the function definition
    * Use an if statement so if the price is equal to 'Free', 'Paid', '0.0', or '0' it is appended to either the list for free apps or the list for paid apps.
    

In [26]:
def app_is_free(data, index):
    free_list = []
    not_free = []
    for app in data:
        price = app[index]
        if price == 'Free' or price == 'Paid' or price == '0.0' or price == '0':
            free_list.append(app)
        else:
            not_free.append(app)
            
    return free_list, not_free


final_appl, not_free_appl = app_is_free(eng_list_appl, 4)
final_goog, not_free_goog = app_is_free(eng_list_goog, 7)

There are 3,222 free apps in the AppStore and 8864 apps in the GooglePlay store.   


*Note: Using the index 6 on the GooglePlay data results in 9,611 apps. Using the `Type` instead of the `Price` columns is not accuratley isolating free apps.  

In [27]:
print('# of Free AppStore Apps:', len(final_appl))
print("# of Free GooglePlay Apps:", len(final_goog))

# of Free AppStore Apps: 3222
# of Free GooglePlay Apps: 8863


### Most Common Apps by Genre

As noted in the intro, our aim is to determine which apps will attract the most users since our revenue is highly influence by the number of users.  At our company, validating an app happens in three steps:

1. Build a beta Andriod version of the app, and add it to GooglePlay
2. If the app recieves a strong, positive response from users we develop it further
3. If the app is profitable after six months, we build an iOS version of the app and add it to the AppStore

Because our end goal is to add applications on both GooglePlay and the AppStore, we need to build app profiles that are successful on both markets.  For example an app productivity app that applies gamification.

Let's begin our analysis by finding what the most common genres are for each market.  We will need to build frequency tables for a few columns in our data sets.

In [28]:
explore_data(final_appl, 0, 3)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']




We see the Primary Genre is stored at index 11 in the AppStore data.  This can be used to generate frequency tables adn find the most common genres in this market.

In [29]:
explore_data(final_goog, 0, 3)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']




For the GooglePlay data, there is a column that stores the main category as well as the genre.  A genre in this dataset can have multiple categories but the main category only stores one value.  We can use the column at index 1 and index 9 to create our freqency tables. 

We will build two functions to analyze the frequency tables:
* One function to show frequency table percentages
* Another fucntion to display these percentages in descending order



Let's create a function for generating frequency tables

In [55]:
def freq_table(dataset, index):
    table = {}
    total = 0
    for app in dataset:
        total+=1
        value = app[index]
        if value in table:
            table[value]+=1
        else:
            table[value]=1
            
    table_percent = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percent[key] = percentage
            
    return table_percent



Now we have created our basic frequency table but it is stored in a dictionary and has no specific order to it. Using the built-in function `sorted()` on a dictionary will only return the keys and we would like the entire key, value pair for our frequency table.  Below we use a second function `display_table` to:

* Convert the dictionary into a list of tuples 
* Display the frequency table sorted in descending order for the `prime_genre`, `Genres`, and `Category` columns

In [56]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        

print('Freq Table for prime_genre Column\n')
display_table(final_appl, 11)

Freq Table for prime_genre Column

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


To better understand potential profiles, we used the AppStore data's `prime_genre` column.  This column holds one string detailing the primary genre for that application.  The most common genre was **Games** with more than half (58.16%) of the dataset being this type of application.  The next closest category was **Entertainment** with only 7.88% of the dataset.  This is more than 8 times less than the leading number of observations, so it is safe to say in the AppStore **Games** are the most common category. The most common apps seem to be for entertainment purposes and fall into the Gaming, Entertainment, Education, and Social Networking categories.  

This is not to say these types of apps are the most popular.  We do not know how many users installed these apps to there devices , showing actual popularity, this simply states the frequency of different apps found in the AppStore.

In [57]:
print('Freq Table for Genres Column\n')
display_table(final_goog, 9)

Freq Table for Genres Column

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & V

The Genres column found in the GooglePlay dataset can contain multiple genres that an app belongs to.  In GooglePlay its seems more of the apps are designed for practical purposes like (family, tools, lifestyle) opposed to gaming.   This column includes many categories and for some observastions it can become confusing to decipher there actual genre in the GooglePlay store.  All of these percentages are small with the most common genres **Tools** only taking up 8.45% of the dataset. This is the largest value the smallest percentage comes from the **Adventure;Education** genre with only 0.01%.  This column does not seem to represent the data very well and I will be using the less granular Category column going forward.  



In [58]:
print('Freq Table for Category Column\n')
display_table(final_goog, 1)

Freq Table for Category Column

FAMILY : 18.910075595170937
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.654

The Category column, also found in the GooglePlay dataset, contains far fewer category combinations.  Only one type of genre is included with each observation.  **Family** has the highest number of observations making up 18.91% of the dataset. The  **Game** represents another 9.72% of the dataset.   Upon further investigation, we can see that the Family category is made up of children's games. As shown in the screenshot below.

![Google Play Store screenshot](gp.png)

At this point I feel comfortable stating the AppStore seems to be geared more towards fun, while the GooglePlay store is a more balanced enviornment of both fun and practical applications.  Now we can move on to answering the question of which apps have the most users.

### Most Popular Apps by Genre on the AppStore

One way to find the most popular app genres with the most users is to calculate the average number of installs for each app genre.  For the GooglePlay data set we can find this information in the `Installs` column, but this information is not found in the AppStore data set.  To remedy this, we can use the total number of user ratings found in the `rating_count_tot`.  Lets start by using a nested loop to calculate the average user ratings per app genre on the AppStore.  To do this we will need to:
* Isolate the apps of each genre
* Sum up the user ratings for the apps of that genre
* Divide the sum by the number of apps belonging to that genre(not by the total number of apps)


In [68]:
freq_genre = freq_table(final_appl, 11)

for genre in freq_genre:
    total = 0
    len_genre = 0
    for app in final_appl:
        genre_app = app[11]
        if genre_app == genre:
            user_ratings = float(app[5])
            total+=user_ratings
            len_genre+=1
            
    avg_ratings = total/len_genre
    print(genre, ":", avg_ratings)
    


Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


On average Navigation apps have the highest number of user ratings.  This could be skewed by the sheer number of user reviews.  Combined, navigation apps like Waze and Google Maps have almost half a million reviews.  

In [69]:
for app in final_appl:
    if app[11] == 'Navigation':
        print(app[1], ":", app[5]) # print the name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


Social Networking and Reference apps come closest to matching the avergae user ratings of Navigation apps, with 71K+ and 74K+, respectively.  These averages could be skewed from extremley large outliers.  Reference apps have a large number of user ratings but this average is heavily skewed by the Bible app.  These 100K+ outliers should be removed since many apps  struggle to get over the 10K+ mark. We will save this issue for a later time. 

In [70]:
for app in final_appl:
    if app[11] == 'Reference':
        print(app[1], ":", app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


This Reference market seems to have potential.  We could take a popular book and turn it into an app adding different features for users.  Such as, audio versions of the book, quizzes on the book, daily quotes from the book, even embed a dictionary in the app. A built-in dictionary would allow users to use the app instead of an external tool.  If we look at the Social Networking market, we could do things like have a direct social networking link with the author or other readers of the same book. This could take advantage of the high number of user ratings in both of these markets.  

This all suggests the fun dominated AppStore could be oversatured with for-fun apps and our company could profit from more practical apps.

### Most Popular Apps by Genre on the GooglePlay Store

Previously we used the number of user ratings to create our app profile recommendation.  The GooglePlay data set measures popularity by number of installs.  The numbers are not exact as seen below.

In [71]:
display_table(final_goog, 5) # the Installs column

1,000,000+ : 15.728308699086089
100,000+ : 11.55365000564143
10,000,000+ : 10.549475346947986
10,000+ : 10.188423784271691
1,000+ : 8.394448832223853
100+ : 6.916393997517771
5,000,000+ : 6.826131106848697
500,000+ : 5.562450637481666
50,000+ : 4.772650344127271
5,000+ : 4.513144533453684
10+ : 3.542818458761142
500+ : 3.2494640640866526
50,000,000+ : 2.3017037120613786
100,000,000+ : 2.1324607920568655
50+ : 1.9180864267178157
5+ : 0.7898002933543946
1+ : 0.5077287600135394
500,000,000+ : 0.270788672007221
1,000,000,000+ : 0.2256572266726842
0+ : 0.045131445334536835
0 : 0.011282861333634209


For example, we don't know if an app had 100,000+ installs or 999,999.  For our purposes we don't need precision – we just want to find out which app genre attracts the most users.

We will use the numbers as is, so we'll consider an app with 100,000+ installs as having 100,000 installs and so on.  To perform caluclations we will need to convert each install number from a string to a float.  We will use the `replace` function to remove the commas and plus characters as these will raise an error.

In the code below we will:

In [77]:
freq_cat = freq_table(final_goog, 1)

for category in freq_cat:
    total = 0
    len_cat = 0
    for app in final_goog:
        cat_app = app[1]
        if cat_app == category:
            installs = app[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            installs = float(installs)
            total+=installs
            len_cat+=1
            
    avg_installs = total/len_cat
    if avg_installs > 20000000:
        print(category, ":", avg_installs)    

COMMUNICATION : 38456119.167247385
SOCIAL : 23253652.127118643
VIDEO_PLAYERS : 24727872.452830188


Based soley onthe number of installs, the top 3 markets on the GooglePlay store are Communication, Social, and Video Players.  Each market has over 20M user installs showing alot of usage.

In [80]:
for app in final_goog:
    if app[1] == 'COMMUNICATION':
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
Messenger for SMS : 10,000,000+
My Tele2 : 5,000,000+
imo beta free calls and text : 100,000,000+
Contacts : 50,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Browser 4G : 10,000,000+
MegaFon Dashboard : 10,000,000+
ZenUI Dialer & Contacts : 10,000,000+
Cricket Visual Voicemail : 10,000,000+
TracFone My Account : 1,000,000+
Xperia Link™ : 10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard : 10,000,000+
Skype Lite - Free Video Call & Chat : 5,000,000+
My magenta : 1,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Seznam.cz : 1,000,000+
Antillean Gold Telegram (original version) : 100,000+
AT&T Visual Voicemail : 10,000,000+
GMX Mail : 10,000,000+
Omlet Chat : 10,000,000+
My Vodacom SA : 5,000,000+
Microsoft Edge : 5,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Calls & Text by Mo+ : 5,000,000+
free 

GooglePlay apps like WhatsApp Messenger, Gmail, Google Chrome, and others are heavily skewing with data set with 1B+ installs.  Communications apps appear to be installed heavily by users with most excedding 100,000+ installs, while some apps still struggle to break 100+.

If we removed all the Communication apps with over 100M installs, the average would be reduced from 38M+ to 3M+. That's almost roughly 10 times less.  Extremely large values like these are skewing averages in each of the categories.

In [94]:
under_100_m = []

for app in final_goog:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
print("COMMUNICATION:", sum(under_100_m) / len(under_100_m))

COMMUNICATION: 3603485.3884615386


In [87]:
for app in final_goog:
    if app[1] == 'SOCIAL':
        print(app[0], ':', app[5])

Facebook : 1,000,000,000+
Facebook Lite : 500,000,000+
Tumblr : 100,000,000+
Social network all in one 2018 : 100,000+
Pinterest : 100,000,000+
TextNow - free text + calls : 10,000,000+
Google+ : 1,000,000,000+
The Messenger App : 1,000,000+
Messenger Pro : 1,000,000+
Free Messages, Video, Chat,Text for Messenger Plus : 1,000,000+
Telegram X : 5,000,000+
The Video Messenger App : 100,000+
Jodel - The Hyperlocal App : 1,000,000+
Hide Something - Photo, Video : 5,000,000+
Love Sticker : 1,000,000+
Web Browser & Fast Explorer : 5,000,000+
LiveMe - Video chat, new friends, and make money : 10,000,000+
VidStatus app - Status Videos & Status Downloader : 5,000,000+
Love Images : 1,000,000+
Web Browser ( Fast & Secure Web Explorer) : 500,000+
SPARK - Live random video chat & meet new people : 5,000,000+
Golden telegram : 50,000+
Facebook Local : 1,000,000+
Meet – Talk to Strangers Using Random Video Chat : 5,000,000+
MobilePatrol Public Safety App : 1,000,000+
💘 WhatsLov: Smileys of love, sti

Facebook, Google+, and Instagram are skewing the data in the Social Category.  There have been over 1B installs for each of these companys more than any of the other social apps in the Google Play store.  

In [90]:
for app in final_goog:
    if app[1] == 'VIDEO_PLAYERS':
        print(app[0], ':', app[5])

YouTube : 1,000,000,000+
All Video Downloader 2018 : 1,000,000+
Video Downloader : 10,000,000+
HD Video Player : 1,000,000+
Iqiyi (for tablet) : 1,000,000+
Video Player All Format : 10,000,000+
Motorola Gallery : 100,000,000+
Free TV series : 100,000+
Video Player All Format for Android : 500,000+
VLC for Android : 100,000,000+
Code : 10,000,000+
Vote for : 50,000,000+
XX HD Video downloader-Free Video Downloader : 1,000,000+
OBJECTIVE : 1,000,000+
Music - Mp3 Player : 10,000,000+
HD Movie Video Player : 1,000,000+
YouCut - Video Editor & Video Maker, No Watermark : 5,000,000+
Video Editor,Crop Video,Movie Video,Music,Effects : 1,000,000+
YouTube Studio : 10,000,000+
video player for android : 10,000,000+
Vigo Video : 50,000,000+
Google Play Movies & TV : 1,000,000,000+
HTC Service － DLNA : 10,000,000+
VPlayer : 1,000,000+
MiniMovie - Free Video and Slideshow Editor : 50,000,000+
Samsung Video Library : 50,000,000+
OnePlus Gallery : 1,000,000+
LIKE – Magic Video Maker & Community : 50,

In the Video Player category only YouTube and Google Play Movies & TV have over 1B+ installs.  These apps all appear to be much more heavily downloaded than the other Categorys.  A majority of the apps have installs over 100,000+. This makes it a good candidate for our profile.  


This could work well since the Books genre was, Reference, was also popular in the GooglePlay store.  Going back to our ealrier validatio strategy, we ultimatley want a profile that can translate to both the GooglePlay and the AppStore. 

Let's look at some of the apps from this genre and the numebr os installs:


In [95]:
for app in final_goog:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

The book and reference genre includes a variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. It seems there's still a small number of extremely popular apps that skew the average:

In [97]:
for app in final_goog:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


However, it looks like there are only a few very popular apps, so this market still shows potential. Let's try to get some app ideas based on the kind of apps that are somewhere in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads):

In [100]:
for app in final_goog:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

This area of the market seems to be dominated by book son software processin and reading ebooks.  There are various collections of dictionaries and libraries so that market has a high threat of competition and appears saturated.  Many apps are built around one book, the Quran, which shows an app surrounding a popular novel could be profitable.

### Conclusions

I would suggest creating an app that utilizes different aspects of each of these popular markets.  A video player targeted towards a specific show or book that has built in social networking with creators, actors and direct communication with other fans of the show while watching. This brings in aspects outside of just being a library.  These special featurs could be what it takes to help turn a profit in both the GooglePlay and the AppStore.   