# -- Insight into Profitable Apps --

The purpose of this project is two-fold. The first is to help deepen my understanding of data analysis. The second is to familiarize myself with the Jupyter Notebook environment.

In this project, I take mobile app data and provide insight into its relationship with users and revenue. By ingesting this data and providing an analysis, I hope to draw some conclusions and provide a service for those who can benefit from this information.

In [1]:
# imports
import csv

# open up AppleStore.csv 
apple_store_open_file = open('AppleStore.csv')
apple_store_read_file = csv.reader(apple_store_open_file) 
apple_store_data_csv = list(apple_store_read_file) # store everything as a list of lists
apple_store_header = apple_store_data_csv[0]
apple_store_data = apple_store_data_csv[1:]


In [2]:
# open up googleplaystore.csv
google_play_open_file = open('googleplaystore.csv')
google_play_read_file = csv.reader(google_play_open_file)
google_play_data_csv = list(google_play_read_file) # store everything as a list of lists
google_play_header = google_play_data_csv[0]
google_play_data = google_play_data_csv[1:]


In [3]:
# Function to look through datasets and print out number of rows and columns
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice =   dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

### Explore the Apple data

In [4]:
explore_data(apple_store_data_csv, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16


### Explore the Google data

In [5]:
explore_data(google_play_data_csv, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


# -- Cleaning Data --

## -- Clean Inaccurate Data --

For the following function, we can check to see if there are any rows of data that have a different number of columns than its respective header row. If a row of data has more or fewer columns than that of the header row, we can isolate the index number and delete it from the data set.

In [6]:
# checking to see which rows of data have more data points than header points
def check_data_by_row_length(dataset, header):
    for i, row in enumerate(dataset):
        if len(row) != len(header):
            print(f'row information = {row}')
            print(f'index = {i}')
            print('\n')                    

### Checking for inaccurate data for Apple

In [7]:
check_data_by_row_length(apple_store_data, apple_store_header) 

### Checking for inaccurate data for Google

In [8]:
check_data_by_row_length(google_play_data, google_play_header)

row information = ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
index = 10472




In [9]:
# we we see that there is one row in the Google data that has incorrect information. We can delete it
del google_play_data[10472]

## -- Clean Data by Duplicate Entries --

We can clean up data by checking for duplicate entries of apps. For the Apple Store data, we can check for duplicates by checking the id# of each app. Each app should have a unqiue id#. If there are any duplicate id#s, then there exists duplicate apps. 

### Check duplicate apps in Apple

In [10]:
# Check for duplicate entires based on app_id

def check_duplicate_id_apple(dataset):
    unique_ids = []
    unique_apps = []
    duplicate_apps = []
    for row in dataset:
        app_id = row[0]
        app_name = row[1]
        if app_id not in unique_ids: # grabbing the app_id and app_name from the same row
            unique_ids.append(app_id)
            unique_apps.append(app_name)
        else:
            duplicate_apps.append(app_name)
    print(f'- # of Duplicate Apps: {len(duplicate_apps)}') 
    print(f'- # of Unique Apps: {len(unique_apps)}')
        

In [11]:
check_duplicate_id_apple(apple_store_data) 
# there are no duplicate unique ids in the apple dataset

- # of Duplicate Apps: 0
- # of Unique Apps: 7197


### Check duplicate apps in Google

For the Google apps, I can check for duplicate apps by checking the names of apps because in this data set, apps should have unique names.

In [12]:
def check_duplicate_name_google(dataset):
    unique_apps = []
    duplicate_apps = []
    for row in dataset:
        app_name = row[0]
        if app_name not in unique_apps:
            unique_apps.append(app_name)
        else:
            duplicate_apps.append(app_name)
  
    print(f'- # of Duplicate Apps: {len(duplicate_apps)}')        
    print(f'- # of Unique Apps: {len(unique_apps)}')
 


In [13]:
check_duplicate_name_google(google_play_data)

- # of Duplicate Apps: 1181
- # of Unique Apps: 9659


In [14]:
# examining data of duplicate apps with random example 
print(google_play_header)
for row in google_play_data:
    if row[0] == 'Facebook':
        print('\n')
        print(row)
        

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Facebook', 'SOCIAL', '4.1', '78158306', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']


['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']


There are 1181 duplicate apps with the same name within the Google apps.

After careful inspection within the Google apps data, we can see that for duplicate apps, there are different values for the 'Reviews' column. To get the most up-to-date information, the highest 'Review' value indicates the most recent data. We can retain the data entry for each unique app with the highest 'Reviews' value.

### Removing duplicate apps in Google data

In [15]:
# Cleaning the dataset to clear duplicates by highest review count
def collect__uptodate_apps_google(dataset):
    uptodate_apps = {}
    clean_list = []
    for row in dataset:
        app_name = row[0]
        reviews_value = row[3]
        if app_name not in uptodate_apps:
            uptodate_apps[app_name] = row
        else:
            # comparing review value to current review value in hashmap
            if reviews_value > uptodate_apps[app_name][3]:
                uptodate_apps[app_name] = row
    print(f'- Number of unique apps: {len(uptodate_apps)}')
    
    # shift all the cleans rows from uptodate_apps into a new list
    for key in uptodate_apps:
        clean_list.append(uptodate_apps[key])
        
    #return that new list
    return clean_list

In [16]:
clean_google_data = collect__uptodate_apps_google(google_play_data)
explore_data(clean_google_data, 0 , 3, True)

- Number of unique apps: 9659
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'FAMILY', '3.9', '974', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9659
Number of columns: 13


## -- Clean Data by English Apps --

We can now clean the data by removing any apps that are non-English. To do this, we can search through the names of the app and verify if the characters in the app name string contain English letters. We can use python's built-in ord() function which returns the ASCII value of the character. If the character is from 0-127 inclusive, then it is part of the English language.

### Clean Data by English Apps in Apple data

In [17]:
def identify_non_english_apps_apple(dataset):
    english_apps_data =[]
    for row in dataset:
        non_english_count = 0 # this is to account for apps with emojis
        app_name = row[1]
        for character in app_name:
            ascii_val = ord(character)
            if ascii_val > 127:
                non_english_count +=1
        if non_english_count <=3: #we will include apps with up to 3 emojis in the title 
            english_apps_data.append(row)
   
    return english_apps_data


In [18]:
english_apps_apple = identify_non_english_apps_apple(apple_store_data)
explore_data(english_apps_apple, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6183
Number of columns: 16


### Clean Data by English Apps in Google data

In [19]:
def identify_non_english_apps_google(dataset):
    english_apps_data =[]
    for row in dataset:
        non_english_count = 0 # this is to account for apps with emojis
        app_name = row[0]
        for character in app_name:
            ascii_val = ord(character)
            if ascii_val > 127:
                non_english_count +=1
        if non_english_count <=3: #we will include apps with up to 3 emojis in the title 
            english_apps_data.append(row)

    return english_apps_data

In [20]:
english_apps_google = identify_non_english_apps_google(clean_google_data)
explore_data(english_apps_google, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'FAMILY', '3.9', '974', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9614
Number of columns: 13


## -- Checkpoint --

So far we have done the following for both the Apple and Google datasets:

- Removed inaccurate data
- Removed duplicate app entries
- Removed non-English apps


#### For the purposes of this project, we are focused on free apps only because we aim to only build apps that are free, and the main source of revenue is in-app ads. Now we can clean the data to retain only apps that are free.

### -- Retrieve Only Free Apps from Apple --

In [21]:
def free_data_apple(dataset):
    free_apps = []
    for row in dataset:
        price = row[4]
        if price == '0' or price == '0.00' or price == '0.0':
            free_apps.append(row)
    return free_apps

In [22]:
final_clean_data_apple = free_data_apple(english_apps_apple)
explore_data(final_clean_data_apple, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 3222
Number of columns: 16


### -- Retrieve Only Free Apps from Google --

In [23]:
def free_data_google(dataset):
    free_apps = []
    for row in dataset:
        price = row[7]
        if price == '0' or price == '0.00':
            free_apps.append(row)
    return free_apps

In [24]:
final_clean_data_google = free_data_google(english_apps_google)
explore_data(final_clean_data_google, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'FAMILY', '3.9', '974', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 8862
Number of columns: 13


### Final Number of Apps from Clean Apple Data:  3222
### Final Number of Apps from Clean Google Data:  8862

## -- Analysis --

Our aim is to determine the kinds of apps that are likely to attract more users because revenue is highly influenced by the number of people using our apps.

### Criteria to determine apps:

- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets.

#### To begin the analysis, let's first start off by finding the most popular genres of apps within Apple and Google.

## -- Creating Frequency Tables --

We can create frequency tables to show the most common genre for apps within the Apple and Google datasets

In [25]:

# form a frequency table showing app genres and their relative percentages to all other apps
# used to display percentages in descending order
def display_freq_table(dataset, index):
    table = freq_table(dataset, index) # helper function
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
# helper function to create a dictionary with key: app genre, value: # of apps within that genre
# used to return the entire freq table
def freq_table(dataset, index):
    freq_dict = {}
    freq_dict_percentages = {}
    total_dataset_length = len(dataset)
    for row in dataset:
        value = row[index]
        if value not in freq_dict:
            freq_dict[value] = 0
        freq_dict[value] +=1
    for key in freq_dict:
        freq_dict_percentages[key] = (freq_dict[key]/total_dataset_length) * 100
    return freq_dict_percentages

### Apple Data 

In [26]:
display_freq_table(final_clean_data_apple, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Using the display_freq_table() function, we can see that 'Games' is the most popular app genre among free, English language Apple apps. They make up ~58% of apps within these constraints. 'Entertainment' and 'Photo & Video' apps are the second and third most popular apps at ~8% and ~5% respectively.

### Google Data 

In [27]:
display_freq_table(final_clean_data_google, 1)

FAMILY : 18.934777702550214
GAME : 9.693071541412774
TOOLS : 8.451816745655607
BUSINESS : 4.5926427443015125
LIFESTYLE : 3.9043105393816293
PRODUCTIVITY : 3.8930264048747465
FINANCE : 3.7011961182577298
MEDICAL : 3.5206499661475967
SPORTS : 3.39652448657188
PERSONALIZATION : 3.3175355450236967
COMMUNICATION : 3.238546603475513
HEALTH_AND_FITNESS : 3.080568720379147
PHOTOGRAPHY : 2.945159106296547
NEWS_AND_MAGAZINES : 2.798465357707064
SOCIAL : 2.663055743624464
TRAVEL_AND_LOCAL : 2.335815842924848
SHOPPING : 2.2455427668697814
BOOKS_AND_REFERENCE : 2.143985556307831
DATING : 1.8618821936357481
VIDEO_PLAYERS : 1.7941773865944481
MAPS_AND_NAVIGATION : 1.399232678853532
FOOD_AND_DRINK : 1.2412547957571656
EDUCATION : 1.1735499887158656
ENTERTAINMENT : 0.9591514330850823
LIBRARIES_AND_DEMO : 0.9365831640713158
AUTO_AND_VEHICLES : 0.9252990295644324
HOUSE_AND_HOME : 0.8237418190024826
WEATHER : 0.8011735499887158
EVENTS : 0.7109004739336493
PARENTING : 0.6544798013992327
ART_AND_DESIGN : 0.

In [28]:
display_freq_table(final_clean_data_google, 9)

Tools : 8.440532611148726
Entertainment : 6.070864364703228
Education : 5.348679756262695
Business : 4.5926427443015125
Productivity : 3.8930264048747465
Lifestyle : 3.8930264048747465
Finance : 3.7011961182577298
Medical : 3.5206499661475967
Sports : 3.4642292936131795
Personalization : 3.3175355450236967
Communication : 3.238546603475513
Action : 3.1031369893929135
Health & Fitness : 3.080568720379147
Photography : 2.945159106296547
News & Magazines : 2.798465357707064
Social : 2.663055743624464
Travel & Local : 2.324531708417964
Shopping : 2.2455427668697814
Books & Reference : 2.143985556307831
Simulation : 2.0424283457458814
Dating : 1.8618821936357481
Arcade : 1.8505980591288649
Video Players & Editors : 1.7716091175806816
Casual : 1.7490408485669149
Maps & Navigation : 1.399232678853532
Food & Drink : 1.2412547957571656
Puzzle : 1.128413450688332
Racing : 0.9930038366057323
Role Playing : 0.9365831640713158
Libraries & Demo : 0.9365831640713158
Auto & Vehicles : 0.92529902956443

With the google data, there are two columns that indicate different genres of apps. They are 'Category' and 'Genres.' When we make frequency tables using these two different parameters, we come up with different results. When we use 'Category' as the parameter, the top three most popular free, English language Google apps are 'Family', 'Game', and 'Tools'. They make up ~19%, ~10%, and ~8% of apps within these constraints. 

When we use 'Genres' as the parameter, the top three most popular free, English language Google apps are 'Tools', 'Entertainment', and 'Education'. They make up ~8%, ~6%, and ~5% of apps within these constraints.

The frequency table that is generated from using 'Genres' as the parameter has many more entires than using 'Category' as the parameter. The entires within 'Generes' also appears to be more specific. For our purposes, we will use the frequency table generated by using 'Category' as our parameter because that provide a bigger-picture for our analysis

Something to keep in mind is that there tables indicate what the most popular generes are in the app store, but it does not mean that these apps also have the most users.

## -- Finding the Most Popular Apps by Genre --

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play dataset, we can find this information in the Installs column, but this information is missing for the Apple Store dataset. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the 'rating_count_tot' column. 

### Apple Data - Finding the Most Popular Apps by Genre

In [29]:
def most_popular_apps_by_genre_apple(freq_table_apple, dataset):
    apps = [] 
    for genre in freq_table_apple: # start off with 1 genre from the frequency table
        total = 0
        len_genre = 0
        for row in dataset: # iterate through all apps in the clean dataset 
            app_genre = row[11]
            if app_genre == genre: 
                num_of_ratings = float(row[5])
                total += num_of_ratings
                len_genre +=1
        avg_num_ratings = total/len_genre # calculate average installations of an app by genre
        apps.append([genre, avg_num_ratings])
        
    # sort entire list of [genre, avg_num_ratings] by avg_num_ratings and display in descending order    
    apps.sort(key = lambda x : x[1])
    for row in reversed(apps):
        print(row[0], ':', row[1])

In [30]:
# retrieve a freq table for the Apple apps. 12 is in the index for 'prime_genre'
freq_table_apple = freq_table(final_clean_data_apple, 11)

most_popular_apps_by_genre_apple(freq_table_apple, final_clean_data_apple) 

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


We can see that the 'Navigation' genre has the highest average number of user reviews. Let's dig into this a bit deeper by seeing all the 'Navigation' apps and the respective number of reviews per app.

In [31]:
for row in final_clean_data_apple:
    genre = row[11]
    app_name = row[1]
    num_reviews = row[5]
    if genre == 'Navigation':
        print(app_name, ':', num_reviews )
        

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


Upon further investigation from above, we can see that there are only 6 apps within in the 'Navigation' genre and the Waze and Google Maps apps have very high number of users. 

Let's do the same examination for 'Reference' apps.

In [32]:
for row in final_clean_data_apple:
    genre = row[11]
    app_name = row[1]
    num_reviews = row[5]
    if genre == 'Reference':
        print(app_name, ':', num_reviews )

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


We can see that the 'Bible' app makes up a bulk of the number of user reviews within 'Reference' apps.

### Brief Analysis for Apple

Although the 'Games' genre is the most common genre among all genres in Apple apps, it's the genres of 'Navigation' and 'Reference' that have the highest average number of user reviews per app by genre. This means that there are many more apps within the 'Games' genre, and developing an app with this genre may be difficult to have it stand out from the rest. Perhaps building an app that allows people to read a popular book may be successful to draw a lot of traffic. Another idea could be bulding a navigation app for another mode of transportation that is common, such as trains or bikes. 

### Google Data - Finding the Most Popular Apps by Genre

Below, we can investigate the Google data and see how the 'Installs' column is setup. We can see that there are different types of numbers indicating number of downloads for an app. Something to note is that the information presented below is not very precise because these number cutoffs don't include the number themselves. For example, '1,000,000+' may not include 1,000,000, '100,000+' may not include 100,000 and so on.   

This level of precision is not important for our purposes, which is to find which app genres attract the most users. To perform computations, we will assume '1,000,000+' installations equates to 1,000,000 installations and so on. Next, we need to transform the data so that we get rid of commas and the '+' signs

In [33]:
display_freq_table(final_clean_data_google, 5)

1,000,000+ : 15.741367637102236
100,000+ : 11.554953735048521
10,000,000+ : 10.516813360415256
10,000+ : 10.200857594222523
1,000+ : 8.395396073121193
100+ : 6.917174452719477
5,000,000+ : 6.838185511171294
500,000+ : 5.574362446400361
50,000+ : 4.773188896411646
5,000+ : 4.513653802753328
10+ : 3.5432182351613632
500+ : 3.2498307379823967
50,000,000+ : 2.2906793048973144
100,000,000+ : 2.1214172872940646
50+ : 1.9183028661701647
5+ : 0.7898894154818324
1+ : 0.5077860528097494
500,000,000+ : 0.2708192281651997
1,000,000,000+ : 0.22568269013766643
0+ : 0.045136538027533285
0 : 0.011284134506883321


In [34]:
def most_popular_apps_by_genre_google(freq_table_google, dataset):
    apps = []
    for genre in freq_table_google: # start off with 1 genre from the frequency table
        total = 0
        len_genre = 0
        for row in dataset: # iterate through all apps in the clean dataset 
            app_genre = row[1]
            if app_genre == genre:
                num_of_installs = row[5]
                num_of_installs = num_of_installs.replace(',', '') # Converting strings with ',' and '+' to integers
                num_of_installs = num_of_installs.replace('+','')
                num_of_installs = float(num_of_installs)
                total += num_of_installs
                len_genre +=1
        avg_num_installs = total/len_genre
        apps.append([genre, avg_num_installs])
    
    # sort entire list of [genre, avg_num_ratings] by avg_num_ratings and display in descending order    
    apps.sort(key = lambda x : x[1])
    for row in reversed(apps):
        print(row[0], ':', row[1])         

In [35]:
# retrieve a freq table for the Google apps. 1 is in the index for 'Category'
freq_table_google = freq_table(final_clean_data_google, 1)

most_popular_apps_by_genre_google(freq_table_google, final_clean_data_google)

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17805627.643678162
PRODUCTIVITY : 16787331.344927534
GAME : 15560965.599534342
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10682301.033377837
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3694276.334922527
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1820673.076923077
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

We can see in the Googe data that 'Communication' genre apps have the highest average installs out of all genres. Let's explore the most popular individual apps within the 'Communication' genre.

In [36]:
for row in final_clean_data_google:
    app_name = row[0]
    genre = row[1]
    installs = row[5]
    if genre == 'COMMUNICATION' and (installs == '1,000,000,000+' or installs == '500,000,000+'):
        print(app_name, ':', installs)

Messenger – Text and Video Chat for Free : 1,000,000,000+
WhatsApp Messenger : 1,000,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Viber Messenger : 500,000,000+
imo free video calls and chat : 500,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
LINE: Free Calls & Messages : 500,000,000+


The top three most popular apps within 'COMMUNICATION', in no partcular order, are 'Messenger - Text and Video Chat for Free', 'WhatsApp Messenger', and 'Google Chrome: Fast & Secure '. 

Let's explore the second most popular apps by genre: 'VIDEO_PLAYERS'.

In [37]:
for row in final_clean_data_google:
    app_name = row[0]
    genre = row[1]
    installs = row[5]
    if genre == 'VIDEO_PLAYERS' and (installs == '1,000,000,000+' or installs == '500,000,000+'):
        print(app_name, ':', installs)

YouTube : 1,000,000,000+
Google Play Movies & TV : 1,000,000,000+
MX Player : 500,000,000+


The top three most popular apps within 'VIDEO_PLAYERS', in no particular order, are 'Youtube', 'Google Play Movies & TV', and 'MX Player'.

### Brief Analysis for Google

We can see that within these top two genres, the top apps within these genres are dominated by large companies such as Meta and Google. Breaking in to these markets may be tough because of the large number of apps that generate over 1,000,000,000+ installations. We can try and take a look at the 'BOOKS_AND_REFERENCE' genre because we had some luck within that genre in the Apple data.

In [38]:
for row in final_clean_data_google:
    app_name = row[0]
    genre = row[1]
    installs = row[5]
    if genre == 'BOOKS_AND_REFERENCE' and (installs == '100,000,000+' or installs == '10,000,000+'):
        print(app_name, ':', installs)

Wattpad 📖 Free Books : 100,000,000+
Wikipedia : 10,000,000+
Amazon Kindle : 100,000,000+
Cool Reader : 10,000,000+
Dictionary - Merriam-Webster : 10,000,000+
NOOK: Read eBooks & Magazines : 10,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Oxford Dictionary of English : Free : 10,000,000+
Spanish English Translator : 10,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English Dictionary - Offline : 10,000,000+
Bible : 100,000,000+
Aldiko Book Reader : 10,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Quran for Android : 10,000,000+
Audiobooks from Audible : 100,000,000+
Dictionary.com: Find Definitions for English Words : 10,000,000+
Dictionary : 10,000,000+
JW Library : 10,000,000+
English Hindi Dictionary : 10,000,000+


Looking through the apps within the 'BOOKS_AND_REFERENCE' genre, I've had to change the 'installs' parameter down to '100,000,000+' and lower because there were no apps within the '1,000,000,000+' or ''500,000,000+'. We can see that there are only 4 apps within '100,000,000+' installations, and the next threshold for installations that contains apps is '10,000,000+'. This seems a bit more promising because the market isn't dominated by a few single companies and there seems to be more variety. Having fewer apps in these large installation thresholds and having lower thresholds containing more of the apps means the market is more accessible. Dictionaries and relgious books seems to be the more popular types of books within this genre. Perhaps converting a relgious text to an ebook could be a plausible idea or creating a dictionary for popular lagnuage may also seem appealing to many.