# Analysing Profitable App Profiles on App Store and Google Play Markets
---

For this project, I'm working as data analysts for a company that builds Android and iOS mobile apps. 

We make our apps available on Google Play and in the App Store. We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that the number of users of our apps determines our revenue for any given app — the more users who see and engage with the ads, the better. 

My goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

# Opening and explorating data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Collecting data for over 4 million apps requires a significant amount of time and money, so I'll try to analyze a sample of the data instead. To avoid spending resources on collecting new data myself, I first try to see if I can find any relevant existing data at no cost. Luckily, here are two data sets that seem suitable for my goals:

- [A dataset](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
- [A dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

I'll start by opening and exploring these two data sets. To make them easier to open and explore, I will create 2 functions. The first named `open_dataset()` is the function I use to read the `AppleStore.csv`and `googleplaystore.csv` files which contain respectively the App Store dataset and the Google Play dataset. 

In [29]:
def open_dataset(file_name, header=True):
    '''
    A function that takes a file, open it, read it and transform it to a list of list
    '''
    opened_file = open(file_name, encoding='utf8')
    from csv import reader
    read_file = reader(opened_file)
    data = list(read_file)
    return data

In [30]:
# open App Store dataset
app_store_data = open_dataset('AppleStore.csv')
ios = app_store_data[1:]
ios_header = app_store_data[0]

# open Google Play dataset
google_play_data = open_dataset('googleplaystore.csv')
android = google_play_data[1:]
android_header = google_play_data[0]

The second function named `explore_data()`, I'll repeatedly use it to print rows in a readable way.

In [31]:
def explore_data(dataset, start, end, rows_and_columns=False):
    '''
    A function that takes 4 parameters, a dataset (a list of lists), start and end (integers that represent the starting and the ending indices of a slice from the dataset), rows_and_columns (a Boolean that has False as a default argument). The function slices the dataset, loops through the slice, and for each iteration, prints a row and adds a new line after that row. If rows_and_columns is True, the function prints the number of rows and columns.
    ''' 
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

### Exploration of the ios apps

In [32]:
#print headers and few rows of ios dataset
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


This data set contains 7197 Apple iOS mobile application details. Each app (row) has value for price, user_rating, cont_rating, prime_genre, and more. Here is a [link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) to read more about the column description.

### Exploration the android apps

In [33]:
#print headers and few rows of android dataset
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


This data set contains 10841 android mobile application details.  Each app (row) has values for catergory, rating, price, genres, and more. Here is a [link](https://www.kaggle.com/lava18/google-play-store-apps) to read more about the column description.

## Data cleaning

### Deleting wrong data

Let's begin by detecting and deleting wrong data. To do this we will start by creating a function that detects if we have missing data and if so, it prints the row and index of the row in which I have missing data.

In [34]:
def missing_value(dataset):
    '''
    A function that compare the lenght of each row with the lenght of the lenght of the dataset in order to finds rows with missing values
    '''
    for row in dataset: 
        if len(row) != len(dataset[0]):
            print(row)
            print(dataset.index(row))

In [35]:
#display missing values in ios dataset
missing_value(ios)

In [36]:
#display missing values in android dataset
missing_value(android)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10472


The 10472th row of the android app dataset has only 12 values instead of 13. The missing value is the category value, so I choose to delete this row from my dataset.

In [37]:
# Make sure you don't run the del statement more than once, otherwise you'll delete more than one row

# del android[10472]

### Deleting duplicate entries

We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. To do this, I created a function that stored the apps in two lists: one for the name of duplicate apps, and one for the name of unique apps, and after we will inspect the list of duplicate apps.

In [38]:
def detect_duplicates(dataset, index):
    '''
    a function that stored the apps in two lists: one for the name of duplicate apps, and one for the name of unique apps
    '''
    unique_apps = [] 
    duplicate_apps = [] 

    for app in dataset: 
        app_name = app[index] 

        if app_name not in unique_apps:
            unique_apps.append(app_name)
        else:
            duplicate_apps.append(app_name)
    
    print('Number of duplicate apps: ', len(duplicate_apps))
    print('\n')
    print('Examples of dulicate apps: ', duplicate_apps[:15])
    

In [39]:
#display the duplicates apps of the ios dataset
detect_duplicates(ios, 1)

Number of duplicate apps:  2


Examples of dulicate apps:  ['Mannequin Challenge', 'VR Roller Coaster']


There's 2 possible duplicate apps in ios dataset. First looking at the Mannequin Challenge:

In [40]:
# to look at duplicate apps, I created a function that prints the headers and all the apps that have the same name 
def print_duplicate(header, dataset, index, name):
    print(header)
    print('\n')
    for app in dataset:
        app_name = app[index]
        if app_name == name:
            print(app)
            print('\n')

In [41]:
#display apps with track_name 'Mannequin Challenge'
print_duplicate(ios_header, ios, 1, 'Mannequin Challenge')

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']


['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']




We can see that there are a several notable differences (Content Rating, Rating Count, App Size, and Version No.). Since the content rating and version number style are different, alongside significantly different app sizes, I believe that these are different apps.

Now to examine the VR Roller Coaster.

In [42]:
#display apps with track_name 'VR Roller Coaster'
print_duplicate(ios_header, ios, 1, 'VR Roller Coaster')

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']


['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']




For these two apps, the decision is more difficult. The name, content rating, and user rating are all the same. However, there is a significant difference in the app size, as well as the version number style.

The version rating count is critical to my decision. The version rating count, as the name suggest, is the number of ratings for an apps given version. The rating count is the cumulative count of ratings. As we can see, the first app in the list has 107 total ratings, and 102 of them came from this version. The second app has 67 total ratings, and 44 of them came from this version. These two apps couldn't be the same app with data pulled at different times because the app with the higher rating count (107 total, 102 ver.) would need to have a margin of at least 44 (the ver. rating count of the other app).

I conclude, therefore, that these apps are also different.

Now let's have a look on duplicate apps in android dataset.

In [43]:
#display duplicates in android dataset
detect_duplicates(android, 0)

Number of duplicate apps:  1181


Examples of dulicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


We can see that there's 1181 duplicate apps in android dataset. Let's print some, to see where are the differences.

In [44]:
#print all apps with name 'Quick PDF Scanner + OCR FREE'
print_duplicate(android_header, android, 0, 'Quick PDF Scanner + OCR FREE')

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']




In [45]:
#print all apps with name 'Google My Business'
print_duplicate(android_header, android, 0, 'Google My Business')

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up']


['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up']


['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up']




In [46]:
#print all apps with name 'Instagram'
print_duplicate(android_header, android, 0, 'Instagram')

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']




If you examine the rows we printed, for `Google my business` for example, there's no differences, and for the 2 others, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show the data was collected at different times. And And this is the case for the other duplicates.

I use this information to build a criterion for removing the duplicates. The higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, I'll only keep the row with the highest number of reviews and remove the other entries for any given app. 

I begin by storing the max review for each app in a dictionnary.

In [47]:
reviews_max = {} #to store the max review of each app 

for app in android:
    n_reviews = float(app[3])
    name = app[0]
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews    

I'll now use the dictionnary created above to remove the duplicate rows. I'll do that by filtering only apps from android dataset where the number of reviews is equal to the maximum number of reviews of the app I have already stored in the dictionary and apps that are not in the already_added list for apps that have multiple entries with the same number of reviews like `Google My Business`.

In [48]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

### Removing non-English apps

Recall that at our company, we design apps for an English-speaking audience, so I need to remove non-English apps.

One way to do this is to remove each app with a name containing a symbol that isn't commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /).

Each character we use in a string has a corresponding number associated with it. The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the [ASCII](https://en.wikipedia.org/wiki/ASCII) (American Standard Code for Information Interchange) system.

Based on this number range, I build a function that detects whether a character belongs to the set of common English characters or not, and If an app, have 3 or fewer characters with their corresponding number equal to or less than 127, then the app is an English app.

In [70]:
def is_english(app_name):
    '''
    A function that takes in a string and returns False if the input string has more than three characters that fall outside the ASCII range (0 - 127) (identify the string as non-English), otherwise it should return True.
    '''
    non_english_car = 0
    
    for character in app_name:
        if ord(character) > 127: #  ord() is a built-in function which gives the corresponding number of each character
            non_english_car += 1
        if non_english_car > 3:
            return False
    return True

With this function, we can filter english app on both ios and android datasets.

In [71]:
def english_apps(dataset, index):
    '''
    A function to filter out non-English apps from a dataset. Loop through the dataset. If an app name is identified as English, append the whole row to a separate list
    '''
    english_apps = []

    for app in dataset:
        name = app[index]
        if is_english(name) is True:
            english_apps.append(app)
    return english_apps

In [72]:
english_ios = english_apps(ios, 1) # english apps of ios dataset

In [73]:
# display english apps in ios dataset
explore_data(english_ios, 0, 5, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 6183
Number of columns: 16


In [74]:
english_android = english_apps(android_clean, 0) # english apps of android

In [75]:
#display english apps in android dataset
explore_data(english_android, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9614
Number of columns: 13


### Isolating the free apps

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. My datasets contain both free and non-free apps; I have to isolate only the free apps for our analysis.

In [76]:
def free_apps(dataset, index):
    '''
    A function that loop through the dataset to isolate the free apps in separate list.
    '''
    free_english_apps = []
    
    for app in dataset:
        price = app[index]
        if price == '0' or price == '0.0':
            free_english_apps.append(app)
    return free_english_apps

In [77]:
free_english_ios = free_apps(english_ios, 4) # free english apps of the ios dataset

In [78]:
# display the free english apps in the ios dataset
explore_data(free_english_ios, 0, 5, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 3222
Number of columns: 16


After I finished cleaning my ios dataset, I have 4056 applications left.

In [79]:
free_english_android = free_apps(english_android, 7)  #free english apps of the android dataset

In [80]:
# display the free english apps in android dataset
explore_data(free_english_android, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 8864
Number of columns: 13


After I finished cleaning my android dataset, I have 8905 applications left.

## Most common app by genre

As I mentioned in the introduction, my goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.

To minimize risks and overhead, my validation strategy for an app idea has three steps:

- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, I need to find app profiles that are successful in both markets. 

Let's begin the analysis by determining the most common genres for each market. For this, I need to build a frequency table for the `prime_genre` column of the ios dataset, and for the `Genres` and `Category` columns of the android dataset.

I'll build two functions we can use to analyze the frequency tables:

- One function to generate frequency tables that show percentages
- Another function that I'll use to display the percentages in a descending order

In [81]:
def freq_table(dataset, index):
    '''
     A function that return the frequency table (as percentages) for any column we want.
    '''
    freq_genre = {}
    total = 0
    
    for app in dataset:
        total += 1
        genre = app[index]
        if genre in freq_genre:
            freq_genre[genre] += 1
        else:
            freq_genre[genre] = 1
            
    for genre in freq_genre:
        freq_genre[genre] /= total
        freq_genre[genre] *= 100
    
    return freq_genre

In [82]:
def display_table(dataset, index):
    '''
    A function that transforms the frequency table into a list of tuples to be able to sort it
    '''
    table = freq_table(dataset, index)
    table_display = []
    for key in table: 
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Let's now display the frequency tables.

In [83]:
display_table(free_english_ios, 11) # prime_genre (ios) frequency table

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


The most common genre among free English apps in the Apple Store is by far `Games`, which alone accounts for 55% of apps in the Apple Store, followed by `Entertainment`, 8% of apps, `Photo & Video`, 4% of apps and then `Social Networking` 3% of apps.
The general impression we can get from this is that the majority of free English applications are designed more for entertainment than for practical purposes.

In [84]:
display_table(free_english_android, 1) #category (android) frequency table

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In [85]:
display_table(free_english_android, 9) #genres (android) frequency table

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

Contrary to what we observed on the Apple Store, on Google Play, no genre really stands out from the others.
Indeed, in the Category frequency table, family represents just under 19% of the apps on the Play Store, followed by Game 9.7%, Tools 8% and then Business 4.5%.
The Genres frequency table shows that Tools accounts for 8% of apps, Entertainment 6%, Education 5% and Business 4%.

This already gives us an idea of the most developed free and English applications on the two stores. However, this analysis is not sufficient to make a recommendation on the genre of application that the company could develop. It would therefore be more interesting for us to analyse the most popular applications among users.

## Most popular apps by genre

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre.

For the Google Play data set, we can find this information in the `Installs` column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` app.

## Most popular app by genre on Apple store

Below, we calculate the average number of user ratings per app genre on the App Store:

In [86]:
prime_genre_freq = freq_table(free_english_ios, 11)

for genre in prime_genre_freq:
    total = 0
    len_genre = 0
    for row in free_english_ios:
        genre_app = row[11]
        if genre_app == genre:
            nb_user_ratings = float(row[5])
            total += nb_user_ratings
            len_genre += 1

    avg_nb_user_rating = total/len_genre
    print(genre, ':',avg_nb_user_rating)


Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

In [88]:
for app in free_english_ios:
    if app[11] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


The same pattern applies to social networking applications, where the average number is heavily influenced by a few giants such as Facebook, Pinterest, and Skype. The same applies to music applications, where some big players such as Pandora, Spotify and Shazam have averaged.

Our goal is to find popular genres, but navigation, social networking, or music applications may seem more popular than they actually are. The average number of ratings seems to be distorted by the very few apps with hundreds of thousands of user ratings, while other apps may struggle to exceed the 10,000 threshold. We can get a better picture by deleting each type of these extremely popular applications and then recalculating the average, but we will save this level of detail for later.

The reference app has an average of 74,942 user ratings, but Bible and Dictionary.com actually improved the average rating:

In [90]:
for app in free_english_ios:
    if app[11] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


However, this niche seems to show some potential. One thing we can do is turn another popular book into an app, where we can add different functions in addition to the original version of the book. This might include daily quotes in the book, audio versions of the book, quizzes about the book, etc. In addition, we can also embed dictionaries in the application, so users can view the words added in external applications without having to exit our application. 

This idea seems to be in line with the fact that the App Store is dominated by entertainment applications. This indicates that the entertainment applications on the market may be a bit saturated, which means that practical applications may have a better chance to stand out among the many applications in the App Store.

Now let's analyze the Google Play market a bit.

## Most popular apps by genre on google play store

We compute the average number of installs for each genre.

In [91]:
categories_freq = freq_table(free_english_android, 1)

for category in categories_freq:
    total = 0
    len_category = 0
    for app in free_english_android:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

We saw the same pattern in the video player category, which ranked second with 24,727,872 installations. The market is dominated by applications such as Youtube, Google Play Movies and TV or MX Player. This pattern is repeated in social apps (we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors) or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.) .)

Again, the main concern is that these types of applications may appear to be more popular than they actually are. Moreover, these niche markets seem to be dominated by a few hard-to-match giants.

The game type seems very popular, but before we found that this part of the market seems a bit saturated, so if possible, we would like to suggest different applications.

Books and reference types also seem to be very popular, with an average installation of 8,767,811. It’s interesting to explore this in more depth, because we found that this type has some potential to work well on the App Store, and our goal is to recommend a type of application that shows profit potential on both the App Store and Google Play.

Let's take a look at some applications of this type and the number of installations:

In [92]:
for app in free_english_android:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

This niche market seems to be dominated by software for processing and reading e-books and various collections of libraries and dictionaries, so building similar applications may not be a good idea because there will be some important competition.

We also noticed that there are many applications built around the book of the Quran, which shows that it is profitable to build applications around a popular book. It seems that turning a popular book (perhaps a newer book) into an application is profitable for both Google Play and the App Store market.

However, it seems that the market is already full of libraries, so we need to add some special features to the original version of the book. This might include daily quotes in the book, audio versions of the book, quizzes in the book, forums where people can discuss the book, etc.

## Conclusion

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.  
 
We concluded that taking a popular book and turning it into an app could be profitable for both the Google Play and the App Store markets. 
The markets are already full of libraries, so we need to add some special features besides the raw version of the book. 
This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc. 

## Acknowledgement

This project is a guided project provided by Dataquest to understand and practice fundamentals of Python for Data Science.