# Profitable App Profiles for the App Store and Google Play Markets

I am working as a data analyst for a company that builds Android and iOS mobile apps, and make our apps available on the Google Play and in the App Store.
We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that the number of users of our apps determine our revenue for any given app.

My goal for this project is to analyse data to help developers understand what type of apps are likely to attract more users.

### Opening and Explroring the Data

As of September 2018, there were approximateky 2 million iOS apps available on the App Store, and 2.1 million apps on Google Play Store. To avoid spending resources on collecting new data ourselves, we should first try to find any existing relevant data.

A [dataset](https://www.kaggle.com/datasets/lava18/google-play-store-apps) containing approximately 10,000 Android apps from Google Play; the data was collected in August 2018, and can be downloaded [here](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).

A [dataset](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017, and can be downloaded [here](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).


In [1]:
from csv import reader

#Google Play dataset
opened_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

#App store dataset
opened_file = open('AppleStore.csv', encoding='utf8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]


We will repeatedly use the `explore_data()` function to print rows in a readable way. The function also shows the number of rows and columns in any dataset.

In [2]:
print(ios[2])
print(ios[3])
print(ios[4])

['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']
['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']
['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


In [3]:
print(ios[2])
print('\n')
print(ios[3])
print('\n')
print(ios[4])

['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


In [4]:
def explore_data(dataset, start, end, rows_and_columns=False): #takes in four parameters
#dataset-list, start & end -integers, rows and columns - Boolean with default False    
    dataset_slice = dataset[start:end] #slice the dataset    
    for row in dataset_slice: #loop through the slice, for each iteration
        print(row) #prints a row 
        print('\n') #and adds a new line after row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))


Taking a look at the Google Play dataset...

In [5]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


We can see that the Google Play dataset contains 10,841 apps and 13 columns. Some columns that might be relevant to our analysis are: `'App'`, `'Category'`, `'Reviews'`, `'Installs'`, `'Types'`, `'Price'`, and `'Genres'`

Taking a look at the Apple Store dataset...

In [6]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


The Apple Store contains 7,197 apps and 16 columns. Some of the columns that would be of interest to us are: `'track_name'`, `'currency'`, `'price'`, `'rating_count_tot'`, `'rating_count_ver'`, and `'prime_genre'`. More details about the columns can be found [here](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps)

### Deleting Wrong Data

The Google Play dataset has a [discussion centre](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion?sort=undefined), and we can observe that [one of the discussions](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) explains an error for row 10472. Let us compare this row to the header row and another correct row by printing them.

In [7]:
print(android[10472])
print('\n')
print(android_header)
print('\n')
print(android[0])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Row 10472 corresponds to the app *Life Made Wi-Fi Touchscreen Photo Frame*, with a rating of 19. This is clearly not correct because the maximum rating for a Google Play app is 5. It is mentioned in the [discussion section](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) that the problem is caused by a missing value in the `'Category'` column.
We will have to delete this row.

In [8]:
print(len(android))
del android[10472]
print(len(android))

10841
10840


### Removing Duplicate Entries

If we look at the [discussions section](https://www.kaggle.com/datasets/lava18/google-play-store-apps), or explore the Google Play data set long enough, we'll observe that some apps have duplicate entries.

In [9]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [10]:
app_names = ['Instagram', 'Facebook']

print('Instagram' in app_names)
print('X' in app_names)
print(230 in app_names)
print('Facebook' in app_names)
#we use the in operator to check for membership in a list

True
False
False
True


There are actually 1,181 cases in total where an app is duplicated.

In [11]:
duplicate_apps = [] #created a list storing the name of duplicate apps
unique_apps = [] #created a list for storing unique names

for app in android: #looped through the android dataset. for each iterarion:
    name = app[0] #saved the app name to a variable named 'name'
    if name in unique_apps: #if 'name' was already in the 'unique_apps' list
        duplicate_apps.append(name) #we append 'name' to the 'duplicate_apps' list
    else: #if 'name' wasn't already in the 'unique_apps' list
        unique_apps.append(name) #we append 'name' to the 'unique_apps'

print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Some duplicate apps', duplicate_apps[:15])

Number of duplicate apps: 1181


Some duplicate apps ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


We need to remove duplicate entries and keep only one entry per app, so that we do not count certain apps more than once when we analyse the data.

In [12]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


If we examine the duplicate entries for the Instagram app, we observe that the difference happens on the fourth position for each row *(66577313*, *66577446*, *66577313*, *66509917)*. This corresponds to the number of views. The different figures tells us that the data was collected at different times.

We can use this to come up with a criteria for removing the duplicate entries. The higher the number of entries, the more recent the data should be. We will keep the row with the highest number of reviews and remove all other entries for any given app.

In [13]:
print('z' in ['a', 'b', 'c'])
print('z' not in ['a', 'b', 'c'])

False
True


We use both `in` and `not in` operators to check for membership. We can also use `not in` operator to check for membership in a dictionary. As with the case of the `in` operator, the membership check is only done over the dictionary keys.

In [14]:
name_and_reviews = {'Instagram': 66577313, 'Facebook': 78158306}
print('LinkedIn' not in name_and_reviews)
print('Instagram' not in name_and_reviews)
#we can also use the 'not in' operatior to check for membership in a dictionary. 
#as with the case of the 'in' operator, the membership check is only done over 
#the dictionary keys.

True
False


In [15]:
#create a dictionary where each key is a unique app name and the corresponding 
#dictionary value is the highest number of reviews of the app

reviews_max = {} #creating an empty dictionary

for app in android: #loop through the google play data set. for each iteration
    name = app[0] #assign the app name to a variable
    n_reviews = float(app[3]) #convert the number of reviews to float and assign it to a variable
    
    if name in reviews_max and reviews_max[name] < n_reviews: #if these conditions are met
        reviews_max[name] = n_reviews #update the number of reviews for that entry in the reviews_max dictionary
    elif name not in reviews_max: #if name not in the reviews_max dictionary
        reviews_max[name] = n_reviews #create a new entry in the dictionary where the key is the app name, and the value is the number of reviews.

Earlier, we found that there are 1,181 cases where an app is duplicated. This means that the length of our dictionary (of unique apps) should be equal to the difference between the length of our dataset and 1,181.

In [16]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


We will use the `reviews_max` dictionary to remove the duplicated. We will only keep the entry with the highest reviews.

In [17]:
android_clean = [] #create an empty list which will store our  new cleaned dataset
already_added = [] #create an empty list which will just store app names

for app in android: #loop through the google play dataset. for each iteration
    name = app[0] #assign the app name to the variable name
    n_reviews = float(app[3]) #convery the number of reviews to float, and assign it to the variable n_reviews
    
    if (reviews_max[name] == n_reviews) and (name not in already_added): #if these conditions are met:
        android_clean.append(app) #append the entire row to the android_clean list
        already_added.append(name) #append the name of the app name to the already_added list. this is to keep track of the apps that we already added.

Let us quickly explore the new dataset to confirm that the number of rows is 9,659.

In [18]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


We  have 9659 rows as expected. The same number of rows we got after subtracting the duplucated rows from the length of our dataset.

### Removing Non-English Apps

If we explore the data long enough, we will find out that both datasets have apps with names that suggest they are not designed for an English-speaking audience.


In [19]:
print(ios[813][1])
print(ios[6731][1])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


We are not interested in keeping these apps, and so we will remove them. One way to go about this, is to remove each app with a name containing a symbol that is not commonly used in English text - English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (.,!?,;) and other symbols (+,*,/).

All the characters we commonly use in an English text are encoded using the ASCII (American Standard Code for Information Interchange) system. Each ASCII character has a corresponding number between 0 and 127 associated with it.

For instance, the corresponding number for character `'a'` is 97, character `'A'` is 65, and character `'爱'` is 29,233.
We can get the corresponding number of each character using the `ord()` [built-in function](https://docs.python.org/3/library/functions.html#ord).

In [20]:
print(ord('a'))
print(ord('A'))
print(ord('爱'))
print(ord('5'))
print(ord('+'))

97
65
29233
53
43


If an app name contains a character that is greater than 127, the it probably means that the app has a non-English name.

In Python, strings are indexable and iterable, which means we can use indexing to select an individual character, can we can also iterate on the string using a for loop.

In [21]:
string = 'abc'
print(string[0])
print(string[1])
print(string[2])

a
b
c


In [22]:
for character in string:
    print(character)

a
b
c


We can take advatange of these principles, and the built-in `ord()` function to build a function that checks an app name and tells us whether it contains non-ASCII characters.

In [23]:
def is_english(string): #takes in a string 
    
    for character in string: #iterate over the input string
        if ord (character) > 127: #for each iteration, check whether the number
                                   #associated with the character is > 127
            return False #when a character > 127, immediately return false
        
    return True # if the loop finishes running without the return statement,
                  #the function should return true. 
    
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
False


In [24]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

False
False


The function seems to work fine, but we see that the function can't correctly identify certain English names that use use emojis or other symbols (™, — (em dash), – (en dash), etc.). This is because emojis and characters like `™` fall outside the ASCII range, and have corresponding numbers over 127.

In [25]:
print(ord('™'))
print(ord('😜'))

8482
128540


If we use the function as it is, we will lose useful data since many English apps will be incorrectly labeled as non-English. To minimise the impact of data loss, we will only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range.

In [26]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
            
    if non_ascii > 3:
        return False
    else:
        return True

print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


The function might not be perfect, but very few non-English apps will get past our filter, and will have to do at this point in our analysis.

We use the new function to filter out non-English apps from both datasets:

In [27]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

We are left with 9614 Android apps and 6183 iOS apps.

### Isolating Free Apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our datasets contain both free and non-free apps; we will need to isolate only the free apps for our analysis.

In [28]:
android_final = []
ios_final = []

#looping through each dataset to isolate free apps in seperate lists
for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

8864
3222


We are left with 8864 Android apps and 3222 iOS apps for our analysis.

### Most Common App by Genre

Our aim is to determine the kinds of apps that are likely to attract more users. This is because our revenue is highly influenced by the number of people using our apps.

To minimise risks and overheads, our validation strategy for an ideal app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop if further.
3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app in both the Google Play and the App Store, we need to find app profiles that are successful across both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

We will begin our analysis by getting a sense of the most genres for each market. To achieve this, we will build a frequency table for the `Genres` and `Category` columns of the Google Play [dataset](https://www.kaggle.com/datasets/lava18/google-play-store-apps) and the `prime_genre` column of the App Store [dataset](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps).

We will build two functions we can use to analyse the frequency tables:

* One function to generare frequency tables that show percentages
* Another function we can use to display the percentages in descending order.


Because dictionaries do not have an order, it will be very difficult to analyse the frequency tables. We will make use of the  `sorted()` [function](https://docs.python.org/3/library/functions.html#sorted) to build the second function that help us display the entries in the frequency  table in descending order.

The function takes in an iterable data type (like a list, dictionary, tuple, etc.), and returns a list of the elements of that iterable, sorted in ascending or descending order (the `reverse` parameter controls whether the order is ascending or descending.

In [29]:
a_list = [50, 20, 100]
print(sorted(a_list))
print(sorted(a_list, reverse = True))

[20, 50, 100]
[100, 50, 20]


The `sorted()` function does not work too well with dictionaries because it only considers and returns the dictionary keys.

In [30]:
freq_table = {'Genre_1': 50, 'Genre_3': 20, 'Genre_2': 100}
sorted(freq_table)

['Genre_1', 'Genre_2', 'Genre_3']

The results of `sorted(freq_table)` were displayed, even though we did not specify the `print()` command. This is a feature of Jupyter - the output from the last command is displayed by default, without specifying `print()`.

The `sorted()` [function](https://docs.python.org/3/library/functions.html#sorted) works well if we transform the dictionary into a list of tuples, where each tuple contains a dictionary key along with its corresponding dictionary value. To ensure the sorting works right, the dictionary value comes first, and the dictionary key comes second:

In [31]:
freq_table = {'Genre_1': 50, 'Genre_3': 20, 'Genre_2': 100}
freq_table_as_tuple = [(50, 'Genre_1'), (20, 'Genre_3'), (100, 'Genre_2')]
sorted(freq_table_as_tuple)

[(20, 'Genre_3'), (50, 'Genre_1'), (100, 'Genre_2')]

In [32]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
            
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
        
    return table_percentages


def display_table(dataset, index): #takes in two parameters dataset will be a 
                                   #list of lists, index will be an integer
    table = freq_table(dataset, index) # generates a frequency table
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key) #transforms the frequecy table into
                                             #a list of tuples

        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True) #sorts the list in desc
    for entry in table_sorted:
        print(entry[1], ':', entry[0]) #prints the entries of the frequency table
                                       #in desc order
        


We start by examining the `Genres` and `Category` columns on the Google Play dataset (these two columns which seem to be related)

In [33]:
display_table(android_final, 1) #category

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

It seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the app) means mostly games for kids.

In [34]:
display_table(android_final, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

The difference between `Genres` and the `Category` columns is not crystal clear, but one thing we can notice is that the `Genres` column is much more granular (it has more categories). We are only looking for the bigger picture at the moment, so we will only work with the `Category` column moving forward.

We continue by examining the frequency table for the `prime_genre` column of the App Store dataset.

In [35]:
display_table(ios_final, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


We can see that among the free English apps, more than half (58.16%) are games. Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.66 apps are designed for education, followed by social networking apps which account for 3.29% of the apps in our dataset.

The general impression is that App Store (the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo & video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, the fact that fun apps are the most numerous does not imply that they also have the greatest number of users.

Remember our dataset only contains free English apps, we should be careful not to extend our conclusions beyond this scope. If we find that gaming apps are the mpst numerous among the free English apps on App Store, it does not mean we will see the same pattern on App Store as a whole.

Up to this point, we found that the Google Play store shows a more balanced landscape of both practical and for_fun apps, while the App Store is dominated by apps designed for fun. 
The next step is to find out the kind of apps that have most users.

### Most Popular Apps by Genre on Google Play

One way to find out what genre are the most popular (that have the most users) is to calculate the average number of installs for each app genre. For the Google Play dataset, we can find this information in the `Installs` column. We should be able to get a clearer piacture about genre popularity

In [36]:
display_table(android_final, 5) #Installs

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


The install numbers do not seem precise enough, with values that are open-ended (100+, 1,000+, 5,000+, etc). We can not tell whether an app with 100,000+ installs has 100,000 installs, 200,000 or 350,000. We however do not need very precise data for our purposes, we only want to get an idea of which app genres attracts the most users, and we do not need perfect precision with respect to the number of users.

We are going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000 installs has 1,000,000 installs, and so on.

To perform computatuons, we will need to convert each install number to `float`. This means that we need to remove the commas and the plus character, otherwise the conversion will fail and raise an error. We will do this directly in the loop below, where we also compute the average number of installs for each genre (category).

To remove characters from strings, we can use the `str.replace(old, new)` [method](https://docs.python.org/3/library/stdtypes.html#str.replace)  (just like `list.append()` or `list.copy()`. `str.replace()` takes in two parameters, `old` and `new`, and replaces all occurrences of old within a new string with `new`.

In [37]:
n_installs = '100,000+'
print(n_installs.replace('+', 'plus'))
print(n_installs.replace('1', 'one'))
print(n_installs.replace('&', 'ampersand')) #no change

100,000plus
one00,000+
100,000+


To remove certain characters, we can replace them with the empty string `''`

In [38]:
n_installs = '100,000+'
print(n_installs.replace('+', ''))

100,000


Note that we will need to reassign to n_installs if we want our changes saved.

In [39]:
n_installs = '100,000+'
n_installs = n_installs.replace('+', '')
print(n_installs)
n_installs = n_installs.replace(',', '')
print(n_installs)

100,000
100000


In [40]:
categories_android = freq_table(android, 1) #generating a frequency table

for category in categories_android: #loop over the unique genres,for each iteration
    total = 0 #variable will store the sum of installs specific to each genre
    len_category = 0 #variable will store the number of apps specific to each genre
    for app in android_final: #loop over the Google Play dataset
        category_app = app[1] #save the app genre to the variable category_app
        if category_app == category: #if category_app is the same as category
            n_installs = app[5] #save the number of installs
            n_installs = n_installs.replace(',', '') #remove the comma
            n_installs = n_installs.replace('+', '') #remove the plus
            total += float(n_installs) #convert to a float, add installs to the total variable
            len_category += 1 #increment the len_category variable by 1
    avg_n_installs = total / len_category #compute the avg no. of installs
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

We notice that on average, communication apps have the most installs, 38,456,119. This number is heavily skewed up by a few apps that have over one biliion installs (WhatsApp, Facebook Messenger), and a few others with over 100 and 500 million installs.

In [41]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                     or app[5] == '500,000,000+'
                                     or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

If we removed all the communication apps that have over 100 million installs, the average would be significantly reduced, roughly ten times

In [42]:
under_100m = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100m.append(float(n_installs))

sum(under_100m) / len(under_100m)        

3603485.3884615386

We observe the same pattern for the video players category with 24,727,872 installs. The market is dominated by apps like YoutTube and Google Play Movies. The pattern is repeated for social apps (where we have giants like Facebook,Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calender, etc.).

The main concern is that these app genres might seem more popular than they really are. Moreover, these niches seem to be dominated by a few giants who are hard to compete against.

The game genre seems pretty popular, but previously we found out this part of the market  bit saturated, and so we would like to come up with a different app recommendation if possible

The books and reference genre looks relatively popular as well, with an average number of 8,767,811 installs. Let's take a look at some of the apps from this genre and their number of installs.

In [43]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

The book and reference genre includes a variety of apps including software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. It seems there is still a small number of extremely popular apps that skew the average.

In [44]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                           or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


It seems that there are only a few very popular apps, so this market still shows potential. Let us try to get some app ideas based on the kind of apps that are somewhere in the middle, in terms of popularity (between 1,000,000 and 100,000,000 downloads)

In [45]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                           or app[5] == '5,000,000+'
                                           or app[5] == '10,000,000+'
                                           or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

This niche seems to be dominated by software for processing and reading ebooks, as well as varios collections of libraries and dictionaries. It is probably not a good idea to build similar apps since there will be significant competetion.

We also notice there are quite a few apps built around the book Quran, which may suggest that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both Google Play and the App Store markets.

Be that as it may, it looks like the market is already full of libraries, so we may need to add some special features besides the raw version of the book, quizzes on the book, a forum where people can discuss the book, etc.

### Most Popular by Genre on the App Store

One way to find out which genres are the most popular is to calculate the average number of `Installs` for each app genre, but this information is missing for the App Store dataset. As a workaround, we will take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` column.

Let us start with calculating the average number of user ratings per app genre on the App Store.

To calculate the average number of user ratings for each genre, we will use a nested for loop.

In [46]:
some_strings = ['FIRST', 'SECOND']
some_integers = [1, 2, 3, 4, 5]

for string in some_strings: #iterate over the some_strings list, for each iteration
    print(string) #print string (iteration variable)
    
    for integer in some_integers: #start another iteration over the list some_integers
        print(integer) #for each iteration over this list, we print integer (iteration variable)

FIRST
1
2
3
4
5
SECOND
1
2
3
4
5


We can see that for each of the two iterations over the list `some_strings` (there are two iterations because `some_strings` only contains two elements), there is another inner iteration happening over the list `some_integers`.

The second iteration over `some_strings` begins only when the iteration over `some_integers` is finished. Notice that all the elements of the list `some_integers` are printed for each of the two iterations over the list `some_strings`.

We call a loop inside another loop a **nested loop**. We will use a nested loop to calculate the averages we mentioned above.

In [47]:
genres_ios = freq_table(ios_final, -5) #generate a freq table for the prime_genres to get the unique app genres

for genre in genres_ios: #loop over the unique genres. for each iteration
    total = 0 #initiate a variable named total to store the sum of user ratings
    len_genre = 0 #initiate a variable to store the no. of apps specific to each genre
    for app in ios_final: #loop over the App Store dataset
        genre_app = app[-5] #save the app genre to a variable
        if genre_app == genre: #if genre_app=genre (the iteration variable of the main loop)
            n_ratings = float(app[5]) #save the number of user ratings as a float
            total += n_ratings #add up the number of ratings to the total variable
            len_genre += 1 #increment the len_genre variable by one
    avg_n_ratings = total / len_genre #compute the avg no. of user ratings
    print(genre, ':', avg_n_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps which have close to half a millio user reviews together.

In [48]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


The same pattern appies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. The same applies to music, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

Our aim is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundred of thousand of user ratings, whie the other apps may struggle to get past the 10,000 threshold. We could get a better picture by removing these extremely popular apps for each genre and then rework the averages.

Reference apps have 74,942 user ratings on average, but it is actually the Bible and Dictionary.com which skew up the average rating.

In [49]:
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


It is interesting to explore this since we found that this genre has some potential to work well on Google Play, and our aim is to recommend an app genre that shows potential for being profitable for both the App Store and Google Play.

One thing we could do is take another popular book and turn it into an app where we could add different features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes about the book, etc. In addition, we could also emnbed a dictionary within the app, so users do not need to exit our app to look up words in an external app.

This idea fit well with the fact that the App Store is dominated by for-fun apps. This might suggest that the market might be a bit saturated with for-fun apps, which means a practical app might have more of a chance to stand out among the large number of apps on the App Store.

Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we described above, but the other genres do not seem too interesting to us:

* Weather apps - generally people do not spend too much time in-app, and the chances of making a profit from in-app ads are low. Also, getting reliable live weather dat may require us to connect our apps to non-free APIs.

* Food and drink - making a popular food and drink app requires actual cooking and a delivery service which is outside the scope of our company.

* Finance apps - building a finance apps requires domain knowledge, so we do not want to hire a finance expert just to build an app.

# Conclusion

In this project, we analysed data about the Google Play and App Store mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets seem to be already full of libraries, so we need to add some special features besides the raw version of the book. This might inclide daily quotes from the book, an audio version of the book, quizzes on the book, a form where people can discuss the book, etc.