# Profitable App Profiles for the App Store and Google Play Markets

We are working as data analysts for a company that builds apps for Android and IOS devices for an English-speaking audience. Our apps are free of charge, so our main source of revenue is in-app ads. Hence the number of users is what determines the revenue for each app - the more users who see and engage with the ads, the better. The goal of this project is to clean and analyse data to help our developers understand what kind of app will attract more users. 

## Opening and exploring the data

There are 2 million apps on the App Store and 2.1 million on the Google Play Store. Collecting the data for 4.1 million apps would require a significant time and money investment, so we focus on two smaller, more suitable datasets.

In [1]:
opened_apple = open('AppleStore.csv', encoding = 'utf8')
opened_google = open('googleplaystore.csv', encoding = 'utf8')

from csv import reader

read_apple = reader(opened_apple)
read_google = reader(opened_google)

apple = list(read_apple)
apple_header = apple[0]
apple = apple[1:]

google = list(read_google)
google_header = google[0]
google = google[1:]

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

        
explore_data(apple, 1, 5, rows_and_columns = True)

explore_data(google, 1, 5, rows_and_columns = True)

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw 

Here are the column names and descriptions for the iOS app data. See the [documentation](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) for more details.

| Column name      | Description |
| ----------- | ----------- |
| id          | App ID       |
| track_name   | App Name        | 
|size_bytes | App Size (in bytes)|
|currency | Currency Type |
| price | App Price|
|rating_count_tot | Number of user ratings for all versions|
|rating_count_ver | Number of user ratings for current version|
|user_rating | Average user rating for all versions|
|user_rating_ver | Average user rating for current version|
|ver| Latest version code|
|cont_rating| Content Rating |
| prime_genre | App Genre|
|sup_devices.num| Number of supported devices|
|ipadSc_urls.num | Number of screenshots shown for display|
| lang.num| Number of supported languages|
|vpp_lic| Vpp Device Based Licensing Enabled|

In [2]:
apple_header

['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

The columns that could be useful for our analysis include `track_name`, `currency`, `rating_count_tot`, `rating_count_ver`, `prime_genre`.

Likewise for the Android app data. These names are self-explanatory so we will not provide a detailed description. See the [documentation](https://www.kaggle.com/datasets/lava18/google-play-store-apps) for details.

In [3]:
google_header

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

The columns that seem useful to us are `App`, `Category`, `Reviews`, `Installs`, `Price` and `Genres`.

## Deleting Wrong Data

We must delete/correct inaccurate or duplicate data before proceeding with the analysis. Recall that we are only interested in the free apps, and are designing for an English-speaking audience. So we must also remove paid and non-English apps.

This [discussion](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) describes an error in the 10472th entry of the Google Play Store app data. Let's explore this issue.

In [4]:
print(google[10472])
print("\n")
print(google_header)
print("\n")
print(google[0])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


We can see that there is a missing value in the `Category` column, which causes subsequent entries to be shifted. So we delete this row:

In [5]:
del google[10472] #be careful to not run this more than once, to avoid deleting more rows than desired

In [6]:
google[10472]

['osmino Wi-Fi: free WiFi',
 'TOOLS',
 '4.2',
 '134203',
 '4.1M',
 '10,000,000+',
 'Free',
 '0',
 'Everyone',
 'Tools',
 'August 7, 2018',
 '6.06.14',
 '4.4 and up']

Looking at the [discussion section](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps/discussion) for the App Store dataset, there do not seem to be any reports of wrong data.

## Removing duplicate entries

There are 4 apps with the name `Instagram` in the Google Play dataset:

In [7]:
for app in google:
    name = app[0]
    if name == "Instagram":
        print(app)
        print("\n")

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']




Let's write some code to find out how many duplicate app names there are:

In [8]:
unique_apps = []
duplicate_apps = []

for app in google:
    name = app[0]
    if name not in unique_apps:
        unique_apps.append(name)
    else:
        duplicate_apps.append(name)
print("Number of unique apps:", len(unique_apps), "\nNumber of duplicate apps:",len(duplicate_apps),"\n")

print("Some duplicate app names:\n", duplicate_apps[:20])

Number of unique apps: 9659 
Number of duplicate apps: 1181 

Some duplicate app names:
 ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']


We only want to keep one record for each app name. We'll keep the one with the most reviews as this is most likely to correspond to the most recent data. To do that, we will:

- Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

- Use the information stored in the dictionary and create a new dataset, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

In [9]:
reviews_max = {}

for app in google:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
reviews_max['Instagram']

66577446.0

We now create the new dataset, without duplicates. 

We create two empty lists, one that will have the new dataset (without duplicates) and one to keep track of the apps that have already been added to the new dataset. We loop through the original dataset, assigning names for each app and the number of reviews as a float. If the number of reviews are the same as the maximum that we found earlier, and it isn't already in the list, we append the app to the new dataset and its name to the `already_added` list. We need the `not in already_added` condition to account for cases where duplicate app entries have the same number of reviews as the maximum number of reviews. 

In [10]:
android_clean = []
already_added = []

for app in google:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
android_clean[:20], already_added[:20]
len(android_clean)

9659

We could try the same procedure for the iOS apps, but it turns out that there are no duplicates:

In [11]:
apple_header    

['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

In [12]:
unique_ios_apps = []
duplicate_ios_apps = []

for app in apple:
    id_number = app[0]
    if id_number not in unique_ios_apps:
        unique_ios_apps.append(id_number)
    else:
        duplicate_ios_apps.append(id_number)
        
len(unique_ios_apps), len(duplicate_ios_apps)

(7197, 0)

## Removing non-English apps

One way to remove the non-English apps is to remove any apps that contain characters not commonly used in the English language, i.e. anything that isn't from the English alphabet, a digit from 0 to 9, a punctuation mark (,.?!;:), or a special character (+_*/\`@#~_-=). The characters specific to English text are known as _ASCII characters_.

Each character has a number associated with it, which we can obtain using the `ord()` function. This number for the ASCII characters is always in the range 0 to 127. So we can remove all apps with an associated number outside of this range.

In [13]:
ord('a')

97

In [14]:
ord('ñ')

241

We can iterate on each character in the string.

In [15]:
def is_English(string):
    for char in string:
        if ord(char) > 127:
            return False
    return True

In [16]:
print(is_English('Instagram'))
print(is_English('Dolphin Browser - Fast, Private & Adblock🐬'))

True
False


This does work, however it labels some English apps as non-English even if they contain an Emoji or two. So we end up removing far too many apps. We will make our condition on strings less stringent by allowing apps that contain up to 3 characters outside of our permitted range of characters. 

In [17]:
def is_English(string):
    non_ascii = 0
    
    for char in string:
        
        if ord(char) > 127:
            non_ascii += 1
            
        if non_ascii > 3:
            return False
        
    return True

print(is_English('Instagram'))
print(is_English('Dolphin Browser - Fast, Private & Adblock🐬'))
print(is_English('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_English('Docs To Go™ Free Office Suite'))
print(is_English('Instachat 😜'))
print(is_English('Español'))

True
True
False
True
True
True


This function is still not perfect. Some non-English apps still may get through, but it is enough for our analysis. Let's now filter out the non-English apps:

In [18]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_English(name):
        android_english.append(app)

for app in apple:
    name = app[1]
    if is_English(name):
        ios_english.append(app)

In [19]:
print("Android English apps:")
explore_data(android_english,0,10)
print("iOS English apps:")
explore_data(ios_english,0,10)

Android English apps:
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', 

In [20]:
print("Number of English Android apps:", len(android_english))
print("Number of English iOS apps:", len(ios_english))

Number of English Android apps: 9614
Number of English iOS apps: 6183


## Isolating the Free Apps

So far we have:

- Removed inaccurate data
- Removed duplicate app entries
- Removed non-English apps

It remains to remove the paid apps, as our company is only interested in apps that are free to download/install.

In [21]:
android_apps = []
ios_apps = []

for app in android_english:
    price = app[7]
    if price == "0":
        android_apps.append(app)
        
for app in ios_english:
    price = app[4]
    if price == "0.0":
        ios_apps.append(app)
        
explore_data(android_apps,0,5)
        
explore_data(ios_apps,0,5)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0'

In [22]:
len(android_apps), len(ios_apps)

(8864, 3222)

We have now finished cleaning our data. Our goal now is to use the data to identify which apps are likely to attract more users, in order to maximise ad revenue. To minimise risks, our validation strategy for an app idea has three steps:

- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

The end goal is to make an app suitable (and preferably profitable) in both the Android and iOS markets. Let's begin by determining the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our datasets. We will display the frequencies as percentages. We use a dictionary to do this.

In [23]:
def freq_table(dataset, index):
    frequency_table = {}
    total = 0
    for app in dataset:
        item = app[index]
        total += 1
        if item in frequency_table:
            frequency_table[item] += 1
        else:
            frequency_table[item] = 1
    
    percentage_table = {}
    
    for key in frequency_table:
        percentage = (frequency_table[key] / total) * 100
        percentage_table[key] = percentage
    return percentage_table
    
freq_table(android_apps,9)

{'Art & Design': 0.5979241877256317,
 'Art & Design;Creativity': 0.06768953068592057,
 'Auto & Vehicles': 0.9250902527075812,
 'Beauty': 0.5979241877256317,
 'Books & Reference': 2.1435018050541514,
 'Business': 4.591606498194946,
 'Comics': 0.6092057761732852,
 'Comics;Creativity': 0.01128158844765343,
 'Communication': 3.2378158844765346,
 'Dating': 1.861462093862816,
 'Education': 5.347472924187725,
 'Education;Creativity': 0.04512635379061372,
 'Education;Education': 0.33844765342960287,
 'Education;Pretend Play': 0.056407942238267145,
 'Education;Brain Games': 0.033844765342960284,
 'Entertainment': 6.069494584837545,
 'Entertainment;Brain Games': 0.078971119133574,
 'Entertainment;Creativity': 0.033844765342960284,
 'Entertainment;Music & Video': 0.16922382671480143,
 'Events': 0.7107400722021661,
 'Finance': 3.7003610108303246,
 'Food & Drink': 1.2409747292418771,
 'Health & Fitness': 3.0798736462093865,
 'House & Home': 0.8235559566787004,
 'Libraries & Demo': 0.936371841155234

This is good but it is very difficult to analyse the frequencies without them being in order. For this, we will use the built-in `sorted()` function, which takes as input an iterable data type and outputs a list of their elements in ascending/descending order. However, it doesn't work too well with dictionaries because it only considers and returns the dictionary keys. To remedy this, we transform the dictionary into a list of tuples, where each tuple contains a dictionary key along with its corresponding dictionary value. To ensure the sorting works right, the dictionary value comes first, and the dictionary key comes second. There are much simpler ways to do this once we learn more advanced techniques. Using this workaround, we write a function called  `display_table()`:

In [24]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    tuple_list = []
    
    for key in table:
        tup = (table[key], key)
        tuple_list.append(tup)
    sorted_table = sorted(tuple_list, reverse = True)
    
    for entry in sorted_table:
        print(entry[1] + ":", entry[0])

In [25]:
display_table(ios_apps, -5) #prime_genre column in App Store

Games: 58.16263190564867
Entertainment: 7.883302296710118
Photo & Video: 4.9658597144630665
Education: 3.662321539416512
Social Networking: 3.2898820608317814
Shopping: 2.60707635009311
Utilities: 2.5139664804469275
Sports: 2.1415270018621975
Music: 2.0484171322160147
Health & Fitness: 2.0173805090006205
Productivity: 1.7380509000620732
Lifestyle: 1.5828677839851024
News: 1.3345747982619491
Travel: 1.2414649286157666
Finance: 1.1173184357541899
Weather: 0.8690254500310366
Food & Drink: 0.8069522036002483
Reference: 0.5586592178770949
Business: 0.5276225946617008
Book: 0.4345127250155183
Navigation: 0.186219739292365
Medical: 0.186219739292365
Catalogs: 0.12414649286157665


Looking at the `prime_genre` column of the App Store dataset, the most common (free, English language) app genre by far is `Games`, followed by `Entertainment`. Most of the apps are designed for entertainment purposes, with 4 of the top 5 genres being related to entertainment. More practical apps make up the bottom segment of the app genres. Based on this frequency table alone, one would be inclined to recommend gaming or entertainment apps for a potential app profile for developers. However, the large number of apps for these categories does not necessarily translate to a large number of users, so it would be wiser to analyse the number of users before coming to a conclusion. 

In [26]:
display_table(android_apps, 1) # Category column in Google Play dataset

FAMILY: 18.907942238267147
GAME: 9.724729241877256
TOOLS: 8.461191335740072
BUSINESS: 4.591606498194946
LIFESTYLE: 3.9034296028880866
PRODUCTIVITY: 3.892148014440433
FINANCE: 3.7003610108303246
MEDICAL: 3.531137184115524
SPORTS: 3.395758122743682
PERSONALIZATION: 3.3167870036101084
COMMUNICATION: 3.2378158844765346
HEALTH_AND_FITNESS: 3.0798736462093865
PHOTOGRAPHY: 2.944494584837545
NEWS_AND_MAGAZINES: 2.7978339350180503
SOCIAL: 2.6624548736462095
TRAVEL_AND_LOCAL: 2.33528880866426
SHOPPING: 2.2450361010830324
BOOKS_AND_REFERENCE: 2.1435018050541514
DATING: 1.861462093862816
VIDEO_PLAYERS: 1.7937725631768955
MAPS_AND_NAVIGATION: 1.3989169675090252
FOOD_AND_DRINK: 1.2409747292418771
EDUCATION: 1.1620036101083033
ENTERTAINMENT: 0.9589350180505415
LIBRARIES_AND_DEMO: 0.9363718411552346
AUTO_AND_VEHICLES: 0.9250902527075812
HOUSE_AND_HOME: 0.8235559566787004
WEATHER: 0.8009927797833934
EVENTS: 0.7107400722021661
PARENTING: 0.6543321299638989
ART_AND_DESIGN: 0.6430505415162455
COMICS: 0.62

In [27]:
display_table(android_apps, -4) # Genres column in Google Play dataset

Tools: 8.449909747292418
Entertainment: 6.069494584837545
Education: 5.347472924187725
Business: 4.591606498194946
Productivity: 3.892148014440433
Lifestyle: 3.892148014440433
Finance: 3.7003610108303246
Medical: 3.531137184115524
Sports: 3.463447653429603
Personalization: 3.3167870036101084
Communication: 3.2378158844765346
Action: 3.1024368231046933
Health & Fitness: 3.0798736462093865
Photography: 2.944494584837545
News & Magazines: 2.7978339350180503
Social: 2.6624548736462095
Travel & Local: 2.3240072202166067
Shopping: 2.2450361010830324
Books & Reference: 2.1435018050541514
Simulation: 2.0419675090252705
Dating: 1.861462093862816
Arcade: 1.8501805054151623
Video Players & Editors: 1.7712093862815883
Casual: 1.7599277978339352
Maps & Navigation: 1.3989169675090252
Food & Drink: 1.2409747292418771
Puzzle: 1.128158844765343
Racing: 0.9927797833935018
Role Playing: 0.9363718411552346
Libraries & Demo: 0.9363718411552346
Auto & Vehicles: 0.9250902527075812
Strategy: 0.913808664259927

The most common categories in the Google Play Store are `FAMILY` (which mainly includes games designed for children) and `GAME`.  The most common genres are `Tools` and `Entertainment`. These trends are not too dissimilar to the App Store. However, there are more practical apps represented on the Play Store, with `Tools`, `Education`, `Business` and `Productivity` being among the most common genres.

But based on the number of apps, we can conclude that the App Store has more of a focus on fun: namely entertainment, gaming and social networking. Whereas, the Play Store has more of a practical emphasis, but the most common categories still consist of gaming and entertainment.

In both cases, it is difficult to recommend an app genre without knowing the distribution of users, which we will explore next.

## Most Popular Apps by Genre on the App Store

We'd now like to analyse which kind of apps have the most users. One way to determine this is by looking at the number of app installs, which can be found on the `Installs` column in the Google Play dataset. This information is not available in the App Store dataset, so we'll use the number of user ratings instead, `rating_count_tot`. We'll calculate the number of ratings per app genre. To do that, we need to do the following:

- Isolate the apps of each genre
- Add up the user ratings of each genre
- Divide the number of ratings by the number of apps in that genre.

We'll need to use a **nested** loop. We start by generating the frequency table for the `prime_genre` column:

In [28]:
genres_table = freq_table(ios_apps, -5)
genres_table

{'Social Networking': 3.2898820608317814,
 'Photo & Video': 4.9658597144630665,
 'Games': 58.16263190564867,
 'Music': 2.0484171322160147,
 'Reference': 0.5586592178770949,
 'Health & Fitness': 2.0173805090006205,
 'Weather': 0.8690254500310366,
 'Utilities': 2.5139664804469275,
 'Travel': 1.2414649286157666,
 'Shopping': 2.60707635009311,
 'News': 1.3345747982619491,
 'Navigation': 0.186219739292365,
 'Lifestyle': 1.5828677839851024,
 'Entertainment': 7.883302296710118,
 'Food & Drink': 0.8069522036002483,
 'Sports': 2.1415270018621975,
 'Book': 0.4345127250155183,
 'Finance': 1.1173184357541899,
 'Education': 3.662321539416512,
 'Productivity': 1.7380509000620732,
 'Business': 0.5276225946617008,
 'Catalogs': 0.12414649286157665,
 'Medical': 0.186219739292365}

In [29]:
for genre in genres_table:
    total = 0 # number of user ratings in genre
    len_genre = 0 # number of apps in each genre
    for app in ios_apps:
        genre_app = app[-5]
        if genre_app == genre:
            no_of_ratings = float(app[5])
            total += no_of_ratings
            len_genre += 1
    avg_no_ratings = total/len_genre
    print(genre + ":", avg_no_ratings)

Social Networking: 71548.34905660378
Photo & Video: 28441.54375
Games: 22788.6696905016
Music: 57326.530303030304
Reference: 74942.11111111111
Health & Fitness: 23298.015384615384
Weather: 52279.892857142855
Utilities: 18684.456790123455
Travel: 28243.8
Shopping: 26919.690476190477
News: 21248.023255813954
Navigation: 86090.33333333333
Lifestyle: 16485.764705882353
Entertainment: 14029.830708661417
Food & Drink: 33333.92307692308
Sports: 23008.898550724636
Book: 39758.5
Finance: 31467.944444444445
Education: 7003.983050847458
Productivity: 21028.410714285714
Business: 7491.117647058823
Catalogs: 4004.0
Medical: 612.0


This would indicate that the most popular app genre is `Navigation`, but we can see that this average is skewed by a few popular apps. The same is true for other categories like `Social Networking`.

In [30]:
for app in ios_apps:
    if app[-5] == "Navigation":
        print(app[1] + ":", app[5])

Waze - GPS Navigation, Maps & Real-time Traffic: 345046
Google Maps - Navigation & Transit: 154911
Geocaching®: 12811
CoPilot GPS – Car Navigation & Offline Maps: 3582
ImmobilienScout24: Real Estate Search in Germany: 187
Railway Route Search: 5


## Most Popular Apps by Genre on the Google Play Store

The `Installs` column seems like a useful column to quantify app popularity. Upon further inspection we notice that most of the values are imprecise, often stating an open-ended range like `100+` instead of an exact value:

In [31]:
display_table(android_apps, 5)

1,000,000+: 15.726534296028879
100,000+: 11.552346570397113
10,000,000+: 10.548285198555957
10,000+: 10.198555956678701
1,000+: 8.393501805054152
100+: 6.915613718411552
5,000,000+: 6.825361010830325
500,000+: 5.561823104693141
50,000+: 4.7721119133574
5,000+: 4.512635379061372
10+: 3.5424187725631766
500+: 3.2490974729241873
50,000,000+: 2.3014440433213
100,000,000+: 2.1322202166064983
50+: 1.917870036101083
5+: 0.78971119133574
1+: 0.5076714801444043
500,000,000+: 0.2707581227436823
1,000,000,000+: 0.22563176895306858
0+: 0.04512635379061372
0: 0.01128158844765343


However, we don't need precise data, we just want to get an idea of which genre of apps attracts the most users. So for our purposes we will consider an app with `100,000+` installs to have `100,000` installs, etc. To analyse the numbers we will need them as floats, without the `+` and `,` symbols. We'll use the `str.replace(old,new)` method to do this. To remove characters, we simply replace them with an empty string `""`.

In [32]:
string = "100,000+"
string.replace("+","")
string

'100,000+'

This method does not replace the original string, so we need to reassign the variable:

In [33]:
string = string.replace("+","")
string

'100,000'

For the Google Play dataset example, we'll use another nested for loop.

In [34]:
android_categories = freq_table(android_apps, 1) # freq table for category column
category_list = []
for category in android_categories:
    total = 0 # number of installs
    len_category = 0 
    for app in android_apps:
        app_category = app[1]
        if app_category == category:
            n_installs = app[5]
            n_installs = n_installs.replace("+", "")
            n_installs = n_installs.replace(",", "")
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
    avg_installs = total/len_category
    #print(category + ":", avg_installs)
    # Let's order the list by number of installs
    tup = (avg_installs, category)
    category_list.append(tup)
    sorted_list = sorted(category_list, reverse = True)
sorted_list

[(38456119.167247385, 'COMMUNICATION'),
 (24727872.452830188, 'VIDEO_PLAYERS'),
 (23253652.127118643, 'SOCIAL'),
 (17840110.40229885, 'PHOTOGRAPHY'),
 (16787331.344927534, 'PRODUCTIVITY'),
 (15588015.603248259, 'GAME'),
 (13984077.710144928, 'TRAVEL_AND_LOCAL'),
 (11640705.88235294, 'ENTERTAINMENT'),
 (10801391.298666667, 'TOOLS'),
 (9549178.467741935, 'NEWS_AND_MAGAZINES'),
 (8767811.894736841, 'BOOKS_AND_REFERENCE'),
 (7036877.311557789, 'SHOPPING'),
 (5201482.6122448975, 'PERSONALIZATION'),
 (5074486.197183099, 'WEATHER'),
 (4188821.9853479853, 'HEALTH_AND_FITNESS'),
 (4056941.7741935486, 'MAPS_AND_NAVIGATION'),
 (3695641.8198090694, 'FAMILY'),
 (3638640.1428571427, 'SPORTS'),
 (1986335.0877192982, 'ART_AND_DESIGN'),
 (1924897.7363636363, 'FOOD_AND_DRINK'),
 (1833495.145631068, 'EDUCATION'),
 (1712290.1474201474, 'BUSINESS'),
 (1437816.2687861272, 'LIFESTYLE'),
 (1387692.475609756, 'FINANCE'),
 (1331540.5616438356, 'HOUSE_AND_HOME'),
 (854028.8303030303, 'DATING'),
 (817657.27272727

`COMMUNICATION` has the most number of app installs, followed by `VIDEO_PLAYERS` and `SOCIAL`. This might lead us to think these correspond with the best app profile for developers to target. However, we must note that these categories are dominated by a few giant companies which may be difficult to compete with. This skews the number of installs, with many of these giants having over 100M installs each.

In [35]:
for app in android_apps:
    installs = app[5]
    installs = installs.replace("+","")
    installs = installs.replace(",","")
    installs = float(installs)
    if installs > 100000000:
        print("Category:",app[1],"\nApp name:", app[0],"\nInstalls:", installs,"\n")

Category: BOOKS_AND_REFERENCE 
App name: Google Play Books 
Installs: 1000000000.0 

Category: COMMUNICATION 
App name: WhatsApp Messenger 
Installs: 1000000000.0 

Category: COMMUNICATION 
App name: Google Duo - High Quality Video Calls 
Installs: 500000000.0 

Category: COMMUNICATION 
App name: Messenger – Text and Video Chat for Free 
Installs: 1000000000.0 

Category: COMMUNICATION 
App name: imo free video calls and chat 
Installs: 500000000.0 

Category: COMMUNICATION 
App name: Skype - free IM & video calls 
Installs: 1000000000.0 

Category: COMMUNICATION 
App name: LINE: Free Calls & Messages 
Installs: 500000000.0 

Category: COMMUNICATION 
App name: Google Chrome: Fast & Secure 
Installs: 1000000000.0 

Category: COMMUNICATION 
App name: UC Browser - Fast Download Private & Secure 
Installs: 500000000.0 

Category: COMMUNICATION 
App name: Gmail 
Installs: 1000000000.0 

Category: COMMUNICATION 
App name: Hangouts 
Installs: 1000000000.0 

Category: GAME 
App name: Candy Cru

Without these large values, the average number of installs for each category are much smaller. For the `COMMUNICATION` column:

In [36]:
under_100_m = []

for app in android_apps:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3603485.3884615386

This motivates us to look at apps with a more modest number of installs in the categories that are less dominated by large companies:

In [37]:
for app in android_apps:
    name = app[0]
    installs = app[5]
    genre = app[1]
    if (installs == "500,000+" or installs == "1,000,000+") and genre == "BOOKS_AND_REFERENCE":
        print("Name:",name,"\nInstalls:", installs,"\n")

Name: Book store 
Installs: 1,000,000+ 

Name: English Grammar Complete Handbook 
Installs: 500,000+ 

Name: Free Books - Spirit Fanfiction and Stories 
Installs: 1,000,000+ 

Name: Offline: English to Tagalog Dictionary 
Installs: 500,000+ 

Name: FamilySearch Tree 
Installs: 1,000,000+ 

Name: Cloud of Books 
Installs: 1,000,000+ 

Name: Recipes of Prophetic Medicine for free 
Installs: 500,000+ 

Name: ReadEra – free ebook reader 
Installs: 1,000,000+ 

Name: English to Urdu Dictionary 
Installs: 500,000+ 

Name: eBoox: book reader fb2 epub zip 
Installs: 1,000,000+ 

Name: English Persian Dictionary 
Installs: 500,000+ 

Name: Flybook 
Installs: 500,000+ 

Name: All Maths Formulas 
Installs: 1,000,000+ 

Name: Only 30 days in English, the guideline is guaranteed 
Installs: 500,000+ 

Name: English-Myanmar Dictionary 
Installs: 1,000,000+ 

Name: Golden Dictionary (EN-AR) 
Installs: 1,000,000+ 

Name: All Language Translator Free 
Installs: 1,000,000+ 

Name: Azpen eReader 
Installs

In [38]:
for app in android_apps:
    name = app[0]
    installs = app[5]
    genre = app[1]
    if (installs == "500,000+" or installs == "1,000,000+") and (genre == "PRODUCTIVITY"):
        print("Name:",name,"\nInstalls:", installs,"\n")

Name: Power Booster - Junk Cleaner & CPU Cooler & Boost 
Installs: 1,000,000+ 

Name: MyMTN 
Installs: 1,000,000+ 

Name: Hacker's Keyboard 
Installs: 1,000,000+ 

Name: Security & Privacy 
Installs: 1,000,000+ 

Name: 7 Weeks - Habit & Goal Tracker 
Installs: 500,000+ 

Name: Loop - Habit Tracker 
Installs: 1,000,000+ 

Name: TickTick: To Do List with Reminder, Day Planner 
Installs: 1,000,000+ 

Name: Pushbullet - SMS on PC 
Installs: 1,000,000+ 

Name: Planner Pro-Personal Organizer 
Installs: 1,000,000+ 

Name: Cozi Family Organizer 
Installs: 1,000,000+ 

Name: IFTTT 
Installs: 1,000,000+ 

Name: Dashlane Free Password Manager 
Installs: 1,000,000+ 

Name: Solid Explorer Classic 
Installs: 1,000,000+ 

Name: Solid Explorer File Manager 
Installs: 1,000,000+ 

Name: Smart File Manager 
Installs: 1,000,000+ 

Name: Simple Notepad 
Installs: 1,000,000+ 

Name: Sticky Note + : Sync Notes 
Installs: 1,000,000+ 

Name: Squid - Take Notes & Markup PDFs 
Installs: 1,000,000+ 

Name: Jotte

## Conclusions

Examining the mid-range of the `BOOKS_AND_REFERENCE` genre, we see that Quran apps and Dictionaries appear quite often. In the `PRODUCTIVITY` genre, we see a lot of calendar, to-do list and reminder apps. From this we conclude that a profitable yet realistic app profile may include elements from a popular book, as well as combining aspects of organisation/productivity. From what we saw earlier, gaming and entertainment still dominate the most popular categories, even excluding the large companies. So we surmise that a productivity/book app with a gamified/entertainment aspect may be the safest option.