## High-Profit App Strategies for the App Store and Google Play

Our goal in this project is to identify mobile app profiles that are profitable in the App Store and Google Play markets. As data analysts for a company specializing in Android and iOS app development, we aim to provide our developers with data-driven insights to guide their decisions on which types of apps to create.

Our company focuses exclusively on free-to-download apps, with in-app advertising as our primary revenue source. Therefore, the profitability of an app is directly tied to the number of users it attracts. The objective of this project is to analyze data that will help our developers understand which types of apps are most likely to draw a large user base.

In [2]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv', encoding='utf-8-sig')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv', encoding='utf-8-sig')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

The `explore_data()` function does the following:

- Takes in four parameters:
    - Dataset, which will be a list of lists.
    - `Start` and `end`, which will both be integers and represent the starting and the ending indices of a slice from the dataset.
    - `Rows_and_columns`, which is expected to be a Boolean and has `False` as a default argument.
- Slices the dataset using `dataset[start:end]`.
- Loops through the slice, and for each iteration, prints a row and adds a new line after that row using `print('\n')`.
    The `\n` in `print('\n')` is a special character that won't print. Instead, the `\n` character adds a new line, and we use `print('\n')` to add some blank space between rows.
- Prints the number of rows and columns if `rows_and_columns` is `True`. `Dataset` shouldn't have a header row, or the function will print the wrong number of rows (one more row compared to the actual length).

In [4]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0])) # this helps it omit the first row which contains the header

In [5]:
opened_file = open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)

In [6]:
print(ios[2])
print(ios[3])
print(ios[4])

['281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']
['282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']
['282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']


In [7]:
print(ios[2])
print('\n')
print(ios[3])
print('\n')
print(ios[4])

['281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']


In [8]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', '7-Jan-18', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', '15-Jan-18', '2.0.0', '4.0.3 and up']


['U Launcher Lite â€“ FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', '1-Aug-18', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


We see that the *Google Play* data set has **10841 apps and 13 columns**. At a quick glance, the columns that might be useful for the purpose of our analysis are `'App'`, `'Category'`, `'Reviews'`, `'Installs'`, `'Type'`, `'Price'`, and `'Genres'`.

In [10]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 7197
Number of columns: 16


We have **7197 iOS apps** in this data set, and the columns that seem interesting are: `'track_name'`, `'currency'`, `'price'`, `'rating_count_tot'`, `'rating_count_ver'`, and `'prime_genre'`. Not all column names are self-explanatory in this case, but details about each column can be found in the data set [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

In [12]:
print(android[10472])  # incorrect row
print('\n')
print(android_header)  # header
print('\n')
print(android[0])      # correct row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', '11-Feb-18', '1.0.19', '4.0 and up', '']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', '7-Jan-18', '1.0.0', '4.0.3 and up']


The row 10472 corresponds to the app *Life Made WI-Fi Touchscreen Photo Frame*, and we can see that the rating is 19. This is clearly off because the maximum rating for a Google Play app is 5 (this problem is caused by a missing value in the `'Category'` column). As a consequence, we'll delete this row.

In [14]:
print(len(android))
del android[10472]  # don't run this more than once
print(len(android))

10841
10840


# Removing duplicate entries

**Part One**

If we explore the Google Play data set long enough, we'll find that some apps have more than one entry. For instance, the application `Instagram` has four entries:

In [16]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', '31-Jul-18', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', '31-Jul-18', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', '31-Jul-18', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', '31-Jul-18', 'Varies with device', 'Varies with device']


In [17]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
    
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


# **Data Cleaning Process**
We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.

If you examine the rows we printed two cells above for the Instagr m app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times. We can use this to build a criterion for keeping rows. We won't remove rows randomly, but rather we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

To do that, we wil- l:

Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of tha- t app
Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of re
views)
Part Two
Let's start by building the dictionary.

In [19]:
reviews_max = {}  # Initializes an empty dictionary to store the maximum number of reviews for each app

for app in android:  # Loops through each app in the android dataset
    name = app[0]  # Extracts the app name (assuming it's the first element in each app list)
    n_reviews = int(app[3])  # Converts the number of reviews (at index 3) to a float

    if name in reviews_max and reviews_max[name] < n_reviews:  # If the app is already in the dictionary and the current number of reviews is greater than the stored value
        reviews_max[name] = n_reviews  # Updates the dictionary with the larger number of reviews

    elif name not in reviews_max:  # If the app is not already in the dictionary
        reviews_max[name] = n_reviews  # Adds the app and its number of reviews to the dictionary

In [20]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


In [21]:
android_clean = [] # will store the cleaned list of apps where each app appears only once, specifically the version with the highest number of reviews.
already_added = [] # keeps track of the app names that have already been added to `android_clean`, preventing duplicates.

for app in android:
    name = app[0]
    n_reviews = int(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app) # contains the entire row
        already_added.append(name) # make sure this is inside the if block. contains only the name of the app in the android_clean row

In [22]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', '7-Jan-18', '1.0.0', '4.0.3 and up']


['U Launcher Lite â€“ FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', '1-Aug-18', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', '8-Jun-18', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


## Removing Non-English Apps
**
Part One**


If you explore the data sets enough, you'll notice the names of some of the apps suggest they are not directed toward an English-speaking audience. Below, we see a couple of examples from both data sets:

In [24]:
print(ios[813][1])
print(ios[6731][1])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

AliExpress Shopping App
Idle Armies


ä¸­å›½èªž AQãƒªã‚¹ãƒ‹ãƒ³ã‚°
Ù„Ø¹Ø¨Ø© ØªÙ‚Ø¯Ø± ØªØ±Ø¨Ø­ DZ


We're not interested in keeping these kind of apps, so we'll remove them. One way to go about this is to remove each app whose name contains a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;, etc.), and other symbols (+, *, /, etc.).

All these characters that are specific to English texts are encoded using th`e ASCII standa`rd. Eac`h ASCII charact`er has a corresponding number between 0 and 127 associated with it, and we can take advantage of that to build a function that checks an app name and tells us whether it contain`s non-ASCII characte`rs.

We built this function below, and we use `the built-in o`rd() function to find out the corresponding encoding number of each character.

In [26]:
def is_english(string): # checks whether a given string consists only of English (ASCII) characters
    
    for character in string:
        if ord(character) > 127:
            return False
    
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
False


The function seems to work fine, but some English app names use emojis or other symbols (™, — (em dash), – (en dash), etc.) that fall outside of the ASCII range. Because of this, we'll remove useful apps if we use the function in its current form.

In [28]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print('\n')
print(ord('™'))
print(ord('😜'))

False
False


8482
128540


**Part Two**
To minimize the impact of data loss, we'll only remove an app if its name has more than three non-ASCII characters:

In [30]:
def is_english(string):              # is_english() is used to save english apps
    non_ascii = 0                    # A counter tracks the number of characters in the string that are outside the ASCII range
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1           # If a non-ASCII character is found, the non_ascii counter is incremented by 1.

    if non_ascii > 3:
        return False
    else:
        return True

print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
True


In [31]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', '7-Jan-18', '1.0.0', '4.0.3 and up']


['U Launcher Lite â€“ FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', '1-Aug-18', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', '8-Jun-18', 'Varies with device', '4.2 and up']


Number of rows: 9500
Number of columns: 13


['281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5

## Isolating the Free Apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis. Below, we isolate the free apps for both our data sets.

In [33]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)

for app in ios_english:
    price = app[4]
    if price == '0':
        ios_final.append(app)

print('No. of Android apps left will be', len(android_final))
print('No. of ios apps left will be', len(ios_final))

No. of Android apps left will be 8760
No. of ios apps left will be 3169


# Most Common Apps by Genre__
Part On__e
As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three sp- s:

Build a minimal Android version of the app, and add it to Google - Play.
If the app has a good response from users, we then develop it fu- rther.
If the app is profitable after six months, we also build an iOS version of the app and add it to the App
 Store.
Because our end goal is to add the app on `both the `App S`tore and Go`ogle Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets mi`ght be a product`ivity app that make`s use of gam`ification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'`ll build a freq`uency tab`le for the `prime_genre col`umn of th`e App Store data se`t, and` the `Genres a`nd Category colu`mns of the `Google Play __data set__.

Part Two
We'll build two functions we can use to analyze the frequ- ency tables:

One function to generate frequency tables that s- how percentages
Another function that we can use to display the percentages in a descending order

In [35]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100 # i.e the number of times it appears/total
        table_percentages[key] = percentage 
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key) # i.e the percentage occurence, name of variable
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

__Part Three__

We start by examining the frequency table for te h prime_gerene column of the App Store data set.

In [37]:
display_table(ios_final, 11) # from the function defined above

Games : 58.53581571473651
Entertainment : 7.82581255916693
Photo & Video : 5.0489113284947935
Education : 3.72357210476491
Social Networking : 3.2817923635216157
Shopping : 2.5244556642473968
Utilities : 2.398232881035027
Sports : 2.1773430104133795
Music : 2.0511202272010096
Health & Fitness : 1.9880088355948247
Productivity : 1.7040075733669928
Lifestyle : 1.5462290943515304
News : 1.3253392237298833
Travel : 1.1360050489113285
Finance : 1.1044493531082362
Weather : 0.8520037866834964
Food & Drink : 0.8204480908804039
Reference : 0.5364468286525718
Business : 0.5364468286525718
Book : 0.3786683496371095
Navigation : 0.18933417481855475
Medical : 0.18933417481855475
Catalogs : 0.12622278321236985


In [38]:
print(ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


We can see that among the free English apps, more than a half (58.54%) are games. Entertainment apps are 8%, followed by photo and video apps, which are close are around 5%. Only 3.73% of the apps are designed for education, followed by social networking apps which amount for 3.28% of the apps in our data set.

The general impression is th the part of `t App Stort containing free English aps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

Let's continue by examining `the Ge`nres `and Cate`gory columns of `the Google `Play data as they are the two columns which seem to be related

In [40]:
display_table(android_final, 1) # Category

FAMILY : 18.938356164383563
GAME : 9.657534246575343
TOOLS : 8.481735159817351
BUSINESS : 4.646118721461187
PRODUCTIVITY : 3.9383561643835616
LIFESTYLE : 3.9155251141552516
FINANCE : 3.721461187214612
MEDICAL : 3.550228310502283
SPORTS : 3.3333333333333335
PERSONALIZATION : 3.287671232876712
COMMUNICATION : 3.2534246575342465
HEALTH_AND_FITNESS : 3.093607305936073
PHOTOGRAPHY : 2.9794520547945202
NEWS_AND_MAGAZINES : 2.808219178082192
SOCIAL : 2.6484018264840183
TRAVEL_AND_LOCAL : 2.3401826484018264
SHOPPING : 2.2488584474885847
BOOKS_AND_REFERENCE : 2.146118721461187
DATING : 1.860730593607306
VIDEO_PLAYERS : 1.8036529680365299
MAPS_AND_NAVIGATION : 1.3812785388127853
FOOD_AND_DRINK : 1.2328767123287672
EDUCATION : 1.1757990867579908
ENTERTAINMENT : 0.9589041095890412
AUTO_AND_VEHICLES : 0.9246575342465754
LIBRARIES_AND_DEMO : 0.9018264840182649
WEATHER : 0.7876712328767124
HOUSE_AND_HOME : 0.7876712328767124
EVENTS : 0.7191780821917808
ART_AND_DESIGN : 0.6506849315068494
PARENTING : 

In [41]:
display_table(android_final, -4) # genre

Tools : 8.470319634703197
Entertainment : 6.084474885844749
Education : 5.3881278538812785
Business : 4.646118721461187
Productivity : 3.9383561643835616
Lifestyle : 3.904109589041096
Finance : 3.721461187214612
Medical : 3.550228310502283
Sports : 3.4018264840182644
Personalization : 3.287671232876712
Communication : 3.2534246575342465
Action : 3.105022831050228
Health & Fitness : 3.093607305936073
Photography : 2.9794520547945202
News & Magazines : 2.808219178082192
Social : 2.6484018264840183
Travel & Local : 2.328767123287671
Shopping : 2.2488584474885847
Books & Reference : 2.146118721461187
Simulation : 2.054794520547945
Dating : 1.860730593607306
Arcade : 1.82648401826484
Video Players & Editors : 1.7808219178082192
Casual : 1.7351598173515983
Maps & Navigation : 1.3812785388127853
Food & Drink : 1.2328767123287672
Puzzle : 1.141552511415525
Racing : 1.004566210045662
Role Playing : 0.9474885844748858
Strategy : 0.9246575342465754
Auto & Vehicles : 0.9246575342465754
Libraries &

The difference between the `Genres` and the `Category` columns is not crystal clear, but one thing we can notice is that the `Genres` column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so we'll only work with the `Category` column moving forward.

Up to this point, we found that the `App Store` is dominated by apps designed for fun, while `Google Play` shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.

## Most Popular Apps by Genre on the App Store

One way to find out what `genres` are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the `Google Play` data set, we can find this information in the Installs column, but for the `App Store` data set this information is missing. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` app.

Below, we calculate the average number of user ratings per app genre on the App Store:

In [44]:
genres_ios = freq_table(ios_final, -5)

for genre in genres_ios:          # Outer loop: iterates over each genre in the genres_ios list
    total = 0                     # Initializes the total ratings for the current genre
    len_genre = 0                 # Initializes the count of apps in the current genre
    for app in ios_final:         # Inner loop: iterates over each app in the ios_final dataset
        genre_app = app[-5]       # Extracts the genre of the current app (from the -5th column)
        if genre_app == genre:    # Checks if the current app's genre matches the current genre being processed
            n_ratings = float(app[5])  # Extracts the number of ratings (from the 6th column) and converts it to a float
            total += n_ratings          # Adds the app's ratings to the total ratings for the current genre
            len_genre += 1              # Increments the count of apps in the current genre
    avg_n_ratings = total / len_genre   # Calculates the average number of ratings for the current genre
    print(genre, ':', avg_n_ratings)    # Prints the genre and its average number of ratings

Productivity : 21799.14814814815
Weather : 54215.2962962963
Shopping : 27816.2
Reference : 79350.4705882353
Finance : 32367.02857142857
Music : 58205.03076923077
Utilities : 19900.473684210527
Travel : 31358.5
Social Networking : 72916.54807692308
Sports : 23008.898550724636
Health & Fitness : 24037.634920634922
Games : 22985.211320754715
Food & Drink : 33333.92307692308
News : 21750.071428571428
Book : 46384.916666666664
Photo & Video : 28441.54375
Entertainment : 14364.774193548386
Business : 7491.117647058823
Lifestyle : 16739.34693877551
Education : 7003.983050847458
Navigation : 86090.33333333333
Medical : 612.0
Catalogs : 4004.0


In [45]:
print(ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [46]:
genres_ios = freq_table(ios_final, -5)

def ratings_table(dataset, index):
    table = freq_table(dataset, index)
    ratings_float = []
    for key in table:
        key_val_as_tuple = (table[key], key) # i.e the percentage occurence, name of variable
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

for genre in genres_ios:          # Outer loop: iterates over each genre in the genres_ios list
    total = 0                     # Initializes the total ratings for the current genre
    len_genre = 0                 # Initializes the count of apps in the current genre
    for app in ios_final:         # Inner loop: iterates over each app in the ios_final dataset
        genre_app = app[-5]       # Extracts the genre of the current app (from the -5th column)
        if genre_app == genre:    # Checks if the current app's genre matches the current genre being processed
            n_ratings = float(app[5])  # Extracts the number of ratings (from the 6th column) and converts it to a float
            total += n_ratings          # Adds the app's ratings to the total ratings for the current genre
            len_genre += 1              # Increments the count of apps in the current genre
    avg_n_ratings = total / len_genre   # Calculates the average number of ratings for the current genre
    print(genre, ':', avg_n_ratings)    # Prints the genre and its average number of ratings
    for ratings in genres_ios:
        total = 0
        len_ratings = 0
        for app in ios_final:
            ratings_app = app[5]
            if ratings_app == ratings:
                o_ratings = float(app[5])
                total += o_ratings
                len_ratings += 1

Productivity : 21799.14814814815
Weather : 54215.2962962963
Shopping : 27816.2
Reference : 79350.4705882353
Finance : 32367.02857142857
Music : 58205.03076923077
Utilities : 19900.473684210527
Travel : 31358.5
Social Networking : 72916.54807692308
Sports : 23008.898550724636
Health & Fitness : 24037.634920634922
Games : 22985.211320754715
Food & Drink : 33333.92307692308
News : 21750.071428571428
Book : 46384.916666666664
Photo & Video : 28441.54375
Entertainment : 14364.774193548386
Business : 7491.117647058823
Lifestyle : 16739.34693877551
Education : 7003.983050847458
Navigation : 86090.33333333333
Medical : 612.0
Catalogs : 4004.0


On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

In [48]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
GeocachingÂ® : 12811
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5
CoPilot GPS â€“ Car Navigation & Offline Maps : 3582
Google Maps - Navigation & Transit : 154911


The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same applies to music apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

Our aim is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10,000 threshold. We could get a better picture by removing these extremely popular apps for each genre and then rework the averages, but we'll leave this level of detail for later.

Reference apps have 74,942 user ratings on average, but it's actually the Bible and Dictionary.com which skew up the average rating:

In [50]:
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5]) # print name and number of ratings

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
Merriam-Webster Dictionary : 16849
Google Translate : 26786
Night Sky : 12122
WWDC : 762
Jishokun-Japanese English Dictionary & Translator : 0
VPN Express : 14
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
LUCKY BLOCK MOD â„¢ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
Guides for PokÃ©mon GO - Pokemon GO News and Cheats : 826
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Real Bike Traffic Rider Virtual Reality Glasses : 8


However, this niche seems to show some potential. One thing we could do is take another popular book and turn it into an app where we could add different features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes about the book, etc. On top of that, we could also embed a dictionary within the app, so users don't need to exit our app to look up words in an external app.

This idea seems to fit well with the fact that the App Store is dominated by for-fun apps. This suggests the market might be a bit saturated with for-fun apps, which means a practical app might have more of a chance to stand out among the huge number of apps on the App Store.

Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we described above, but the other genres don't seem too interesting to us:

- Weather apps — people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.

- Food and drink — examples here include Starbucks, Dunkin' Donuts, McDonald's, etc. So making a popular food and drink app requires actual cooking and a delivery service, which is outside the scope of our company.

- Finance apps — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.

## Most Popular Apps by Genre on Google Play

For the Google Play market, we have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [53]:
print(android_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [54]:
display_table(android_final, 5)

1,000,000+ : 15.74200913242009
100,000+ : 11.518264840182649
10,000,000+ : 10.60502283105023
10,000+ : 10.205479452054794
1,000+ : 8.367579908675799
100+ : 6.952054794520548
5,000,000+ : 6.872146118721462
500,000+ : 5.5479452054794525
50,000+ : 4.7716894977168955
5,000+ : 4.486301369863014
10+ : 3.515981735159817
500+ : 3.2077625570776256
50,000,000+ : 2.28310502283105
100,000,000+ : 2.134703196347032
50+ : 1.9292237442922375
5+ : 0.7876712328767124
1+ : 0.5136986301369862
500,000,000+ : 0.273972602739726
1,000,000,000+ : 0.228310502283105
0+ : 0.045662100456621
0 : 0.01141552511415525


One problem with this data is that is not precise. For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to get an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

To perform computations, however, we'll need to convert each install number to float — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. We'll do this directly in the loop below, where we also compute the average number of installs for each genre (category).

In [56]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '') # this replaces the ',' with nothing
            n_installs = n_installs.replace('+', '') # this replaces the '+' with nothing
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 654074.8271604938
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8329168.936170213
BUSINESS : 1712290.1474201474
COMICS : 859042.1568627451
COMMUNICATION : 38550548.03859649
DATING : 861409.5521472392
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11767380.952380951
EVENTS : 253542.22222222222
FINANCE : 1365500.4049079753
FOOD_AND_DRINK : 1951283.8055555555
HEALTH_AND_FITNESS : 4219697.055350553
HOUSE_AND_HOME : 1385541.463768116
LIBRARIES_AND_DEMO : 649314.0506329114
LIFESTYLE : 1447458.976676385
GAME : 15571586.690307328
FAMILY : 3716053.755274262
MEDICAL : 121161.87781350482
SOCIAL : 23628689.23275862
SHOPPING : 7103190.78680203
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3750580.6438356163
TRAVEL_AND_LOCAL : 14120454.07804878
TOOLS : 10902378.834454913
PERSONALIZATION : 5240358.986111111
PRODUCTIVITY : 16787331.344927534
PARENTING : 552875.1785714285
WEATHER : 5212877.101449275
VIDEO_PLAYERS : 24878048.860759493
NEWS_AND_MAGAZI

On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

In [58]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger â€“ Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Me

If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:

In [60]:
under_100_m = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m) # to get the average number of downloads for apps with less that 100m installs

3437620.895348837

We see the same pattern for the video players category, which is the runner-up with 24,727,872 installs. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

Again, the main concern is that these app genres might seem more popular than they really are. Moreover, these niches seem to be dominated by a few giants who are hard to compete against.

The game genre seems pretty popular, but previously we found out this part of the market seems a bit saturated, so we'd like to come up with a different app recommendation if possible.

The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

Let's take a look at some of the apps from this genre and their number of installs:

In [62]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra â€“ free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+

The book and reference genre includes a variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. It seems there's still a small number of extremely popular apps that skew the average:

In [64]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
       print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Audiobooks from Audible : 100,000,000+


However, it looks like there are only a few very popular apps, so this market still shows potential. Let's try to get some app ideas based on the kind of apps that are somewhere in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads):

In [66]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                           or app[5] == '5,000,000+'
                                           or app[5] == '10,000,000+'
                                           or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra â€“ free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+

The market for ebook processing and reading software, as well as extensive libraries and dictionaries, appears to be highly competitive. Given the dominance of these types of apps, entering this space with a similar offering may not be advisable due to the significant competition.

However, we’ve observed a notable number of apps dedicated to the Quran, indicating that apps centered around popular books can indeed be profitable. Developing an app based on a well-known book (potentially a more recent bestseller) could offer opportunities for success in both the Google Play and App Store markets.

That said, the market is saturated with library apps, so simply providing a basic version of a book may not be enough. To stand out, consider incorporating unique features such as daily quotes from the book, an audio version, quizzes, or a discussion forum for readers.

## Conclusions

In this project, we analyzed data from the App Store and Google Play to recommend an app profile that could be profitable in both markets.

Our analysis suggests that developing an app based on a popular book, particularly a recent one, could be a successful strategy. Since both markets already have many basic e-book apps, adding unique features could help set the app apart. These features might include daily quotes from the book, an audio version, interactive quizzes, or a community forum for discussions.