# Profitable App Profiles for the App Store and Google Play Markets

For this project, I'll pretend that I'm working as a data analyst for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.

Our company only builds apps that are designed for an English-speaking audience. They are also free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users who use our app &mdash; the more users that see and engage with the ads, the better. My goal for this project is to analyze data to help the developers understand what types of apps are likely to attract more users.

## Opening and Exploring the Data

To accomplish my goal, I'll need to collect and analyze data about mobile apps available on Google Play and the App Store. As of September 2018, there were approximately 2 million iOS apps available on the App Store and 2.1 million Android apps on Google Play.

![Number of apps &copy; Statista 2018](https://s3.amazonaws.com/dq-content/350/py1m8_statista.png)

Collecting data for over 4 million apps requires a significant amount of time and money, so I'll try to analyze a sample of the data instead. To avoid spending resources on collecting new data, I'll try to find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for my goal:

- [A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 **Android** apps from **Google Play**; the data was collected in August 2018 (you can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv))
- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 **iOS** apps from the **App Store**; the data was collected in July 2017 (you can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv))

I'll start by opening and exploring these two data sets. Before doing that though, to make the two data sets easier to explore, I'll create a function named `explore_data()` that I can repeatedly use to print rows of data in a readable way. In addition, when I use this function, I'll have the option to output the number of rows and columns in the data set.

In [1]:
def explore_data(data_set, start, end, rows_and_columns=False):
    data_slice = data_set[start:end]
    for row in data_slice:
        print(row)
        # Print an extra line of space after each row for readability
        print()
    
    if rows_and_columns:
        print('Number of rows:', len(data_set))
        print('Number of columns:', len(data_set[0]))

Now that I'm appropriately prepared, I'll open the two data sets.

In [2]:
from csv import reader

### The Google Play data set ###
opened_file = open('../data_sets/googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('../data_sets/AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

Next, I'll explore the two data sets by finding and outputting the first few rows of each one using the `explore_data()` function. Above the example rows of each data set, you'll also be able to see the columns of the data set. The columns should give you a better insight on exactly what each data point represents. I'll inspect these columns myself so that I can identify which ones might help me with my analysis.

In [3]:
print(android_header)
print()

explore_data(android, 0, 3, True)
print()

print(ios_header)
print()

explore_data(ios, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']

['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']

Number of rows: 10841
Number of columns: 13

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

['284882215', '

Most of the columns in the data sets are pretty self-explanatory, but if you don't understand some of the columns, you can view the documentation for the Google Play data set [here](https://www.kaggle.com/lava18/google-play-store-apps) and the App Store data set [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).

And as for which columns will help me reach my goal, I believe that the following are the most interesting at a quick glance:

**Google Play** data set:

- `'track_name'`
- `'currency'`
- `'price'`
- `'rating_count_tot'`
- `'rating_count_ver'`
- `'prime_genre'`

**App Store** data set:

- `'App'`
- `'Category'`
- `'Reviews'`
- `'Installs'`
- `'Type'`
- `'Price'`
- `'Genres'`

## Deleting Wrong Data

Before beginning my analysis, I need to make sure that the data I analyze is accurate, otherwise the results of my analysis will be wrong. This means that I need to:

- Detect inaccurate data, and correct or remove it.
- Detect duplicate data, and remove the duplicates.

My company only builds apps that are free to download and install, and they are directed toward an English-speaking audience. This means that I'll need to:

- Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
- Remove apps that aren't free.

I'll begin by detecting and deleting wrong data.

The Google Play data set has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and you can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error for a certain row (entry 10472). This entry is missing its `'Category'` value, which causes a shift in the column. Additionally, the app has an empty value for its genre. To show you this, I'll present the exact row from the data set. And lastly, I'll delete this incorrect entry from the Google Play data set.

In [4]:
print(android[10472])

del android[10472]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


## Removing Duplicate Entries

### Part One

Continuing the data cleaning process, if you explore the Google Play data set long enough or look at the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), you'll notice that some apps have duplicate entries. For instance, Instagram has four entries.

In [5]:
for app in android:
    name = app[0]
    
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In total, there are 1,181 cases where an app occurs more than once.

In [6]:
unique_apps = []
duplicate_apps = []

for app in android:
    name = app[0]
    
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of unique apps:', len(unique_apps))

print('\nNumber of duplicate apps:', len(duplicate_apps)) # \n also prints an extra line of space

print('\nExamples of duplicate apps:', duplicate_apps[:20])

Number of unique apps: 9659

Number of duplicate apps: 1181

Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']


I don't want to count certain apps more than once when I analyze the data, so I need to remove the duplicate entries and keep only one entry per app. One thing I could do is remove the duplicate rows randomly, but I could probably find a better way.

If you examine the rows I printed for the Instagram app, you can see that the main difference in these duplicates happens on the fourth position (number of reviews) of each row. The different numbers show that the data was collected at different times.

![Different values for number of reviews in duplicates](https://s3.amazonaws.com/dq-content/350/py1m8_fourth_col.png)

I can use this information to build a criterion for removing the duplicates. The higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, I'll only keep the row with the highest number of reviews and remove the other entries for any given app.

### Part Two

When I looped through the Google Play data set, I found that there are 1,181 duplicates. After I remove these duplicates, the Google Play data set should be left with 9,659 rows.

In [7]:
print('Expected length:', len(android) - len(duplicate_apps))

Expected length: 9659


To remove the duplicates, I will:

- Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
- Use the information stored in the dictionary and create a new data set, which will have only one entry per app (and for each app, I'll only select the entry with the highest number of reviews).

I'll start by building the dictionary and displaying its length to be sure that it doesn't account for any duplicate entries.

In [8]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and n_reviews > reviews_max[name]:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print('Actual length:', len(reviews_max))

Actual length: 9659


Now, I'll use the `reviews_max` dictionary to remove the duplicates. For the duplicate cases, I'll only keep the entries with the highest number of reviews. In the code cell below:

- I start by initializing two empty lists, `android_clean` and `already_added`.
- I loop through the `android` data set, and for every iteration:
    - I isolate the name of the app and the number of reviews.
    - I add the current row (`app`) to the `android_clean` list, and the app name (`name`) to the `already_cleaned` list if:
        - The number of reviews of the current app matches the number of reviews of that app as described in the `reviews_max` dictionary.
        - The name of the app is not already in the `already_added` list &mdash; I need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If I just check for `reviews_max[name] == n_reviews`, I'll still end up with duplicate entries for some apps.

In [9]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

Lastly, I'll quickly explore the new and improved data set and confirm that the number of rows is precisely 9,659.

In [10]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']

Number of rows: 9659
Number of columns: 13


## Removing Non-English Apps

### Part One

Recall that my company creates English-only apps, and I'd like to analyze only the apps that are directed toward an English-speaking audience. However, if you explore the data long enough, you'll find that both data sets have apps with names that suggest that they are not directed toward an English-speaking audience.

In [11]:
print('Android:')
print('\nExample 1:', android_clean[4412][0])
print('Example 2:', android_clean[7940][0])

print()

print('iOS:')
print('\nExample 1:', ios[813][1])
print('Example 2:', ios[6731][1])

Android:

Example 1: 中国語 AQリスニング
Example 2: لعبة تقدر تربح DZ

iOS:

Example 1: 爱奇艺PPS -《欢乐颂2》电视剧热播
Example 2: 【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


I'm not interested in keeping these apps, so I'll remove them. One way to go about this is to remove each app with a name containing a symbol that is not commonly used in English text &mdash; English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /).

Behind the scenes, each character used in a string has a corresponding number associated with it. For instance, the corresponding number for character 'a' is 97, character 'A' is 65, and character '爱' is 29,233. I can get the corresponding number of each character using the [`ord()` built-in function](https://docs.python.org/3/library/functions.html#ord).

The numbers corresponding to the characters commonly used in English text are all in the range 0 to 127, according to the [ASCII](https://en.wikipedia.org/wiki/ASCII) (American Standard Code for Information Interchange) system. Based on this number range, I can build a function that detects whether a character belongs to the set of common English characters or not. If the number is less than or equal to 127, then the character belongs to the set of common English characters.

If an app name contains a character that is greater than 127, then it probably means that the app has a non-English name. The app names are stored as strings; I can use indexing to select an individual character, and I can also iterate on the string using a `for` loop.

With all this information in mind, I can write the function I just described.

In [12]:
def is_english(string):
    for char in string:
        if ord(char) > 127:
            return False
    
    return True

Before I use this function on the data sets, I'll test it out on some example names.

In [13]:
print('English app name:', is_english('Instagram'))
print('English app name:', is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

English app name: True
English app name: False


The function seems to work perfectly fine, but what about app names with the trademark symbol (™) or emojis (😜)?

In [14]:
print('English app name:', is_english('Docs To Go™ Free Office Suite'))
print('English app name:', is_english('Instachat 😜'))

English app name: False
English app name: False


These special symbols fall outside the ASCII range (they have corresponding numbers over 127). In result, symbols like the ones above will cause useful apps to be removed if this function is used in its current form.

### Part Two

If I'm going to use the function I've created, I'll lose useful data since many English apps will be incorrectly labeled as non-English. To minimize the impact of data loss, I'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. This means all English apps with up to three emoji or other special characters will still be labeled as English. This filter function is still not perfect, but it should be fairly effective.

Now, I'll edit the function I created, and then I'll use it to filter out the non-English apps.

In [15]:
def is_english(string):
    non_ascii = 0
    
    for char in string:
        if ord(char) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    
    return True

print('English app name:', is_english('Docs To Go™ Free Office Suite'))
print('English app name:', is_english('Instachat 😜'))
print('English app name:', is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

English app name: True
English app name: True
English app name: False


As you can see in the output above, the new `is_english()` function works fine. The final step is to use this new function to filter out the non-English apps from both data sets.

In [16]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    
    if is_english(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    
    if is_english(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)

print()

explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']

Number of rows: 9614
Number of columns: 13

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']

['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games

Now that all the non-English apps are out of the picture, there are 9,614 records left for the Google Play data set and 6,183 for the App Store data set.

## Isolating the Free Apps

As I mentioned in the introduction, my company only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Currently, the data sets contain both free and non-free apps; I'll need to isolate only the free apps for my analysis.

Luckily, isolating the free apps will be the last step in the data cleaning process. After this, I'll start analyzing the data.

In [17]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[4]
    
    if price == '0.0':
        ios_final.append(app)
        
print('Remaining number of apps in Android data set:', len(android_final))
print('Remaining number of apps in iOS data set:', len(ios_final))

Remaining number of apps in Android data set: 8864
Remaining number of apps in iOS data set: 3222


## Most Common Apps by Genre

### Part One

Again, my aim is to determine the kinds of apps that are likely to attract more users because my company's revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, my company's validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, I need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

I'll begin the analysis by getting a sense of what the most common genres for each market are. For this, I'll need to build frequency tables for a few columns in the data sets. But first, I'll inspect both data sets and identify the columns I could use to generate the frequency tables.

In [18]:
print(android_header)

print()

print(ios_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Evidently, the columns that I'll use for the Google Play data set will be `'Category'` and `'Genres'`, and I'll use the `'prime_genre'` column for the App Store data set.

### Part Two

With that objective in mind, I'll build two functions that I can use to analyze the frequency tables:

- One function to generate frequency tables that show percentages
- Another function that I can use to display the percentages in a descending order

In [19]:
def freq_table(data_set, index):
    table = {}
    
    for entry in data_set:
        key = entry[index]
        
        if key in table:
            table[key] += 1
        else:
            table[key] = 1
    
    table_percentages = {}
    
    for key in table:
        percentage = table[key] / len(data_set) * 100
        table_percentages[key] = percentage
    
    return table_percentages

def display_table(data_set, index, decimal_place=2): # By default, the data will be rounded to 2 decimal places
    table = freq_table(data_set, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print('{}: {}%'.format(entry[1], round(entry[0], decimal_place))) # Round data for readability

### Part Three

Now that those functions are in place, I can focus on analyzing the frequency tables for both data sets.

I'll start with displaying the frequency table for the `'Category'` column in the Google Play data set.

In [20]:
display_table(android_final, 1)

FAMILY: 18.91%
GAME: 9.72%
TOOLS: 8.46%
BUSINESS: 4.59%
LIFESTYLE: 3.9%
PRODUCTIVITY: 3.89%
FINANCE: 3.7%
MEDICAL: 3.53%
SPORTS: 3.4%
PERSONALIZATION: 3.32%
COMMUNICATION: 3.24%
HEALTH_AND_FITNESS: 3.08%
PHOTOGRAPHY: 2.94%
NEWS_AND_MAGAZINES: 2.8%
SOCIAL: 2.66%
TRAVEL_AND_LOCAL: 2.34%
SHOPPING: 2.25%
BOOKS_AND_REFERENCE: 2.14%
DATING: 1.86%
VIDEO_PLAYERS: 1.79%
MAPS_AND_NAVIGATION: 1.4%
FOOD_AND_DRINK: 1.24%
EDUCATION: 1.16%
ENTERTAINMENT: 0.96%
LIBRARIES_AND_DEMO: 0.94%
AUTO_AND_VEHICLES: 0.93%
HOUSE_AND_HOME: 0.82%
WEATHER: 0.8%
EVENTS: 0.71%
PARENTING: 0.65%
ART_AND_DESIGN: 0.64%
COMICS: 0.62%
BEAUTY: 0.6%


Among the many apps, the most common ones are family apps, gaming apps, and apps designed as tools. However, upon further inspection, I realized that the family apps consist mostly of games for kids. Overall, it seems common for apps on Google Play to be designed more for practical purposes (family, tools, business, lifestyle, productivity, finance, etc.) than fun.

Next, I'll generate the frequency table for the `'Genres'` column in the Google Play data set.

In [21]:
display_table(android_final, -4)

Tools: 8.45%
Entertainment: 6.07%
Education: 5.35%
Business: 4.59%
Productivity: 3.89%
Lifestyle: 3.89%
Finance: 3.7%
Medical: 3.53%
Sports: 3.46%
Personalization: 3.32%
Communication: 3.24%
Action: 3.1%
Health & Fitness: 3.08%
Photography: 2.94%
News & Magazines: 2.8%
Social: 2.66%
Travel & Local: 2.32%
Shopping: 2.25%
Books & Reference: 2.14%
Simulation: 2.04%
Dating: 1.86%
Arcade: 1.85%
Video Players & Editors: 1.77%
Casual: 1.76%
Maps & Navigation: 1.4%
Food & Drink: 1.24%
Puzzle: 1.13%
Racing: 0.99%
Role Playing: 0.94%
Libraries & Demo: 0.94%
Auto & Vehicles: 0.93%
Strategy: 0.91%
House & Home: 0.82%
Weather: 0.8%
Events: 0.71%
Adventure: 0.68%
Comics: 0.61%
Beauty: 0.6%
Art & Design: 0.6%
Parenting: 0.5%
Card: 0.45%
Casino: 0.43%
Trivia: 0.42%
Educational;Education: 0.39%
Board: 0.38%
Educational: 0.37%
Education;Education: 0.34%
Word: 0.26%
Casual;Pretend Play: 0.24%
Music: 0.2%
Racing;Action & Adventure: 0.17%
Puzzle;Brain Games: 0.17%
Entertainment;Music & Video: 0.17%
Casual;

As you can see in the results above, tool apps, entertainment apps, and educational apps are frequent on Google Play. Just like the most common categories, the app genres on Google Play are made up of more hands-on content (besides the decent amount of entertainment apps). Obviously, there aren't very many differences between Google Play's categories and genres. However, the genres are more granular than the categories (there are a lot more genres than categories). Because I'm looking for the bigger picture at the moment, I'll stick to working with just the `'Category'` column moving forward.

Finally, I'll output the frequency table for the `'prime_genre'` column in the App Store data set.

In [22]:
display_table(ios_final, -5)

Games: 58.16%
Entertainment: 7.88%
Photo & Video: 4.97%
Education: 3.66%
Social Networking: 3.29%
Shopping: 2.61%
Utilities: 2.51%
Sports: 2.14%
Music: 2.05%
Health & Fitness: 2.02%
Productivity: 1.74%
Lifestyle: 1.58%
News: 1.33%
Travel: 1.24%
Finance: 1.12%
Weather: 0.87%
Food & Drink: 0.81%
Reference: 0.56%
Business: 0.53%
Book: 0.43%
Navigation: 0.19%
Medical: 0.19%
Catalogs: 0.12%


According to the data generated above, the most common apps reside in the `'Games'` genre by a long shot. In fact, almost 60% of the apps in the data set are gaming apps. The runner-up is `'Entertainment'`, which isn't far off from gaming apps. As you can probably tell, this greatly contrasts with the popularity of practical categories and genres on Google Play.

Up to this point, I found that the App Store is dominated by apps designed for fun. Yet the Google Play market shows a more balanced landscape of both practical and for-fun apps (but mostly practical apps).

## Most Popular Apps by Genre on the App Store

Now, I'd like to get an idea about the kinds of apps with the most users.

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, I can find this information in the `'Installs'` column, but this information is missing for the App Store data set. As a workaround, I'll take the total number of user ratings as a proxy, which can be found in the `'rating_count_tot'` column.

I'll start with calculating the average number of user ratings per app genre on the App Store.

In [23]:
genres_ios = freq_table(ios_final, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    
    for app in ios_final:
        genre_app = app[-5]
        
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    
    avg_n_ratings = total / len_genre
    
    print('{}: {}'.format(genre, round(avg_n_ratings, 2)))

Social Networking: 71548.35
Photo & Video: 28441.54
Games: 22788.67
Music: 57326.53
Reference: 74942.11
Health & Fitness: 23298.02
Weather: 52279.89
Utilities: 18684.46
Travel: 28243.8
Shopping: 26919.69
News: 21248.02
Navigation: 86090.33
Lifestyle: 16485.76
Entertainment: 14029.83
Food & Drink: 33333.92
Sports: 23008.9
Book: 39758.5
Finance: 31467.94
Education: 7003.98
Productivity: 21028.41
Business: 7491.12
Catalogs: 4004.0
Medical: 612.0


On average, apps in the `'Navigation'` genre have the highest number of ratings at a little over 86,000 ratings. However, as you can see in the output below, this large number is heavily influenced by Waze and Google Maps. In fact, these two apps have a combined rating count of almost half a million ratings.

In [24]:
for app in ios_final:
    name = app[1]
    genre = app[-5]
    n_ratings = app[5]
    
    if genre == 'Navigation':
        print('{}: {}'.format(name, n_ratings))

Waze - GPS Navigation, Maps & Real-time Traffic: 345046
Google Maps - Navigation & Transit: 154911
Geocaching®: 12811
CoPilot GPS – Car Navigation & Offline Maps: 3582
ImmobilienScout24: Real Estate Search in Germany: 187
Railway Route Search: 5


This doesn't make for a very good app profile recommendation for the App Store. Unsurprisingly, the same pattern applies to social networking apps where giants like Facebook, Pinterest, and Skype greatly affect the average number of ratings. This is also true when it comes to music apps; there are a few big players like Pandora, Spotify, and Shazam.

These popular apps make their genres seem more popular than they really are. The majority of apps in these genres don't even or barely have more than 10,000 ratings.

Again, reference apps have a large amount of combined ratings, but kings like the Bible and Dictionary.com make this average inaccurate toward the other apps in this same genre.

In [25]:
for app in ios_final:
    name = app[1]
    genre = app[-5]
    n_ratings = app[5]
    
    if genre == 'Reference':
        print('{}: {}'.format(name, n_ratings))

Bible: 985920
Dictionary.com Dictionary & Thesaurus: 200047
Dictionary.com Dictionary & Thesaurus for iPad: 54175
Google Translate: 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran: 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition: 17588
Merriam-Webster Dictionary: 16849
Night Sky: 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE): 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools: 4693
GUNS MODS for Minecraft PC Edition - Mods Tools: 1497
Guides for Pokémon GO - Pokemon GO News and Cheats: 826
WWDC: 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free: 718
VPN Express: 14
Real Bike Traffic Rider Virtual Reality Glasses: 8
教えて!goo: 0
Jishokun-Japanese English Dictionary & Translator: 0


However, compared to the other options, an app based on books actually has a lot of potential. One or more popular books could be turned into an app where the user could do more than just read the books. They could experience many special and interactive features: daily quotes from the books, audio versions of the books, quizzes about the books, and even an embedded dictionary (so that the user could look up words right within the app).

Another reason that this idea stands out is that it differs from the saturated market of for-fun apps within the App Store. Perhaps a more practical app would be just the thing to compete with a market dominated by apps designed for amusement. Additionally, imagine one of the books in the app being about business. This picture shows how this app could be extremely inclusive of a variety of other genres.

The `'Book'` genre seems to overlap a bit with the app I described above. Yet besides that, although some of the other genres may be popular, they either require a lot of resources, don't support in-app ads very well, aren't used too often, or are just outside the scope of this company.

## Most Popular Apps by Genre on Google Play

I have data about the number of installs for the Google Play market, so I should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — most values are open-ended (100+, 1,000+, 5,000+, etc.).

In [26]:
display_table(android_final, 5)

1,000,000+: 15.73%
100,000+: 11.55%
10,000,000+: 10.55%
10,000+: 10.2%
1,000+: 8.39%
100+: 6.92%
5,000,000+: 6.83%
500,000+: 5.56%
50,000+: 4.77%
5,000+: 4.51%
10+: 3.54%
500+: 3.25%
50,000,000+: 2.3%
100,000,000+: 2.13%
50+: 1.92%
5+: 0.79%
1+: 0.51%
500,000,000+: 0.27%
1,000,000,000+: 0.23%
0+: 0.05%
0: 0.01%


For instance, I don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, I don't need very precise data for my purposes — I only want to find out which app genres attract the most users, and I don't need perfect precision with respect to the number of users.

I'm going to leave the numbers as they are. This means that, for example, I'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs. To perform computations, however, I'll need to convert each install number from a string to a float. This means that I need to remove the commas and plus characters, otherwise the conversion will fail and raise an error. I'll do this now, and I'll also compute the average number of installs for each category (genre).

In [27]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    
    for app in android_final:
        category_app = app[1]
        
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            
            total += float(n_installs)
            len_category += 1
    
    avg_n_installs = total / len_category
    
    print('{}: {}'.format(category, round(avg_n_installs, 2)))

ART_AND_DESIGN: 1986335.09
AUTO_AND_VEHICLES: 647317.82
BEAUTY: 513151.89
BOOKS_AND_REFERENCE: 8767811.89
BUSINESS: 1712290.15
COMICS: 817657.27
COMMUNICATION: 38456119.17
DATING: 854028.83
EDUCATION: 1833495.15
ENTERTAINMENT: 11640705.88
EVENTS: 253542.22
FINANCE: 1387692.48
FOOD_AND_DRINK: 1924897.74
HEALTH_AND_FITNESS: 4188821.99
HOUSE_AND_HOME: 1331540.56
LIBRARIES_AND_DEMO: 638503.73
LIFESTYLE: 1437816.27
GAME: 15588015.6
FAMILY: 3695641.82
MEDICAL: 120550.62
SOCIAL: 23253652.13
SHOPPING: 7036877.31
PHOTOGRAPHY: 17840110.4
SPORTS: 3638640.14
TRAVEL_AND_LOCAL: 13984077.71
TOOLS: 10801391.3
PERSONALIZATION: 5201482.61
PRODUCTIVITY: 16787331.34
PARENTING: 542603.62
WEATHER: 5074486.2
VIDEO_PLAYERS: 24727872.45
NEWS_AND_MAGAZINES: 9549178.47
MAPS_AND_NAVIGATION: 4056941.77


Once again, my aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play. So, I'll try to come to a conclusion on at least one app profile recommendation that works for both the App Store and Google Play.

On average, communication apps have the most installs. However, if you investigate these apps, you'll find that the category's mass amount of installs is mostly populated by just a small amount of apps (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, Hangouts, etc.).

In [28]:
for app in android_final:
    category = app[1]
    
    if category == 'COMMUNICATION':
        installs = app[5]
        
        if installs == '1,000,000,000+' or installs == '500,000,000+' or installs == '100,000,000+':
            name = app[0]
            print('{}: {}'.format(name, installs))

WhatsApp Messenger: 1,000,000,000+
imo beta free calls and text: 100,000,000+
Android Messages: 100,000,000+
Google Duo - High Quality Video Calls: 500,000,000+
Messenger – Text and Video Chat for Free: 1,000,000,000+
imo free video calls and chat: 500,000,000+
Skype - free IM & video calls: 1,000,000,000+
Who: 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji: 100,000,000+
LINE: Free Calls & Messages: 500,000,000+
Google Chrome: Fast & Secure: 1,000,000,000+
Firefox Browser fast & private: 100,000,000+
UC Browser - Fast Download Private & Secure: 500,000,000+
Gmail: 1,000,000,000+
Hangouts: 1,000,000,000+
Messenger Lite: Free Calls & Messages: 100,000,000+
Kik: 100,000,000+
KakaoTalk: Free Calls & Text: 100,000,000+
Opera Mini - fast web browser: 100,000,000+
Opera Browser: Fast and Secure: 100,000,000+
Telegram: 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer: 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure: 100,000,000+
Viber Messenger: 500,000,000+
WeC

You'll see that if I remove all the communication apps that have over 100,000,000 installs, the average would be reduced roughly ten times.

In [29]:
under_100_m = []

for app in android_final:
    category = app[1]
    
    if category == 'COMMUNICATION':
        n_installs = app[5]
        n_installs = n_installs.replace(',', '')
        n_installs = n_installs.replace('+', '')
        n_installs = float(n_installs)
        
        if n_installs < 100000000:
            under_100_m.append(n_installs)
            
print('COMMUNICATION:', round(sum(under_100_m) / len(under_100_m), 2))

COMMUNICATION: 3603485.39


This same pattern occurs for the category of video player apps (the runner-up with 24,727,872 installs). The market is dominated by apps like Youtube, Google Play Movies & TV, and MX Player. This pattern can also be seen when it comes to social apps (where there are giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), and productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

Again, the crystal clear problem is that these app categories appear to be more popular than they actually are. Furthermore, these niches are vividly overpowered by a few giants who are nearly impossible to compete against.

The category of games is obviously pretty popular, but previously I found out that this part of the market is a bit saturated, so it would probably be better to come up with a different app recommendation if possible.

The `'BOOKS_AND_REFERENCE'` category looks fairly popular as well, as it holds an average number of installs of 8,767,811. It's interesting to explore this in more depth since I found that this category (a lot like the genre from the App Store) has potential to be the best candidate.

Now, I'll take a look at some of the apps from this category and their number of installs.

In [30]:
for app in android_final:
    category = app[1]
    
    if category == 'BOOKS_AND_REFERENCE':
        name = app[0]
        n_installs = app[5]
        
        print('{}: {}'.format(name, n_installs))

E-Book Read - Read Book for free: 50,000+
Download free book with green book: 100,000+
Wikipedia: 10,000,000+
Cool Reader: 10,000,000+
Free Panda Radio Music: 100,000+
Book store: 1,000,000+
FBReader: Favorite Book Reader: 10,000,000+
English Grammar Complete Handbook: 500,000+
Free Books - Spirit Fanfiction and Stories: 1,000,000+
Google Play Books: 1,000,000,000+
AlReader -any text book reader: 5,000,000+
Offline English Dictionary: 100,000+
Offline: English to Tagalog Dictionary: 500,000+
FamilySearch Tree: 1,000,000+
Cloud of Books: 1,000,000+
Recipes of Prophetic Medicine for free: 500,000+
ReadEra – free ebook reader: 1,000,000+
Anonymous caller detection: 10,000+
Ebook Reader: 5,000,000+
Litnet - E-books: 100,000+
Read books online: 5,000,000+
English to Urdu Dictionary: 500,000+
eBoox: book reader fb2 epub zip: 1,000,000+
English Persian Dictionary: 500,000+
Flybook: 500,000+
All Maths Formulas: 1,000,000+
Ancestry: 5,000,000+
HTC Help: 10,000,000+
English translation from Beng

This category includes a large variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, and more. However, there's still a small number of extremely popular apps that skew the average.

In [31]:
for app in android_final:
    category = app[1]
    
    if category == 'BOOKS_AND_REFERENCE':
        n_installs = app[5]
        
        if n_installs == '1,000,000,000+' or n_installs == '500,000,000+' or n_installs == '100,000,000+':
            name = app[0]
            print('{}: {}'.format(name, n_installs))

Google Play Books: 1,000,000,000+
Bible: 100,000,000+
Amazon Kindle: 100,000,000+
Wattpad 📖 Free Books: 100,000,000+
Audiobooks from Audible: 100,000,000+


Since there are only a small amount of very popular apps, this market still shows potential. I'll explore a little more here by trying to get some app ideas based on the kinds of apps that are somewhere in the middle (sweet spot) in terms of popularity (between 1,000,000 and 100,000,000 downloads).

In [32]:
for app in android_final:
    category = app[1]
    
    if category == 'BOOKS_AND_REFERENCE':
        n_installs = n_installs_string = app[5]
        n_installs = n_installs.replace(',', '')
        n_installs = n_installs.replace('+', '')
        
        if 1000000 <= int(n_installs) <= 100000000:
            name = app[0]
            print('{}: {}'.format(name, n_installs_string))

Wikipedia: 10,000,000+
Cool Reader: 10,000,000+
Book store: 1,000,000+
FBReader: Favorite Book Reader: 10,000,000+
Free Books - Spirit Fanfiction and Stories: 1,000,000+
AlReader -any text book reader: 5,000,000+
FamilySearch Tree: 1,000,000+
Cloud of Books: 1,000,000+
ReadEra – free ebook reader: 1,000,000+
Ebook Reader: 5,000,000+
Read books online: 5,000,000+
eBoox: book reader fb2 epub zip: 1,000,000+
All Maths Formulas: 1,000,000+
Ancestry: 5,000,000+
HTC Help: 10,000,000+
Moon+ Reader: 10,000,000+
English-Myanmar Dictionary: 1,000,000+
Golden Dictionary (EN-AR): 1,000,000+
All Language Translator Free: 1,000,000+
Bible: 100,000,000+
Amazon Kindle: 100,000,000+
Aldiko Book Reader: 10,000,000+
Wattpad 📖 Free Books: 100,000,000+
Dictionary - WordWeb: 5,000,000+
50000 Free eBooks & Free AudioBooks: 5,000,000+
Al-Quran (Free): 10,000,000+
Al Quran Indonesia: 10,000,000+
Al'Quran Bahasa Indonesia: 10,000,000+
Al Quran Al karim: 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline: 

This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries. Because of this, it's probably not a good idea to build similar apps since there'll be some significant competition.

You'll also notice that there are quite a few apps built around the book *Quran*. This suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and App Store markets.

However, since the market is already full of libraries, our company will need to add some special features besides the raw version of the book: daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, and more.

## Conclusion

In this project, I analyzed data about the App Store and Google Play markets. I manipulated thousands of apps and observed hundreds of entries from each data set. Overall, my goal was to come up with a data-driven app profile recommendation that could reasonably be applied to both Google Play and the App Store.

In the end, I came to the conclusion that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the App Store and Google Play markets. Yet since the markets are already full of libraries, our company will need to implement some special features besides the raw version of the book to make this app stand out.