# Identifying a Profitable Free iOS and Android App Profile

We are tasked with identifying which market segment of free iOS and Android apps has the most potential for a profitable app. We are interested in identifying an ad-driven app profile successful in both markets, so maximizing the number of users is key.

This project analyzes a dataset from Kaggle containing information about the top 7200 apps on the Apple iOS app store as of July 2017 and the top 10000 apps on the Google Play app store as of September 2018.

The dataset documentation can be found at the following links:

* [Apple iOS App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)
* [Google Play App Store](https://www.kaggle.com/lava18/google-play-store-apps)

We begin by establishing `explore_data()`, a simple function to examine the datasets conveniently.

In [1]:
from typing import List
def explore_data(dataset: List[List], start: int = None,
                 end: int = None, rows_and_columns: bool = False) -> None:
    start = start if start is not None else 0
    end = end if end is not None else len(dataset)-1

    print(*dataset[start:end], sep="\n\n")
    
    if rows_and_columns:
        print("Number of rows:", len(dataset))
        print("Number of cols:", len(dataset[0]), end="\n\n")

The CSV files are read and split into headers and data.

In [2]:
from csv import reader
apple_full = list(reader(open("datasets/AppleStore.csv")))
gplay_full = list(reader(open("datasets/googleplaystore.csv")))

apple_header, apple = apple_full[0], apple_full[1:]
gplay_header, gplay = gplay_full[0], gplay_full[1:]

print(apple_header, gplay_header, sep="\n")

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


## Data Cleaning

Some of the data in the datasets is erroneous and must be cleaned. First, there is a malformed row in the Google Play dataset which has the incorrect number of columns due a missing `'Category'` column. Second, due to the same app being scraped multiple times, there are several duplicate names among the data. The following function removes duplicate rows using the review count as a standin for a timestamp, so that only the most recent entry with the highest review count is used.

### Removing malformed rows

The following function will return the given dataset with any rows of the incorrect length removed.

In [3]:
from typing import List
def remove_malformed_rows(dataset: List[List], header: List) -> List[List]:
    for row in dataset:
        if len(row) != len(header):
            del dataset[dataset.index(row)]
    return dataset

In [4]:
print("Rows before removing malformed rows:", len(gplay))
gplay = remove_malformed_rows(gplay, gplay_header)
print("Rows after removing malformed rows: ", len(gplay))

Rows before removing malformed rows: 10841
Rows after removing malformed rows:  10840


One malformed row was removed from the Google Play data.

In [5]:
print("Rows before removing malformed rows:", len(apple))
apple = remove_malformed_rows(apple, apple_header)
print("Rows after removing malformed rows: ", len(apple))

Rows before removing malformed rows: 7197
Rows after removing malformed rows:  7197


The Apple data was not modified as no malformed rows exist in the dataset.

### Removing duplicates

In [6]:
apple_unique_names = list()
apple_duplicate_names = list()
for row in apple:
    name = row[1]
    if name not in apple_unique_names:
        apple_unique_names.append(name)
    else:
        apple_duplicate_names.append(name)

apple_expected_len = len(apple) - len(apple_duplicate_names)
print(f"Apple dataset contains {len(apple_unique_names)} unique names "
      f"and {len(apple_duplicate_names)} duplicate names.")
        
gplay_unique_names = list()
gplay_duplicate_names = list()
for row in gplay:
    name = row[0]
    if name not in gplay_unique_names:
        gplay_unique_names.append(name)
    else:
        gplay_duplicate_names.append(name)

print(f"Google Play dataset contains {len(gplay_unique_names)} unique names "
      f"and {len(gplay_duplicate_names)} duplicate names.")

Apple dataset contains 7195 unique names and 2 duplicate names.
Google Play dataset contains 9659 unique names and 1181 duplicate names.


As seen above, the Apple and Google Play datasets contain 2 and 1181 duplicate apps, respectively. The following function will return the given dataset with any duplicate apps merged, using the review count as a pseudo-timestamp, leaving only the entry with the highest number of reviews, which should be the most recent.

In [7]:
from typing import List
def remove_duplicates_keyed(dataset: List[List], key_col: int, data_col: int) -> List[List]:
    """Currently unfinished function - could have a comparison
       function argument to support custom data types."""
    # Create a map of names to rows
    tracker = dict()
    for row in dataset:
        # Collect the name and review count
        key = row[key_col]
        new_data = int(row[data_col])

        # If the name does not exist in the tracker or the new review count
        # is higher, map the name to the current row.
        current_entry = tracker.get(key)
        if (current_entry is None
                or int(current_entry[data_col]) < new_data):
            tracker[key] = row

    return list(tracker.values())

The following snippet applies this function to each dataset, providing the name and review count column numbers.

In [8]:
apple_len_dup = len(apple)
apple = remove_duplicates_keyed(apple, key_col=1, data_col=5)
apple_len_nodup = len(apple)
apple_num_dup = apple_len_dup - apple_len_nodup


gplay_len_dup = len(gplay)
gplay = remove_duplicates_keyed(gplay, key_col=0, data_col=3)
gplay_len_nodup = len(gplay)
gplay_num_dup = gplay_len_dup - gplay_len_nodup

print(f"{apple_num_dup} duplicate apps removed from Apple dataset. New row count: {len(apple)}")
print(f"{gplay_num_dup} duplicate apps removed from Google Play dataset. New row count: {len(gplay)}")

2 duplicate apps removed from Apple dataset. New row count: 7195
1181 duplicate apps removed from Google Play dataset. New row count: 9659


The snippet reports that the correct number of duplicates were removed. The row counts match the number of unique names above - 7195 for Apple and 9659 for Google Play.

### Removing non-English apps

The iOS and Google Play stores contains apps of a variety of languages, but we are currently interested only in those directed to an English-speaking audience. To account for this, we will check each app name for non-English characters. As some English apps still use non-ASCII characters such as emojis or symbols, a small allowance is made to reduce the number of English apps mistakenly removed. If a name has more than 3 non-ASCII characters, it is considered non-English.

In [9]:
from typing import List
def remove_nonenglish(dataset: List[List], name_col: int) -> List[List]:
    output = list()
    for row in dataset:
        name = row[name_col]
        noneng_count = sum(1 for c in name if ord(c) > 127)
        if noneng_count < 4:
            output.append(row)
    
    return output

The following snippet applies this function to each dataset.

In [10]:
apple_num_all_lang = len(apple)
apple = remove_nonenglish(apple, name_col=1)
apple_num_eng = len(apple)
apple_num_noneng = apple_num_all_lang - apple_num_eng

print(f"There are {apple_num_eng} English apps in the Apple dataset. "
      f"Removed {apple_num_noneng} non-English apps.")

gplay_num_all_lang = len(gplay)
gplay = remove_nonenglish(gplay, name_col=0)
gplay_num_eng = len(gplay)
gplay_num_noneng = gplay_num_all_lang - gplay_num_eng

print(f"There are {gplay_num_eng} English apps in the Google Play dataset. "
      f"Removed {gplay_num_noneng} non-English apps.")

There are 6181 English apps in the Apple dataset. Removed 1014 non-English apps.
There are 9614 English apps in the Google Play dataset. Removed 45 non-English apps.


### Removing non-free apps

As we are interested only in free apps, we must remove apps that are not free. The following function will remove any apps identified as non-free from a given dataset.

In [11]:
import string
from typing import List
def remove_nonfree(dataset: List[List], price_col: int) -> List[List]:
    output = list()
    for row in dataset:
        # Remove non-digit characters
        price = ''.join(c for c in str(row[price_col]) if c in string.digits)
        if float(price) == 0:
            output.append(row)

    return output

The following snippet removes all non-free apps from the datasets.

In [12]:
apple_num_all_price = len(apple)
apple = remove_nonfree(apple, 4)
apple_num_free = len(apple)
apple_num_nonfree = apple_num_all_price - apple_num_free

print(f"There are {apple_num_free} free apps in the Apple dataset. "
      f"{apple_num_nonfree} non-free apps were removed.")

gplay_num_all_price = len(gplay)
gplay = remove_nonfree(gplay, 7)
gplay_num_free = len(gplay)
gplay_num_nonfree = gplay_num_all_price - gplay_num_free

print(f"There are {gplay_num_free} free apps in the Google Play dataset. "
      f"{gplay_num_nonfree} non-free apps were removed.")

There are 3220 free apps in the Apple dataset. 2961 non-free apps were removed.
There are 8864 free apps in the Google Play dataset. 750 non-free apps were removed.


## Identifying a top free English app profile

To minimize risks and overhead, our validation strategy for an app idea consists of three steps:

1. Build a minimal Android version of the app, and add it to Google Play
2. If the app has a good response from users, we develop it further
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

As our end goal is to make the app available on both Google Play and the iOS App Store, we need to find app profiles that are successful in both markets. We will begin by determining the most common genres for each market.

### Selecting relevant columns

The `prime_genre` column in the Apple dataset and the `Genre` and `Category` columns in the Google Play dataset contain the information we need. We will create a frequency table and sort the dictionary by values using the `frequency_table()` function. `print_frequency_table()` is used to print these tables in a readable fashion.

In [13]:
from typing import List, Dict, Tuple
def frequency_table(dataset: List[List], column: int,
                    sep: str = None, add_percent: bool = False) -> Dict[str, Tuple[int, float]]:
    # Construct the frequency table
    output = dict()
    for row in dataset:
        # The optional separator can be used to account for
        # multiple pieces of information in a single entry
        for key in str(row[column]).split(sep):
            try:
                output[key] += 1
            except KeyError:
                output[key] = 1

    # Sort
    output_sorted = dict(sorted(output.items(), key=lambda x: x[1], reverse=True))

    # Optionally, add percentage of total
    if add_percent:
        for key, count in output_sorted.items():
            output_sorted[key] = (count, round(count / len(dataset) * 100, 2))
    
    return output_sorted


from typing import Dict, Tuple
def print_frequency_table(table: Dict[str, Tuple[int, float]], key_title: str,
                          top: bool = None, show_percent: bool = False) -> None:
    key_pad = max(len(key_title), max(len(x) for x in table.keys())) + 1
    count_pad = max(len("Count"), max(len(str(x[0])) for x in table.values())) + 1

    if top is None:
        top = len(table)

    shown = 0
    if show_percent:
        print(f"{key_title:<{key_pad}}| {'Count':<{count_pad}}| Percentage")
        print(f"{'-'*key_pad}|{'-'*(count_pad+1)}|{'-'*12}")
        for key, info in table.items():
            if shown >= top:
                break
            else:
                shown += 1
                
            count, pct = info
            print(f"{key:<{key_pad}}| {count:<{count_pad}}| {pct}")
    else:
        print(f"{key_title:<{key_pad}}| {'Count':<{count_pad}}")
        print(f"{'-'*key_pad}|{'-'*(count_pad+1)}")
        for key, info in table.items():
            try:
                count = info[0]
            except TypeError:
                count = info
                
            if shown >= top:
                break
            else:
                shown += 1
            print(f"{key:<{key_pad}}| {count:<{count_pad}}")
    print()

In [14]:
apple_genre_freq = frequency_table(apple, 11, add_percent=True)

print("Apple genre frequency:")
print_frequency_table(apple_genre_freq, key_title="Genre", show_percent=True)

gplay_category_freq = frequency_table(gplay, 1, add_percent=True)
gplay_genre_freq = frequency_table(gplay, 9, sep=';', add_percent=True)

print("Google Play category frequency:")
print_frequency_table(gplay_category_freq, key_title="Category", show_percent=True)

print("Google Play genre frequency:")
print_frequency_table(gplay_genre_freq, key_title="Genre", show_percent=True)

Apple genre frequency:
Genre         | Count | Percentage
--------------|-------|------------
Games         | 1872  | 58.14
Entertainment | 254   | 7.89
&             | 251   | 7.8
Photo         | 160   | 4.97
Video         | 160   | 4.97
Education     | 118   | 3.66
Social        | 106   | 3.29
Networking    | 106   | 3.29
Shopping      | 84    | 2.61
Utilities     | 81    | 2.52
Sports        | 69    | 2.14
Music         | 66    | 2.05
Health        | 65    | 2.02
Fitness       | 65    | 2.02
Productivity  | 56    | 1.74
Lifestyle     | 51    | 1.58
News          | 43    | 1.34
Travel        | 40    | 1.24
Finance       | 36    | 1.12
Weather       | 28    | 0.87
Food          | 26    | 0.81
Drink         | 26    | 0.81
Reference     | 18    | 0.56
Business      | 17    | 0.53
Book          | 14    | 0.43
Navigation    | 6     | 0.19
Medical       | 6     | 0.19
Catalogs      | 4     | 0.12

Google Play category frequency:
Category            | Count | Percentage
--------------------

That's a lot of information - let's cut down our search to the top 5 or 10 in each column, which for these datasets will cut off at entries that make up about 3% of the column. We'll use top 5 for the broad Apple genre and Google Play category columns, and top 10 for the precise Google Play genre column.

In [15]:
print("Top 5 Apple genres:")
print_frequency_table(apple_genre_freq, key_title="Genre", top=5, show_percent=True)

print("Top 5 Google Play categories:")
print_frequency_table(gplay_category_freq, key_title="Category", top=5, show_percent=True)

print("Top 10 Google Play genres:")
print_frequency_table(gplay_genre_freq, key_title="Genre", top=10, show_percent=True)

Top 5 Apple genres:
Genre         | Count | Percentage
--------------|-------|------------
Games         | 1872  | 58.14
Entertainment | 254   | 7.89
&             | 251   | 7.8
Photo         | 160   | 4.97
Video         | 160   | 4.97

Top 5 Google Play categories:
Category            | Count | Percentage
--------------------|-------|------------
FAMILY              | 1676  | 18.91
GAME                | 862   | 9.72
TOOLS               | 750   | 8.46
BUSINESS            | 407   | 4.59
LIFESTYLE           | 346   | 3.9

Top 10 Google Play genres:
Genre                   | Count | Percentage
------------------------|-------|------------
Tools                   | 750   | 8.46
Education               | 606   | 6.84
Entertainment           | 569   | 6.42
Business                | 407   | 4.59
Lifestyle               | 347   | 3.91
Productivity            | 345   | 3.89
Finance                 | 328   | 3.7
Medical                 | 313   | 3.53
Sports                  | 309   | 3.49
Person

Based on this information, we can identify some significant trends. First, games and entertainment make up the majority of Apple applications with just over 66% of all free English apps being in these two genres. Second, the free English app categories on Android are more diverse, with only 30% being games (the Family category is composed mostly of games with a young audience) and about 17% being tools, business, and lifestyle. Looking at the Google Play genre information, 20% fall under tools, education, and business, indicating a balance between productivity and entertainment, as opposed to the dominance of games and entertainment on the Apple store.

### Identifying app genres with the most users

We have identified which types of app the datasets are composed of. Now, we would like to know which genres are most popular. We can use the `Installs` column in the Google Play dataset, and the total number of ratings, `rating_count_tot` in the Apple dataset.

In [16]:
import string
from typing import List, Dict, Tuple
def frequency_table_keyed_total(dataset: List[List], key_col: int, data_col: int,
                                sep: str = None) -> Dict[str, Tuple[int, int, int]]:
    """Given a dataset, add together entries in data_col, indexed by key_col."""
    total_counts_by_key = dict()
    for row in dataset:
        keys = str(row[key_col]).split(sep)
        data_str = str(row[data_col])
        data = int(''.join(c for c in data_str if c in string.digits ))

        for key in keys:
            # Add data and increment count
            if key not in total_counts_by_key:
                total_counts_by_key[key] = (data, 1)
            else:
                prev = total_counts_by_key[key]
                total_counts_by_key[key] = (
                    prev[0] + data,
                    prev[1] + 1
                )

    # Add average
    total_counts_by_key = { k: (v[0], v[1], round(v[0] / v[1])) for k, v in total_counts_by_key.items() }

    return dict(sorted(total_counts_by_key.items(), key=lambda x: x[1][1], reverse=True))


from typing import Dict, Tuple
def print_frequency_table_keyed_total(table: Dict[str, Tuple[int, int, int]], key_title: str,
                                      top: bool = None, sort_by_average: bool = False):
    key_pad = max(len(key_title), max(len(x) for x in table.keys())) + 1
    total_pad = max(len("Total"), max(len(str(x[0])) for x in table.values())) + 1

    if top is None:
        top = len(table)

    if sort_by_average:
        table = dict(sorted(table.items(), key=lambda x: x[1][2], reverse=True))

    shown = 0
    print(f"{key_title:<{key_pad}}| {'Total':<{total_pad}}| Average")
    print(f"{'-'*key_pad}|{'-'*(total_pad+1)}|{'-'*9}")
    for key, val in table.items():
        if shown >= top:
            break
        else:
            shown += 1

        total, count, avg = val
        print(f"{key:<{key_pad}}| {total:<{total_pad}}| {avg}")
    print()

In [17]:
apple_total_ratings_by_genre = frequency_table_keyed_total(apple, 11, 5)

print("Apple genres with highest average ratings:")
print_frequency_table_keyed_total(apple_total_ratings_by_genre, key_title="Genre", top=10, sort_by_average=True)

gplay_total_installs_by_genre = frequency_table_keyed_total(gplay, 9, 5, sep=';')
gplay_total_installs_by_category = frequency_table_keyed_total(gplay, 1, 5)

print("Google play categories with highest average installs:")
print_frequency_table_keyed_total(gplay_total_installs_by_category, key_title="Category", top=10, sort_by_average=True)
print("Google play genres with highest average installs:")
print_frequency_table_keyed_total(gplay_total_installs_by_genre, key_title="Genre", top=10, sort_by_average=True)

Apple genres with highest average ratings:
Genre         | Total    | Average
--------------|----------|---------
Navigation    | 516542   | 86090
Reference     | 1348958  | 74942
Social        | 7584125  | 71548
Networking    | 7584125  | 71548
Music         | 3783551  | 57327
Weather       | 1463837  | 52280
Book          | 556619   | 39758
Food          | 866682   | 33334
Drink         | 866682   | 33334
Finance       | 1132846  | 31468

Google play categories with highest average installs:
Category            | Total       | Average
--------------------|-------------|---------
COMMUNICATION       | 11036906201 | 38456119
VIDEO_PLAYERS       | 3931731720  | 24727872
SOCIAL              | 5487861902  | 23253652
PHOTOGRAPHY         | 4656268815  | 17840110
PRODUCTIVITY        | 5791629314  | 16787331
GAME                | 13436869450 | 15588016
TRAVEL_AND_LOCAL    | 2894704086  | 13984078
ENTERTAINMENT       | 989460000   | 11640706
TOOLS               | 8101043474  | 10801391
NEWS_AN

Based on these average, there is no clear winner. Navigation, communication, and social media are top genres, but are skewed by apps like Google Maps and Instagram - it would be difficult to find an in to these markets. Reference is primarily composed of existing books and guides. Creation of such an app would be relatively simple, and could gain a large number of users by its utility.

## Conclusions

The analysis of free apps indicates that while games are very popular, they do not have as many users on average. Social media apps are popular but subject to the network effect, making it very difficult to break into this market. A free ad-driven reference app appears to be a strong candidate for profitability in both the iOS and Android free English app markets. A potential app profile could be a reference app that generates revenue via both ads and features that give it utility beyond the source material.