# Profitable App Profiles for the App Store and Google Play Markets

## Introduction
Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

In [None]:
from csv import reader
ios_apps = list(reader(open('AppleStore.csv')))
android_apps = list(reader(open('googleplaystore.csv')))

ios_apps_header = ios_apps[0]
ios_apps_dataset = ios_apps[1:]

android_apps_header = android_apps[0]
android_apps_dataset = android_apps[1:]

print(ios_apps_dataset[:6])
print('\n')
print(android_apps_dataset[:6])


The explore_data() function takes in four parameters:

- `dataset`, which will be a list of lists
- `start` and `end`, which will both be integers and represent the starting and the ending indices of a slice from the dataset
- `rows_and_columns`, which will be a Boolean and has False as a default argument


In [None]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Printing the first few rows of the App Store and Google Play datasets.

In [None]:
print(explore_data(ios_apps_dataset, 0, 5, rows_and_columns=True))
print('\n')
print(explore_data(android_apps_dataset, 0, 5, rows_and_columns=True))

Printing the column names of the App Store and Google Play datasets.

[App Store column description](http://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps)

[Google Play column description](https://www.kaggle.com/datasets/lava18/google-play-store-apps?resource=download)

In [None]:
print(ios_apps_header)
print('\n')
print(android_apps_header)

## Data Cleaning
This involes removing or correcting wrong data, removing duplicate data, and modifying the data to fit the purpose of our analysis.

1. check for missing columns in Google Play dataset

In [None]:
print(android_apps_header)

for row in android_apps_dataset:
    if len(row) != len(android_apps_header):
        print(row)
        print(android_apps_dataset.index(row))
        

We can see that the`Category` column is missing and there is no data in the `Genres` column, rather it is just left with whitespace.

As a result, we delete this column to get rid of the errors.

In [None]:
del android_apps_dataset[10472]

2. check for missing columns in App Store dataset

In [None]:
print(ios_apps_header)

for row in ios_apps_dataset:
    if len(row) != len(ios_apps_header):
        print(row)
        print(ios_apps_dataset.index(row))
        

There are no missing column values in the App Store dataset

3. check for duplicate entries in the Google Play dataset

In [None]:
android_duplicate_apps = []
android_unique_apps = []

for row in android_apps_dataset:
    app_name = row[0]
    if app_name in android_unique_apps:
        android_duplicate_apps.append(app_name)
    else:
        android_unique_apps.append(app_name)
        
print('Number of duplicate apps:', len(android_duplicate_apps))
print('Number of unique apps:', len(android_unique_apps))
print('\n')
print('Examples of duplicate apps:', android_duplicate_apps[:15])

Next, we select the criterion for removing duplicates

In [None]:
print(android_apps_header)

for row in android_apps_dataset:
    app_name = row[0]
    if app_name == 'Instagram':
        print(row)

We can see that the duplicates for `Instagram` differ in the `Reviews` column, which tells us the row with the highest number of Reviews must be the latest entry.

Now, we will remove the duplicate entries from the Google Play dataset.

In [None]:
reviews_max = {}

for row in android_apps_dataset:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max))
print('\n')
print(reviews_max)
    

From the code above, we see that there are 9659 apps with the latest review count and without duplicates.

Next, we will use the `reviews_max` dictionary created above to remove the duplicate rows:

In [None]:
android_clean = []
already_added = []

for row in android_apps_dataset:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
        
print(len(android_clean))
print(android_clean)
print(len(already_added))

In the code above, we created two lists `android_clean` to store the clean Google Play dataset without duplicates and `already_added` to store the names of the apps in the cleaned dataset.

We needed to add this supplementary condition (`not in` **already_added**) to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry.

Next, we explore the datasets to remove apps that are not in `English`

In [None]:
def english(string):
    for character in string:
        if ord(character) > 127:
            return False
        
    return True

print(english('Instagram'))
print(english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english('Docs To Go™ Free Office Suite'))
print(english('Instachat 😜'))

We can see from the code above that some English apps with emojis or non-alphabet characters in their names, like `TM`, will be incorrectly labeled as non-English apps.

We need to refilter the dataset to accommodate the English apps that have these special characters.


In [None]:
def english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
            
    if non_ascii > 3:
        return False
    else:
        return True
        
print(english('Docs To Go™™™ Free Office Suite'))

In [None]:
android_clean_english = []

for row in android_clean:
    android_name = row[0]
    
    if english(android_name) == True:
        android_clean_english.append(row)

print(len(android_clean_english))       
print(android_clean_english)

In [None]:
ios_clean_english = []

for row in ios_apps_dataset:
    ios_name = row[1]
    
    if english(ios_name) == True:
        ios_clean_english.append(row)

print(len(ios_clean_english))       
print(ios_clean_english)

Now, we're going to isolate the `Free` apps in both datasets

In [None]:
print(android_apps_header)
print('\n')
print(ios_apps_header)

In [None]:
android_free = []

for row in android_clean_english:
    type = row[6]
    
    if type == 'Free':
        android_free.append(row)
        
print(len(android_free))
print(android_free)

In [None]:
ios_free = []

for row in ios_clean_english:
    price = float(row[4])
    
    if price == 0.0:
        ios_free.append(row)
        
print(len(ios_free))
print(ios_free)

Our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.

To minimize risks and overhead, our validation strategy for an app idea has three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.


We'll begin the analysis by determining the most common genres for each market.

We will do this by building frequency tables for the `Category` and `Genres` columns of the Google Play dataset and the `prime_genre` column of the App Store dataset.

In [None]:
def freq_table(dataset, index):
    freq_dict = {}
    total = 0

    for row in dataset:
        total += 1
        column = row[index]
        if column not in freq_dict:
            freq_dict[column] = 1
        else:
            freq_dict[column] += 1
            
    for column in freq_dict:
        freq_dict[column] /= total
        freq_dict[column] *= 100
            
    return freq_dict

freq_table(android_free, 1)

 We'll need to build a second function that can help us display the entries in the frequency table in descending order.
 
This function will transform the frequency table into a list of tuples, then sort the list in a descending order

In [None]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

## `prime_genre` frequency table analysis (App Store Dataset)

We can see that among the free English apps, more than a half (58.16%) are games. Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.66% of the apps are designed for education, followed by social networking apps which amount for 3.29% of the apps in our data set.

The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

In [None]:
print('APP STORE - PRIME_GENRE')
print('\n')
display_table(ios_free, 11)

## `Category` frequency table analysis (Google Play Store Dataset)

The landscape seems significantly different on Google Play: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids.

In [None]:
print('GOOGLE PLAY STORE - CATEGORY')
print('\n')
display_table(android_free, 1)

## `Genres` frequency table analysis (Google Play Store Dataset)

Practical apps seem to have a better representation on Google Play compared to App Store. This is confirmed by the frequency table we see for the Genres column.

In [None]:
print('GOOGLE PLAY STORE - GENRES')
print('\n')
display_table(android_free, 9)

The difference between the `Genres` and the `Category` columns is not crystal clear, but one thing we can notice is that the `Genres` column is much more granular (it has more categories). 

We're only looking for the bigger picture at the moment, so we'll only work with the `Category` column moving forward.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. 

Now we'd like to get an idea about the kind of apps that have most users.

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the `Installs` column, but for the App Store data set this information is missing. 
As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` column.

## App Store

In [None]:
ios_genre_freq = freq_table(ios_free, 11)
print(ios_genre_freq)

In [None]:
for genre in ios_genre_freq:
    total = 0
    len_genre = 0
    
    for row in ios_free:
        genre_app = row[11]
        if genre_app == genre:
            total += float(row[5])
            len_genre += 1
            
    ios_avg_user_rating = total / len_genre
    print(genre, ':', ios_avg_user_rating)
    
    

We observe that in the App Store, on average, `Navigation` apps have the highest number of user reviews (86090).

Further analysis shows that the number of `Navigation` user reviews is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together.

In [None]:
for row in ios_free:
    app_name = row[1]
    genre = row[11]
    user_ratings = row[5]
    if genre == 'Navigation':
        print(app_name, ':', user_ratings)

The same pattern applies to `Social Networking` apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc.

In [None]:
for row in ios_free:
    app_name = row[1]
    genre = row[11]
    user_ratings = row[5]
    if genre == 'Social Networking':
        print(app_name, ':', user_ratings)

Same applies to `Music` apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

In [None]:
for row in ios_free:
    app_name = row[1]
    genre = row[11]
    user_ratings = row[5]
    if genre == 'Music':
        print(app_name, ':', user_ratings)

This also applies to `Reference` apps, where the Bible and Dictionary.com skew up the average rating

In [None]:
for row in ios_free:
    app_name = row[1]
    genre = row[11]
    user_ratings = row[5]
    if genre == 'Reference':
        print(app_name, ':', user_ratings)

However, this niche seems to show some potential. One thing we could do is take another popular book and turn it into an app where we could add different features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes about the book, etc. On top of that, we could also embed a dictionary within the app, so users don't need to exit our app to look up words in an external app.

This idea seems to fit well with the fact that the App Store is dominated by for-fun apps. This suggests the market might be a bit saturated with for-fun apps, which means a practical app might have more of a chance to stand out among the huge number of apps on the App Store.

## Google Play Store

In [None]:
android_category_freq = freq_table(android_free, 1)
print(android_category_freq)

In [None]:
for category in android_category_freq:
    total = 0
    len_category = 0
    
    for row in android_free:
        category_app = row[1]
        installs = float(((row[5]).replace('+', '')).replace(',', ''))
        if category_app == category:
            total += installs
            len_category += 1
            
    android_avg_installs = total / len_category
    print(category, ':', android_avg_installs)
        

On average, `COMMUNICATION` apps have the most installs: 38,456,119.

In [None]:
for row in android_free:
    if row[1] == 'COMMUNICATION':
        print(row[0], ':', row[5])

The number of installs for `COMMUNICATION` apps is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

In [None]:
for row in android_free:
    if row[1] == 'COMMUNICATION' and (row[5] == '1,000,000,000+'
                                      or row[5] == '500,000,000+'
                                      or row[5] == '100,000,000+'):
        print(row[0], ':', row[5])

If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:

COMMUNICATION: 38,456,119 vs 3,603,485

In [None]:
under_100_m = []

for row in android_free:
    n_installs = float(((row[5]).replace('+', '')).replace(',', ''))
    if row[1] == 'COMMUNICATION' and n_installs < 100000000:
        under_100_m.append(n_installs)
        
sum(under_100_m) / len(under_100_m)

We see the same pattern for the video players category, which is the runner-up.


In [None]:
for row in android_free:
    if row[1] == 'VIDEO_PLAYERS' and (row[5] == '1,000,000,000+'
                                      or row[5] == '500,000,000+'
                                      or row[5] == '100,000,000+'):
        print(row[0], ':', row[5])

The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player.

VIDEO_PLAYERS: 24,727,872 vs 5,544,878

In [None]:
under_100_m = []

for row in android_free:
    n_installs = float(((row[5]).replace('+', '')).replace(',', ''))
    if row[1] == 'VIDEO_PLAYERS' and n_installs < 100000000:
        under_100_m.append(n_installs)
        
sum(under_100_m) / len(under_100_m)

The pattern is repeated for `SOCIAL` apps where we have giants like Facebook, Instagram, Google+, etc.

In [None]:
for row in android_free:
    if row[1] == 'SOCIAL' and (row[5] == '1,000,000,000+'
                                      or row[5] == '500,000,000+'
                                      or row[5] == '100,000,000+'):
        print(row[0], ':', row[5])

In [None]:
under_100_m = []

for row in android_free:
    n_installs = float(((row[5]).replace('+', '')).replace(',', ''))
    if row[1] == 'SOCIAL' and n_installs < 100000000:
        under_100_m.append(n_installs)
        
sum(under_100_m) / len(under_100_m)

SOCIAL: 23,253,652 vs 3,084,582

**Again, the main concern is that these app genres might seem more popular than they really are. Moreover, these niches seem to be dominated by a few giants who are hard to compete against.**

The `GAME` genre seems pretty popular, but previously we found out this part of the market seems a bit saturated, so we'd like to come up with a different app recommendation if possible.

In [None]:
for row in android_free:
    if row[1] == 'GAME' and (row[5] == '1,000,000,000+'
                                      or row[5] == '500,000,000+'
                                      or row[5] == '100,000,000+'):
        print(row[0], ':', row[5])

In [None]:
under_100_m = []

for row in android_free:
    n_installs = float(((row[5]).replace('+', '')).replace(',', ''))
    if row[1] == 'GAME' and n_installs < 100000000:
        under_100_m.append(n_installs)
        
sum(under_100_m) / len(under_100_m)

The `BOOKS_AND_REFERENCE` genre looks fairly popular as well, with an average number of installs of 8,767,811.

In [None]:
for row in android_free:
    if row[1] == 'BOOKS_AND_REFERENCE':
        print(row[0], ':', row[5])

The book and reference genre includes a variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. It seems there's still a small number of extremely popular apps that skew the average:

In [None]:
for row in android_free:
    if row[1] == 'BOOKS_AND_REFERENCE' and (row[5] == '1,000,000,000+'
                                      or row[5] == '500,000,000+'
                                      or row[5] == '100,000,000+'):
        print(row[0], ':', row[5])

However, it looks like there are only a few very popular apps, so this market still shows potential.

**We could get some app ideas based on the kind of apps that are somewhere in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads):**

In [None]:
for row in android_free:
    if row[1] == 'BOOKS_AND_REFERENCE' and (row[5] == '1,000,000+'
                                      or row[5] == '5,000,000+' 
                                      or row[5] == '10,000,000+'
                                      or row[5] == '50,000,000+'):
        print(row[0], ':', row[5])

## Insights

This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

However, it looks like the market is already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

# Conclusion

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.