# Profitable App Profiles for the App Store and Google Play Markets Project

This project is dedicated to finding profitable apps for Android and iOS.

Imagine being a data analyst and working for a company that builds Android and iOS mobile apps. The company only releases free apps, so the revenue is made by the in-game apps, so our profit is dependent on the number of app users. 

The goal is to analyse and structure the data and to clarify what apps are more popular among the users.

## Opening and Eploring the data

Due to the very big number of iOS apps on AppStore and Android apps on Google Play (~4 mil apps in total), we will work on just a sample of data instead. Fortunately, we can avoid extra costs by finding available datasets. ("googleplaystore.csv", "AppleStore.csv"). It is handy to write a function that will help us read the datasets. Let's call it __explore_data( )__.

In [None]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]


def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line between rows
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print(android_header)
print('\n')
explore_data(android, 0, 3, True)
print('\n')

Let's check the AppStore file now.

In [None]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

## Deleting Wrong Data

Our Google dataset has has a dedicated discussion section, and we can see that one of the discussions outlines an error for row 10472. 

In [None]:
print(android[10472])  # incorrect row
print('\n')
print(android_header)  # header
print('\n')
print(android[0])      # correct row

We see that row 10472 shows the app **Life Made WI-Fi Touchscreen Photo Frame** that has a rating 19, however the highest possible rating allowed by Google Play is 5 so we can delete this row.

In [None]:
print(len(android))
# del android[10472]  #commenting this line prevents deleting this row over and over 
print(len(android))

## Removing Duplicate Entries

Exploring the datasets, we find out that there are some duplicates. For example the app Instagram has 4 rows in the Google dataset.

In [None]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

In [None]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
    
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

There are 1181 duplicate apps in total. There is no sense for us to keep more than one row for each app so we will keep one. The difference between the rows is the number of reviews. We'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

In [None]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

Remember that we had 1181 duplicates in the dataset and now we see that our dictionary with unique apps **revies_max** has no duplicates. Let's remove the duplicates from out dataset now by:

 * Creating two empty lists, **android_clean** and **already_added**;
 * Looping through the Google dataset and for each iteration we:
     * We add the current row (app) to the **android_clean** list, and the app name (name) to the **already_added** list if:
         * The number of reviews of the current app is equal to the number of reviews of that app as described in the **reviews_max dictionary**; and
         * The app name is not in the **already_added** list (in case an app has more than one row with the same number of reviews
  



In [None]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name) # make sure this is inside the if block

In [None]:
explore_data(android_clean, 0, 3, True)

As we see, there are 9659 rows, meaning we managed to get rid of the duplicates.

## Removing Non-English Apps

After a closer look we notice that some apps are in other languages than English.

In [None]:
print(ios[813][1])
print(ios[6731][1])

print(android_clean[4412][0])
print(android_clean[7940][0])

We do not want to leave these apps so will come up with a sorting solution. We only want to keep the English alphabet letters, numbers and symbols. 

All these characters that are specific to English texts are encoded using the ASCII standard. Each ASCII character has a corresponding number between 0 and 127 associated with it, and we can take advantage of that to build a function that checks an app name and tells us whether it contains non-ASCII characters.

We built this function below, and we use the built-in ord() function to find out the corresponding encoding number of each character.

In [None]:
def is_english(string):
    
    for character in string:
        if ord(character) > 127:
            return False
    
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

It seems to be working, however we have apps that contain other characters like emojis in their names which our fucntion does not allow.  

In [None]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

print(ord('™'))
print(ord('😜'))

In order to save useful apps we will not use that function, but we will remove an app if it has more than three non-ASCII characters.

In [None]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

That is good enough and we can move on with this function to the next, iOS dataset.

In [None]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

## Isolating the Free Apps

As we mentioned earlier, we deal onlt with free apps so we need to filter the datasets in order to leave only the free apps.

In [None]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

## Most common Apps by Genre

Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our data sets.

We'll build two functions we can use to analyze the frequency tables:

    * One function to generate frequency tables that show percentages
    * Another function that we can use to display the percentages in a                 descending order

In [None]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

We start by examining the frequency table for the prime_genre column of the App Store data set.

In [None]:
display_table(ios_final, -5)

So most of the apps on AppStore belong to one of the entertainment genre (games, social networking, entertainment, photo and video, etc). There ar enot that many educational or practical apps. Let's dive more into it.

In [None]:
display_table(android_final, 1) 

As for the Google play store we see a different picture. There are more pratical apps. We can confirm that by investigating further.

In [None]:
display_table(android_final, -4)

## Most popular Apps by Genre on the App Store

The most popular genres have the most installs, but we can't get that data so we will use the total number of user ratings instead, which we can find in the rating_count_tot app.

Below, we calculate the average number of user ratings per app genre on the App Store:

In [None]:
genres_ios = freq_table(ios_final, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

We see that navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together

In [None]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Same sutiation for networking and music apps where the whole picture is concentrated on the big apps(Facebook, Skype) and (Shazam, Spotify) etc. Let's take a look at the reference apps.

In [None]:
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

This one seems promising. We might use this to create an app with online books, comics etc. The App Store is full with entertaining apps so coming up with somthing more practical might be a good idea.

## Most popular Apps by Genre on Google Play

For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [None]:
display_table(android_final, 5) # the Installs columns

We'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

To perform computations, however, we'll need to convert each install number to float — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. We'll do this directly in the loop below, where we also compute the average number of installs for each genre (category).

In [None]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

Communication apps seems to have the most installs here.

In [None]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

We can go and remove all the communication apps that have over 100 million installs, the average would be reduced roughly ten times

In [None]:
under_100_m = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

As a small conlcusion, the genres such as communication, video, social networking, music are dominated by the giant popular apps (Facebook, Twitter, Skype, Youtube etc.). The game genre seems to have a bit different position with many popular games. Let's check the books genre.

In [None]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

We see a variety of apps: from a reading app and dictionaries to programming languages tutorials. Let's check the biggest ones here.

In [None]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

We see that there are not too many apps here. What about less popular apps?

In [None]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

We see that there are virtual library apps and ebook readers on the list. This might be a good place to try first if we want to create an app. However, we will need to add some additional and interesting features to stand out and draw users' interest.

## Conclussion

During this project we read and analyzed the App Store and Google Play apps datasets and investigated the current trends. In the end we came up with an idea to think more about the book genre in order to avoid the entertainment genres that have too many apps. 