# Profitable App Profiles
In this project I aim to find mobile app profiles that are profitable for the App Store and Google Play.

The company builds apps that are free to download and install, and the main source of revenue consists of in-app ads. This means that the revenue for any given app is mostly influenced by the number of users that use our app. My goal for this project is to analyze data to help developers understand what kinds of apps are likely to attract more users.

First we open the two data sets so we can begin exploring the data

In [None]:
from csv import reader

# The Google Play data set 
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

# The App Store data set 
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

To explore the data we first created a function that can be used to explore rows in a more readable way.

In [None]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line between rows
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print(android_header)
print('\n')
explore_data(android, 0, 3, True)
print('\n')
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

# Removing inaccurate data
The Google Play data set has a dedicated discussion section, and we can see that one of the discussions outlines an error for row 10472. 

In [None]:
print(android[10472])  
print('\n')
print(android_header)  
print('\n')
print(android[0])

The row 10472 corresponds to the app Life Made WI-Fi Touchscreen Photo Frame, and we can see that the rating is 19. This is clearly wrong as it exceeds the maximum rating for a Google Play app which is 5. Therefore, we'll delete this row.

In [None]:
del android[10472]

# Removing duplicate entries
If we explore the Google Play data set, we'll find that some apps have more than one entry. For instance, the application Google ads has four entries:

In [None]:
for app in android:
    name = app[0]
    if name == 'Google Ads':
        print(app)

In total, there are 1,181 cases where an app occurs more than once:

In [None]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
    
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

In order to avoid duplicate entries we will create a new dictionary whihc will use a unique app for each key and the value s the highest number of reviews of that app

In [None]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In a previous code cell, we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,181.

In [None]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

We use the dictionary to remove duplicate apps, for duplicate cases, we'll only keep the entries with the highest number of reviews. 

In [None]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

Exploring the data shows the number of rows is now 9,659.

In [None]:
explore_data(android_clean, 0, 3, True)

In [None]:
def english(string):
    
    for character in string:
        if ord(character) > 127:
            return False
    
    return True

print(english('Instagram'))
print(english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english('Docs To Go™ Free Office Suite'))
print(english('Instachat 😜'))


# Removing Non-English Apps
When exploring the data sets, you'll notice the names of some of the apps suggest they are not directed toward an English-speaking audience. 

As all english characters have an ASCII value between 0 an 127, any character with a value greater than 127 is not english and so those apps can be removed from the data set as we only need apps that are in english. The function above checks the ASCII values of all characters in a string and returns False if there are any characters which are deemed not english.

In [None]:
def english(string):
    non_english = 0
    
    for character in string:
        if ord(character) > 127:
            non_english += 1
    
    if non_english > 3:
        return False
    else:
        return True

print(english('Docs To Go™ Free Office Suite'))
print(english('Instachat 😜'))

The previous function would not be an efficient way to remove non english apps as there are some english apps with characters with an ASCII values greater than 127. To solve this problem a counter was added to count the number of non english characters. If there are more than 3 non english characters we can assume the app is not english and return False so it can be removed from the data set.

In [None]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

# Isolating free apps

The company only builds  which are free to download and install so we need to isolate the free apps in both data sets.

In [None]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

In [None]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Analyse freq table prime genre for app store

In [None]:
display_table(ios_final, -5)

Analyse freq table prime genre for app store

In [None]:
display_table(android_final, 1)

# Popular apps by genre on the App store

In [None]:
genres_ios = freq_table(ios_final, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

In [None]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])

ratings navigation op

talk about bible or something else

# Popular apps by genre on the Google play

In [None]:
display_table(android_final, 5)

explain the crap

In [None]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

In [None]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])


# Conclusion
In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.