# Profitable App Profiles for the App Store and Google Play Markets

The goal of this project is to analyze the data of the App Store and Google Play markets to find the profile of Mobile App's that are more profitable. 

The setting:
We are working as Data Analyst at a Company that builds free to use Mobile Apps for Android and IOS smartphones. 

The main source of revenue from free to use apps are the ads revenue. This means that our revenue is influenced by the number of users.

Our goal is to help the developers to decide the type of app to build that brings the most ad revenue. 

## Opening and Exploring the Data

The complete data sets of the App Store and Google Play Markets are too big and costly to analyze, so we will use a sample data set instead. 

The Sample data consist of two data sets, one with approximately 10,000 Android apps from Google Play store, Collected in August of 2018, and another with 7,000 IOS Apps from the APP store, Collected in July 2017.

We will start by opening and exploring these two data sets.



In [2]:
# The Google Play Data Set #
from csv import reader
file_and = open('googleplaystore.csv', 'r')
reader_and = reader(file_and)
android_data = list(reader_and)
android_header = android_data[0]
android_data = android_data[1:]

# The Apple App Store Data Set #
file_ios = open('AppleStore.csv','r')
reader_ios = reader(file_ios)
ios_data = list(reader_ios)
ios_header = ios_data[0]
ios_data = ios_data[1:]


UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2755: character maps to <undefined>

To make it easy for us to explore the two data sets, we will creat a function that shows us the data sets in a more readable way. 

This function will be called `explore_data()` and will have an option to show the number of rows and columns for any data set.

In [None]:
# This is the explore data function #

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new empty line after each row
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
                

In [None]:
print(android_header)
print('\n')
explore_data(android_data, 0,3, True)

The Dataset has 13 Columns, wich 7 can be useful for our analysis. 
These are: "`App, category, Reviews, Installs, Type, Price, and Genres`". 

For more information about the dataset and its columns, [Google Play Documentation](https://www.kaggle.com/lava18/google-play-store-apps).

Up next, the App Store Dataset.

In [None]:
print(ios_header)
print('\n')
explore_data(ios_data, 0,3, True)

This dataset has 16 columns and different column names. From this dataset, we can use the following 6 columns: "`
track_name, currency, price, rating_count_tot, rating_count_ver and prime_genre`".

For more information about the columns of this dataset, [App Store Documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home). 

## Cleaning the Data

We need to clean the data before we analyze it. We are looking to find a profile that matches the type of app we want to build, free and for English-Speeking users. So we need to remove apps that aren't free and/or non-English.

Also, The Google Play data set has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error for row 10472.

Let's compare it to the header and another row that is correct.


In [None]:
print(android_header) #Header
print('\n')
print(android_data[1]) #Row with correct Data
print('\n')
print(android_data[10472]) #Row with incorrect Data

The app on row 10472 has a rating of 19. This is incorrect beacause the maximum rating on the Google Play store is 5. 

So we remove this row from our dataset by using the `del` statement. 

In [None]:
print(len(android_data))
del android_data[10472] #don't run this more thant once, it will delete the new 10472 row

In [None]:
print(len(android_data))

### Removing Duplicates

#### Part one: Identifying Duplicates

With large datasets, it's pretty to have duplicated entries. A good example is *Instagram* in the Google Play Dataset.

In [None]:
insta = 0
for i in android_data:
    name = i[0]
    if name == 'Instagram':
        print(i)
        insta += 1
print('\n')
print('Instagram has ', insta, ' entries')        

Now we look for all the apps that have the duplicates.

In [None]:
unique_apps = []
duplicate_apps = []

for app in android_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of unique apps:',len(unique_apps) )
print('\n')
print('Number of duplicates:', len(duplicate_apps))
print('\n')
print('Examples of duplicates:', duplicate_apps[:5])
        

The next step is to remove the duplicates and keep only one entry. We must keep the best entry, so we have the best data for our analysis. 

We can assume that the reason there are multiple entries is that the datasets have entries for different periods of the same app. The best data is the most recent, and though we have the app version, we may have multiple entries of the same version. The best alternative is to look at the reviews numbers, the more reviews, the more recent the data should be. The reviews are on column number 4.  

To clean the data, we will:
- Creat a dictionary with unique apps that have the highest number of reviews, named `reviews_max`
- Use the `reviews_max` to verify wich entries have the highest number of reviews on our data set, and put it in a list to be the only entry for each app. We will also use the `already_added` to keep track apps that have multiple entries and the same number of reviews. 
This list will be called `android_clean` and has to be as long as the `unique_apps` list.  

In [None]:
reviews_max = {}

for app in android_data:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print('Expected length:', len(android_data) - len(duplicate_apps) )
print('\n')
print('Dictionary length:', len(reviews_max))    

In [None]:
android_clean = []
already_added = []

for app in android_data:
    name = app[0]
    n_reviews = float(app[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
        

        

Now we confirm if our dataset has the same number of entries as the `unique_apps` list. 

In [None]:
print('Expected Lenght for Clean Android Data:', len(unique_apps))
print('\n')
explore_data(android_clean, 0,3, True)

## Removing Non-English Apps

### Part One: Identifying Duplicates

Another criteria for our Analysis is that our is is going to be designed for English speaking users. Only this time, it happens to both of our datasets.
So we will take the action and identify those apps and clean them from our datasets. Example:

In [None]:
print(ios_data[813][1])
print(ios_data[6731][1])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

The way we indentify these apps is by checking if the app name contains any letter or symbol thats is mot commonly used in English-text. The English alphabet, numbers and punctuation marks and other symbols (+, -, * etc.) are used in English-text. 

These characters are encoded using the ASCII standard. Each ASCII character has a number between 0 and 127 associated with it. We can find this number in python by uing the `ord()` function and we will use it to build our `eng_ver()` function.

In [None]:
def eng_ver(string):
    
    for i in string:
        if ord(i) > 127:
            return False
    return True

In [None]:
#Test of our eng_ver Function
print(eng_ver('Instagram'))
print(eng_ver('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(eng_ver('Docs To Go™ Free Office Suite'))
print(eng_ver('Instachat 😜'))

The Function works fine, except for when the App name has a emoji or other symbols inside of it. (Example: ™ and 😜). 

In this form, our function will delete usefull apps for our analysis. 

## Part two: Advanced Filter

To minimize the impact of data loss, we'll only remove an app if its name has more than three non-ASCII characters:

**(Text from Solution notebook)**

In [None]:
def eng_ver(string):
    ver_counter = 0
    
    for i in string:     
        if ord(i) > 127:
            ver_counter +=1
            
    if ver_counter > 3:
        return False
            
    return True

print(eng_ver('Docs To Go™ Free Office Suite'))
print(eng_ver('Instachat 😜'))
print(eng_ver('爱奇艺PPS -《欢乐颂2》电视剧热播'))

The function is still not perfect, and very few non-English apps might get past our filter, but this seems good enough at this point in our analysis — we shouldn't spend too much time on optimization at this point.

**(Text from Solution notebook)**

Below, we use the is_english() function to filter out the non-English apps for both data sets:

In [None]:
android_clean_eng = []
ios_clean_eng = []

for i in android_clean:
    name = i[0]
    if eng_ver(name):
        android_clean_eng.append(i)

for i in ios_data:
    name = i[1]
    if eng_ver(name):
        ios_clean_eng.append(i)
        
explore_data(android_clean_eng, 0,3,True)
print('\n')
explore_data(ios_clean_eng, 0, 3, True)


## Isolating the Free Apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis. Below, we isolate the free apps for both our data sets.

**(Text from Solution notebook)**

In [None]:
android_final = []
ios_final = []

for app in android_clean_eng:
    price = app[7]
    if price == '0' :
        android_final.append(app)
        
for app in ios_clean_eng:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print('Android Apps left in Dataset:',len(android_final))
print('\n')
print('IOS Apps:', len(ios_final))

We're left with 8864 Android apps and 3222 iOS apps, which should be enough for our analysis.

## Most Common Apps by Genre
### Part One
As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.

### Part Two
We'll build two functions we can use to analyze the frequency tables:

- One function to generate frequency tables that show percentages
- Another function that we can use to display the percentages in a descending order

**(Text From Solution Notebook)**

In [None]:
def freq_table(dataset, index):
    freq_table_dic = {}
    total_freq_number = 0
    
    for i in dataset:
        total_freq_number +=1
        freq_col = i[index]  
        if freq_col in freq_table_dic:
            freq_table_dic[freq_col] += 1
        
        else:
            freq_table_dic[freq_col] = 1
        
   
    freq_table_perc = {}
    for i in freq_table_dic:
        percentage = (freq_table_dic[i]/total_freq_number) * 100
        freq_table_perc[i] = percentage

    return freq_table_perc
        
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
display_table(ios_final, -5)
print('\n')
display_table(android_final, -4)
print('\n')
display_table(android_final, 1)

### Part Three

#### App Store: Prime Genre
We can see that among the free English apps, more than a half (58.16%) are games. Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.66% of the apps are designed for education, followed by social networking apps which amount for 3.29% of the apps in our data set.

The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.


####  Google Play: Category
Let's continue by examining the Genres and Category columns of the Google Play data set (two columns which seem to be related).

The landscape seems significantly different on Google Play: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids.

[![img](https://s3.amazonaws.com/dq-content/350/py1m8_family.png)](https://play.google.com/store/apps/category/FAMILY?hl=en)

#### Google Play: Genres
Even so, practical apps seem to have a better representation on Google Play compared to App Store. This picture is also confirmed by the frequency table we see for the Genres column.

The difference between the Genres and the Category columns is not crystal clear, but one thing we can notice is that the Genres column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.

**(Text From Solution notebook)**


In [None]:
prime_genre_freq = freq_table(ios_final, -5)

for genre in prime_genre_freq:
    total = 0
    len_genre = 0
    
    for app in ios_final:
        genre_app = app[-5]
        if genre == genre_app:
            users_app = float(app[5])
            total += users_app
            len_genre += 1

    avg_user_rat_prime = total / len_genre

    print(genre, ":", avg_user_rat_prime)

On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together.

The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same applies to music apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

Our aim is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10,000 threshold. We could get a better picture by removing these extremely popular apps for each genre and then rework the averages, but we'll leave this level of detail for later.

Reference apps have 74,942 user ratings on average, but it's actually the Bible and Dictionary.com which skew up the average rating.

However, this niche seems to show some potential. One thing we could do is take another popular book and turn it into an app where we could add different features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes about the book, etc. On top of that, we could also embed a dictionary within the app, so users don't need to exit our app to look up words in an external app.

This idea seems to fit well with the fact that the App Store is dominated by for-fun apps. This suggests the market might be a bit saturated with for-fun apps, which means a practical app might have more of a chance to stand out among the huge number of apps on the App Store.

Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we described above, but the other genres don't seem too interesting to us:

- Weather apps — people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.

- Food and drink — examples here include Starbucks, Dunkin' Donuts, - McDonald's, etc. So making a popular food and drink app requires actual cooking and a delivery service, which is outside the scope of our company.

- Finance apps — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.

**(Text from Solutions Notebook)**

In [None]:
cat_freq = freq_table(android_final, 1)

for category in cat_freq:
    total = 0
    len_category = 0
    
    for app in android_final:
        category_app = app[1]
        
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
            
    avg_installs = (total / len_category) * 100
    print(category, ":" , avg_installs)

This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

However, it looks like the market is already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

## Conclusions
In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

**(Text from Solutions Notebook)**