# Profitable App Profiles for the App Store and Google Play Markets

Our aim in this project is to find app profiles that are profitable for the company. We are working as Data Analysts for a company that builds apps for the Android and Apple OS users. We make these apps available on both the Apple Store and Google Play. 

Our apps are free to download and install. The main source of revenue is in-app advertising. So the the revenue is strongly influenced by the number of users that user our app. Our goal in this project is to analyze data to help our app developers understand what kinds of apps attract more users. 


## Opening and Exploring the Data
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play. Collecting data for over 4 million apps require a significant amount of time and money, so we instead will analyze a sample of the data. We have the following two data sets at our disposal: 

    - A data set containing data about approximately 10,000 Android apps from Google Play; the data was collected
    in August 2018.
    - A data set containing data about approximately 7,000 iOS apps from the App Store; the data was collected in
    July 2017. 
    

Let us start opening and exploring the two data sets. 
    



In [148]:
from collections import Counter, deque

In [2]:
# import csv
from csv import reader
# we will first read the two files and create lists of data
files = ['data_files/profitable_apps/AppleStore.csv', 'data_files/profitable_apps/googleplaystore.csv']

def read_csv(csvfile):
    with open(csvfile, "r") as f:
        read_file = reader(f)
        return list(read_file)
        
ios = read_csv(files[0])
android = read_csv(files[1])

ios_header = ios[0]
android_header = android[0]

ios = ios[1:]
android = android[1:]

In [3]:
# create explore_data() function to read some important data about the datasets
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print (row)
        print ("\n")
    
    if rows_and_columns:
        print ("Number of rows:", len(dataset))
        print ("Number of columns:", len(dataset[0]))

In [4]:
explore_data(android, 1, 5, False)

['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']




#### Let us look at the data. 

In [5]:
## let us look at how many and what columns there are.
print ("There are a total of {} columns and {} rows in the android data set. And the columns are: \n".format(len(android[0]), len(android)))
print (android_header)
print ("{}".format(2*"\n"))
print ("There are a total of {} columns and {} rows in the ios data set. And the columns are: \n".format(len(ios[0]), len(ios)))
print (ios_header)

There are a total of 13 columns and 10841 rows in the android data set. And the columns are: 

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']



There are a total of 16 columns and 7197 rows in the ios data set. And the columns are: 

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


#### Columns to Consider:

    - the number of users using the app
    - type of the app (if it is free or not) type
    - what age group is the app for? rating
    - does the app have a one-time fee or not?, subscription
    - how popular is the app?, reviews for the app
    - will consider only English-speaking apps.
 

### Data Cleaning

In this step, we are going to clean the data a little bit to prepare it for further analysis. It is said that Data Scientists spend 80% of their time cleaning the bad data and only 20% of their time analyzing the data after it has been cleaned. We will be spending some time doing some cleaning. 

1. remove duplicate records - if there are any dupes, we are going to remove them. 
2. remove non-free apps - since we will be analysing free apps, we will remove the non-free apps.
3. remove non-English-speaking apps - we are going to remove apps that are aimed non-English-speaking users.
4. remove any other discrepancies in the data. 
5. check that there is value for every column

#### checking if there is a value for every column

In [40]:
def compare_col_to_row(col, app_list):
    for idx, app in enumerate(app_list):
        if len(col) != len(app):
            return app, idx


In [None]:
# check if there all the columnar data is available for all the apps. 
# check android first
print (compare_col_to_row(android_header, android))
# it seems that the we have a record that is problematic. We need to delete that record as per dataquest.

## let us find the index of the record. 
the_ind = 0
for idx, app in enumerate(android):
    if app[0].startswith('Life Made'):
        the_ind = idx

In [59]:
## now let us delete this app at the index stored in the_ind, do not run this cell
del android[the_ind]

In [60]:
# let us run this method again
print (compare_col_to_row(android_header, android))

None


In [61]:
# check the ios data 
print (compare_col_to_row(ios_header, ios))

None


#### removing duplicates, non-English-speaking apps and non-free apps

In [405]:
# we see that the android_unique does not have a column that shows the language so we have to invent the wheel.
# so we can use the isascii string function

## let us define a class which we can use to do many things. 

class CleanUp:
    def __init__(self, arr):
        self.apps = arr
    
    def remove_duplicates(self):
        unique_app_names = deque() # we are going to use app_names to store unique app names 
        unique_apps = deque()
        duplicate_apps = []
        for app in self.apps:
            name = app[0]
            if name not in unique_app_names:
                unique_app_names.append(name)
                unique_apps.append(app)
            else:
                duplicate_apps.append(name)
        return unique_apps, duplicate_apps
    
    def remove_non_english(self):
        self.apps = CleanUp.remove_duplicates(self)[0]
        english_only = deque()
        for app in self.apps:
            if app[0].isascii():
                english_only.append(app)
        return english_only
    
    def remove_not_free(self, android=True):
        self.apps = CleanUp.remove_non_english(self)
        free_apps = deque()
        for app in self.apps:
            if not android:
                if float(app[4]) <= 0:
                    free_apps.append(app)
            else:
                if app[6] == 'Free':
                    free_apps.append(app)
        return free_apps
            
    
android_ins = CleanUp(android)
ios_ins = CleanUp(ios)

In [406]:
# so we have removed the duplicates and kept english_only apps
and_unique = android_ins.remove_duplicates()[0]
and_dupe_apps = android_ins.remove_duplicates()[1]
ios_unique = ios_ins.remove_duplicates()[0]
ios_dupe_apps = ios_ins.remove_duplicates()[1]
and_eng_only = android_ins.remove_non_english()
ios_eng_only = ios_ins.remove_non_english()
and_free = android_ins.remove_not_free()
ios_free = ios_ins.remove_not_free(False)
vars_list = [[android, and_dupe_apps, and_unique, and_eng_only, and_free], [ios, ios_dupe_apps, ios_unique, ios_eng_only, ios_free]]

In [407]:
dataset_strings = ('android', 'ios')
for i in range(2):
    print ("{} operations: \n".format(dataset_strings[i]))
    print ("Total number of apps:\t\t\t\t{}".format(len(vars_list[i][0])))
    print ("Number of total duplicates removed: \t\t{}".format(len(vars_list[i][1])))
    print ("Some duplicate app names: \t\t\t{}".format(vars_list[i][1][:3]))
    print ("Number of unique apps:\t\t\t\t{} ".format(len(vars_list[i][2])))
    print ("Number of English-only apps: \t\t\t{}".format(len(vars_list[i][3])))
    print ("Number of free apps: \t\t\t\t{}\n".format(len(vars_list[i][4])))



android operations: 

Total number of apps:				10840
Number of total duplicates removed: 		1181
Some duplicate app names: 			['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business']
Number of unique apps:				9659 
Number of English-only apps: 			9117
Number of free apps: 				8405

ios operations: 

Total number of apps:				7197
Number of total duplicates removed: 		0
Some duplicate app names: 			[]
Number of unique apps:				7197 
Number of English-only apps: 			7197
Number of free apps: 				4056



#### Data Cleaning: 

From the results of the cleaning up of the data, it seems about 10% of the android data had duplicates, about 5% apps were in other foreign languages and about 5% of the apps charged some kind of a fee. 

On the other hand, the ios dataset did not have any duplicates and it did not have any non-English apps either. However, a huge percentage of the apps charged a fee, almost 44% of the apps. So our dataset has shrunk from a little over 7000 apps to 4056 apps. 
