# Analysis of Mobile App Data

In this project we will be analyzing data on iOS and Android mobile apps that are sold on Google Play and the App store. We will be simulating that we work for a company that builds iOS and Android apps. The apps our company builds will be free to download and install with all of the revenue being generated by in-app advertisements. 

Our goal will be to analyze the data sets and gain insight into what type of app will be the most profitable. Since revenue is driven entirely by the amount of users that interact with the in-app advertisments, this means creating an app that will attract the highest amount of users possible. 

## Opening and Exploring the Data

We will be working with two data sets in this project which are both readily available on Kaggle:
- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data on approximately 7,000 iOS apps from the App Store. The data set can be downloaded directly [here.](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)
- [A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data on approximately 10,000 Android apps from Google Play. The data set can be downloaded directly [here.](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)


To begin, we will open and explore the two data sets to get a better idea of what we are working with. To assist with this we will define an function `explore_data()` that can be used to print rows of the data set in a readable way:

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))

Now that we have a function to assist in exploring our data, we will go ahead and open the two data sets so we can start exploring. We will also seperate the header rows from the rest of the data set, since we don't want to include these as part of the main data set:

In [2]:
from csv import reader

open_file = open('AppleStore.csv')
read_file = reader(open_file)
ios_data = list(read_file)
ios_header = ios_data[0]
ios_data = ios_data[1:]

open_file = open('googleplaystore.csv')
read_file = reader(open_file)
android_data = list(read_file)
android_header = android_data[0]
android_data = android_data[1:]

explore_data(ios_data, 1, 4, rows_and_columns=True)
print('\n')
explore_data(android_data, 1, 4, rows_and_columns=True)

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows:  7197
Number of columns:  16


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.

Looking at the results of calling `explore_data()` on our two datasets we can see a few rows of each to get an idea of what our data looks like. Also, we see that the App Store data set has 7197 rows and 16 columns, while the Google Play data set has 10841 rows and 13 columns. To get a better idea of what the data in each column of the two data sets represents, let's print out the header rows of each column:

In [3]:
print(ios_header)
print('\n')
print(android_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


If we take a look at the App Store data set [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) we can get a better idea of what each column heading means in order to determine which will be useful in our analysis. 'track_name' refers to the name of the app which will obviously be useful. 'rating_cont_tot' and 'user_rating' have data about the amount of ratings the app has received and what the ratings were which can be valueable for determining how popular an app is. Finally 'prime_genre' describes the genre that the app belongs to which can help us determine which types of apps are more popular than others. 

Likewise, looking at the Google Play data set [documentation](https://www.kaggle.com/lava18/google-play-store-apps) we see that 'App' refers to the name of the app. 'Category' and 'Genres' will be useful in grouping the apps together by content. 'Rating', 'Reviews', and 'Installs' will help us determine which type of apps are the most popular.

## Data Cleaning 

Before we progress further in our analysis we will pause to clean our data sets. Do do this we will remove and correct any incorrect data points, remove any duplicate data points, and modify the data sets to better fit the purpose of our analysis. For instance, our company is only interested in developing *free* apps that are targeted toward an *English speaking* audience. To that end, we will need to remove any non-free and non-English apps from our data sets.

### Deleting Wrong Data

We will begin our cleaning process by detecting and deleting any incorrect data in our data sets. First, let's take a look at the Google Play data set. If we explore the [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion) section for the data set on Kaggle, we can see that there is an [error](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) in a row of the data set. Row 10472 is missing the 'Category' column, which caused all the columns after it to shift out of place. If we print the length of row 10472 we can see that it is 12 when we know that the number of columns for the Google Play data set is 13:

In [4]:
print(android_data[10472])
print(len(android_data[10472]))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
12


To fix this error, we will go ahead and delete this row from the data set:

In [5]:
print(len(android_data))
del(android_data[10472])
print(len(android_data))

10841
10840


Now the Google Play data set is one step closer to being cleaned. The [discussion](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion) for the App Store data set dosen't show any obvious incorrect data, so we will assume that it is correct. We can now move on with the data cleaning process.

### Removing Duplicate Entries

For now we will continue cleaning the Google Play data set. When examining the data set or reading through the [documentation](https://www.kaggle.com/lava18/google-play-store-apps/discussion) we see that there are multiple instances of apps appearing in the data set more than once. Take Instagram for example, a quick check shows us that it appears 4 seperate times:

In [6]:
for app in android_data:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


If we do a deeper dive, we find that there are 1181 duplicate entries in the Google Play data set:

In [7]:
unique_entries = []
duplicate_entries = []

for app in android_data:
    name = app[0]
    if name in unique_entries:
        duplicate_entries.append(name)
    else:
        unique_entries.append(name)
        
print('Number of duplicate entries: ', len(duplicate_entries))
print('\n')
print('Examples of duplicate entries: ', duplicate_entries[:10])

Number of duplicate entries:  1181


Examples of duplicate entries:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


Now we need to remove these duplicate entries so that there is only one entry per app. If we look at the print out of the four Instagram duplicates above, we see that they are pretty much identical except for one important difference: they differ at the 'Reviews' column. Using this knowledge, we can decide which of the duplicates to keep in our data set. The entry with the highest amount of review will be the most recent and up to date entry, so that will be the one we will keep while deleting the rest.