# Introduction

This project is about analyzing app data from Google Play and Apple App stores. We'll pretend we're data analysts working for a company building Android and iOS apps. Our company only build apps that are free, and revenue comes from in-app ads. 

The goal is to use Python and Jupyter Notebook to profile the most profitable apps on the Google Play and Apple App stores Going through the data will help our developers understand what types of apps users gravitate towards.

The data for [Google Play][1] and [Apple App Store][2] can be downloaded at Kaggle.

[1]:https://www.kaggle.com/lava18/google-play-store-apps
[2]:https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

## Opening and Exploring the Data

First we open the files and give them corresponding variable names. The header column is separated from the data for quick access. Here we will use the `explore_data()` function for exploration. It prints the rows in the list so they're readable, and finds the number of rows and columns if `rows_and_columns` is `True`. It assumes the input dataset doesn't have a header row.

The first few rows of each data set are printed along with the number of rows and columns. We also try and identify some columns that could help with our analysis.

In [None]:
from csv import reader

#Open .csv files
file1 = open("AppleStore.csv", encoding='utf8')
file2 = open("googleplaystore.csv", encoding='utf8')

apple_file = reader(file1)
apple_apps_data = list(apple_file) #lists of list

google_file = reader(file2)
google_apps_data = list(google_file)

In [None]:
#Separating the header from the data set 
apple_head = apple_apps_data[0]
apple_data = apple_apps_data[1:]

google_head = google_apps_data[0]
google_data = google_apps_data[1:]

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # new empty line for separation
        
    if rows_and_columns:
        print("Number of rows:", len(dataset))
        print("Number of columns:", len(dataset[0]))

#First few rows
print("Apple Rows")
explore_data(apple_data, 1, 3, True)
print('\n')
print("Google Rows")
explore_data(google_data, 1, 3, True)

In [None]:
#Exploring Columns
print(apple_head, '\n')
print(google_head, '\n')

The columns we could use need to be related to price (we develop free apps only) and the user ratings for the app. They're detailed in two tables here:

| Google Column Name | Description |
|:-----------:|:------------:|
| 'Rating' | User rating of the app |
| 'Installs' | Number of downloads |
| 'Price' | Price of the App |
| 'Type' | Whether an app is paid or free |

| Apple Column Name | Description |
|:---------:|:---------:|
| 'user_rating' | Average user rating (for all version) |
| 'user_rating_ver' | Average user rating (for current version) |
| 'Price' | Price of the app |

## Wrong Data

The discussion section for the Google Play Store data set describes an error for row 10472 (data set without the header). Printing row 10472, the header, and another row show the rating for row 10472 has a rating of 19, which is incorrect, since the maximum rating is 5. Therefore we'll delete this row.

In [None]:
print(google_head, '\n') #header
print(google_data[10472], '\n') # incorrect
print(google_data[10473]) #correct

In [None]:

del google_data[10472] # Running this more than once will delete more data.

## Duplicate Apps
### Part One

From the discussion section for the Google Play Store data, duplicate entries for the same applications have been found. An example is Instagram:

In [None]:
for app in google_data:
    name = app[0]
    if name == 'Instagram':
        print(app)

Using a `for` loop we see that there's 1,181 duplicate apps. In this case, getting rid of the duplicate app data will make our analysis more accurate. The duplicates won't be removed randomly. Take the Instagram app duplicates. All the data in each row is the same except for the 4th entry, which the number of user reviews. The different amount of user reviews suggests the data was taken at different times. It seems the higher the number of reviews, the more recent the data is. As such, we will keep the row with the highest amount of user reviews.

In [None]:
duplicate = []
unique = []

for app in google_data:
    app_name = app[0]
    if app_name in unique:
        duplicate.append(app_name)
    else:
        unique.append(app_name)

print('Number of duplicate apps:', len(duplicate))
print('\n')
print('Examples of duplicate apps:', duplicate[:3])

### Part Two

After removing the duplicate Google Play Store apps we should be left with 9659 unique apps.

In [None]:
print('Length after duplicates:', len(google_data) - 1181)

To remove the duplicates, we'll create a dictionary where each key is a unqiue app name and the corresponding dictionary value is the highest number of reviews for that particular app. 

In [None]:
reviews_max = {} #empty dictionary
for app in google_data:
    name = app[0] #app name
    n_reviews = float(app[3]) #number of reviews for the app
    if name in reviews_max and reviews_max[name] < n_reviews:
        '''
        if an app name is in the dictionary and the number of its
        reviews is less than the duplicate's app number of reviews, update
        the number of reviews for that app.
        '''
        reviews_max[name] += n_reviews
    if name not in reviews_max:
        # if app not in dictionary, create a new key with reviews as the value
        reviews_max[name] = n_reviews

len(reviews_max) # is 9659 as expected

Here we use the `reviews_max`dictionary to remove the duplicate rows. The list `android_clean` will hold a list of lists of our cleaned data while the `already_added` list helps us keep track of already added apps.

In [None]:
android_clean = [] #list for new cleaned data set
already_added = [] #app names

for app in google_data:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        '''
        If the number of reviews is equal to the reviews in reviews_name,
        and the app is not in already_added, add the entire row to
        android_clean, and append the name of the app to already_added.
        '''
        android_clean.append(app)
        already_added.append(name) 
        #this second condition is for duplicate apps that have the same number of reviews

print('Expected:', len(google_data) - 1181)
len(android_clean) #should be 9659