# Project: Profitable Apps for the Apple Store and Google Play

As a data analyst, I work for a company that build both Android and iOS mobbile applications. Their main source of revenue comes from in-app adverts and they only build apps that are free to downlaod and install.

The goal of this project is to analyse the data to help our developers understand what type of applications are likely to attract more users.


## Opening and Exploring the Data

Below we have a function that simplifies the exploration of the datasets.

It takes in 4 parameters:
1. The dataset
2. The start and end which are both integers representing the indicies to slice the dataset
3. The rows and columns which have a default argument of False

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

We will now open and explore both datasets and have a look at the number of rows and columns. We also want to try and identify the columns that could assist us with our analysis.


In [2]:
opened_file1 = open("AppleStore.csv")
opened_file2 = open("googleplaystore.csv")
from csv import reader
read_file1 = reader(opened_file1)
read_file2 = reader(opened_file2)
appledata = list(read_file1)
googledata = list(read_file2)

In [4]:
explore_data(appledata, 0, 5, True)



['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


We can see that the Apple store has 7198 apps with 16 columns. The `track_name`, `currency`, `price`, `rating_count_tot`, `rating_count_ver` and `prime_genre` could be benficial in our analysis.

In [5]:
explore_data(googledata, 0, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


We can see that the Google play store has 10842 apps with 13 columns. The `App`, `category`, `rating`, `Installs`, `Type`, `price` and `genre` rating_count_ver, prime_genre could be benficial in our analysis.

# Deleting Wrong Data

The google play data set has a dedicated [discussion forum](https://www.kaggle.com/lava18/google-play-store-apps/discussion). In [one of the discussions](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) it has been flagged that there is missing data at row 10472. For me, this is row 10473 as I have not seperated the header from the rest of the rows. Let us print the header, the row with the error and a row with correct data to investigate.

In [15]:
print(googledata[0]) # header
print('\n')
print(googledata[10473]) # incorrect row
print('\n')
print(googledata[10474]) # correct row

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


The incorrect row refers to an app called 'Life Made WI-Fi Touchscreen Photo Frame'.

Upon review, we can see highlight the following errors:

1. The category is missing from the row
2. Genre is blank data. It only has whitespace

Due to the data being incorrect, we will delete the row.

In [16]:
del googledata[10473] # run once to prevent deleting correct data

In [18]:
print(len(googledata))

10841


# Removing Duplicate Data

## Part One

When analysing data, there is the potential of there being duplicate data. This is the case with the Google Play dataset.

We will pritn a few duplicate rows to confirm this.

In [4]:
for app in googledata[1:]:
    name = app[0]
    if name == 'Instagram':
        print(app)
        print('\n')
    

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']




Above we can see that the `Instagram` app has 4 duplicates.
We will now count the number of duplicates in the dataset.

In [11]:
duplicate_apps = []
unique_apps = []

for app in googledata[1:]:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps :', len(duplicate_apps))
print('\n')
print('Example of duplicate apps :', duplicate_apps[:15])
        

Number of duplicate apps : 1181


Example of duplicate apps : ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


The duplicate rows need to be removed. However, this will not be done randomly. If we refer to the duplicae instagram example we can see that fourth position of each row has different data. This row refers to the ratings row. With this information, we can assume that the data was collected at different times. We can use this has the criteria for removing duplicates. The more reviews the app has, the more recent the data should be.

## Part Two