# Mobile App Usage Analysis

Prepared by Amanda Morphew-Ulm  
Last Edited: 2020-03-15  
*An important note to the reader:
This report was created as a guided project during the [Dataquest.io](https://app.dataquest.io/) course **Python for Data Science: Fundamentals**.*

## Google Play and App Store

This project explores data on existing Google Play and App Store mobile apps, including content rating, genre, downloads, and user ratings. As our company focuses on free-to-install apps that generate revenue via ads, our business model relies on user ad views and engagement. Therefore, our focus in analyzing this dataset is to understand what types of apps attract more users, so our developers can design our products for high levels of user engagement.

There are over 2 million apps on each of the two stores our company develops for, and collecting data on all of them is not currently practical. We're using two publicly available data sets that comprise representative samples of the data we're looking for:
- [A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018.
- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017.

First, we create a function to open and explore both data sets. This is a reusable set of code that lets us print easy-to-read rows.

In [1]:
#Parameters:
#dataset - expected to be a list of lists; it should not include a header row
#start and end - expected to be integers and represent the starting and ending indices of a slice from the data set
#rows_and_columns - expected to be a Boolean and defaults to False
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice=dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') #\n is a special character that adds a new empty line
    if rows_and_columns:
        print('Number of rows:',len(dataset))
        print('Number of columns:',len(dataset[0]))

Now we'll open the two data set CSV files so we have each CSV file assigned to a Python variable.

In [2]:
opened_file=open('F:/Jupyter/datasets/AppleStore.csv',encoding='utf8')
from csv import reader
read_file=reader(opened_file)
apple_apps_data=list(read_file)
opened_file=open('F:/Jupyter/datasets/googleplaystore.csv',encoding='utf8')
from csv import reader
read_file=reader(opened_file)
google_apps_data=list(read_file)

We can now use our defined variables to access our data sets using:

- apple_apps_data - iOS App Store data set
- google_apps_data - Google Play Store data set

Now we need to insert those variables into the function we defined above, along with a few adjustments. We'll start by printing the header row to see our column names, and then use our defined function to see the first five rows of data, the number of rows (excluding headers), and the number of columns:

- Since the data has a header row, we use [1:] to exclude it
- We want to look at just a few rows of data at first, to get an idea of what we have, so we'll use a start of 0 (the first row) and end of 5 (stop before the sixth row)
- We need to change the rows_and_columns to True, since our data has both

In [3]:
print(apple_apps_data[0])
print('\n') #Add an empty line for readability
explore_data(apple_apps_data[1:],0,5,rows_and_columns=True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


In [4]:
print(google_apps_data[0])
print('\n') #Add an empty line for readability
explore_data(google_apps_data[1:],0,5,rows_and_columns=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Eve

## Data Cleaning

Before we begin analysis, we need to make sure the data is accurate. In other words, we need to detect inaccurate data, as well as duplicate data, and remove it from our data sets.

Our company only produces free-to-install apps that are directed toward an English-speaking audience, so it makes sense to limit our data set to those parameters as well.

### Inaccurate Data

Since these data sets have been used extensively by other people completing similar analyses, we can check on the discussion boards where we downloaded the data sets.

The [App Store discussion board](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion) doesn't seem to indicate any data errors that other users have found.

In the discussion board for the Google Play Store data set, we see a few reports of incorrect data:

- [Wrong entry for Life Made WI-Fi Touchscreen Photo Frame](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) - I found that this row in our data set is index 10473; the poster most likely removed their header row and then noted the index
- [Wrong Type value for Command & Conquer: Rivals](https://www.kaggle.com/lava18/google-play-store-apps/discussion/101414) - Checking the comments, we see that the index of this row is most likely 9149, as we have not removed any data from our data set at this point

While we could search the Play Store and correct these values, we have a large enough sample that it makes more sense to delete these two rows. We will delete these two rows, lower row first (deleting from top down would change the index of the second row we want to delete).

In [5]:
print(google_apps_data[10473])
print('Columns in this row: ' + str(len(google_apps_data[10473])))
print('\n')
print(google_apps_data[9149])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Columns in this row: 12


['Command & Conquer: Rivals', 'FAMILY', 'NaN', '0', 'Varies with device', '0', 'NaN', '0', 'Everyone 10+', 'Strategy', 'June 28, 2018', 'Varies with device', 'Varies with device']


In [6]:
#Do not re-run this block without running entire notebook
del google_apps_data[10473]
print('Row 10473 removed.')
del google_apps_data[9149]
print('Row 9149 removed.')

Row 10473 removed.
Row 9149 removed.


### Duplicate Data

While checking the discussion board for mentions of inaccurate data, we also see some users mention that duplicates occur in the Google Play Store data set. We will use the app name as the value to check for duplicates, using the below loop:

In [7]:
#Create empty lists into which we can separate the unique values from the repeated values
duplicate_apps=[]
unique_apps=[]
#Loop through the Google Play data, minus the header row
for app in google_apps_data[1:]:
    #Assign this app's name to a 'name' variable
    name=app[0]
    #Check to see if we've already listed this app in the unique_apps list
    if name in unique_apps:
        #If we have, put this name in the duplicate_apps list instead
        duplicate_apps.append(name)
    else:
        #If this app isn't listed yet in unique_apps, add it to that list instead
        unique_apps.append(name)
#View our results
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


Now that we have the names of our duplicate apps identified, how should we remove the extras?

- If we choose a random one to keep, it may not contain the most recent data for that app
- We could keep the highest version number; however, some rows say 'Varies with device'
- The Installs column is written as interval strings, such as '1,000,000,000+', rather than integer values

With those things in mind, our most likely candidate for identifying the most recent row for a duplicated app is going to be the Reviews column, which is stored here as specific integers and should only increase or stay the same as time passes.

To remove the duplicates, we will:

- Create a dictionary with unique app names as keys and their corresponding values are the highest number of reviews of that app
- Use this information to create a new data set, which will have only one entry per app - the entry with the highest number of reviews from our original data set

In [8]:
#We start by creating an empty dictionary
reviews_max={}
#Loop through the Google Play data, minus the header row
for app in google_apps_data[1:]:
    #Assign this app's name to a 'name' variable
    name=app[0]
    #Convert the number of reviews to a float and assign it to a variable
    n_reviews=float(app[3])
    #Check if name is already a key in reviews_max AND current n_reviews is larger than existing value
    if name in reviews_max and reviews_max[name]<n_reviews:
        #If yes, we update that existing value
        reviews_max[name]=n_reviews
    #Add name to reviews_max dictionary only if it's not already there
    elif name not in reviews_max:
        reviews_max[name]=n_reviews
    #If neither of the above are true, we don't want to update it, so no "else" here
#We expect a length of 9658 after removing duplicates; let's test this:
print(len(reviews_max))

9658


Now that we have a dictionary holding the highest number of reviews for each app, we can use this to scrub our data set.

In [9]:
#This will become our newly cleaned data set
google_clean=[]
#This will just store app names
already_added=[]
#Loop through the Google Play data, minus the header row
for app in google_apps_data[1:]:
    #Assign this app's name to a 'name' variable
    name=app[0]
    #Convert the number of reviews to a float and assign it to a variable
    n_reviews=float(app[3])
    #If this row is the one with the highest number of reviews
    #We also need the second condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry
    if n_reviews==reviews_max[name] and name not in already_added:
        google_clean.append(app)
        already_added.append(name)
#Let's check the length to see if this went as planned; it should equal 9658
print(len(google_clean))
        

9658


We now have a variable, google_clean, which contains our cleaned list of Google Play Store app data.

As mentioned above, the iOS App Store discussion board does not mention any data errors or duplicates found by other users. However, we can quickly run our code again through the App Store data set to check for duplicates. As expected, we find none.

In [10]:
#Create empty lists into which we can separate the unique values from the repeated values
duplicate_ios_apps=[]
unique_ios_apps=[]
#Loop through the Google Play data, minus the header row
for app in apple_apps_data[1:]:
    #Assign this app's name to a 'name' variable
    name=app[0]
    #Check to see if we've already listed this app in the unique_apps list
    if name in unique_ios_apps:
        #If we have, put this name in the duplicate_apps list instead
        duplicate_ios_apps.append(name)
    else:
        #If this app isn't listed yet in unique_apps, add it to that list instead
        unique_ios_apps.append(name)
#View our results
print('Number of duplicate apps:', len(duplicate_ios_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_ios_apps[:15])

Number of duplicate apps: 0


Examples of duplicate apps: []
