# Analyzing Mobile App Data

The goal of this project is to help our developers understand which types of apps are likely to attract more users on Google Play and the App Store

In [2]:
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

                                                The explore_data() function does the following:

## Parameters:
- Dataset, which will be a list of lists
- Start and end, which will both be integers and represent the starting and the ending indices of a slice from the dataset
- rows_and_columns, which will be a Boolean and has False as a default argument

## Functionality:
- Slices the dataset using `dataset[start:end]`
- Loops through the slice, and for each iteration, prints a row and adds a new line after that row using `print('\n')`
  
 * `The \n in print('\n')` is a special character that won't print. Instead, the `\n` character adds a new line, and we use `print('\n')` to add some blank space between rows.

## Optional Row and Column Count:
- If `rows_and_columns` is set to `True`, it prints the number of rows and columns in the dataset.

  
## The dataset shouldn't have a header row, or the function will print the wrong number of rows (one more row compared to the actual length)

## With the code below we are opening both iOS and Google Store data:

In [5]:
from csv import reader

## Google Play ###
opened_file = open('googleplaystore.csv', encoding='utf-8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### App Store ###

opened_file = open('AppleStore.csv', encoding='utf-8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

In [6]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


# The next stage is to clean the data:
- Remove malfunctioned data
- Remove duplicates
- Remove non-English apps
- Remove apps that aren't free

## `del android[10472]` was used to delete a row with corrupted data(missing information) and that WILL brick our code below.

In [9]:
uniq_apps = []
duplicate_apps = []

for app in android:
    name = app[0]
    if name in uniq_apps:
        duplicate_apps.append(name)
    else:
        uniq_apps.append(name)

print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


As we can see from the row above, we have multiple duplicates available and the main difference is the amount of reviews.
So likely we want to use the one with the biggest amount of reviews available (since it would mean the latest timestamp taken).

Further steps:
- Create a separate dictionary out of the current one using a unique app name and the highest number of reviews (latest timestamp).

In [38]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


The code above shows us the amount of actual apps that we should have in our list. We don't want to count certain apps
that are already a part of our list.

In [36]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])

    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


# Now we have made an actual list with no duplicate data.

For the duplicate cases, we'll only keep the entries with the highest number of reviews.

We use English for the apps we develop at our company, and we'd like to analyze only the apps that are designed for an English-speaking audience.

Due to that next stage will be to remove apps that are non English.