#### Introduction
We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. 

#### Anticipated Outcome
Our goal is to analyze data to help our developers understand what type of apps are likely to attract more users.

In [1]:
# Open csv files and save each as a list of lists

def open_dataset(file_name):
    
    opened_file = open(file_name)    
    from csv import reader
    read_file = reader(opened_file)
    data = list(read_file)
    return data

apple_data=open_dataset('resources/AppleStore.csv')
google_data=open_dataset('resources/googleplaystore.csv')

For additional documention information, use the following links:
1. Google dataset: [link](https://www.kaggle.com/lava18/google-play-store-apps)
2. Apple dataset: [link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

In [3]:
# function slices dataset at designated indices to allow for exploration

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [4]:
# Print header and first few rows for Apple dataset, including number of rows and columns
apple_header=apple_data[0]
explore_apple=explore_data(apple_data,1, 6, True)       

print("Header")
print(apple_header)
print(explore_apple)

['284882215', 'Facebook', '389879808', 'USD', '0', '2974676', '212', '3.5', '3.5', '95', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0', '2161558', '1289', '4.5', '4', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0', '1724546', '3842', '4.5', '4', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0', '1126879', '3594', '4', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7198
Number of columns: 16
Header
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
None


In [34]:
# Print header and first few rows for Google dataset, including number of rows and columns

google_header=google_data[0]
explore_google=explore_data(google_data,1, 6, True)       

print("Header")
print(google_header)
print(explore_google)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', '7-Jan-18', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', '15-Jan-18', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', '1-Aug-18', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', '8-Jun-18', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', '20-Jun-18', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13
Header
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Instal

## Deleting Wrong Data

The Google Play data set has a dedicated discussion section, and we can see that one of the discussions outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [32]:
# Check for missing rating from google_data
print(google_data[10473])
print('\n')
print(google_header)
print('\n')
print(google_data[1:3])

# delete row with missing data (only run this code once, or it will continue to delete row)
# del google_data[10473]

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', '7-Aug-18', '6.06.14', '4.4 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', '7-Jan-18', '1.0.0', '4.0.3 and up'], ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', '15-Jan-18', '2.0.0', '4.0.3 and up']]


In [8]:
# Check that row was deleted
len(google_data)

10841

## Romoving Duplicate Entries
### Part One
#### Duplicate entries should not be included in analysis. The following steps were taken to identify duplicate entries:
1. Loop through each dataset
2. Check for duplicate app names
3. Check to see if the app name has been added to the unique apps list. If not, append the app name. If it has been added, add the app name to a duplicate apps list.
4. Check the length of each list, and display several examples of duplicate app names.

In [42]:
# Check for duplicate apps in Apple data
ios_unique_apps = [] 
ios_duplicate_apps = [] 

for app in apple_data: 
    app_name = app[1] 

    if app_name not in ios_unique_apps:
        ios_unique_apps.append(app_name)
    else:
        ios_duplicate_apps.append(app_name)
        
print('unique apps:', len(ios_unique_apps))
print('duplicate apps: ', len(ios_duplicate_apps))
print('Names of duplicates:', ios_duplicate_apps)

unique apps: 7196
duplicate apps:  2
Names of duplicates: ['Mannequin Challenge', 'VR Roller Coaster']


In [10]:
# Check for duplicate apps in Google data
google_unique_apps = [] 
google_duplicate_apps = [] 

for app in google_data: 
    app_name = app[0] 

    if app_name not in google_unique_apps:
        google_unique_apps.append(app_name)
    else:
        google_duplicate_apps.append(app_name)
        
print('Unique apps:',len(google_unique_apps)-1) 
print('Duplicate apps: ', len(google_duplicate_apps))
print('Examples of duplicates:', google_duplicate_apps[:10])

Unique apps: 9659
Duplicate apps:  1181
Examples of duplicates: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


### Part Two
To choose which duplicates to remove, we will select the app with the highest number of reviews. A greater number of reviews indicates that this is the most recently updated entry of the app. To complete this process, we will:
1. Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
2. Use the information stored in the and create a new data set, which will have only one entry per app.

In [20]:
# create a dictionary for Google data with unique app values
google_reviews_max = {}

for row in google_data[1:]:
    name= row[0]
    n_reviews = float(row[3])
    if name in google_reviews_max and google_reviews_max[name] < n_reviews:
        google_reviews_max.update({name:n_reviews})
    if name not in google_reviews_max:
        google_reviews_max[name] = n_reviews
print(len(google_reviews_max))
# print(google_reviews_max)


9659


Now, let's use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. In the code cell below:

- We start by initializing two empty lists, android_clean and already_added.
- We loop through the android data set, and for every iteration:
- We isolate the name of the app and the number of reviews.
- We add the current row (row) to the android_clean list, and the app name (name) to the already_added list if:
- The number of reviews of the current app matches the number of reviews of that app as described in the reviews_max dictionary; and
- The name of the app is not already in the already_added list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps.

In [36]:
#create new dataset
android_clean=[]

# store app names
already_added=[]

for row in google_data[1:]:
    name = row[0]
    n_reviews=float(row[3])    
    if (n_reviews == google_reviews_max[name]) and (name not in already_added):
        # Append the entire row to the android_clean list 
        android_clean.append(row)
        # Append the name of the app name to the already_added list 
        already_added.append(name)
        
#Check to make sure that the # of rows in the list is correct
explore_data(android_clean, 0, 3, True)


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', '7-Jan-18', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', '1-Aug-18', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', '8-Jun-18', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


## Removing Non-English Apps
### Part One

In [57]:
# Create a function to iterate over and input string. 
# For each iteration check whether the number associated with the character is greater than 127(English ASCII characters). 
# If all characters are <127, return true, otherwise, return False
def apps(name):
    for character in name:
        if ord(character) > 127:
            return False 
    return True

# Check results with a few examples
print("1: ", apps('Instachat'))
print("2: ", apps('Docs To Go™ Free Office Suite'))
print("3: ", apps('Instachat 😜'))
print("4: ", apps('爱奇艺PPS -《欢乐颂2》电视剧热播'))

1:  True
2:  False
3:  False
4:  False


In [11]:
#redefine function to include up to 3 non-ASCII characters
def apps(name):
    non_ascii= 0
    for character in name:
        if ord(character) > 127:
            non_ascii +=1
    if non_ascii>3:
        return False 
    else:
        return True
print("1: ", apps('Instachat'))
print("2: ", apps('Docs To Go™ Free Office Suite'))
print("3: ", apps('Instachat 😜'))
print("4: ", apps('爱奇艺PPS -《欢乐颂2》电视剧热播'))

1:  True
2:  True
3:  True
4:  False
