Note: This is a Guided project from DataQuest. The data and the code in Part I was part of the course. Part II is my original work. 

# Part I

### Loading Data

First, let's load and transform the data into a list of lists. 

In [1]:
from csv import reader
import pprint

In [2]:
opened_file = open('googleplaystore.csv', encoding='utf-8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

In [3]:
opened_file = open('AppleStore.csv', encoding='utf-8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

Now let's get a feel for the data and see what categories might be useful for our analysis.

In [4]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', '7-Jan-18', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', '15-Jan-18', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', '1-Aug-18', '1.2.4', '4.0.3 and up']


Number of rows: 10840
Number of columns: 13


In [5]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 7197
Number of columns: 17


### Removing Duplicate Entries

In Google Play data set some apps have more than one entry. Usually different entries correspond to different number of reviews, probably due to data being collected at different times.

In [6]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    unique_apps.append(name)
    
print(len(duplicate_apps))
print(duplicate_apps[:15])

1181
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


We need to remove the duplicate entries and keep only one entry per app. 
We'll keep the rows that have the highest number of reviews.

In [7]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In [8]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


In [9]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

Now let's confirm that the number of rows in the new dataset is 9,659.

In [10]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', '7-Jan-18', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', '1-Aug-18', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', '8-Jun-18', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


### Removing Non-English Apps

In [11]:
def is_English(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

print(is_English('Docs To Go™ Free Office Suite'))
print(is_English('Instachat 😜'))

True
True


In [12]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_English(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if is_English(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', '7-Jan-18', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', '1-Aug-18', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', '8-Jun-18', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.

We now have 9614 Android apps and 7197 iOS apps to explore. Data is clean, no duplicates or wrong entries.

# Part II

### Finding Patterns Among Popular Apps 

Our end goal is to find out which genres users seem to like the most based on app ratings.

First, we'll take a look at what app genres we have.

In [13]:
genres = []
for row in android_english:
    genres.append(row[9])
genres[:10]

['Art & Design',
 'Art & Design',
 'Art & Design',
 'Art & Design;Creativity',
 'Art & Design',
 'Art & Design',
 'Art & Design',
 'Art & Design',
 'Art & Design;Creativity',
 'Art & Design']

In [14]:
unique_genres = []
for g in genres:
    if g not in unique_genres:
        unique_genres.append(g)
print(unique_genres[:15])
print(len(unique_genres))

['Art & Design', 'Art & Design;Creativity', 'Auto & Vehicles', 'Beauty', 'Books & Reference', 'Business', 'Comics', 'Comics;Creativity', 'Communication', 'Dating', 'Education', 'Education;Creativity', 'Education;Education', 'Education;Pretend Play', 'Education;Brain Games']
118


Let's count how many times each genre occurs.

In [15]:
genres_dict = {}
for g in genres:
    if g in genres_dict:
        genres_dict[g] += 1
    else:
        genres_dict[g] = 1
genres_dict

{'Action': 299,
 'Action;Action & Adventure': 12,
 'Adventure': 72,
 'Adventure;Action & Adventure': 5,
 'Adventure;Brain Games': 1,
 'Adventure;Education': 1,
 'Arcade': 184,
 'Arcade;Action & Adventure': 14,
 'Arcade;Pretend Play': 1,
 'Art & Design': 56,
 'Art & Design;Action & Adventure': 1,
 'Art & Design;Creativity': 6,
 'Art & Design;Pretend Play': 1,
 'Auto & Vehicles': 84,
 'Beauty': 53,
 'Board': 42,
 'Board;Action & Adventure': 3,
 'Board;Brain Games': 14,
 'Board;Pretend Play': 1,
 'Books & Reference': 218,
 'Books & Reference;Creativity': 1,
 'Books & Reference;Education': 2,
 'Business': 419,
 'Card': 47,
 'Card;Action & Adventure': 2,
 'Casino': 39,
 'Casual': 165,
 'Casual;Action & Adventure': 13,
 'Casual;Brain Games': 12,
 'Casual;Creativity': 6,
 'Casual;Education': 3,
 'Casual;Music & Video': 1,
 'Casual;Pretend Play': 25,
 'Comics': 54,
 'Comics;Creativity': 1,
 'Communication': 314,
 'Communication;Creativity': 1,
 'Dating': 170,
 'Education': 503,
 'Education;Act

Some apps have no rating, so we'll remove those from our dataset.

In [16]:
android_final = []
for app in android_english:
    rating = app[2]
    if rating != 'NaN':
        android_final.append(app)
print(android_final[:5])
print(len(android_final))

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', '7-Jan-18', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', '1-Aug-18', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', '8-Jun-18', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', '20-Jun-18', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', '26-Mar-17', '1', '2.3 and up']]
8166


A new dictionary with genres as keys and average app ratings for that genre as values will help us answer our question.

In [17]:
genres_ratings = {}
for key in genres_dict:
    ratings = []
    for app in android_final:
        if key == app[9]:
            rating = float(app[2])
            ratings.append(rating)
    aver_rating = round(sum(ratings)/(len(ratings)+1), 5)
    genres_ratings[key] = aver_rating
pprint.pprint(genres_ratings)

{'Action': 4.23038,
 'Action;Action & Adventure': 3.98462,
 'Adventure': 4.1169,
 'Adventure;Action & Adventure': 3.58333,
 'Adventure;Brain Games': 2.3,
 'Adventure;Education': 2.05,
 'Arcade': 4.24709,
 'Arcade;Action & Adventure': 4.02857,
 'Arcade;Pretend Play': 2.25,
 'Art & Design': 4.27818,
 'Art & Design;Action & Adventure': 0.0,
 'Art & Design;Creativity': 3.72857,
 'Art & Design;Pretend Play': 1.95,
 'Auto & Vehicles': 4.13378,
 'Beauty': 4.17907,
 'Board': 4.1725,
 'Board;Action & Adventure': 3.025,
 'Board;Brain Games': 4.05333,
 'Board;Pretend Play': 2.4,
 'Books & Reference': 4.31845,
 'Books & Reference;Creativity': 0.0,
 'Books & Reference;Education': 2.8,
 'Business': 4.08745,
 'Card': 3.98222,
 'Card;Action & Adventure': 2.86667,
 'Casino': 4.17368,
 'Casual': 4.07215,
 'Casual;Action & Adventure': 3.90714,
 'Casual;Brain Games': 4.13077,
 'Casual;Creativity': 3.72857,
 'Casual;Education': 3.2,
 'Casual;Music & Video': 2.05,
 'Casual;Pretend Play': 3.99231,
 'Comics':

Based on app ratings, most users seem to like apps in the following categories: 'Events', 'Puzzle', 'Books and Reference', 'Personalization', 'Art & Design', 'Role Playing', and 'Parenting'. 