# DataQuest Guided Project 2: App Stores and Profitability


This is the first Guided Project from the DataQuest Data Scientist track. The goal is to explore a data set consisting of various Android and iOS mobile apps with their respective success indicators. 

The project works with two data sets. The first contains data approximately 10,000 Android apps from Google Play,  collected in August 2018. The second set contains data approximately 7,000 iOS apps from the App Store, collected in July 2017.

Primarily the objective is to work with Python fundamentals so it does not include NumPy or Pandas. Specifically, the exploration will focus on the question of what profile application is likely to attract and reatin users thereby increasing the profit from ad revenue. 



### 1. Import and Initial Data Exploration

first we will import the data using the reader module from the csv package so that it can be worked with as a list of the rows of data.  

In [35]:
from csv import reader 

#Apple Store 
open_file = open('AppleStore.csv')
read_file = reader(open_file)
apple_data = list(read_file)
apple_headers = apple_data[0]
apple_data = apple_data[1:]

#Google Play
open_file = open('googleplaystore.csv')
read_file = reader(open_file)
google_data = list(read_file)
google_headers = google_data[0]
google_data = google_data[1:]


To make exploring the data easier, we'll define the function below for printing sections as well as the row / column information. 

In [36]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Below we explore the data sets using the explore_data function

In [37]:
print('Sample Apple Data\n')
explore_data(apple_data, 0, 1, True)
print('\n')
print('Sample Google Data\n')
explore_data(google_data, 0, 1, True)

Sample Apple Data

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7197
Number of columns: 16


Sample Google Data

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


Next let's take a look at what headers are in the data set.

In [38]:
print('Apple Headers\n')
for header in apple_headers:
    print(header, end = ' || ')
    
print('\n\nGoogle Headers\n')
for header in google_headers:
    print(header, end = ' || ')

Apple Headers

id || track_name || size_bytes || currency || price || rating_count_tot || rating_count_ver || user_rating || user_rating_ver || ver || cont_rating || prime_genre || sup_devices.num || ipadSc_urls.num || lang.num || vpp_lic || 

Google Headers

App || Category || Rating || Reviews || Size || Installs || Type || Price || Content Rating || Genres || Last Updated || Current Ver || Android Ver || 

Not all of the Apple names are self explanatory so here is also a link to the documentation.

https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home



### 2. Data Cleaning

There is a known issue with row 10472 from this discussion:
https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015

Below we explore the rows around it and remove it.

In [39]:
explore_data(google_data, 10471, 10474)


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']




In [40]:
del google_data[10472]

In [41]:
explore_data(google_data, 10471, 10474)


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']




The google play store data also has duplicate entries we'll need to remove to not impact our frequency tables later. First, let's explore how many duplicates there are. 

In [42]:
duplicate_apps = []; unique_apps = []

for app in google_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name) 
        
print('Number of Duplicats', len(duplicate_apps))

Number of Duplicats 1181


we'll choose to handle duplicates by keeping only the record with the highest number of reviews in the hopes that it reflects the most recent data. 


In [43]:
reviews_max = {}
for app in google_data:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name]=n_reviews
    if name not in reviews_max:
        reviews_max[name]=n_reviews 

print(len(reviews_max))

9659


In [44]:
google_clean = []
already_added = []

for app in google_data:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        google_clean.append(app)
        already_added.append(name)
        
print(len(google_clean))

9659


At this point we are left with 9659 records in the cleaned version of the Google Play Data set. We also need to focus on the English speaking market, however, so we can define a function to remove any of the records that are not in English. We'll do so for both Google Play and the Apple App Store. 

In [49]:
def english_char_check(a_string):
    count = 0
    for c in a_string:
        if ord(c) > 127:
            count += 1
            if count == 3:
                return False
    return True

In [50]:
english_apple = []
for app in apple_data:
    name = app[0]
    if english_char_check(name):
        english_apple.append(app)
        
print(len(english_apple))

english_google = []
for app in google_clean:
    name = app[0]
    if english_char_check(name):
        english_google.append(app)
        
print(len(english_google))


7197
9597


## 3. Isolate the Target Data - Free Apps

Now that we are left with 7197 Apps from the App Store and 9597 in the Google Play Store, we will isolate the free apps from the paid for apps since the free apps is the population we are interested in.

In [54]:
free_apple = []
for app in english_apple:
    price = app[4]
    if price == '0.0':
        free_apple.append(app)

print('Number of Free App Store Apps: ', len(free_apple))

free_google = []
for app in english_google:
    Type = app[6]
    if Type == 'Free':
        free_google.append(app)
    
print('Number of Free Google Play Store Apps: ',len(free_google))

Number of Free App Store Apps:  4056
Number of Free Google Play Store Apps:  8847


## 4. Seek Out Indicators for Success Within the Free Apps

Since we are interested in what 'type' of app is successful, we'll need to look at the types of apps in the data set and choose a success indicator with a goal of sorting the types by the indicator and choosing a winner. 

Since not all types of apps will be evenly distributed, we need to determine how many of each there is. For this, we'll start with a frequncy table function and a function to display the output nicely. 

In [60]:
def freq_table(dataset, index):
    D = {}
    for row in dataset:
        item = row[index]
        if item in D:
            D[item] += 1
        else:
            D[item] =1 
    totalcount = 0
    for key in D:
         totalcount += D[key]
    
    for key in D:
        D[key] = round(D[key]/totalcount*100,2)
        
    return D 

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


Using the display_table function (and frequncy table function within), We can categories of apps.

In [59]:
print('Apple App Genres\n------\n')
display_table(free_apple,11)
print('\n\nGoogle Play Categories\n------\n')
display_table(free_google,1)
print('\n\nGoogle Play Genres\n------\n')
display_table(free_google,9)

Apple App Genres
------

Games : 55.65
Entertainment : 8.23
Photo & Video : 4.12
Social Networking : 3.53
Education : 3.25
Shopping : 2.98
Utilities : 2.69
Lifestyle : 2.32
Finance : 2.07
Sports : 1.95
Health & Fitness : 1.87
Music : 1.65
Book : 1.63
Productivity : 1.53
News : 1.43
Travel : 1.38
Food & Drink : 1.06
Weather : 0.76
Reference : 0.49
Navigation : 0.49
Business : 0.49
Catalogs : 0.22
Medical : 0.2


Google Play Categories
------

FAMILY : 18.93
GAME : 9.7
TOOLS : 8.45
BUSINESS : 4.6
PRODUCTIVITY : 3.9
LIFESTYLE : 3.89
FINANCE : 3.71
MEDICAL : 3.54
SPORTS : 3.39
PERSONALIZATION : 3.32
COMMUNICATION : 3.23
HEALTH_AND_FITNESS : 3.09
PHOTOGRAPHY : 2.95
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.67
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.87
VIDEO_PLAYERS : 1.8
MAPS_AND_NAVIGATION : 1.39
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.8
WEATHER : 0.79
EVENTS : 0.71
PAR

A potential success indicator from the data sets, we could look for how many downloads there are for each type of app. For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

In [80]:
apple_genre_ratios = freq_table(free_apple,11)

apple_nratings_scaled = {}
for genre in apple_genre_ratios:
    total = 0
    len_genre = 0
    for row in free_apple:
        row_genre = row[11]
        if row_genre == genre:
            total += float(row[5])
            len_genre += 1
    avg = total /len_genre
    apple_nratings_scaled[genre] = round(avg*apple_genre_ratios[genre])

print(sorted(apple_nratings_scaled.items(), key=lambda p:p[1], reverse=True))


[('Games', 1053159), ('Social Networking', 187366), ('Photo & Video', 112270), ('Music', 93195), ('Entertainment', 89073), ('Shopping', 55865), ('Sports', 39252), ('Utilities', 37687), ('Health & Fitness', 37311), ('Weather', 35888), ('Reference', 33049), ('Productivity', 29152), ('Finance', 27991), ('Travel', 27898), ('News', 22727), ('Food & Drink', 21390), ('Lifestyle', 20830), ('Education', 20366), ('Book', 13852), ('Navigation', 12726), ('Business', 3120), ('Catalogs', 392), ('Medical', 92)]


Google

In [79]:
google_category_ratios = freq_table(free_google,1)

google_installs_scaled = {}
for category in google_category_ratios:
    total=0
    len_cat = 0
    for row in free_google:
        row_cat = row[1]
        if row_cat == category: 
            installs = row[5]
            installs = installs.replace('+','')
            installs = installs.replace(',','')
            total += float(installs)
            len_cat += 1
    avg = total /len_cat
    google_installs_scaled[category] = round(avg*google_category_ratios[category])
    

print(sorted(google_installs_scaled.items(), key=lambda p:p[1], reverse=True))


[('GAME', 150776941), ('COMMUNICATION', 124647577), ('TOOLS', 91515629), ('FAMILY', 70000266), ('PRODUCTIVITY', 65470592), ('SOCIAL', 62087251), ('PHOTOGRAPHY', 52628326), ('VIDEO_PLAYERS', 44510170), ('TRAVEL_AND_LOCAL', 32722742), ('NEWS_AND_MAGAZINES', 26737700), ('BOOKS_AND_REFERENCE', 18862388), ('PERSONALIZATION', 17268922), ('SHOPPING', 15832974), ('HEALTH_AND_FITNESS', 12943460), ('SPORTS', 12375542), ('ENTERTAINMENT', 11175078), ('BUSINESS', 7876535), ('MAPS_AND_NAVIGATION', 5628492), ('LIFESTYLE', 5625555), ('FINANCE', 5148339), ('WEATHER', 4064985), ('FOOD_AND_DRINK', 2386873), ('EDUCATION', 2126854), ('DATING', 1597034), ('ART_AND_DESIGN', 1271254), ('HOUSE_AND_HOME', 1088478), ('AUTO_AND_VEHICLES', 602006), ('LIBRARIES_AND_DEMO', 600194), ('COMICS', 507894), ('MEDICAL', 426749), ('PARENTING', 358118), ('BEAUTY', 307891), ('EVENTS', 180015)]


For the Apple App Store, it looks like the Game apps have the highest number of ratings when scaled by the ratio of apps that are of that type. 

For the Google Play Store, Games wins as well. 

From this analysis, it looks like the Games category would be the best choice to optimze downloads and therefore profitability. 