<div>
    <b>Description:</b> Exploring Google Play and Apple Store Apps Markets<br>
    <b>Author:</b> Maika Carmelle Henry Northrop
</div>
<br>

In [19]:
# import modules
import csv
from csv import reader
import pprint

# Profitable App Profiles for the App Store and Google Play Markets

The objective of this data analysis project is to identify mobile apps that could potentially be profitable for the App store and Google Play markets.  As a Data Scientist and Full Stack Web Developer for the Bright Leaf Works startup company, my job is to facilitate the development of Android and IOS mobile apps and enable our team and stakeholders to make data-driven decisions with respect to the kind of apps they should build.

At Bright Leaf Works, we build apps that are free to download and install and take a user-centric approach to how we design the front and back end of the app.  Our primary source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app, which is why a user-friendly and UX approach to how we design coupled with the type of mobile apps we bring to market are critical elements in our business model. And so, the main goal for this project is to analyze data to help our team understand what kinds of apps are likely to attract more users.

## Collecting the Data

Presently, there are over 4 million iOS and Android apps available on the market. 

It would not be practical nor a sound business strategy to collect this amount of data as it would require a considerable amount of time and money to compile.  Therefore, we've decided to analyze a sample of the data in order to avoid spending company resources collecting new data ourselves.  The following two data sets will serve our purpose:

* A [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) containing data about approximately ten thousand Android apps from Google Play
* A [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) containing data about approximately seven thousand iOS apps from the App Store


![App Market Statistics 2018](images/apps_stats.png "App Market Statistics as of 2018")

## Let's explore the Google Play and Apple store datasets.

The following function was created following the DRY design method so that we can repeatedly print rows in a more readable way.  Also, an option has been added to our function to show the number of rows and columns for any data set.

In [20]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

## Let's begin by opening and reading both data sets.

In [21]:
### The Google Play data set ###
opened_file = open('datasets/googleplaystore.csv', encoding="utf8")
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('datasets/AppleStore.csv', encoding="utf8")
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

### Google Play Data Set

In [22]:
### Explore Android data set
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


## Data Set Summary

There are 10841 android apps and 13 columns in this data set.  The columns that may provide interesting insight are:
* app
* category
* reviews
* installs
* type
* price
* genres

### App Store Data Set

In [23]:
### Explore IOS data set
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 7197
Number of columns: 17


## Data Set Summary

There are 7197 iOS apps in this data set, and the columns that may provide interesting insight are:

* track_name
* currency
* price
* rating_count_tot
* rating_count_ver
* prime_genre

Not all columns are self-explanatory as the ones listed above, however further details about each column can be found in the data set [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

## Deleting the Wrong Data

The Google Play data set has a forum where members engage in [discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion) about the data set.  One of the members identified an error for row 10472. Let's print this row and compare it to the header and other rows.

In [24]:
print(android[10472]) # the incorrect row
print('\n')
print(android_header)
print('\n')
print(android[0:2]) # correct rows

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']]


Row 10472 corresponds to the 'Life Made WI-FI Touchscreen Photo Frame' app.  The rating in column 3 is 19.  The maximum rating in this data set is 5, therefore this row has inaccurate data.  As a consequence, we'll delete the entire row.

In [25]:
print(len(android))
del android[10472]
print(len(android))

10841
10840


## Removing Duplicate Entries

### Part One
Upon further review of the Google Play data set, we've discovered that some apps have more than one entry.  For example, the Instagram application has four entries:

In [26]:
for app_name in android:
    name = app_name[0]
    if name == 'Instagram':
        print('\n')
        print(app_name)  



['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Now let's determine how many duplicate apps exist in our data set:

In [27]:
unique_apps = []
duplicate_apps = []

for apps in android:
    name = apps[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('The number of duplicate apps are: ', len(duplicate_apps))
print('The number of unique apps are: ', len(unique_apps))
print('\n')
print('Examples of duplicate apps:')
pprint.pprint(duplicate_apps[567:577], indent=4)

The number of duplicate apps are:  1181
The number of unique apps are:  9659


Examples of duplicate apps:
[   'SKOUT - Meet, Chat, Go Live',
    'Badoo - Free Chat & Dating App',
    'Jaumo Dating, Flirt & Live Video',
    'SayHi Chat, Meet New People',
    'Couple - Relationship App',
    'Meetup',
    'Wish - Shopping Made Fun',
    'SnipSnap Coupon App',
    'Extreme Coupon Finder',
    'Checkout 51: Grocery coupons']


Our goal is to insure that our app data does not contain duplicate entries and represents that most up-to-date data.  Therefore it will be necessary to examine the duplicate entries and look for variances among the entries.  

Upon further examining Instagram's duplicate entries (as illustrated two cells above), we noticed that the fourth position within each row has different numerical values.  These values represent the number of reviews and when the data was collected.  The newest data has the highest number of reviews.

Therefore the criterion we will use to remove duplicate entries is to keep the row with the highest number of reviews.  This method will provide us with more reliable ratings.

To accomplish this we will:
* Build a dictionary that will store a unique app name as the key and the **highest number of reviews** as the value.
* Use the dictionary to create a new data set that will only have only one entry per app

### Part Two
#### Let's build a data dictionary for our android app data set.

In [28]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews       

In a previous code cell, we were able to identify 1181 cases where the apps occured more than once.  Therefore, the number of apps within our 'unique apps' dictionary should be the difference between the length of our data set and 1181, that is to say 9659.  Let's confirm:

In [29]:
print('Expected length: ', len(android) - 1181)
print('Actual length: ', len(reviews_max))

Expected length:  9659
Actual length:  9659


#### Let's remove duplicates using our reviews_max dictionary.  
For duplicate cases we will keep entries with the highest number of reviews  The steps are as follows:
* Initialize two empty lists variables called 'android_wrangled' and 'apps_already_added'
* We will loop through the data set and for every iteration we will:
  - Isolate the name of the app and the number of reviews.
  - Add the current row to the 'android_wrangled' list and the app name to the 'apps_already_added' list if:
    - The number of reviews of the current app matches the number of reviews found in the 'reviews_max' dictionary; and
    - The name of the app does not already exist in the 'apps_already_added' list.  We need to add this additional condition because there are a number of cases where the highest number of reviews of a duplicate app is the same for more than one entry. 

In [30]:
android_wrangled = []
apps_already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in apps_already_added):
        android_wrangled.append(app)
        apps_already_added.append(name)

#### Now let's explore the new data set.

In [31]:
explore_data(android_wrangled, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


## Removing Non-English Apps

### Part One

As we explore the data set further, it becomes apparent that many apps are in foreign languages.  See below example:

In [32]:
print(ios[814][2])
print(ios[6734][2])

print(android_wrangled[4412][0])
print(android_wrangled[7940][0])

搜狐新闻—新闻热点资讯掌上阅读软件
エレメンタル ファンタジー - 高精細３ＤアクションＲＰＧ
中国語 AQリスニング
لعبة تقدر تربح DZ


Let's remove these apps. One way to go about this is to remove app names that contain atypical symbols not common in English texts and keep the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;, etc.), and other symbols (+, *, /, etc.).

Many of the UTF-8 characters are encoded using the Unicode standard. Each Unicode character has a corresponding number between 0 and 127 associated with it.  Below is a function built to take advantage of Unicode's encoded characters.  The function checks the app's name and tells us whether it contains non-Unicode characters.

Python's built-in ord() function returns the number representing the unicode code of a specified character.

In [33]:
def is_english(string):
    
    for character in string:
        if ord(character) > 127:
            return False
    
    return True

print(is_english('Instagram'))
print(is_english('エレメンタル ファンタジー - 高精細３ＤアクションＲＰＧ'))

True
False


### Part Two

Some English apps use symbols that fall outside of the ASCII range.  In order to minimize data loss or inadvertently remove useful apps, we're going to have to change our function a bit and have it remove an app only if more than three non-ASCII characters are in it's name:

In [34]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
            
    if non_ascii > 3:
        return False
    else:
        return True
    
print(is_english('Sketch - Draw & Paint'))
print(is_english('eHarmony™ Dating App - Meet Singles'))
print(is_english('Flashlight ◎'))

True
True
True


The function is not exactly perfect, however we were able to minimize the number of non-English apps.  This should be good enough for a simple preliminary analysis and will consider optimization at a later point. 

Below, the is_english function is also being applied to the Google Play data set.  Let's go ahead and filter both data sets.

In [36]:
android_english = []
ios_english = []

for app in android_wrangled:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188

## Isolating the Free Apps

Since we only build apps that are free to download and install and our main source of revenue consists of in-app ads, we will need to isolate the free apps for our analysis.

In [39]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[5]
    if price == '0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

8864
4056


We are now officially left with 8864 Android apps and 4056 iOS apps, which should be enough for our analysis.

## Most Common Apps by Genre

### Part One

As previously mentioned, our goal is to determine the type of apps that are likely to attract more users because our revenue is predicated on the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three primary steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we would also build an iOS version and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, we will need to find app profiles that are successful on both markets.  For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market.  For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.


### Part Two

We'll build two functions we can use to analyze the frequency tables:
* One function to generate frequency tables that show percentages
* Another function that we can use to display the percentages in a descending order

In [48]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key]/total) * 100
        table_percentages[key] = percentage
    
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0]) 

### Part Three
Now let's start by examining the frequency table for the prime_genre column of the App Store data set:

In [49]:
display_table(ios_final, -5)

Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032
