## Google Play and The App Store


#### Intro:
We are going to do a data analysis that explores Google Play and The App Store. Through our analysis we hope to learn what types of free apps users typically download. With this information we can suggest the direction that maximizes in-app ad revenue.

In [1]:
import pandas as pd
from csv import reader

def explore_data(dataset, start, end, rows_and_columns=False, header=False):
    """Prints a slice of a dataset."""
    
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        if header:
            print('Number of rows:', len(dataset)-1)
            print('Number of columns:', len(dataset[0]))
        else:
            print('Number of rows:', len(dataset))
            print('Number of columns:', len(dataset[0]))

def dataset(csv) -> list:
    """Opens a csv file, reads it, and turns the contents into a list."""
    csv_open = open(csv)
    csv_dataset = list(reader(csv_open))
    return csv_dataset


In [2]:

app_dataset = dataset('AppleStore.csv')
google_dataset = dataset('googleplaystore.csv')

In [3]:
explore_google = explore_data(app_dataset, 0, 5,  rows_and_columns=True, header=True)
explore_google = explore_data(google_dataset, 0, 5, rows_and_columns=True, header=True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Ed

In [4]:
print(explore_data(app_dataset, 0, 1))
explore_data(google_dataset,0, 1)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


None
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']




#### Helpful columns:
Several columns can be of help:

|Dataset|Column|Explanation|
|---|---|---|
|The App Store|price|We are only interested in free apps.|
|The App Store|rating_count_tot|We will want to see if there is reasonable amount of ratings.|
|The App Store|user_rating|We are only interested in "Successful apps".|
|The App Store|prime_genre|We should know what type an app is.|
|Google Play|Category|Same as prime_genre.|
|Google Play|Rating|Same as user_rating.|
|Google Play|Reviews|Same as rating_count_tot.|
|Google Play|Installs|We are interested in apps with a large number of installs.|
|Google Play|Type|Same as prime_genre.|
|Google Play|Price|Same as price.|
|Google Play|Content Rating|Same as user_rating.|
|Google Play|Genres|Same as prime_genre.|
|Google Play|Current Ver| Help with determining most recent trends|


To get a first hand look at the data, you can find them in these two links:

https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

https://www.kaggle.com/lava18/google-play-store-apps

In [5]:
# Remove error row and comment out delete so we don't accidently run it again.
del google_dataset[10473]
print(google_dataset[0])
print(google_dataset[10473])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


In [6]:
all_apps = []
for row in google_dataset[1:]:
    name = row[0]
    all_apps.append(name)
print('Total Number of entries: ' + str(len(all_apps)))

app_count = {}
for app in all_apps:
    if app in app_count:
        app_count[app] += 1
    else:
        app_count[app] = 1

print('Number of unique entries: ' + str(len(app_count)))
duplicate_apps = []
for app in app_count:
    if app_count[app] > 1:
        duplicate_apps.append(app)
print('Number of entries where more than one duplicate: ' + str(len(duplicate_apps)))


Total Number of entries: 10840
Number of unique entries: 9659
Number of entries where more than one duplicate: 798


#### Deleting Replicated Entries
To ensure that we are not counting an application twice, we will remove any duplicates. To determine which duplicate to remove, we will keep the entry with the highest number of downloads since it is more likely to be the accurate count. We will have two main steps:
    1. Iterate through the `google_dataset` and create a dictionary with the name of each app and it's highest entry
    2. We will iterate through the `google_dataset` again, but check for if the entries match our dictionary (allowing only rows that have our dictionary values) and create a list of lists (a new dataset)

In [7]:
review_max = {}

for row in google_dataset[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in review_max and review_max[name] < n_reviews:
        review_max[name] = n_reviews
    elif name not in review_max:
        review_max[name] = n_reviews
print(len(review_max))

9659


In [8]:
android_no_duplicates = []
already_added = []

for row in google_dataset[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == review_max[name] and name not in already_added:
        android_no_duplicates.append(row)
        already_added.append(name)
print(android_no_duplicates[1])

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


In [9]:
def english_string(word: str) -> bool:
    """Returns a False if there are more than 3 symbols in a name"""
    non_english_symbols = 0
    for letter in word:
        if non_english_symbols > 3:
            return False
        elif ord(letter) > 127:
            non_english_symbols += 1
    return True

In [10]:
android_no_forgein_names = []

for row in android_no_duplicates:
    name = row[0]
    if english_string(name):
        android_no_forgein_names.append(row)
print(len(android_no_forgein_names))

9616


In [11]:
android_free_apps = []
android_not_free_apps = []
for row in android_no_forgein_names:
    price = row[6]
    if price == 'Free':
        android_free_apps.append(row)
    else:
        android_not_free_apps.append(row)
print('There are ' + str(len(android_free_apps)) + ' free apps.')
print('There are ' + str(len(android_not_free_apps)) + ' paid apps.')

There are 8865 free apps.
There are 751 paid apps.


In [12]:
apple_free_apps = []
apple_not_free_apps = []
for row in app_dataset[1:]:
    price = row[4]
    if price == '0.0':
        apple_free_apps.append(row)
    else:
        apple_not_free_apps.append(row)
print('There are ' + str(len(apple_free_apps)) + ' free apps.')
print('There are ' + str(len(apple_not_free_apps)) + ' paid apps.')

There are 4056 free apps.
There are 3141 paid apps.


In [13]:
print(google_dataset[0:2])
app_dataset[0:2]

[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'], ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']]


[['id',
  'track_name',
  'size_bytes',
  'currency',
  'price',
  'rating_count_tot',
  'rating_count_ver',
  'user_rating',
  'user_rating_ver',
  'ver',
  'cont_rating',
  'prime_genre',
  'sup_devices.num',
  'ipadSc_urls.num',
  'lang.num',
  'vpp_lic'],
 ['284882215',
  'Facebook',
  '389879808',
  'USD',
  '0.0',
  '2974676',
  '212',
  '3.5',
  '3.5',
  '95.0',
  '4+',
  'Social Networking',
  '37',
  '1',
  '29',
  '1']]

### App strategy
As mentioned in the begining of the notebook, our goal with the analysis is to build popular apps so we can take advantage of in app advertsing. Our validation process will be to:
- Build a minimal Anrdoid version of an app
- If it has a good response, then we will develop it further
- If it becomes profitable, then we will build an IOS Version
It is important to analyze both the Android and Apple app stores before deciding on an app because we want to release a succesfful app for all potential customers.

In [14]:
def freq_table(dataset, index: int) -> dict:
    """Returns a frequency table for any column of a dataset."""
    table = {}
    dataset_size = len(dataset)
    for row in dataset:
        genre = row[index]
        if genre in table:
            table[genre] += 1
        else:
            table[genre] = 1
    for entry in table:
        whole_num = table[entry]
        table[entry] = round(whole_num/dataset_size * 100, 2)   
    return table

def display_table(dataset, index):
    """Creates a frequency table and prints the sorted results."""
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [15]:
display_table(android_free_apps, 1)

FAMILY : 18.89
GAME : 9.72
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.91
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.24
BOOKS_AND_REFERENCE : 2.15
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.92
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


In [16]:
display_table(android_free_apps, 9)

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Lifestyle : 3.9
Productivity : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.24
Books & Reference : 2.15
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.92
Strategy : 0.9
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;Br

### Observations:
- The most common genres are `Family`, `Games`, and `Tools`
- The top three results have the largest gap between percentages, but the gap begins to fall as the list descends

In [17]:
display_table(apple_free_apps, 11)

Games : 55.65
Entertainment : 8.23
Photo & Video : 4.12
Social Networking : 3.53
Education : 3.25
Shopping : 2.98
Utilities : 2.69
Lifestyle : 2.32
Finance : 2.07
Sports : 1.95
Health & Fitness : 1.87
Music : 1.65
Book : 1.63
Productivity : 1.53
News : 1.43
Travel : 1.38
Food & Drink : 1.06
Weather : 0.76
Reference : 0.49
Navigation : 0.49
Business : 0.49
Catalogs : 0.22
Medical : 0.2


### Observations
- From the Apple App Store, the most popular genre is `Games`, followed by `Entertainment` at a much lower percetange.
- Although there is a large gap between `Games` and `Entertainment` and between `Entertainment` and `Photo & Video`, the difference between categories afterwards are lower.
- The most popular apps have to do with being enterained and socialisation. 
- Based on the results we can suggest either a Game, Entertainment, or Social app.
- I would argue that Photo & Video are highly connected to Social Networking.
- Between Apple and Google stores, we can see that Game is popular in both, however the Genres differ so much it is difficult to tell what the apps have in common

In [18]:
apple_genre_freq = freq_table(apple_free_apps, 11)
all_genres = []
for genre in apple_genre_freq:
    total = 0
    len_genre = 0
    for row in apple_free_apps:
        genre_app = row[11]
        if genre_app == genre:
            user_rating_tot = float(row[5])
            total += user_rating_tot
            len_genre += 1
    avg_user_rating = len_genre/total
    all_genres.append((avg_user_rating, genre))
sorted(all_genres, reverse=True)

[(0.002175095160413268, 'Medical'),
 (0.000561938061938062, 'Catalogs'),
 (0.0001595829565402415, 'Education'),
 (0.00015704010804359434, 'Business'),
 (0.00011767013139831339, 'Book'),
 (0.00011137955426850293, 'Lifestyle'),
 (9.239615598794866e-05, 'Entertainment'),
 (7.395212480301443e-05, 'Finance'),
 (7.137707329115756e-05, 'Utilities'),
 (6.292187489829439e-05, 'News'),
 (5.334278514584182e-05, 'Shopping'),
 (5.2841026962666635e-05, 'Games'),
 (5.248272937280599e-05, 'Productivity'),
 (5.0119495428574446e-05, 'Health & Fitness'),
 (4.96796292767715e-05, 'Sports'),
 (4.9556241147584246e-05, 'Food & Drink'),
 (4.946572599344403e-05, 'Travel'),
 (3.850292911033207e-05, 'Navigation'),
 (3.6697392858994994e-05, 'Photo & Video'),
 (2.117704763264517e-05, 'Weather'),
 (1.8840127944231113e-05, 'Social Networking'),
 (1.7704746140365342e-05, 'Music'),
 (1.4826258489886268e-05, 'Reference')]

### Observations:
From the results we can see that `Medical` is the highest rating count, which is a good indicator that it is also high in downloads.

In [30]:
category_freq = freq_table(android_free_apps, 1)
all_categories = []
for category in category_freq:
    total = 0
    len_category = 0
    for row in android_free_apps:
        category_app = row[1]
        if category_app == category:
            number_of_installs = row[5]
            number_of_installs = float(number_of_installs.replace(',', '').replace('+', ''))
            total += number_of_installs
            len_category += 1
    avg_downloads = len_category/total
    all_categories.append((avg_downloads, category))
sorted(all_categories, reverse=True)

[(8.295270497904928e-06, 'MEDICAL'),
 (3.944116255017792e-06, 'EVENTS'),
 (1.9487407641637604e-06, 'BEAUTY'),
 (1.8429659550170142e-06, 'PARENTING'),
 (1.5661615512622601e-06, 'LIBRARIES_AND_DEMO'),
 (1.5448362050676618e-06, 'AUTO_AND_VEHICLES'),
 (1.2230063051534151e-06, 'COMICS'),
 (1.1709206580826734e-06, 'DATING'),
 (7.510097918199829e-07, 'HOUSE_AND_HOME'),
 (7.206207553734822e-07, 'FINANCE'),
 (6.974952477307038e-07, 'LIFESTYLE'),
 (5.840131717785492e-07, 'BUSINESS'),
 (5.454064072014826e-07, 'EDUCATION'),
 (5.195081178125964e-07, 'FOOD_AND_DRINK'),
 (5.03439729873672e-07, 'ART_AND_DESIGN'),
 (2.7482794690842314e-07, 'SPORTS'),
 (2.704275441228814e-07, 'FAMILY'),
 (2.4649109000308073e-07, 'MAPS_AND_NAVIGATION'),
 (2.3873060337676903e-07, 'HEALTH_AND_FITNESS'),
 (1.9706428614489298e-07, 'WEATHER'),
 (1.9225287760183665e-07, 'PERSONALIZATION'),
 (1.4210848871239236e-07, 'SHOPPING'),
 (1.1465313530763951e-07, 'BOOKS_AND_REFERENCE'),
 (1.0472105044199336e-07, 'NEWS_AND_MAGAZINES'),
 

### Observations:
Again we see that `Medical` has the hightest downloads.