# Profitable App Profiles for the App Store and Google Play Markets

This is a learning excersize from [Data Analyst in Python - Dataquest course](https://www.dataquest.io/path/data-analyst/). In this excersize we are imagining we are a mobile app development company. Our aim is to help our developers understand what type of apps are likely to attract more users on Google Play and the App Store. Our tools for this is pure Python.

## Data sources 

* [Google Play Store Apps](https://www.kaggle.com/lava18/google-play-store-apps)
* [Apple iOS App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

In [1]:
# row printer function

def explore_data(dataset, start=0, end=5, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset) - 1)
        print('Number of columns:', len(dataset[0]))
        print('\n')

In [2]:
from csv import reader

with open('googleplaystore.csv') as google_file, open('AppleStore.csv') as apple_file:
    google_apps = list(reader(google_file))
    apple_apps = list(reader(apple_file))

In [3]:
# read through data headers

explore_data(google_apps, end=3, rows_and_columns=True)
explore_data(apple_apps, end=3, rows_and_columns=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558

## Data cleaning

At our company, we only build apps that are free to download and install, and that are directed toward an English-speaking audience. 

Tasks for cleaning:

- [x] find errors in data (delete if any), e.g. there is a [reported error](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015)
- [x] remove non-English apps
- [x] remove non-free apps

In [4]:
# finding rows with errors

# "this entry has missing 'Rating' and a column shift happened for next columns.."
# 10472 Life Made WI-Fi Touchscreen Photo Frame 1.9 19.0 3.0M 1,000+ Free 0 Everyone NaN February 11, 2018 1.0.19 4.0 and up NaN

reported_error_index = 10472
explore_data(google_apps, 10473, 10474)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']




In [5]:
del google_apps[10473]

### Duplicates

There is also [another report for iOS apps](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion/90409) which says that apps named "Mannequin Challenge" and "VR Roller Coaster" appear 2 times in the dataset. Although the names do appear two times each, the values in columns `id`, `size_in_bytes` and `rating_count` are so different that it's probably just 2 different apps with the same name. Also [doing a search in the App Store page](https://www.apple.com/us/search/Mannequin-Challenge?src=globalnav) gives a lots of results by this name.

Some duplicates are indeed found in Google Apps dataset:

In [6]:
# Find the number of duplicate android apps

unique_apps = []
duplicate_apps = []

for row in google_apps[1:]:
    app_name = row[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)

print("Number of duplicate apps:", len(duplicate_apps))
print("Examples:", duplicate_apps[:15])

Number of duplicate apps: 1181
Examples: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In [7]:
# Finding how many unique names there are in duplicates

unique_duplicate_names = set(duplicate_apps)
len(unique_duplicate_names)

798

In [8]:
# Since 798 < 1181, some apps have not only duplicates but triplicates, etc
# Let's find the most frequent duplicates

frequency = {}

for app in duplicate_apps:
    if app in frequency:
        frequency[app] += 1
    else:
        frequency[app] = 1
        
max_count = 0
most_frequent_app = ''
for app, count in frequency.items():
    if count > max_count:
        max_count = count
        most_frequent_app = app

In [9]:
# Inspecting the most frequent app

print(google_apps[0])

for row_index, row in enumerate(google_apps):
    if row[0] == most_frequent_app:
        print(row_index,'\t', row)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
1654 	 ['ROBLOX', 'GAME', '4.5', '4447388', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
1702 	 ['ROBLOX', 'GAME', '4.5', '4447346', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
1749 	 ['ROBLOX', 'GAME', '4.5', '4448791', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
1842 	 ['ROBLOX', 'GAME', '4.5', '4449882', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
1871 	 ['ROBLOX', 'GAME', '4.5', '4449910', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.

Inspecting the rows of the duplicate we can notice that the only thing that really changes is 'Reviews' - which stores the number of total reviews. Since it can only grow with time, the most reasonable way to delete the duplicates would be to remove the old rows (i.e. keeping only rows with the highest number of reviews). To check if we remove rows correctly, calculate the expected number of rows that should be left:

In [10]:
expected_row_count = len(google_apps[1:]) - len(duplicate_apps)
print("Expected number of rows:", expected_row_count)

Expected number of rows: 9659


In [11]:
# map app to the max review count

review_counts = {}

for row in google_apps[1:]:
    app_name = row[0]
    current_count = int(row[3])
    if app_name in review_counts:
        if current_count > review_counts[app_name]:
            review_counts[app_name] = current_count
    else:
        review_counts[app_name] = current_count

In [12]:
# not include the rows where review count for a given app is lower than max
# note some rows are duplicates with same max reviews, thus keeiping track of already inserted apps

already_inserted = set()
google_apps_clean = google_apps[:1]

for row in google_apps[1:]:
    app_name = row[0]
    current_review_count = int(row[3])
    if (app_name not in already_inserted) and (current_review_count == review_counts[app_name]):
        google_apps_clean.append(row)
        already_inserted.add(app_name)

In [13]:
# check that we get the expected number of rows

expected_row_count == len(google_apps_clean[1:])

True

In [14]:
google_apps = google_apps_clean

### English locale

The easiest way to remove non-English apps is to look for non ascii symbols in the app names.

In [15]:
def is_ascii(a_string):
    for character in a_string:
        if ord(character) > 127:
            return False
    return True

assert is_ascii('Instagram')
assert not is_ascii('爱奇艺PPS -《欢乐颂2》电视剧热播')

# but also notice that those two strings are non-ascii too
assert not is_ascii('Docs To Go™ Free Office Suite')
assert not is_ascii('Instachat 😜')

As shown in the tests above just judging by ascii is not good enough - many useful entries that contain non-ascii characters are actually English. To include those we can use a slightly better approach - mark as non-English only those strings containing at least 3 non-ascii characters in a row.

In [16]:
def is_eng(a_string):
    
    if len(a_string) < 3:
        return True
    
    for index, character in enumerate(a_string):
        # skip first and last indexes to avoid out of bound errors
        if (index == 0) or (index == len(a_string) - 1):
            continue
        if (ord(a_string[index - 1]) > 127) and (ord(character) > 127) and (ord(a_string[index + 1]) > 127):
            return False
        
    return True

assert is_eng('Instagram')
assert not is_eng('爱奇艺PPS -《欢乐颂2》电视剧热播')
assert is_eng('Docs To Go™ Free Office Suite')
assert is_eng('Instachat 😜')

In [17]:
# English Android Apps

en_google_apps = [app for app in google_apps[1:] if is_eng(app[0])]
print(len(en_google_apps))
print(en_google_apps[:3])

9615
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]


In [18]:
# English iOS Apps

en_apple_apps = [app for app in apple_apps[1:] if is_eng(app[1])]
print(len(en_apple_apps))
print(en_apple_apps[:3])

6167
[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']]


### Free Apps Only

In [19]:
# For Android apps 'Price' column has index 7 but it also contains $ symbol
# find out if all rows contain the $ prefix
set([row[7][0] for row in en_google_apps])

{'$', '0'}

In [20]:
def price_to_float(a_strng):
    if '$' in a_strng:
        return float(a_strng[1:])
    else:
        return float(a_strng)

In [21]:
# we can simply omit the first simble when converting to float
google_apps_clean = [app for app in en_google_apps if price_to_float(app[7]) == 0.0]
len(google_apps_clean)

8865

In [22]:
# 'price' column has index 4
# check if it has only floats

set([row[4][0] for row in en_apple_apps])

{'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}

In [23]:
apple_apps_clean = [app for app in en_apple_apps if float(app[4]) == 0.0]
len(apple_apps_clean)

3208

In [24]:
google_app_columns = google_apps[0]
apple_app_columns = apple_apps[0]

### Data cleaning results

* `google_app_columns` - list with android app column names
* `google_apps_clean` - list of lists with clean adroid apps data set
* `apple_app_columns` - list with iOS app column namex
* `apple_apps_clean` - list of lists with clean iOS apps data set

## Analysis

The main creteria of profitability for the future app is how many users it has. To ensure cost effectiveness of the future app we want to follow this strategy:
1. build a minimal Android app
2. if the app has a good response from users - develop it further
3. if the app if profitable after 6 months - develop iOS version of it

According to this strategy what we should learn during our data analysis is:
* what are the most common genres for the mobile apps
* for the common jenres - what are the most popular apps

In [25]:
[i for i in enumerate(google_app_columns)]

[(0, 'App'),
 (1, 'Category'),
 (2, 'Rating'),
 (3, 'Reviews'),
 (4, 'Size'),
 (5, 'Installs'),
 (6, 'Type'),
 (7, 'Price'),
 (8, 'Content Rating'),
 (9, 'Genres'),
 (10, 'Last Updated'),
 (11, 'Current Ver'),
 (12, 'Android Ver')]

In [26]:
[i for i in enumerate(apple_app_columns)]

[(0, 'id'),
 (1, 'track_name'),
 (2, 'size_bytes'),
 (3, 'currency'),
 (4, 'price'),
 (5, 'rating_count_tot'),
 (6, 'rating_count_ver'),
 (7, 'user_rating'),
 (8, 'user_rating_ver'),
 (9, 'ver'),
 (10, 'cont_rating'),
 (11, 'prime_genre'),
 (12, 'sup_devices.num'),
 (13, 'ipadSc_urls.num'),
 (14, 'lang.num'),
 (15, 'vpp_lic')]

To find the most common app genres we will build frequency tables for:
* `Genres` and `Category` columns from the google apps data
* `prime_genre` from the apple apps data

In [27]:
# copied from course sources
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

### Task

Create a function named freq_table() that takes in two inputs: dataset (which is expected to be a list of lists) and index (which is expected to be an integer).

The function should return the frequency table (as a dictionary) for any column we want. The frequencies should also be expressed as percentages.

In [28]:
def freq_table(dataset, index):
    freq_table = {}
    
    # count ocurrencies
    for row in dataset:
        field_value = row[index]
        
        if field_value in freq_table:
            freq_table[field_value] += 1
        else:
            freq_table[field_value] = 1
            
    # convert to percentages
    total_apps = len(dataset)
    for key in freq_table:
        freq_table[key] = round((freq_table[key] / total_apps) * 100, 2)
    
    return freq_table

In [29]:
import math

def test_freq_table():
    mock_dataset = [
        ['three_quaters', 'second_column_one_half'],
        ['three_quaters', 'and the other column value'],
        ['three_quaters', 'another other column value'],
        ['one_quater', 'second_column_one_half'],
    ]
    
    assert math.isclose(75, freq_table(mock_dataset, 0)['three_quaters'])
    assert math.isclose(25, freq_table(mock_dataset, 0)['one_quater'])
    assert freq_table(mock_dataset, 0).get('should_not_exist', None) == None
    assert math.isclose(50, freq_table(mock_dataset, 1)['second_column_one_half'])
    
    
test_freq_table()

In [30]:
# Genres from google apps

display_table(google_apps_clean, 9)

Tools : 8.44
Entertainment : 6.07
Education : 5.36
Business : 4.59
Lifestyle : 3.9
Productivity : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.45
Personalization : 3.33
Communication : 3.24
Action : 3.09
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.24
Books & Reference : 2.17
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.41
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.92
Strategy : 0.91
House & Home : 0.81
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Trivia : 0.42
Casino : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual

In [31]:
# Category from goole apps

display_table(google_apps_clean, 1)

FAMILY : 18.92
GAME : 9.7
TOOLS : 8.45
BUSINESS : 4.59
LIFESTYLE : 3.91
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.38
PERSONALIZATION : 3.33
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.24
BOOKS_AND_REFERENCE : 2.17
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.41
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.92
HOUSE_AND_HOME : 0.81
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


In [32]:
# prime_genre from apple apps

display_table(apple_apps_clean, 11)

Games : 58.2
Entertainment : 7.89
Photo & Video : 4.99
Education : 3.68
Social Networking : 3.3
Shopping : 2.65
Utilities : 2.46
Sports : 2.15
Music : 2.06
Health & Fitness : 2.03
Productivity : 1.75
Lifestyle : 1.56
News : 1.34
Travel : 1.25
Finance : 1.09
Weather : 0.87
Food & Drink : 0.81
Reference : 0.53
Business : 0.53
Book : 0.37
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


### Apple Apps Analysis

From the free apps we see by far the most numerous is in the _Games_ ganre (58%), the closest follow up is _Entertainment_ with a little less than 8% and then _Photo and Video_ with almost 5%. The general feel is that the entertainment / leisure apps are more wide-spread in the app store than productivity / business apps (at least in the free sector). While it indicates that the game apps are very popular there, it also means that competetion should be quite stiff in this erea. An interesting further finding would be to see how many users the apps from these genre have.

### Google Apps Analysis

In the google apps there is no such clear outlier, especially in the `Genres` column (that might be due to the fact that several genres can be specified there, but we treated them as a single unit during the comparison, a better, although not so simple, way to found out true statistics by this column would be to split the strings by the separator (`;`) and somehow weight them. 

If we look at the `Category` column, where only one category may be specified for an app, it shows that _FAMILY_ is the most numerous one (with 18%), followed by _GAMES_ (9.7%) and _TOOLS_ (~8%) (tools, which are probably not entertainment oriented got quite a large share here in free Google apps). It's not very obvious though what is the _FAMILY_ category, browsing through examples of apps there it seems like games (especially for children, oriented on the child development or entertainment) are very common in it, so with a stretch we can count that category together with games which would sum up to almost 30% - a clear outlier again.

## Most popular apps

So far we have found what genres have more apps. That alone does not tell us how many users these apps have. To find it out we will calculate the average users count for the apps in a given genre. In google apps it is the `Intstalls` column (index 5). In apple apps there is unfortunately no such column, so we will use `rating_count_tot` as a proxy (index 5).

In [33]:
# to create frequency tables we can refactor the
# previously written code to have freq_table function
# to return the count instead of percentages and add another 
# function for percentages, but Python collections library already 
# has a class for it

from collections import Counter

### Most popular Apple apps

In [42]:
# get counts of apple store `prime_genre` (index 11)
apple_genres = [i[11] for i in apple_apps_clean]
apple_genres[:3]

['Social Networking', 'Photo & Video', 'Games']

In [43]:
apple_genres_counts = Counter(apple_genres)
apple_genres_counts.most_common(3)

[('Games', 1867), ('Entertainment', 253), ('Photo & Video', 160)]

In [48]:
# check the format of `rating_count_tot`
[i[5] for i in apple_apps_clean[:3]]

['2974676', '2161558', '2130805']

In [52]:
apple_average_genre_rating_counts = Counter()

for genre in apple_apps_genres_counts:
    rating_counts_total = 0
    
    for row in apple_apps_clean:
        current_genre = row[11]
        if current_genre == genre:
            rating_count = float(row[5])
            rating_counts_total += rating_count
    
    apple_average_genre_rating_counts[genre] = round(rating_counts_total / apple_apps_genres_counts[genre])
    
apple_average_genre_rating_counts.most_common(10)

[('Navigation', 86090),
 ('Reference', 79350),
 ('Social Networking', 71548),
 ('Music', 57327),
 ('Weather', 52280),
 ('Book', 46385),
 ('Food & Drink', 33334),
 ('Finance', 32367),
 ('Photo & Video', 28442),
 ('Travel', 28244)]

As we can see the most number of ratings i not in the entetainment of section, but rather more in utils. Navigation and Reference having the largest averages for rating count total. Reference apps are about religion, geography and the like data collections. Games do not make it in the first ten even. Among the most popular _Navigation_, _Reference_ and _Social Networking_ genres _Reference_ seem to be a good choice - not so large level of entry (but still need to specialize in a topic and be quite good at providing the information) and still has a lot of user response. 

### Most popular Google apps

In [53]:
# Indexes
# Category 1
# Installs 5

# checkout the format for Installs column
[i[5] for i in google_apps_clean[:3]]

['10,000+', '5,000,000+', '50,000,000+']

In [54]:
google_categories = [i[1] for i in google_apps_clean]
google_categories[:3]

['ART_AND_DESIGN', 'ART_AND_DESIGN', 'ART_AND_DESIGN']

In [57]:
google_categories_count = Counter(google_categories)
google_categories_count.most_common(10)

[('FAMILY', 1677),
 ('GAME', 860),
 ('TOOLS', 749),
 ('BUSINESS', 407),
 ('LIFESTYLE', 347),
 ('PRODUCTIVITY', 345),
 ('FINANCE', 328),
 ('MEDICAL', 313),
 ('SPORTS', 300),
 ('PERSONALIZATION', 295)]

In [58]:
google_average_installs = Counter()

for category in google_categories:
    total_installs = 0
    
    for row in google_apps_clean:
        current_category = row[1]
        if current_category == category:
            total_installs = float(row[5].replace(',', '').replace('+', ''))
    
    google_average_installs[category] = round(total_installs / google_categories_count[category])
    
google_average_installs.most_common(10)

[('LIBRARIES_AND_DEMO', 120482),
 ('ENTERTAINMENT', 117647),
 ('PHOTOGRAPHY', 38314),
 ('LIFESTYLE', 28818),
 ('SOCIAL', 21186),
 ('EDUCATION', 9709),
 ('SHOPPING', 5025),
 ('BEAUTY', 1887),
 ('WEATHER', 1408),
 ('DATING', 606)]

The most popular category turned out to be _LIBRARIES_AND_DEMO_ in which parts of apps exist (rather than the whole apps), because we are going to build a free app this category does not interest us. The next clear winner is the _ENTERTAINMENT_ which ssems to be very broad, but we can combine our knowledge of what we've learned while analysing apple apps data and find the first popular (and yet relatively simple to win audience) apple app in the entertainment category, which is __music__.