# Profitable App Profiles for the App Store and Google Play Markets

This is a learning excersize from [Data Analyst in Python - Dataquest course](https://www.dataquest.io/path/data-analyst/). In this excersize we are imagining we are a mobile app development company. Our aim is to help our developers understand what type of apps are likely to attract more users on Google Play and the App Store. Our tools for this is pure Python.

## Data sources 

* [Google Play Store Apps](https://www.kaggle.com/lava18/google-play-store-apps)
* [Apple iOS App Store](https://www.kaggle.com/lava18/google-play-store-apps)

In [1]:
# row printer function

def explore_data(dataset, start=0, end=5, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset) - 1)
        print('Number of columns:', len(dataset[0]))
        print('\n')

In [2]:
from csv import reader

with open('googleplaystore.csv') as google_file, open('AppleStore.csv') as apple_file:
    google_apps = list(reader(google_file))
    apple_apps = list(reader(apple_file))

In [22]:
# read through data headers

explore_data(google_apps, end=3, rows_and_columns=True)
explore_data(apple_apps, end=3, rows_and_columns=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10840
Number of columns: 13


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558

## Data cleaning

At our company, we only build apps that are free to download and install, and that are directed toward an English-speaking audience. 

Tasks for cleaning:

- [x] find errors in data (delete if any), e.g. there is a [reported error](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015)
- [x] remove non-English apps
- [x] remove non-free apps

In [4]:
# finding rows with errors

# "this entry has missing 'Rating' and a column shift happened for next columns.."
# 10472 Life Made WI-Fi Touchscreen Photo Frame 1.9 19.0 3.0M 1,000+ Free 0 Everyone NaN February 11, 2018 1.0.19 4.0 and up NaN

reported_error_index = 10472
explore_data(google_apps, 10473, 10474)

In [5]:
del google_apps[10473]

### Duplicates

There is also [another report for iOS apps](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion/90409) which says that apps named "Mannequin Challenge" and "VR Roller Coaster" appear 2 times in the dataset. Although the names do appear two times each, the values in columns `id`, `size_in_bytes` and `rating_count` are so different that it's probably just 2 different apps with the same name. Also [doing a search in the App Store page](https://www.apple.com/us/search/Mannequin-Challenge?src=globalnav) gives a lots of results by this name.

Some duplicates are indeed found in Google Apps dataset:

In [6]:
# Find the number of duplicate android apps

unique_apps = []
duplicate_apps = []

for row in google_apps[1:]:
    app_name = row[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)

print("Number of duplicate apps:", len(duplicate_apps))
print("Examples:", duplicate_apps[:15])

Number of duplicate apps: 1181
Examples: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In [7]:
# Finding how many unique names there are in duplicates

unique_duplicate_names = set(duplicate_apps)
len(unique_duplicate_names)

798

In [8]:
# Since 798 < 1181, some apps have not only duplicates but triplicates, etc
# Let's find the most frequent duplicates

frequency = {}

for app in duplicate_apps:
    if app in frequency:
        frequency[app] += 1
    else:
        frequency[app] = 1
        
max_count = 0
most_frequent_app = ''
for app, count in frequency.items():
    if count > max_count:
        max_count = count
        most_frequent_app = app

In [9]:
# Inspecting the most frequent app

print(google_apps[0])

for row_index, row in enumerate(google_apps):
    if row[0] == most_frequent_app:
        print(row_index,'\t', row)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
1654 	 ['ROBLOX', 'GAME', '4.5', '4447388', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
1702 	 ['ROBLOX', 'GAME', '4.5', '4447346', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
1749 	 ['ROBLOX', 'GAME', '4.5', '4448791', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
1842 	 ['ROBLOX', 'GAME', '4.5', '4449882', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
1871 	 ['ROBLOX', 'GAME', '4.5', '4449910', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.

Inspecting the rows of the duplicate we can notice that the only thing that really changes is 'Reviews' - which stores the number of total reviews. Since it can only grow with time, the most reasonable way to delete the duplicates would be to remove the old rows (i.e. keeping only rows with the highest number of reviews). To check if we remove rows correctly, calculate the expected number of rows that should be left:

In [10]:
expected_row_count = len(google_apps[1:]) - len(duplicate_apps)
print("Expected number of rows:", expected_row_count)

Expected number of rows: 9659


In [11]:
# map app to the max review count

review_counts = {}

for row in google_apps[1:]:
    app_name = row[0]
    current_count = int(row[3])
    if app_name in review_counts:
        if current_count > review_counts[app_name]:
            review_counts[app_name] = current_count
    else:
        review_counts[app_name] = current_count

In [12]:
# not include the rows where review count for a given app is lower than max
# note some rows are duplicates with same max reviews, thus keeiping track of already inserted apps

already_inserted = set()
google_apps_clean = google_apps[:1]

for row in google_apps[1:]:
    app_name = row[0]
    current_review_count = int(row[3])
    if (app_name not in already_inserted) and (current_review_count == review_counts[app_name]):
        google_apps_clean.append(row)
        already_inserted.add(app_name)

In [13]:
# check that we get the expected number of rows

expected_row_count == len(google_apps_clean[1:])

True

In [28]:
google_apps = google_apps_clean

### English locale

The easiest way to remove non-English apps is to look for non ascii symbols in the app names.

In [17]:
def is_ascii(a_string):
    for character in a_string:
        if ord(character) > 127:
            return False
    return True

assert is_ascii('Instagram')
assert not is_ascii('爱奇艺PPS -《欢乐颂2》电视剧热播')
assert is_ascii('Docs To Go™ Free Office Suite')
assert is_ascii('Instachat 😜')

AssertionError: 

As shown in the tests above just judging by ascii is not good enough - many useful entries that contain non-ascii characters are actually English. To include those we can use a slightly better approach - mark as non-English only those strings containing at least 3 non-ascii characters in a row.

In [21]:
def is_eng(a_string):
    
    if len(a_string) < 3:
        return True
    
    for index, character in enumerate(a_string):
        # skip first and last indexes to avoid out of bound errors
        if (index == 0) or (index == len(a_string) - 1):
            continue
        if (ord(a_string[index - 1]) > 127) and (ord(character) > 127) and (ord(a_string[index + 1]) > 127):
            return False
        
    return True

assert is_eng('Instagram')
assert not is_eng('爱奇艺PPS -《欢乐颂2》电视剧热播')
assert is_eng('Docs To Go™ Free Office Suite')
assert is_eng('Instachat 😜')

In [37]:
# English Android Apps

en_google_apps = [app for app in google_apps[1:] if is_eng(app[0])]
print(len(en_google_apps))
print(en_google_apps[:3])

9615
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]


In [38]:
# English iOS Apps

en_apple_apps = [app for app in apple_apps[1:] if is_eng(app[1])]
print(len(en_apple_apps))
print(en_apple_apps[:3])

6167
[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']]


### Free Apps Only

In [39]:
# For Android apps 'Price' column has index 7 but it also contains $ symbol
# find out if all rows contain the $ prefix
set([row[7][0] for row in en_google_apps])

{'$', '0'}

In [46]:
def price_to_float(a_strng):
    if '$' in a_strng:
        return float(a_strng[1:])
    else:
        return float(a_strng)

In [47]:
# we can simply omit the first simble when converting to float
google_apps_clean = [app for app in en_google_apps if price_to_float(app[7]) == 0.0]
len(google_apps_clean)

8865

In [41]:
# 'price' column has index 4
# check if it has only floats

set([row[4][0] for row in en_apple_apps])

{'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}

In [50]:
apple_apps_clean = [app for app in en_apple_apps if float(app[4]) == 0.0]
len(apple_apps_clean)

3208

In [51]:
google_app_columns = google_apps[0]
apple_app_columns = apple_apps[0]

### Data cleaning results

* `google_app_columns` - list with android app column names
* `google_apps_clean` - list of lists with clean adroid apps data set
* `apple_app_columns` - list with iOS app column namex
* `apple_apps_clean` - list of lists with clean iOS apps data set