# Analysis of mobile app sales

The used data-files contain information on sales from the [google](https://www.kaggle.com/lava18/google-play-store-apps) and [apple](https://dq-content.s3.amazonaws.com/350/AppleStore.csv) app stores. The objective of this project is to identify the type of apps that generate the highest number of user downloads. To facilitate the analysis only free apps are considered. The data on paid apps would need to be properly segmented. Results can be used by developers to make better decisions on future free app products.

In [63]:
def explore_data(dataset, start, end, rows_and_columns=True):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds new line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Saving app data as list of lists in the varaibles `apple` and `google`.  
Headers are saved seperately as `apple_header` and `google_header`.

In [19]:
from csv import reader

#Apple file
opened_file = open('C:/Users/User/Documents/my_datasets/AppleStore.csv', encoding="utf8")
read_file = reader(opened_file)
apple = list(read_file)
apple_header = apple[0]
apple = apple[1:]

#Android file
opened_file = open('C:/Users/User/Documents/my_datasets/googleplaystore.csv', encoding="utf8")
read_file = reader(opened_file)
google = list(read_file)
google_header = google[0]
google = google[1:]

In [64]:
#Exploring the data from the apple store.
explore_data(apple,0,3)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


|Column name (index)|Description (Apple-store)|
|---|---|
| "id" (0) | App ID |
| "track_name" (1) | App Name |
| "size_bytes" (2) | Size (in Bytes)|
| "currency" (3) | Currency Type|
| "price" (4)| Price amount|
| "ratingcounttot" (5)| User Rating counts (for all version)|
|"ratingcountver" (6)| User Rating counts (for current version)|
|"user_rating" (7)| Average User Rating value (for all version)|
|"userratingver" (8)| Average User Rating value (for current version)|
|"ver" (9)| Latest version code|
|"cont_rating" (10)| Content Rating|
|"prime_genre" (11) | Primary Genre|
|"sup_devices.num" (12)| Number of supporting devices|
|"ipadSc_urls.num" (13)| Number of screenshots showed for display|
|"lang.num" (14)| Number of supported languages|
|"vpp_lic" (15)| Vpp Device Based Licensing Enabled|

In [27]:
#Exploring data from the google store.
explore_data(google,0,3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


|Column name (index)|Description (Google-store)|
|---|---|
|'App' (0)| Name of app|
|'Category' (1)| App category|
|'Rating' (2)| App rating (total)|
|'Reviews' (3)| Count of toal reviews|
|'Size' (4)| App size (MB)|
|'Installs' (5)| Total number of installs|
|'Type' (6)| Paid vs. Free|
|'Price' (7)| Price of app|
|'Content Rating' (8)| Targeted age group|
|'Genres' (9)| Complete list of genres|
|'Last Updated' (10)| Date of last update|
|'Current Ver' (11)| Current version number|
|'Android Ver' (12)| Required android version|

## Data cleaning of apple and google store data

**Cleaning backlog:**
* False data
* Duplicate data
* Non-english app names
* Isolate free from payed apps

In [35]:
# Ratings must be between 0-5. Search for errors in google data.
error_index = []
for index in range(len(google)):
    rating = float(google[index][2])
    if rating < 0 or rating > 5:
        error_index.append(index)
print(error_index)

[10472]


In [40]:
# Delete erronious row
del google[10472]

In [39]:
# Ratings must be between 0-5. Search for errors in apple data.
error_index1 = []
error_index2 = []

for index in range(len(apple)):
    rating = float(apple[index][7])
    if rating < 0 or rating > 5:
        error_index1.append(index)
        
    rating = float(apple[index][8])
    if rating < 0 or rating > 5:
        error_index2.append(index)
        
print(error_index1, error_index2)

[] []


In [50]:
# Create list of uniq app names and dictionary of duplicate frequency counts

#Apple store data 
unique_apple = []
duplicates_apple = {}

for row in apple:
    app = row[1]
    if app in unique_apple:
        if app in duplicates_apple:
            duplicates_apple[app] += 1
        else: duplicates_apple[app] = 2 # first count in unique
    else: unique_apple.append(app)
        
#Google store data
unique_google = []
duplicates_google = {}

for row in google:
    app = row[0]
    if app in unique_google:
        if app in duplicates_google:
            duplicates_google[app] += 1
        else: duplicates_google[app] = 2 # first count in unique
    else: unique_google.append(app)
        
#Print information on duplicates
print('Apple store: \n', 'Number of duplicated apps: ', len(duplicates_apple))
print('Duplicated apps: ', duplicates_apple, '\n')
print('Google store: \n', 'Number of duplicated apps: ', len(duplicates_google))
print('Duplicated apps: ', duplicates_google)


Apple store: 
 Number of duplicated apps:  2
Duplicated apps:  {'Mannequin Challenge': 2, 'VR Roller Coaster': 2} 

Google store: 
 Number of duplicated apps:  798
Duplicated apps:  {'Quick PDF Scanner + OCR FREE': 3, 'Box': 3, 'Google My Business': 3, 'ZOOM Cloud Meetings': 2, 'join.me - Simple Meetings': 3, 'Zenefits': 2, 'Google Ads': 3, 'Slack': 3, 'FreshBooks Classic': 2, 'Insightly CRM': 2, 'QuickBooks Accounting: Invoicing & Expenses': 3, 'HipChat - Chat Built for Teams': 2, 'Xero Accounting Software': 2, 'MailChimp - Email, Marketing Automation': 2, 'Crew - Free Messaging and Scheduling': 2, 'Asana: organize team projects': 2, 'Google Analytics': 2, 'AdWords Express': 2, 'Accounting App - Zoho Books': 2, 'Invoice & Time Tracking - Zoho': 2, 'Invoice 2go — Professional Invoices and Estimates': 2, 'SignEasy | Sign and Fill PDF and other Documents': 2, 'Genius Scan - PDF Scanner': 2, 'Tiny Scanner - PDF Scanner App': 2, 'Fast Scanner : Free PDF Scan': 2, 'Mobile Doc Scanner (MDSca

In [59]:
# Inspect duplicate data in original file

#'VR Roller Coaster' app in apple store
print('Apple VR Roller Coaster app duplicates:')
for row in apple:
    if row[1] == 'VR Roller Coaster' or row[1] == 'Mannequin Challenge':
        print(row)
print('\n')

# 'Subway Surfers' app in google store 
print('Google Subway Surfers app duplicates:')
for row in google:
    if row[0] == 'Subway Surfers':
        print(row)

Apple VR Roller Coaster app duplicates:
['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']
['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']


Google Subway Surfers app duplicates:
['Subway Surfers', 'GAME', '4.5', '27722264', '76M', '1,000,000,000+', 'Free', '0', 'Everyone 10+', 'Arcade', 'July 12, 2018', '1.90.0', '4.1 and up']
['Subway Surfers', 'GAME', '4.5', '27723193', '76M', '1,000,000,000+', 'Free', '0', 'Everyone 10+', 'Arcade', 'July 12, 2018', '1.90.0', '4.1 and up']
['Subway Surfers', 'GAME', '4.5', '27724094', '76M', '1,000,000,000+', 'Free', 

In the google store data, the number of reviews of the duplicates varies. This suggests different versions and is further verified by comparing the 10th and 3rd entry of the "VR Roller Coaster" duplicates of the apple store data. The appropriate solution is to keep app entries with the highest number of reviews for the googe data which should correspond the latest version. Since the apple data has only two duplications, the entries will be deleted manually.

In [60]:
# Dictionay with unique google app names and corresponding highest number of reviews (latest version)
reviews_max = {} # key: app name; value: highest number of reviews
for row in google:
    name = row[0] # app name
    n_reviews = float(row[3]) # number of reviews in current app object
    
    if name not in reviews_max:
        reviews_max[name] = n_reviews
    elif reviews_max[name] < n_reviews: # number of reviews in dictionary less than in current duplicate
        reviews_max[name] = n_reviews
print('Unique google app entries: ', len(reviews_max))

Unique google app entries:  9659


In [79]:
# Check whether app duplicates exist that have the same maximum number of reviews

max_duplicate_count = {} # app name: number of reviews_max duplicates
for row in google:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name]:
        if name in max_duplicate_count:
            max_duplicate_count[name] += 1
        else: max_duplicate_count[name] = 1
            
count = 0 # number of duplicate apps with same reviews_max value
for key in max_duplicate_count:
    if max_duplicate_count[key] > 1:
        count += 1
print('Number of duplications same maximum number of review: ', count)

Number of duplications same maximum number of review:  336


Total of 336 apps exist in the data that share the same number of reviews. This implies that `n_reviews` cannot be the only criterion for removing duplicates. We also need to keep a `name_record` of apps that have already been added.

In [70]:
google_clean = [] # removed duplicates
name_record = [] # list of apps already added

for row in google:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in name_record: # add only if n_reviews is the maximum number and the app hasn't been previously added
        google_clean.append(row)
        name_record.append(name)
        
explore_data(google_clean,0,3) 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


As expected, the number of rows is 9659 which has been previously identified as the number of unique records in the google app store data.

In [83]:
# Manually clean apple data
# Keep version 1.4 of 'Mannequin Challenge' and version 2.0.0 of 'VR Roller Coaster'
remove_index = [] # index to remove from apple data
for index in range(len(apple)):
    if apple[index][1] == 'Mannequin Challenge' and apple[index][9] != '1.4': # not newest version (1.4)
        remove_index.append(index) # record index of old version
    if apple[index][1] == 'VR Roller Coaster' and apple[index][9] != '2.0.0': # not newest version (2.0.0)
        remove_index.append(index) # record index of old version

print('Number of old versions: ', len(remove_index)) # 2 expected prior to delition
for index in remove_index:
    del apple[index]

2 expected old versions:  0


In [96]:
def is_ascii(string): # used to identify non-english apps
    '''Returns True if all characters in "string" correspond to the ASCII system (0-127), otherwise False'''
    for character in string:
        order = ord(character) # number order of string character
        if order > 127: # ASCII characters are in range 0-127
            return False # if non-ASCII character found return False
    return True # if all characters in ASCII return True

Previousd data exploration has revealed that some english app names has non-ASCII characters shuch as emojis. The is-ascii function will be rewritten to return a dictionary with indexes and the corresponding count of non-ASCII characters. This will serve to determine the optimal cut-off value for non-english app-names.

In [119]:
def count_non_ascii(string): 
    '''Returns corresponding count of non-ASCII characters'''
    count = 0 # number of non-ASCII characters
    for character in string: 
        order = ord(character) # number order of string character
        if order > 127: # ASCII characters are in range 0-127
            count += 1 # if non-ASCII character found increase count
    return count 

In [132]:
non_ascii_google = [] # number of non-ASCI characters at each index of google data
for row in google_clean:
    non_ascii_google.append(count_non_ascii(row[0]))
    
non_ascii_apple = [] # number of non-ASCI characters at each index of apple data
for row in apple:
    non_ascii_apple.append(count_non_ascii(row[0]))  

In [133]:
print("Total non-ASCII characters in google data: ", sum(non_ascii_google))
print("Total non-ASCII characters in apple data: ", sum(non_ascii_apple))

Total non-ASCII characters in google data:  1045
Total non-ASCII characters in apple data:  0


In [134]:
for index in range(len(non_ascii_google)):
    if non_ascii_google[index] > 0:
        print(google_clean[index][0], ': ', non_ascii_google[index])

U Launcher Lite – FREE Live Cool Themes, Hide Apps :  1
CarMax – Cars for Sale: Search Used Car Inventory :  1
AutoScout24 Switzerland – Find your new car :  1
Zona Azul Digital Fácil SP CET - OFFICIAL São Paulo :  2
ReadEra – free ebook reader :  1
Docs To Go™ Free Office Suite :  1
USPS MOBILE® :  1
Invoice 2go — Professional Invoices and Estimates :  1
Röhrich Werner Soundboard :  1
Manga Net – Best Online Manga Reader :  1
Truyện Vui Tý Quậy :  3
Comic Es - Shojo manga / love comics free of charge ♪ ♪ :  2
漫咖 Comics - Manga,Novel and Stories :  2
Tapas – Comics, Novels, and Stories :  1
【Ranobbe complete free】 Novelba - Free app that you can read and write novels :  2
Call Free – Free Call :  1
Xperia Link™ :  1
Messenger – Text and Video Chat for Free :  1
Dolphin Browser - Fast, Private & Adblock🐬 :  1
Sync.ME – Caller ID & Block :  1
myMail – Email for Hotmail, Gmail and Outlook Mail :  1
Vonage Mobile® Call Video Text :  1
Match™ Dating - Meet Singles :  1
Find Real Love — YouL

From the above list it seems that app names with less than 4 non-ASCII characters can still be considered as english. This will be used to modify the `is_ascii` function. 

In [130]:
def is_english(string): # used to identify non-english apps
    '''Returns True if all characters in "string" correspond to the ASCII system (0-127), otherwise False'''
    count = 0 # number of non-ASCII characters
    for character in string: 
        order = ord(character) # number order of string character
        if order > 127: # ASCII characters are in range 0-127
            count += 1 # if non-ASCII character found increase count
    if count < 4:
        return True
    else: return False

In [136]:
google_clean_en = []
for row in google_clean:
    name = row[0]
    if is_english(name):
        google_clean_en.append(row)
print('Number of unique, accurate, english apps: ', len(google_clean_en))

Number of unique, accurate, english apps:  9614


In [138]:
google_free = [] # clean data on free apps in the google store
for app in google_clean_en:
    if app[6] == "Free": # append only if type is free
        google_free.append(app)

apple_free = [] # clean data on free app in the apple store
for app in apple:
    price = float(app[4])
    if price == 0:
        apple_free.append(app)
        
print("Number of free apps in google data: ", len(google_free))
print("Number of free apps in apple data: ", len(apple_free))

Number of free apps in google data:  8863
Number of free apps in apple data:  4054


## Data anaysis

The objective is to find app profiles that correlate with high download numbers. The information can be used by developers to maximize revenue on future app projects. Both the google and apple markets are important. Profiles should, therefore, be successful in both markets.

In [150]:
def freq_table(dataset, index):
    """Returns a dictionary of frequencies"""
    freq = {}
    for data in dataset:
        element = data[index]
        if element not in freq: # add new value
            freq[element] = 1
        else: freq[element] += 1 # increase count of already registered value
    return freq

In [141]:
def display_table(dataset, index):
    '''Prints a sorted frequency table of elements at the given "index"'''
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple) # transform dictionary into list of tuples for effective sorting
        
    table_sorted = sorted(table_display, reverse = True) # Sort tuples according to frequencies
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [155]:
# Frequency tables for app generes/categories
print('Google categories:')
display_table(google_free, 1)
print('\nGoogle genres:')
display_table(google_free, 9)


Google categories:
FAMILY : 1675
GAME : 862
TOOLS : 750
BUSINESS : 407
LIFESTYLE : 346
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 313
SPORTS : 301
PERSONALIZATION : 294
COMMUNICATION : 287
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 190
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 124
FOOD_AND_DRINK : 110
EDUCATION : 103
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 73
WEATHER : 71
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 57
COMICS : 55
BEAUTY : 53

Google genres:
Tools : 749
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 345
Finance : 328
Medical : 313
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181
Dating : 165
Arcade : 164
Video Play

From the google data it seems that **entertainment related apps** are the most frequent in the app store. This is most apparent from the categories, which are dominated by `Family` and `Game`.

In [157]:
print('Apple genres:')
display_table(apple_free, 11)

Apple genres:
Games : 2255
Entertainment : 334
Photo & Video : 167
Social Networking : 143
Education : 132
Shopping : 121
Utilities : 109
Lifestyle : 94
Finance : 84
Sports : 79
Health & Fitness : 76
Music : 67
Book : 66
Productivity : 62
News : 58
Travel : 56
Food & Drink : 43
Weather : 31
Reference : 20
Navigation : 20
Business : 20
Catalogs : 9
Medical : 8


Similar to the google data, the apple store appears to be dominated by entertainment apps.  
High genre frequencies, however, do not necessarily imply high revenue or download numbers; further analysis is required.  
The average number of ratings for each category will be computeted to estimate the amount of active users. This will give insight into which app genres have the highest download numbers. 

### Most popular apps by genre in the Apple store

In [161]:
# Table of genres and corresponding average number of reviews
apple_genre_frequencies = freq_table(apple_free, 11)
for genre in apple_genre_frequencies:
    sum_ratings = 0 # sum of ratings for each genre
    len_genre = apple_genre_frequencies[genre] # number of apps in genre
    for app in apple_free:
        genre_app = app[11]
        if genre_app == genre: 
            rating_count = float(app[5]) # number of user ratings
            sum_ratings += rating_count # sum number for ratings for current category in loop
    mean = round(sum_ratings/len_genre)
    print(genre, ': ', mean)

Social Networking :  53078
Photo & Video :  27250
Games :  18941
Music :  56482
Reference :  67448
Health & Fitness :  19952
Weather :  47221
Utilities :  14010
Travel :  20216
Shopping :  18747
News :  15893
Navigation :  25972
Lifestyle :  8978
Entertainment :  10823
Food & Drink :  20179
Sports :  20129
Book :  8498
Finance :  13522
Education :  6266
Productivity :  19054
Business :  6368
Catalogs :  1780
Medical :  460


Genres with largest number of reviews in descending order: `Reference`, `Music`, `Social Networking`, `Weather`.  
Next we should look into the genres at higher resolution to get more information on skewness etc.

### Most popular apps by genre in Google store

In [165]:
# Table of genres and corresponding average number of reviews
google_genre_frequencies = freq_table(google_free, 1)
for genre in google_genre_frequencies:
    sum_ratings = 0 # sum of ratings for each genre
    len_genre = google_genre_frequencies[genre] # number of apps in genre
    for app in google_free:
        genre_app = app[1] # index 1 = Category
        if genre_app == genre: 
            rating_count = float(app[3]) # index 3 = number of total reviews
            sum_ratings += rating_count # sum number for ratings for current category in loop
    mean = round(sum_ratings/len_genre)
    print(genre, ': ', mean)

ART_AND_DESIGN :  24699
AUTO_AND_VEHICLES :  14140
BEAUTY :  7476
BOOKS_AND_REFERENCE :  87995
BUSINESS :  24240
COMICS :  42586
COMMUNICATION :  995608
DATING :  21953
EDUCATION :  56293
ENTERTAINMENT :  301752
EVENTS :  2556
FINANCE :  38536
FOOD_AND_DRINK :  57479
HEALTH_AND_FITNESS :  78095
HOUSE_AND_HOME :  26435
LIBRARIES_AND_DEMO :  10926
LIFESTYLE :  33922
GAME :  683524
FAMILY :  113211
MEDICAL :  3730
SOCIAL :  965831
SHOPPING :  223887
PHOTOGRAPHY :  404081
SPORTS :  116939
TRAVEL_AND_LOCAL :  129484
TOOLS :  305733
PERSONALIZATION :  181122
PRODUCTIVITY :  160635
PARENTING :  16379
WEATHER :  171251
VIDEO_PLAYERS :  425350
NEWS_AND_MAGAZINES :  93088
MAPS_AND_NAVIGATION :  142860
