## Analysis of AppStore and PlayStore Data

Profitable App Profiles for the App Store and Google Play Markets

Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

In [94]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

### Loading of data sets and first expolration

In [95]:
from csv import reader

# Import the App Sotre data set
f_1 = open("additional_files/AppleStore.csv",encoding="utf8")
f_1_read = reader(f_1)
ios_data = list(f_1_read)
ios_header = ios_data[0]
ios_data = ios_data[1:]

# Import the Google Play data set
f_2 = open("additional_files/googleplaystore.csv",encoding="utf8")
f_2_read = reader(f_2)
android_data = list(f_2_read)
android_header = android_data[0]
android_data = android_data[1:]

First three rows of the App Store data set and the total number of rows and columns

In [96]:
explore_data(ios_data,0,3,True)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 7197
Number of columns: 17


First three rows of the Google Play data set and the total number of rows and columns

In [97]:
explore_data(android_data,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


***
The following section shows the colmn names of the data sets

In [98]:
print("Column names of the App Store data set:")
print(ios_header)
print("\n")
print("Column names of the Google Play data set:")
print(android_header)

Column names of the App Store data set:
['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Column names of the Google Play data set:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


### Deletion of row with inaccurate data

In [99]:
wrong_line = android_data[10472]
print(wrong_line)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [100]:
if wrong_line[0] == "Life Made WI-Fi Touchscreen Photo Frame":
    del android_data[10472]

In [101]:
print(android_data[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


### Detecting duplicates in the data sets

In [102]:
# Duplicates in App Store data set
duplicate_apps = []
unique_apps = []
for app in ios_data:
    app_id = app[1]
    if app_id in unique_apps:
        duplicate_apps.append(app_id)
    else:
        unique_apps.append(app_id)

print("App Store:")
print("Number of duplicated apps:",len(duplicate_apps))
print("\n")
print("Example of duplicate apps:\n",duplicate_apps[0:15])

App Store:
Number of duplicated apps: 0


Example of duplicate apps:
 []


No duplicates in the App Store data set!
***

In [103]:
# Duplicates in App Store data set
duplicate_apps = []
unique_apps = []
for app in android_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print("Google Play Store:")
print("Number of duplicated apps:",len(duplicate_apps))
print("\n")
print("Example of duplicate apps:\n",duplicate_apps[0:15])

Google Play Store:
Number of duplicated apps: 1181


Example of duplicate apps:
 ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


The Google Play data set has 1181 duplicates!
***

As criterion to select the duplicates which will be removed, the total number of ratings is used.
The entry with the highest number of ratings is supposed to be the most recent one.

### Removing of the duplicates in the Google Play data set

In [104]:
# Determine expected lenfth of the data set afer removal
# of duplicated data
print("Expected length:",len(android_data)- 1181)

Expected length: 9659


In [105]:
# Creating a dictionary of unique apps and their
# max reviews
reviews_max = {}
for app in android_data:
    name = app[0]
    n_reviews = float(app[3])
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

if len(reviews_max) == (len(android_data)- 1181):
    print("Length of dictionary ("+str(len(reviews_max))+") is as expected ("+str(len(android_data)- 1181)+").")
    print("You can continue!")
elif len(reviews_max) > (len(android_data)- 1181):
    print("Length of dictionary ("+str(len(reviews_max))+") is larger as expected ("+str(len(android_data)- 1181)+").")
    print("Please review your code!")
elif len(reviews_max) < (len(android_data)- 1181):
    print("Length of dictionary ("+str(len(reviews_max))+") is smaller as expected ("+str(len(android_data)- 1181)+").")
    print("Please review your code!")

Length of dictionary (9659) is as expected (9659).
You can continue!


Now, the dictionary will be used to remove the duplicate rows:

In [106]:
android_clean = []
already_added = []
for app in android_data:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)


For further analysis, we have now a clean data set `android_clean`.

### Removing of non-english apps

We have to detekt non-english characters in the app names.

Therefore we write the function `isEnglish`.

This function returns `False` in case at leased four character of the input string are considered as non-englisch (ASCII code higher than 127).
Otherwise, the function returns `True`.

In [107]:
def isEnglish(in_str):
    non_char_num = 0
    for char in in_str:
        char_num = ord(char)
        if char_num > 127:
            non_char_num += 1
        if non_char_num > 3:
            return False
    return True

Filtering of the Google Play data set:

In [108]:
index = 0
non_english_apps = []
for app in android_clean:
    name = app[0]
    if (isEnglish(name)==False):
        non_english_apps.append(app)
        del android_clean[index]
    index += 1

print("Number of reamining apps: "+str(len(android_clean)))
print("Number of filtered apps: "+str(len(non_english_apps)))

Number of reamining apps: 9615
Number of filtered apps: 44


***
### Isolation of free apps

Free apps in App Store data set:

In [118]:
free_ios_apps = []
for app in ios_data:
    price = app[5]
    if price == '0':
        free_ios_apps.append(app)

Free apps in Google Play data set:

In [122]:
free_android_apps = []
for app in android_clean:
    price = app[7]
    if price == '0':
        free_android_apps.append(app)

***
### Creation of frequency tables to analyse the markets

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets.

Let's begin the analysis by getting a sense of what are the most common genres for each market using frequency tables.

First, we define some functions for this task:

In [126]:
def freq_table(dataset,index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
    
    return table_percentages


def display_table(dataset,index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key],key)
        table_display.append(key_val_as_tuple)
    
    table_sorted = sorted(table_display, reverse = True)
    
    for entry in table_sorted:
        print(entry[1],":",entry[0])

***
Frequency table for the App Store genres (prime_genre):

In [127]:
display_table(free_ios_apps,12)

Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032


***
Frequency table for the Google Play genres (Category):

In [135]:
display_table(free_android_apps,1)

FAMILY : 18.917089678510997
GAME : 9.723632261703328
TOOLS : 8.460236886632826
BUSINESS : 4.591088550479413
LIFESTYLE : 3.902989283699944
PRODUCTIVITY : 3.8917089678511
FINANCE : 3.699943598420756
MEDICAL : 3.5307388606880994
SPORTS : 3.395375070501974
PERSONALIZATION : 3.3164128595600673
COMMUNICATION : 3.2374506486181613
HEALTH_AND_FITNESS : 3.0795262267343486
PHOTOGRAPHY : 2.9441624365482233
NEWS_AND_MAGAZINES : 2.7975183305132543
SOCIAL : 2.662154540327129
TRAVEL_AND_LOCAL : 2.33502538071066
SHOPPING : 2.2447828539199097
BOOKS_AND_REFERENCE : 2.143260011280316
DATING : 1.8612521150592216
VIDEO_PLAYERS : 1.793570219966159
MAPS_AND_NAVIGATION : 1.3987591652566271
FOOD_AND_DRINK : 1.2408347433728144
EDUCATION : 1.161872532430908
ENTERTAINMENT : 0.9588268471517203
LIBRARIES_AND_DEMO : 0.9362662154540328
AUTO_AND_VEHICLES : 0.924985899605189
HOUSE_AND_HOME : 0.8234630569655951
WEATHER : 0.8009024252679076
EVENTS : 0.7106598984771574
PARENTING : 0.6542583192329385
ART_AND_DESIGN : 0.6429

***
### Further analysis of the data sets

Mean user rating for each genre in the App Store data set:

In [132]:
ios_genres = freq_table(free_ios_apps,12)
for genre in ios_genres:
    total = 0
    len_genre = 0
    for row in free_ios_apps:
        genre_app = row[12]
        if genre_app == genre:
            num_ratings = float(row[6])
            total += num_ratings
            len_genre += 1
    avg_num = total / len_genre
    print(genre,":",avg_num)

Productivity : 19053.887096774193
Weather : 47220.93548387097
Shopping : 18746.677685950413
Reference : 67447.9
Finance : 13522.261904761905
Music : 56482.02985074627
Utilities : 14010.100917431193
Travel : 20216.01785714286
Social Networking : 53078.195804195806
Sports : 20128.974683544304
Health & Fitness : 19952.315789473683
Games : 18924.68896765618
Food & Drink : 20179.093023255813
News : 15892.724137931034
Book : 8498.333333333334
Photo & Video : 27249.892215568863
Entertainment : 10822.961077844311
Business : 6367.8
Lifestyle : 8978.308510638299
Education : 6266.333333333333
Navigation : 25972.05
Medical : 459.75
Catalogs : 1779.5555555555557


According to the analized data set, the best app profile for the App Store is a social networking app (3.52% with a mean rating number of 53078.20).

***
Frequency table of installations for the Google Play data set:

In [134]:
display_table(android_clean,5)

1,000,000+ : 14.7061882475299
100,000+ : 11.513260530421217
10,000+ : 10.618824752990118
10,000,000+ : 9.745189807592304
1,000+ : 9.152366094643785
100+ : 7.321892875715029
5,000,000+ : 6.292251690067603
500,000+ : 5.241809672386895
5,000+ : 4.83619344773791
50,000+ : 4.815392615704628
10+ : 3.9937597503900157
500+ : 3.4113364534581385
50,000,000+ : 2.1216848673946958
50+ : 2.1216848673946958
100,000,000+ : 1.9656786271450857
5+ : 0.8528341133645346
1+ : 0.6864274570982839
500,000,000+ : 0.24960998439937598
1,000,000,000+ : 0.20800832033281333
0+ : 0.13520540821632865
0 : 0.010400416016640665


Converting rating number strings to numbers:

In [136]:
categories = freq_table(free_android_apps,1)
for category in categories:
    total = 0
    len_category = 0
    for row in free_android_apps:
        category_app = row[1]
        if category_app == category:
            installs = row[5]
            installs = installs.replace("+","")
            installs = installs.replace(",","")
            total += float(installs)
            len_category += 1
    avg_inst = total / len_category
    print(category,":",avg_inst)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3693497.7280858676
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_