# Profitable App Profiles for the App Store and Google Play Store

The aim of this project is to find mobile app profiles that are profitable for the App Store and Google Play Store. We're looking for app profiles that would give allow us to make the right decsions when building a free app where it's main source of revenue is in-app ads. This means that the revenue is mostly influenced by the number of users that use the app.

---

- `dataset`: List of lists (Kind of like a matrix)
- `start`: Starting index for row selection
- `end`: Ending index for row selecting
- `rows_and_columns`: Boolean value used to either display the number of rows & columns or not

In [19]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Opening the `AppleStore.csv` file and the `googleplaystore.csv` file, which are the datasets for iOS and Android apps respectivly. We turn dataset into a list of lists and extract the header row from each one.

In [20]:
from csv import reader

ios_file = open('AppleStore.csv')
read_ios = reader(ios_file)
ios = list(read_ios)
ios_header = ios[0]
ios = ios[1:]

android_file = open('googleplaystore.csv')
read_android = reader(android_file)
android = list(read_android)
android_header = android[0]
android = android[1:]

In [21]:
# Displaying the header row of the iOS dataset
print(ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [22]:
# Inspecting the first row in the iOS dataset
explore_data(ios, 0, 1, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7197
Number of columns: 16


In [23]:
# Displaying the header row of the Android dataset
print(android_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [24]:
# Inspecting the first row in the Android dataset
explore_data(android, 0, 1, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In [25]:
print(android_header)  # header row
print('\n')
print(android[10472])  # incorrect row
print('\n')
print(android[0]) # correct Row

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


In [26]:
# Displaying length of dataset before removal
print(len(android))
# Removing faulty datapoint
del android[10472]
# Displaying length of dataset after removal
print(len(android))

10841
10840


The dataset that we are using for the Google Play store has some duplicate data. So we'll need to indentify those duplicates.

In [27]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print("Duplicate apps:", len(duplicate_apps))

Duplicate apps: 1181


After analyzing the duplicates, we've realized that they were recorded at different times. We made this observatiion based off of the number of reviews each duplicate data point had. So we will keep the one with the most reviews and erase the rest, because we're assuming that it's the most recent one. These duplicates were probably created when the dataset was being updated or something of the sort.

In [28]:
reviews_max = {}
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))

9659


In [29]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

print(len(android_clean))

9659


Our target audiance are English speakers. So we'll remove apps who's title contain more than 3 non english characters.

In [30]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

In [31]:
# Testing our function
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


In [32]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)

for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)

print(len(android_english))
print(len(ios_english))

9614
6183


If you remember we said that we also want to target apps that are free to download, so we'll also remove any app that is not free to download.

In [33]:
android_free = []
ios_free = []

for app in android_english:
    price = app[7]
    if price == '0' or price == '0.0':
        android_free.append(app)

for app in ios_english:
    price = app[4]
    if price == '0' or price == '0.0':
        ios_free.append(app)

print(len(android_free))
print(len(ios_free))

8864
3222


Remember that the aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

The validation strategy for an app idea is comprised of three steps:

1. Build a MVP Android app.
2. If it has good retention, it's further developed.
3. If it's profitable within six months, an iOS version is developed.

Because the end goal is to develop an app for both the App Store and Google Play Store, we need an app profile that's successful on both markets.

We'll analyse the data to find the most common genres.

In [16]:
def freq_table(dataset, index):
    table = {}
    total = len(dataset)
    
    for row in dataset:
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

In [17]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [18]:
# Displaying prime_genre from ios
display_table(ios_free, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


In [19]:
# Displaying Genres from andriod
display_table(android_free, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

In [20]:
# Displaying Category from andriod
display_table(android_free, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

Now we will analyze the frequency table generated for the `prime_genre` column of the App Store data set.

What is the most common genre? What is the runner-up?
- Most common genre: Games
- Runner up: Entertainment

What other patterns can you see?
- Medical/Navigation/Catalogue apps have the lowest frequency

What is the general impression — are most of the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) or more for fun (games, entertainment, photo and video, social networking, sports, music, etc.)?
- The general impression is that most apps are designed for fun

Can you recommend an app profile for the App Store market based on this frequency table alone? If there's a large number of apps for a particular genre, does that also imply that apps of that genre generally have a large number of users?
- An app with in yhe gaming category probably based on the frequency of apps that there is. However this does not imply that there are a large number of users, but why would there be so many gaming apps?

Now we will analyze the frequency table we generated for the `Category` and `Genres` columns of the Google Play data set.

What are the most common genres?
- Tools : 8.45%
- Entertainment : 6.07%
- Education : 5.35%

What other patterns do you see?
- Genres that are complicated and have too much going on are the lowest
- `Family`, `Game` & `Tools` have the highest frequency amongst the Categories.
- `ART_AND_DESIGN`, `COMICS`, `BEAUTY` are the lowest among the categories.

Now let's compare the patterns we see for the Google Play market with those you saw for the App Store market.

Apps geared towards entertainment score pretty high in frequency within both app stores. However business apps scored very low in the App Store, however it was the opposite in the Play Store.

Educational apps score very high in both app stores as well.

Based off of the frequency of the genres that we see in both app stores, I think that an app within either education or entertainment would fit the profile. However the frequency tables do not tell us how many users download/use these apps, they just tell us how many apps there are within these genres. At the same time, there wouldn't be so many apps in a specific genre if it wasn't working out so well for them.

In [21]:
genres_ios = freq_table(ios_free, 11)
for genre in genres_ios:
    total = 0
    len_genre = 0
    for data in ios_free:
        genre_app = data[11]
        if genre_app == genre:
            user_ratings = float(data[5])
            total += user_ratings
            len_genre+=1
    avg = total / len_genre
    print(genre, ":", avg)

Business : 7491.117647058823
Health & Fitness : 23298.015384615384
Games : 22788.6696905016
News : 21248.023255813954
Sports : 23008.898550724636
Reference : 74942.11111111111
Navigation : 86090.33333333333
Music : 57326.530303030304
Productivity : 21028.410714285714
Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Medical : 612.0
Utilities : 18684.456790123455
Weather : 52279.892857142855
Catalogs : 4004.0
Entertainment : 14029.830708661417
Lifestyle : 16485.764705882353
Travel : 28243.8
Food & Drink : 33333.92307692308
Shopping : 26919.690476190477
Finance : 31467.944444444445
Education : 7003.983050847458
Book : 39758.5


A good app profile recommendation for the appstore would be something that falls under the `Social Networking` genre. Not only does it have a lot of reviews, but it also scored well within the frequency table.

In [22]:
categories_android = freq_table(android_free, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_free:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)


SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
ART_AND_DESIGN : 1986335.0877192982
BEAUTY : 513151.88679245283
PRODUCTIVITY : 16787331.344927534
SHOPPING : 7036877.311557789
FINANCE : 1387692.475609756
BOOKS_AND_REFERENCE : 8767811.894736841
TOOLS : 10801391.298666667
COMMUNICATION : 38456119.167247385
PHOTOGRAPHY : 17840110.40229885
ENTERTAINMENT : 11640705.88235294
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
SOCIAL : 23253652.127118643
FAMILY : 3695641.8198090694
PARENTING : 542603.6206896552
LIFESTYLE : 1437816.2687861272
NEWS_AND_MAGAZINES : 9549178.467741935
BUSINESS : 1712290.1474201474
EVENTS : 253542.22222222222
MAPS_AND_NAVIGATION : 4056941.7741935486
COMICS : 817657.2727272727
EDUCATION : 1833495.145631068
MEDICAL : 120550.61980830671
GAME : 15588015.603248259
VIDEO_PLAYERS : 24727872.452830188
PERSONALIZATION : 5201482.6122448975
AUTO_AND_VEHICLES : 647317.8170731707
FOOD_AND_DRINK : 1924897.73636