## Profitable app profiles in Google Playstore and Apple Store


I have a plan to develop an app for add revenue only and focus on free apps (English only). The goal of this project is to get an overview of possible app profiles that could be profitable.


As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play. On Kaggle I found 2 datasets containing a sample of all available apps to start our investigation.

Direct download of the datasets:
* [Google Playstore](https://www.kaggle.com/lava18/google-play-store-apps) containing ~ 10 000 apps
* [Apple iOS](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing ~ 7 000 apps



__Read and explore the datasets__

Create a list of lists for both.

In [1]:
from csv import reader

In [2]:
open_goog = open('googleplaystore.csv', encoding='utf8')
open_appl = open('AppleStore.csv', encoding='utf8')

read_goog = reader(open_goog)
read_appl = reader(open_appl)

goog_data = list(read_goog)
appl_data = list(read_appl)

print("Google columns")
print(goog_data[0])
print('\nApple columns')
print(appl_data[0])

Google columns
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

Apple columns
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Not all columns in the apple dataset are self-explanatory . A summary with there meaning can be found  here [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

In [3]:
appl_index = {}
goog_index = {}
n = 0
goog_head = goog_data[0]
appl_head = appl_data[0]
for i in goog_head:
    goog_index[i] = n
    n+=1
for i in appl_head:
    appl_index[i] = n
    n+=1
print("Google column and index number:")    
print(goog_index)
print("\nApple column and index number:")   
print(appl_index)    

Google column and index number:
{'App': 0, 'Category': 1, 'Rating': 2, 'Reviews': 3, 'Size': 4, 'Installs': 5, 'Type': 6, 'Price': 7, 'Content Rating': 8, 'Genres': 9, 'Last Updated': 10, 'Current Ver': 11, 'Android Ver': 12}

Apple column and index number:
{'id': 13, 'track_name': 14, 'size_bytes': 15, 'currency': 16, 'price': 17, 'rating_count_tot': 18, 'rating_count_ver': 19, 'user_rating': 20, 'user_rating_ver': 21, 'ver': 22, 'cont_rating': 23, 'prime_genre': 24, 'sup_devices.num': 25, 'ipadSc_urls.num': 26, 'lang.num': 27, 'vpp_lic': 28}


#### Create a function to explore the data

In [4]:
def explore_data (dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # Empty line for readability        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [5]:
explore_data(goog_data, 0,5, 1)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


In [6]:
explore_data(appl_data, 1,5, 1)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


### Cleaning the data

When I read the active discussion on the Google dataset hosted at Kaggle there is a discussion on row 10473 the app has rating of 19 on a scale of 5. Because 1 app does not have much impact on the dataset I decide to delete this one

In [7]:
print(goog_data[10472:10474])

[['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up'], ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']]


In [8]:
del(goog_data[10473])

### Check dataset for duplicate apps
I
create 2 lists and check if an app exists in the first list. If it exists add it to the list of duplicate items

In [9]:
duplicate_apps = []
unique_apps = []
for app in goog_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else :
        unique_apps.append(name)
        
print("Number of duplicate apps:", len (duplicate_apps))
print("\nExamples:\n", duplicate_apps[:10])

Number of duplicate apps: 1181

Examples:
 ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


#### To check how many times the duplicates exist  import Counter and sort most common

In [10]:
from collections import Counter
Counter(duplicate_apps).most_common(10)

[('ROBLOX', 8),
 ('CBS Sports App - Scores, News, Stats & Watch Live', 7),
 ('Duolingo: Learn Languages Free', 6),
 ('8 Ball Pool', 6),
 ('Candy Crush Saga', 6),
 ('ESPN', 6),
 ('Nick', 5),
 ('Subway Surfers', 5),
 ('Bubble Shooter', 5),
 ('Sniper 3D Gun Shooter: Free Shooting Games - FPS', 5)]

In [11]:
for app in goog_data:
    name = app[0]
    if name == "ESPN":
        print(app)

['ESPN', 'SPORTS', '4.2', '521138', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone 10+', 'Sports', 'July 19, 2018', 'Varies with device', '5.0 and up']
['ESPN', 'SPORTS', '4.2', '521138', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone 10+', 'Sports', 'July 19, 2018', 'Varies with device', '5.0 and up']
['ESPN', 'SPORTS', '4.2', '521138', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone 10+', 'Sports', 'July 19, 2018', 'Varies with device', '5.0 and up']
['ESPN', 'SPORTS', '4.2', '521140', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone 10+', 'Sports', 'July 19, 2018', 'Varies with device', '5.0 and up']
['ESPN', 'SPORTS', '4.2', '521140', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone 10+', 'Sports', 'July 19, 2018', 'Varies with device', '5.0 and up']
['ESPN', 'SPORTS', '4.2', '521140', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone 10+', 'Sports', 'July 19, 2018', 'Varies with device', '5.0 and up']
['ESPN', '

#### After inspecting duplicates I keep the newest and use max reviews as parameter. I create a dictionary to store max reviews for every app.

In [12]:
max_reviews= {}
for row in goog_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if not name in max_reviews:
        max_reviews[name]= n_reviews
    else:
        if n_reviews > max_reviews[name]:
            max_reviews[name] = n_reviews        

In [13]:
print('Expected length:', len(goog_data) - 1181 -1 ) # I have deleted 1 faulty item and 1181 duplicates
print('Actual length:', len(max_reviews))

Expected length: 9659
Actual length: 9659


#### Check if the length of our duplicate list corresponds with  original list - duplicates - faulty items. Now I can clean the dataset and keep apps where reviews equals max_revieuws

In [14]:
goog_clean = []
already_added = []

for row in goog_data[1:]:
    name = row[0]
    n_reviews = float(row[3]) ##let op converteren naar float
    if (n_reviews == max_reviews[name])and (name not in already_added ): ## let op lijst kan dubbele waardes bevatten
        goog_clean.append(row)
        already_added.append(name)

In [15]:
explore_data(goog_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


### Removing non English and non free apps

My goal is to find a profitable app in the English speaking market. In the datasets are apps that are obviously non English.  To remove all non English apps I use the ascii number of characters anything above 127 is a non common English character. To keep emoticons and trademark logos in I decide to filter only if length of these strings with characters above 127 is bigger then 3

In [16]:
print(goog_clean[4412][0])
print(goog_clean[7940][0])

中国語 AQリスニング
لعبة تقدر تربح DZ


In [17]:
def is_english(string):
    n = 0
    for char in string:
        if ord(char) > 127:
            n+=1
    if n > 3:        
        return False
    else:       
        return True        

In [18]:
android_english = []
ios_english = []

for app in goog_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in appl_data:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagr

Create a new list with only English and free apps. When checking the lists there is a big drop in the iOS app numbers due to paid apps.

In [19]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print("Number of apps in final Android set:", len(android_final))
print("Number of apps in final iOS set:",len(ios_final))

Number of apps in final Android set: 8864
Number of apps in final iOS set: 3222


### Explore populair genres on Androis and iOS

Create a function that takes in dataset and index number of the column and creates a frequency table. A second function will sort the percentages (descendng)

In [20]:
def freq_table(dataset, index):
    table = {}
    total = 0
    for row in dataset:
        genre = row[index]
        if genre in table:
            table[genre] += 1
            total +=1
        else :
            table[genre]=1
            total +=1    
    for key in table:
        table[key] /= total
        table[key] *= 100
        table[key] = round(table[key],2)         
    return table

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    table_sorted = sorted(table_display, reverse = True)
    #print(table_sorted) first switch key and value and save as tuple, make list of tuples
    for entry in table_sorted:
        print(entry[1],",",entry[0])  

In [21]:
print("Android category and percentage")
display_table(android_final, 1)

Android category and percentage
FAMILY , 18.91
GAME , 9.72
TOOLS , 8.46
BUSINESS , 4.59
LIFESTYLE , 3.9
PRODUCTIVITY , 3.89
FINANCE , 3.7
MEDICAL , 3.53
SPORTS , 3.4
PERSONALIZATION , 3.32
COMMUNICATION , 3.24
HEALTH_AND_FITNESS , 3.08
PHOTOGRAPHY , 2.94
NEWS_AND_MAGAZINES , 2.8
SOCIAL , 2.66
TRAVEL_AND_LOCAL , 2.34
SHOPPING , 2.25
BOOKS_AND_REFERENCE , 2.14
DATING , 1.86
VIDEO_PLAYERS , 1.79
MAPS_AND_NAVIGATION , 1.4
FOOD_AND_DRINK , 1.24
EDUCATION , 1.16
ENTERTAINMENT , 0.96
LIBRARIES_AND_DEMO , 0.94
AUTO_AND_VEHICLES , 0.93
HOUSE_AND_HOME , 0.82
WEATHER , 0.8
EVENTS , 0.71
PARENTING , 0.65
ART_AND_DESIGN , 0.64
COMICS , 0.62
BEAUTY , 0.6


In [22]:
print("iOS genre and percentage")
display_table(ios_final, -5)

iOS genre and percentage
Games , 58.16
Entertainment , 7.88
Photo & Video , 4.97
Education , 3.66
Social Networking , 3.29
Shopping , 2.61
Utilities , 2.51
Sports , 2.14
Music , 2.05
Health & Fitness , 2.02
Productivity , 1.74
Lifestyle , 1.58
News , 1.33
Travel , 1.24
Finance , 1.12
Weather , 0.87
Food & Drink , 0.81
Reference , 0.56
Business , 0.53
Book , 0.43
Navigation , 0.19
Medical , 0.19
Catalogs , 0.12


The genre and category distribution on Google and IOS look different. iOS is skewed towards Games and entertainment while Android has more practical apps. Because I will first develop an Android version I will investigate this further.

First create a dictionary with total number of reviews per category to see which category is most populair.

In [23]:
android_cat = freq_table(android_final, 1)
reviews_genre = {}

for genre in android_cat:
    total = 0
    len_genre = 0
    for app in android_final:
        genre_app = app[1]
        if genre_app == genre:
            len_genre += 1
            num_ratings = float(app[3])
            total += num_ratings
    avg_n_ratings = round(total / len_genre)
    reviews_genre[genre]= avg_n_ratings

genre_sorted = [] 
for key in reviews_genre:
    genre_tuple = ( key,reviews_genre[key])
    genre_sorted.append(genre_tuple)
    
print(sorted(genre_sorted,reverse = True))         

[('WEATHER', 171251), ('VIDEO_PLAYERS', 425350), ('TRAVEL_AND_LOCAL', 129484), ('TOOLS', 305733), ('SPORTS', 116939), ('SOCIAL', 965831), ('SHOPPING', 223887), ('PRODUCTIVITY', 160635), ('PHOTOGRAPHY', 404081), ('PERSONALIZATION', 181122), ('PARENTING', 16379), ('NEWS_AND_MAGAZINES', 93088), ('MEDICAL', 3730), ('MAPS_AND_NAVIGATION', 142860), ('LIFESTYLE', 33922), ('LIBRARIES_AND_DEMO', 10926), ('HOUSE_AND_HOME', 26435), ('HEALTH_AND_FITNESS', 78095), ('GAME', 683524), ('FOOD_AND_DRINK', 57479), ('FINANCE', 38536), ('FAMILY', 113143), ('EVENTS', 2556), ('ENTERTAINMENT', 301752), ('EDUCATION', 56293), ('DATING', 21953), ('COMMUNICATION', 995608), ('COMICS', 42586), ('BUSINESS', 24240), ('BOOKS_AND_REFERENCE', 87995), ('BEAUTY', 7476), ('AUTO_AND_VEHICLES', 14140), ('ART_AND_DESIGN', 24699)]


In [24]:
print("Category and number of installs \n")

for cat in freq_goog:
    total = 0
    len_category = 0
    for app in android_final[1:]:
        cat_app = app[1]
        if cat_app == cat:
            n_installs = app[5]
            n_installs = n_installs.replace(',','')
            n_installs = n_installs.replace('+','')
            total += float(n_installs)
            len_category +=1
    avg_n_installs = total / (len_category)
    print(cat, ':', int(avg_n_installs))                  

Category and number of installs 



NameError: name 'freq_goog' is not defined