# Profitable App Profiles for the Apple Store and Google Play Market 

Many of the apps available on the Apple Store and Google Play Market are free to download. This means that the revenue stream for such apps is largely based on ad-revenue. In the code below, we will examine which app types provide the greatest opportunity for in-app revenue based on the information pulled from the Apple Store and Google Play data sets. 

We will begin by taking a look at the available data for both the Apple Store and the Google Play Market. We begin by importing the necessary files and functions from the csv library. We then extract the data into a list to be more malleable for further data analysis. We will print the first few lines of both data sets to get an idea of what the data looks like.

First we will take a look at the Apple Store data set. We will print out the first five rows in the data set to get an idea of what format the data has. Next we will print out the size of the data set to get a feel for how much data we are working with. Finally, we will print out the header row of the data set to see what each column represents. Detailed information on what the column names represent can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

In [1]:
from csv import reader
opened_file_apple = open('AppleStore.csv')
opened_file_gp = open('googleplaystore.csv')
read_file_apple = reader(opened_file_apple)
read_file_gp = reader(opened_file_gp)

apple_data = list(read_file_apple)
gp_data = list(read_file_gp)

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
explore_data(apple_data[1:], 0, 6)
nrow_apple = len(apple_data)
ncol_apple = len(apple_data[0])
print('Number of Rows: ' + str(nrow_apple) + ', Number of Columns: ' + str(ncol_apple))

print('\n')
print(apple_data[0])

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061624', '1814', '4.5', '4.0', '6.26', '12+', 'Social Networking', '37', '5', '27', '1']


Number of Rows: 7198, Number of Columns: 16


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 

Looking at this output, we can see that our data set is stored in a 7198 x 16 list (not including the header row). The data set seems to contain a number of metrics that may be useful to our goal. For example, we should only be dealing with apps that are free, so apps with a price value not equal to 0.0 should not be considered in our analysis. Furthermore we can see the number of user reviews for current versions of the app, hinting at the app's current active user base as well as the historical user base. This may give us some idea of whether or not the user base is consistent, growing or shrinking and whether or not such an app has a sustainable future for in-app ad revenue.

Now, let us do the same for the information from the Google Play Market. A more thorough description of the column names can be found [here](https://www.kaggle.com/lava18/google-play-store-apps/version/6).

In [2]:
explore_data(gp_data[1:], 0, 6)
nrow_gp = len(gp_data)
ncol_gp = len(gp_data[0])
print('Number of Rows: ' + str(nrow_gp) + ', Number of Columns: ' + str(ncol_gp))
print('\n')
print(gp_data[0])

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+

We can see that we have roughly similarly sized data sets with the Google Play Market data set coming in at 10842 x 13. There is slighly less historical data available, however the key information for our assessment appears to be present here as well.

# Cleaning the Data

Before we are able to perform any real data analysis, we will need to clean our data. To that end, we need to eliminate incorrect or useless data. For us, this is data that falls into two categories:
    
    1. Paid apps. Remember, we are looking at apps for in-app advertising, paid apps will not be having such advertisements
    2. Apps for non English speakers - our target market is English speakers and thus we are not concerned with 


In [3]:
# The shitty row is 10472
print(gp_data[10473])
del(gp_data[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The Google Play Market has duplicate app data. We can see this by sorting the apps by uniqueness in a dictionary;

... Code to extract info
... Code to show a few of the duplicate rows
... Code that shows the number of duplicates


In [4]:
duplicate_apps_gp = []
unique_apps_gp = []

for row in gp_data[1:]:
    name = row[0]
    if name in unique_apps_gp:
        duplicate_apps_gp.append(name)
    else:
        unique_apps_gp.append(name)

duplicate_apps_apple = []
unique_apps_apple = []

for row in apple_data[1:]:
    name = row[1]
    if name in unique_apps_apple:
        duplicate_apps_apple.append(name)
    else:
        unique_apps_apple.append(name)

        

print('Number of duplicate apps in Google Play Market:' + str(len(duplicate_apps_gp)))
print('\n')
print('Number of duplicate apps in Apple Store:' + str(len(duplicate_apps_apple)))

Number of duplicate apps in Google Play Market:1181


Number of duplicate apps in Apple Store:2


We want to remove duplicate entries. There are a few different ways that we could do this. One option is to keep the entry with the most reviews, as this is guaranteed to be the most recent version of the app and thus most accurately the current state of affairs.

In [5]:
reviews_max_gp = {}

for row in gp_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if name in reviews_max_gp and reviews_max_gp[name] < n_reviews:
        reviews_max_gp[name] = n_reviews
    elif name not in reviews_max_gp:
        reviews_max_gp[name] = n_reviews

print(len(reviews_max_gp))

android_clean = []
already_added_gp = []

for row in gp_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if n_reviews == reviews_max_gp[name] and name not in already_added_gp:
        android_clean.append(row)
        already_added_gp.append(name)
        
print(android_clean[0:4])

print(len(android_clean))


reviews_max_apple = {}

for row in apple_data[1:]:
    name = row[1]
    n_reviews = float(row[5])
    
    if name in reviews_max_apple and reviews_max_apple[name] < n_reviews:
        reviews_max_apple[name] = n_reviews
    elif name not in reviews_max_apple:
        reviews_max_apple[name] = n_reviews

print(len(reviews_max_apple))

apple_clean = []
already_added_apple = []

for row in apple_data[1:]:
    name = row[1]
    n_reviews = float(row[5])
    
    if n_reviews == reviews_max_apple[name] and name not in already_added_apple:
        apple_clean.append(row)
        already_added_apple.append(name)

print(len(apple_clean))

9659
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']]
9659
7195
7195


Write some explanations about how you're basically cleaning the data. So how you're removing duplicates, the reasoning for why you are doing so and how you solved it.

In [6]:
def englishSpeaking(string):
    char_count = 0
    for letter in string:
        if ord(letter) > 127:
            char_count += 1
            
    if char_count > 3:
        return False
    else:
        return True

print(englishSpeaking('Instagram'))
print(englishSpeaking('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(englishSpeaking('Docs To Go™ Free Office Suite'))
print(englishSpeaking('Instachat 😜'))

True
False
True
True


In [7]:
android_english = []
apple_english = []

for row in android_clean:
    name = row[0]
    
    if englishSpeaking(name) == True:
        android_english.append(row)

for row in apple_clean:
    name = row[1]
    
    if englishSpeaking(name) == True:
        apple_english.append(row)
        
print(len(android_english))
print(len(apple_english))
      

9614
6181


In [8]:
android_clean_free = []
apple_clean_free = []

for row in android_english:
    price = row[7]
    
    if price == '0':
        android_clean_free.append(row)
    
    
for row in apple_english:
    price = float(row[4])
    
    if price == 0.0:
        apple_clean_free.append(row)
    

print(len(android_clean_free))
print(len(apple_clean_free))

8864
3220


To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our data sets.

In [9]:
gp_genres = {}
gp_genres_list = []

for row in android_clean_free:
    genre = row[9]
    
    if genre in gp_genres:
        gp_genres[genre] += 1
    else: 
        gp_genres[genre] = 1
    
    gp_genres_list.append(genre)
    
# print(gp_genres)

apple_genres = {}
apple_genres_list = []

for row in apple_clean_free:
    genre = row[11]
    
    apple_genres_list.append(genre)
    
    if genre in apple_genres:
        apple_genres[genre] += 1
    else:
        apple_genres[genre] = 1

# print(apple_genres_list[0:4])

In [10]:
def freq_table(dataset, index):
    table = {}
    total = len(dataset)
    for row in dataset:
        entry = row[index]
        
        if entry in table:
            table[entry] += 1
        else: 
            table[entry] = 1
            
    for key in table:
        table[key] /= total
        table[key] *= 100
        
    return table

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
   
print('Android breakdown: \n')
display_table(android_clean_free, 1)
print('\n')
display_table(android_clean_free, 9)
print('\n')
print('Apple Breakdown: \n')
display_table(apple_clean_free, 11)
        

Android breakdown: 

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638

Analyze the frequency table you generated for the prime_genre column of the App Store data set.

What is the most common genre? What is the runner-up?

Most common genre is games
The runner-up is Entertainment

What other patterns can you see?

Apps related to entertainment or media make up a majority of the listed free apps, while more pragmatic apps tend to be rare

What is the general impression — are most of the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) or more for fun (games, entertainment, photo and video, social networking, sports, music, etc.)?

Most apps are designed for fun, they make up over 75% of the total availabl free listings on the app store

Can you recommend an app profile for the App Store market based on this frequency table alone? If there's a large number of apps for a particular genre, does that also imply that apps of that genre generally have a large number of users?

We cannot make a recommendation on this table alone. Unfortunately, a large number of games would mean that our game is also less likely to be noticed. Though a genre is very common, it does not necessarily mean that all the apps in that genre are receiving a lot of attention

Analyze the frequency table you generated for the Category and Genres column of the Google Play data set.

What are the most common genres?

Tools, Entertainment and Education

What other patterns can you see?

There is a much more balanced spread of free apps available (by percentage) in the Google Play data set. Many categories that deal with practical apps (education, productivity, business) are much more popular than in the apple store

Compare the patterns you see for the Google Play market with those you saw for the App Store market.

The Google Play market has a much more balanced variety of free apps for a variety of categories, versus the apple store which is primarily dominated by free entertainment apps

Can you recommend an app profile based on what you found so far? 

We cannot make an app profile recommendation as of yet because we still need to examine how much traffic each type of app actually receives in order to make a sound recommendation

Do the frequency tables you generated tell you what are the most frequent app genres or what genres have the most users?

They tell us what the most frequent app genres are but they do not reveal to us what genres have the most users as they do not show the total review count

In [15]:
# print(gp_genres)
# print('\n')
# print(apple_genres)

print('Average number of reviews by genre: \n')
for genre in apple_genres:
    
    total = 0
    len_genre = 0
    
    for row in apple_clean_free:
        genre_app = row[11]
        
        if genre_app == genre:
            ratings = float(row[5])
            total += ratings
            len_genre += 1
            
    avg_ratings = total/len_genre
    print(genre, avg_ratings)

Average number of reviews by genre: 

Weather 52279.892857142855
Music 57326.530303030304
Reference 74942.11111111111
Sports 23008.898550724636
Photo & Video 28441.54375
Finance 31467.944444444445
Medical 612.0
Navigation 86090.33333333333
Social Networking 71548.34905660378
Travel 28243.8
Health & Fitness 23298.015384615384
Education 7003.983050847458
Productivity 21028.410714285714
Business 7491.117647058823
Games 22812.92467948718
Catalogs 4004.0
News 21248.023255813954
Utilities 18684.456790123455
Food & Drink 33333.92307692308
Entertainment 14029.830708661417
Lifestyle 16485.764705882353
Book 39758.5
Shopping 26919.690476190477


Make a recommendation for at least one app profile based on the information here

In [17]:
gp_categories = {}
gp_categories_list = []

for row in android_clean_free:
    genre = row[1]
    
    if genre in gp_categories:
        gp_categories[genre] += 1
    else: 
        gp_categories[genre] = 1
    
    gp_categories_list.append(genre)
    
print('Average number of installs by category: \n')
for category in gp_categories:
    total = 0
    len_category = 0
    
    for row in android_clean_free:
        category_app = row[1]
        
        if category_app == category:
            installs = row[5]
            installs = installs.replace('+','')
            installs = installs.replace(',','')
            installs = float(installs)
            total += installs
            len_category += 1
            
    avg_installs = total/len_category
    print(category, avg_installs)
                                        

Average number of installs by category: 

SPORTS 3638640.1428571427
SHOPPING 7036877.311557789
BUSINESS 1712290.1474201474
TRAVEL_AND_LOCAL 13984077.710144928
FINANCE 1387692.475609756
ENTERTAINMENT 11640705.88235294
FAMILY 3695641.8198090694
MEDICAL 120550.61980830671
VIDEO_PLAYERS 24727872.452830188
COMMUNICATION 38456119.167247385
EDUCATION 1833495.145631068
BOOKS_AND_REFERENCE 8767811.894736841
HOUSE_AND_HOME 1331540.5616438356
GAME 15588015.603248259
MAPS_AND_NAVIGATION 4056941.7741935486
PERSONALIZATION 5201482.6122448975
ART_AND_DESIGN 1986335.0877192982
LIBRARIES_AND_DEMO 638503.734939759
COMICS 817657.2727272727
HEALTH_AND_FITNESS 4188821.9853479853
NEWS_AND_MAGAZINES 9549178.467741935
DATING 854028.8303030303
WEATHER 5074486.197183099
BEAUTY 513151.88679245283
FOOD_AND_DRINK 1924897.7363636363
EVENTS 253542.22222222222
PRODUCTIVITY 16787331.344927534
SOCIAL 23253652.127118643
PARENTING 542603.6206896552
TOOLS 10801391.298666667
LIFESTYLE 1437816.2687861272
PHOTOGRAPHY 1784011

Make at least one app profile recommendation for Google Play based on these results