# Profitable App Profiles for the Apple Store and Google Play Markets

In this project, we are determining what types of apps users will be more likely to download. 

This project involves working with f2p apps on the Apple Store and Google Play markets. The revenue generated from f2p apps primarily comes from users clicking on ads for the apps and subsequently downloading and using them.


In [None]:
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print("\n") #Adds a new line after each row
    
    if rows_and_columns:
        print("Number of rows: ", len(dataset))
        print("Number of columns: ", len(dataset[0]))

In [44]:
from csv import reader

data = open("AppleStore.csv")
apple = reader(data)
apple_data = list(data)

data2 = open("googleplaystore.csv")
google = reader(data)
google_data = list(data2)

NameError: name 'reader' is not defined

In [None]:
explore_data(apple_data, 1, len(apple_data), True)
explore_data(google_data, 1, len(apple_data), True)

### Details of the data (Apple Store and Google Play):

[Google Play Data](https://www.kaggle.com/lava18/google-play-store-apps)

[Apple Store Data](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

In [29]:
print(apple_data[0])
print(google_data[0])

id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic

App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver



In [31]:
print(google_data[10473])
print(google_data[0])

s
App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver



In [19]:
del google_data[10473]

The following code will determine what duplicate app entries (if any) the google play store data has. If it does not have any, the resulting duplicate array will be empty.

In [43]:
duplicate_apps = []
unique_apps = []

for app in google_data[1:]:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
    

#### Part Two
Let's start by building the dictionary.

In [148]:
reviews_max = {}

for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In a previous code cell, we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,181.

Now, let's use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. In the code cell below:

- We start by initializing two empty lists, android_clean and already_added.
- We loop through the android data set, and for every iteration:
    - We isolate the name of the app and the number of reviews.
    - We add the current row (app) to the android_clean list, and the app name    (name) to the already_added list if:
        - The number of reviews of the current app matches the number of reviews of that app as described in the reviews_max dictionary; and
        - The name of the app is not already in the already_added list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps.

In [158]:
google_clean = []
already_added = []

for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        google_clean.append(app)
        already_added.append(name) # make sure this is inside the if block

### Removing Non-English Apps

We will follow similar steps like we did with removing duplicates; we find the english apps and put them in their separate list

This function will check for app names that have english in them

In [160]:
def is_english(string):
    count = 0
    for character in string:
        if ord(character) > 127 and count != 3:
            count += 1
        elif ord(character) > 127 and count == 3:
            count = 0
            return False
    
    return True

In [161]:
google_english = []
apple_english = []

for app in google_clean:
    name = app[0]
    if is_english(name):
        google_english.append(app)
        
for app in apple_data[1:]:
    name = app[1]
    if is_english(name):
        apple_english.append(app)
        
explore_data(google_english, 0, 3, True)
print('\n')
explore_data(apple_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9614
Number of columns:  13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+'

### Isolating the Free Apps
As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis. Below, we isolate the free apps for both our data sets.

In [162]:
google_final = []
apple_final = []

for app in google_english:
    price = app[7]
    if price == '0':
        google_final.append(app)
        
for app in apple_english:
    price = app[4]
    if price == '0.0':
        apple_final.append(app)


## Most Common Apps by Genre
### Part One
As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.

### Part Two
We'll build two functions we can use to analyze the frequency tables:

- One function to generate frequency tables that show percentages
- Another function that we can use to display the percentages in a descending order

In [163]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [164]:
display_table(apple_final, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


As we can see, on the App Store, **Games, Entertainment, and Photo & Video**, are the genres of apps most likely to be marketed. Apps that are considered "Social" apps and ones that you would expect younger audiences to have tend to take up more spots on the store while apps you wouldn't expect your general audience to download take up less spots in the store. 

It looks like, for the Apple Store at least, creating a Game might be the way to go. However, that does not imply that apps of the Game genre generally have a large number of users, especially if the game is a Candy Crush ripoff. 

In [165]:
display_table(google_final, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

The most common genres for the Google Play Store are **Family, Game, and Tools**. One major difference between this and the Apple Store genre frequency table is that there isn't one genre that dominates the others. Although Family is sitting at a nice 18.9% of the apps on the Google Play Store, the others are not far behind. 

I would say that the number of apps on the Google Play Store are near to normally distributed. Because of this thought, using this frequency table does not provide sufficient evidence for us to build a certain genre of app for the Google Play Store, at least at a glance.

In [51]:
apple_genre_freq = freq_table(apple_final, -5)

for genre in apple_genre_freq:
    total = 0 
    len_genre = 0
    
    for app in apple_final:
        genre_app = app[-5]
        if genre_app == genre:
            user_ratings = float(app[5])
            total += user_ratings
            len_genre += 1
    
    average_num_ratings = total / len_genre
    print(genre + ": " + str(average_num_ratings))


        

Lifestyle: 16485.764705882353
Book: 39758.5
Social Networking: 71548.34905660378
Health & Fitness: 23298.015384615384
Navigation: 86090.33333333333
Music: 57326.530303030304
Travel: 28243.8
Medical: 612.0
Utilities: 18684.456790123455
Weather: 52279.892857142855
Food & Drink: 33333.92307692308
Productivity: 21028.410714285714
Catalogs: 4004.0
Shopping: 26919.690476190477
News: 21248.023255813954
Games: 22788.6696905016
Photo & Video: 28441.54375
Business: 7491.117647058823
Education: 7003.983050847458
Finance: 31467.944444444445
Sports: 23008.898550724636
Entertainment: 14029.830708661417
Reference: 74942.11111111111


On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

In [52]:
for app in apple_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


However, this could work for us.

We could create a navigation app that allows for more touristy destinations with accurate information, since Waze and Google Maps are known to lead us in the wrong direction at times. 

## Most Popular Apps by Genre on Google Play
For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [54]:
display_table(google_final, 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


One problem with this data is that is not precise. For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to get an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

To perform computations, however, we'll need to convert each install number to float — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. We'll do this directly in the loop below, where we also compute the average number of installs for each genre (category).

In [56]:
google_genres_freq = freq_table(google_final, 1)

for genre in google_genres_freq:
    total = 0
    len_genre = 0
    for app in google_final:
        genre_app = app[1]
        if genre_app == genre:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_genre += 1
    avg_n_installs = total / len_genre
    print(genre, ':', avg_n_installs)

BEAUTY : 513151.88679245283
ENTERTAINMENT : 11640705.88235294
EDUCATION : 1833495.145631068
AUTO_AND_VEHICLES : 647317.8170731707
TOOLS : 10801391.298666667
BUSINESS : 1712290.1474201474
VIDEO_PLAYERS : 24727872.452830188
PERSONALIZATION : 5201482.6122448975
DATING : 854028.8303030303
LIFESTYLE : 1437816.2687861272
TRAVEL_AND_LOCAL : 13984077.710144928
MEDICAL : 120550.61980830671
MAPS_AND_NAVIGATION : 4056941.7741935486
GAME : 15588015.603248259
FOOD_AND_DRINK : 1924897.7363636363
ART_AND_DESIGN : 1986335.0877192982
EVENTS : 253542.22222222222
HOUSE_AND_HOME : 1331540.5616438356
WEATHER : 5074486.197183099
NEWS_AND_MAGAZINES : 9549178.467741935
COMICS : 817657.2727272727
PARENTING : 542603.6206896552
LIBRARIES_AND_DEMO : 638503.734939759
FAMILY : 3695641.8198090694
SPORTS : 3638640.1428571427
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
FINANCE : 1387692.475609756
HEALTH_AND_FITNESS : 4188821.9853479853
PRODUCTIVITY : 16787331.344927534
COMMUNICATION : 38456119.16724738

On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

In [58]:
for app in google_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

A lot of the Communication apps are made or bought by the big tech companies, which in turn gets them a ton of installs. 

We could still make a Communication app since that genre is very integral to most people's phones, and coincides with the fact that Lifestyle apps take up most of the Apple Store.

# Conclusion

I have surmised that a Communication app would be the best route to take, or a Navigation app where it is use more for touristy purposes. 

I could have looked more into Books & Reference, but the number of installs and reviews for these apps are heavily skewed. Most good Book apps are paid, as well, and the free ones sometimes require subscriptions.

Game apps would have also been a great choice, but the market is so saturated with them, that it would take much longer than the time the company has in order to properly evaluate its performance.

Thank you for reading my Jupyer Notebook!