__Profitable App Profiles for the App Store and Google Play Markets__

This project is to investigate the Apple (iOS) app store and Google Play (Android) markets in terms of which apps are most profitable. The parameters for this research include cost, language and user interactions. Cost will be sorted as free to play apps both with and without ads, and apps that cost money with and without ads. In terms of language we plan on isolating and analyzing apps designed for the English speaking demographic and user interactions will be determined through number of downloads and ratings of the apps themselves.

My goal is to determine the most profitable app profile and which combination of the aforementioned will provide a promising chance of success. To start off I'll load the iOS and Android store data to explore their information storage techniques.

In [2]:
opened_file = open('AppleStore.csv', encoding='utf8')
opened_file2 = open('googleplaystore.csv', encoding='utf8')
from csv import reader
read_file = reader(opened_file)
read_file2 = reader(opened_file2)
apple_data = list(read_file)
google_data = list(read_file2)
##
##
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') #adds a new (empty) line after each row
    if rows_and_columns:
        print('Numbers of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
##
##
print(explore_data(apple_data, 0, 4))
print(explore_data(google_data, 0, 4))
##
print(len(apple_data[0:]))
print(len(google_data[0:]))

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


None
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and u

After testing the explore function, identifying the column names in the process and finding the length of the data, we will begin to clean it up. This includes leaving out data that does not apply to our study, specifically ones that are not free and ones that are not primarily in the English language. 

We are told there is incorrect information on entry 10472 for the google play store data, which we will correct here. After printing the header and the row with the missing data, it is apparent that the app's category is missing altogether. I was able to erase this specific data point below, as incomplete data in this study just proves to be noise in the overall statistics.

In [3]:
print(google_data[0])
print('\n')
print(google_data[10473])
##
##
del google_data[10473]

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In this step I'll create a loop that combs through the data searching for duplicate entries. If finding any, I'll erase the duplicates leaving only the most current version within the data sets. The most current data entry would logically be defined as the entry with the largest number of reviews.

Below I've printed apps with the name Snapchat to confirm that apps are indeed being entered multiple times into the Google Play Store. As you can see, the number of reviews varies with each entry, allowing our sorting loop to function as we intend.

In [22]:
duplicate_entries = []
unique_entries = []

for app in google_data:
    name = app[0]
    if name == 'Snapchat':
        print(app)

for app in google_data:
    name = app[0]
    if name in unique_entries:
        duplicate_entries.append(name)
    else:
        unique_entries.append(name)

print('\n')
print('Number of duplicate apps:', len(duplicate_entries))
print('\n')
print('Names of duplicate apps:', duplicate_entries[:10])

['Snapchat', 'SOCIAL', '4.0', '17014787', 'Varies with device', '500,000,000+', 'Free', '0', 'Teen', 'Social', 'July 30, 2018', 'Varies with device', 'Varies with device']
['Snapchat', 'SOCIAL', '4.0', '17014705', 'Varies with device', '500,000,000+', 'Free', '0', 'Teen', 'Social', 'July 30, 2018', 'Varies with device', 'Varies with device']
['Snapchat', 'SOCIAL', '4.0', '17015352', 'Varies with device', '500,000,000+', 'Free', '0', 'Teen', 'Social', 'July 30, 2018', 'Varies with device', 'Varies with device']
['Snapchat', 'SOCIAL', '4.0', '17000166', 'Varies with device', '500,000,000+', 'Free', '0', 'Teen', 'Social', 'July 30, 2018', 'Varies with device', 'Varies with device']


Number of duplicate apps: 1181


Names of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


__Erasing Duplicates__

By creating a dictionary called reviews_max, we can store app info based on the largest number of reviews in the data set. Here I created the blank dictionary, then combed through the data set ignoring the header row to compare the number of reviews per app row. The apps with the largest amount of reviews will get stored into the dictionary while the others will be ignored for now.

In [5]:
reviews_max = {}
for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
print(len(reviews_max))

9659


__Erasing Diplicates Part 2__
Here I've created two blank lists where I will sort app names and app rows from the dictionary above. If the number of reviews is equal to that of the app in the previously sorted dictionary above, it will add it to the android_clean list and add just the name to the already_added list for simplicity's sake.

In [6]:
android_clean = []
already_added = []
for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
print(len(android_clean))

9659


__Adjusting the Audience__

This project is directed towards the English speaking market, which means we can assume that app names not in English do not suit our needs. In this next section I'll create a function that detects whether an app name is in English or not. 

The ASCII states that characters we frequently use in English are numbered 0-127. We can use this information paired with the ord() function to tell whether or not an app name is in English or not. Unfortunately, as is the case with emojis and special characters, they are not detected as English as their ASCII character corresponds with an ord outside of 127 seen with the final example in this next cell.

In [7]:
def english(string):
    for character in string:
        if ord(character) > 127:
            return False
    return True
print(english('Instagram'))
print(english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english('Instachat 😜'))
print(english('Docs To Go™ Free Office Suite'))

True
False
False
False


In [8]:
def english(string):
    non_english = 0
    for character in string:
        if ord(character) > 127:
            non_english += 1
    if non_english > 3:
        return False
    return True
print(english('Instagram'))
print(english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english('Instachat 😜'))
print(english('Docs To Go™ Free Office Suite'))
print('\n')

True
False
True
True




I've ammended the previous function so that it will allow up to 3 anomalies for the sake of emojis and special characters. Some apps may get past the filter but this will certainly reduce the majority of non-English apps. I've now applied it to both the android data set and the ios data set below.

In [9]:
android_english = []
apple_english = []

for app in android_clean:
    name = app[0]
    if english(name) == True:
        android_english.append(app)
        
        
for app in apple_data:
    name = app[1]
    if english(name) == True:
        apple_english.append(app)
        
explore_data(android_english, 0 ,3, True)
print('\n')
explore_data(apple_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Numbers of rows: 9614
Number of columns: 13


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instag

__Free Apps__

Now that I've sorted through and removed any irrelevant data to this project regarding innacurate entires, duplicate entries and non-English apps I can complete the last hurdle which is sorting to find the free apps. I'll do this by setting the parameter of removing apps that are not set with a price of 0.

In [10]:
final_android = []
final_apple = []

for app in android_english:
    price = app[6]
    if 'Free' in app[6]:
        final_android.append(app)
        
for app in apple_english:
    price = app[4]
    if price == '0.0':
        final_apple.append(app)
        
print(len(final_android))
print(len(final_apple))

8863
3222


The goal of this research is to find an app successful to both markets allowing us a pathway for developing a successful strategy for releasing to both markets. Our strategy is to build and launch a base android app, and over a period of 6 months track it's progress and if it proves viable we can release it to the iOS markets. 

We can use the columns of price, genre, review and number of downloads to gauge what apps have the most promise on either app store.

Here I've created a function that acts as a frequency table where we can run the app store data through and print out a useable list of which app genres and categories are most frequent in the app store.

In [11]:
def freq_table(dataset, index):
    table = {}
    total = 0
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
            
    table_percent = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percent[key] = percentage
    return table_percent
##
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

__Anatalytics__

Now that the functions are built, we can apply it to the prime_genre, categories and genres of both the ios and google play stores. Printed below is the apple store prime_genres of the free English apps on the ios store. Games takes up 58% with the follow up being entertainment at nearly 8% and then in third is photo and video at nearly 5%. 

With this information we know that the most popular category by a vast majority is Games. However, just because the iOS store is flooded with games designed for fun doesn't mean they're in the most demand. It's hard to tell from this data alone whether or not an app profile for any of these genres would be recommended. There is a clear bias towards the Games genre but additional data such as downloads, daily users and other trends would be needed to make a decision.

In [12]:
display_table(final_apple, -5) #prime_genres | iOS

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Below I've printed the Google Play Store equivalent. Here we can see a much more even distribution with the Family category comprising the majority (18.89%) with Games following up at 9.73% and Tools coming in at 8.46%.

In [13]:
display_table(final_android, 1)  #Categories / Google

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

Printed below are the genres of the Google Play Store

In [14]:
display_table(final_android, -4) #Genres

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & Vehicles : 0.9251946293580051
S

A considerable difference between this and the categories is that the distribution is in a much larger range and more evenly spread with the top 3 genres at 8.45% (Tools), 6.07% (Entertainment) and 4.59% (Education). A large difference is that the games and family sections, which comprised ~28.61 of the categories, have been split into various sub categores such as Arcade, Racing, etc. Of the game genre the most profilic sub-category is Action at 3.1%, 2nd is 

As with the iOS app genres, this data can be used to tell where app developers are leaning towards in content and app profiles, but doesn't allude to which are successful in the long run and which are categorized as fads.

__App Analysis Part 2__
Since we can get only a partial picture of what we want with the previous functions, we'd like to pull in a second metric to help give us the bigger picture. We'd like to use the number of installs, but that isn't defined when looking through the app store data sets. Instead, we can use the number of ratings per app as a rough proxy for number of installs.

In [15]:
apple_genre = freq_table(final_apple, -5) #frequency table for prime genre of apple store
for genre in apple_genre:
    total = 0
    len_genre = 0
    for app in final_apple:
        genre_app = app[-5]
        if genre == genre_app:
            ratings = float(app[5])
            total += ratings
            len_genre += 1
            
    average_ratings = (total/len_genre)
    print(genre, ':', average_ratings)
        
        

Sports : 23008.898550724636
Finance : 31467.944444444445
Music : 57326.530303030304
Food & Drink : 33333.92307692308
Photo & Video : 28441.54375
Utilities : 18684.456790123455
Games : 22788.6696905016
Catalogs : 4004.0
Social Networking : 71548.34905660378
Shopping : 26919.690476190477
Reference : 74942.11111111111
Education : 7003.983050847458
Weather : 52279.892857142855
Productivity : 21028.410714285714
Lifestyle : 16485.764705882353
Book : 39758.5
Health & Fitness : 23298.015384615384
Navigation : 86090.33333333333
Travel : 28243.8
Entertainment : 14029.830708661417
Business : 7491.117647058823
Medical : 612.0
News : 21248.023255813954


One negative aspect of sorting the data this way is you get a few genres who are dominated by tech giants. An example would be Facebook, Snapchat and Reddit assumably accounting for a large portion of the Social Networking reviews, and the same could be said with the effect Waze and Google Maps have on the Navigation genre.

An app profile I'd recommend for the Apple market based on the studies in this project is one in the Photo and Video genre as it is middle of the road reviews at 28 thousand and comprises only 4.96% of the iOS app market. These two statistics suggest a market with high downloads and not many competitors to overcome.

In [16]:
android_category = freq_table(final_android, 1)
for category in android_category:
    total = 0
    len_category = 0
    for app in final_android:
        category_app = app[1]
        if category == category_app:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            total += float(n_installs)
            len_category += 1
    average_installs = (total/len_category)
    print(category, ':', average_installs)
   


SHOPPING : 7036877.311557789
FOOD_AND_DRINK : 1924897.7363636363
TRAVEL_AND_LOCAL : 13984077.710144928
FAMILY : 3697848.1731343283
EVENTS : 253542.22222222222
VIDEO_PLAYERS : 24727872.452830188
HOUSE_AND_HOME : 1331540.5616438356
AUTO_AND_VEHICLES : 647317.8170731707
COMICS : 817657.2727272727
DATING : 854028.8303030303
HEALTH_AND_FITNESS : 4188821.9853479853
COMMUNICATION : 38456119.167247385
SOCIAL : 23253652.127118643
ART_AND_DESIGN : 1986335.0877192982
PHOTOGRAPHY : 17840110.40229885
WEATHER : 5074486.197183099
PERSONALIZATION : 5201482.6122448975
NEWS_AND_MAGAZINES : 9549178.467741935
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
MEDICAL : 120550.61980830671
SPORTS : 3638640.1428571427
GAME : 15588015.603248259
FINANCE : 1387692.475609756
EDUCATION : 1833495.145631068
BEAUTY : 513151.88679245283
MAPS_AND_NAVIGATION : 4056941.7741935486
PRODUCTIVITY : 16787331.344927534
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
BOOKS_AND_REFERENCE : 8767811

__Android App Recommendation__

Ideally I'd recommend an app that applies to a market with a large number of installs and whose category is only a realtively small part of the overall Google Play Store. Again avoiding large tech icons such as Facebook or Waze, I'd be looking for median numbers in both categories.

I'd recommend an app that is targeted towards the Health and Fitness category as it only comprises 3.08% of the Google Play Store and has approximately 4.19 million downloads. There aren't many prominent apps dominating this sector and with Quarantine a reality for America an up and coming app could see promising revenue.