<h3>**Market Share by Genre in Apple Store & Google Play Store**</h3>

The purpose of this project is to look at prevailing trends in the app store to see where to next best focus our development to get the largest market share for the app.

Our goal is to narrow down the proposed app to a specific istore/play store app category based on market size and popularity. 

Due to the size and scope of the number of apps in the stores, we'll focus on using data sets that contain [10,000 Play Store apps](https://www.kaggle.com/lava18/google-play-store-apps/home) and [7,000 iOS apps](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

The following is a function to more easily display the first part of the data tables to reference them to see if functions procesed them correctly:

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Now we will import the datasets and splice off the header rows for easier reference. 

In [2]:
import csv
#Import both store datasets:
raw_google_read=open("googleplaystore.csv")
raw_apple_read=open("AppleStore.csv")

#Parse both files using the CSV reader
#then convert them to lists of lists
android = list(csv.reader(raw_google_read))
apple = list(csv.reader(raw_apple_read))

#remove Header rows
android_header=android[0]
android=android[1:]
apple_header=apple[0]
apple=apple[1:]



Let's first explore the Goolge play data set


In [3]:
print(android_header)
explore_data(android, 0,5, rows_and_columns=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Every

It looks like the Google Play dataset has 10,841 apps with 13 columns of data. the columns labeled 'App', 'Category', 'Rating', 'Reviews',  'Installs', 'Type', 'Price',  'Genres' will where we'll focus our analysis for this set. 


In [4]:
print(apple_header)
explore_data(apple, 0,5, rows_and_columns=True)


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


Looking at the dataset for the iOS data, it looks like we've got 7,197 apps with 16 columns of data. The columns 'track_name',  'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', and 'prime_genre' may be most useful for anlaysis. 


<H2>**Cleaning Data**</h2>

According to the discussion from the dataset, one of the rows, line 10472 is malformed. Below we'll check to confirm, and delete if it is. It points to the row being "Life Made WI-Fi Touchscreen Photo Frame". 

In [5]:
i=0
for rows in android:
    i+=1  
    if rows[0].startswith("Life Made"):
        print(i)        
print(android_header, "\n", android[10472], "\n")


10473
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 
 ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] 



According to this, the Life Made Wi-Fi Touchscreen Photoframe App is 10473 iterations through the list (index 10472), as expected. 

If we compare the header to the row data, we'll see it's got a blank space where the category should be. This makes it useless for our analysis, so we will delete it from the dataset. 

In [6]:
print("current length of android file: ",len(android))

del (android[10472])

# Check to make sure its gone:
print("updated length of android file: ", len(android))

current length of android file:  10841
updated length of android file:  10840


**<h2>De-Duplicating Goole Play Data</h2>**

It turns out upon further examination, that the google play dataset has duplicate apps in it (ie, instagram appears mulitple times). Next I'll look at the number of apps that appear more than once: 

In [7]:
for rows in android:
    if rows[0] == "Instagram":
        print(rows)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [8]:
duplicate_apps=[]
unique_apps=[]

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print("Duplicate Apps: ", len(duplicate_apps))
print("Unique Apps: ", len(android)-len(duplicate_apps))
print("Examples of duplicate apps: ")
print(duplicate_apps[0:15])


Duplicate Apps:  1181
Unique Apps:  9659
Examples of duplicate apps: 
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


We can see based on this code, that of the 10,840 remaining apps, there are 9,659 unique apps and 1,181 duplicate apps. 

Since  these apps can be old versions or spoofs, we'll want to keep the highest rated apps, since those are likely to be the original. 

To do this, we'll run some code to isolate only the highest reviewed apps. It'll take an app, and append it to the dictionary "reviews_max". If another app appears with the same name, it will compare the two for number of reviews if the already existing data has a higher view, it will keep it, or if the other line of data has a higher number of reviews, replace the current data. 

In [9]:
reviews_max = {}

for apps in android:
    name = apps[0]
    n_reviews = float(apps[3])
    
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    if (name not in reviews_max):
        reviews_max[name]= n_reviews
        
#print the length to make sure it makes the expected length of 
#the "Unique Apps" count derived above: 
print("length of individual apps: ", len(reviews_max))

length of individual apps:  9659


Next, we'll take the list with only the highest reviews, and use the reviews to filter it against the main list. Only the apps where the highest review is equal to the highest review count will be appended: 

In [10]:
android_clean = []
already_added = []

for apps in android:
    name = apps[0]
    n_reviews = float(apps[3])
    
    if (n_reviews == reviews_max[name] and (name not in already_added)):
        android_clean.append(apps)
        already_added.append(name)
        
print("android clean length: ",len(android_clean), "\n")

android clean length:  9659 



We can see that the clean android list is now 9,659, just like the unique apps list above. 

<h2>Removing Non-English language apps</h2>

Now that we have a dataset for the android apps without duplicates, we need to remove any of the apps that are certainly in non english. This includes apps where the name would be in Cyrillic, Kanji, Arabic, etc. 

Because all romanized alphabet characters appear in the first 127 positions of the ASCII map, anything above that is not going to be in a english script. In order to allow certain special characters like the trademark ™ through, we'll allow up to 3 characters over the first 127 ASCII characters. 

Although this method is not perfect, it should provide a good "Close Enough" list. 

In [11]:
def lang_check(check_string):
    threshold = 0
    for letter in check_string:
        if ord(letter)>127:
            threshold+=1
    if threshold > 3:
        return False
    else:
        return True
    
android_english = []
apple_english = []

for apps in android_clean:
    name = apps[0]
        
    if lang_check(name) == True:
        android_english.append(apps)

for apps in apple:
    name = apps[1]
        
    if lang_check(name) == True:
        apple_english.append(apps)
        
        
print("android english length: ",len(android_english), "\n")
print("apple english length: ",len(apple_english), "\n")


android english length:  9614 

apple english length:  6183 



Now that we've isolated inaccurate data, duplicate apps and non-english apps, we are left with 9,614 android apps and 6,183 iOS store apps. From there, we will isolate this further to just the free apps.

<h2>Isolating the Free Apps</h2>

Next, we will isolate only apps where the cost is free from the subset of the previous apps. 

In [12]:
android_final=[]
apple_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in apple_english:
    price = app[4]
    if price == '0.0':
        apple_final.append(app)
        
print("Number of free, English-Only android Apps: ",len(android_final))
print("Number of free, English-Only Apple iOS Apps: ", len(apple_final))

Number of free, English-Only android Apps:  8864
Number of free, English-Only Apple iOS Apps:  3222


<H2>Determining App Frequency By Genre</H2>

Now that we've cleaned the data and narrowed the apps to only the free apps, it's time to analyze the data by looking at the number of apps in each category in these datasets. To do this, we'll create a couple functions that will take the colums from a table and turn them into a frequency table (dictionary) and then another function to sort those from highest to lowest. 

Using that, we'll be able to look at which genres/types of apps in App store are the most popular.

In [13]:
#returns a frequency table based on
#user input of a Dataset and a Index column:
def freq_table(dataset, index):
    output_table= {}
    length=len(dataset)
   
    for info in dataset:
        item = info[index]
        if item in output_table:
            output_table[item]+=1
        else: 
            output_table[item]=1    
    
    table_percents = {}
    for key in output_table:
        percent = (output_table[key]/length)*100
        table_percents[key]= percent
    return table_percents   


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


First, we'll look at the prime_genre column in iOS store data column index 11:  

In [14]:
display_table(apple_final, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Based on the above, 58.16% of all the English-language iOS are primarily entertainment apps, followed entertainent (7.88%), Photo & Video (4.97%). 

Simply based on the number of reviews, though, we can't tell if they have the most users, only the most reviews. Other apps might be more marketable and have a smaller review database. 


In [15]:
display_table(android_final, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

Based on the above data from the Play Store apps, the family category is the leading category with 18.91% of the share, followed by Game & Tool (9.72% and 8.46% respectively). 

In [16]:
display_table(android_final, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

The above categories show that Tools(8.45%), Entertainment(6.07%) an Education (5.35%) are the most common categories. Based on the above two frequency tables, it appears that the market is far more diverse and fragmented than the app environment for the app store. However, it is less dominated by gaming apps than the iOS store, indicating there might be more potential for a variety of successful apps. 

To further analyze, we should look at the number of downloads:


In [17]:
apple_prime_genre = freq_table(apple_final, 11)

for genre in apple_prime_genre:
    total=0
    len_genre=0

    for apps in apple_final:
        genre_app = apps[11]
        if genre_app == genre:
            total += float(apps[5])
            len_genre +=1
    print( genre, ": ", total/len_genre)                    
            
            

Shopping :  26919.690476190477
Business :  7491.117647058823
Navigation :  86090.33333333333
Social Networking :  71548.34905660378
Education :  7003.983050847458
Games :  22788.6696905016
Lifestyle :  16485.764705882353
Utilities :  18684.456790123455
News :  21248.023255813954
Food & Drink :  33333.92307692308
Weather :  52279.892857142855
Travel :  28243.8
Photo & Video :  28441.54375
Health & Fitness :  23298.015384615384
Sports :  23008.898550724636
Music :  57326.530303030304
Reference :  74942.11111111111
Book :  39758.5
Medical :  612.0
Finance :  31467.944444444445
Entertainment :  14029.830708661417
Productivity :  21028.410714285714
Catalogs :  4004.0


From the above, it looks like the apps with the most userbase is actually navigational apps (86,090), followed by social networking (71,548), and Music (57,326). 

I would suggest avoiding social networking and music apps, becuase of the nature of those apps having a platform based around them, that is to say, a facebook app only is useful for interaction with a facebook service. unless we were to start a service that might be hard to capitalize on. Same with music. 

Navigational apps can use open source data, an thus would be a good place to look. for further investigation

** Google play** 

Below, we actually have install data from the app store so we can examine these directly. However, the numbers of installs are not super specific (1 million + means 1 million and one or 5 million?)

But as a rough estimate it should provide enough direction. First we'll have to clean up the extra characters (ie, the plusses and the commas), and then we can go from there. 

In [42]:
categories = freq_table(android_final, 1)

for category in categories:
    total=0
    len_genre=0

    for apps in android_final:
        category_app = apps[1]
        if category_app == category:
            installs = apps[5]
            installs = installs.replace(',', '').replace('+','')
            total += float(installs)
            len_genre +=1
    print( category , ": ", total/len_genre)                 



LIBRARIES_AND_DEMO :  638503.734939759
PHOTOGRAPHY :  17840110.40229885
HOUSE_AND_HOME :  1331540.5616438356
TRAVEL_AND_LOCAL :  13984077.710144928
EDUCATION :  1833495.145631068
SHOPPING :  7036877.311557789
MEDICAL :  120550.61980830671
ENTERTAINMENT :  11640705.88235294
COMMUNICATION :  38456119.167247385
AUTO_AND_VEHICLES :  647317.8170731707
VIDEO_PLAYERS :  24727872.452830188
PRODUCTIVITY :  16787331.344927534
SPORTS :  3638640.1428571427
TOOLS :  10801391.298666667
BUSINESS :  1712290.1474201474
PERSONALIZATION :  5201482.6122448975
LIFESTYLE :  1437816.2687861272
FINANCE :  1387692.475609756
ART_AND_DESIGN :  1986335.0877192982
GAME :  15588015.603248259
NEWS_AND_MAGAZINES :  9549178.467741935
BOOKS_AND_REFERENCE :  8767811.894736841
WEATHER :  5074486.197183099
HEALTH_AND_FITNESS :  4188821.9853479853
BEAUTY :  513151.88679245283
MAPS_AND_NAVIGATION :  4056941.7741935486
PARENTING :  542603.6206896552
COMICS :  817657.2727272727
SOCIAL :  23253652.127118643
FAMILY :  3695641.8

The it looks like travel and local/productivity apps rate in the top ten most popular apps, so based on this, it is likely that these would be a good, profitable cross section to pursue for the purpose of our project. 
