# Popularity Analysis of Free-to-Play App Categories

The goal of this project is to do a basic data analysis of the summary data available from the Google Play Store and the Apple iOS Mobile App Store.

These data sets are available for download here:

 * [Google Play Store Apps](https://www.kaggle.com/lava18/google-play-store-apps/home)
 * [Apple iOS Store Apps](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)
 
This project will provide basic insight into the characteristics that are correlated with higher download rates of popular apps. 

### Data Exploration

In [33]:
from csv import reader
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios_data = list(read_file)
ios_header = ios_data[0]
ios_data = ios_data[1:]

opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
ggl_data = list(read_file)
ggl_header = ggl_data[0]
ggl_data = ggl_data[1:]

In [34]:
# creates a function to view selected rows of the data set
# or to print the length of the data set 
# function assumes the data set does not have a header

def explore_data(dataset, start, end, print_count=False):
    data_slice = dataset[start:end]
    for row in data_slice:
        print(row)
        print('\n') # creates space between rows
    if print_count:
        print('Number of rows:', len(dataset))
        print('Number of columns', len(dataset[0]))

In [35]:
print(ggl_header)
print(ggl_data[1:5])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
[['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']]


In [36]:
print(ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [37]:
explore_data(ggl_data,0,2,print_count=True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns 13


In [38]:
explore_data(ios_data,0,2,print_count=True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns 16


We need to remove one bad data entry in the Google data set that the discussion page lists as line number 10,472.

### Data Cleaning

There is one corrupted entry in the Google Play Store data set. We remove that entry below.

In [39]:
print(len(ggl_data))
del ggl_data[10472]
print(len(ggl_data))

10841
10840


The discussion page related to the Google Play Store data shows there are some duplicate entries in that data set. We need to check for duplicates and remove every duplicate entry except for the most recent. We can use the number of reviews to determine the most recent data point. We should also check the iOS App data as well, eventhough there is nothing in the discussion page.

In [40]:
unique_apps_ggl = []
duplicate_apps_ggl = []

for i in ggl_data:
    name = i[0]
    if name in unique_apps_ggl:
        duplicate_apps_ggl.append(name)
    else:
        unique_apps_ggl.append(name)
        
print('# of unique google apps: ', len(unique_apps_ggl))
print('# of duplicate google apps: ', len(duplicate_apps_ggl))

unique_apps_ios = []
duplicate_apps_ios = []

for i in ios_data:
    name = i[0]
    if name in unique_apps_ios:
        duplicate_apps_ios.append(name)
    else:
        unique_apps_ios.append(name)
        
print('# of unique apple apps: ', len(unique_apps_ios))
print('# of duplicate apple apps: ', len(duplicate_apps_ios))

# of unique google apps:  9659
# of duplicate google apps:  1181
# of unique apple apps:  7197
# of duplicate apple apps:  0


In [41]:
for i in ggl_data:
    name = i[0]
    if name == 'Facebook':
        print(i)

['Facebook', 'SOCIAL', '4.1', '78158306', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']


It appears onlt the Google data set contains duplicate entries. Printing a likely duplicate shows the only column that differs is the reveiw count. Here the examples show: 78,158,306 & 78,128,208.

Because there are 1,181 duplicate entries in the Google data set, after their removal the length of the data set should be reduced from 10,840 to 9,659.

In [42]:
reviews_max = {}

for i in ggl_data:
    name = i[0]
    n_reviews = float(i[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max.update({name:n_reviews})
        
print(len(reviews_max))

9659


Above, we created a dictionary and added entries to the reviews_max dictionary. If the name was already in the dictionary, the review count was updated to the higher amount. If they were not in the dictionary, we added the name and review count.

Next, we can create a new, clean data set using an empty list and a ggl_added list to check if we've already added the data. The two lists should be the same length to make sure we've added everything properly, and they should match our expected length.

In [43]:
ggl_clean = []
ggl_added = []

for i in ggl_data:
    name = i[0]
    n_reviews = float(i[3])
    if n_reviews == reviews_max[name] and name not in ggl_added:
        ggl_clean.append(i)
        ggl_added.append(name)

print(len(ggl_clean))
print(len(ggl_added))

9659
9659


Because there were no errors or duplicates in the iOS data, we can create a copy of that data set as our ios_clean variable for further use.

In [44]:
ios_clean = ios_data

Both data sets also have apps that are not built primarily for dnglish users. We want to remove these entries before doing further analysis. The easiest way to do this is to remove any row of data that has text symbols not used in English. 

In ASCII encoding, the range of commonly used English characters is 0 - 127. Some real apps may have a couple non-standard characters or emoji (🙄), so we we only want to remove an entry if it contains more than three non-standard characters.

In [45]:
def char_check(string):
    error_count = 0
    for char in string:
        if ord(char) > 127:
            error_count +=1
            if error_count > 3:
                return False
    return True

In [46]:
print(len(ggl_clean))
print(len(ios_clean))

9659
7197


In [47]:
ios_english = []
ggl_english = []

for row in ggl_clean:
    name = row[0]
    if char_check(name) == True:
        ggl_english.append(row)
        
for row in ios_clean:
    name = row[1]
    if char_check(name) == True:
        ios_english.append(row)
        
print(len(ios_english))
print(len(ggl_english))

6183
9614


Lastly, we want to isolate all the free apps to be our final data set.

In [48]:
ios_free = []
ggl_free = []

for row in ggl_english:
    price = row[6]
    if price == 'Free':
        ggl_free.append(row)
        
for row in ios_english:
    price = row[4]
    if price == '0.0':
        ios_free.append(row)
        
print(len(ios_free))
print(len(ggl_free))

3222
8863


We should now have two data sets. Each containing the respective iOS and Google Play apps that are now free from errors, duplicates, non-english apps, and are free.

### App Optimization - Frequency Tables

To optimize an app we need to figure out what works in the app stores. Below, we build frequency tables for a few data catagories. Specifically, the 'Genres' and 'Category' columns from the Google Play data set.

In [49]:
def freq_table(dataset, index):
    freq_dict = {}
    percent_dict = {}
    num_entries = 0

    for row in dataset:
        num_entries += 1
        if row[index] in freq_dict:
            freq_dict[row[index]] += 1
        else:
            freq_dict[row[index]] = 1
            
    for i in freq_dict:
        percent = (float(freq_dict[i]) / num_entries)
        percent_dict[i] = percent * 100
    return percent_dict

Next we need to view the frequency table in decending order.

In [53]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_disp = []
    for key in table:
        new_tuple = (table[key], key)
        table_disp.append(new_tuple)
        
    table_sorted = sorted(table_disp, reverse = True)
    for i in table_sorted:
        print(i)

Lets look at some summary data:

In [64]:
display_table(ggl_free, 1)
# Google Play 'Category' column

(18.898792733837304, 'FAMILY')
(9.725826469592688, 'GAME')
(8.462146000225657, 'TOOLS')
(4.592124562789123, 'BUSINESS')
(3.9038700214374367, 'LIFESTYLE')
(3.8925871601038025, 'PRODUCTIVITY')
(3.7007785174320205, 'FINANCE')
(3.5315355974275078, 'MEDICAL')
(3.396141261423897, 'SPORTS')
(3.317161232088458, 'PERSONALIZATION')
(3.2381812027530184, 'COMMUNICATION')
(3.0802211440821394, 'HEALTH_AND_FITNESS')
(2.944826808078529, 'PHOTOGRAPHY')
(2.798149610741284, 'NEWS_AND_MAGAZINES')
(2.6627552747376737, 'SOCIAL')
(2.335552296062281, 'TRAVEL_AND_LOCAL')
(2.245289405393208, 'SHOPPING')
(2.1437436533904997, 'BOOKS_AND_REFERENCE')
(1.8616721200496444, 'DATING')
(1.7939749520478394, 'VIDEO_PLAYERS')
(1.399074805370642, 'MAPS_AND_NAVIGATION')
(1.241114746699763, 'FOOD_AND_DRINK')
(1.1621347173643235, 'EDUCATION')
(0.9590432133589079, 'ENTERTAINMENT')
(0.9364774906916393, 'LIBRARIES_AND_DEMO')
(0.9251946293580051, 'AUTO_AND_VEHICLES')
(0.8236488773552973, 'HOUSE_AND_HOME')
(0.8010831546880289, 'WEA

In [63]:
display_table(ios_free, -5)
#iOS 'prime_genre' column

(58.16263190564867, 'Games')
(7.883302296710118, 'Entertainment')
(4.9658597144630665, 'Photo & Video')
(3.662321539416512, 'Education')
(3.2898820608317814, 'Social Networking')
(2.60707635009311, 'Shopping')
(2.5139664804469275, 'Utilities')
(2.1415270018621975, 'Sports')
(2.0484171322160147, 'Music')
(2.0173805090006205, 'Health & Fitness')
(1.7380509000620732, 'Productivity')
(1.5828677839851024, 'Lifestyle')
(1.3345747982619491, 'News')
(1.2414649286157666, 'Travel')
(1.1173184357541899, 'Finance')
(0.8690254500310366, 'Weather')
(0.8069522036002483, 'Food & Drink')
(0.5586592178770949, 'Reference')
(0.5276225946617008, 'Business')
(0.4345127250155183, 'Book')
(0.186219739292365, 'Navigation')
(0.186219739292365, 'Medical')
(0.12414649286157665, 'Catalogs')


It seems like games and entertainment apps are the most popular in both the iOS and Google Play store. Practial applications less so. However, all we've done is reveal the most commonly published apps, not the ones with the most users. We can find out the average number of downloads per category to draw a better conclusion.

The iOS store doesn't list a download count, and the Google Play store only lists ranges, so we're going to use the number of ratings as a proxy.

In [72]:
ios_genres = freq_table(ios_free, -5)

for genre in ios_genres:
    sum_installs = 0
    num_apps = 0
    for row in ios_free:
        if row[-5] == genre:
            installs = float(row[5])
            sum_installs += installs
            num_apps += 1
    avg_installs = (sum_installs / num_apps)
    print(genre,":",round(avg_installs))

News : 21248
Medical : 612
Shopping : 26920
Travel : 28244
Utilities : 18684
Education : 7004
Weather : 52280
Catalogs : 4004
Health & Fitness : 23298
Book : 39758
Music : 57327
Food & Drink : 33334
Finance : 31468
Social Networking : 71548
Productivity : 21028
Business : 7491
Photo & Video : 28442
Lifestyle : 16486
Reference : 74942
Navigation : 86090
Entertainment : 14030
Sports : 23009
Games : 22789


In [76]:
ggl_genres = freq_table(ggl_free, 1)

for genre in ggl_genres:
    sum_installs = 0
    num_apps = 0
    for row in ggl_free:
        if row[1] == genre:
            installs = float(row[3])
            sum_installs += installs
            num_apps += 1
    avg_installs = (sum_installs / num_apps)
    print(genre,":",round(avg_installs))

EDUCATION : 56293
PARENTING : 16379
HOUSE_AND_HOME : 26435
COMMUNICATION : 995608
ART_AND_DESIGN : 24699
SHOPPING : 223887
PHOTOGRAPHY : 404081
BEAUTY : 7476
TRAVEL_AND_LOCAL : 129484
PRODUCTIVITY : 160635
LIFESTYLE : 33922
FAMILY : 113211
MAPS_AND_NAVIGATION : 142860
DATING : 21953
LIBRARIES_AND_DEMO : 10926
VIDEO_PLAYERS : 425350
GAME : 683524
AUTO_AND_VEHICLES : 14140
FINANCE : 38536
SOCIAL : 965831
COMICS : 42586
HEALTH_AND_FITNESS : 78095
TOOLS : 305733
BUSINESS : 24240
MEDICAL : 3730
NEWS_AND_MAGAZINES : 93088
ENTERTAINMENT : 301752
FOOD_AND_DRINK : 57479
EVENTS : 2556
PERSONALIZATION : 181122
BOOKS_AND_REFERENCE : 87995
SPORTS : 116939
WEATHER : 171251


Another method for the Google Play store is to use the downloads numbers as they are and check against our previous results using the ratings.

In [77]:
ggl_genres = freq_table(ggl_free, 1)

for genre in ggl_genres:
    sum_installs = 0
    num_apps = 0
    for row in ggl_free:
        if row[1] == genre:
            installs = row[5]
            installs = installs.replace('+','')
            installs = installs.replace(',','')
            sum_installs += int(installs)
            num_apps += 1
    avg_installs = (sum_installs / num_apps)
    print(genre,":",round(avg_installs))

EDUCATION : 1833495
PARENTING : 542604
HOUSE_AND_HOME : 1331541
COMMUNICATION : 38456119
ART_AND_DESIGN : 1986335
SHOPPING : 7036877
PHOTOGRAPHY : 17840110
BEAUTY : 513152
TRAVEL_AND_LOCAL : 13984078
PRODUCTIVITY : 16787331
LIFESTYLE : 1437816
FAMILY : 3697848
MAPS_AND_NAVIGATION : 4056942
DATING : 854029
LIBRARIES_AND_DEMO : 638504
VIDEO_PLAYERS : 24727872
GAME : 15588016
AUTO_AND_VEHICLES : 647318
FINANCE : 1387692
SOCIAL : 23253652
COMICS : 817657
HEALTH_AND_FITNESS : 4188822
TOOLS : 10801391
BUSINESS : 1712290
MEDICAL : 120551
NEWS_AND_MAGAZINES : 9549178
ENTERTAINMENT : 11640706
FOOD_AND_DRINK : 1924898
EVENTS : 253542
PERSONALIZATION : 5201483
BOOKS_AND_REFERENCE : 8767812
SPORTS : 3638640
WEATHER : 5074486


This generally confirms the previous analysis. The number of ratings are different, but the general order and magnitude are the same.

Let's take a look at the most popular apps in a couple categories.

### Category Analysis

Again, for the Google Play Store there are two ways to go about finding the most popular apps in each category.

In [79]:
for app in ggl_free:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                or app[5] == '500,000,000'
                                or app[5] == '100,000,000'):
        print(app[0],':',app[5])

WhatsApp Messenger : 1,000,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
Skype - free IM & video calls : 1,000,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+


In [82]:
for app in ggl_free:
    if app[1] == 'COMMUNICATION' and int(app[3]) > 10000000:
        print(app[0],':',app[3])
    

WhatsApp Messenger : 69119316
Messenger – Text and Video Chat for Free : 56646578
Skype - free IM & video calls : 10484169
LINE: Free Calls & Messages : 10790289
UC Browser - Fast Download Private & Secure : 17714850
Viber Messenger : 11335481
BBM - Free Calls & Messages : 12843436


The above two sections attempt to pull the most popular apps in the Communication category. What we see is there are some Google apps that have a lot of downloads, but do not have as many total reveiws as some other apps. This could be due to them being automatic downloads by the phone manufacturer or retailer.

Either way, Communication apps seems like a crowded and highly competitive market.

In [96]:
for app in ggl_free:
    if app[1] == 'FINANCE' and (app[5] == '100,000,000+'
                                or app[5] == '50,000,000+'
                                or app[5] == '10,000,000+'):        print(app[0],':',app[5])

print("===================================")

for app in ggl_free:
    if app[1] == 'FINANCE' and int(app[3]) > 100000:
        print(app[0],':',app[3])

K PLUS : 10,000,000+
Mobile Bancomer : 10,000,000+
CASHIER : 10,000,000+
Itau bank : 10,000,000+
Cash App : 10,000,000+
İşCep : 10,000,000+
Bank of Brazil : 10,000,000+
PayPal : 50,000,000+
Bank of America Mobile Banking : 10,000,000+
Wells Fargo Mobile : 10,000,000+
Capital One® Mobile : 10,000,000+
Chase Mobile : 10,000,000+
HDFC Bank MobileBanking : 10,000,000+
Google Pay : 100,000,000+
Credit Karma : 10,000,000+
K PLUS : 124424
Mobile Bancomer : 278082
SCB EASY : 112656
CASHIER : 335738
Itau bank : 957973
Nubank : 130582
IKO : 167168
VTB-Online : 138371
Banorte Movil : 111632
İşCep : 381788
TrueMoney Wallet : 199684
Bank of Brazil : 1336246
Money Manager Expense & Budget : 134564
Monefy - Money Manager : 111254
Mobills: Budget Planner : 161440
MetaTrader 4 : 260547
Stocks, Forex, Bitcoin, Ethereum: Portfolio & News : 157505
Yahoo Finance : 135952
Money Lover: Expense Tracker, Budget Planner : 126447
PayPal : 659760
USAA Mobile : 100997
Bank of America Mobile Banking : 341090
Wells 

Finance apps seem to be mostly banking apps of simlar popularities. Without a connected banking operations, it seems unlikely we could develop a finance app (such as a retirement planner) that would be highly downloaded.

In [97]:
for app in ggl_free:
    if app[1] == 'NEWS_AND_MAGAZINES' and (app[5] == '100,000,000+'
                                or app[5] == '50,000,000+'
                                or app[5] == '10,000,000+'):        print(app[0],':',app[5])

print("===================================")

for app in ggl_free:
    if app[1] == 'NEWS_AND_MAGAZINES' and int(app[3]) > 100000:
        print(app[0],':',app[3])

Fox News – Breaking News, Live Video & News Alerts : 10,000,000+
NEW - Read Newspaper, News 24h : 10,000,000+
BBC News : 10,000,000+
CNN Breaking US & World News : 10,000,000+
BaBe - Read News : 10,000,000+
detikcom - Latest & Most Complete News : 10,000,000+
Dailyhunt (Newshunt) - Latest News, Viral Videos : 50,000,000+
Read- Latest News, Information, Gossip and Politics : 10,000,000+
Reddit: Social News, Trending Memes & Funny Videos : 10,000,000+
Opera News - Trending news and videos : 10,000,000+
Topbuzz: Breaking News, Videos & Funny GIFs : 10,000,000+
Pulse Nabd - World News, Urgent : 10,000,000+
NYTimes - Latest News : 10,000,000+
Bloomberg: Market & Financial News : 10,000,000+
News Republic : 10,000,000+
Newsroom: News Worth Sharing : 10,000,000+
SmartNews: Breaking News Headlines : 10,000,000+
Updates for Samsung - Android Update Versions : 10,000,000+
Pocket : 10,000,000+
NewsDog - Latest News, Breaking News, Local News : 10,000,000+
News by The Times of India Newspaper - La

The interesting part of the 'News and Magazines' category is that it is not currently dominated by automatic downloads such as Google News. The most reviewed and most downloaded app appears to be Fox News, a non-technology company. This is closely followed by BBC and CNN.

### Conclusion

Given that no major tech firm dominates this section with pre-installed apps, it could prove to be a good opportunity to develop a low-cost news aggregation app. 

By providing an app that simply links to other sites and streams a list of headlines and AP photos, we could limit development costs and have to opportunity for substantial downloads in the future.