# Analyzing App Popularity

This project aims to analyze apps in order to understand what attracts users. This will help our super awesome company increase its revenue.

In [2]:
from csv import reader

### The Google Play data set ###
openedFile = open('googleplaystore.csv')
readFile = reader(openedFile)
androidApps = list(readFile)
androidHeader = androidApps[0]
androidApps = androidApps[1:]

### The App Store data set ###
openedFile = open('AppleStore.csv')
readFile = reader(openedFile)
iOSApps = list(readFile)
iOSHeader = iOSApps[0]
iOSApps = iOSApps[1:]

**Print first few rows of each data set**

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print("iOS\n")
print(iOSHeader)
print("\n")
explore_data(iOSApps, 1, 4)
print("\nANDROID\n")
print(androidHeader)
explore_data(androidApps,1,4)

iOS

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']



ANDROID

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Laun

In [4]:

print("iOS Apps: " + str(len(iOSApps)))
print("iOS Rows: " + str(len(iOSHeader)))

iOS Apps: 7197
iOS Rows: 16


In [5]:
print("Android Apps: " + str(len(androidApps)))
print("Android Rows: " + str(len(androidHeader)))

Android Apps: 10841
Android Rows: 13


Documentation of these data sets can be found: [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

# Deleting Bad Data

One of the entries in the Google Play data set has incorrect data, which is described in their discussion forums. We first must confirm that the entry is bad, and then remove it

In [6]:
print(androidHeader)
print(androidApps[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


We can see (before I run this whole notebook), that the rating is 19 when it should likely be 1.9

In [7]:
print(androidApps[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [8]:
del androidApps[10472]

In [9]:
print(androidApps[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


In [10]:
# Checking for Duplicate Entries

duplicateApps = []
uniqueApps    = []

for app in androidApps:
    name = app[0]
    if name in uniqueApps:
        duplicateApps.append(name)
    else:
        uniqueApps.append(name)
  
print("The number of duplicate apps is:")
print(len(duplicateApps))

print("\nSome duplicate apps include:\n")
print(duplicateApps[0:5])

The number of duplicate apps is:
1181

Some duplicate apps include:

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


Entries are listed multiple times because data was added multiple times. Rather than removing a random one, it would make more sense to remove all but the most recent one. The way we can find this is to check the number of review. The entry with the highest number of review is the most recent one.

To remove duplicates, we will create a dictionary where each key is a unique app name and the corresponding value is the hightest number of review. Create a new datasat with only one entry per app.

In [11]:
maxReviews = {}

for app in androidApps:
    name = app[0]
    numReviews = float(app[3])
    
    if name in maxReviews and maxReviews[name] < numReviews:
        maxReviews[name] = numReviews
        # We only want it to override if more recent, OR to be defined if it's unique/not in there already.
        
    elif name not in maxReviews:
        maxReviews[name] = numReviews
        
print(len(maxReviews))

9659


In [12]:
# At this point, we want to have a nice clean dataset that meets
# the above criterion that there are no duplicate entries and only the
# most recent entry is kept.

androidClean = []
alreadyAdded = []

for app in androidApps:
    name = app[0]
    numReviews = float(app[3])
    
    # If the entry is the most recent one, and we haven't already added it,
    # add it to the nice, clean list and keep track of its name. 
    # This is because duplicate entries might have the same number of reviews.
    
    if (numReviews == maxReviews[name]) and (name not in alreadyAdded):
        androidClean.append(app)
        alreadyAdded.append(name)
        
print(len(androidClean))
print(len(alreadyAdded))

9659
9659


In [13]:
maxiOSReviews = {}

for app in iOSApps:
    name = app[0]
    numReviews = float(app[5])
    
    if name in maxiOSReviews and maxiOSReviews[name] < numReviews:
        maxiOSReviews[name] = numReviews
        # We only want it to override if more recent, OR to be defined if it's unique/not in there already.
        
    elif name not in maxiOSReviews:
        maxiOSReviews[name] = numReviews
        
print(len(maxiOSReviews))

7197


In [14]:
iOSClean = []
existingNames = []

for app in iOSApps:
    name = app[1]

for app in iOSApps:
    name = app[1]   # Second value "track_name" is the app name
    numRatings = float(app[5])  # 6th value is total rating count for all versions
    
    # If the entry is the most recent one, and we haven't already added it,
    # add it to the nice, clean list and keep track of its name. 
    # This is because duplicate entries might have the same number of reviews.
    
    if (name not in existingNames):
        iOSClean.append(app)
        existingNames.append(name)
        
print(len(iOSClean))
print(len(existingNames))

7195
7195


# Removing non-English characters

At this point, we have a nice, clean data set with no duplicate entries.
But.... it's not that clean as we want to analyze apps directed towards English-speaking folk. To do this, we want to see if the name contains a symbol that isn't English, or numbers, punctuation marks, or other common symbols.

Each character has an associated ASCII number. You can check the number of each character using the "`ord()`" built-in function, and then see if it fits within a certain range. In our case, if the number is between 0 and 127 (inclusive), it is not in the common set of English characters.

In Python, strings are indexable and iterable, so you can select an individual character or iterate on the string using a `for` loop.

We'll write a function to do this below:

In [15]:
def isEnglish(string):
    
    numNonAscii = 0
    isEnglish = True
    
    for char in string:
        if ord(char) > 127:    # character not within common English set.
            numNonAscii += 1
    
    if numNonAscii > 3:         # We are allowing up to 3 non-ASCII char.
        isEnglish = False
    
    return isEnglish
           

In [16]:
print(isEnglish("Instagram"))

True


In [17]:
print(isEnglish('爱奇艺PPS -《欢乐颂2》电视剧热播'))

False


In [18]:
print(isEnglish('Docs To Go™ Free Office Suite'))

True


At this point, we should use this function to clean both data sets.

In [19]:
englishAndroidApps = []
englishiOSApps     = []

for app in androidClean:
    name = app[0]
    if isEnglish(name):
        englishAndroidApps.append(app)

for app in iOSApps:
    name = app[1]
    if isEnglish(name):
        englishiOSApps.append(app)       

In [20]:
print(len(englishAndroidApps))

9614


In [21]:
print(len(englishiOSApps))

6183


# Isolating Free Apps

At this point, we have:
* Removed inaccurate data
* Removed duplicate app entries, leaving only the most recent entry
* Removed non-English apps (essentially, but removing every app name that has more than 3 non-English ASCII chars in it.)

But.... now we need to isolate free apps for our analysis, as we only build apps that are free to download and install.

In [22]:
# Loop through each data set (android and iOS) to isolate
# the free apps in separate lists. Make sure that the correct
# index is used to refer to the price.

# In the iOS data [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)
# price is the 5th element (4th index in Python.)

# NOTE: "Free" is the string 0.0 here, not 0.

iOSFreeApps = []

for app in englishiOSApps:
    price = app[4]
    if price == "0.0":
        iOSFreeApps.append(app)
  
print("Free iOS Apps: ")
print(len(iOSFreeApps))

# In the android data [here](https://www.kaggle.com/lava18/google-play-store-apps/home),
# the 8th element (7th index) corresponds with price.

# NOTE: Free is the string 0 here, not 0.0

androidFreeApps = []
for app in englishAndroidApps:
    price = app[7]
    if price == "0":
        androidFreeApps.append(app)
  
print("Free android Apps: ")
print(len(androidFreeApps))


Free iOS Apps: 
3222
Free android Apps: 
8864


# Analyzing Common Genres for Each Market

As mentioned at the beginning, ouraim is to determine the kind of apps that are likely to attract more users because more people = more users = more money.

To validate our app ideas, we follow these steps:

1. Build a minimal Android version and add to Google Play.
2. If app has a good response, develop further.
3. If profitable after 6 months, also build iOS version and add to app store.

Although we are testing with android apps first, the app needs to be successful in both markets. 

To analyze, we will build a frequency table for some columns in our data sets.

The iOS data set includes:

"id" : App ID   
"track_name": App Name  
"size_bytes": Size (in Bytes)  
"currency": Currency Type  
"price": Price amount   
"rating_count_tot": User Rating counts (for all version)  
"rating_count_ver": User Rating counts (for current version)  
"user_rating" : Average User Rating value (for all version)  
"user_rating_ver": Average User Rating value (for current version)  
"ver" : Latest version code  
"cont_rating": Content Rating  
"prime_genre": Primary Genre  
"sup_devices.num": Number of supporting devices  
"ipadSc_urls.num": Number of screenshots showed for display  
"lang.num": Number of supported languages  
"vpp_lic": Vpp Device Based Licensing Enabled  

The android data set includes:

AppApplication name  
CategoryCategory the app belongs to  
RatingOverall user rating of the app (as when scraped)  
ReviewsNumber of user reviews for the app (as when scraped)  
SizeSize of the app (as when scraped)  
InstallsNumber of user downloads/installs for the app (as when scraped)  
TypePaid or Free  
PricePrice of the app (as when scraped)  
Content RatingAge group the app is targeted at - Children / Mature 21+ / Adult  
GenresAn app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.  
Last UpdatedDate when the app was last updated on Play Store (as when scraped)  
Current VerCurrent version of the app available on Play Store (as when scraped)  
Android VerMin required Android version (as when scraped)  

At this point, we have already isolated free apps. To determine the most common genre, we probably care about overall rating, number of reviews, and the "genre" field.

**For iOS apps, this corresponds to:**

Index 5: "rating_count_tot": User Rating counts (for all version)

Index 7:"user_rating" : Average User Rating value (for all version)

Index 11: "prime_genre": Primary Genre

**For android apps, this corresponds to:**

Index 1: Category- apps can have multiple genres, so this is a better number to look at.

Index 2: RatingOverall user rating of the app (as when scraped)

Index 3: ReviewsNumber of user reviews for the app (as when scraped)

Index 9:  GenresAn app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.

In [55]:
# Generate frequency tables for iOS

iOSFreqTable = {}

for app in iOSFreeApps:
    numRatings = int(app[5])
    userRating = float(app[7])
    primeGenre = app[11]
    
    if primeGenre in iOSFreqTable:
        iOSFreqTable[primeGenre] += 1
    else:
        iOSFreqTable[primeGenre] = 1

# That's fine and dandy for one value, but what I really need
# is a function for generating frequency tables for a particular index

def createFreqTable(dataSet, index):
    total = 0
    freqTable = {}
    for row in dataSet:
        total += 1
        value = row[index]
        if value in freqTable:
            freqTable[value] += 1
        else:
            freqTable[value] = 1    
        
    return freqTable, total

iOSFreqTable, total = createFreqTable(iOSFreeApps, 11)

#print(iOSFreqTable)
#print(total)

# But... welp don't have a way of visualizing this yet, so it's better
# that we make a frequency table of percentages.

def convertToProportion(freqTable, total):
    propTable = {}
    for key in freqTable:
        count = freqTable[key]
        proportion = (count/total)*100
        propTable[key] = proportion
        
    return propTable
        
convertToProportion(iOSFreqTable, total)

 
def displayTable(table):
    
    displayTable = []
    for key in table:
        valAsTuple = (table[key], key)
        displayTable.append(valAsTuple)
        
        sortedTable = sorted(displayTable, reverse = True)
    for entry in sortedTable:
        print(entry[1], ':', entry[0])
        
# So, I have a bunch of functions:
#    freqTable, total = createFreqTable(dataSet, index)
#    propTable = convertToProportion(table, total)
#    displayTable(table)


Great, now I have funtions! Now to actual split them into useful things. First, iOS apps.

# iOS Genre Analysis

In [56]:
# Create iOS frequency table by count
iOSFreqTable, total = createFreqTable(iOSFreeApps, 11)

# print(iOSFreqTable)
# Convert to table of percentages
iOSPropTable = convertToProportion(iOSFreqTable, total)
#print(iOSPropTable)

# Sort and display the table
displayTable(iOSPropTable)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


The vast majority of free, English apps in the iOS app store are games. (Games and entertainment make up about 65% of this category.)

But.... just because there are more apps that are for fun things doesn't mean that people use them the most. It's then worth investigating the number of users that these apps have.

And... now to use the same functions but for android.

# Android App Analysis


In [57]:
# Create iOS frequency table by count
androidFreqTable, total = createFreqTable(androidFreeApps, 1)

# print(iOSFreqTable)
# Convert to table of percentages
androidPropTable = convertToProportion(androidFreqTable, total)
#print(iOSPropTable)

# Sort and display the table
displayTable(androidPropTable)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

Welp... It's unfortunate that the categories don't seem to overlap at all... This "family" shenanigans is top at 18%, with "games" at 9.72%. But, family actually means kids games? Which is weird, but whatever. We'll also look into the "genre" category too.

In [58]:
# Create iOS frequency table by count
androidFreqTable, total = createFreqTable(androidFreeApps, -4)

# print(iOSFreqTable)
# Convert to table of percentages
androidPropTable = convertToProportion(androidFreqTable, total)
#print(iOSPropTable)

# Sort and display the table
displayTable(androidPropTable)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

There are a crazy amount of genres, so that's probably not super useful for us.

Overall, the iOS App Store has fun things, but Google Play has some fun things and some practical things. 

But again, this is purely by number of apps, not how many people use them, which we'll analyze next.

Some other helpful notes that I didn't read before I just did all of that...

Dictionaries don't have order, so we can't use those to analyze frequeny tables. You can technically use the "sorted()"  function on any iterable data type (lists, dictionaries, and tuples), but it's pretty useless on dictionaries but it only considers and returns the keys, not values, and doesn't really even sort them.

But..... if we turn the dictionary into a list of tuples where the tuple contains the key and value, we can do something with that. They actually give us a displayTable function, so my cleaned up version of their functions is below.

In [78]:
def genFreqTable(dataSet, index):
    total = 0
    freqTable = {}
    for row in dataSet:
        total += 1
        value = row[index]
        if value in freqTable:
            freqTable[value] += 1
        else:
            freqTable[value] = 1    
        
    percentageTable = {}
    for key in freqTable:
        percentage = (freqTable[key]/total)*100
        percentageTable[key] = percentage
        
    return percentageTable

def displayTable(dataSet, index):
    table = genFreqTable(dataSet, index)
    displayTable = []
    for key in table:
        kvPair = (table[key], key)
        displayTable.append(kvPair)
        
    sortedTable = sorted(displayTable, reverse = True)
    for entry in sortedTable:
        print(entry[1], " : " , entry[0])
        
displayTable(iOSFreeApps, -5)

Games  :  58.16263190564867
Entertainment  :  7.883302296710118
Photo & Video  :  4.9658597144630665
Education  :  3.662321539416512
Social Networking  :  3.2898820608317814
Shopping  :  2.60707635009311
Utilities  :  2.5139664804469275
Sports  :  2.1415270018621975
Music  :  2.0484171322160147
Health & Fitness  :  2.0173805090006205
Productivity  :  1.7380509000620732
Lifestyle  :  1.5828677839851024
News  :  1.3345747982619491
Travel  :  1.2414649286157666
Finance  :  1.1173184357541899
Weather  :  0.8690254500310366
Food & Drink  :  0.8069522036002483
Reference  :  0.5586592178770949
Business  :  0.5276225946617008
Book  :  0.4345127250155183
Navigation  :  0.186219739292365
Medical  :  0.186219739292365
Catalogs  :  0.12414649286157665


At this point, we now want to figure out which genres are the most popular/have the most users. For the Google Play data, we can use the "Installs" column, but for the App Store, we'll use rating_count_tot as a proxy.

We'll begin by calculating the average number of user ratings per app genre. To do that, we'll need to:

1. Isolate the apps of each genre
2. Sum up the user ratings for the apps of that genre
3. Divide by the sum of apps belonging to that genre

In [91]:
iOSGenres = genFreqTable(iOSFreeApps, -5)

displayedTable = [] 
for genre in iOSGenres:
    total = 0
    totalForGenre = 0
    for app in iOSFreeApps:
        appGenre = app[-5]
        if appGenre == genre:
            numRatings = round(int(round(app[5])))
            total += numRatings
            totalForGenre += 1
    avgRating = total/totalForGenre
    kvPair =  (genre, avgRating)  
    #print(kvPair)
    displayedTable.append(kvPair)
    
# To sort a list of tuples by value, use a lambda function
# (or import itemgetter but I didn't do that.)
# key is a function that identifies how to retrieve a
sortedTable = sorted(displayedTable, key=lambda x: x[1], reverse = True)
for entry in sortedTable:
        print(entry[0], ':', entry[1])
 

    

TypeError: type str doesn't define __round__ method