# Profitable App Profiles for the App Store and Google Play Markets

Our goal is to determine the kind of apps that are likely to attract more users. This is important because our revenue is driven through in-app purchases.

To minimize risks and overhead, our validation strategy for an app idea has three steps:

1. Build a minimal Android version of the app, and add it to Google Play.

2. If the app has a good response from users, we develop it further.

3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.



## Opening and Exploring the Data
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

![img](https://s3.amazonaws.com/dq-content/350/py1m8_statista.png) Source: [Statista](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/)
Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:

* A [dataset](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
* A [dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

## Define global methods

### openFile()
Return a list from a csv file from the provided path

In [2]:
def openFile(withPath):
    
    openedFile = open(withPath)

    from csv import reader
    readFile = reader(openedFile)
    return list(readFile)

### exploreData()
Use this method to explore the dataset and optionally print the # rows and columns.

In [3]:
def exploreData(dataset, start, end, rowsAndColumns=False):
    
    datasetSlice = dataset[start:end]    
    
    for row in datasetSlice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rowsAndColumns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

### findDuplicates()
Find duplicates in the given dataset. The method determines a duplicate based on the app name and therefore requires a dataset with app names and its index.

In [4]:
def findDuplicates(inDataset, appNameIndex):
    
    print("Finding duplicates and printing the first rows if applicable")
    
    uniqueApps = []
    duplicateApps = []

    for row in inDataset:
        appName = row[appNameIndex]

        if appName in uniqueApps:
            duplicateApps.append(row)
        else:
            uniqueApps.append(appName)
        
    
    if len(duplicateApps) > 3:
        
        for x in range(3):
            print(duplicateApps[x])
            print("---")
    else:
        
        for row in duplicateApps:
            print(row)
            print("---")
    
    print("Out of " + str(len(uniqueApps)) + " apps")
    print("We found " + str(len(duplicateApps)) + " duplicates")

### dictionaryWithAppnamesAndReviewsCount()
This methode generates a dictionary with application names as key and the total amount of reviews as value.
To do this it requires the index from the dataset for each parameter.

```
{
    appname_1_string : review_amount_x_integer,
    appname_2_string : review_amount_x_integer,
    ...
}
```

In [5]:
def dictionaryWithAppnamesAndReviewsCount(fromDataset, appNameIndex, reviewsCountIndex):
    
    dictionary = {}
    
    for row in fromDataset:
    
        appName = row[appNameIndex]
        reviewsCount = int(row[reviewsCountIndex])
        
        if appName not in dictionary:
            dictionary[appName] = reviewsCount
        elif appName in dictionary and dictionary[appName] < reviewsCount:
            dictionary[appName] = reviewsCount
            
    return dictionary

### generateCleanDataset()
Recreate a new dataset from an existing dataset. 

Internally calls dictionaryWithAppnamesAndReviewsCount() to have a reference of highest number of reviews for each app. This is used as a criteria to remove duplicate data where we keep the row with the highest count.

In [6]:
def generateCleanDataset(fromDataset, appNameIndex, reviewsCountIndex):
    
    cleanDataset = []
    alreadyAdded = []
    maxReviewDictionary = dictionaryWithAppnamesAndReviewsCount(fromDataset, appNameIndex, reviewsCountIndex)
    
    for row in fromDataset:
        
        appName = row[appNameIndex]
        reviewsCount = int(row[reviewsCountIndex])
    
        if appName not in alreadyAdded and reviewsCount == maxReviewDictionary[appName]:
            cleanDataset.append(row)
            alreadyAdded.append(appName)
            
    return cleanDataset

### isEnglish()
returns True if the provided string doesn't contain more than 3 foreign characters. This is based on the ASCII code.

In [7]:
def isEnglish(string):
    
    foreignASCIIcount = 0
    
    for c in string:
        
        if ord(c) > 127:
            
            foreignASCIIcount += 1
            if foreignASCIIcount > 3:
                return False
    
    return True

### frequencyTable()
Generate a frequency table that showcases number of genres for each application.

In [8]:
def frequencyTable(fromDataset, genreIndex):
    
    dictionary = {}
    
    for row in fromDataset:
    
        genre = row[genreIndex]
        
        if genre in dictionary:
            dictionary[genre] += 1
        else:
            dictionary[genre] = 1
            
    return dictionary

### displayTable()
Converts a frequency table (dictionary) to a list of tuples. This way we can use the sorted() method to display the values.

In [9]:
def displayTable(fromDataset, index):
    
    table = frequencyTable(fromDataset, index)
    tableDisplay = []
    
    for key in table:
        keyAsTuple = (table[key], key)
        tableDisplay.append(keyAsTuple)

    tableSorted = sorted(tableDisplay, reverse = True)
    for entry in tableSorted:
        print(entry[1], ':', entry[0])

## Apple Store

A section exploring the Apple Store dataset. We print a few rows and and columns.

In [10]:
appleStoreData = openFile("datasets/AppleStore.csv")

### Printing the first 2 rows (without header)

In [11]:
exploreData(appleStoreData[1:], 0, 2, True)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


Number of rows: 7197
Number of columns: 17


### All column names for the Apple Store dataset
doc: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

In [12]:
for column in appleStoreData[0:1]:
    print(column)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


## Google Play Store

A section exploring the Google Play dataset. We print a few rows and and columns.

We also remove the duplicate applications from the dataset and recreate a new dataset with unique apps.

In [13]:
googlePlayStoreData = openFile("datasets/GooglePlayStore.csv")

### Printing the first 3 rows (without header)

In [14]:
exploreData(googlePlayStoreData[1:], 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


### All column names for the Google Play Store dataset

doc: https://www.kaggle.com/datasets/lava18/google-play-store-apps

In [15]:
for column in googlePlayStoreData[0:1]:
    print(column)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


## Cleanup datasets

* Remove inaccurate data
* Remove duplicates
* Filter out non-English names
* Isolate free apps

### Remove inaccurate data

#### Removing 10473 (with header) due to 3.0M value not being an float

In [16]:
print(googlePlayStoreData[10473])
del googlePlayStoreData[10473]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


#### Removing x due to NaN not being either 'Free' or 'Paid'

In [17]:
for row in googlePlayStoreData:
    
    if row[6] == 'NaN':
        print(row)
        googlePlayStoreData.remove(row)

['Command & Conquer: Rivals', 'FAMILY', 'NaN', '0', 'Varies with device', '0', 'NaN', '0', 'Everyone 10+', 'Strategy', 'June 28, 2018', 'Varies with device', 'Varies with device']


### Search for duplicate application entries

In [18]:
findDuplicates(googlePlayStoreData[1:], 0)

Finding duplicates and printing the first rows if applicable
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
---
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
---
['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up']
---
Out of 9658 apps
We found 1181 duplicates


### Example of a duplicate application - Google My Business

In [19]:
for row in googlePlayStoreData:
    appName = row[0]
    
    if appName == "Google My Business":
        print(row)

['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up']
['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up']
['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up']


### Removing the duplicate applications

Index 3 displays the review count. We will use this integer to determine which row we want to keep. The highest count should point to the most recent data.

We store our clean dataset in a new list (`cleanGoogleDataset`).

In [20]:
cleanGoogleDataset = generateCleanDataset(fromDataset=googlePlayStoreData[1:], appNameIndex=0, reviewsCountIndex=3)

Loop through the original dataset again and only append to the clean data set if

1. application is not added yet
<br/>`if name not in alreadyAdded`

2. we only use the row with the highest count for that app
<br/>`reviewsCount == reviewsMax[name]`

### Check for duplicates in the Apple Store dataset

In [21]:
findDuplicates(appleStoreData[1:], 2)

Finding duplicates and printing the first rows if applicable
['7579', '1089824278', 'VR Roller Coaster', '240964608', 'USD', '0', '67', '44', '3.5', '4', '0.81', '4+', 'Games', '38', '0', '1', '1']
---
['10885', '1178454060', 'Mannequin Challenge', '59572224', 'USD', '0', '105', '58', '4', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']
---
Out of 7195 apps
We found 2 duplicates


### Removing duplicates again
We use the same criteria and select the row with the highest review count.

```
app name index = 2
reviews count index = 6
```

In [22]:
cleanAppleStoreDataset = generateCleanDataset(fromDataset=appleStoreData[1:], appNameIndex=2, reviewsCountIndex=6)

### Filter non english app names

In [23]:
englishAppleStoreDataset = []

for row in cleanAppleStoreDataset:
    
    if isEnglish(row[2]):
        englishAppleStoreDataset.append(row)
        
print("Remaining rows: " + str(len(englishAppleStoreDataset)))

Remaining rows: 6181


In [24]:
englishGoogleStoreDataset = []

for row in cleanGoogleDataset:
    
    if isEnglish(row[0]):
        englishGoogleStoreDataset.append(row)
        
print("Remaining rows: " + str(len(englishGoogleStoreDataset)))

Remaining rows: 9613


### Isolate free apps

In [25]:
freeApplePlayStoreDataset = []

for row in englishAppleStoreDataset:
    
    if row[5] == "0":
        freeApplePlayStoreDataset.append(row)
    
print("Remaining rows: " + str(len(freeApplePlayStoreDataset)))

Remaining rows: 3220


In [26]:
freeGoogleStoreDataset = []

for row in englishGoogleStoreDataset:
    
    if row[6] == "Free":
        freeGoogleStoreDataset.append(row)
        
print("Remaining rows: " + str(len(freeGoogleStoreDataset)))

Remaining rows: 8863


## Highlight the most popular categories

Genre index in dataset

```
# Apple
prime_genre = 12

# Google
category = 1
genres = 9
```

We start by examining the frequency table for the prime_genre column of the App Store data set.

In [38]:
displayTable(freeApplePlayStoreDataset, 12)

Games : 1872
Entertainment : 254
Photo & Video : 160
Education : 118
Social Networking : 106
Shopping : 84
Utilities : 81
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 51
News : 43
Travel : 40
Finance : 36
Weather : 28
Food & Drink : 26
Reference : 18
Business : 17
Book : 14
Navigation : 6
Medical : 6
Catalogs : 4


We can see that among the free English apps, more than a half (58.16%) are games. Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.66% of the apps are designed for education, followed by social networking apps which amount for 3.29% of the apps in our data set.

The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

Let's continue by examining the Genres and Category columns of the Google Play data set (two columns which seem to be related).

In [39]:
displayTable(freeGoogleStoreDataset, 1)

FAMILY : 1675
GAME : 862
TOOLS : 750
BUSINESS : 407
LIFESTYLE : 346
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 313
SPORTS : 301
PERSONALIZATION : 294
COMMUNICATION : 287
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 190
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 124
FOOD_AND_DRINK : 110
EDUCATION : 103
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 73
WEATHER : 71
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 57
COMICS : 55
BEAUTY : 53


The landscape seems significantly different on Google Play: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids.

In [40]:
displayTable(freeGoogleStoreDataset, 9)

Tools : 749
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 345
Finance : 328
Medical : 313
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181
Dating : 165
Arcade : 164
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 124
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 80
House & Home : 73
Weather : 71
Events : 63
Adventure : 60
Comics : 54
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 40
Casino : 38
Trivia : 37
Educational;Education : 35
Board : 34
Educational : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 12
Arcade;Action & Advent

The difference between the Genres and the Category columns is not crystal clear, but one thing we can notice is that the Genres column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.

### Most Popular Apps by Genre on the App Store
One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

Below, we calculate the average number of user ratings per app genre on the App Store:

In [42]:
primeGenreFreqTable = frequencyTable(freeApplePlayStoreDataset, 12)
printList = []

for genre in primeGenreFreqTable:
    
    total = 0
    genreCount = 0
    
    for row in freeApplePlayStoreDataset:
        
        appGenre = row[12]
        
        if appGenre == genre:
            
            userRatings = float(row[8])
            total += userRatings
            genreCount += 1
            
    averageRating = total / genreCount
    printList.append((averageRating, genre))
    listSorted = sorted(printList, reverse = True)

for entry in listSorted:
    print(entry[1], ':', entry[0])          

Catalogs : 4.125
Games : 4.037393162393163
Productivity : 4.0
Business : 3.9705882352941178
Shopping : 3.9702380952380953
Music : 3.946969696969697
Photo & Video : 3.903125
Navigation : 3.8333333333333335
Health & Fitness : 3.769230769230769
Reference : 3.6666666666666665
Education : 3.635593220338983
Food & Drink : 3.6346153846153846
Social Networking : 3.5943396226415096
Entertainment : 3.5393700787401574
Utilities : 3.5308641975308643
Travel : 3.4875
Weather : 3.482142857142857
Lifestyle : 3.411764705882353
Finance : 3.375
News : 3.244186046511628
Book : 3.0714285714285716
Sports : 3.0652173913043477
Medical : 3.0


### Show the average number of installs for each category
Listed the # installs in the Google Play Store for each category.

In [31]:
categoryFreqTable = frequencyTable(freeGoogleStoreDataset, 1)
printList = []

for category in categoryFreqTable:
    
    total = 0
    categoryCount = 0
    
    for row in freeGoogleStoreDataset:
        
        appCategory = row[1]
        
        if appCategory == category:
            
            installs = row[5]
            installs = installs.replace("+", "")
            installs = installs.replace(",", "")
            total += float(installs)
            categoryCount += 1
      
    averageInstalls = total / categoryCount
    printList.append((averageInstalls, category))
    listSorted = sorted(printList, reverse = True)

for entry in listSorted:
    print(entry[1], ':', entry[0])          

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3697848.1731343283
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315