# Profitable app profiles for the App Store and Google Play markets

In this project, we are analysing data of free apps in the App Store and Google Play, to see what types of apps are profitable. We aim to find out what types of free apps consumers like the most. 

We are working for a company that makes free apps and whose main revenue is generated through in-app adds. We aim to find out more popular apps since these have more users, meaning more in-app ads are viewed, generating more revenue compared to other apps. By looking at successful apps, our developers will be able to understand what type of app to develop which would attract more users.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # we add a new (empty) line after each row, it starts to look like a table this way

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Above, we defined a new function which makes it easier to understand the rows of data we have. The function assumes the argument for the dataset parameter doesn't have a header row. Below, we open and save 2 data sets on Android and Ios apps as a list of lists, to help us achieve our task:
- you can download the data about apps from the Apps Store from this [link](https://www.kaggle.com/datasets/lava18/google-play-store-apps)
- you can download the data about apps from the Google Play Store from this [link](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps)



In [4]:
from csv import reader

# The Google Play data set 
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
gplay = list(read_file)
gplay_header = gplay[0]
gplay = gplay[1:]

# The App Store data set 
opened_file2 = open('AppleStore.csv')
read_file2 = reader(opened_file2)
apple = list(read_file2)
apple_header = apple[0]
apple = apple[1:]

## Opening and Exploring our Data 
**Let's explore the datasets with our newly defined function, starting with the Apple Store data:**

In [6]:

# These are the names of the columns for our data; they are the same for both datasets. They describe what the data is about briefly through the name.
print(apple_header)
print('\n')
# exploring our datasets' first few rows
explore_data(apple, 0, 3, True)


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


**We have data on 7197 apps from Apple's App Store.**

The naming of the columns is a little complicated, so if you do not get the name straight away please have a look at the data's documentation [here](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps)

**From the names of the columns for our datasets, the types of data that seem helpful in our analysis are: 'track_name', 'price', 'rating_count_tot' and 'prime_genre'**

In [8]:
print(gplay_header)
print('\n')
explore_data(gplay, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


**We have data on 10,841 apps from the Google Play Store.**

**From the names of the columns for our datasets, the types of data that seem helpful in our analysis are: 'App', 'Installs', 'Content Rating', 'Rating', 'Price', 'Reviews'**

### Deleting Wrong Data

From the dataset's discussion forum, we notice that there is an error in row 10472. To identify it, we compare it to our header row and a correct row of data, too see if there are any differences.

In [11]:
print(gplay_header)
print(gplay[10472])
print(gplay[20])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['Logo Maker - Small Business', 'ART_AND_DESIGN', '4.0', '450', '14M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design', 'April 20, 2018', '4.0', '4.1 and up']


Looking at our header row and comparing the 10472nd app to a random app, we see that a rating of 19 is impossible for 'Life Made WI-Fi Touchscreen Photo Frame' since the maximum is 5.0. We see that we are missing the 'Category' entry for 'Life Made WI-Fi Touchscreen Photo Frame'.

We should delete the information on this app therefore.

In [13]:
# let's compare the length of our dataset before and after we delete this app information, to make sure we have deleted it.
print(len(gplay))
# use del to delete an element from our dataset
del gplay[10472]
print(len(gplay))

10841
10840


### Removing Duplicate Entries: 

Looking the Google Play data, we see that some apps have multiple entries. 
An example would be Instagram:

In [15]:
for app in gplay:
    name = app[0]
    if name == 'Instagram':
        print(app)
        # this will show us all Instagram entries 

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


We do not want to have duplicate entries in our data. It is best to find all the apps with multiple entries, and eventually remove them.


In [17]:
# separate the unique apps with 1 entry in our data and the apps with duplicate entries
duplicate_apps = []
unique_apps = []
# identify the apps with duplicate entries and update our lists
for app in gplay:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
# count how many apps have duplicate entries
print(f'There are {len(duplicate_apps)} apps with duplicate entries')
print('\n')
print('Some examples of duplicate apps are: ', duplicate_apps[:10]) 

There are 1181 apps with duplicate entries


Some examples of duplicate apps are:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


We want to remove the duplicate entries for the apps and only keep one for each app. We could delete duplicates randomly until one is left, but this is not ideal. Ideally, we want the most recent data for the apps so that our analysis is as accurate as possible. We can tell which is the most recent entry by looking at the fourth element of our entry, the number of reviews the app has. The row of data which has the highest number of reviews for the given app is our most recent data on the app, which is what we want to keep.

To do this, we can use a dictionary:
- The keys will be unique app names, and their associated value will be their review count. We have one entry per app , which will be the entry with the highest number of reviews. This ensures we have unique keys and follows what we want.
- We then use this dictionary to make our ideal dataset.

In [19]:
reviews_max = {}

for app in gplay:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        # update to the most recent value for number of reviews of the app
    
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        # add a key for this app as it is our first entry for it

        

From our previous results, we know there are 1181 duplicate entries. Based on this, our dictionary's length should be the difference between the length of our dataset and 1181, which is equal to the number of unique entries of apps. 
Let us verify to make sure our dictionary is correct:

In [21]:
print('Expected length:', len(gplay) - 1181)
print('Actual Length:', len(reviews_max))

Expected length: 9659
Actual Length: 9659


Now that we know the dictionary is the right length, let's use it to remove the duplicate rows. In the code below:

- We first make two empty lists: gplay_clean and already_added. One to create our new cleaned dataset, the other to keep track of what apps we have already added in our new dataset
- We loop through gplay, and in each iteration:
    - We save the name and number of reviews for the current app to separate variables
    - We will add the app's name to already_added and the row of data if:
        - The number of reviews for the app is equal to the value saved for this app in reviews_max. Remember, our dictionary had the highest number of reviews associated to the app name(the key)
        - The name of the app is not already in the already_added list. This is important so that we only have one entry per app, and this entry will be the most recent data since it has the highest number of reviews for this app, confirmed in our above condition. We also do this to prevent cases where the highest number of reviews of a duplicate app is the same for more than one entry

In [23]:
# stores our new cleaned data with no duplicates
gplay_clean = []
# for app names
already_added = []

for app in gplay:
    name = app[0]
    n_reviews = float(app[3])

    if (n_reviews == reviews_max[name]) and (name not in already_added):
        gplay_clean.append(app)
        already_added.append(name)


Let's now explore our new cleaned dataset for the apps from Google Play using explore_data(). We do this to ensure everything has gone correctly and that we have the correct number of rows in our dataset, 9659.

In [25]:
explore_data(gplay_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


As expected from our dictionary earlier, we have 9659 rows.

### Removing Non-English Apps: 

After a deeper look in the dataset, there are some non-english apps in our data too. We can see some examples below:

In [28]:
print(apple[813][1])

print(gplay_clean[4412][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
中国語 AQリスニング


Since we are using English for the apps being developed at our company, we want to only analyse English apps and remove Non-English apps.

One way to remove them is by removing apps whose name has a character that is not normally used in English text. This can be found out using the character's corresponding ASCII number. Symbols and characters commonly used in English text are ASCII numbers 0 to 127.

Since we are only interested in ASCII characters from 0 to 127, we can define a function which uses the built in function ord(), to identify which app names have Non-English characters.

In [30]:
def is_english(string):
    for character in string:
        # iterate over each character in the string
        if ord(character) > 127:
            return False
            # this will tell us the app is Non_english and the function will end here, not running what's below
    return True
# let's test it out with some strings

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
False
False


Our function works fine, except when English apps have some symbols and emojis which are outside of our ASCII range of 0 to 127, which means they incorrectly get labelled as Non-English.

We cannot let this happen as we would be removing useful data, so we should redefine our function to fix this.

We will only count an app name as Non-English if there are more than 3 characters with ASCII numbers greater than 127. This way English apps with special symbols and emojis will still be classed as English apps. This is not a perfect filter but it should be good enough.

In [32]:
def is_english(string):
    # have a variable to keep track of a string's non-english characters
    not_eng = 0
    for character in string:
        
        if ord(character) > 127:
            not_eng +=  1
    if not_eng > 3:
        return False
        # strings with more than 3 non-english characters will be labelled as Non-English
    else:
        return True
        # the app name is English

# let's test it out with some strings


print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


Now that our function is defined correctly, let us use it to remove Non-English app names from both datasets:

In [34]:
# lists for English apps from Apple App Store dataset
eng_apple = []
# lists for English apps from Google Play Store dataset
eng_gplay = []
# Filtering App store data for English apps
for app in apple:
    name = app[1]
    if is_english(name) == True:
        eng_apple.append(app)

# Filtering Google Play data for English apps 
for app in gplay_clean:
    name = app[0]
    if is_english(name) == True:
        eng_gplay.append(app)
# let us see how many English apps we have
explore_data(eng_apple,0 ,3, True)
print('\n')
explore_data(eng_gplay,0 ,3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6183
Number of columns: 16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Vari

We are left with 6183 iOS apps and 9614 android apps.

### Isolating the Free Apps

Our company only builds free apps and their main source of revenue is in-app adds. Our datasets have both free and paid apps, so we want to isolate the free apps from the datasets.

In [37]:
# Google play english app data
final_gplay = []
for app in eng_gplay:
    price = app[7]
    if price == '0':
        final_gplay.append(app)

# Apple app store english apps
final_apple = []
for app in eng_apple:
    price = app[4]
    if price == '0.0':
        final_apple.append(app)

# check the lengths of our data
print(len(final_gplay))
print(len(final_apple))

8864
3222


We are left with 8864 Android apps and 3222 iOS apps, which should be enough for our analysis.

## Most Common Apps by Genre: 

### Part One

Our Strategy for an app idea is:
- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Since we ultimately want to release the app on both platforms, we want the app to be popular on both platforms. A good start to find out what is popular for both is to look at the most popular genres on Google Play and the App Store.

To do this, we will need to construct frequency tables for the 'prime_genre' column of the App Store data set, and the 'Genres' and 'Category' columns of the Google Play data set.



### Part Two 

We are going to build 2 functions to analyse the frequency tables:
- one will generate frequency tables that show percentages
- Another function we can use to display the percentages in a descending order


In [41]:

def freq_table(dataset, index):
    freq_dict = {}
    total = 0 # to keep track of the total number of entries in our specified column
    for app in dataset:
        total += 1
        # saving the app's data point for the column we want
        value = app[index]
        
        if value in freq_dict:
            freq_dict[value] += 1
        else:
            freq_dict[value] = 1

    freq_percentages = {}
    for key in freq_dict:
        percentage = (freq_dict[key]/total) * 100
        freq_percentages[key] = percentage

    return freq_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

            

Let's begin analysing the App Store's most popular genres:

In [43]:
# prime_genre column
display_table(final_apple,11) 

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


From above, we see that 'Games' is the most popular genre as more than half of free English apps are games, followed by 'Entertainment' at around 8%. The type of free english apps used the most are ones designed for entertainment and leisure, like: 
- games
- entertainment
- photo & video

Apps with practical uses (education, productivity, lifestyle, etc.) are more rare.
Based on this data alone, we cannot recommend an App Profile for our company. We should also consider the number of users; it could be that there is much more supply in this genre than demanded by users. 

Let's continue by looking at the Genres and Category columns of the Google Play data set:

In [45]:
# generate frequency tables for the columns we are interested in using the display_table function

display_table(final_gplay,1) # category column


FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

Family and Game categories are the most popular. The Family category mostly consists of games too, so we have similar results to the App Store data. Tools and other practical categories are more popular and represented on Google Play's free english app selection. Overall Google Play is much more balanced in entertainment and practical apps.

Looking at the Genre column next:

In [47]:
display_table(final_gplay,9 ) # genre column

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

As suggested in the Category column, practical apps are more popular on Google Play compared to the App Store. On the other hand, games and entertainment apps are more popular on the App Store.

Our frequency tables reveal the most popular genres, but to recommend an App Profile, we next want to know which type of apps have the most users.

## Most Popular Apps by Genre on the App Store

One way to find the most popular genres (which have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this in the Installs column, but for the App Store data set this information is missing. As a substitute, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

Below, we start by calculating the average number of user ratings per app genre on the App Store:



In [49]:
# generate a frequency table for prime_genres, to know the app genres
apple_genres = freq_table(final_apple, 11)


for genre in apple_genres:
    # store the total number of ratings for this genre
    total = 0 
    # stores the number of apps for this genre
    len_genre = 0

    for app in final_apple:
        genre_app = app[11]
        if genre_app == genre:
            # update the total number of ratings and number of apps for the genre
            tot_ratings = float(app[5])
            total = total + tot_ratings
            len_genre += 1
    # calculate the average number of ratings for the genre
    average_ratings = total/len_genre

    print(genre, ': ', average_ratings) 



Social Networking :  71548.34905660378
Photo & Video :  28441.54375
Games :  22788.6696905016
Music :  57326.530303030304
Reference :  74942.11111111111
Health & Fitness :  23298.015384615384
Weather :  52279.892857142855
Utilities :  18684.456790123455
Travel :  28243.8
Shopping :  26919.690476190477
News :  21248.023255813954
Navigation :  86090.33333333333
Lifestyle :  16485.764705882353
Entertainment :  14029.830708661417
Food & Drink :  33333.92307692308
Sports :  23008.898550724636
Book :  39758.5
Finance :  31467.944444444445
Education :  7003.983050847458
Productivity :  21028.410714285714
Business :  7491.117647058823
Catalogs :  4004.0
Medical :  612.0


From the above data, we see that Navigation apps have the most ratings, but this is probably influenced by the popularity of Google Maps and Waze:


In [51]:
for app in final_apple:
    if app[-5] == 'Navigation':
        # print name and number of ratings
        print(app[1], ':', app[5]) 

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


The same can be said for social networking (Instagram, X, etc) and music genres (Spotify, Apple Music), both genres are dominated by a few apps.

We want to find popular genres where their number of ratings are not skewed by a few apps. That way the new app we make will actully be able to compete with the other apps easier.

Another genre with high ratings is Reference. Let's have a look at the number of ratings for the apps in this genre:

In [53]:
for app in final_apple:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


Although the Bible and Dictionary apps have much higher number of ratings than the others, there are still a handful of apps with a decent number of ratings. Therefore this genre could be a good choice for our App Profile.

Based on our previous findings, we know that there are a lot more for-fun apps than practical apps on the App Store, so we should try make our app more practical to stand out. 

Maybe we could have an app that would summarise textbooks or non-fiction books, and these summaries could be listened to as audio too.

Let's look at the Google Play data next:

## Most Popular Apps by Genre on Google Play

Our google play dataset has information on the number of installs, however these numbers are not precise as they are open-ended (e.g. '100,000+'). Looking at the 'Installs' column:

In [56]:
display_table(final_gplay, 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


Although they are open ended, we are only after a rough idea of the genres which are most popular, so we do not need perfect precision. Therefore we will consider the exact numbers, i.e. an app with 1,000+ installs has 1,000 installs.

We still need to convert our strings into floats to perform our average installs per genre (category) calculations; so we also need to remove the commas and plus characters. This is done below within our loops:

In [112]:
categories_gplay = freq_table(final_gplay, 1)

for category in categories_gplay:
    # total number of installs for this category
    total = 0
    # number of apps in this category
    len_category = 0
    
    for app in final_gplay:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            # update values
            total += float(n_installs)
            len_category += 1
    # compute the average installs per app for this category
    avg_installs = total / len_category
    print(category, ':', avg_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

We see above that communication apps have the most installs with 38,456,119. However, this number is heavily skewed by apps with over one billion installs (e.g. WhatsApp, Facebook Messenger, Skype), and some apps with over 100 and 500 million installs:

In [115]:
for app in final_gplay:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+' or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

The same pattern happens for the video players category, where they are dominated by apps such as Youtube and Google Play Movies. Photography, productivity and social apps are also dominated by a handful of apps, which makes it hard for us to compete if we were to enter.

Previously we found out that the games category makes up for the largest proportion of apps on the Google play store, so it is best to avoid a saturated category.

The books and reference genre looks fairly popular, with an average number of installs of 8,767,811. We want to take a closer look at this categoryh, since this was our recommended App profile the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

Let's take a look at some of the apps from this genre and their number of installs:

In [119]:
for app in final_gplay:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

We should identify if this category's average is heavily skewed by a few apps, or if there are more apps with a decent number of installs (between 1,000,000 and 100,000,000 installs):

In [122]:
for app in final_gplay:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+' or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


There are only a few extremely popular apps, so this could be promising.

In [129]:
for app in final_gplay:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+' or app[5] == '5,000,000+' or app[5] == '10,000,000+' or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

There are a good amount of apps in this category with decent levels of popularity. This genre already has a lot of apps for reading ebooks, collections of libraries and dictionaries. Hence we should try to not build similar apps since there will be a lot of competition.

## Conclusion

For this project, we analyzed data about the App Store and Google Play mobile apps, in order to recommend an app profile that can be profitable for both markets.

Taking a popular book and transforming it into an app could be profitable for both the Google Play and the App Store markets.
Since there are already a lot of libraries, we need our app to have unique features besides the raw text. This could be:
- daily quotes or reminders
- audio format
- a discussion forum related to the book for readers 
