# Profitable App Profile Study for App Store and Google Play Marketplaces

The aim of this project is to develop an understanding of the profitable app profiles for App Store and Google Play Marketplaces. This will be achieved thorugh identifying the types of apps that are likely to attract most users.

For individual developers and/or companies that build free apps, the main source of revenue comes from in-app ads. The revenue of any given app therefore is highly dependent on the user population. Herein, we will analyse the market data to make inference on the types of apps that are likely to attract most users.

As of September 2020, there are approximately 3 million Android apps on Google Play alone (Statista, 2020). The total number apps is doubled with the addition of iOS apps from App Store.

Collecting data for 6 million apps requires a significant amount of time and resources. To minimise resource expenditure, we will begin the study with existing data available for public access.

The data used in this project are from the follwoing sources:

> A [data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was colleted in August 2018.

> A [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017.

In [1]:
# Define an explore_data function that enables slicing of the dataset and returns the shape of the dataset when requried.

def explore_data(dataset, start, end, rows_and_columns = False):
    ds_slice = dataset[start:end]
    for row in ds_slice:
        print(row) # prints each row within the sliced dataset
        print('\n') # creates an empty line after each row
    if rows_and_columns == True:
        print('The dataset has {} rows.'.format(len(dataset)))
        print('The dataset has {} columns.'.format(len(dataset[0])))

In [2]:
from csv import reader

# Load Apple Store Data
with open('AppleStore.csv', mode = 'r', encoding = 'utf8') as f:
    as_data = list(reader(f))

# Load Google Play Data
with open('googleplaystore.csv', mode = 'r', encoding = 'utf8') as f:
    gp_data = list(reader(f))


In [3]:
# Print first three rows of Apple Store Data
explore_data(as_data, 0, 3, True)

# Links to the dataset documentation are available in the synopsis for clrifications on the column names.


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


The dataset has 7198 rows.
The dataset has 16 columns.


In [4]:
# Print first three rows of Google Play Store Data
explore_data(gp_data, 0, 3, True)

# Links to the dataset documentation are available in the synopsis for clarifications on the column names.

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


The dataset has 10842 rows.
The dataset has 13 columns.


This project focuses on the profitability profile of apps that are *free* to download an install, and are directed towards *English-speaking* audience.

Preliminary data cleaning will be performed to the irrelevant data (i.e. paid apps and non-Enlish apps.)

Note we are not using numpy or pandas module yet - the detection of erroneous data/strucutre will be performed using basic python loops.

In [5]:
# Define an error detection function that checks for rows missing columns
def detect_error(dataset):
    ds_col_len = len(dataset[0]) # take the length of the header row, assuming header row has the correct numebr of columns
    for index, row in enumerate(dataset):
        if len(row)!= ds_col_len:
            print(row)
            print('The index for the erroneous row is {}'.format(index))
    print('Search completed.')

In [6]:
# Check for data integrity errors in Apple Store Data
detect_error(as_data)

# Completed without error

Search completed.


In [7]:
# Check for data integrity errors in Google Play Data
detect_error(gp_data)

# Found a row with missing a missing column. The index of the row is recorded for the ease of tracking.

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
The index for the erroneous row is 10473
Search completed.


In [8]:
# A Google search indicates the particular app represented by the erroneous row is a lifestyle app. Additional tags however may be present at the time of data colleciton. For the purpose of this project, the row can be excluded.

# Check row to confirm the printed index is correct.
gp_data[10473]

['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

In [9]:
# Remove row
print('The original length is {}'.format(len(gp_data)))
del gp_data[10473]
print('The length post modification is {}'.format(len(gp_data)))

The original length is 10842
The length post modification is 10841


It has also been indicated in the [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion) section for the Google Playstore dataset that there exist duplicate entries in the dataset. We will firstly identify the apps with duplicate entries in the following cell. 

In [10]:
# Create an empty dictionary to check the number of times an app has appeared in the dataset

app_names = dict()

for index, row in enumerate(gp_data):
    # As the dictionary is created empty, the first count of the app_names will be initialised with the 'not in' check.
    if row[0] not in app_names:
        app_names[row[0]] = [1,[index]]
    else:
        app_names[row[0]][0] += 1 # Add to the number of times the app has appeared in the dataset
        app_names[row[0]][1].append(index) # Record the index of the entries

print(app_names)

del app_names['App'] # Remove the heading row wrongfully counted as an app.
# Note the heading could also have been excluded from gp_data entirely during extraction. Either approach will work.


{'App': [1, [0]], 'Photo Editor & Candy Camera & Grid & ScrapBook': [1, [1]], 'Coloring book moana': [2, [2, 2034]], 'U Launcher Lite – FREE Live Cool Themes, Hide Apps': [1, [3]], 'Sketch - Draw & Paint': [1, [4]], 'Pixel Draw - Number Art Coloring Book': [1, [5]], 'Paper flowers instructions': [1, [6]], 'Smoke Effect Photo Maker - Smoke Editor': [1, [7]], 'Infinite Painter': [1, [8]], 'Garden Coloring Book': [1, [9]], 'Kids Paint Free - Drawing Fun': [1, [10]], 'Text on Photo - Fonteee': [1, [11]], 'Name Art Photo Editor - Focus n Filters': [1, [12]], 'Tattoo Name On My Photo Editor': [1, [13]], 'Mandala Coloring Book': [1, [14]], '3D Color Pixel by Number - Sandbox Art Coloring': [1, [15]], 'Learn To Draw Kawaii Characters': [1, [16]], 'Photo Designer - Write your name with shapes': [1, [17]], '350 Diy Room Decor Ideas': [1, [18]], 'FlipaClip - Cartoon animation': [1, [19]], 'ibis Paint X': [1, [20]], 'Logo Maker - Small Business': [1, [21]], "Boys Photo Editor - Six Pack & Men's Su

Now that we have a collection of the app names with the number of times they have appeared in datasets, we can focus on the repeated entries and decide on how to clean the data.

In [11]:
app_repeated = []

for key in list(app_names):
    if app_names[key][0] > 1:
        app_repeated.append(key)
        
print(len(app_repeated)) # A quick check shows there are 798 apps with repeated entries.

798


More popular apps may potentially have multiple entries, check if this is the case for Facebook, Messenger, Twitter.

In [12]:
print('Facebook' in app_repeated)
print('Messenger' in app_repeated)
print('Twitter' in app_repeated)

True
False
True


Let's take a more detailed look at Twitter entries.

In [13]:
for fb_index in app_names['Twitter'][1]:
    print(gp_data[fb_index])

['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11667403', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 'August 6, 2018', 'Varies with device', 'Varies with device']
['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11667403', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 'August 6, 2018', 'Varies with device', 'Varies with device']
['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11657972', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 'July 30, 2018', 'Varies with device', 'Varies with device']


Notice **Entry 1** and **Entry 2** are identical - either can be removed while preserving the other entry.

**Column 4** of the entries represents the number of reviews the app has received. The entry with the highest number of reviews is therefore the most recent data point. This is confirmed by cross-referencing the Last Update Date of the app (**Column 11**).

We will now create a cleaned dataset *gp_data_clean* by removing the duplicate entries. In the case where there are multiple entries for the same app, only the entries with the highest review numbers will be retained.

In [14]:
gp_data_clean = []
for key in list(app_names):
    # Looping through each key of the dictionary
    
    if app_names[key][0] == 1:
        # Check if it's an app with only a single entry
        gp_index = app_names[key][1][0]
        gp_data_clean.append(gp_data[gp_index])
    else:
        dup_instances = app_names[key][1]
        
        gp_index = 0 # initialise selection index
        hist_rev_count = 0 # initialise review count
        
        for index in dup_instances: # looping through all duplicate entries to find the entry with the highest review count
            current_rev_count = float(gp_data[index][3])
            if current_rev_count > hist_rev_count:
                hist_rev_count = current_rev_count
                gp_index = index
        gp_data_clean.append(gp_data[gp_index])
        

Check the length of the cleaned data matches that of the unique app names.

In [15]:
len(gp_data_clean) == len(list(app_names))

True

Using Twitter as a spot check, confirm if only the entries with the highest number of reviews have been retained.

In [16]:
for row in gp_data_clean:
    if row[0] == 'Twitter':
        print(row)

['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11667403', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 'August 6, 2018', 'Varies with device', 'Varies with device']


This is indeed the entry with the most reviews. The cleaned dataset *gp_data_clean* now only contains entries with unique app names.

In [17]:
print(gp_data_clean[6407])

['Company Kitchen', 'LIFESTYLE', '2.8', '81', '7.7M', '10,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'March 6, 2018', '2.0.0 (18.03.01)', '2.3.3 and up']


Note there are still non-English apps in our cleaned list. The next step is to remove these entries as well. The key will be the use of *ord()* function that checks the ASCII code of a character. We will begin by creating a function that checks whether the string contains any non-English characters. If the ASCII code of the character is less or equal to 127, the character belongs to the set of common Enligh characters.

In [18]:
def check_eng(text):
    for letter in text:
        if ord(letter) > 127:
            return False # note the return here also breaks the function
    else: # the for-else loop enables the return of boolean True only if the for loop has been completed with no break
        return True

In [19]:
# Testing the check function
print(check_eng('Crackers'))
print(check_eng('ほっともっと'))
print(check_eng('荞麦天妇罗'))
print(check_eng('Instachat 😜'))

True
False
False
False


Notice the function output for instachat is also False. This is because the emoji is outside the 127 ASCII codes we have allowed for. Filtering the dataset based on the original function would see us potentially losing useful data as some English apps will be labelled as non-English. To minimise the impact of data loss, we will set an arbitrary threshold of detecting three or more non-English characters before marking the app as non-English. The filter is still not perfect (i.e. it cannot detect apps with only two foreign characters), but it should be appropriate for our purpose.

In [20]:
def check_eng(text):
    threshold = 0
    for letter in text:
        if ord(letter) > 127:
            threshold += 1
    if threshold >= 3:
        return False
    else:
        return True

In [21]:
# Testing the check function
print(check_eng('Crackers'))
print(check_eng('ほっともっと'))
print(check_eng('荞麦天妇罗'))
print(check_eng('Instachat 😜'))

True
False
False
True


In [22]:
#Implementing the function to filter non-English apps from gp_data_clean
gp_data_eng = []
gp_data_non_eng = []

for row in gp_data_clean:
    if not check_eng(row[0]):
        gp_data_non_eng.append(row)
    else:
        gp_data_eng.append(row)

print(len(gp_data_eng))
print(len(gp_data_non_eng))

9597
62


At this stage, we have carried out the following:

- Removed entries with missing data
- Removed duplicate entries
- Removed non-English apps

Recall the focus of the project is on free apps. We will now create a separate list with free apps only.

In [23]:
gp_data_final = []

for row in gp_data_eng:
    if row[6] == 'Free':
        gp_data_final.append(row)
        
print(len(gp_data_final))

8844


We will now perform the same cleaning process on the App Store data to: 1. remove duplicate entries; 2. remove non-English apps; and 3. remove paid apps. Note we have already checked that App Store data does not contain entries with missing columns.

In [24]:
app_names = dict()

for index, row in enumerate(as_data):
    # We have modified the code slightly as the entry with Index 1 is the app name for as_data
    if row[1] not in app_names:
        app_names[row[1]] = [1,[index]]
    else:
        app_names[row[1]][0] += 1 # Add to the number of times the app has appeared in the dataset
        app_names[row[1]][1].append(index) # Record the index of the entries

print(app_names)

del app_names['track_name']



In [25]:
app_repeated = []

for key in list(app_names):
    if app_names[key][0] > 1:
        app_repeated.append(key)
        
print(len(app_repeated))
print(app_repeated)

2
['Mannequin Challenge', 'VR Roller Coaster']


There are only two apps with repeated names - Mannequin Challenge and VR Roller Coaster.

In [26]:
print(app_names['Mannequin Challenge'])
print(as_data[2949])
print(as_data[4464])

[2, [2949, 4464]]
['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']


The row with Index 5 represents the number of ratings received. We can see that entry as_data\[2949\] is likely the more recent entry as it has more ratings.

In [27]:
as_data_clean = []
for key in list(app_names):
    # Looping through each key of the dictionary
    
    if app_names[key][0] == 1:
        # Check if it's an app with only a single entry
        as_index = app_names[key][1][0]
        as_data_clean.append(as_data[as_index])
    else:
        dup_instances = app_names[key][1]
        
        as_index = 0 # initialise selection index
        hist_rev_count = 0 # initialise review count
        
        for index in dup_instances: # looping through all duplicate entries to find the entry with the highest review count
            current_rev_count = float(as_data[index][5]) # modified for as_data
            if current_rev_count > hist_rev_count:
                hist_rev_count = current_rev_count
                as_index = index
        as_data_clean.append(as_data[as_index])
        

In [28]:
len(as_data_clean) == len(list(app_names))

True

In [29]:
as_data_eng = []
as_data_non_eng = []

for row in as_data_clean:
    if not check_eng(row[1]):
        as_data_non_eng.append(row)
    else:
        as_data_eng.append(row)

print(len(as_data_eng))
print(len(as_data_non_eng))

6153
1042


In [30]:
as_data_final = []

for row in as_data_eng:
    if float(row[4]) == 0:
        as_data_final.append(row)
        
print(len(as_data_final))

3201


In [31]:
print(len(gp_data_final))
print(len(as_data_final))

8844
3201


Now we have cleaned the datasets *gp_data_final* and *as_data_final* representing the free, English apps on Google Play and App Store respectively.

Recall the aim of this project is to determine the types of free apps that will attract the most users, which in turn will maximise the ad revenue generated. 
Assuming the company is more experienced in developing android apps, one potential validation strategy can be as follows:

1. Build the base version of an app and add it to Google Play.
2. If the app has generated enough interest, develop the app further by adding user requested features, runtime improvement, etc.
3. If the app is deemed profitable after six months (determined via OPEX and ad revenue comparison), develop an iOS version of the app and add it to App Store.

The end objective of our strategy involves having the app on both Google Play and App Store. A good starting point therefore can be establishing app genres that are most common on each platform.

- The column identifier for genre in as_data_final would be **Index 11 prime_genre**
- The column identifier for genre in gp_data_final would be **Index 1 Category** and **Index 9 Genres**

We will proceed to build a frequency table that show the number of apps for each genre. Two functions will be built to standardise the process - one that generates a frequency table and show percentages, the other to display the percentages in ascending/descending orders.

In [32]:
def freq_table(dataset, index):
    genre_dict = dict()
    # First, generate a dictionary with number of apps for each genre
    for row in dataset:
        genre_app = row[index]
        if genre_app in genre_dict:
            genre_dict[genre_app] += 1
        else:
            genre_dict[genre_app] = 1
    # We will then convert the numbers into percentages
    app_num = len(dataset)
    for entry in genre_dict:
        genre_dict[entry] /= app_num
        genre_dict[entry] *= 100
    return genre_dict

In [33]:
as_genre = freq_table(as_data_final, 11)

In [34]:
print(as_genre)

{'Social Networking': 3.3114651671352706, 'Photo & Video': 4.99843798812871, 'Games': 58.23180256169947, 'Music': 2.0618556701030926, 'Reference': 0.5310840362386754, 'Health & Fitness': 2.0306154326772883, 'Weather': 0.8747266479225243, 'Utilities': 2.4679787566385505, 'Travel': 1.2496094970321776, 'Shopping': 2.592939706341768, 'News': 1.3433302093095907, 'Navigation': 0.18744142455482662, 'Lifestyle': 1.5620118712902218, 'Entertainment': 7.841299593876913, 'Food & Drink': 0.8122461730709154, 'Sports': 2.1555763823805063, 'Book': 0.37488284910965325, 'Finance': 1.0934083099031553, 'Education': 3.6863480162449234, 'Productivity': 1.7494532958450486, 'Business': 0.5310840362386754, 'Catalogs': 0.12496094970321774, 'Medical': 0.18744142455482662}


In [35]:
gp_genre = freq_table(gp_data_final, 1)
print(gp_genre)

{'ART_AND_DESIGN': 0.644504748982361, 'FAMILY': 18.939393939393938, 'AUTO_AND_VEHICLES': 0.9271822704658526, 'BEAUTY': 0.5992763455450022, 'BOOKS_AND_REFERENCE': 2.1370420624151967, 'BUSINESS': 4.601990049751244, 'COMICS': 0.6105834464043419, 'COMMUNICATION': 3.233830845771144, 'TOOLS': 8.45771144278607, 'DATING': 1.8656716417910446, 'EDUCATION': 1.1646313885119857, 'ENTERTAINMENT': 0.9611035730438715, 'EVENTS': 0.7123473541383989, 'FINANCE': 3.70872908186341, 'FOOD_AND_DRINK': 1.2437810945273633, 'HEALTH_AND_FITNESS': 3.0868385345997282, 'HOUSE_AND_HOME': 0.8028041610131162, 'LIBRARIES_AND_DEMO': 0.9384893713251923, 'LIFESTYLE': 3.8896426956128454, 'GAME': 9.701492537313433, 'VIDEO_PLAYERS': 1.797829036635007, 'MEDICAL': 3.505201266395296, 'SOCIAL': 2.668475802804161, 'SHOPPING': 2.2501130710085935, 'PHOTOGRAPHY': 2.951153324287653, 'SPORTS': 3.3921302578018993, 'TRAVEL_AND_LOCAL': 2.340569877883311, 'PERSONALIZATION': 3.324287652645862, 'PRODUCTIVITY': 3.900949796472185, 'PARENTING':

The next step is to write a function that displays the frequency table in ascending/descending order as we desire.

In [36]:
# Note this fuction only DISPLAYS the frequency table in descending order, it does not actually create a new object.
def display_table(dataset, index,descending = True):
    table_temp = []
    genre_dict = freq_table(dataset,index) # A more elegant approach may be using function wrapper
    for entry in genre_dict:
        table_temp.append([genre_dict[entry],entry])
    table_sorted = sorted(table_temp, reverse = True)
    for row in table_sorted:
        print(f'{row[1]}:{row[0]}')

In [37]:
# Frequency table for prime_genre on App Store
display_table(as_data_final, 11, descending = True)

Games:58.23180256169947
Entertainment:7.841299593876913
Photo & Video:4.99843798812871
Education:3.6863480162449234
Social Networking:3.3114651671352706
Shopping:2.592939706341768
Utilities:2.4679787566385505
Sports:2.1555763823805063
Music:2.0618556701030926
Health & Fitness:2.0306154326772883
Productivity:1.7494532958450486
Lifestyle:1.5620118712902218
News:1.3433302093095907
Travel:1.2496094970321776
Finance:1.0934083099031553
Weather:0.8747266479225243
Food & Drink:0.8122461730709154
Reference:0.5310840362386754
Business:0.5310840362386754
Book:0.37488284910965325
Navigation:0.18744142455482662
Medical:0.18744142455482662
Catalogs:0.12496094970321774


In [38]:
# Frequency table for Category on Google Play
display_table(gp_data_final, 1, descending = True)

FAMILY:18.939393939393938
GAME:9.701492537313433
TOOLS:8.45771144278607
BUSINESS:4.601990049751244
PRODUCTIVITY:3.900949796472185
LIFESTYLE:3.8896426956128454
FINANCE:3.70872908186341
MEDICAL:3.505201266395296
SPORTS:3.3921302578018993
PERSONALIZATION:3.324287652645862
COMMUNICATION:3.233830845771144
HEALTH_AND_FITNESS:3.0868385345997282
PHOTOGRAPHY:2.951153324287653
NEWS_AND_MAGAZINES:2.804161013116237
SOCIAL:2.668475802804161
TRAVEL_AND_LOCAL:2.340569877883311
SHOPPING:2.2501130710085935
BOOKS_AND_REFERENCE:2.1370420624151967
DATING:1.8656716417910446
VIDEO_PLAYERS:1.797829036635007
MAPS_AND_NAVIGATION:1.3907734056987788
FOOD_AND_DRINK:1.2437810945273633
EDUCATION:1.1646313885119857
ENTERTAINMENT:0.9611035730438715
LIBRARIES_AND_DEMO:0.9384893713251923
AUTO_AND_VEHICLES:0.9271822704658526
HOUSE_AND_HOME:0.8028041610131162
WEATHER:0.7914970601537766
EVENTS:0.7123473541383989
PARENTING:0.6558118498417006
ART_AND_DESIGN:0.644504748982361
COMICS:0.6105834464043419
BEAUTY:0.59927634554500

In [39]:
# Frequency table for Genres on Google Play
display_table(gp_data_final, 9, descending = True)

Tools:8.44640434192673
Entertainment:6.0832202623247404
Education:5.359565807327002
Business:4.601990049751244
Productivity:3.900949796472185
Lifestyle:3.878335594753505
Finance:3.70872908186341
Medical:3.505201266395296
Sports:3.4599728629579376
Personalization:3.324287652645862
Communication:3.233830845771144
Action:3.0981456354590686
Health & Fitness:3.0868385345997282
Photography:2.951153324287653
News & Magazines:2.804161013116237
Social:2.668475802804161
Travel & Local:2.3292627770239713
Shopping:2.2501130710085935
Books & Reference:2.1370420624151967
Simulation:2.0465852555404793
Dating:1.8656716417910446
Arcade:1.8430574400723654
Video Players & Editors:1.7752148349163273
Casual:1.7639077340569878
Maps & Navigation:1.3907734056987788
Food & Drink:1.2437810945273633
Puzzle:1.1307100859339665
Racing:0.9950248756218906
Role Playing:0.9384893713251923
Libraries & Demo:0.9384893713251923
Auto & Vehicles:0.9271822704658526
Strategy:0.9045680687471733
House & Home:0.8028041610131162
W

Based on the generated frequency tables for App Stores, we can infer that the particular platform is dominated by apps designed for recreational purposes (e.g. Games at 58.23%, followed by Entertainment at 7.84%). Google Play Store, on the other hand, exhibit a more evenly distributed landscape where practical apps and recreational apps both have their fair share of the market.

Note our cleaned datasets only include free, English apps. The results of our observation cannot be extended beyond this restricted scope.

Now that we have a better understanding of the app distribution on App Store and Google Play, we would like to find the most popular type of apps - this may be measured by the number of users (i.e. number of installs or number of total ratings received divided by the number of apps for the particular genre).

We will focus on App Store apps first:

In [40]:
as_genre = freq_table(as_data_final, 11)
print(as_genre)

{'Social Networking': 3.3114651671352706, 'Photo & Video': 4.99843798812871, 'Games': 58.23180256169947, 'Music': 2.0618556701030926, 'Reference': 0.5310840362386754, 'Health & Fitness': 2.0306154326772883, 'Weather': 0.8747266479225243, 'Utilities': 2.4679787566385505, 'Travel': 1.2496094970321776, 'Shopping': 2.592939706341768, 'News': 1.3433302093095907, 'Navigation': 0.18744142455482662, 'Lifestyle': 1.5620118712902218, 'Entertainment': 7.841299593876913, 'Food & Drink': 0.8122461730709154, 'Sports': 2.1555763823805063, 'Book': 0.37488284910965325, 'Finance': 1.0934083099031553, 'Education': 3.6863480162449234, 'Productivity': 1.7494532958450486, 'Business': 0.5310840362386754, 'Catalogs': 0.12496094970321774, 'Medical': 0.18744142455482662}


In [41]:
def mean_genre_rating(dataset,genres,ig,ir): # ig = index genre in dataset; ir = index rating in dataset
    all_rating = []
    for genre in genres:
        total_count = 0
        len_genre = 0
        for row in dataset:
            if row[ig] == genre:
                total_count += float(row[ir])
                len_genre += 1
        mean_rating_num = total_count/len_genre
        all_rating.append([mean_rating_num,genre])
    all_rating_sorted = sorted(all_rating, reverse = True)
        
    for entry in all_rating_sorted:
        genre = entry[1]
        mean_rating = entry[0]
        print(f'For Genre {genre}, the mean rating number is {mean_rating:.2f}.')

In [42]:
mean_genre_rating(as_data_final,as_genre,11,5)

For Genre Navigation, the mean rating number is 86090.33.
For Genre Reference, the mean rating number is 79350.47.
For Genre Social Networking, the mean rating number is 71548.35.
For Genre Music, the mean rating number is 57326.53.
For Genre Weather, the mean rating number is 52279.89.
For Genre Book, the mean rating number is 46384.92.
For Genre Food & Drink, the mean rating number is 33333.92.
For Genre Finance, the mean rating number is 32367.03.
For Genre Photo & Video, the mean rating number is 28441.54.
For Genre Travel, the mean rating number is 28243.80.
For Genre Shopping, the mean rating number is 27230.73.
For Genre Health & Fitness, the mean rating number is 23298.02.
For Genre Sports, the mean rating number is 23008.90.
For Genre Games, the mean rating number is 22910.83.
For Genre News, the mean rating number is 21248.02.
For Genre Productivity, the mean rating number is 21028.41.
For Genre Utilities, the mean rating number is 19156.49.
For Genre Lifestyle, the mean rati

Based on the printed results above, the Top 3 most popular app genres on App Store are Navigation and Reference followed by Social Networking - when developing free, English apps on App Store, these are the types of apps that will likely gain the most traction.

The next step then is to examine the app population trends on Google Play. By analysing the number of installs for apps on Google Play, we can develop a better understanding of the genre popularity for the particular market. Note, however, the install numbers are provided in ranges such as 0+, 10+, 100+ instead of precise values.
We will pass the number of installs into the display_table() column first to create a clearer picture of the value ranges.

In [43]:
display_table(gp_data_final, 5, descending = True)

1,000,000+:15.762098597919493
100,000+:11.544549977385799
10,000,000+:10.572139303482588
10,000+:10.199004975124378
1,000+:8.401175938489372
100+:6.919945725915875
5,000,000+:6.829488919041157
500,000+:5.563093622795115
50,000+:4.771596562641339
5,000+:4.488919041157847
10+:3.5278154681139755
500+:3.245137946630484
50,000,000+:2.2840343735866124
100,000,000+:2.1370420624151967
50+:1.9109000452284035
5+:0.7914970601537766
1+:0.508819538670285
500,000,000+:0.27137042062415195
1,000,000,000+:0.22614201718679333
0+:0.045228403437358664


The use of ranges in place of precise numbers introduce uncertainty into our analysis e.g. apps with 500,000+ installs may actually have 500,001 installs or 999,999 installs. For our purpose of identifying the app genres that attract the most users, this level of uncertainty is however tolerable. We will be using the floor values of the intervals i.e. 500,000+ will be transformed to 500,000. 

We have already obtained a dictionary containing all the unique genres on Google Play Market through the *freq-table()* function previously.

In [44]:
gp_genre

{'ART_AND_DESIGN': 0.644504748982361,
 'FAMILY': 18.939393939393938,
 'AUTO_AND_VEHICLES': 0.9271822704658526,
 'BEAUTY': 0.5992763455450022,
 'BOOKS_AND_REFERENCE': 2.1370420624151967,
 'BUSINESS': 4.601990049751244,
 'COMICS': 0.6105834464043419,
 'COMMUNICATION': 3.233830845771144,
 'TOOLS': 8.45771144278607,
 'DATING': 1.8656716417910446,
 'EDUCATION': 1.1646313885119857,
 'ENTERTAINMENT': 0.9611035730438715,
 'EVENTS': 0.7123473541383989,
 'FINANCE': 3.70872908186341,
 'FOOD_AND_DRINK': 1.2437810945273633,
 'HEALTH_AND_FITNESS': 3.0868385345997282,
 'HOUSE_AND_HOME': 0.8028041610131162,
 'LIBRARIES_AND_DEMO': 0.9384893713251923,
 'LIFESTYLE': 3.8896426956128454,
 'GAME': 9.701492537313433,
 'VIDEO_PLAYERS': 1.797829036635007,
 'MEDICAL': 3.505201266395296,
 'SOCIAL': 2.668475802804161,
 'SHOPPING': 2.2501130710085935,
 'PHOTOGRAPHY': 2.951153324287653,
 'SPORTS': 3.3921302578018993,
 'TRAVEL_AND_LOCAL': 2.340569877883311,
 'PERSONALIZATION': 3.324287652645862,
 'PRODUCTIVITY': 3.9

In [45]:
all_installs = []
for genre in gp_genre:  
    install_total = 0
    len_genre = 0
    for app in gp_data_final:
        if app[1] == genre:
            app_install = float(app[5].replace('+','').replace(',',''))
            install_total += app_install
            len_genre += 1
    mean_installs = install_total / len_genre
    all_installs.append([mean_installs,genre])
    
all_installs_sorted = sorted(all_installs, reverse = True)
        
for entry in all_installs_sorted:
    genre = entry[1]
    mean_installs = entry[0]
    print(f'For Genre {genre}, the mean install number is {mean_installs:.2f}.')    

For Genre COMMUNICATION, the mean install number is 38590581.09.
For Genre VIDEO_PLAYERS, the mean install number is 24727872.45.
For Genre SOCIAL, the mean install number is 23253652.13.
For Genre PHOTOGRAPHY, the mean install number is 17840110.40.
For Genre PRODUCTIVITY, the mean install number is 16787331.34.
For Genre GAME, the mean install number is 15544014.51.
For Genre TRAVEL_AND_LOCAL, the mean install number is 13984077.71.
For Genre ENTERTAINMENT, the mean install number is 11640705.88.
For Genre TOOLS, the mean install number is 10830251.97.
For Genre NEWS_AND_MAGAZINES, the mean install number is 9549178.47.
For Genre BOOKS_AND_REFERENCE, the mean install number is 8814199.79.
For Genre SHOPPING, the mean install number is 7036877.31.
For Genre PERSONALIZATION, the mean install number is 5201482.61.
For Genre WEATHER, the mean install number is 5145550.29.
For Genre HEALTH_AND_FITNESS, the mean install number is 4188821.99.
For Genre MAPS_AND_NAVIGATION, the mean install 

The above result shows the Top 3 most popular app genres on Google Play are Communicaiton, Video Players, followed by Social Networking. Crossreferencing this finding with the app popularity results from App Store indicates the social networking to be the most promising genre for developing a user base, and in turn, becoming profitable. It is worth noting that the App Store genres do not have an individual entry for Communication, which likely would have been included in the Social Networking genre.

Recall our proposed development cycle of establishing a base app on Google Play, refining the app if it becomes popular, and porting the app to App Store if profitbale. Although Social Networking is a genre with the greatest popularity, it may not be suitable for our approach for the initial investment required for apps in this genre.

We can narrow down the development direction by establishing a selection criteria - the app needs to be of a popular genre (Top 10 user rating numbers/install numbers) on Google Play and App Store, and the initial development complexity needs to be reasonably low. Based on these requirements, one potential approach may be developing a photo editing app that can crop photos and apply preset filters to the photos stored on the phone. The Photography Genre ranked 4th on the App Store and 9th on Google Play. The Photography apps represent 2.95% and 5.00% of the total app populations on Google Play and App Store respectively. To make the app stand out, special features including daily photo filtering tips and ability to apply custom overlay may also need to be added.