# Profitable Mobile App Profile for Android and Apple Market

The main goal of this project is to get some key characteristics of profitable mobile application for Google Play Store (Android) and App Store (Apple) market. I'm working as Data Analyst in a company building Android and iOS apps. My job is to provide data analysis regarding apps that want to be build. 

At the moment, the developer team is developing a free-app which has in-app ads. This means the revenue will be based on the number of user using our app, since the more users use and engage with our app, the better. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

## Opening and Exploring the Data

By September 2018, there were around 2 million iOS apps on Apps Store and 2.1 million Android apps on Google Play Store. Collecting this 4-million data will be time-consuming and need a lot of resource. So, we will try to find data samples and analyze it. Fortunately, there are these two data set that can help us :
- __[A data set](https://www.kaggle.com/lava18/google-play-store-apps/home)__ containing approx. ten thousand Android apps from Google Play
- __[A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)__ containing approx. seven thousand iOS apps from App Store

To make it easier to explore the data, we will make function named `explore_dataset()` that takes in the data set, index of start and end row, also the Boolean parameter to show the number of row and column (_if true_)

In [33]:
def explore_dataset(data_set, start, end, show_rowcol = False):    
    data_slices = data_set[start:end]
    for row in data_slices:
        print(row)
        print('\n')
        
    if show_rowcol:
        print('Number of rows: ', len(data_set))
        print('Number of columns: ', len(data_set[0]))
        print('\n')

Now, we can start open the data first.

In [34]:
## Open AppleStore.csv ##
opened_file_apple = open('Dataset\AppleStore.csv', encoding='utf-8')
from csv import reader
read_file_apple = reader(opened_file_apple)
apps_data_apple = list(read_file_apple)
apple_header = apps_data_apple[0]
apple = apps_data_apple[1:]

## Open googleplaystore.csv ##
opened_file_google = open('Dataset\googleplaystore.csv', encoding='utf-8')
from csv import reader
read_file_google = reader(opened_file_google)
apps_data_google = list(read_file_google)
google_header = apps_data_google[0]
google = apps_data_google[1:]

The, we use the `explore_dataset()` function to give us a brief overview about these data sets.

In [35]:
print(apple_header,'\n')
explore_dataset(apple, 0, 3, True)

print(google_header,'\n')
explore_dataset(google, 0, 3, True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows:  7197
Number of columns:  17


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '1

For the App Store data set, we can see that it has 7197 iOS apps and columns that may be useful for our analysis are : 'price', 'user_rating', 'prime_genre', and etc. While for Google Play data set, it has 10841 Android apps data on Google Play. The columns that we maybe can use for further analysis are : 'Category', 'Rating', 'Genres', 'Price' and etc.

## Data Cleaning

After exploring the data sets and have general information about what the data sets are like and what kind of information it has, we can start cleaning the data sets before stepping into analysis stage. The data cleaning process is **very important** to make sure that the data set we had has met our criteria and to make sure the conclusion resulted from these data set is correct.

Our company will only build free-apps and only for English-speaking audience which means that we will only analyze mobile app data that are free and can be used by English-speaking people by deleting mobile apps data that doesn't meet this requirements. We will do this Data Cleaning process trough several steps as below :

### Deleting Wrong Data

If we pay attention the discussion forum on Google Play data set, we can find that there is wrong data in our data set as discussed in __[here](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015)__. First, we need to check the data of the wrong data by printing it. From the discussion, it seems that the wrong data is in index 10472 (without header row)

In [36]:
print(google_header,'\n')
print(google[10472],'\n')

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] 



The data in row 10472 is wrong as we can see from the result above. The rating is **19** while the maximum value of app rating on Google Play is 5. So, we need to delete this data.

In [37]:
print(len(google))
del google[10472]
print(len(google))

10841
10840


We delete the data using **del** statement and we recheck again if the data has been deleted by checking the number of rows on our Google Play data set. As we can see above, the number of rows is decreased by one meaning that one data has been deleted. If we check the data from index 10472, it will show different data than before :

In [38]:
print(google[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


### Removing Duplicate Entries

If we go through the discussion forum on Google Play data set ([here](https://www.kaggle.com/lava18/google-play-store-apps/discussion)), we'll find that our Google Play data set has some duplicate entries. So, first we need to check what applications that have duplicate entries.

In [39]:
duplicate_apps = []
unique_apps = []

for app in google:
    if app[0] in unique_apps:
        duplicate_apps.append(app[0])
    else:
        unique_apps.append(app[0])
        
print('Number of unique apps : ', len(unique_apps))
print('Number of duplicate apps : ', len(duplicate_apps))

Number of unique apps :  9659
Number of duplicate apps :  1181


As we can see above, the number of duplicate apps are 1181. Next, we need to check what's the difference in the duplicate apps.

In [40]:
print(duplicate_apps[:15])

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In [41]:
for app in google:
    if app[0] == 'Google Ads':
        print(app)

['Google Ads', 'BUSINESS', '4.3', '29313', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']
['Google Ads', 'BUSINESS', '4.3', '29313', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']
['Google Ads', 'BUSINESS', '4.3', '29331', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']


From one of the example, we can conclude that one of main difference between these duplicate apps is on the data index 3 which corresponds to number of reviews given. This can be used as one of the criteria to removing duplicate entries. We don't want to analyze duplicate app information so we need to remove these data but not randomly. 

In the next step, we will try to delete duplicate data by checking the number of reviews (data in index number 3) and keep the Google Play apps data with highest review number among duplicate entries. We will implement this by utlizing dictionary variable. We will use apps name as the key and number of reviews as the value.

In [42]:
app_dict = {}

for app in google:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in app_dict and app_dict[name] < n_reviews:
        app_dict[name] = n_reviews
    elif name not in app_dict:
        app_dict[name] = n_reviews
        
print(len(app_dict))

9659


On the cell above, we've made dictionary variable named **app_dict** to store unique apps name and its number of reviews (by keeping the highest number if there's duplicate entries). Then, we check the length of app_dict and we expect it to be same as number of unique apps that we've been queried before, that is **9659** which is correct. So, we can conclude that now we already have the list of unique app names along with its highest number of reviews. Then, we will use this dictionary variable to remove duplicate app data.

In [43]:
google_clean = []
already_added = []

for app in google:
    name = app[0]
    n_reviews = float(app[3])
    
    if (name not in already_added) and (app_dict[name] == n_reviews):
        google_clean.append(app)
        already_added.append(name)
        
explore_dataset(google_clean,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9659
Number of columns:  13




In the code above, we make 2 empty list variables named `google_clean` and `already_added`. `google_clean` variable will be used to store the correct app data without duplicate data while `already_added` is used to track for duplicate data. The code in the cell above will iterate through every app data and check if the number of reviews is matched as `app_dict` and hasn't been added before by comparing it to `already_added` variable. As result, we will get `google_clean` variable containing *9659* data without duplicate (by keeping data with highest number of reviews)

### Removing Non-English Apps

Our team are developing application that use English as its language and we want this application to be used by English-speaker audience. So, in our data analysis, we want to explore apps data on Google Play Store and App Store that's for English audience. We can separate English apps and Non-English apps by looking at its app name.

In English, we only use alphabet (A to Z) and we can check if the app name contains alphabet or not by converting it to ASCII. Alphabet (a, b, ..., z, A, ..., Z) has ASCII number below 127. So, we will take each app name in our dataset and then check every character if the ASCII number is below 127 using built-in function in Python called `ord` which return ASCII number of input character. 

Another problem is some of our app data has English name but also contains several character with ASCII number greateer than 127 (such as : emoji). In order to avoid data loss, we will categorize an apps as English Apps if its app name contain non-alphabet character less than 4.

First, we make function `check_alphabet` that return *True* if app is English app based on criteria explained above and vice versa.

In [44]:
def check_alphabet(text):
    non_alphabet = 0
    for char in text:
        if ord(char) > 127:
            non_alphabet += 1
    return non_alphabet <= 3

Next, we will iterate trough each data set and remove Non-English apps.

In [45]:
google_final = []
apple_final = []

for app in google_clean:
    if check_alphabet(app[0]):
        google_final.append(app)
        
for app in apple:
    if check_alphabet(app[2]):
        apple_final.append(app)

After getting new data set without Non-English apps, we can explore our data set to check the data sample and number of rows using `explore_dataset` function.

In [46]:
explore_dataset(google_final, 0, 3, True)
explore_dataset(apple_final, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9614
Number of columns:  13


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '1

Based on the output of `explore_dataset` function, we have *9614* data of Android Google Play apps and *6183* data of Apple Play Store apps.

### Removing Non-Free Apps

As we stated at the beginning, our teams are developing free app. So, we only interested in free apps data and we need to remove several non-free app data. We can do this by iterating trough data set then check if the price of app is free or not and keep only free app. 

In [47]:
google_dataset = []
apple_dataset = []

for app in google_final:
    price = app[7]
    if price == '0':
        google_dataset.append(app)
        
for app in apple_final:
    price = float(app[5])
    if price == 0:
        apple_dataset.append(app)        

Then, we can check our data set that has been trough several data cleaning steps. We are left with *8864* data for Android Apps and *3222* data for Apple Apps.

In [48]:
explore_dataset(google_dataset, 0, 3, True)
explore_dataset(apple_dataset, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  8864
Number of columns:  13


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online 

## Data Analysis

### Most Common Apps by Genre

As we mentioned before, our team are developing free apps and need to attract as many users as it can because our revenue is based on user engagement of our application. To minimize risk, we will start first by creating MVP (Minimum Viable Product) of our application for Android since it has bigger user population and add it to Google Play. If it's successful, we can delevop it further and if after several months, we can develop iOS version of our application. First of all, we need to find characteristics of application that are successful on both Google Play and App Store. We can start it by analyzing genre/categoryof application which is favorited by users on Google Play and App Store. 

Based on our data set, if we want to look at the genre/category column, we can analyze *prime_genre* column on App Store data set while on Google Play data set, we can look at *Genre* and *Category* column. To analyze these columns, we can create function to show the frequency table of each genre/category.

In [49]:
def freq_table(dataset, index):
    dict_freqtable = {}
    for app in dataset:
        if app[index] not in dict_freqtable:
            dict_freqtable[app[index]] = (1 / len(dataset)) * 100
        else:
            dict_freqtable[app[index]] += (1 / len(dataset)) * 100
    return dict_freqtable

In [50]:
def display_table(dataset, index):
    dict_freqtable = freq_table(dataset, index)
    display_table = []
    for key in dict_freqtable:
        tuple_value = (dict_freqtable[key], key)
        display_table.append(tuple_value)
    
    for data in sorted(display_table, reverse=True):
        print(data[1], ' : ', data[0])

Let's continue our analysis by exploring `prime_genre` column on App Store data set.

In [51]:
display_table(apple_dataset, -5)

Games  :  58.1626319056464
Entertainment  :  7.883302296710134
Photo & Video  :  4.965859714463075
Education  :  3.6623215394165176
Social Networking  :  3.2898820608317867
Shopping  :  2.6070763500931133
Utilities  :  2.5139664804469306
Sports  :  2.1415270018621997
Music  :  2.048417132216017
Health & Fitness  :  2.0173805090006227
Productivity  :  1.7380509000620747
Lifestyle  :  1.5828677839851035
News  :  1.3345747982619496
Travel  :  1.2414649286157668
Finance  :  1.1173184357541899
Weather  :  0.8690254500310364
Food & Drink  :  0.8069522036002481
Reference  :  0.558659217877095
Business  :  0.5276225946617009
Book  :  0.4345127250155184
Navigation  :  0.186219739292365
Medical  :  0.186219739292365
Catalogs  :  0.12414649286157665


Based on the data above, for Free English Apps on iOS App Store, mostly is dominated by Games apps (58.16%%), Entertainment apps (7.88%), Photo and Video apps (4.96%), Education apps (3.66%) and etc. From this list, we can conclude that for Free English Apps on iOS App Store mostly is application for entertainment purpose while applications for practical purpose is more rare. However, the fact that entertainment apps is more numerous than other app's genre doesn't mean that the demand also high. We need to explore that notion later.

But, first let's take a look at genre's distribution on Google Play dataset by looking at `Genres` and `Category` column. 

In [52]:
display_table(google_dataset, 1) # Genre Column

FAMILY  :  18.907942238266926
GAME  :  9.724729241877363
TOOLS  :  8.46119133574016
BUSINESS  :  4.591606498194979
LIFESTYLE  :  3.90342960288811
PRODUCTIVITY  :  3.8921480144404565
FINANCE  :  3.7003610108303455
MEDICAL  :  3.5311371841155417
SPORTS  :  3.3957581227436986
PERSONALIZATION  :  3.3167870036101235
COMMUNICATION  :  3.2378158844765483
HEALTH_AND_FITNESS  :  3.079873646209398
PHOTOGRAPHY  :  2.944494584837555
NEWS_AND_MAGAZINES  :  2.7978339350180583
SOCIAL  :  2.6624548736462152
TRAVEL_AND_LOCAL  :  2.335288808664261
SHOPPING  :  2.2450361010830324
BOOKS_AND_REFERENCE  :  2.14350180505415
DATING  :  1.861462093862813
VIDEO_PLAYERS  :  1.7937725631768928
MAPS_AND_NAVIGATION  :  1.398916967509025
FOOD_AND_DRINK  :  1.2409747292418778
EDUCATION  :  1.1620036101083042
ENTERTAINMENT  :  0.9589350180505433
LIBRARIES_AND_DEMO  :  0.9363718411552363
AUTO_AND_VEHICLES  :  0.9250902527075828
HOUSE_AND_HOME  :  0.8235559566787015
WEATHER  :  0.8009927797833946
EVENTS  :  0.7107400722

The genre's distribution is quite different on Google Play dataset (for Free English apps). Mostly is dominated by practical purpose apps (Family : 18.91%, Tools : 8.46%, Business : 4.59%, etc). But, if we explore the data further, we'll see that Family category is dominated by applications for kids. This kind of distribution is also confirmed in `Category` column.

In [53]:
display_table(google_dataset, -4) #Category Column

Tools  :  8.449909747292507
Entertainment  :  6.069494584837599
Education  :  5.34747292418777
Business  :  4.591606498194979
Productivity  :  3.8921480144404565
Lifestyle  :  3.8921480144404565
Finance  :  3.7003610108303455
Medical  :  3.5311371841155417
Sports  :  3.46344765342962
Personalization  :  3.3167870036101235
Communication  :  3.2378158844765483
Action  :  3.1024368231047053
Health & Fitness  :  3.079873646209398
Photography  :  2.944494584837555
News & Magazines  :  2.7978339350180583
Social  :  2.6624548736462152
Travel & Local  :  2.3240072202166075
Shopping  :  2.2450361010830324
Books & Reference  :  2.14350180505415
Simulation  :  2.041967509025268
Dating  :  1.861462093862813
Arcade  :  1.8501805054151597
Video Players & Editors  :  1.771209386281586
Casual  :  1.7599277978339327
Maps & Navigation  :  1.398916967509025
Food & Drink  :  1.2409747292418778
Puzzle  :  1.1281588447653441
Racing  :  0.9927797833935037
Role Playing  :  0.9363718411552363
Libraries & Demo 

Based on the frequency table for `Category` column, we can see that free English apps on Google Play is mostly dominated by practical purpose apps rather than entertainment apps. While the difference between `Genre` and  `Category` column on Google Play dataset is not clear, we can assume that `Category` column has more granular categorization of apps. Since, we want to look at bigger picture for most common genre on Google Play Store and since both columns have similar landscapes, we'll stick to `Category` column.

Until now, we can conclude that for free English apps, on iOS App Store mostly is dominated by entertainment purpose apps, while on Google Play, mostly is dominated by practical purpose apps. Next, we want to explore kind of apps that has more users to see what kind of app has high demand as mentioned before.

### Most Popular Apps by Genre

One way to find out apps that have more users is by checking the average number of installation for each genre of apps. For Google Play dataset, we can look for that data in `Installs` column, while for App Store dataset, we can look at `rating_count_tot` that shows total number of ratings given by users for all versions since App Store dataset doesn't have data for total number installation. 

Below, we create function `display_common_genre` to display list of most popular genre based on total ratings given by users (for App Store) and total user's download (for Google Play). This function takes 4 inputs : dataset, column index of genre (for Google Play, we'll use `Category` column as stated above), total user's download and rating column index and boolean parameter if it's Google Play data because in Google Play data for `Installs` column the data record is not really precise. Instead, it's divided into several threshold category, such as : '5,000+', '10,000+', etc. Then, we need to replace the ',' and '+' character and change it into float datatype for Google Play dataset before we can analyze it. 

In [54]:
def display_common_genre(dataset, genre_idx, cont_user_idx, google_dataset=False):
    genres = freq_table(dataset, genre_idx)
    common_genre_list = []
    for app_genre in genres:
        total = 0
        count = 0

        for app in dataset:        
            if app[genre_idx] == app_genre:
                count += 1
                if not google_dataset:
                    total += float(app[cont_user_idx])
                else:
                    app[cont_user_idx] = app[cont_user_idx].replace(',','')
                    app[cont_user_idx] = app[cont_user_idx].replace('+','')
                    total += float(app[cont_user_idx])

        common_genre_list.append((total/count, app_genre))
    
    for app_genre in sorted(common_genre_list, reverse=True):
        print(app_genre[1], ":", app_genre[0])

First, we'll analyze iOS App Store dataset. The sorted result with highest user rating given is displayed below :

In [55]:
display_common_genre(apple_dataset, -5, 6)

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


As we can see from the result above, the most popular free English apps on App Store (based on number of user's rating given) is dominated by Navigation apps, followed by Reference, Social Networking, Music and Weather apps. These top 5 genres seem potential as our genre choice for our application. Now, let's take a deeper look to get decision of which genre should we take for our application. To make it easier, we will create function to display sorted list of app name and number of user rating given (Apple App Store dataset) or number of installation (Google Play Store dataset). The function is named `display_app_usrcount` and implemented below :

In [56]:
def display_app_usrcount(dataset, genre, genre_idx, app_name_idx, user_count_idx, googledataset=False):
    app_list = []
    for app in dataset:
        if app[genre_idx] == genre:
            if not googledataset:
                app_list.append((int(app[user_count_idx]), app[app_name_idx]))
            else:
                app[user_count_idx] = app[user_count_idx].replace(',','')
                app[user_count_idx] = app[user_count_idx].replace('+','')
                app_list.append((int(app[user_count_idx]), app[app_name_idx]))

    for app in sorted(app_list, reverse=True):
        print(app[1], ':', app[0])

First, we'll try to explore Navigation category on Apple Store since it has the biggest user rating given for free English apps on Apple Store.

In [57]:
display_app_usrcount(apple_dataset, 'Navigation', -5, 2, 6)

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


From this data, we can see that for Navigation category, the number is heavily influenced by Waze and Google Maps which makes up 96.79% users rating given (close to half million) for this category. We also can found this same patter for Social Networking and Music category.

In [58]:
display_app_usrcount(apple_dataset, 'Social Networking', -5, 2, 6)

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

In [59]:
display_app_usrcount(apple_dataset, 'Music', -5, 2, 6)

Pandora - Music & Radio : 1126879
Spotify Music : 878563
Shazam - Discover music, artists, videos & lyrics : 402925
iHeartRadio – Free Music & Radio Stations : 293228
SoundCloud - Music & Audio : 135744
Magic Piano by Smule : 131695
Smule Sing! : 119316
TuneIn Radio - MLB NBA Audiobooks Podcasts Music : 110420
Amazon Music : 106235
SoundHound Song Search & Music Player : 82602
Sonos Controller : 48905
Bandsintown Concerts : 30845
Karaoke - Sing Karaoke, Unlimited Songs! : 28606
My Mixtapez Music : 26286
Sing Karaoke Songs Unlimited with StarMaker : 26227
Ringtones for iPhone & Ringtone Maker : 25403
Musi - Unlimited Music For YouTube : 25193
AutoRap by Smule : 18202
Spinrilla - Mixtapes For Free : 15053
Napster - Top Music & Radio : 14268
edjing Mix:DJ turntable to remix and scratch music : 13580
Free Music - MP3 Streamer & Playlist Manager Pro : 13443
Free Piano app by Yokee : 13016
Google Play Music : 10118
Certified Mixtapes - Hip Hop Albums & Mixtapes : 9975
TIDAL : 7398
YouTube Mu

From these lists, we can assume that for these 3 application genres on App Store is highly influenced by big player, such as :  Waze, Google Maps, Facebook, Pinterest, Skype, Pandora, Spotify, Shazam, etc. This probably makes number of ratings given for these categories is skewed mostly by these well-known and wide-used apps. This pattern also can give us some hints that these three categories may not really popular than they are because the number of user's rating is mostly from those big player apps. Therefore, we can remove these three categories from our assumption for most common apps. Besides, if we create app in one of those three categories, we may have struggle to conquer those well-known apps.

Now, let's take a look at other categories, that is : Reference and Weather.

In [60]:
display_app_usrcount(apple_dataset, 'Reference', -5, 2, 6)

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


In [61]:
display_app_usrcount(apple_dataset, 'Weather', -5, 2, 6)

The Weather Channel: Forecast, Radar & Alerts : 495626
The Weather Channel App for iPad – best local forecast, radar map, and storm tracking : 208648
WeatherBug - Local Weather, Radar, Maps, Alerts : 188583
MyRadar NOAA Weather Radar Forecast : 150158
AccuWeather - Weather for Life : 144214
Yahoo Weather : 112603
Weather Underground: Custom Forecast & Local Radar : 49192
NOAA Weather Radar - Weather Forecast & HD Radar : 45696
Weather Live Free - Weather Forecast & Alerts : 35702
Storm Radar : 22792
QuakeFeed Earthquake Map, Alerts, and News : 6081
Moji Weather - Free Weather Forecast : 2333
Hurricane by American Red Cross : 1158
Forecast Bar : 375
Hurricane Tracker WESH 2 Orlando, Central Florida : 203
FEMA : 128
iWeather - World weather forecast : 80
Weather - Radar - Storm with Morecast App : 78
Yurekuru Call : 53
Weather & Radar : 37
WRAL Weather Alert : 25
Météo-France : 24
JaxReady : 22
Freddy the Frogcaster's Weather Station : 14
Almanac Long-Range Weather Forecast : 12
wetter.c

For Reference category, we can see that application Bible and Dictionary.com get many user ratings and it seems that these two applications highly influencing this category. But, aside from that fact, this category looks like have potential since those two application would act as must-have application for users. We can make application that store several books and give several other features, such as : quoting feature, dictionary, quizzes, etc. These feature will make our application more interesting and will engage more users and more time to spend more time using our application because our revenue is based on user's engagement. This idea also matches the fact that free English apps on App Store is populated with for fun apps (may indicating high demand for fun apps). 

For Weather, this category also seems promising. But, mostly users will only spend little times for this kind of apps which makes time spent on our apps very little. Thus, also affecting our revenue. Up until now, based on free English apps on App Store, we decide that Reference category will be promising and we can create e-book reader application with several other features, such as : dictionary, quizzes, quotation, and etc.  

Now, let's take a look at Google Play Store dataset.

## Most Popular Apps by Genre on Google Play

In [62]:
display_common_genre(google_dataset, 1, 5, True)

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3695641.8198090694
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

For Play Store dataset, we can see that Communication has the highest number of downloads with Video Players category in the second, Social in the third and so on. While, for Books and Reference category, is also fairly popular and on 11th place. Now, let's take a deeper look on several categories.

In [63]:
display_app_usrcount(google_dataset, 'COMMUNICATION', 1, 0, 5)

WhatsApp Messenger : 1000000000
Skype - free IM & video calls : 1000000000
Messenger – Text and Video Chat for Free : 1000000000
Hangouts : 1000000000
Google Chrome: Fast & Secure : 1000000000
Gmail : 1000000000
imo free video calls and chat : 500000000
Viber Messenger : 500000000
UC Browser - Fast Download Private & Secure : 500000000
LINE: Free Calls & Messages : 500000000
Google Duo - High Quality Video Calls : 500000000
imo beta free calls and text : 100000000
Yahoo Mail – Stay Organized : 100000000
Who : 100000000
WeChat : 100000000
UC Browser Mini -Tiny Fast Private & Secure : 100000000
Truecaller: Caller ID, SMS spam blocking & Dialer : 100000000
Telegram : 100000000
Opera Mini - fast web browser : 100000000
Opera Browser: Fast and Secure : 100000000
Messenger Lite: Free Calls & Messages : 100000000
Kik : 100000000
KakaoTalk: Free Calls & Text : 100000000
GO SMS Pro - Messenger, Free Themes, Emoji : 100000000
Firefox Browser fast & private : 100000000
BBM - Free Calls & Messages

In Communication category, we can see the same pattern as in App Store dataset. This category is highly influenced by some giant apps, such as : Whatsapp Messenger, Skype, Hangous, etc. If we explore further in other categories, we also can see the same pattern where average number of download in each category is also skewed by several apps. 

In [65]:
display_app_usrcount(google_dataset, 'VIDEO_PLAYERS', 1, 0, 5)

YouTube : 1000000000
Google Play Movies & TV : 1000000000
MX Player : 500000000
VivaVideo - Video Editor & Photo Movie : 100000000
VideoShow-Video Editor, Video Maker, Beauty Camera : 100000000
VLC for Android : 100000000
Motorola Gallery : 100000000
Motorola FM Radio : 100000000
Dubsmash : 100000000
Vote for : 50000000
Vigo Video : 50000000
VMate : 50000000
Samsung Video Library : 50000000
Ringdroid : 50000000
MiniMovie - Free Video and Slideshow Editor : 50000000
LIKE – Magic Video Maker & Community : 50000000
KineMaster – Pro Video Editor : 50000000
HD Video Downloader : 2018 Best video mate : 50000000
DU Recorder – Screen Recorder, Video Editor, Live : 50000000
video player for android : 10000000
iMediaShare – Photos & Music : 10000000
YouTube Studio : 10000000
Video Player All Format : 10000000
Video Downloader - for Instagram Repost App : 10000000
Video Downloader : 10000000
Ustream : 10000000
Quik – Free Video Editor for photos, clips, music : 10000000
PowerDirector Video Editor

In [67]:
display_app_usrcount(google_dataset, 'SOCIAL', 1, 0, 5)

Instagram : 1000000000
Google+ : 1000000000
Facebook : 1000000000
Snapchat : 500000000
Facebook Lite : 500000000
VK : 100000000
Tumblr : 100000000
Tik Tok - including musical.ly : 100000000
Tango - Live Video Broadcast : 100000000
Pinterest : 100000000
LinkedIn : 100000000
Badoo - Free Chat & Dating App : 100000000
BIGO LIVE - Live Stream : 100000000
ooVoo Video Calls, Messaging & Stories : 50000000
Zello PTT Walkie Talkie : 50000000
SKOUT - Meet, Chat, Go Live : 50000000
POF Free Dating App : 50000000
MeetMe: Chat & Meet New People : 50000000
textPlus: Free Text & Calls : 10000000
magicApp Calling & Messaging : 10000000
YouNow: Live Stream Video Chat : 10000000
We Heart It : 10000000
Waplog - Free Chat, Dating App, Meet Singles : 10000000
TextNow - free text + calls : 10000000
Text free - Free Text + Call : 10000000
Text Me: Text Free, Call Free, Second Phone Number : 10000000
Tapatalk - 100,000+ Forums : 10000000
Tagged - Meet, Chat & Dating : 10000000
SayHi Chat, Meet New People : 1

In [69]:
display_app_usrcount(google_dataset, 'PHOTOGRAPHY', 1, 0, 5)

Google Photos : 1000000000
Z Camera - Photo Editor, Beauty Selfie, Collage : 100000000
YouCam Perfect - Selfie Photo Editor : 100000000
YouCam Makeup - Magic Selfie Makeovers : 100000000
Sweet Selfie - selfie camera, beauty cam, photo edit : 100000000
S Photo Editor - Collage Maker , Photo Collage : 100000000
Retrica : 100000000
PicsArt Photo Studio: Collage Maker & Pic Editor : 100000000
PhotoGrid: Video & Pic Collage Maker, Photo Editor : 100000000
Photo Editor Pro : 100000000
Photo Editor Collage Maker Pro : 100000000
Photo Collage Editor : 100000000
LINE Camera - Photo editor : 100000000
Cymera Camera- Photo Editor, Filter,Collage,Layout : 100000000
Candy Camera - selfie, beauty camera, photo editor : 100000000
Camera360: Selfie Photo Editor with Funny Sticker : 100000000
BeautyPlus - Easy Photo Editor & Selfie Camera : 100000000
B612 - Beauty & Filter Camera : 100000000
AR effect : 100000000
Video Editor Music,Cut,No Crop : 50000000
VSCO : 50000000
Square InPic - Photo Editor & Co

In [70]:
display_app_usrcount(google_dataset, 'BOOKS_AND_REFERENCE', 1, 0, 5)

Google Play Books : 1000000000
Wattpad 📖 Free Books : 100000000
Bible : 100000000
Audiobooks from Audible : 100000000
Amazon Kindle : 100000000
Wikipedia : 10000000
Spanish English Translator : 10000000
Quran for Android : 10000000
Oxford Dictionary of English : Free : 10000000
NOOK: Read eBooks & Magazines : 10000000
Moon+ Reader : 10000000
JW Library : 10000000
HTC Help : 10000000
FBReader: Favorite Book Reader : 10000000
English Hindi Dictionary : 10000000
English Dictionary - Offline : 10000000
Dictionary.com: Find Definitions for English Words : 10000000
Dictionary - Merriam-Webster : 10000000
Dictionary : 10000000
Cool Reader : 10000000
Aldiko Book Reader : 10000000
Al-Quran (Free) : 10000000
Al'Quran Bahasa Indonesia : 10000000
Al Quran Indonesia : 10000000
Read books online : 5000000
English to Hindi Dictionary : 5000000
Ebook Reader : 5000000
Dictionary - WordWeb : 5000000
Bible KJV : 5000000
Ancestry : 5000000
AlReader -any text book reader : 5000000
Al Quran : EAlim - Transl

Based on these data, we can consider taking Books and Reference category as our genre option for our application since it's fairly popular among users and not many big players in this category. Now, if we take a look at applications on Books and Reference category, we can assume that mostly is e-book reader apps and various of collections libraries and dictionaries. So, we will avoid developing this kind of applications since there'll be chance for some significant competitors. 

The list of application also shows that there are several apps built around popular books, such as : Bible, Al-Qur'an, etc. So, we can take several popular books and turn it into applications. But, we also need to give several features since there  are already library-like apps, like : discussion forums, quizzes, daily quotes, etc. 

# Conclusion

Our main goal for this project is to give recommendation about genre of application that will be profiting both for Android and Apple market. The application will focus on user's engagement as its main source of revenues by displaying in-app advertisements and this application will also be free and for English-speaking users. For our analysis, we use Google Play Store and Apple App Store datasets containing records of applications available in both markets. 

Based on the analysis, we conclude that Books and Reference category will be promising for our apps. We will take several popular books and add several features, for example : daily quote, quizzes, discussion forum, and etc, since both markets is already full of libraries and by adding these features, we can increase time spent by users using our apps which also can increase our revenue. 