# Profitable App Profiles for App Store and Google Play Markets
**L & L Co.** builds Android and iOS mobile Apps. These apps are made available on Google Play and App Store, they are free to download and install. The main revenue stream for **L & L Co.** is the in-app ads.

The main aim of this project is to therefore find out which apps attract more users and in turn engage more users with the ads. As a result the developers will be adviced accordingly on what apps to build in order to maximize the revenue.

## Opening and Exploring the Data
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

To save on time and resources, a sample of this data is obtained from Kaggle for analysis:
- A [data set](https://www.kaggle.com/lava18/google-play-store-apps/home) containing data about approximately ten thousand Android apps from Google Play
- A [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) containing data about approximately seven thousand iOS apps from the App Store

**Opening the two data sets:**

In [2]:
from csv import reader

###Google Play data set:###
open_file = open('C:/Users/Luci/Desktop/Data Science/DataQuest/my_datasets/googleplaystore.csv', encoding = 'utf8')
read_file = reader(open_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

###App Store data set###
open_file = open('C:/Users/Luci/Desktop/Data Science/DataQuest/my_datasets/AppleStore.csv', encoding = 'utf8')
read_file = reader(open_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]


For simplicity in exploring the data, a function `explore_data()` is created. When called, it explores the rows in a more readable form, and prints the number of columns and rows in the given dataset.

In [3]:
def explore_data(data_set, start, end, rows_and_columns = False):
    data_slice = data_set[start:end]
    for each_row in data_slice:
        print(each_row)
        print('\n')
    if rows_and_columns:
        print('Number of columns:', len(data_set[0]))
        print('Number of columns:', len(data_set))
        print('\n')
        
print(android_header)
explore_data(android, 0, 1, True)
print(ios_header)
explore_data(ios, 2, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of columns: 13
Number of columns: 10841


['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of columns: 17
Number of columns: 7197




There are 10841 android apps and 7197 iOS mobile apps.
The columns of interest are:
- Android apps: `Category`, `Ratings`, `Price`,`Genre`
- iOS Mobile apps: `track_name`, `currency`, `Price`, `rating_count_tot`, `user_rating`

# 1. Cleaning the Data
## Deleting Wrong Data
The Google Play data set has a [dedicated discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion) section, and [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) outlines an error for row 10472. 

Printing this row and comparing it against the header and another row that is correct:

In [4]:
print(android_header) #header
print('\n')
print(android[10472]) #incorrect row
print('\n')
print(android[0]) #correct row

print('\n Incorrect row length:', len(android[10472]))
print('\n Correct row length:', len(android[0]))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']

 Incorrect row length: 12

 Correct row length: 13


The row 10472 corresponds to the app `Life Made WI-Fi Touchscreen Photo Frame`. As seen the rating is 19. This is clearly off because the maximum rating for a Google Play app is 5. 
Therefore this row is deleted so as not to cause inconsistensies in the analysis:

In [5]:
print('Original length of Android data set:', len(android))
#del android[10472] # run only once
print('Length after deleting wrong data:', len(android))

Original length of Android data set: 10841
Length after deleting wrong data: 10840


## Removing Duplicate Entries
Exploring the Google Play data set long enough, it is noticed that some apps have duplicate entries. For instance, Instagram has four entries:

In [6]:
count = 0
for each_row in android:
    app_name = each_row[0]
    if app_name == 'Instagram':
        print(each_row)
        print('\n')
        count += 1
print('Number of repetitions:', count)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Number of repetitions: 4


On further inspection, it is observed that in total, there are 1181 cases where an app occurs more than once:

In [7]:
unique_apps = []
duplicate_apps = []

for each_row in android:
    app_name = each_row[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)
        
print('Number of unique apps:', len(unique_apps))
print('Number of duplicate apps:', len(duplicate_apps))

Number of unique apps: 9659
Number of duplicate apps: 1181


We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app.

Examining the rows printed two cells above for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times. Using this to build a criterion for keeping rows, we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

To do that:

1. A dictionary is created where each key is a unique app name, and the value is the highest number of reviews of that app
2. The dictionary is used to create a new data set, which will have only one entry per app (and only the apps with the highest number of reviews will be selected)

In [8]:
uapps_max_reviews = {}

for each_row in android:
    app_name = each_row[0]
    n_reviews = each_row[3]
    
    if app_name in uapps_max_reviews and uapps_max_reviews[app_name] < n_reviews:
        uapps_max_reviews[app_name] = n_reviews #record replaced with the highest review record
    elif app_name not in uapps_max_reviews:
        uapps_max_reviews[app_name] = n_reviews
        
print('Length of unique apps with mmaximum reviews:', len(uapps_max_reviews))


Length of unique apps with mmaximum reviews: 9659


Now, let's use the `uapps_max_reviews` dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. 

Steps:
1. Two empty lists are initialized: android_clean and already_added.
2. The android data set is looped, and for every iteration:
    - The name of the app and the number of reviews are isolated.
    - The current row (each_app) is added to the android_clean list, and the app name (name) to the already_cleaned list if:
        - the number of reviews of the current app matches the number of reviews of that app as described in the `uapps_max_reviews` dictionary, and;
        - the name of the app is not already in the already_added list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry. For example, the Box app has three entries, and the number of reviews is the same. If we just check for reviews_max[name] == num_reviews, we'll still end up with duplicate entries for some apps.

In [9]:
android_clean = []
already_added = []

for each_app in android:
    app_name = each_app[0]
    num_reviews = each_app[3]
    
    if uapps_max_reviews[app_name] == num_reviews and app_name not in already_added:
        android_clean.append(each_app)
        already_added.append(app_name)
    

Exploring the data to see cases of duplicates where only the record with the highest reviews is maintained and the rest discarded:

In [10]:
# Before cleaning
print('Before cleaning:')
for each_row in android:
    app_name = each_row[0]
    if app_name == 'Instagram':
        print('\n')
        print(each_row)
        
# After cleaning 
print('\nAfter cleaning:')
for each_row in android_clean:
    app_name = each_row[0]
    if app_name == 'Instagram':
        print('\n')
        print(each_row)
        
#Size after cleaning
print('\nNumber of columns:', len(android_header))
print('Number of rows:', len(android_clean))

Before cleaning:


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']

After cleaning:


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']

Number of columns: 13
Number of rows: 9659


## Removing non-English Apps
On exploring the data set further, several non-English apps were found:

In [11]:
print(android_clean[7941][0])

لعبة تقدر تربح DZ


However, the target audience for **L & L Co.** is English-speaking. Therefore, all the non-English apps have to be filtered for the analysis to be more accurate.

According to ASCII, the numbers corresponding to the characters used in an English text range from 0 to 127. Based on this number range, a function is created that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters.

The `app_name` are stored as strings in these data sets. Therefore the *indexable* and *iterable* property of strings in python is utilized to loop through the `app_name` to detect if the characters used are allowed in English text.

A function `is_english()` is defined. For every character in the string, its corresponding number is obtained using `ord()` function. If the character number doesn't fall within the allowed ASCII range, the `is_english()` function returns a `False` else a `True`

In [12]:
def is_english(string):
    for character in string:
        if ord(character) > 127:
            return False
    return True

#Testing the function:
print(is_english('Instagram'))
print(is_english('لعبة تقدر تربح'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
False
False


From the cell above, it is noticed that English app names like `'Docs To Go™ Free Office Suite'` and `'Instachat 😜'` have been returned as `False`. This is because emojis and characters like ™ fall outside the ASCII range and have corresponding numbers of over 127.

In order to finetune this filter to avoid to loss of valuable data, a condition is added such that an app is only removed if its name has more than three characters with corresponding numbers falling outside the ASCII range

In [13]:
def is_english(string):
    count = 0
    for character in string:
        if ord(character) > 127:
            count += 1
            if count > 3:
                return False
    return True

#Testing the function:
print(is_english('Instagram'))
print(is_english('لعبة تقدر تربح'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


Using this filter on the two datasets:

In [14]:
# android data set
english_android = []
non_english_android = []

for each_app in android_clean:
    app_name = each_app[0]
    if is_english(app_name):
        english_android.append(each_app)
    else:
        non_english_android.append(each_app)

explore_data(english_android, 0, 2, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of columns: 13
Number of columns: 9614




In [15]:
#ios mobile data set
english_ios = []
non_english_ios = []

for each_app in ios:
    app_name = each_app[2]
    if is_english(app_name):
        english_ios.append(each_app)
    else:
        non_english_ios.append(each_app)

explore_data(english_ios, 0, 2, True)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


Number of columns: 17
Number of columns: 6183




## Isolating Free Apps
**L & L Co.** is interested in only Free Apps as it builds apps that are free to download and install. However the data set contains both free and non-free apps. The next step is therefore to isolate the free apps.

In [16]:
final_android = []
for each_app in english_android:
    price = each_app[7]
    if price == '0':
        final_android.append(each_app)
        
final_ios = []
for each_app in english_ios:
    price = each_app[5]
    if price == '0':
        final_ios.append(each_app)
    
print('Android:', len(final_android))
print('ios:', len(final_ios))

Android: 8862
ios: 3222


At this stage the data cleaning process has entailed:
- *removing inaccurate data.*
- *removing duplicate app entries.*
- *removing non-English apps.*
- *isolating free apps.*

Taking these as the only data cleaning steps for these data sets. The final data sets therefore consists of 8862 android apps and 3222 iOS mobile apps. This data set is sufficient for analysis.

# 2. Analysis of the clean Data
Main aim in this project as stated earlier is to determine the kinds of apps that are likely to attract more users because the revenue is highly influenced by the number of people using the apps.

The validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, it is developed further.
3. If the app is profitable after six months, an iOS version of the app is built and added to the App Store.

Because the end goal is to add the app on both Google Play and the App Store, the app profiles that are successful on both markets have to be determined. 

## Most Common Apps on Google Play and App Store by Genre
To get a sense of the most common genres for each market. A frequency table is built for the `prime_genre` column of the App Store data set, and the `Genres` and `Category` columns of the Google Play data set.

Two functions are built:
1. One function to generate frequency tables that show percentages
2. Another function that's used to display the percentages in a descending order

In [17]:
#Generating the frequency tables
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for each_row in dataset:
        total += 1
        key = each_row[index]
        
        if key in table:
            table[key] += 1 #increase the key value by 1
        else:
            table[key] = 1 #assign the key value = 1
    
    table_percentages = {}
    
    for each_key in table:
        percentage = (table[each_key]/total) * 100
        table_percentages[each_key] = percentage
        
    return table_percentages

#displaying the percentages in a descending order
def display_table(dataset, index):
    table = freq_table(dataset, index)
    display = []
    
    for each_key in table:
        key_val_as_tuple = (table[each_key], each_key)
        display.append(key_val_as_tuple)
        
    table_sorted = sorted(display, reverse = True)
    for each_entry in table_sorted:
        print(each_entry[1], ':' ,each_entry[0])

Examining the frequency table for the `prime_genre` column of the App Store data set first.

In [18]:
display_table(final_ios, 12)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


From the above cell, it is noticed that among the free English apps, more than a half (58.16%) are games. Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5% and only 3.66% of the apps are designed for education, followed by social networking apps which amount for 3.29% of the apps in our data set.

Apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare.

However, it should be noted that the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

Examining the `Genres` and `Category` columns of the Google Play data set:

In [19]:
display_table(final_android, 1) #Category

FAMILY : 18.934777702550214
GAME : 9.693071541412774
TOOLS : 8.451816745655607
BUSINESS : 4.5926427443015125
LIFESTYLE : 3.9043105393816293
PRODUCTIVITY : 3.8930264048747465
FINANCE : 3.7011961182577298
MEDICAL : 3.5206499661475967
SPORTS : 3.39652448657188
PERSONALIZATION : 3.3175355450236967
COMMUNICATION : 3.238546603475513
HEALTH_AND_FITNESS : 3.080568720379147
PHOTOGRAPHY : 2.945159106296547
NEWS_AND_MAGAZINES : 2.798465357707064
SOCIAL : 2.663055743624464
TRAVEL_AND_LOCAL : 2.335815842924848
SHOPPING : 2.2455427668697814
BOOKS_AND_REFERENCE : 2.143985556307831
DATING : 1.8618821936357481
VIDEO_PLAYERS : 1.7941773865944481
MAPS_AND_NAVIGATION : 1.399232678853532
FOOD_AND_DRINK : 1.2412547957571656
EDUCATION : 1.1735499887158656
ENTERTAINMENT : 0.9591514330850823
LIBRARIES_AND_DEMO : 0.9365831640713158
AUTO_AND_VEHICLES : 0.9252990295644324
HOUSE_AND_HOME : 0.8237418190024826
WEATHER : 0.8011735499887158
EVENTS : 0.7109004739336493
PARENTING : 0.6544798013992327
ART_AND_DESIGN : 0.

On Google Play: there are not that many apps designed for fun, a good number of the apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). 

This picture is also confirmed by the frequency table for the `Genres` column:

In [20]:
display_table(final_android, 9) #Genres

Tools : 8.440532611148726
Entertainment : 6.070864364703228
Education : 5.348679756262695
Business : 4.5926427443015125
Productivity : 3.8930264048747465
Lifestyle : 3.8930264048747465
Finance : 3.7011961182577298
Medical : 3.5206499661475967
Sports : 3.4642292936131795
Personalization : 3.3175355450236967
Communication : 3.238546603475513
Action : 3.1031369893929135
Health & Fitness : 3.080568720379147
Photography : 2.945159106296547
News & Magazines : 2.798465357707064
Social : 2.663055743624464
Travel & Local : 2.324531708417964
Shopping : 2.2455427668697814
Books & Reference : 2.143985556307831
Simulation : 2.0424283457458814
Dating : 1.8618821936357481
Arcade : 1.8505980591288649
Video Players & Editors : 1.7716091175806816
Casual : 1.7490408485669149
Maps & Navigation : 1.399232678853532
Food & Drink : 1.2412547957571656
Puzzle : 1.128413450688332
Racing : 0.9930038366057323
Role Playing : 0.9365831640713158
Libraries & Demo : 0.9365831640713158
Auto & Vehicles : 0.92529902956443

The difference between the `Genres` and the `Category` columns is not very distinct. Hovever the `Genres` column is much more granular (has more categories). 

The conclusion at this point is that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps

## Most Popular Apps by Genre.
As mentioned earlier, just because an app is common on Google Play or App Store, it doesn't directly imply that the app is popular among users.

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, this information in the `Installs` column, but this information is missing for the App Store data set. Therefore as a workaround, the total number of user ratings found in the `rating_count_tot app` is considered instead.

## App Store Most Popular Apps
Calculating the average number of user ratings per app genre on the App Store. 

Procedure:
1. Isolate the apps of each genre.
2. Sum up the user ratings for the apps of that genre.
3. Divide the sum by the number of apps belonging to that genre (not by the total number of apps).

In [21]:
#App Store:
genres_ios = freq_table(final_ios, 12) #returns dictionary with key as genre and value as percentage of occurance
avg_genre_rating = {}

for genre in genres_ios:
    total = 0 #total rating of particular genre
    len_genre = 0
    
    #nested loop to deal with all the recordings of a particular genre before moving to the next type of genre.
    for each_app in final_ios: 
        genre_app = each_app[12]
        
        #if it belongs to the genre of interest then: sum the total number of ratings 
        if genre_app == genre:        
            n_ratings = float(each_app[6])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    avg_genre_rating[genre] = avg_n_ratings
    #print(genre, ':', avg_n_ratings) #prints without ordering in ascending or descending order

In [22]:
#Function desc_display, Takes a dictionary as its input parameter and displays the dictionary contents in descending order.
def desc_display(dictionary):
    display = []
    
    for each_key in dictionary:
        key_val_as_tuple = (dictionary[each_key], each_key)
        display.append(key_val_as_tuple)
        
    table_sorted = sorted(display, reverse = True)
    
    for each_entry in table_sorted:
        print(each_entry[1], ':' ,each_entry[0])

#calling the function to display the most popular apps on App Store in descending order.      
desc_display(avg_genre_rating)    

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


On average, `Navigation` apps have the highest number of user reviews, followed by `Reference` and `Social Networking` apps.
Its to be noted that these high no. of reviews could be influenced by a few popular apps with thousands of ratings as demontrated in the cell below:

In [23]:
for each_app in final_ios:
    if each_app[12] == 'Navigation':
        print(each_app[2], ':', each_app[6])
        print('\n')

Waze - GPS Navigation, Maps & Real-time Traffic : 345046


Geocaching® : 12811


ImmobilienScout24: Real Estate Search in Germany : 187


Railway Route Search : 5


CoPilot GPS – Car Navigation & Offline Maps : 3582


Google Maps - Navigation & Transit : 154911




***Waze*** and ***Google Maps*** happen to have a very high number of user reviews among the `Navigation` apps. These are very popular apps dominating the Navigation app market. A new navigation app may not necessarily translate to being as popular to the users as compared to another app of a different genre. Therefore for uniformity, these very popular apps could be filtered out so as not to skew the average results, so as to get a more generalm picture.

## Google Play Most Popular Apps
In Google Play the `Installs` column will be used to analyse the popular apps.

In [24]:
display_table(final_android, 5) #Installs column

1,000,000+ : 15.741367637102236
100,000+ : 11.554953735048521
10,000,000+ : 10.516813360415256
10,000+ : 10.200857594222523
1,000+ : 8.395396073121193
100+ : 6.917174452719477
5,000,000+ : 6.838185511171294
500,000+ : 5.574362446400361
50,000+ : 4.773188896411646
5,000+ : 4.513653802753328
10+ : 3.5432182351613632
500+ : 3.2498307379823967
50,000,000+ : 2.2906793048973144
100,000,000+ : 2.1214172872940646
50+ : 1.9183028661701647
5+ : 0.7898894154818324
1+ : 0.5077860528097494
500,000,000+ : 0.2708192281651997
1,000,000,000+ : 0.22568269013766643
0+ : 0.045136538027533285
0 : 0.011284134506883321


The install numbers don't seem precise enough — they are open-ended (100+, 1,000+, 5,000+, etc.) as seen in the cell above. However,for this purpose(finding common apps), very precise data not needed. Whats needed is to find out which app genres attract the most users.

To perform computations, each install number is converted from string to float. This means the commas and the plus characters are removed, otherwise the conversion will fail and raise an error.

In [25]:
category = freq_table(final_android, 1) #returns dictionary with key as category and value as the percentage of occurance.
diff_categories = {}

for each_entry in category:
    category_name = each_entry
    len_category = 0
    total = 0
    
    for each_app in final_android:
        if each_app[1] == category_name:
            n_installs = each_app[5]
            n_installs = n_installs.replace(',','')
            n_installs = n_installs.replace('+','')
            total += float(n_installs)
            len_category += 1
            
    avg_num_installs = total / len_category
    #print(category_name, ':', avg_num_installs)
    diff_categories[category_name] = avg_num_installs
    
desc_display(diff_categories)

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17805627.643678162
PRODUCTIVITY : 16787331.344927534
GAME : 15560965.599534342
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10682301.033377837
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3694276.334922527
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1820673.076923077
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

On average, in the Google Play market `COMMUNICATION` apps have the most installs: 38,456,119, followed by `VIDEO_PLAYERS` and `SOCIAL`. 

In [26]:
for each_app in final_android:
    if each_app[1] == 'COMMUNICATION':
        print(each_app[0], ':', each_app[5])

WhatsApp Messenger : 1,000,000,000+
Messenger for SMS : 10,000,000+
My Tele2 : 5,000,000+
imo beta free calls and text : 100,000,000+
Contacts : 50,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Browser 4G : 10,000,000+
MegaFon Dashboard : 10,000,000+
ZenUI Dialer & Contacts : 10,000,000+
Cricket Visual Voicemail : 10,000,000+
TracFone My Account : 1,000,000+
Xperia Link™ : 10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard : 10,000,000+
Skype Lite - Free Video Call & Chat : 5,000,000+
My magenta : 1,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Seznam.cz : 1,000,000+
Antillean Gold Telegram (original version) : 100,000+
AT&T Visual Voicemail : 10,000,000+
GMX Mail : 10,000,000+
Omlet Chat : 10,000,000+
My Vodacom SA : 5,000,000+
Microsoft Edge : 5,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Calls & Text by Mo+ : 5,000,000+
free 

On further inspection, it is found that this number is heavily skewed up by a few apps that have over one billion installs (***WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail,*** and ***Hangouts***), and a few others with over 100 and 500 million installs.

If we removed the `COMMUNICATION` apps with over 100 million installs.

In [48]:
under_100_m = []

for each_app in final_android:
    n_installs = each_app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    num_installs = float(n_installs)

    if (each_app[1] == 'COMMUNICATION') and (num_installs < 100000000.0):
        under_100_m.append(num_installs)
      
print('Average number of installs of Communication apps with below 100 million installs:',sum(under_100_m) / len(under_100_m))

Average number of installs of Communication apps with below 100 million installs: 3603485.3884615386


As observed, the average reduces by roughly 10 times. And hence the effect of skewed data visible.

Both in App store and Google Play, the apps under the social category are in the top three most popular apps.

In [50]:
for each_app in final_ios:
    if each_app[12] == 'Social Networking':
        print(each_app[2], ':', each_app[6])

Facebook : 2974676
LinkedIn : 71856
Skype for iPhone : 373519
Tumblr : 334293
Match™ - #1 Dating App. : 60659
WhatsApp Messenger : 287589
TextNow - Unlimited Text + Calls : 164963
Grindr - Gay and same sex guys chat, meet and date : 23201
imo video calls and chat : 18841
Ameba : 269
Weibo : 7265
Badoo - Meet New People, Chat, Socialize. : 34428
Kik : 260965
Qzone : 1649
Fake-A-Location Free ™ : 354
Tango - Free Video Call, Voice and Chat : 75412
MeetMe - Chat and Meet New People : 97072
SimSimi : 23530
Viber Messenger – Text & Call : 164249
Find My Family, Friends & iPhone - Life360 Locator : 43877
Weibo HD : 16772
POF - Best Dating App for Conversations : 52642
GroupMe : 28260
Lobi : 36
WeChat : 34584
ooVoo – Free Video Call, Text and Voice : 177501
Pinterest : 1061624
知乎 : 397
Qzone HD : 458
Skype for iPad : 60163
LINE : 11437
QQ : 9109
LOVOO - Dating Chat : 1985
QQ HD : 5058
Messenger : 351466
eHarmony™ Dating App - Meet Singles : 11124
YouNow: Live Stream Video Chat : 12079
Cougar 

In [51]:
for each_app in final_android:
    if each_app[1] == 'SOCIAL':
        print(each_app[0], ':', each_app[5])

Facebook : 1,000,000,000+
Facebook Lite : 500,000,000+
Tumblr : 100,000,000+
Social network all in one 2018 : 100,000+
Pinterest : 100,000,000+
TextNow - free text + calls : 10,000,000+
Google+ : 1,000,000,000+
The Messenger App : 1,000,000+
Messenger Pro : 1,000,000+
Free Messages, Video, Chat,Text for Messenger Plus : 1,000,000+
Telegram X : 5,000,000+
The Video Messenger App : 100,000+
Jodel - The Hyperlocal App : 1,000,000+
Hide Something - Photo, Video : 5,000,000+
Love Sticker : 1,000,000+
Web Browser & Fast Explorer : 5,000,000+
LiveMe - Video chat, new friends, and make money : 10,000,000+
VidStatus app - Status Videos & Status Downloader : 5,000,000+
Love Images : 1,000,000+
Web Browser ( Fast & Secure Web Explorer) : 500,000+
SPARK - Live random video chat & meet new people : 5,000,000+
Golden telegram : 50,000+
Facebook Local : 1,000,000+
Meet – Talk to Strangers Using Random Video Chat : 5,000,000+
MobilePatrol Public Safety App : 1,000,000+
💘 WhatsLov: Smileys of love, sti

Even though in the social category there are still giants like ***Facebook*** dominating the markets, if the social category is narrowed down to dating apps, a market gap can be found.

There is no one particular dating app dominating the market but the dating apps are very popular in both the App Store and  Google Play Markets with the users distributed among the different types of dating apps.

However, since the market has dating apps in place already, a more specific dating app meeting the needs of our users is what should be developed so as to stand out against the competing apps. For example in Nairobi, an area with lots of university students and working class people, a dating app specifically connecting the two groups could be developed.

# Conclusions
In this project, data about the App Store and Google Play mobile apps was analyzed, with the goal of recommending an app profile that can be profitable for both markets.

After analysis, a social app is found to be profitable both on App Store and Google Play. A social app specifically a dating app was found could do well in the markets. There isn't a particular dominating app, but different dating apps are doing great with their users distributed among them. But since the market is still very competitive, the dating app proposed needs to be more specific and meet the very specific needs of the targeted market. These specific needs could be:
- financial; connecting university students with the working class
- matching interacials
- security; assuring the safety of its users 
