# Profiling Profitable Apps on the App Store and Google Play

The aim of this project is to determine the types of that are most likely to be profitable/generate the highest revenue, in order to inform the decision making of a hypothetical app developer. This hypothetical app developer creates apps that are free and primarily in english. 

As the apps developed are free, the revenues generated are from in-app advertisement and are a function of the ad engagment/number of app users. The goal of this analysis is to profile an app that can be successful at attracting a lot of users on the Playstore as well as the App Store.

*NB: This analysis has been completed as a Guided Project in Dataquest.io's Python for Data Science: Fundamental course*

# Exploring The Data

This analysis will be conducted using two existing datasets: [googleplaystore.csv](https://www.kaggle.com/lava18/google-play-store-apps/home) by Laranya Gupta, which stores 2018 data on about 10,000 android apps, and [appleStore.csv](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home#AppleStore.csv) by Ramanthan Perumal, which stores 2018 data on about 7,000 iOS apps.

In [50]:
# As mentioned above, the two data sets to be used are stored in CSV files, AppleStore.csv and googleplaystore.csv respectively.
#They will be opened with the following code.

from csv import reader
#Open the Play Store dataset
google_opened= open('googleplaystore.csv')
google_read= reader(google_opened)
android_apps= list(google_read)
android_apps_header= android_apps[0]
android_apps= android_apps[1:]

#Open the App Store dataset
apple_opened= open('AppleStore.csv')
apple_read= reader(apple_opened)
ios_apps= list(apple_read)
ios_apps_header= ios_apps[0]
ios_apps= ios_apps[1:]

In order to efficiently explore these large datasets and extract useful information such as the content of each row and the number of rows & columns, a new function has been defined:

In [51]:
def explore_data(dataset, start,end, rows_and_columns=True):
    dataset_slice= dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') #adds a new empty line after each row to improve readability
     
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
#Exploring the App Store apps dataset:
print('ios Apps Explored:')
print(ios_apps_header)
print('\n')
explore_data(ios_apps,0,5, rows_and_columns=True) #Shows the data in the first 5 rows of ios_apps, excluding the header
print('\n')
#Exploring the Play Store dataset:
print('Play Store Apps Explored:')
print (android_apps_header)
print('\n')
explore_data(android_apps,0,5,rows_and_columns=True)    

ios Apps Explored:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns:

From the details above, it can be seen that the App Store dataset has 7197 rows and 16 columns, while that of Play Store has 10841 and 13 respectively. 

The column names for these two datasets are contained in the extracted headers. Descriptions for the App Store column names can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home#AppleStore.csv), while that for the Playstore can be found [here](https://www.kaggle.com/lava18/google-play-store-apps/home).

With respect to the App Store and Playstore, column names "user_rating" and "rating_count_tot" seem like good proxies for positive popularity of the apps. Generally, it would also be interesting to note if there are any differences in popularity between paid and unpaid apps, if certain app genres dominate popularity and if the size of the app affects popularity, as this could inform the strategy of app developers.

# Data Cleaning

### Deleting Unfilled/Wrong Data- Android Data

It is possible to identify if data is missing in the rows by running a for-loop over the rows and comparing their length to the length of the header row:

In [52]:
for row in android_apps:
    if len(row)!=13: #From the executed code above, there are 13 column names
        print(android_apps.index(row)) #Print the index of the row with the missing data

10472


In [53]:
#Having a look at the data 
print(android_apps_header)
print('\n')
print(android_apps[10472])
print('\n')
print(android_apps[0])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


It can be seen that row 10472, corresponding to the "Life Made Wi-Fi Tocuhscreen Photo Frame", Android Apps data has a length of 12 as opposed to 13, indicating that there is data missing. Additionally, the rating for this app is 19 which is erroneous as the maximum rating for an app on the Playstore is 5. For these reasons, this row will be deleted.

In [54]:
del android_apps[10472]

### Deleting Unfilled/Wrong Data- App Store Data

A similar method can be employed to check if there is any unfilled data in the App Store Data:

In [55]:
ios_row_check=0 #Create a list whose value is to be increased by 1 after each iteration if row length is not equal to 16
for row in ios_apps:
    if len(row)!=16: #There are 16 coloumns in this data
        ios_row_check+=1
print(ios_row_check)
    

0


As the value of `ios_row_check` is zero, it appears that the App Store Data does not have missing values.

### Removing Duplicated Entries- Android Data

Additionally, it is important to check that data for the same app have not been entered multiple times for the same app- that is to say, there is no duplication. It is possible to check for duplication with the following code:

In [56]:
unique_entry=[]
duplicated_entry=[]

for row in android_apps:
    name=row[0]
    if name in unique_entry:
        duplicated_entry.append(name)
    else:
        unique_entry.append(name)

print('Number of duplicated entries:', len(duplicated_entry))
print('\n')
print('Number of unique entries', len(unique_entry))

Number of duplicated entries: 1181


Number of unique entries 9659


As can be seen above, there are 1181 duplicated entries. In order to continue with the analysis, it is important to develop a criteria for removing these duplicated entries.

In [57]:
print(duplicated_entry[0:15]) #Taking a look at some of the duplicated entries

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In [58]:
#Taking a look at an example of a duplicated app
print(android_apps_header)
print('\n')
for app in android_apps:
    name=app[0]
    if name=='Quick PDF Scanner + OCR FREE':
        print(app)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


From observing the data for the duplicated entries for the app, "Quick PDF Scanner + OCR FREE", it can be seen that the main difference is in the number of reviews column. Since the higher the number of reviews, the more recent and reliable the reviews should be, it is possible to develop a selection criterion: keep the data with the highest number of reviews and delete the duplicates. This selection criterion will be effected in the following code.

In [59]:
reviews_max={} #Create an empty dictionary to contain the reviews with the maximum ratings
for app in android_apps:
    name=app[0]
    n_reviews= float(app[3]) #As the number of reviews stored as a string
    if name in reviews_max and reviews_max[name]<n_reviews: 
        reviews_max[name]=n_reviews #If current number of reviews in current iteration is higher than that stored in the dictionary, update it with this current value
    elif name not in reviews_max:
        reviews_max[name]=n_reviews #Else if name not in dictionary, store number of reviews as this current value

From earlier lines of code, it was shown that the number of unique entries is 9659. If the code for determining `reviews_max` is correct, the length of this dictionary should be equal 9659. This has been checked in the following code.

In [60]:
len(reviews_max)==9659

True

The code below will create a list that stores the cleaned data (named android_clean). If this code has been correctly constructed, the length of android_clean should be 9695. The code that checks for duplicated entires can also be reused to ensure that the number of duplicated entires is zero.

In [61]:
android_clean=[] #Create an empty list to store the new cleaned data
already_added=[] #Store the app names that have been added to the list above
for app in android_apps:
    name= app[0]
    n_reviews= float(app[3])
    if n_reviews==reviews_max[name] and (name not in already_added): #This second not in condition is needed as multiple entries of the same app might have the same number of reviews 
        android_clean.append(app)
        already_added.append(name)

print(len(android_clean))

9659


In [62]:
#Double-check that there are no duplicated entries in android_clean
unique_entry=[]
duplicated_entry=[]

for row in android_clean:
    name=row[0]
    if name in unique_entry:
        duplicated_entry.append(name)
    else:
        unique_entry.append(name)

print('Number of duplicated entries:', len(duplicated_entry))
print('\n')
print('Number of unique entries', len(unique_entry))

Number of duplicated entries: 0


Number of unique entries 9659


### Removing Duplicated Entries- App Store Data

The first step is to check if there are any duplicated entries in the `ios_apps` dataset

In [63]:
unique_entry=[]
duplicated_entry=[]

for row in ios_apps:
    name=row[0]
    if name in unique_entry:
        duplicated_entry.append(name)
    else:
        unique_entry.append(name)

print('Number of duplicated entries:', len(duplicated_entry))
print('\n')
print('Number of unique entries', len(unique_entry))

Number of duplicated entries: 0


Number of unique entries 7197


It can be seen that there are no duplicated entries in the `ios_apps` dataset.

### Removing Non-English Apps - App Store & Android Apps

From exploring both of the datasets (ios and Android), it can be seen that there are some app names that are not in English. This is exemplified in the following code:

In [64]:
print(ios_apps[813][1])
print(ios_apps[6731][1])

print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜
中国語 AQリスニング
لعبة تقدر تربح DZ


As this analysis is focused on apps that are suitable for an English audience, these apps not in English should be removed. The knowledge that according to the ASCII (American Standard Code for Information Interchance) system, characters commonly used in English text have numbers corresponding to them within the range 0-127, has been employed to help with this task. The ord() function returns this number.

In [65]:
def is_english(string):
    for character in string:
        if ord(character)>127:
            return False 
    return True 

The code can be checked against the following apps news to ensure that it is working:

In [66]:
print(is_english('Instagram')) #Example 1
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))#Example 2
print(is_english('Docs To Go™ Free Office Suite')) #Example 3
print(is_english('Instachat 😜')) #Example 4

True
False
False
False


From the code above, `is_english` correctly identifies Example 2 that is not in English. However, it incorrectly identifies Example 3 & 4 as not in English due to the inclusion of characters that fall outside of the ASCII system. To reduce the loss of potentially valuable data, this code should be updated with the compromise of only removing apps with a name including more characters with numbers outide the 0-127 ASCII range. Although this update is not perfect, it provides a sufficient improvement to the previous version.

In [67]:
def is_english(string):
    characters_above_127=0
    for character in string:
        if ord(character)>127:
            characters_above_127+=1
    if characters_above_127>3:
        return False
    else:
        return True 

In [68]:
#Retesting this updated code:
print(is_english('Docs To Go™ Free Office Suite')) #Example 1
print(is_english('Instachat 😜')) #Example 2

True
True


The following code will be used to filter out the English and Non-English apps.

In [69]:
english_ios_apps=[]
english_android_apps=[]

for app in android_clean:
    if is_english(app[0]): #Check if app name is in english
        english_android_apps.append(app)
        
for app in ios_apps:
    if is_english(app[1]):
        english_ios_apps.append(app)

print('English Android Apps')
explore_data(english_android_apps, 0, 3, True)
print('\n')
print('English ios Apps')
explore_data(english_ios_apps, 0, 3, True)

English Android Apps
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


English ios Apps
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', 

It can be seen that after the implementation of the is_english code, the number of Android apps in English is 9614 and that for ios Apps is 6183.

### Isolating Free Apps - App Store & Android Apps

In [70]:
print(ios_apps_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [71]:
print(android_apps_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


As mentioned in the beginning, this analysis is focused on free apps on the Apple Store and the Playstore. Therefore, the free apps should be extracted from the data above. This can be done with the following code.

In [72]:
android_final = [] #This list will store the final Android apps, i.e. those that are free and are in english
ios_final = [] #This list will store the final ios apps, i.e. those that are free and in english

for app in english_android_apps:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in english_ios_apps:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

8864
3222


As can be seen above, there are 3222 free ios apps in the ios app data and 8864 in the android data.

# Determining Most Common Apps

As stated at the beginning, the aim of this analysis is to determine the types of apps that are most likely to attact users. This is because more users leads to more in-app ad engagements and thus higher revenue potential. The end goal is to release the app on the Playstore and App Store, therefore a three step validation strategy has been proposed to minimise risks and overhead:
- Build an MVP version of the Android App to be added to thr Playstore
- Develop this further upon postive responses from users
- Given that the app has been provitable on the Playstore for 6 months, an iOS version of the App will be built and added to the App Store.

A natural starting point would be to find out which genres of Apps are popular on both stores. This will be done in the next section.

## Most Common Apps By Genre- Playstore & App Store

To find the most common apps by genre, a frequency table will be created and this will be done in two parts:
- Create a frequency table with percentages
- Display the percentages in descending order

In [73]:
# Creating a function to create a frequency table
def freq_table(dataset,index):
    table={}
    total_count=0
    for row in dataset:
        total_count+=1
        value=row[index]
        if value in table:
            table[value]+=1
        else:
            table[value]=1
#Storing the percentages in a new dictionary          
    table_percentages={}
    for key in table:
        percentage= (table[key]/total_count)*100
        table_percentages[key]=percentage 
    return table_percentages

#Show percentages in descending order
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

### Applied to iOS Data

The `display_table` function defined above can be applied to the ios_final dataset, on the prime_genre column, in order to generate a frequency table in descending order. This will be done with the following code.

In [74]:
display_table(ios_final,11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


From the data above, it can be seen that almost 60% of the free apps, in english, on the Apple App Store are Games. This is followed up the Entertainment genre (~8%) and the Photo & Video genre (~5%). It appears that an overwhelming majority of the apps are designed for entertainment purposes with the remainder (~ 27% of free, english apps) angled towards more functional/lifestyle roles (includes the sports, utilities and shopping genres). 

Based on this data, it is possible to hypothesise that reason for the large number of gaming/entertainment apps designed is because they have historically been the most profitable, attracting more and more developpers to the genres. Recalling that profitability for free apps(through revenue) is based on number of users and ad engagements, this hypothesis can be validated by looking at how the number of users are distributed by genre. 

### Applied to Android Data

The display_table function defined above can be applied to the android_final dataset, on the category and genre columns, in order to generate frequency tables in descending order. This will be done with the following code.

In [75]:
print('Frequency Table for Category Column:')
display_table(android_final,1) #For Category Column
print('\n')
print('Frequency Table for Genre Column:')
display_table(android_final,9) #For Genre Column

Frequency Table for Category Column:
FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING :

It appears that the biggest difference between the Category and Genre Columns is that the Genre column is more finely segmented. For this analysis, the level of granularity provided by the Category column is sufficient.

From the data above, for free english apps on the Playstore, the most popular by category are family apps at around 19%, followed by gaming at about 10%. There appears to be a difference from the trend seen for iOS Apps, where gaming/entertainment apps dominated. In the case of Android apps, there is no overwhelming leader and the spread of apps seems a lot more even between apps that are functional and those designed for entertainment. 

## Most Common Apps By Users- Playstore & App Store

To validate the hypothesis above and get a better idea of the ideal profile of a free app in english, the distribution of users by genres needs to be calculated. 

For the Playstore data, the average installs per genre will be calcualted using the Installs column. For the App Store, there is no analogous column, however the total number of user ratings (in the rating_count_tot) will be used as a proxy.

### Use Ratings per Genre - App Store

 The average number of user ratings per app genre on the App genre will be calcualted using the following code.

In [76]:
genre_ios=freq_table(ios_final,11)
for genre in genre_ios:
    number_of_ratings=0 #This variable will contain the total number of user ratings
    len_genre=0 #This variable will contain the total number of apps for each genre
    for app in ios_final:
        genre_app=app[11]
        if genre_app==genre:
            number_of_ratings+=float(app[5])
            len_genre+=1
    avg_user_ratings=number_of_ratings/len_genre
    print('Average User Ratings in','',genre,'','is','',avg_user_ratings)
    print ('Number of apps in', '', genre, ':', len_genre)
    print('\n')

Average User Ratings in  Catalogs  is  4004.0
Number of apps in  Catalogs : 4


Average User Ratings in  Health & Fitness  is  23298.015384615384
Number of apps in  Health & Fitness : 65


Average User Ratings in  Book  is  39758.5
Number of apps in  Book : 14


Average User Ratings in  News  is  21248.023255813954
Number of apps in  News : 43


Average User Ratings in  Social Networking  is  71548.34905660378
Number of apps in  Social Networking : 106


Average User Ratings in  Sports  is  23008.898550724636
Number of apps in  Sports : 69


Average User Ratings in  Lifestyle  is  16485.764705882353
Number of apps in  Lifestyle : 51


Average User Ratings in  Weather  is  52279.892857142855
Number of apps in  Weather : 28


Average User Ratings in  Finance  is  31467.944444444445
Number of apps in  Finance : 36


Average User Ratings in  Photo & Video  is  28441.54375
Number of apps in  Photo & Video : 160


Average User Ratings in  Navigation  is  86090.33333333333
Number of apps in  

From the data above, it appears that the most popular genre of apps, via the average number of user ratings, is the Navigation genre with about 86,100 ratings. However, with only 6 apps in this Genre, it seems that a few very popular apps are dominating and making the genre appear more popular than it really is. This can be confirmed with the following code. 

In [77]:
for apps in ios_final:
    if apps[11]== 'Navigation':
        print(apps[1],':',apps[5]) #Print app name and number of ratings for that app

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


The above code confirms this, with Google Maps and Waze having ~97% of the total user ratings of this genre. These appear to be two huge brands in Navigation and entering the Navigation space might prove to be unprofitable due to the entry barrier created by their respective brand power.

In determining which genre would be the most profitable to enter, the level of competitive rivalry is very important. In the case of many players in one genre (i.e a fragmented market), competition is very high and maybe characterised by a lot of switching by users, preventing a build up of users required to access higher ad revenues. Conversely, a genre with low competition but dominated fiercely by few players (like the Navigation) might not be profitable due to brand being an entry barrier. 

The ideal genre to enter seems to be characterised by moderate competition (but enough to establish a "brand" and attract users) but also sufficient scale/popularity to generate sufficient revenues. From this point of view, the Reference genre seems to be the most promising. 

In [78]:
for apps in ios_final:
    if apps[11]== 'Reference':
        print(apps[1],':',apps[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


While Reference is dominated mainly by the Bible and Dictionary apps, it can be seen that there is a niche for creating guides/cheat codes for popular games. If this company has any popular games, this strategy of creating a cheat codes/guides could be a way of leveraging that popularity in another genre/segment. 
As games are the largest genre of apps on the App Store, a good strategy might be to create an app that functions as a library of cheats for many of these games as a way of improving the value proposition for users and limiting the level of fragmentation in the games guide sub-segment of References.

### Average Installs per Category - Playstore

A similar analysis will be performed for the Playstore apps. This time, the the number of installations of the app is available via the Installs column and so will be used.

In [79]:
display_table(android_final,5) #Produces distribution across number of installations

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


From the code above, it can be seen that the installs are in open ranges and not very precise. For the sake of this analysis, it has been assumed that apps will take the exact number of installs as the category there are in. That is, all apps in the 1,000+ category will have a number of installs of 1000 and all apps in the 100,000+ category will have a number of installs of 100000 e.t.c. 

In [80]:
print(android_apps_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [81]:
print(android_final[2])

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


In [82]:
category_android=freq_table(android_final,1) #This dictionary stores the unique categories
for category in category_android:
    total_category_installs=0 #This variable will store the total number of installs for each category
    len_category=0 #This variable will store the number of apps in each category
    for app in android_final:
        category_app=app[1]
        if category_app==category:
            n_installs=app[5]
            n_installs=n_installs.replace('+','') #Remove + in string
            n_installs=n_installs.replace(',','') #Remove , in string 
            total_category_installs+=float(n_installs) #Save as float so it can be used in addition
            len_category+=1
    average_installs= total_category_installs/len_category
    print('Average Installs in','',category, '','is','',average_installs)
    print('Number of apps in','',category,'','is','',len_category)
    print('\n')

Average Installs in  TOOLS  is  10801391.298666667
Number of apps in  TOOLS  is  750


Average Installs in  LIFESTYLE  is  1437816.2687861272
Number of apps in  LIFESTYLE  is  346


Average Installs in  COMICS  is  817657.2727272727
Number of apps in  COMICS  is  55


Average Installs in  EVENTS  is  253542.22222222222
Number of apps in  EVENTS  is  63


Average Installs in  SPORTS  is  3638640.1428571427
Number of apps in  SPORTS  is  301


Average Installs in  SHOPPING  is  7036877.311557789
Number of apps in  SHOPPING  is  199


Average Installs in  BUSINESS  is  1712290.1474201474
Number of apps in  BUSINESS  is  407


Average Installs in  FAMILY  is  3695641.8198090694
Number of apps in  FAMILY  is  1676


Average Installs in  ENTERTAINMENT  is  11640705.88235294
Number of apps in  ENTERTAINMENT  is  85


Average Installs in  PRODUCTIVITY  is  16787331.344927534
Number of apps in  PRODUCTIVITY  is  345


Average Installs in  EDUCATION  is  1833495.145631068
Number of apps in  EDUC

From the above, it can be seen that the Category with the largest number of installs is Communication with about 38.5m downloads and there are 287 apps in this category. The following code will explore the apps in the Commucations catgory more.

In [83]:
for app in android_final:
    if app[1]=='COMMUNICATION':
        name=app[0]
        installs=app[5]
        print(name,'', 'has','', installs)

WhatsApp Messenger  has  1,000,000,000+
Messenger for SMS  has  10,000,000+
My Tele2  has  5,000,000+
imo beta free calls and text  has  100,000,000+
Contacts  has  50,000,000+
Call Free – Free Call  has  5,000,000+
Web Browser & Explorer  has  5,000,000+
Browser 4G  has  10,000,000+
MegaFon Dashboard  has  10,000,000+
ZenUI Dialer & Contacts  has  10,000,000+
Cricket Visual Voicemail  has  10,000,000+
TracFone My Account  has  1,000,000+
Xperia Link™  has  10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard  has  10,000,000+
Skype Lite - Free Video Call & Chat  has  5,000,000+
My magenta  has  1,000,000+
Android Messages  has  100,000,000+
Google Duo - High Quality Video Calls  has  500,000,000+
Seznam.cz  has  1,000,000+
Antillean Gold Telegram (original version)  has  100,000+
AT&T Visual Voicemail  has  10,000,000+
GMX Mail  has  10,000,000+
Omlet Chat  has  10,000,000+
My Vodacom SA  has  5,000,000+
Microsoft Edge  has  5,000,000+
Messenger – Text and Video Chat for Free 

It appears that there are some giants in the communication genre making the genre appear a lot more popular. For example, Skype,Whatsapp Messanger, Google Duo and have 1,000,000,000+, 1,000,000,000+ and 500,000,00+ downloads respectively. By removing some of these apps skewing the average installs, a fairer assessment can be made.

In [84]:
#Remove apps with over 100,000,000+ installs in Communication
communication_updated=[] #Store apps with <100m installs
for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        communication_updated.append(float(n_installs))
new_communication_avg_installs=sum(communication_updated)/len(communication_updated)
print(new_communication_avg_installs)

3603485.3884615386


It can be seen that the new communications average installs is about 3.6m, down from 38.5m originally calculated.

As the app release strategy is to release the app on the Playstore and the App Store, an interesting category to explore would be the Books and References category. From the analysis done for the iOS apps, an app functioning as a library of guides and cheats for games was recommended.

The books and reference category has about 8.8m installs and 190 apps.

In [85]:
#Explore the Books and References Category more
for app in android_final:
    if app[1]=='BOOKS_AND_REFERENCE':
        name=app[0]
        installs=app[5]
        print(name,'', 'has','', installs)

E-Book Read - Read Book for free  has  50,000+
Download free book with green book  has  100,000+
Wikipedia  has  10,000,000+
Cool Reader  has  10,000,000+
Free Panda Radio Music  has  100,000+
Book store  has  1,000,000+
FBReader: Favorite Book Reader  has  10,000,000+
English Grammar Complete Handbook  has  500,000+
Free Books - Spirit Fanfiction and Stories  has  1,000,000+
Google Play Books  has  1,000,000,000+
AlReader -any text book reader  has  5,000,000+
Offline English Dictionary  has  100,000+
Offline: English to Tagalog Dictionary  has  500,000+
FamilySearch Tree  has  1,000,000+
Cloud of Books  has  1,000,000+
Recipes of Prophetic Medicine for free  has  500,000+
ReadEra – free ebook reader  has  1,000,000+
Anonymous caller detection  has  10,000+
Ebook Reader  has  5,000,000+
Litnet - E-books  has  100,000+
Read books online  has  5,000,000+
English to Urdu Dictionary  has  500,000+
eBoox: book reader fb2 epub zip  has  1,000,000+
English Persian Dictionary  has  500,000+
F

There are also giants in this category, such as the Bible app and Amazon Kindle with 100,000,000+ downloads each. These apps can be removed using the threshold of 100,000,000+ downloads and a new average can be calculated.

In [88]:
#Remove apps with over 100,000,000+ installs in Books and Reference
books_and_reference_updated=[] #Store apps with <100m installs
for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'BOOKS_AND_REFERENCE') and (float(n_installs) < 100000000):
        books_and_reference_updated.append(float(n_installs))
new_books_and_reference_updated=sum(books_and_reference_updated)/len(books_and_reference_updated)
print(new_books_and_reference_updated)

1437212.2162162163


The new average installs for the Books and Reference category is about 1.5m. From the exploration of the whole genre above, it can be seen that religious books are quite popular, albeit with some fragmentation. There are many different version of the Quran with a total of about 48m+ downloads. A new strategy could be to create an app that consolidates all these different versions of the Quran, with some new extra features (such as audio in various languages, discussion forums etc) to encourage switching by users.

## Recommendation

The aim of this analysis was to identify popular apps across the Playstore and the App Store that are free and in English, in order to inform the app development strategy of an app development company. The final recommendation of this analysis is as follows:
- Across the iOS and Android apps, the Reference and Books & Reference (respectively) genres were found to be the most profitable for app developers as they struck the balance between scale (number of ratings/installations) and degree of competition (not too competitive to make standing out hard and not too uncompetitive to be dominated by few brands/apps).
- A Quran app that consolidates the features(availability in different languages, availability in various lines per page etc) seen on the various Playstore apps to limit fragmentation  but also add new features such as audio/discussion forums to encourage user switching.
- This approach is supported by the fact that the Quran app is the 5th most popular app in the Reference genre of iOS apps, with 18,418 user ratings.