# Maximizing Profits: Analyzing Top Earning Apps in the App Store and Google Play Markets

The aim of this project is to find the mobile app profiles that are profitable for the App Store and Google Play markets. Working as a data analyst for a company that builds Android and iOS mobile apps, my job is to enable my team of developers to make data-driven decisions with respect to the kind of apps they build.

The company only builds apps that are free to download and install, relying primarily on in-app ads for revenue. Therefore, the success of our apps is highly dependent on the number of users. This project helps our developers understand apps that are most likely to attract more users.

## Opening and Exploring the Data

Two datasets are explored in this project
* [A data set](https://www.kaggle.com/datasets/lava18/google-play-store-apps) containing data about approximately ten thousand Android apps from Google Play.
* [A data set](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately seven thousand iOS apps from the App Store.

The data sets are opened below

In [2]:
#Links to the data sets are included above
def open_dataset(file_path): #Function opens and reads data set. Returns dataset header and body as a tuple
    opened_file = open(file_path, encoding="utf-8")
    from csv import reader
    read_file = reader(opened_file)
    data = list(read_file)
    data_header = data[0]
    data = data[1:]
    return data_header, data
    
## The Google Play Data set ##
android_header, android = open_dataset("C:/Users/okwuo/OneDrive/Desktop/My_Datasets/googleplaystore.csv")

## The App Store data set ## 
ios_header, ios = open_dataset("C:/Users/okwuo/OneDrive/Desktop/My_Datasets/AppleStore.csv")

print(android_header)
print('\n')
print(ios_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


The function below makes it easier to explore the data sets by make rows more readable. Included in the function is an option to show the number of rows and columns of any data set.

In [4]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') #adds a new line between rows

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

print('\n')


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13




The output shows the Google Play data set has 10,841 apps and 13 columns. The useful columns for this project are 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.

App Store data set is explored below

In [6]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


The output shows the App Store data set has 7,197 apps and 16 columns. The useful columns are 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'. Details about each column can be found in the data set [documentation](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps).

# Cleaning the Data Sets

## Deleting Wrong Data
[One of the discussions](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) on the Google Play data set [discussion section](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion?sort=undefined) outlines an error for row 10472. 
The output for this row returns 

`'Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'`
Comparing it to other rows, we see that this row is missing a value in the 'Category' column. As a consequence, this row will be deleted from the data set.

In [9]:
print(len(android))
del android[10472]
print(len(android))

10841
10840


## Removing Duplicate Entries

After exploring the Google Play data set, it becomes apparent that some of the apps have more than one entry. For example, the Instagram application has four entries:

In [11]:
count = 0
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)
        count += 1

print(count)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
4


The function below counts the total number of duplicate cases 

In [13]:
def count_duplicate(dataset, index):
    duplicate_apps =[]
    unique_apps = []
    
    for row in dataset:
        name = row[index]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)

    print("Number of duplicate apps:", len(duplicate_apps))
    print('\n')
    print("Examples of duplicate apps:", '\n', duplicate_apps[:15])

count_duplicate(android, 0)

Number of duplicate apps: 1181


Examples of duplicate apps: 
 ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


From the code above, we see there are 1,181 cases where an app has been entered more than once.

Examining the rows, the main differnece between duplicate apps is noted on the fourth column (the 'Rating' column). The different numbers show that the data was collected at different times. For a more reliable rating, the dupliacte app with the higher number of reviews will be stored.
 
To do that, we:
* Create a dictionary where each key is a unique app name, and the value is the highest number of reviews for said app
* Use the dictionary to create a new data set which will have only one entry of the app with the most review

In [15]:
max_review = {}

for app in android:
    name = app[0]
    n_review = float(app[3])

    if name in max_review and max_review[name] < n_review:
        max_review[name] = n_review

    elif name not in max_review:
        max_review[name] = n_review

print(len(max_review))

9659


Earlier on, we found out there are 1,181 cases of duplicated apps. Therefore, the expected length of the dictionary should equal the difference between the length of the data set and 1,181 (i.e. 10840 - 1181 = 9659).

In the code below:

* Two empty lists, `android_clean` and `already_added`, are initialised.
* `android` data set is looped through, and for each iteration:
  * extract app name and number of reviews.
  * add current row (app) to the `android_clean` list and the app name (name) to the `already_added` list if:
     * The number of reviws of the cuurent app matches the number of review of said app as described in the `max_review` dictionary and
     * The name of the app is not in the `already_added` list. 

In [17]:
android_clean =[]
already_added =[]

for app in android:
    name = app[0]
    n_review = float(app[3])

    if (max_review[name] == n_review) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name) #Keeps track of app names to avoid dupliactes.

explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


`9659 rows`, just as expected


# Removing Non-English Apps

The names of some of the apps in the data set suggest said apps are not directed towards an English-speaking audience.


In [19]:
print(ios[813][1])
print(ios[6731][1])

print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜
中国語 AQリスニング
لعبة تقدر تربح DZ


Characters that are specific to English texts are encoded using the ASCII standard. Each ASCII character has a corresponding number between 0 and 127 associated with it. Utilisng a built-in function `ord()`, a function, `is_english()`, that checks an app name for non-ASCII characters is built below.

**Note: English app names use emojis or other symbols (™, —, etc.) that fall outside of the ASCII range. To minimize the impact of data loss, app names with more than three non-ASCII characters are removed.**

In [21]:
def is_english(string):
    non_ascii = 0
    for character in string:
        if ord(character) > 127:
            non_ascii+=1
    
    if non_ascii >3:    
        return False
    else:
        return True


Below, we use the `is_english()` function to filter out the non-English apps for both data sets

In [23]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)

for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
        
print('English Apps for the Android Dataset')
explore_data(android_english, 0, 0, True)
print('\n')
print('English Apps for the iOS Dataset')
explore_data(ios_english, 0, 0, True)

English Apps for the Android Dataset
Number of rows: 9614
Number of columns: 13


English Apps for the iOS Dataset
Number of rows: 6183
Number of columns: 16


# Isolating Free Apps

Both data sets contain free and non-free apps, as the company only builds apps that are free, the free apps need to be isolated for analysis. This will be the final step in the data cleaning process.

In [25]:
android_final = []
ios_final = []

for row in android_english:
    price = row[7]
    if price == '0':
        android_final.append(row)

for row in ios_english:
    price = row[4]
    if price == '0.0':
        ios_final.append(row)

print('Free Apps for the Android Dataset')
explore_data(android_final, 0, 3, True)
print('\n')
print('Free Apps for the iOS Dataset')
explore_data(ios_final, 0, 3, True)

Free Apps for the Android Dataset
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns: 13


Free Apps for the iOS Dataset
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928',


# Most Common Apps by Genre:

Because the company's revenue is highly influenced by the number of people using its apps, the goal is to determine the kinds of apps likely to attract more users.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

* Build a minimal Android app version and add it to Google Play.
* If the app has a good response from users, it will be developed further.
* If the app is profitable after six months, an iOS version will be built and added to the App Store.
  
Because the goal is to add the app to the App Store and Google Play, it is important to identify app profiles that perform well on both platforms.

The analysis will begin by examining the most common genres for each market. To achieve this, a frequency table will be built for the `prime_genre` column in the App Store dataset and the `'Category'` columns in the Google Play dataset.

In [27]:
def freq_table(dataset, index):
    table = {}
    total = 0

    for row in dataset:
        total += 1
        genre = row[index]
        if genre in table:
            table[genre] += 1
        else:
            table[genre] = 1

    table_percentages = {}
    for key in table:
        percentage = (table[key]/total) * 100
        table_percentages[key] = percentage

    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        value_key_pair = (table[key], key)
        table_display.append(value_key_pair)

    table_sorted = sorted(table_display, reverse = True) #Sort table in descending order
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

display_table(ios_final, 11) #prime_genre column iOS data set

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Among the free English apps, more than half (`58.16`%) fall into the games category. Entertainment apps account for approximately 8%, followed by photo and video apps at around 5%. Educational apps constitute only 3.66%, with social networking apps making up 3.29% of the data set.

The distribution of app types suggests that the App Store, at least within the category of free English apps, is primarily composed of apps designed for entertainment purposes (games, entertainment, photo and video, social networking, sports, music, etc.). In contrast, apps with practical functions (education, shopping, utilities, productivity, lifestyle, etc.) are less common.  However, more entertainment-oriented apps do not necessarily indicate a greater number of users, as demand may not align with supply.


An analysis of the `'Category'` columns in the Google Play data set is conducted below.

In [29]:
display_table(android_final, 1) # Category column Android data set

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

# Most Popular Apps by Genre

Determining the most popular app genres (those with the highest number of users) can be achieved by calculating the average number of installs for each genre. In the Google Play data set, this information is available in the Installs column. However, the App Store data set lacks this information. As an alternative, the total number of user ratings serves as a proxy, with relevant data found in the rating_count_tot column.

Below, the average number of user ratings per app genre on the App Store is calculated and sorted in decreasing order.

In [31]:
genre_ios = freq_table(ios_final, -5)
genre_ratings_list = []
for genre in genre_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:
            num_ratings = float(app[5])
            total += num_ratings
            len_genre += 1 
            
    avg_num_ratings = total / len_genre
    genre_ratings_list.append((avg_num_ratings, genre))

genre_ratings_sorted = sorted(genre_ratings_list, reverse = True)
for entry in genre_ratings_sorted:
    print(entry[1], ':', entry[0])

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


# A Deep Dive into The Popular Genres

In [33]:
#The function shows the name and total users of the top 5 apps of a given genre in a given dataset.
def top_apps_by_genre(dataset, genre, app_name_index, genre_index, tot_users_index, top_num = 5, percent = False):
    genre_apps = []
    total_users = 0
    for app in dataset:
        app_name = app[app_name_index]
        app_genre = app[genre_index]
        app_users = int(app[tot_users_index].replace(',','').replace('+','')) #Converts string with special characters to integers (e.g. 100,000+ to 100000)
        
        if app_genre == genre:
            app_users_and_name = (app_users, app_name)
            genre_apps.append(app_users_and_name)
            total_users +=app_users

    count = 0
    print('*'*5, 'Top', top_num, genre, 'apps', '*'*5)

    for app in sorted(genre_apps, reverse = True):
        app_name = app[1]
        app_users = app[0]
        count += 1
        if count > top_num:
            print('\n')
            break
        if percent == True:
            if total_users != 0:
                app_user_percentage = round((app_users / total_users) * 100,2)
            else:
                app_user_percent = 0
                
            print(app_name,':',app_users, '(' + str(app_user_percentage)+'%)') #Prints app name and total number of users as a count and percentage
        else:
            print(app_name,':',app_users)

top_apps_by_genre(dataset = ios_final, genre="Navigation", genre_index=-5, app_name_index=1, tot_users_index=5, percent=True)
top_apps_by_genre(dataset = ios_final, genre="Reference", genre_index=-5, app_name_index=1, tot_users_index=5, percent=True)
top_apps_by_genre(dataset = ios_final, genre="Social Networking", genre_index=-5, app_name_index=1, tot_users_index=5, percent=True)
top_apps_by_genre(dataset = ios_final, genre="Music", genre_index=-5, app_name_index=1, tot_users_index=5, percent=True)

***** Top 5 Navigation apps *****
Waze - GPS Navigation, Maps & Real-time Traffic : 345046 (66.8%)
Google Maps - Navigation & Transit : 154911 (29.99%)
Geocaching® : 12811 (2.48%)
CoPilot GPS – Car Navigation & Offline Maps : 3582 (0.69%)
ImmobilienScout24: Real Estate Search in Germany : 187 (0.04%)


***** Top 5 Reference apps *****
Bible : 985920 (73.09%)
Dictionary.com Dictionary & Thesaurus : 200047 (14.83%)
Dictionary.com Dictionary & Thesaurus for iPad : 54175 (4.02%)
Google Translate : 26786 (1.99%)
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418 (1.37%)


***** Top 5 Social Networking apps *****
Facebook : 2974676 (39.22%)
Pinterest : 1061624 (14.0%)
Skype for iPhone : 373519 (4.93%)
Messenger : 351466 (4.63%)
Tumblr : 334293 (4.41%)


***** Top 5 Music apps *****
Pandora - Music & Radio : 1126879 (29.78%)
Spotify Music : 878563 (23.22%)
Shazam - Discover music, artists, videos & lyrics : 402925 (10.65%)
iHeartRadio – Free Music & Radio Stations : 293228 (7.75%)
Sou

The previous analysis identified navigation apps with the highest average number of user reviews. However, closer examination reveals that this figure is significantly influenced by Waze and Google Maps, which collectively account for nearly `97%` of the total reviews in the Navigation genre.

The same pattern applies to social networking apps, where a few giants like Facebook, Pinterest, Skype, etc. heavily influence the average number. The same applies to music apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

`Reference` apps have 74,942 user ratings on average, but it is the Bible and Dictionary.com which skew up the average rating.

Below are the top 5 `Book` apps:

In [35]:
top_apps_by_genre(dataset = ios_final, genre="Book", genre_index=-5, app_name_index=1, tot_users_index=5, percent=True)

***** Top 5 Book apps *****
Kindle – Read eBooks, Magazines & Textbooks : 252076 (45.29%)
Audible – audio books, original series & podcasts : 105274 (18.91%)
Color Therapy Adult Coloring Book for Adults : 84062 (15.1%)
OverDrive – Library eBooks and Audiobooks : 65450 (11.76%)
HOOKED - Chat Stories : 47829 (8.59%)




This niche appears to have potential. Although Amazon holds 64.20% of the weight in the `Book` genre, the combined percent (35.45%) of the other apps shows promise. One approach involves creating an app that combines characteristics of the `Book` and `Reference` genres. This could be an eLibrary app with additional features beyond the basic text to enhance user retention. Possible enhancements include daily quotes, an audio version, and interactive quizzes related to in-app books. Additionally, integrating an in-app dictionary would allow users to look up words without needing to switch to an external application, keeping engagement within the app.

Other popular genres include `Weather`, `Food & Drink`, and `Finance`.


In [37]:
top_apps_by_genre(dataset = ios_final, genre="Weather", genre_index=-5, app_name_index=1, tot_users_index=5, percent=True)
top_apps_by_genre(dataset = ios_final, genre="Food & Drink", genre_index=-5, app_name_index=1, tot_users_index=5, percent=True)
top_apps_by_genre(dataset = ios_final, genre="Finance", genre_index=-5, app_name_index=1, tot_users_index=5, percent=True)

***** Top 5 Weather apps *****
The Weather Channel: Forecast, Radar & Alerts : 495626 (33.86%)
The Weather Channel App for iPad – best local forecast, radar map, and storm tracking : 208648 (14.25%)
WeatherBug - Local Weather, Radar, Maps, Alerts : 188583 (12.88%)
MyRadar NOAA Weather Radar Forecast : 150158 (10.26%)
AccuWeather - Weather for Life : 144214 (9.85%)


***** Top 5 Food & Drink apps *****
Starbucks : 303856 (35.06%)
Domino's Pizza USA : 258624 (29.84%)
OpenTable - Restaurant Reservations : 113936 (13.15%)
Allrecipes Dinner Spinner : 109349 (12.62%)
DoorDash - Food Delivery : 25947 (2.99%)


***** Top 5 Finance apps *****
Chase Mobile℠ : 233270 (20.59%)
Mint: Personal Finance, Budget, Bills & Money : 232940 (20.56%)
Bank of America - Mobile Banking : 119773 (10.57%)
PayPal - Send and request money safely : 119487 (10.55%)
Credit Karma: Free Credit Scores, Reports & Alerts : 101679 (8.98%)




* Weather apps – Users typically spend minimal time within these apps, reducing the potential for generating revenue through in-app ads. Additionally, accessing reliable live weather data may require integration with paid APIs, increasing costs.  

* Food and drink apps – Popular examples in this category include Starbucks, Dunkin' Donuts, and McDonald's. Developing a widely used food and drink app often involves aspects such as food preparation and delivery services, which fall outside the company's scope.  

* Finance apps – facilitate banking, bill payments, and money transfers. Developing a finance app requires specialized domain knowledge, and hiring a financial expert solely for app development is not a viable option.


# Most Popular Apps by Genre on Google Play

For the Google Play market, data on the number of installs is available, which should provide a clearer picture of genre popularity. However, the install numbers appear to lack precision, as most values are open-ended (e.g., 100+, 1,000+, 5,000+, etc.).

In [40]:
display_table(android_final, 5) # the Installs column

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


A limitation of this data is its lack of precision. For example, an app with 100,000+ installs could be 100,000, 200,000, or 350,000. However, precise data is not required for the current analysis, as the objective is to determine which app genres attract the most users, rather than achieving perfect accuracy regarding user numbers.

The numbers will remain as they are, meaning that an app with 100,000+ installs will be treated as having 100,000 installs, and an app with 1,000,000+ installs will be considered to have 1,000,000 installs, and so on.

To perform the necessary computations, each install number will be converted to a float.

The conversion and subsequent calculation is carrried out below.

In [42]:
category_and_install_list = []
category_android = freq_table(android_final, 1)
for category in category_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            total += float(n_installs)
            len_category += 1 
    avg_n_installs = total / len_category
    category_and_install_list.append((avg_n_installs, category))

print('\n')
cat_install_sorted = sorted(category_and_install_list, reverse = True)
for entry in cat_install_sorted:
    print(entry[1], ':', entry[0])



COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3695641.8198090694
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 513

# Deep Dive into Some of the Most Installed App Categories

From the analysis above, communication apps have the highest installs on average: 38,456,119. 

Like the Social Network apps from the iOS data set, this number is inflated by a few apps.

In [44]:
num_communication_apps = 0 #Counts number of apps in the Communication category
count = 0 #Counts number of Communication apps with over 100 million installs
total_comm_installs = 0 # Tracks total number of installs of Communication apps
approx_installs = 0  #Tracks total installs of apps with over 100 million downloads
        
for app in android_final:
    if app[1] == 'COMMUNICATION':
        num_communication_apps+=1
        n_comm_installs = app[5]
        n_comm_installs = n_comm_installs.replace('+', '').replace(',', '')
        total_comm_installs += float(n_comm_installs)
        
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':',app[5])
        approx_installs += int(app[5].replace('+', '').replace(',', ''))
        count+=1
        
print('\n')
print('There are', num_communication_apps, 'communication apps.', count, 'of those apps have over one hundred million installs')        

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

In [45]:
remaining_installs = total_comm_installs - approx_installs
avg_before = total_comm_installs / num_communication_apps
avg_after = remaining_installs / (num_communication_apps - count)
reduction_factor = avg_before / avg_after
print('Reduction factor:', round(reduction_factor))


Reduction factor: 11


If these 27 apps were removed, the average installs would be reduced approximately 11 times.

The subsequent categories follow a similar pattern:

top_apps_by_genre(dataset = android_final, genre="VIDEO_PLAYERS", genre_index=1, app_name_index=0, tot_users_index=5, percent=True)
top_apps_by_genre(dataset = android_final, genre="SOCIAL", genre_index=1, app_name_index=0, tot_users_index=5)
top_apps_by_genre(dataset = android_final, genre="PHOTOGRAPHY", genre_index=1, app_name_index=0, tot_users_index=5)
top_apps_by_genre(dataset = android_final, genre="PRODUCTIVITY", genre_index=1, app_name_index=0, tot_users_index=5)

* The video player category is dominated by apps like Youtube and Google Play Movies & TV
* The social category by apps like Instagram, Google+ and Facebook
* The photography category by apps like Google Photos and various photo editors
* The productivity category by apps like Google Drive, MS Word, Google Calendar
  
These categories appear to be more popular than they actually are and are dominated by big companies that might be hard to compete with.


As with the iOS data set, the books and reference category looks popular, with over 8,767,811 downloads.

Below are some of the apps from this category and their number of installs:

In [49]:
limit = 100 #limits the number of apps printed
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and limit != 0:
        print(app[0], app[5])
        limit -= 1

E-Book Read - Read Book for free 50,000+
Download free book with green book 100,000+
Wikipedia 10,000,000+
Cool Reader 10,000,000+
Free Panda Radio Music 100,000+
Book store 1,000,000+
FBReader: Favorite Book Reader 10,000,000+
English Grammar Complete Handbook 500,000+
Free Books - Spirit Fanfiction and Stories 1,000,000+
Google Play Books 1,000,000,000+
AlReader -any text book reader 5,000,000+
Offline English Dictionary 100,000+
Offline: English to Tagalog Dictionary 500,000+
FamilySearch Tree 1,000,000+
Cloud of Books 1,000,000+
Recipes of Prophetic Medicine for free 500,000+
ReadEra – free ebook reader 1,000,000+
Anonymous caller detection 10,000+
Ebook Reader 5,000,000+
Litnet - E-books 100,000+
Read books online 5,000,000+
English to Urdu Dictionary 500,000+
eBoox: book reader fb2 epub zip 1,000,000+
English Persian Dictionary 500,000+
Flybook 500,000+
All Maths Formulas 1,000,000+
Ancestry 5,000,000+
HTC Help 10,000,000+
English translation from Bengali 100,000+
Pdf Book Downlo

In [50]:
num_book_apps = 0 
count = 0 
total_book_installs = 0 
approx_installs = 0  
        
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        num_book_apps+=1
        n_book_installs = app[5].replace('+', '').replace(',', '')
        total_book_installs += float(n_book_installs)
        
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':',app[5])
        approx_installs += int(app[5].replace('+', '').replace(',', ''))
        count+=1
        
print('\n')
print('There are', num_book_apps, 'book & reference apps.', count, 'of those apps have over one hundred million installs') 

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


There are 190 book & reference apps. 5 of those apps have over one hundred million installs


In [51]:
remaining_installs = total_book_installs - approx_installs
avg_before = total_book_installs / num_book_apps
avg_after = remaining_installs / (num_book_apps - count)
reduction_factor = avg_before / avg_after
print('Reduction factor:', round(reduction_factor))

Reduction factor: 6


The apps in the book and reference genre are centered around various themes, from religious texts to tutorials to multiple versions of dictionaries and libraries. Despite some extremely popular apps (5 in 190 apps have over one million installs), the genre still shows potential.


# Conclusion

This project examined free and English apps from the iOS and Google Play stores to recommend a profitable app for both markets.

The top 5 genres with the most apps on the iOS store were:
* Games (58.16%)
* Entertainment (7.88%)
* Photo & Video (4.97%)
* Education (3.66%)
* Social Networking (3.29%)

The top 5 on the Android store were:
* Family (18.91%)
* Game (9.72%)
* Tools (8.46%)
* Business (4.59%)
* Lifestyle (3.90%)

The top 5 most installed apps on the iOS store were:
* Navigation (≈ 86,090.33 average installs)
* Reference (≈ 74,942.11 average installs)
* Social Networking (≈ 71,548.35 average installs)
* Music (≈ 57,326.53 average installs)
* Weather (≈ 52,279.89 average installs)

The top 5 on the Android store were:
* Communication (≈ 38,456,119.16 average installs)
* Video Players (≈ 24,727,872.45 average installs)
* Social (≈ 23,253,652.13 average installs)
* Photography (≈ 17,840,110.40 average installs)
* Productivity (≈ 16787331.34 average installs)

Limitations for some of the high-ranking categories include:
* Communication apps: Often dominated by a few major players (e.g., WhatsApp, Messenger, Telegram) with strong network effects, making it hard for new entrants to gain traction.
  
* Navigation: Requires constant access to GPS and mapping services, often through third-party APIs like Google Maps, which can be costly at scale
   
* Music and Video Player: Content relies on expensive and legally complex licensing deals.

* Weather apps: Typically have low user engagement, which limits revenue opportunities from in-app ads.

* Food & Drink apps: Successful apps in this category (e.g., Starbucks, McDonald's) rely on complex logistics and real-world services like food preparation and delivery.


Overall analysis suggests an e-library app could be profitable in both the Google Play and Apple Store markets (The genre ranked 2nd and 11th in the Apple and Google Play stores respectively). Given the abundance of library apps, differentiation would require additional features beyond just adding a book as raw text. Potential enhancements include 
* creating tutorials and lore guides for popular games and media content
* inserting daily quotes
* adding audio versions in different languages
* interactive quizzes
* and a discussion forum for users to engage with the book's content and community.

Avoiding categories with high technical, legal, or operational barriers and focusing instead on educational content with scalable engagement features, the book/reference genre presents a promising opportunity. It is well-positioned to attract a broad user base across both platforms while aligning with the company’s lean, low-risk development strategy.