# Project Overview: Profitable App Profiles for the App Store and Google Play Markets
This project aims to assist app developers in identifying profitable app profiles by analysing mobile application data from the **Apple App Store** and **Google Play Market**. The focus is limited to **free apps**, as this business model dominates both stores. This project uses publicly available datasets and applies data cleaning, filtering, and exploratory data analysis (EDA) to extract actionable insights.

## 1. Dataset Expolration

The datasets analyzed in this project are:
- `AppleStore.csv`: Contains 7,198 iOS apps with 16 columns, including app name, price, size, rating counts, and genre.
- `googleplaystore.csv`: Contains 10,000+ Android apps with 13 columns, including app name, category, rating, reviews, installs, type and genre.

`AppleStore.csv`

---
| Column name        | Description                                     |
| :----------------: | :---------------------------------------------: |
| `id`               | App ID                                          |
| `track_name`       | App Name                                        |
| `size_bytes`       | Size (in Bytes)                                 |
| `currency`         | Currency Type                                   |
| `price`            | Price amount                                    |
| `rating_count_tot` | User Rating counts (for all version)            |
| `rating_count_ver` | User Rating counts (for current version)        |
| `user_rating`      | Average User Rating value (for all version)     |
| `user_rating_ver`  | Average User Rating value (for current version) |
| `ver`              | Latest version code                             |
| `cont_rating`      | Content Rating                                  |
| `prime_genre`      | Primary Genre                                   |
| `sup_devices.num`  | Number of supporting devices                    |
| `ipadSc_uris.num`  | Number of screenshots showed for display        |
| `lang.num`         | Number of supported languages                   |
| `vpp_lic`          | Vpp Device Bases Licensing Enabled              |

Link for database: https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps

`googleplaystore.csv`

---
| Column name      | Description                                                  |
| :--------------: | :----------------------------------------------------------: |
| `App`            | Application name                                             |
| `Category`       | Category the app belongs to                                  |
| `Rating`         | Overall user rating of the app                               |
| `Reviews`        | Number of user reviews for the app                           |
| `Size`           | Size of the app                                              |
| `Installs`       | Number of user downloads/installs for the app                |
| `Type`           | Paid or Free                                                 |
| `Price`          | Price of the app                                             |
| `Content Rating` | Age group the app is targeted at - Children/Mature 21+/Adult |
| `Genres`         | An app can belong to multiple genres                         |
| `Last Updated`   | Date when the app was last updated on Play Store             |
| `Current Ver`    | Current version of the app available on Play Store           |
| `Android Ver`    | Min required Android version                                 |

Link for database: https://www.kaggle.com/datasets/lava18/google-play-store-apps

A custom function `explore_data()` is defined to inspect slices of each dataset and verify structure and conent. The header and sample rows are printed to understand the format.

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
from csv import reader
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
apple_store_data = list(read_file)
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
google_play_store_data = list(read_file)
explore_data(apple_store_data, 0, 5, True)
explore_data(google_play_store_data, 0, 5, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Ed

## 2. Data Cleaning

Before conducting any analysis, the data is cleaned to ensure reliabillity.

### 2.1. Removing Erroneous Entries
- A known problematic row in `googleplaystore.csv` (with the app "Life Made WI-Fi Touchscreen Photo Frame") is removed because it has missing and misaligned columns.

### 2.2. Handling Duplicates
- Duplicate entries in the Google Play dataset are identified using app names.
- A dictionary-based method is used to retain only the row with the highest number of reviews for each duplicate app, under the assumption that more reviews imply more up-to-date or complete data.

In [3]:
explore_data(google_play_store_data, 10473, 10474)
del google_play_store_data[10473]
explore_data(google_play_store_data, 10473, 10474)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']




In [4]:
duplicate_apps = []
unique_apps = []

for app in google_play_store_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In [5]:
for app in google_play_store_data:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [6]:
reviews_max = {}
for app in google_play_store_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
print(list(reviews_max.items())[:15])
len(reviews_max)

[('Photo Editor & Candy Camera & Grid & ScrapBook', 159.0), ('Coloring book moana', 974.0), ('U Launcher Lite – FREE Live Cool Themes, Hide Apps', 87510.0), ('Sketch - Draw & Paint', 215644.0), ('Pixel Draw - Number Art Coloring Book', 967.0), ('Paper flowers instructions', 167.0), ('Smoke Effect Photo Maker - Smoke Editor', 178.0), ('Infinite Painter', 36815.0), ('Garden Coloring Book', 13791.0), ('Kids Paint Free - Drawing Fun', 121.0), ('Text on Photo - Fonteee', 13880.0), ('Name Art Photo Editor - Focus n Filters', 8788.0), ('Tattoo Name On My Photo Editor', 44829.0), ('Mandala Coloring Book', 4326.0), ('3D Color Pixel by Number - Sandbox Art Coloring', 1518.0)]


9659

In [7]:
android_clean = []
already_added = []
for app in google_play_store_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
print(android_clean[:15])
len(android_clean)

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up'], ['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', '

9659

## 3. Filtering for Analysis

### 3.1. English-Language Apps Only

To focus on a global market and reduce noise from non-English listings:
- A filtering function is implemented to detect non-English names using Unicode values.
- Apps with names containing more than three non-English characters are removed.

### 3.2. Free Apps Only

Since the objective is to recommend profitable strategies for **free apps**, paid apps are excluded using the `price` column in iOS data and the `Type` column in Google Play data.

In [8]:
def is_name_English(text):
    for element in text:
        if ord(element) > 127:
            return False
    return True

print(is_name_English('Instagram'))
print(is_name_English('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_name_English('Docs To Go™ Free Office Suite'))
print(is_name_English('Instachat 😜'))

True
False
False
False


In [9]:
def is_name_English(text):
    count = 0
    for element in text:
        if ord(element) > 127 or ord(element) < 0:
            count += 1
    if count > 3:
        return False
    return True
print(is_name_English('Instagram'))
print(is_name_English('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_name_English('Docs To Go™ Free Office Suite'))
print(is_name_English('Instachat 😜'))

True
False
True
True


In [10]:
android_english_names = []
ios_english_names = []
for app in android_clean:
    if is_name_English(app[0]):
        android_english_names.append(app)
for app in apple_store_data:
    if is_name_English(app[1]):
        ios_english_names.append(app)
print(android_english_names[:15])
print(ios_english_names[:15])

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up'], ['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', '

In [11]:
android_free_apps = []
ios_free_apps = []
for app in android_english_names:
    if app[7] == '0':
        android_free_apps.append(app)
for app in ios_english_names:
    if app[4] == '0.0':
        ios_free_apps.append(app)
print(android_free_apps[:5])
print(ios_free_apps[:5])

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']]
[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '

## 4. Frequency Analysis

To identify the most common types of apps, **frequency tables** are generated using helper functions:
- `freq_table()`: Computes frequency percentages for a given column (e.g., `prime_genre` in iOS or `Category` in Google Play).
- `display_table()`: Sorts and prints the frequency table in descending order of popularity.

Key findings:
- In the App Store, the most common genre is **Games** (~58%), followed by **Entertainment** and **Photo & Video**.
- In Google Play, the most common category is **Family** followed by **Game** and **Tools**.

In [17]:
def freq_table(dataset, index):
        freq_dictionary = {}
        total = 0
        for app in dataset:
            total += 1
            value = app[index]
            if value in freq_dictionary:
                freq_dictionary[value] += 1
            else:
                freq_dictionary[value] = 1
        freq_dictionary_percentage = {}
        for key in freq_dictionary:
            freq_dictionary_percentage[key]= (freq_dictionary[key] / total) * 100
        return freq_dictionary_percentage

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

display_table(ios_free_apps, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


In [19]:
display_table(android_free_apps, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

In [21]:
display_table(android_free_apps, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

## 5. Profitability Insights

The project assesses **user engagement** as a proxy for profitability:

### 5.1. iOS (App Store)

- Average user ratings (`rating_count_tot`) are computed per genre.
- Despite games being the most common, genres like **Social Networking**, **Music**, and **Reference** have higher average ratings per app, suggesting higher engagement and monetization potential.

### 5.2. Android (Google Play)

- Install counts (parsed from the `Installs` column) are used to estimate popularity.
- Categories like **Comunication**, **Video Players**, **Productivity**, and **Health & Fitness** show strong user egagment.
- However, some categories (e.g., Communication) are skewed by a few very large apps (e.g., WhatsApp, Facebook). These are interpreted carefully.
 

In [23]:
genres_ios = freq_table(ios_free_apps, 11)
for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_free_apps:
        genre_app = app[11]
        if genre_app == genre:
            total += float(app[5])
            len_genre += 1
    print(str(genre) + ' ' + str(total / len_genre))

Social Networking 71548.34905660378
Photo & Video 28441.54375
Games 22788.6696905016
Music 57326.530303030304
Reference 74942.11111111111
Health & Fitness 23298.015384615384
Weather 52279.892857142855
Utilities 18684.456790123455
Travel 28243.8
Shopping 26919.690476190477
News 21248.023255813954
Navigation 86090.33333333333
Lifestyle 16485.764705882353
Entertainment 14029.830708661417
Food & Drink 33333.92307692308
Sports 23008.898550724636
Book 39758.5
Finance 31467.944444444445
Education 7003.983050847458
Productivity 21028.410714285714
Business 7491.117647058823
Catalogs 4004.0
Medical 612.0


In [27]:
category_android = freq_table(android_free_apps, 1)
for category in category_android:
    total = 0
    len_category = 0
    for app in android_free_apps:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            total += float(n_installs)
            len_category +=1
    print(category + ' ' + str(total / len_category))

ART_AND_DESIGN 1986335.0877192982
AUTO_AND_VEHICLES 647317.8170731707
BEAUTY 513151.88679245283
BOOKS_AND_REFERENCE 8767811.894736841
BUSINESS 1712290.1474201474
COMICS 817657.2727272727
COMMUNICATION 38456119.167247385
DATING 854028.8303030303
EDUCATION 1833495.145631068
ENTERTAINMENT 11640705.88235294
EVENTS 253542.22222222222
FINANCE 1387692.475609756
FOOD_AND_DRINK 1924897.7363636363
HEALTH_AND_FITNESS 4188821.9853479853
HOUSE_AND_HOME 1331540.5616438356
LIBRARIES_AND_DEMO 638503.734939759
LIFESTYLE 1437816.2687861272
GAME 15588015.603248259
FAMILY 3695641.8198090694
MEDICAL 120550.61980830671
SOCIAL 23253652.127118643
SHOPPING 7036877.311557789
PHOTOGRAPHY 17840110.40229885
SPORTS 3638640.1428571427
TRAVEL_AND_LOCAL 13984077.710144928
TOOLS 10801391.298666667
PERSONALIZATION 5201482.6122448975
PRODUCTIVITY 16787331.344927534
PARENTING 542603.6206896552
WEATHER 5074486.197183099
VIDEO_PLAYERS 24727872.452830188
NEWS_AND_MAGAZINES 9549178.467741935
MAPS_AND_NAVIGATION 4056941.774193

## 7. Conclusion

The project concludes that:

- The iOS App Store is saturated with games and entertainment apps, but **niches like Education or Productivity** may offer better returns.
- On Google Play, while apps in Communication and Social categories dominate in installs, **Helath & Fitness, Education, and Productivity** are high-potential areas with less saturation.

Developers aiming to launch free apps with potential for high user engagement should consider these niche markets rather than competing in overcrowded categories.