<a href="https://colab.research.google.com/github/Dhairyaxshah/Appfluence/blob/main/notebooks/04_unsupervised_ml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Unsupervised Machine Learning: App Market Segmentation

### Objective
This notebook applies unsupervised machine learning to group mobile apps into meaningful market segments based on intrinsic app characteristics such as category, pricing, size, ratings, and content type without using popularity labels.

The goal is to understand **how apps compare structurally**, not how popular they already are.


## Step 1: Data Loading and Initial Verification

In this step, we load the cleaned Google Play Store and Apple App Store datasets.
Since data cleaning and preprocessing were completed in a previous notebook,
we only perform basic sanity checks to confirm:

- The datasets load correctly
- The expected number of rows and columns are present
- The data structure matches what will be used for clustering

No data modification is performed in this step.


In [3]:
# Core data manipulation libraries
import pandas as pd
import numpy as np

# Visualization libraries (used later for cluster analysis)
import matplotlib.pyplot as plt
import seaborn as sns

# Interactive visualization (used later for cluster exploration)
import plotly.express as px

In [4]:
# URLs for cleaned datasets stored in the project repository
gp_url = "https://raw.githubusercontent.com/Dhairyaxshah/Appfluence/main/data/google_play_cleaned.csv"
as_url = "https://raw.githubusercontent.com/Dhairyaxshah/Appfluence/main/data/apple_store_cleaned.csv"

# Load datasets into DataFrames
df_gp = pd.read_csv(gp_url)
df_as = pd.read_csv(as_url)

# Confirm dataset dimensions (rows, columns)
print("Google Play:", df_gp.shape)
print("App Store:", df_as.shape)

# Preview a few rows to verify successful loading
display(df_gp.head(3))
display(df_as.head(3))


Google Play: (8196, 10)
App Store: (7195, 10)


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19.0,10000,Free,0.0,Everyone,Art & Design
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,500000,Free,0.0,Everyone,Art & Design;Pretend Play
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7,5000000,Free,0.0,Everyone,Art & Design


Unnamed: 0,track_name,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,cont_rating,prime_genre,lang.num,size_mb
0,PAC-MAN Premium,3.99,21292,26,4.0,4.5,4+,Games,10,96.119141
1,Evernote - stay organized,0.0,161065,26,4.0,3.5,4+,Productivity,23,151.232422
2,"WeatherBug - Local Weather, Radar, Maps, Alerts",0.0,188583,2822,3.5,4.5,4+,Weather,3,95.867188


In [5]:
# Check missing values in Google Play dataset
print("Missing values (Google Play) - top 10:")
display(df_gp.isna().sum().sort_values(ascending=False).head(10))

# Check missing values in App Store dataset
print("\nMissing values (App Store) - top 10:")
display(df_as.isna().sum().sort_values(ascending=False).head(10))

Missing values (Google Play) - top 10:


Unnamed: 0,0
App,0
Category,0
Rating,0
Reviews,0
Size,0
Installs,0
Type,0
Price,0
Content Rating,0
Genres,0



Missing values (App Store) - top 10:


Unnamed: 0,0
track_name,0
price,0
rating_count_tot,0
rating_count_ver,0
user_rating,0
user_rating_ver,0
cont_rating,0
prime_genre,0
lang.num,0
size_mb,0


In [6]:
# Inspect available columns to guide feature selection for clustering
# This helps identify intrinsic app attributes and exclude popularity labels

print("Google Play columns:")
print(list(df_gp.columns))

print("\nApp Store columns:")
print(list(df_as.columns))


Google Play columns:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres']

App Store columns:
['track_name', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'cont_rating', 'prime_genre', 'lang.num', 'size_mb']


## Step 2: Feature Selection for Clustering

In this step, we select intrinsic app characteristics to be used for
unsupervised learning.

Only structural features describing an app’s category, pricing, size,
ratings, and content type are included. Popularity-related variables
(e.g., installs, total rating counts, popularity labels) are intentionally
excluded to prevent data leakage.

Feature selection is performed separately for:
- Google Play Store
- Apple App Store

This ensures that clustering reflects how apps compare structurally
within each marketplace.


In [7]:
# -----------------------------
# Google Play: Feature Selection
# -----------------------------

# Select intrinsic app characteristics only
# Exclude popularity indicators such as Installs
gp_features = [
    'Category',        # App category (market positioning)
    'Type',            # Free or Paid
    'Price',           # Monetization strategy
    'Size',            # App size (resource footprint)
    'Rating',          # User quality perception
    'Content Rating'   # Target age group
]

X_gp_cluster = df_gp[gp_features]

# Preview selected features
display(X_gp_cluster.head())


Unnamed: 0,Category,Type,Price,Size,Rating,Content Rating
0,ART_AND_DESIGN,Free,0.0,19.0,4.1,Everyone
1,ART_AND_DESIGN,Free,0.0,14.0,3.9,Everyone
2,ART_AND_DESIGN,Free,0.0,8.7,4.7,Everyone
3,ART_AND_DESIGN,Free,0.0,25.0,4.5,Teen
4,ART_AND_DESIGN,Free,0.0,2.8,4.3,Everyone


In [8]:
# -----------------------------
# App Store: Feature Selection
# -----------------------------

as_features = [
    'prime_genre',     # Market category
    'price',           # Monetization
    'size_mb',         # App size
    'user_rating',     # User quality perception
    'cont_rating',     # Age suitability
    'lang.num'         # Language support breadth
]

X_as_cluster = df_as[as_features]

# Preview selected features
display(X_as_cluster.head())


Unnamed: 0,prime_genre,price,size_mb,user_rating,cont_rating,lang.num
0,Games,3.99,96.119141,4.0,4+,10
1,Productivity,0.0,151.232422,4.0,4+,23
2,Weather,0.0,95.867188,3.5,4+,3
3,Shopping,0.0,122.558594,4.0,12+,9
4,Reference,0.0,88.476562,4.5,4+,45


✔ Feature selection completed.

Only intrinsic app attributes are retained.
No popularity labels or engagement-based variables are used,
ensuring clustering results remain unbiased and interpretable.
