# CIS 9660 – Project #2, Question #3
**Unsupervised Learning: Customer Segmentation + Association Rule Mining**

**Dataset:** TidyTuesday — Hotels (bookings) (public CSV)

**Pipeline:**
1. Load public dataset directly from URL
2. Clean & select features
3. **K-Means** clustering with **Elbow** & **Silhouette** to choose K
4. **Gaussian Mixture** (GMM) as alternative clustering with **BIC**
5. **PCA (2D)** visualization of K-Means clusters
6. **Apriori** association rules over categorical booking attributes
7. Save key outputs (CSV figures/metrics)


## 0) Setup (Installs for Colab)

In [None]:

# If running in Google Colab, run this cell once
!pip -q install mlxtend plotly scikit-learn pandas matplotlib seaborn


## 1) Imports & Config

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA

from mlxtend.frequent_patterns import apriori, association_rules

sns.set(style="whitegrid", font_scale=1.1)
RANDOM_STATE = 42
plt.rcParams['figure.figsize'] = (8,5)


## 2) Load Dataset (TidyTuesday Hotels CSV)

In [None]:

url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv"
df = pd.read_csv(url)
print("Raw shape:", df.shape)
df.head()


## 3) Basic Cleaning & Feature Selection

In [None]:

# Drop rows with missing values in key numeric fields we will use
num_cols = [
    'lead_time', 'arrival_date_week_number', 'arrival_date_day_of_month',
    'stays_in_weekend_nights', 'stays_in_week_nights',
    'adults', 'children', 'babies', 'previous_cancellations',
    'booking_changes', 'days_in_waiting_list', 'adr', 'total_of_special_requests'
]

# Some datasets have 'children' as float with NaNs; fill with 0 first then drop rows still missing elsewhere
if 'children' in df.columns:
    df['children'] = df['children'].fillna(0)

# Drop rows with *any* NaNs across selected numeric columns
df = df.dropna(subset=num_cols)

# Keep a clean numeric matrix
X = df[num_cols].copy()

print("Shape after cleaning:", df.shape)
df[num_cols].describe().T


## 4) Standardize Features

In [None]:

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled[:3]


## 5) Choose K: Elbow (Inertia) & Silhouette

In [None]:

K_range = range(2, 11)

# Elbow
inertias = []
for k in K_range:
    km = KMeans(n_clusters=k, random_state=RANDOM_STATE, n_init='auto')
    km.fit(X_scaled)
    inertias.append(km.inertia_)

plt.figure()
plt.plot(list(K_range), inertias, marker='o')
plt.title("Elbow Method (K-Means)")
plt.xlabel("Number of clusters (K)")
plt.ylabel("Inertia (Within-Cluster Sum of Squares)")
plt.grid(True)
plt.show()

# Silhouette
sil_scores = []
for k in K_range:
    km = KMeans(n_clusters=k, random_state=RANDOM_STATE, n_init='auto')
    labels = km.fit_predict(X_scaled)
    sil_scores.append(silhouette_score(X_scaled, labels))

plt.figure()
plt.plot(list(K_range), sil_scores, marker='o')
plt.title("Silhouette Score vs K (K-Means)")
plt.xlabel("Number of clusters (K)")
plt.ylabel("Silhouette Score")
plt.grid(True)
plt.show()

best_k = list(K_range)[int(np.argmax(sil_scores))]
print("Best K by silhouette:", best_k)


## 6) Fit K-Means & Profile Segments

In [None]:

kmeans = KMeans(n_clusters=best_k, random_state=RANDOM_STATE, n_init='auto')
df['kmeans_cluster'] = kmeans.fit_predict(X_scaled)

profile_km = df.groupby('kmeans_cluster')[num_cols].mean().round(2)
print("K-Means Segment Profile (means):")
display(profile_km)

# Save profile for appendix
profile_km.to_csv('kmeans_segment_profile.csv')


## 7) Alternative Clustering: Gaussian Mixture (BIC Selection)

In [None]:

bics = []
for k in K_range:
    gmm = GaussianMixture(n_components=k, covariance_type='full', random_state=RANDOM_STATE)
    gmm.fit(X_scaled)
    bics.append(gmm.bic(X_scaled))

plt.figure()
plt.plot(list(K_range), bics, marker='o')
plt.title("GMM Model Selection (BIC)")
plt.xlabel("Number of components")
plt.ylabel("BIC (lower is better)")
plt.grid(True)
plt.show()

best_gmm_k = list(K_range)[int(np.argmin(bics))]
print("Best GMM components by BIC:", best_gmm_k)

gmm = GaussianMixture(n_components=best_gmm_k, covariance_type='full', random_state=RANDOM_STATE)
df['gmm_cluster'] = gmm.fit_predict(X_scaled)

profile_gmm = df.groupby('gmm_cluster')[num_cols].mean().round(2)
print("GMM Segment Profile (means):")
display(profile_gmm)

# Save profile for appendix
profile_gmm.to_csv('gmm_segment_profile.csv')


## 8) PCA (2D) Visualization of K-Means Clusters

In [None]:

pca = PCA(n_components=2, random_state=RANDOM_STATE)
X_pca = pca.fit_transform(X_scaled)

df['pca1'] = X_pca[:, 0]
df['pca2'] = X_pca[:, 1]

plt.figure(figsize=(8,6))
sns.scatterplot(
    x='pca1', y='pca2',
    hue='kmeans_cluster',
    palette='viridis',
    data=df, alpha=0.7, s=30, edgecolor=None
)
plt.title('Customer Segmentation (K-Means) in PCA space')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='Cluster', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

# Save PCA coordinates (optional)
df[['pca1','pca2','kmeans_cluster']].to_csv('pca_kmeans_points.csv', index=False)


## 9) Market Basket / Association Rules (Apriori)

In [None]:

# We'll use categorical booking attributes as 'items' for one-hot transactions
cat_cols = ['hotel', 'meal', 'market_segment', 'distribution_channel', 'reserved_room_type']

# Drop rows with missing in these cats (rare) and create one-hot
df_rules = df.dropna(subset=cat_cols).copy()
basket = pd.get_dummies(df_rules[cat_cols], drop_first=False).astype(int)

# Apriori frequent itemsets & association rules
freq_itemsets = apriori(basket, min_support=0.05, use_colnames=True)
rules = association_rules(freq_itemsets, metric='lift', min_threshold=1.0)

# Sort by lift and select key columns
rules = rules.sort_values('lift', ascending=False)
rules_view = rules[['antecedents','consequents','support','confidence','lift','leverage','conviction']].head(25)
print("Top association rules (by lift):")
display(rules_view)

# Save to CSV for appendix
rules.to_csv('association_rules_full.csv', index=False)
rules_view.to_csv('association_rules_top25.csv', index=False)


## 10) Business Intelligence Notes (Copy into your 1-page report)

In [None]:

print("""
Suggested talking points:
"
"- **K selection**: Used Elbow + Silhouette (K-Means) and BIC (GMM). Report chosen K and the rationale.
"
"- **Segment personas**: Summarize notable differences by cluster (e.g., high lead_time + low adr vs short stays with many special requests).
"
"- **Actions**: Tailored offers per cluster (advance-bookers vs last-minute), upsell packages for long weekday stays, etc.
"
"- **Association rules**: High-lift pairs of booking attributes (e.g., specific meal plan + distribution channel) indicate strong co-occurrence for targeted promos.
"
"- **Limitations**: No price per person or room inventory; adr may be skewed; consider seasonality and event calendars.
"
"- **Future work**: Add time-based features (season/month), incorporate review sentiment, test alternative clustering (DBSCAN/Hierarchical).
"
"""")