Paris case study

In this notebook, we focus **exclusively on Paris Airbnb listings**. We apply, compare, and interpret the core methods covered in the course:
1. **Exploratory analysis and Dimensionality reduction:** PCA (course content) and UMAP (extra method).
2. **Baseline classification models:** LDA and CART (Decision Trees).
3. **Ensemble methods:** Bootstrap, Bagging, and Random Forest.

In [None]:
from pathlib import Path
import warnings
warnings.filterwarnings('ignore') # keep the notebook clean without red alerts

# 2. Manipulation des donn√©es
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec 
import seaborn as sns

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, ConfusionMatrixDisplay

from sklearn.utils import resample

from sklearn.decomposition import PCA
import umap.umap_ as umap # 'umap-learn' and not 'umap'

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier


sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
pd.set_option('display.max_columns', None)

Data loading and train/test split

In [2]:
DATA_PATH = "data/paris_dataset_final_ready.csv.gz" 

df_paris = pd.read_csv(DATA_PATH, compression='gzip', low_memory=False)
print(f"{df_paris.shape[1]} columns.\n")

y = df_paris['target_class']
cols_to_drop = ['target_class', 'price_clean', 'city_label']
X = df_paris.drop(columns=[col for col in cols_to_drop if col in df_paris.columns])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    stratify=y, 
    random_state=123
)   # stratify to preserve class distribution in both train and test sets

print(f"\nFeatures (X) : {X.shape[1]} columns")
print(f"Target (y) : {len(y)} values")
print(f"\nClass distribution:")
print(y.value_counts().sort_index())
print(f"Train dimensions: X={X_train.shape}, y={y_train.shape}")
print(f"Test dimensions:  X={X_test.shape}, y={y_test.shape}\n")

# STANDARDIZATION
# Trees will use X_train to stay interpretable 
# PCA and UMAP will use X_train_scaled
scaler = StandardScaler()

X_train_scaled_array = scaler.fit_transform(X_train)
X_test_scaled_array = scaler.transform(X_test)

X_train_scaled = pd.DataFrame(X_train_scaled_array, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled_array, columns=X_test.columns, index=X_test.index)

30 colonnes.


Features (X) : 27 colonnes
Target (y) : 73111 valeurs

Distribution des classes :
target_class
0    19114
1    17449
2    18299
3    18249
Name: count, dtype: int64
Dimensions Train : X=(58488, 27), y=(58488,)
Dimensions Test  : X=(14623, 27), y=(14623,)



PCA : to be completed

UMAP : to be completed

LDA : to be completed

CART : to be completed

Bootstrapping and Bagging : to be completed