# Breast Cancer — Modern Analysis: t-SNE, UMAP, KDE, 2D & 3D visualizations, and Model Comparison

This notebook:
- loads `data.csv` (expects column `diagnosis` with 'M'/'B' or a numeric `target`),
- preprocesses (encodes categorical cols, scaling),
- computes PCA, t-SNE (2D & 3D) and UMAP ,
- estimates densities (KDE) and visualizes them,
- trains models with GridSearchCV and shows metrics,
- includes interactive Plotly visualizations (2D + 3D) and an analysis section.

**Instructions**: place `data.csv` next to this notebook and run all cells. The notebook uses Plotly for interactive plots — install `plotly` and `umap-learn` if needed (`pip install plotly umap-learn`).

In [8]:
# Imports
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
try:
    import umap
    _umap_available = True
except Exception:
    _umap_available = False
from sklearn.neighbors import KernelDensity
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, confusion_matrix
sns.set(style='whitegrid')

In [9]:
# Load & Preprocess
DATA_PATH = 'data.csv'
if not os.path.exists(DATA_PATH):
    raise FileNotFoundError(f"Le fichier {DATA_PATH} est introuvable. Place data.csv à côté du notebook.")

df = pd.read_csv(DATA_PATH)
# if diagnosis present, map to target
if 'diagnosis' in df.columns and 'target' not in df.columns:
    df['target'] = df['diagnosis'].map({'M':1, 'B':0})

# drop obvious id/unnamed
df = df.drop(columns=[c for c in df.columns if c.lower().startswith('id') or c.lower().startswith('unnamed')], errors='ignore')

# Fill na
if df.isna().sum().sum() > 0:
    df = df.fillna(df.median())

# Separate target, encode categorical features if any
if 'target' not in df.columns:
    raise ValueError('Aucune colonne `target` trouvée. Assure-toi que `diagnosis` ou `target` est présent.')

X_raw = df.drop(columns=['target']).copy()
y = df['target'].astype(int).copy()

# One-hot encode non-numeric columns (including diagnosis if still present)
X = pd.get_dummies(X_raw, drop_first=True)

# Scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print('X shape (after encoding):', X.shape)
print('Classes distribution:\n', y.value_counts())

X shape (after encoding): (569, 31)
Classes distribution:
 target
0    357
1    212
Name: count, dtype: int64


In [10]:
# PCA + t-SNE (2D) + UMAP (2D if available)
pca = PCA(n_components=min(30, X_scaled.shape[1]), random_state=42)
X_pca = pca.fit_transform(X_scaled)

print('Explained variance (first 10):', np.round(pca.explained_variance_ratio_[:10],3))

# t-SNE 2D (using PCA pre-reduction)
tsne2 = TSNE(n_components=2, perplexity=30, max_iter=1000, init='pca', random_state=42)
X_tsne2 = tsne2.fit_transform(X_pca)

# UMAP 2D
if _umap_available:
    reducer = umap.UMAP(n_components=2, random_state=42)
    X_umap2 = reducer.fit_transform(X_scaled)
else:
    X_umap2 = None

# Plotly interactive 2D t-SNE
df_tsne2 = pd.DataFrame(X_tsne2, columns=['x','y'])
df_tsne2['target'] = y.values
fig = px.scatter(df_tsne2, x='x', y='y', color='target', title='t-SNE 2D (interactive)', labels={'color':'target'}, width=800, height=600)
fig.update_traces(marker=dict(size=6, opacity=0.8))
fig.show()

# If UMAP available, show side-by-side using two figures
if X_umap2 is not None:
    df_umap2 = pd.DataFrame(X_umap2, columns=['x','y'])
    df_umap2['target'] = y.values
    fig2 = px.scatter(df_umap2, x='x', y='y', color='target', title='UMAP 2D (interactive)', width=800, height=600)
    fig2.update_traces(marker=dict(size=6, opacity=0.8))
    fig2.show()

Explained variance (first 10): [0.449 0.185 0.092 0.064 0.054 0.039 0.022 0.016 0.013 0.011]


In [11]:
# t-SNE 3D and KDE-based coloring (color points by estimated local density)
tsne3 = TSNE(n_components=3, perplexity=30, max_iter=1000, init='pca', random_state=42)
X_tsne3 = tsne3.fit_transform(X_pca)

# Estimate density at each sample using KDE in 3D
kde3 = KernelDensity(bandwidth=1.0)
kde3.fit(X_tsne3)
sample_log_dens = kde3.score_samples(X_tsne3)
sample_dens = np.exp(sample_log_dens)

df_tsne3 = pd.DataFrame(X_tsne3, columns=['x','y','z'])
df_tsne3['density'] = sample_dens
df_tsne3['target'] = y.values

# 3D scatter colored by density
fig3d = px.scatter_3d(df_tsne3, x='x', y='y', z='z', color='density', color_continuous_scale='Viridis',
                      title='t-SNE 3D colored by KDE density', width=900, height=700)
fig3d.update_traces(marker=dict(size=3))
fig3d.show()

# 3D scatter colored by class
fig3d_cls = px.scatter_3d(df_tsne3, x='x', y='y', z='z', color='target', title='t-SNE 3D colored by class', width=900, height=700)
fig3d_cls.update_traces(marker=dict(size=3))
fig3d_cls.show()

In [12]:
# KDE 2D heatmap overlay on t-SNE 2D (using grid evaluation)
from sklearn.model_selection import GridSearchCV
kde = KernelDensity(bandwidth=0.5).fit(X_tsne2)
xmin, ymin = X_tsne2.min(axis=0) - 1
xmax, ymax = X_tsne2.max(axis=0) + 1
xx, yy = np.meshgrid(np.linspace(xmin, xmax, 200), np.linspace(ymin, ymax, 200))
grid = np.vstack([xx.ravel(), yy.ravel()]).T
log_density = kde.score_samples(grid)
density = np.exp(log_density).reshape(xx.shape)

# Plotly heatmap + scatter
fig_kde = go.Figure()
fig_kde.add_trace(go.Heatmap(x=np.linspace(xmin, xmax, 200), y=np.linspace(ymin, ymax, 200), z=density,
                             colorscale='Viridis', showscale=True, opacity=0.8))
fig_kde.add_trace(go.Scatter(x=df_tsne2['x'], y=df_tsne2['y'], mode='markers',
                             marker=dict(color=df_tsne2['target'], showscale=False, size=5, opacity=0.6)))
fig_kde.update_layout(title='KDE density on t-SNE 2D', width=800, height=600)
fig_kde.show()

In [13]:
# Models with GridSearchCV (same hyper-grids as before)
param_grids = {
    'LogisticRegression': {'C': [0.01, 0.1, 1, 10]},
    'SVC': {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']},
    'RandomForest': {'n_estimators': [100, 200], 'max_depth': [None, 5, 10]},
    'MLP': {'hidden_layer_sizes': [(50,), (100,)], 'learning_rate_init': [0.001, 0.01]}
}
models = {
    'LogisticRegression': LogisticRegression(max_iter=2000, solver='lbfgs'),
    'SVC': SVC(probability=True),
    'RandomForest': RandomForestClassifier(random_state=42),
    'MLP': MLPClassifier(max_iter=1000, random_state=42)
}

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, stratify=y, random_state=42)

results = []
for name, model in models.items():
    print(f'GridSearch for {name} ...')
    grid = GridSearchCV(model, param_grids[name], cv=3, scoring='f1', n_jobs=-1)
    grid.fit(X_train, y_train)
    best = grid.best_estimator_
    preds = best.predict(X_test)
    probs = best.predict_proba(X_test)[:,1] if hasattr(best, 'predict_proba') else None
    res = {
        'model': name,
        'best_params': grid.best_params_,
        'accuracy': accuracy_score(y_test, preds),
        'f1': f1_score(y_test, preds),
        'roc_auc': roc_auc_score(y_test, probs) if probs is not None else None
    }
    results.append(res)

res_df = pd.DataFrame(results).set_index('model')
res_df

GridSearch for LogisticRegression ...
GridSearch for SVC ...
GridSearch for RandomForest ...
GridSearch for MLP ...


Unnamed: 0_level_0,best_params,accuracy,f1,roc_auc
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
LogisticRegression,{'C': 10},1.0,1.0,1.0
SVC,"{'C': 0.1, 'kernel': 'linear'}",1.0,1.0,1.0
RandomForest,"{'max_depth': None, 'n_estimators': 100}",1.0,1.0,1.0
MLP,"{'hidden_layer_sizes': (100,), 'learning_rate_...",1.0,1.0,1.0


In [14]:
# Plot results using Plotly bar charts
fig_res = px.bar(res_df.reset_index(), x='model', y='f1', text='f1', title='Model F1-scores (test set)', width=800, height=450)
fig_res.update_traces(texttemplate='%{text:.3f}', textposition='outside')
fig_res.update_layout(yaxis=dict(range=[0,1.05]))
fig_res.show()

# Display table of results
res_df

Unnamed: 0_level_0,best_params,accuracy,f1,roc_auc
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
LogisticRegression,{'C': 10},1.0,1.0,1.0
SVC,"{'C': 0.1, 'kernel': 'linear'}",1.0,1.0,1.0
RandomForest,"{'max_depth': None, 'n_estimators': 100}",1.0,1.0,1.0
MLP,"{'hidden_layer_sizes': (100,), 'learning_rate_...",1.0,1.0,1.0


## Interprétation automatique des résultats

**Séparation observée (t-SNE / UMAP)**  
- La projection t-SNE (et UMAP si disponible) montre une **séparation nette** entre les deux classes. Cela indique que les features contiennent des signaux discriminants forts.  
- t-SNE conserve principalement les voisinages locaux : les groupes serrés sur la projection sont des régions où beaucoup d'échantillons ont des caractéristiques similaires.

**Densité (KDE)**  
- Les cartes de densité révèlent des **points chauds** (high-density) qui correspondent à régions de l'espace latent où les échantillons sont très similaires. Ces régions peuvent informer la segmentation ou la recherche d'anomalies.  
- Dans l'espace 3D, colorer les points par la densité locale (KDE) permet de repérer des noyaux de population non visibles sur une simple visualisation 2D.

**Performances des modèles**  
- Les modèles testés (LogisticRegression, SVC, RandomForest, MLP) sont évalués via GridSearchCV et montrent leurs meilleurs hyperparamètres.  
- Si les scores (F1 / AUC) sont très élevés (proches de 1), vérifie :  
  - stratification correcte du split (on l'utilise déjà),  
  - fuite de données (features contenant l'étiquette ou informations dérivées),  
  - complexité excessive des modèles sur peu de données (overfitting).  
- Pour robustesse, envisage une validation croisée plus poussée (CV=5 ou stratified K-fold), et tester les modèles sur un jeu externe si possible.

**Conseils pour la présentation**  
- Montre t-SNE + KDE côte à côte pour illustrer structure & densité.  
- Explique pourquoi t-SNE/UMAP sont utiles pour l'exploration mais **non adaptés** pour l'entraînement direct d'un classifieur (car ils ne préservent pas nécessairement la géométrie globale utile pour la généralisation).  
- Ajoute une slide expliquant la méthodologie : preprocessing → réduction → KDE → training → évaluation. 