# 07 - √âvaluation du Pipeline Anti-Fuite de Donn√©es

Ce notebook d√©montre le workflow correct pour entra√Æner et √©valuer un mod√®le de classification de CV **sans fuite de donn√©es**.

## Principe Cl√©

```
‚ùå MAUVAIS: Preprocessing ‚Üí Split ‚Üí Train/Test
‚úÖ BON:     Split ‚Üí Preprocessing ‚Üí Train/Test
```

Le split doit √™tre fait sur les donn√©es **BRUTES** avant toute transformation.

## 1. Imports et Configuration

In [None]:
import sys
from pathlib import Path

# Ajouter le dossier src au path
PROJECT_ROOT = Path().absolute().parent
sys.path.insert(0, str(PROJECT_ROOT / 'src'))

print(f"Projet: {PROJECT_ROOT}")

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

# Nos modules
from training.data_splitter import DataSplitter
from training.pipeline_builder import CVClassifierPipelineBuilder
from training.trainer import CVClassifierTrainer
from training.evaluator import PipelineEvaluator

print("‚úì Modules import√©s")

## 2. Chargement des Donn√©es Brutes

In [None]:
# Charger le dataset BRUT (non pr√©process√©)
DATA_PATH = PROJECT_ROOT / 'data' / 'raw' / 'resume_dataset.csv'

df = pd.read_csv(DATA_PATH)

print(f"Dataset charg√©: {len(df)} CVs")
print(f"Colonnes: {list(df.columns)}")
df.head(2)

In [None]:
# Distribution des cat√©gories
print("Distribution des cat√©gories:")
df['Category'].value_counts()

## 3. Split des Donn√©es BRUTES (√âtape Critique)

‚ö†Ô∏è **C'est ici que tout se joue !**

On s√©pare les donn√©es **AVANT** tout preprocessing pour √©viter la fuite d'information.

In [None]:
# Cr√©er le splitter
splitter = DataSplitter(
    test_size=0.2,      # 20% pour le test
    random_state=42,    # Reproductibilit√©
    stratify=True       # Garder la distribution des classes
)

# Dossier pour sauvegarder les indices
SPLIT_DIR = PROJECT_ROOT / 'data' / 'splits'

# V√©rifier si un split existe d√©j√†
if splitter.split_exists(SPLIT_DIR):
    print("Split existant trouv√©, chargement...")
    train_df, test_df = splitter.load_split(df, SPLIT_DIR)
else:
    print("Cr√©ation d'un nouveau split...")
    train_df, test_df = splitter.split_and_save(df, 'Category', SPLIT_DIR)

In [None]:
# V√©rifier le split
print(f"\nüìä R√©sum√© du Split:")
print(f"   Train: {len(train_df)} CVs ({len(train_df)/len(df)*100:.1f}%)")
print(f"   Test:  {len(test_df)} CVs ({len(test_df)/len(df)*100:.1f}%)")
print(f"   Total: {len(df)} CVs")

In [None]:
# Extraire X et y
X_train = train_df['Resume'].values  # Texte BRUT
y_train = train_df['Category'].values

X_test = test_df['Resume'].values    # Texte BRUT
y_test = test_df['Category'].values

print(f"X_train: {len(X_train)} textes bruts")
print(f"X_test:  {len(X_test)} textes bruts")

## 4. Encodage des Labels

In [None]:
# Encoder les cat√©gories
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

print(f"Nombre de classes: {len(label_encoder.classes_)}")
print(f"Classes: {list(label_encoder.classes_)}")

## 5. Construction du Pipeline

Le pipeline encapsule toutes les transformations:

```
Texte Brut ‚Üí TextCleanerTransformer ‚Üí TfidfVectorizer ‚Üí Classifier
```

Ainsi, le TF-IDF est **fit uniquement sur les donn√©es d'entra√Ænement**.

In [None]:
# Construire le pipeline
builder = CVClassifierPipelineBuilder(
    classifier_name='random_forest',
    tfidf_params={'max_features': 5000, 'ngram_range': (1, 2)},
    classifier_params={'n_estimators': 200, 'random_state': 42}
)

pipeline = builder.build()

print("Pipeline construit:")
for name, step in pipeline.steps:
    print(f"  ‚Üí {name}: {type(step).__name__}")

## 6. Cross-Validation (sur Train uniquement)

La cross-validation est effectu√©e **uniquement sur les donn√©es d'entra√Ænement**.

Le test set reste compl√®tement isol√©.

In [None]:
# Cr√©er le trainer
trainer = CVClassifierTrainer(
    classifier_name='random_forest',
    n_folds=5,
    random_state=42
)

# Lancer la cross-validation
cv_results = trainer.cross_validate(X_train, y_train_encoded)

In [None]:
# Afficher les r√©sultats de CV
print("\nüìà R√©sultats Cross-Validation (5-fold):")
print("=" * 50)

for metric, scores in cv_results['scores'].items():
    print(f"\n{metric}:")
    print(f"  CV Mean:  {scores['cv_mean']:.4f} (+/- {scores['cv_std']:.4f})")
    print(f"  Train:    {scores['train_mean']:.4f}")

## 7. Entra√Ænement du Mod√®le Final

On entra√Æne maintenant le pipeline sur **tout** le train set.

In [None]:
# Entra√Æner le mod√®le final
pipeline = trainer.train(X_train, y_train_encoded, label_encoder)

print(f"\n‚úì Mod√®le entra√Æn√© en {trainer.training_time:.2f}s")

## 8. √âvaluation Finale sur le Test Set

‚ö†Ô∏è **Cette √©tape ne doit √™tre faite qu'UNE SEULE FOIS**, √† la toute fin.

Le test set n'a jamais √©t√© vu pendant l'entra√Ænement ou la validation.

In [None]:
# Cr√©er l'√©valuateur
evaluator = PipelineEvaluator(pipeline, label_encoder)

# √âvaluer sur le test set
test_results = evaluator.evaluate(X_test, y_test_encoded)

In [None]:
# Comparer CV vs Test
comparison = evaluator.compare_with_cv(cv_results)

## 9. Analyse des R√©sultats

In [None]:
# Tableau r√©capitulatif
import pandas as pd

summary = pd.DataFrame({
    'M√©trique': ['Accuracy', 'F1 Macro', 'Precision', 'Recall'],
    'CV (mean)': [
        cv_results['scores']['accuracy']['cv_mean'],
        cv_results['scores']['f1_macro']['cv_mean'],
        cv_results['scores']['precision_macro']['cv_mean'],
        cv_results['scores']['recall_macro']['cv_mean']
    ],
    'CV (std)': [
        cv_results['scores']['accuracy']['cv_std'],
        cv_results['scores']['f1_macro']['cv_std'],
        cv_results['scores']['precision_macro']['cv_std'],
        cv_results['scores']['recall_macro']['cv_std']
    ],
    'Test': [
        test_results['metrics']['accuracy'],
        test_results['metrics']['f1_macro'],
        test_results['metrics']['precision_macro'],
        test_results['metrics']['recall_macro']
    ]
})

summary['CV (mean)'] = summary['CV (mean)'].apply(lambda x: f"{x:.4f}")
summary['CV (std)'] = summary['CV (std)'].apply(lambda x: f"¬±{x:.4f}")
summary['Test'] = summary['Test'].apply(lambda x: f"{x:.4f}")

print("\nüìä Tableau R√©capitulatif:")
print("=" * 60)
print(summary.to_string(index=False))

In [None]:
# Visualisation de la matrice de confusion
import matplotlib.pyplot as plt
import seaborn as sns

cm = np.array(test_results['confusion_matrix']['matrix'])
labels = test_results['confusion_matrix']['labels']

plt.figure(figsize=(14, 12))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=labels, yticklabels=labels)
plt.title('Matrice de Confusion - Test Set')
plt.xlabel('Pr√©dit')
plt.ylabel('R√©el')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

## 10. Test avec un Nouveau CV

In [None]:
# Exemple de pr√©diction
exemple_cv = """
John Doe
Senior Software Developer

EXPERIENCE:
- 5 years of Java development
- Spring Boot, Microservices, REST APIs
- Docker, Kubernetes, AWS
- Agile methodologies, Scrum

SKILLS:
Java, Python, SQL, MongoDB, Git, Jenkins, CI/CD

EDUCATION:
Master's in Computer Science
"""

# Pr√©diction (le pipeline g√®re tout: nettoyage + vectorisation + pr√©diction)
prediction = pipeline.predict([exemple_cv])[0]
probabilities = pipeline.predict_proba([exemple_cv])[0]

# D√©coder
category = label_encoder.inverse_transform([prediction])[0]
confidence = probabilities.max()

print(f"Cat√©gorie pr√©dite: {category}")
print(f"Confiance: {confidence:.2%}")

In [None]:
# Top 5 cat√©gories
top5_idx = np.argsort(probabilities)[::-1][:5]

print("\nTop 5 cat√©gories:")
for i, idx in enumerate(top5_idx, 1):
    cat = label_encoder.inverse_transform([idx])[0]
    prob = probabilities[idx]
    print(f"  {i}. {cat}: {prob:.2%}")

## 11. Conclusion

### Points Cl√©s

1. **Split AVANT preprocessing**: Les donn√©es sont s√©par√©es en train/test sur le texte BRUT
2. **Pipeline sklearn**: Toutes les transformations sont encapsul√©es
3. **Cross-validation sur train uniquement**: Le test set reste isol√©
4. **√âvaluation unique**: Le test set n'est utilis√© qu'une seule fois √† la fin

### V√©rification Anti-Leakage

Si les scores CV et Test sont **proches**, c'est un bon signe qu'il n'y a pas de fuite de donn√©es.

- ‚úÖ Diff√©rence < 5%: OK
- ‚ö†Ô∏è Diff√©rence > 10%: Possible overfitting ou leakage

In [None]:
# V√©rification finale
diff = abs(comparison['cv_accuracy'] - comparison['test_accuracy'])

print("\nüîç V√©rification Anti-Leakage:")
print(f"   CV Accuracy:   {comparison['cv_accuracy']:.4f}")
print(f"   Test Accuracy: {comparison['test_accuracy']:.4f}")
print(f"   Diff√©rence:    {diff:.4f}")

if diff < 0.05:
    print("\n‚úÖ Pas de fuite de donn√©es d√©tect√©e!")
else:
    print("\n‚ö†Ô∏è Attention: √©cart important entre CV et Test")