# Phase 3 ‚Äì Mod√©lisation & Enrichissement ML

Objectif : enrichir le dataset nettoy√© (Phase 2) avec
- **Topics NMF** pour regrouper les offres par th√©matiques
- **Classe salaire** (haut/bas) pour l‚Äôanalyse et le dashboard
- **Jeu de donn√©es enrichi** pr√™t pour Phase 4

## √âtape 1 ‚Äì Importer les biblioth√®ques

- pandas / numpy
- scikit-learn (TF-IDF, NMF, LogisticRegression, m√©triques)
- utils (Path)
- tqdm pour le confort d‚Äôaffichage
- Warnings filtr√©s pour un notebook plus lisible

## √âtape 2 ‚Äì Charger le dataset nettoy√© (Phase 2)

On part du fichier produit en Phase 2 : `data/processed/hellowork_cleaned.csv`.

In [22]:
# --- Imports ---
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
from pathlib import Path

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

print("‚úÖ Imports OK")

‚úÖ Imports OK


## √âtape 3 ‚Äì Harmoniser les colonnes utiles

On cr√©e/normalise les colonnes attendues pour le ML :
- `description_clean` (texte pr√™t pour TF-IDF)
- `salary_monthly` (valeur num√©rique mensuelle)
- colonnes cat√©gorielles encod√©es si pr√©sentes (`sector_enc`, `location_enc`, etc.)

In [23]:
# --- Charger le dataset nettoy√© ---
clean_path = Path("data/processed/hellowork_cleaned.csv")
assert clean_path.exists(), "Le fichier nettoy√© de la Phase 2 est introuvable."

df = pd.read_csv(clean_path, encoding="utf-8")
print(f"üìä Shape: {df.shape}")
print(f"üìã Colonnes: {df.columns.tolist()}")

df.head(3)

üìä Shape: (1239, 15)
üìã Colonnes: ['Sector', 'Job_Title', 'Company', 'Location', 'Contract', 'Salary', 'Description', 'URL', 'Salary_Monthly', 'Description_Clean', 'Top_Keywords', 'Sector_enc', 'Location_enc', 'Contract_enc', 'Company_enc']


Unnamed: 0,Sector,Job_Title,Company,Location,Contract,Salary,Description,URL,Salary_Monthly,Description_Clean,Top_Keywords,Sector_enc,Location_enc,Contract_enc,Company_enc
0,Agriculture ‚Ä¢ P√™che,Alternance - Charg√©¬∑e de Formation H/F,Remy Cointreau,Paris - 75,Alternance,"486,49 - 1‚ÄØ801,80 ‚Ç¨ / mois",Nous recherchons un¬∑e candidat¬∑e : Alternance...,https://www.hellowork.com/fr-fr/emplois/642118...,1144.145,recherchons un¬∑e candidat¬∑e alternance charg√©¬∑...,"formation,formations,groupe,aider,plan,cr√©atio...",0,0,0,0
1,BTP,Electricien H/F,Samsic Emploi,Rennes - 35,Int√©rim,12 - 15 ‚Ç¨ / heure,Nous recherchons activement un/une electricien...,https://www.hellowork.com/fr-fr/emplois/729658...,2160.0,recherchons activement unune electriciennne ca...,"travail,samsic,sengage,lun,passion,lexp√©rience...",1,1,1,1
2,BTP,Ouvrier Polyvalent en Menuiserie H/F,Groupe Actual,Auterive - 31,Int√©rim,"Estimation ‚Üí 12,36 - 13,50 ‚Ç¨ / heure",Nous recherchons un(e) menuisier(e) exp√©riment...,https://www.hellowork.com/fr-fr/emplois/732798...,2068.8,recherchons menuisiere exp√©riment√©e rejoindre ...,"recherchons,dexp√©rience,connaissance,candidats...",1,2,1,2


## √âtape 4 ‚Äì Cr√©er la cible salaire (High vs Low)

On utilise la m√©diane du salaire mensuel pour d√©finir `high_salary` (1 = au-dessus de la m√©diane).

In [24]:
# --- Harmoniser les noms de colonnes ---
# Standardize column names to match expected format

col_map = {
    "Description_Clean": "description_clean",
    "Salary_Monthly": "salary_monthly",
    "Sector_enc": "sector_enc",
    "Location_enc": "location_enc",
    "Contract_enc": "contract_enc",
    "Company_enc": "company_enc",
}

df = df.rename(columns={k: v for k, v in col_map.items() if k in df.columns})

# Guarantee minimal required columns
if "description_clean" not in df.columns:
    if "Description" in df.columns:
        df["description_clean"] = df["Description"].fillna("").astype(str)
    else:
        df["description_clean"] = ""

if "salary_monthly" not in df.columns:
    if "Salary" in df.columns:
        df["salary_monthly"] = pd.to_numeric(df["Salary"], errors='coerce')
    else:
        df["salary_monthly"] = np.nan

print("‚úÖ Colonnes harmonis√©es")
print(f"üìä Colonnes disponibles: {df.columns.tolist()}")
print(f"\nüìã Aper√ßu des colonnes cl√©s:")
print(df[[c for c in ["description_clean", "salary_monthly"] if c in df.columns]].head(3))

‚úÖ Colonnes harmonis√©es
üìä Colonnes disponibles: ['Sector', 'Job_Title', 'Company', 'Location', 'Contract', 'Salary', 'Description', 'URL', 'salary_monthly', 'description_clean', 'Top_Keywords', 'sector_enc', 'location_enc', 'contract_enc', 'company_enc']

üìã Aper√ßu des colonnes cl√©s:
                                   description_clean  salary_monthly
0  recherchons un¬∑e candidat¬∑e alternance charg√©¬∑...        1144.145
1  recherchons activement unune electriciennne ca...        2160.000
2  recherchons menuisiere exp√©riment√©e rejoindre ...        2068.800


## √âtape 5 ‚Äì Clustering th√©matique avec NMF

Pipeline :
1) TF-IDF sur `description_clean`
2) NMF (5 topics) pour extraire les th√©matiques
3) Attribution du cluster dominant dans `job_cluster`
4) Top mots par topic pour interpr√©tation

In [25]:
# --- Cible binaire salaire (high vs low) ---
median_salary = df["salary_monthly"].median(skipna=True)
df["high_salary"] = (df["salary_monthly"] > median_salary).astype(int)

print(f"üí∞ M√©diane salaire mensuel: ‚Ç¨{median_salary:.2f}")
print(f"\nüìä Distribution de la cible:")
print(df["high_salary"].value_counts())
print(f"\n‚úì Salaires valides: {df['salary_monthly'].notna().sum()}/{len(df)}")

üí∞ M√©diane salaire mensuel: ‚Ç¨2116.90

üìä Distribution de la cible:
high_salary
0    811
1    428
Name: count, dtype: int64

‚úì Salaires valides: 1073/1239


## √âtape 6 ‚Äì Classification salaire (Logistic Regression)

- Donn√©es d‚Äôentr√©e : TF-IDF sur `description_clean`
- Cible : `high_salary`
- M√©triques : rapport de classification + AUC ROC

In [26]:
# --- TF-IDF + NMF Clustering ---
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

text_col = "description_clean"
texts = df[text_col].fillna("").astype(str)

print("üîÑ Cr√©ation de la matrice TF-IDF...")
vect = TfidfVectorizer(max_features=1000, min_df=2, max_df=0.9, ngram_range=(1, 2))
X_text = vect.fit_transform(texts)
print(f"‚úì Matrice TF-IDF: {X_text.shape}")

print("\nüîÑ Application de NMF (7 topics)...")
nmf = NMF(n_components=7, random_state=42, max_iter=500, init='nndsvda')
W = nmf.fit_transform(X_text)

# Assign each job to dominant topic
df["job_cluster"] = W.argmax(axis=1)
print("‚úì Clustering NMF termin√©")
print(f"\nüìä R√©partition des clusters:")
print(df["job_cluster"].value_counts().sort_index())

# Interpretation of topics
def print_topics(nmf_model, feature_names, n_top_words=10):
    print("\nüîë Interpr√©tation des topics (top 10 mots):")
    for i, comp in enumerate(nmf_model.components_):
        terms = [feature_names[j] for j in comp.argsort()[-n_top_words:]]
        print(f"  Topic {i}: {', '.join(terms)}")

print_topics(nmf, vect.get_feature_names_out())

üîÑ Cr√©ation de la matrice TF-IDF...
‚úì Matrice TF-IDF: (1239, 1000)

üîÑ Application de NMF (7 topics)...
‚úì Clustering NMF termin√©

üìä R√©partition des clusters:
job_cluster
0    209
1    149
2    210
3    103
4    404
5     70
6     94
Name: count, dtype: int64

üîë Interpr√©tation des topics (top 10 mots):
  Topic 0: journ√©e, heure, pourrez, quelques, brut heure, coll√®ge, √©l√®ves niveau, cours, niveau, √©l√®ves
  Topic 1: rejoins, d√©placement, engag√©s, frais d√©placement, laventure, rejoins laventure, laventure esallia, produits engag√©s, produits, esallia
  Topic 2: kids, jeux, intelligente, domicile, mission, activit√©s, garde denfants, garde, denfants, enfants
  Topic 3: chez, premi√®re, premi√®re ann√©e, ann√©e, dipl√¥mes, salaire, 1418, missions, ouihelp, aide
  Topic 4: suivi, missions, sein, recrutement, formation, clients, participer, poste, gestion, d√©veloppement
  Topic 5: op√©rationnelle, indicateurs, gestion op√©rationnelle, gestion, vente relation, clien

### b) Entra√Ænement du mod√®le

In [27]:
# --- Classification salaire (Logistic Regression) ---
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

# Filter to rows with valid labels
df_train = df[df["high_salary"].notna() & df[text_col].notna()].copy()
print(f"üìä Donn√©es d'entra√Ænement: {len(df_train)} √©chantillons")

if len(df_train) > 10:
    # Use the same TF-IDF matrix (X_text) but filter to training rows
    # FIX: Convert boolean mask to numpy array for sparse matrix indexing
    mask = (df["high_salary"].notna() & df[text_col].notna()).values
    X = X_text[mask]
    y = df_train["high_salary"].values
    
    print(f"\nüîÑ S√©paration train/test (80/20)...")
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    
    print(f"‚úì Train: {X_train.shape[0]} | Test: {X_test.shape[0]}")
    
    print("\nüîÑ Entra√Ænement du mod√®le (Logistic Regression)...")
    clf = LogisticRegression(max_iter=1000, C=10, solver='lbfgs', class_weight='balanced')
    clf.fit(X_train, y_train)
    
    print("‚úì Mod√®le entra√Æn√©")
    
    # Evaluate
    preds = clf.predict(X_test)
    probs = clf.predict_proba(X_test)[:, 1]
    
    print(f"\nüìä Rapport de classification:")
    print(classification_report(y_test, preds, digits=3))
    
    auc = roc_auc_score(y_test, probs)
    print(f"\nüéØ AUC-ROC: {auc:.4f}")
else:
    print("‚ö†Ô∏è  Donn√©es insuffisantes pour la classification")
    clf = None

üìä Donn√©es d'entra√Ænement: 1239 √©chantillons

üîÑ S√©paration train/test (80/20)...
‚úì Train: 991 | Test: 248

üîÑ Entra√Ænement du mod√®le (Logistic Regression)...
‚úì Mod√®le entra√Æn√©

üìä Rapport de classification:
              precision    recall  f1-score   support

           0      0.925     0.994     0.958       162
           1      0.986     0.849     0.912        86

    accuracy                          0.944       248
   macro avg      0.956     0.921     0.935       248
weighted avg      0.947     0.944     0.942       248


üéØ AUC-ROC: 0.9710


## √âtape 7 ‚Äì Attacher les pr√©dictions

On g√©n√®re les pr√©dictions sur l‚Äôensemble du dataset et on ajoute:
- `pred_high_salary` (classe pr√©dite)
- `pred_proba_high_salary` (probabilit√© associ√©e)

In [28]:
# --- Pr√©dictions sur l'ensemble complet ---
if clf is not None:
    print("üîÑ G√©n√©ration des pr√©dictions sur le dataset complet...")
    df["predicted_high_salary"] = clf.predict(X_text)
    df["pred_proba_high_salary"] = clf.predict_proba(X_text)[:, 1]
    print("‚úì Pr√©dictions ajout√©es")
    
    print("\nüìä Aper√ßu des pr√©dictions:")
    display_cols = ["salary_monthly", "high_salary", "predicted_high_salary", "pred_proba_high_salary"]
    print(df[display_cols].head(10))
    
    # Calculate prediction accuracy on labeled data
    labeled_df = df[df["high_salary"].notna()]
    accuracy = (labeled_df["high_salary"] == labeled_df["predicted_high_salary"]).mean()
    print(f"\n‚úì Pr√©cision globale sur donn√©es √©tiquet√©es: {accuracy:.2%}")
else:
    df["predicted_high_salary"] = -1
    df["pred_proba_high_salary"] = 0.0
    print("‚ö†Ô∏è  Pas de pr√©dictions (mod√®le non entra√Æn√©)")

üîÑ G√©n√©ration des pr√©dictions sur le dataset complet...
‚úì Pr√©dictions ajout√©es

üìä Aper√ßu des pr√©dictions:
   salary_monthly  high_salary  predicted_high_salary  pred_proba_high_salary
0        1144.145            0                      0                0.018902
1        2160.000            1                      0                0.212084
2        2068.800            0                      0                0.039824
3             NaN            0                      0                0.043275
4        2080.000            0                      0                0.012916
5        1144.145            0                      0                0.025276
6        2240.000            1                      1                0.708790
7        1900.800            0                      0                0.084087
8        2200.000            1                      0                0.428402
9             NaN            0                      0                0.089906

‚úì Pr√©cision global

## √âtape 8 ‚Äì Sauvegarder le dataset enrichi

On enregistre le r√©sultat pour le dashboard (Phase 4) :
- Fichier: `data/enriched/hellowork_ml_enriched.csv`
- Contient: texte nettoy√©, clusters, labels salaire, pr√©dictions

In [29]:
# --- Sauvegarde du dataset enrichi ---
from pathlib import Path

output_path = Path("data/enriched/hellowork_ml_enriched.csv")
output_path.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(output_path, index=False, encoding="utf-8")

print("=" * 70)
print("‚úÖ PHASE 3 COMPLETE")
print("=" * 70)
print(f"\nüìÅ Fichier enrichi sauvegard√©: {output_path}")
print(f"üìä Lignes: {len(df)}")
print(f"üìã Colonnes: {len(df.columns)}")
print(f"\nüÜï Nouvelles colonnes ajout√©es:")
print(f"   ‚Ä¢ job_cluster (0-6): Topics NMF")
print(f"   ‚Ä¢ high_salary (0/1): Label salaire (m√©diane)")
print(f"   ‚Ä¢ predicted_high_salary (0/1): Pr√©diction du mod√®le")
print(f"   ‚Ä¢ pred_proba_high_salary: Probabilit√© pr√©dite")
print("\n‚úì Pr√™t pour la Phase 4 (Dashboard)")
print("=" * 70)

‚úÖ PHASE 3 COMPLETE

üìÅ Fichier enrichi sauvegard√©: data\enriched\hellowork_ml_enriched.csv
üìä Lignes: 1239
üìã Colonnes: 19

üÜï Nouvelles colonnes ajout√©es:
   ‚Ä¢ job_cluster (0-6): Topics NMF
   ‚Ä¢ high_salary (0/1): Label salaire (m√©diane)
   ‚Ä¢ predicted_high_salary (0/1): Pr√©diction du mod√®le
   ‚Ä¢ pred_proba_high_salary: Probabilit√© pr√©dite

‚úì Pr√™t pour la Phase 4 (Dashboard)
