# Phase 3 â€“ ModÃ©lisation & Enrichissement ML

Objectif : enrichir le dataset nettoyÃ© (Phase 2) avec
- **Topics NMF** pour regrouper les offres par thÃ©matiques
- **Classe salaire** (haut/bas) pour lâ€™analyse et le dashboard
- **Jeu de donnÃ©es enrichi** prÃªt pour Phase 4

## Ã‰tape 1 â€“ Importer les bibliothÃ¨ques

- pandas / numpy
- scikit-learn (TF-IDF, NMF, LogisticRegression, mÃ©triques)
- utils (Path)
- tqdm pour le confort dâ€™affichage
- Warnings filtrÃ©s pour un notebook plus lisible

## Ã‰tape 2 â€“ Charger le dataset nettoyÃ© (Phase 2)

On part du fichier produit en Phase 2 : `data/processed/hellowork_cleaned.csv`.

In [1]:
# --- Imports ---
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
from pathlib import Path

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

# Helpers for cross-cell analysis (Pylance)
df: pd.DataFrame | None = None
X_text = None
clf: LogisticRegression | None = None
X = None
median_salary = 0.0
y = None

print("âœ… Imports OK")


âœ… Imports OK


## Ã‰tape 3 â€“ Harmoniser les colonnes utiles

On crÃ©e/normalise les colonnes attendues pour le ML :
- `description_clean` (texte prÃªt pour TF-IDF)
- `salary_monthly` (valeur numÃ©rique mensuelle)
- colonnes catÃ©gorielles encodÃ©es si prÃ©sentes (`sector_enc`, `location_enc`, etc.)

In [2]:
# --- Charger le dataset nettoyÃ© ---
clean_path = Path("data/processed/hellowork_cleaned.csv")
assert clean_path.exists(), "Le fichier nettoyÃ© de la Phase 2 est introuvable."

df = pd.read_csv(clean_path, encoding="utf-8")
print(f"ðŸ“Š Shape: {df.shape}")
print(f"ðŸ“‹ Colonnes: {df.columns.tolist()}")

df.head(3)

ðŸ“Š Shape: (1219, 14)
ðŸ“‹ Colonnes: ['Sector', 'Job_Title', 'Company', 'Location', 'Contract', 'Salary', 'Description', 'Publication_Date', 'URL', 'Top_Keywords', 'Sector_enc', 'Location_enc', 'Contract_enc', 'Company_enc']


Unnamed: 0,Sector,Job_Title,Company,Location,Contract,Salary,Description,Publication_Date,URL,Top_Keywords,Sector_enc,Location_enc,Contract_enc,Company_enc
0,Agriculture â€¢ PÃªche,Alternance - ChargÃ©Â·e de Formation H/F,Remy Cointreau,Paris - 75,Alternance,"486,49 - 1â€¯801,80 â‚¬ / mois",Nous recherchons unÂ·e candidatÂ·e : Alternance...,,https://www.hellowork.com/fr-fr/emplois/642118...,"formation,formations,des,de,groupe,aider,crÃ©at...",0,0,0,0
1,BTP,Alternance-Gestionnaire Paie H/F,Lafarge France,Issy-les-Moulineaux - 92,Alternance,"486,49 - 1â€¯801,80 â‚¬ / mois",Pourquoi nous rejoindre ? > Participer Ã la t...,,https://www.hellowork.com/fr-fr/emplois/729761...,"paie,de,et,la,des,groupe,processus,ses",1,1,0,1
2,BTP,Ouvrier Polyvalent en Menuiserie H/F,Groupe Actual,Auterive - 31,IntÃ©rim,"Estimation â†’ 12,36 - 13,50 â‚¬ / heure",Nous recherchons un(e) menuisier(e) expÃ©riment...,,https://www.hellowork.com/fr-fr/emplois/735245...,"recherchons,ayant,un,nous,avons,connaissance,c...",1,2,1,2


## Ã‰tape 4 â€“ CrÃ©er la cible salaire (High vs Low)

On utilise la mÃ©diane du salaire mensuel pour dÃ©finir `high_salary` (1 = au-dessus de la mÃ©diane).

In [3]:
# --- Harmoniser les noms de colonnes ---
# On copie pour Ã©viter les SettingWithCopyWarnings

col_map = {
    "Description_Clean": "description_clean",
    "Salary_Monthly": "salary_monthly",
    "Sector_enc": "sector_enc",
    "Location_enc": "location_enc",
    "Contract_enc": "contract_enc",
    "Company_enc": "company_enc",
}

df = df.rename(columns={k: v for k, v in col_map.items() if k in df.columns})

# Garantir les colonnes minimales
if "description_clean" not in df.columns:
    df["description_clean"] = df.get("Description", "").fillna("").astype(str)
if "salary_monthly" not in df.columns:
    # fallback: utiliser Salary si prÃ©sent
    df["salary_monthly"] = df.get("Salary", np.nan)

print(df[[c for c in df.columns if "description" in c.lower() or "salary" in c.lower()]].head(3))

                                 Salary  \
0            486,49 - 1â€¯801,80 â‚¬ / mois   
1            486,49 - 1â€¯801,80 â‚¬ / mois   
2  Estimation â†’ 12,36 - 13,50 â‚¬ / heure   

                                         Description  \
0  Nous recherchons unÂ·e candidatÂ·e :  Alternance...   
1  Pourquoi nous rejoindre ?  > Participer Ã  la t...   
2  Nous recherchons un(e) menuisier(e) expÃ©riment...   

                                   description_clean  \
0  Nous recherchons unÂ·e candidatÂ·e :  Alternance...   
1  Pourquoi nous rejoindre ?  > Participer Ã  la t...   
2  Nous recherchons un(e) menuisier(e) expÃ©riment...   

                         salary_monthly  
0            486,49 - 1â€¯801,80 â‚¬ / mois  
1            486,49 - 1â€¯801,80 â‚¬ / mois  
2  Estimation â†’ 12,36 - 13,50 â‚¬ / heure  


## Ã‰tape 5 â€“ Clustering thÃ©matique avec NMF

Pipeline :
1) TF-IDF sur `description_clean`
2) NMF (5 topics) pour extraire les thÃ©matiques
3) Attribution du cluster dominant dans `job_cluster`
4) Top mots par topic pour interprÃ©tation

In [4]:
# --- Cible binaire salaire ---
median_salary = df["salary_monthly"].median(skipna=True)
df["high_salary"] = (df["salary_monthly"] > median_salary).astype(int)

print(f"MÃ©diane salaire mensuel: {median_salary:.2f}")
print(df["high_salary"].value_counts())

TypeError: Cannot convert ['486,49 - 1\u202f801,80 â‚¬ / mois' '486,49 - 1\u202f801,80 â‚¬ / mois'
 'Estimation â†’ 12,36 - 13,50 â‚¬ / heure' ... '11,88 - 12 â‚¬ / heure'
 '11,89 â‚¬ / heure' 'Pas de salaire renseignÃ©'] to numeric

## Ã‰tape 6 â€“ Classification salaire (Logistic Regression)

- DonnÃ©es dâ€™entrÃ©e : TF-IDF sur `description_clean`
- Cible : `high_salary`
- MÃ©triques : rapport de classification + AUC ROC

In [5]:
# --- TF-IDF + NMF ---
text_col = "description_clean"
texts = df[text_col].fillna("").astype(str)

vect = TfidfVectorizer(max_features=500, min_df=2, max_df=0.9)
X_text = vect.fit_transform(texts)

nmf = NMF(n_components=5, random_state=42, max_iter=400)
W = nmf.fit_transform(X_text)

df["job_cluster"] = W.argmax(axis=1)
print("Clustering NMF terminÃ©. RÃ©partition des clusters:")
print(df["job_cluster"].value_counts().sort_index())

# InterprÃ©tation des topics
def print_topics(nmf_model, feature_names, n_top_words=10):
    for i, comp in enumerate(nmf_model.components_):
        terms = [feature_names[j] for j in comp.argsort()[-n_top_words:]]
        print(f"Topic {i}: {', '.join(terms)}")

print("\nTop mots par topic:")
print_topics(nmf, vect.get_feature_names_out())

Clustering NMF terminÃ©. RÃ©partition des clusters:
job_cluster
0    193
1    188
2     99
3    227
4    512
Name: count, dtype: int64

Top mots par topic:
Topic 0: pourrez, quelques, collÃ¨ge, lycÃ©e, ou, cours, niveau, vos, vous, Ã©lÃ¨ves
Topic 1: aventure, dÃ©placement, engagÃ©s, 14h, semaine, tes, tu, produits, ton, esallia
Topic 2: un, 18, salaire, diplÃ´mes, vos, 14, missions, ouihelp, vous, aide
Topic 3: intelligente, Ã©cole, domicile, une, ou, mission, vous, activitÃ©s, garde, enfants
Topic 4: au, clients, votre, gestion, une, dans, un, du, le, vous


### b) EntraÃ®nement du modÃ¨le

In [6]:
# --- Classification salaire ---
X = X_text  # TF-IDF matrix dÃ©jÃ  calculÃ©e
# Recalculez y au cas oÃ¹
median_salary = df["salary_monthly"].median(skipna=True)
y = (df["salary_monthly"] > median_salary).astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

clf = LogisticRegression(max_iter=500)
clf.fit(X_train, y_train)

preds = clf.predict(X_test)
probs = clf.predict_proba(X_test)[:, 1]
report = classification_report(y_test, preds, digits=3)
auc = roc_auc_score(y_test, probs)

print(report)
print(f"AUC: {auc:.4f}")

TypeError: Cannot convert ['486,49 - 1\u202f801,80 â‚¬ / mois' '486,49 - 1\u202f801,80 â‚¬ / mois'
 'Estimation â†’ 12,36 - 13,50 â‚¬ / heure' ... '11,88 - 12 â‚¬ / heure'
 '11,89 â‚¬ / heure' 'Pas de salaire renseignÃ©'] to numeric

## Ã‰tape 7 â€“ Attacher les prÃ©dictions

On gÃ©nÃ¨re les prÃ©dictions sur lâ€™ensemble du dataset et on ajoute:
- `pred_high_salary` (classe prÃ©dite)
- `pred_proba_high_salary` (probabilitÃ© associÃ©e)

In [7]:
# --- PrÃ©dictions complÃ¨tes ---
df["pred_high_salary"] = clf.predict(X)
df["pred_proba_high_salary"] = clf.predict_proba(X)[:, 1]

print(df[["salary_monthly", "high_salary", "pred_high_salary", "pred_proba_high_salary"]].head(5))

NameError: name 'clf' is not defined

## Ã‰tape 8 â€“ Sauvegarder le dataset enrichi

On enregistre le rÃ©sultat pour le dashboard (Phase 4) :
- Fichier: `data/enriched/hellowork_ml_enriched.csv`
- Contient: texte nettoyÃ©, clusters, labels salaire, prÃ©dictions

In [None]:
# --- Sauvegarde ---
output_path = Path("data/enriched/hellowork_ml_enriched.csv")
output_path.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(output_path, index=False, encoding="utf-8")

print("="*60)
print("âœ“ PHASE 3 COMPLETE")
print(f"Enriched dataset saved: {output_path}")
print(f"Rows: {len(df)}")
print("="*60)

âœ“ PHASE 3 COMPLETE
Enriched dataset saved: data\enriched\hellowork_ml_enriched.csv
Rows: 1219


: 