# Part 1 ‚Äî Mod√®le de d√©tection de mail/message frauduleux (Phishing)

Objectif:
- Entra√Æner un mod√®le de classification (phishing vs safe) sur des donn√©es textuelles
- √âvaluer les performances (Precision/Recall/F1, etc.)
- Sauvegarder les artifacts (vectorizer + model)
- G√©n√©rer `ref_data.csv` (donn√©es de r√©f√©rence vectoris√©es) pour la suite (monitoring / drift / retrain)


In [1]:
# Imports & configuration
import os
import re
import json
import joblib
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    precision_recall_fscore_support,
    accuracy_score
)

RANDOM_SEED = 42

# Chemins des donn√©es
VALIDATION_DATA_PATH = "../data/Phishing_validation_emails.csv"
PROD_DATA_PATH = "../data/prod_data.csv"

# Output path:
ARTIFACT_DIR = "../artifacts"
DATA_OUT_DIR = "../data"
os.makedirs(ARTIFACT_DIR, exist_ok=True)
os.makedirs(DATA_OUT_DIR, exist_ok=True)

print("OK: dirs ready")

OK: dirs ready


In [2]:
# Lecture des donn√©es de validation
df_validation = pd.read_csv(VALIDATION_DATA_PATH)
print("üìä Donn√©es de validation:", df_validation.shape)

# Lecture des donn√©es de production (feedbacks utilisateurs)
if os.path.exists(PROD_DATA_PATH):
    df_prod = pd.read_csv(PROD_DATA_PATH, header=None, names=["Email Text", "model_prediction", "Email Type"])
    # On garde seulement les colonnes n√©cessaires
    df_prod = df_prod[["Email Text", "Email Type"]]
    print("üìä Donn√©es de production:", df_prod.shape)
    
    # Concat√©nation des deux sources
    df = pd.concat([df_validation, df_prod], ignore_index=True)
    print("‚úÖ Donn√©es combin√©es:", df.shape)
else:
    print("‚ö†Ô∏è  Pas de prod_data.csv, utilisation uniquement des donn√©es de validation")
    df = df_validation

display(df.head())

print("\nColumns:", df.columns.tolist())
print("\nMissing values:\n", df.isna().sum())

# Distribution d'√©tiquettes
print("\nLabel distribution:\n", df["Email Type"].value_counts())

üìä Donn√©es de validation: (2000, 2)
üìä Donn√©es de production: (6, 2)
‚úÖ Donn√©es combin√©es: (2006, 2)


Unnamed: 0,Email Text,Email Type
0,"Dear Jordan, your subscription has been succes...",Safe Email
1,"Dear Casey, thank you for your purchase. Your ...",Safe Email
2,Congratulations! You've won a $3000 gift card....,Phishing Email
3,You have a new secure message from your bank. ...,Phishing Email
4,Your package delivery is pending. Please provi...,Phishing Email



Columns: ['Email Text', 'Email Type']

Missing values:
 Email Text    0
Email Type    0
dtype: int64

Label distribution:
 Email Type
Safe Email        1004
Phishing Email    1002
Name: count, dtype: int64


## Pr√©sentation du jeu de donn√©es ‚Äì D√©tection de phishing

Le jeu de donn√©es utilis√© dans ce projet est destin√© √† la **d√©tection de mails/messages frauduleux (phishing)** √† partir de donn√©es textuelles.

Il contient **2000 emails** r√©partis en deux classes √©quilibr√©es :
- **Safe Email** : 1000 messages l√©gitimes
- **Phishing Email** : 1000 messages frauduleux

### Structure des donn√©es
Le fichier CSV comporte **2 colonnes** :
- **Email Text** : contenu textuel de l‚Äôemail
- **Email Type** : √©tiquette associ√©e au message (`Safe Email` ou `Phishing Email`)

### Qualit√© des donn√©es
- Aucune valeur manquante n‚Äôest pr√©sente dans le jeu de donn√©es
- Les classes sont parfaitement √©quilibr√©es, ce qui facilite l‚Äôentra√Ænement et l‚Äô√©valuation des mod√®les de classification
- Les textes sont courts √† moyens, repr√©sentatifs de messages r√©els (notifications, alertes bancaires, confirmations, etc.)

### Utilisation dans le projet
Ce jeu de donn√©es est utilis√© comme :
- **jeu de donn√©es d‚Äôentra√Ænement et de validation** pour le mod√®le de d√©tection de phishing
- **jeu de donn√©es de r√©f√©rence (`ref_data`)** pour la phase de monitoring, de d√©tection de d√©rive et de r√©-entra√Ænement automatique du mod√®le dans une approche MLOps

Il constitue une base adapt√©e pour un **prototype fonctionnel** et une d√©monstration compl√®te du cycle de vie d‚Äôun mod√®le de d√©tection de fraude bas√© sur des donn√©es textuelles.


In [3]:
# Nettoyage de base (normalisation du texte + mappage des labels)
def normalize_text(s: str) -> str:
    if not isinstance(s, str):
        return ""
    s = s.strip().lower()
    # Fusionner les espaces multiples
    s = re.sub(r"\s+", " ", s)
    return s

# Uniformisation des noms de colonnes (optionnel)
df = df.rename(columns={"Email Text": "text", "Email Type": "label_str"})

df["text"] = df["text"].apply(normalize_text)

# Mappage des labels : Phishing=1, Safe=0
label_map = {
    "Phishing Email": 1,
    "Safe Email": 0
}
df["label"] = df["label_str"].map(label_map)

# V√©rifier s‚Äôil y a des labels non mapp√©s
bad = df[df["label"].isna()]
print("Unmapped labels rows:", len(bad))
if len(bad) > 0:
    display(bad.head())

df["label"] = df["label"].astype(int)

display(df.head())


Unmapped labels rows: 0


Unnamed: 0,text,label_str,label
0,"dear jordan, your subscription has been succes...",Safe Email,0
1,"dear casey, thank you for your purchase. your ...",Safe Email,0
2,congratulations! you've won a $3000 gift card....,Phishing Email,1
3,you have a new secure message from your bank. ...,Phishing Email,1
4,your package delivery is pending. please provi...,Phishing Email,1


## S√©paration des donn√©es en Train / Validation / Test (70% / 15% / 15%)
- **Train** : utilis√© pour entra√Æner le mod√®le  

- **Validation** : utilis√© pour l‚Äôajustement des hyperparam√®tres, la s√©lection du mod√®le et l‚Äôarr√™t anticip√© (*early stopping*)  

- **Test** : utilis√© uniquement √† la toute fin pour √©valuer la capacit√© r√©elle de g√©n√©ralisation du mod√®le  


In [4]:
# S√©paration des donn√©es en Train / Validation / Test (70% / 15% / 15%)

X = df["text"].values      # Textes des emails (features)
y = df["label"].values     # Labels num√©riques (0 = safe, 1 = phishing)

# Premi√®re s√©paration : 70% entra√Ænement, 30% temporaire (validation + test)
X_train, X_tmp, y_train, y_tmp = train_test_split(
    X,
    y,
    test_size=0.30,              # 30% pour validation + test
    random_state=RANDOM_SEED, 
    stratify=y                   # Conserver la proportion des classes
)

# Deuxi√®me s√©paration : 15% validation, 15% test
X_val, X_test, y_val, y_test = train_test_split(
    X_tmp,
    y_tmp,
    test_size=0.50,              # Diviser le 30% en deux parties √©gales
    random_state=RANDOM_SEED,
    stratify=y_tmp               # M√™me distribution des labels
)

# Affichage des tailles des ensembles
print("Train:", len(X_train), "Val:", len(X_val), "Test:", len(X_test))


Train: 1404 Val: 301 Test: 301


In [5]:
# Construction d‚Äôun mod√®le baseline (TF-IDF + R√©gression Logistique)
# Mod√®le l√©ger, rapide et interpr√©table ‚Äî id√©al pour un PoC et le d√©ploiement

model = Pipeline(steps=[
    (
        "tfidf",
        TfidfVectorizer(
            ngram_range=(1, 2),     # unigrammes + bigrammes
            min_df=2,               # ignorer les mots trop rares
            max_features=30000      # limiter la dimension du vocabulaire
        )
    ),
    (
        "clf",
        LogisticRegression(
            max_iter=1000,          # assurer la convergence
            class_weight="balanced",# g√©rer le d√©s√©quilibre des classes
            random_state=RANDOM_SEED
        )
    )
])

# Renvoyer l'objet mod√®le pour v√©rification structurelle
model


0,1,2
,"steps  steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.","[('tfidf', ...), ('clf', ...)]"
,"transform_input  transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6",
,"memory  memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.",
,"verbose  verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.",False

0,1,2
,"input  input: {'filename', 'file', 'content'}, default='content' - If `'filename'`, the sequence passed as an argument to fit is  expected to be a list of filenames that need reading to fetch  the raw content to analyze. - If `'file'`, the sequence items must have a 'read' method (file-like  object) that is called to fetch the bytes in memory. - If `'content'`, the input is expected to be a sequence of items that  can be of type string or byte.",'content'
,"encoding  encoding: str, default='utf-8' If bytes or files are given to analyze, this encoding is used to decode.",'utf-8'
,"decode_error  decode_error: {'strict', 'ignore', 'replace'}, default='strict' Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given `encoding`. By default, it is 'strict', meaning that a UnicodeDecodeError will be raised. Other values are 'ignore' and 'replace'.",'strict'
,"strip_accents  strip_accents: {'ascii', 'unicode'} or callable, default=None Remove accents and perform other character normalization during the preprocessing step. 'ascii' is a fast method that only works on characters that have a direct ASCII mapping. 'unicode' is a slightly slower method that works on any characters. None (default) means no character normalization is performed. Both 'ascii' and 'unicode' use NFKD normalization from :func:`unicodedata.normalize`.",
,"lowercase  lowercase: bool, default=True Convert all characters to lowercase before tokenizing.",True
,"preprocessor  preprocessor: callable, default=None Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps. Only applies if ``analyzer`` is not callable.",
,"tokenizer  tokenizer: callable, default=None Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if ``analyzer == 'word'``.",
,"analyzer  analyzer: {'word', 'char', 'char_wb'} or callable, default='word' Whether the feature should be made of word or character n-grams. Option 'char_wb' creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space. If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input. .. versionchanged:: 0.21  Since v0.21, if ``input`` is ``'filename'`` or ``'file'``, the data  is first read from the file and then passed to the given callable  analyzer.",'word'
,"stop_words  stop_words: {'english'}, list, default=None If a string, it is passed to _check_stop_list and the appropriate stop list is returned. 'english' is currently the only supported string value. There are several known issues with 'english' and you should consider an alternative (see :ref:`stop_words`). If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if ``analyzer == 'word'``. If None, no stop words will be used. In this case, setting `max_df` to a higher value, such as in the range (0.7, 1.0), can automatically detect and filter stop words based on intra corpus document frequency of terms.",
,"token_pattern  token_pattern: str, default=r""(?u)\\b\\w\\w+\\b"" Regular expression denoting what constitutes a ""token"", only used if ``analyzer == 'word'``. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator). If there is a capturing group in token_pattern then the captured group content, not the entire match, becomes the token. At most one capturing group is permitted.",'(?u)\\b\\w\\w+\\b'

0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",'balanced'
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",42
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


In [6]:
# Entra√Æner le mod√®le
model.fit(X_train, y_train)
print("Training done")

Training done


## √âvaluation du mod√®le (r√©utilisable)

Cette √©tape d√©finit une fonction d‚Äô√©valuation g√©n√©rique pour les mod√®les de classification binaire.
Elle permet de mesurer de mani√®re coh√©rente les performances du mod√®le sur diff√©rents jeux de donn√©es
(validation et test).

Les m√©triques calcul√©es incluent :
- Accuracy
- Precision
- Recall
- F1-score
- Matrice de confusion
- Rapport de classification d√©taill√©

In [7]:
def eval_model(pipeline, X_eval, y_eval, name="Eval"):
    # Pr√©diction des classes (0 / 1)
    y_pred = pipeline.predict(X_eval)

    # Pr√©diction des probabilit√©s
    # LogisticRegression supporte predict_proba
    # [:, 1] correspond √† la probabilit√© de la classe positive (phishing)
    y_proba = pipeline.predict_proba(X_eval)[:, 1]

    # Calcul de l'accuracy
    acc = accuracy_score(y_eval, y_pred)

    # Calcul des m√©triques Precision, Recall et F1 (classification binaire)
    p, r, f1, _ = precision_recall_fscore_support(
        y_eval, y_pred, average="binary"
    )

    # Affichage des r√©sultats principaux
    print(f"== {name} dataset ==")
    print(f"Accuracy : {acc:.4f}")
    print(f"Precision: {p:.4f}")
    print(f"Recall   : {r:.4f}")
    print(f"F1       : {f1:.4f}")

    # Affichage de la matrice de confusion
    print("\nConfusion matrix:\n", confusion_matrix(y_eval, y_pred))

    # Rapport de classification d√©taill√© par classe
    print(
        "\nClassification report:\n",
        classification_report(
            y_eval,
            y_pred,
            target_names=["Safe", "Phishing"]
        )
    )

    # Retourner les m√©triques sous forme de dictionnaire
    return {
        "accuracy": acc,
        "precision": p,
        "recall": r,
        "f1": f1
    }


# √âvaluation sur le jeu de validation
val_metrics = eval_model(model, X_val, y_val, name="Validation")

# √âvaluation finale sur le jeu de test
test_metrics = eval_model(model, X_test, y_test, name="Test")


== Validation dataset ==
Accuracy : 1.0000
Precision: 1.0000
Recall   : 1.0000
F1       : 1.0000

Confusion matrix:
 [[150   0]
 [  0 151]]

Classification report:
               precision    recall  f1-score   support

        Safe       1.00      1.00      1.00       150
    Phishing       1.00      1.00      1.00       151

    accuracy                           1.00       301
   macro avg       1.00      1.00      1.00       301
weighted avg       1.00      1.00      1.00       301

== Test dataset ==
Accuracy : 1.0000
Precision: 1.0000
Recall   : 1.0000
F1       : 1.0000

Confusion matrix:
 [[151   0]
 [  0 150]]

Classification report:
               precision    recall  f1-score   support

        Safe       1.00      1.00      1.00       151
    Phishing       1.00      1.00      1.00       150

    accuracy                           1.00       301
   macro avg       1.00      1.00      1.00       301
weighted avg       1.00      1.00      1.00       301



### Commentaire sur les r√©sultats

Les performances parfaites (Accuracy, Precision, Recall et F1 = 1.00) obtenues sur les jeux de validation et de test
s‚Äôexpliquent principalement par la nature du jeu de donn√©es.

Le dataset est fortement nettoy√©, parfaitement √©quilibr√© (50 % Safe / 50 % Phishing) et contient des signaux lexicaux
tr√®s discriminants entre les deux classes. Dans ce contexte, un mod√®le lin√©aire simple (TF-IDF + R√©gression Logistique)
est capable de s√©parer les classes sans difficult√©.

Ces r√©sultats ne refl√®tent pas des conditions r√©elles et ne doivent pas √™tre interpr√©t√©s comme une capacit√© de
g√©n√©ralisation en environnement de production.


### Interpr√©tation des caract√©ristiques phishing

Cette analyse permet d‚Äôidentifier les termes (mots ou n-grammes) qui contribuent le plus
√† la pr√©diction de la classe *Phishing* dans le mod√®le de r√©gression logistique.

Dans un mod√®le lin√©aire, chaque coefficient positif √©lev√© indique que la pr√©sence du terme
augmente fortement la probabilit√© qu‚Äôun email soit class√© comme phishing.

Les 20 termes affich√©s correspondent donc aux signaux lexicaux les plus discriminants
appris par le mod√®le √† partir des donn√©es d‚Äôentra√Ænement.

Cette √©tape est utile pour :
- comprendre le comportement du mod√®le,
- d√©tecter une d√©pendance excessive √† certains mots-cl√©s,
- concevoir des jeux de donn√©es plus difficiles (hard negatives),
- √©valuer la robustesse du mod√®le face √† des emails plus r√©alistes.

In [8]:
# Extraire les poids appris par la r√©gression logistique
# Chaque poids correspond √† un terme TF-IDF
coef = model.named_steps["clf"].coef_[0]

# R√©cup√©rer la liste des termes (vocabulaire) du TF-IDF
features = model.named_steps["tfidf"].get_feature_names_out()

# Associer chaque terme √† son poids,
# puis trier par poids d√©croissant (signal phishing le plus fort)
top_phishing = sorted(
    zip(features, coef),
    key=lambda x: x[1],
    reverse=True
)[:20]

# Afficher les 20 caract√©ristiques les plus indicatives du phishing
top_phishing


[('to', np.float64(2.0029217671225186)),
 ('account', np.float64(1.6886033400394567)),
 ('click', np.float64(1.670517643134975)),
 ('your account', np.float64(1.4097469907875233)),
 ('your', np.float64(1.3757531088808912)),
 ('delivery', np.float64(1.2665123488232362)),
 ('avoid', np.float64(1.224151686750376)),
 ('to avoid', np.float64(1.224151686750376)),
 ('click here', np.float64(1.1739157673372602)),
 ('here to', np.float64(1.1739157673372602)),
 ('continue', np.float64(1.1616019880148083)),
 ('to continue', np.float64(1.1616019880148083)),
 ('in', np.float64(1.1126328223179331)),
 ('information', np.float64(1.0281240365267137)),
 ('information to', np.float64(1.0281240365267137)),
 ('service', np.float64(1.0184463143315718)),
 ('update your', np.float64(1.0184463143315718)),
 ('unusual', np.float64(1.0017196348015858)),
 ('here', np.float64(0.9008919469304524)),
 ('click the', np.float64(0.8854151469887244))]

In [9]:
top_safe = sorted(
    zip(features, coef),
    key=lambda x: x[1]
)[:20]

# Afficher les 20 caract√©ristiques les plus indicatives du safe mails
top_safe

[('for', np.float64(-2.016023477035834)),
 ('the', np.float64(-1.3496909500376615)),
 ('hi', np.float64(-1.1960305625736416)),
 ('meeting', np.float64(-1.130525898569252)),
 ('for your', np.float64(-1.1186181255230823)),
 ('thank', np.float64(-1.1186181255230823)),
 ('thank you', np.float64(-1.1186181255230823)),
 ('you for', np.float64(-1.1186181255230823)),
 ('let', np.float64(-1.106445356816927)),
 ('on', np.float64(-1.0199201737541774)),
 ('project', np.float64(-1.0199201737541774)),
 ('the project', np.float64(-1.0199201737541774)),
 ('find', np.float64(-0.9830573758273625)),
 ('please find', np.float64(-0.9830573758273625)),
 ('be', np.float64(-0.9360075794941729)),
 ('updates', np.float64(-0.9317683359497109)),
 ('dear', np.float64(-0.8923966375197644)),
 ('reminder', np.float64(-0.859621885043868)),
 ('at', np.float64(-0.8504949108711034)),
 ('you', np.float64(-0.8504768374972324))]

In [10]:
hard_samples = [
    # Contenu de phishing, mais avec suppression volontaire des mots-cl√©s forts
    "Please review the attached document regarding your recent request.",
    "We noticed an unusual activity related to your profile. More details inside.",
    
    # Contenu l√©gitime (safe), mais avec ajout de mots-cl√©s typiques du phishing
    "Thank you for your order, please click here to see the invoice.",
    "Your account information has been updated successfully."
]

# Pr√©diction des classes (0 = safe, 1 = phishing)
hard_pred = model.predict(hard_samples)

# Pr√©diction des probabilit√©s associ√©es √† la classe phishing
hard_proba = model.predict_proba(hard_samples)[:, 1]

# Affichage des r√©sultats pour chaque email
for t, y, p in zip(hard_samples, hard_pred, hard_proba):
    print("-" * 60)
    print(t)
    print(
        "Pred:",
        "phishing" if y == 1 else "safe",
        "| proba:",
        round(float(p), 3)
    )


------------------------------------------------------------
Please review the attached document regarding your recent request.
Pred: safe | proba: 0.365
------------------------------------------------------------
We noticed an unusual activity related to your profile. More details inside.
Pred: phishing | proba: 0.822
------------------------------------------------------------
Thank you for your order, please click here to see the invoice.
Pred: safe | proba: 0.313
------------------------------------------------------------
Your account information has been updated successfully.
Pred: phishing | proba: 0.823


### Analyse des hard samples
**Ce r√©sultat a permis de valider avec succ√®s l'efficacit√© du mod√®le.**  
Les r√©sultats obtenus sur les *hard samples* montrent que le mod√®le ne se base pas uniquement sur la pr√©sence explicite de mots-cl√©s, mais reste n√©anmoins fortement influenc√© par certains signaux lexicaux.

L‚Äôanalyse des coefficients du mod√®le montre que la d√©tection repose
principalement sur des mots-cl√©s sp√©cifiques (*click*, *account*, *update*, etc.).

Cette d√©pendance rend le mod√®le vuln√©rable √† des emails frauduleux
qui √©vitent volontairement ces termes, ou √† des emails l√©gitimes
contenant des formulations similaires.

Des exemples artificiellement construits montrent que le mod√®le
peut √™tre tromp√©, ce qui justifie la mise en place d‚Äôun monitoring
et d‚Äôun r√©-entra√Ænement continu en production.

#### Analyse des caract√©ristiques et lien avec la suite du projet

L‚Äôanalyse des poids du mod√®le de r√©gression logistique montre que la d√©tection
du phishing repose principalement sur des **mots-cl√©s et expressions sp√©cifiques**
tels que *click*, *account*, *update*, ou *click here*.  
Ces termes constituent des signaux forts permettant d‚Äôobtenir d‚Äôexcellentes
performances sur le jeu de donn√©es √©tudi√©.

Cependant, cette d√©pendance √† des **caract√©ristiques lexicales de surface**
rend le mod√®le potentiellement vuln√©rable √† l‚Äô√©volution des strat√©gies de phishing,
o√π les attaquants peuvent √©viter ou modifier ces mots-cl√©s.

**Donc** c‚Äôest pourquoi, dans une logique de mise en production, le projet int√®gre
un **m√©canisme de retour utilisateur et de r√©-entra√Ænement automatique du mod√®le
via un agent IA**, afin de s‚Äôadapter progressivement √† de nouvelles formes de messages
frauduleux et de r√©duire la d√©pendance √† des motifs fixes.


In [11]:
# Sauvegarde des artefacts (mod√®le + m√©triques)
# Comme nous utilisons un Pipeline, il est plus simple de sauvegarder l‚Äôensemble du pipeline
# Pipeline = TF-IDF + Logistic Regression
artifact_path = os.path.join(ARTIFACT_DIR, "phishing_tfidf_logreg.joblib")
joblib.dump(model, artifact_path)
print("Saved:", artifact_path)

# Sauvegarde des m√©triques d‚Äô√©valuation au format JSON pour le suivi des exp√©riences
metrics_path = os.path.join(ARTIFACT_DIR, "metrics.json")
with open(metrics_path, "w", encoding="utf-8") as f:
    json.dump(
        {"val": val_metrics, "test": test_metrics},
        f,
        ensure_ascii=False,
        indent=2
    )

print("Saved:", metrics_path)


Saved: ../artifacts/phishing_tfidf_logreg.joblib


Saved: ../artifacts/metrics.json


### G√©n√©ration des donn√©es de r√©f√©rence (ref_data.csv)

Cette √©tape consiste √† transformer les textes du jeu de donn√©es en vecteurs TF-IDF √† l‚Äôaide du
vectoriseur entra√Æn√©, puis √† convertir ces repr√©sentations en un tableau de caract√©ristiques.

Le fichier `ref_data.csv` obtenu contient :
- les variables explicatives issues du TF-IDF,
- la variable cible (`target`).

Ces donn√©es servent de **r√©f√©rence** pour le monitoring du mod√®le apr√®s d√©ploiement, notamment
pour la d√©tection de *data drift* et de *model drift*.  

In [12]:
# Extraction du composant TF-IDF depuis le pipeline
tfidf: TfidfVectorizer = model.named_steps["tfidf"]

# Utilisation de l'ensemble du dataset comme donn√©es de r√©f√©rence
# (il est aussi possible d'utiliser uniquement le jeu d'entra√Ænement)
X_ref_text = df["text"].values
y_ref = df["label"].values

# Transformation des textes en vecteurs TF-IDF (matrice creuse)
X_ref_vec = tfidf.transform(X_ref_text)
feature_names = tfidf.get_feature_names_out()

# Conversion en DataFrame sparse
# Attention : le DataFrame peut √™tre large et volumineux
ref_df = pd.DataFrame.sparse.from_spmatrix(
    X_ref_vec,
    columns=feature_names
)

# Ajout de la variable cible (label)
ref_df["target"] = y_ref

# Sauvegarde des donn√©es de r√©f√©rence pour le monitoring / drift
ref_path = os.path.join(DATA_OUT_DIR, "ref_data.csv")
ref_df.to_csv(ref_path, index=False)

print("Saved ref_data:", ref_path)
print("ref_data shape:", ref_df.shape)


Saved ref_data: ../data/ref_data.csv
ref_data shape: (2006, 501)


In [13]:
# Validation rapide : charger le mod√®le et pr√©dire un √©chantillon
loaded = joblib.load(os.path.join(ARTIFACT_DIR, "phishing_tfidf_logreg.joblib"))

sample_texts = [
    "Dear user, your account will be suspended, click here to verify immediately.",
    "Bonjour, voici le compte rendu de la r√©union de demain. Merci."
]

pred = loaded.predict(sample_texts)
proba = loaded.predict_proba(sample_texts)[:, 1]

for t, yhat, p in zip(sample_texts, pred, proba):
    print("-" * 60)
    print("TEXT:", t)
    print("PRED:", "phishing(1)" if yhat == 1 else "safe(0)")
    print("PROBA(phishing):", round(float(p), 4))


------------------------------------------------------------
TEXT: Dear user, your account will be suspended, click here to verify immediately.
PRED: phishing(1)
PROBA(phishing): 0.8573
------------------------------------------------------------
TEXT: Bonjour, voici le compte rendu de la r√©union de demain. Merci.
PRED: safe(0)
PROBA(phishing): 0.4464


## Synth√®se ‚Äì Partie 1 : D√©tection de mails/messages frauduleux

Dans cette premi√®re partie, nous avons d√©velopp√© un **mod√®le de d√©tection de phishing bas√© sur des donn√©es textuelles**.  
Le jeu de donn√©es contient 2000 emails √©quilibr√©s entre messages l√©gitimes et frauduleux, sans valeurs manquantes.

Un mod√®le **TF-IDF + r√©gression logistique** a √©t√© entra√Æn√© apr√®s une phase de nettoyage et de normalisation des textes.  
Les performances obtenues sur les jeux de validation et de test sont tr√®s √©lev√©es, ce qui s‚Äôexplique par la pr√©sence de **signaux lexicaux forts** (mots-cl√©s typiques du phishing).

Une analyse des coefficients du mod√®le montre que la d√©cision repose principalement sur ces expressions, ce qui met en √©vidence la **n√©cessit√© d‚Äôun monitoring et d‚Äôun r√©-entra√Ænement continu** en contexte r√©el.  
Les artifacts du mod√®le ainsi qu‚Äôun jeu de donn√©es de r√©f√©rence (`ref_data`) ont √©t√© g√©n√©r√©s pour servir de base aux √©tapes suivantes (API, monitoring, agent de r√©-entra√Ænement).
