## Livrable 1

Ensemble des scripts pour réaliser les trois approches (classique, modèle sur mesure avancé, modèle avancé BERT).
Ce livrable intégrera la gestion des expérimentations avec l’outil MLFlow (tracking des expérimentations, enregistrement des modèles)

## Chargement des données

In [None]:
import kagglehub
import pandas as pd

# Download latest version of the Sentiment140 dataset
path = kagglehub.dataset_download("kazanova/sentiment140")

print("Path to dataset files:", path)

# The dataset typically includes a CSV file; specify its path
# Adjust the file name if it's different in your downloaded dataset
csv_file = f"{path}/training.1600000.processed.noemoticon.csv"

# Load the CSV file into a pandas DataFrame
# The Sentiment140 dataset has no header, so we specify the column names
columns = ['target', 'id', 'date', 'flag', 'user', 'text']
df_raw = pd.read_csv(csv_file, encoding='ISO-8859-1', names=columns)
df = df_raw.copy()



Path to dataset files: /kaggle/input/sentiment140


# Approche classique 1 : Bag of words

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
import spacy

In [None]:
import spacy
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# Chargement du modèle spaCy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])



def lemmatize_and_clean(texts, use_stemming=True, use_lemmatization=True, batch_size=1000):
    """
    Praitement du texte via spaCy en recourant à la lemmatization et/ou NLTK pour le stemming,
    suppression de la punctuation and des stop words.

    Parametres:
    - texts: Liste de textes à traiter
    - use_stemming: si True, appliquer stemming (default: True)
    - use_lemmatization: Si True, appliquer lemmatization (default: True)

    Returns:
    - List comprenant les textes traités/ néttoyée
    """
    # Initialisation de PorterStemmer
    stemmer = PorterStemmer()
    cleaned_texts = []
    for doc in nlp.pipe(texts, batch_size=batch_size, disable=['parser', 'ner']):
        tokens = []
        for token in doc:
            if token.is_punct or token.is_stop:
                continue
            word = token.text
            if use_lemmatization:
                word = token.lemma_
            if use_stemming:
                word = stemmer.stem(word)
            tokens.append(word)
        cleaned_text = ' '.join(tokens)
        cleaned_texts.append(cleaned_text)

    return cleaned_texts

# Example cas d'usage
texts = ["The quick brown foxes are running!", "This is another test sentence."]
result = lemmatize_and_clean(texts)
print(result)
# Output: ['quick brown fox run', 'test sentence']

['quick brown fox run', 'test sentenc']


In [None]:
#cleaned_text=lemmatize_and_clean(df['text'])

In [None]:
'''
import pandas as pd
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Create a DataFrame with cleaned text and target
output_df = pd.DataFrame({
    'cleaned_text': cleaned_text,
    'target': df['target']
})

# Save to CSV
output_path = '/content/drive/MyDrive/Colab_Notebooks/PROJET7/cleandata.csv'
output_df.to_csv(output_path, index=False)

print(f"Data saved to {output_path}")
'''

'\nimport pandas as pd\nfrom google.colab import drive\n\n# Mount Google Drive\ndrive.mount(\'/content/drive\')\n\n# Create a DataFrame with cleaned text and target\noutput_df = pd.DataFrame({\n    \'cleaned_text\': cleaned_text,\n    \'target\': df[\'target\']\n})\n\n# Save to CSV\noutput_path = \'/content/drive/MyDrive/Colab_Notebooks/PROJET7/cleandata.csv\'\noutput_df.to_csv(output_path, index=False)\n\nprint(f"Data saved to {output_path}")\n'

## Le texte a subi un prétraitement et a été sauveguardé dans un fichier distinct (traitement long a éxecuter)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

df_cleaned = pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/PROJET7/cleandata.csv')

display(df_cleaned.head(5))

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,cleaned_text,target
0,@switchfoot http://twitpic.com/2y1zl awww bumm...,0
1,upset updat facebook text cri result school ...,0
2,@kenichan dive time ball manag save 50 rest ...,0
3,bodi feel itchi like fire,0
4,@nationwideclass behav mad,0


In [None]:
# Étape 1 : Division des données
X_train, X_test, y_train, y_test = train_test_split(
    df_cleaned['cleaned_text'], df_cleaned['target'], test_size=0.2, random_state=42
)

In [None]:
proportion_nan_target = df_cleaned['target'].isnull().sum() / len(df_cleaned['target'])
print(f"Proportion of NaN values in 'text' column: {proportion_nan_target:.4f}")

Proportion of NaN values in 'text' column: 0.0000


In [None]:
proportion_nan_text = df_cleaned['cleaned_text'].isnull().sum() / len(df_cleaned['cleaned_text'])
print(f"Proportion of NaN values in 'text' column: {proportion_nan_text:.4f}")

Proportion of NaN values in 'text' column: 0.0007


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from google.colab import drive





# Step 1: Check for NaN or empty values in 'text' column
print("Number of NaN in 'text':", df_cleaned['cleaned_text'].isna().sum())
print("Number of empty strings in 'text':", (df_cleaned['cleaned_text'] == '').sum())

# Step 2: Handle NaN values by replacing with empty strings
df_cleaned = df_cleaned.dropna(subset=['cleaned_text'])# (Alternatively, you could drop NaN rows with)
#df_cleaned['text'] = df_cleaned['text'].fillna('')  # Replace NaN with empty string

# Step 3: Verify no NaN values remain
print("Number of NaN after cleaning:", df_cleaned['cleaned_text'].isna().sum())

# Step 4: Perform train/test split
X_train, X_test, y_train, y_test = train_test_split(
    df_cleaned['cleaned_text'],
    df_cleaned['target'],
    test_size=0.2,
    random_state=42
)

# Step 5: Vectorize the text data
vectorizer = CountVectorizer()
# Fit and transform X_train
X_train_vec = vectorizer.fit_transform(X_train)
# Transform X_test (no fit to avoid data leakage)
X_test_vec = vectorizer.transform(X_test)

print("X_train_vec shape:", X_train_vec.shape)
print("X_test_vec shape:", X_test_vec.shape)

Number of NaN in 'text': 1152
Number of empty strings in 'text': 0
Number of NaN after cleaning: 0
X_train_vec shape: (1279078, 550064)
X_test_vec shape: (319770, 550064)


In [None]:
# Étape 3 : Encodage des catégories
# Initialiser LabelEncoder pour convertir les catégories textuelles en nombres
encoder = LabelEncoder()
# Ajuster (fit) et transformer y_train en nombres
y_train_enc = encoder.fit_transform(y_train)
# Transformer y_test avec le même encodage (pas fit, pour cohérence)
y_test_enc = encoder.transform(y_test)

In [None]:
print(encoder.classes_)      # should output: array([0, 4])
print(set(y_train_enc))      # should be {0, 1}
print(set(y_test_enc))       # should also be {0, 1}


[0 4]
{np.int64(0), np.int64(1)}
{np.int64(0), np.int64(1)}


In [None]:
!pip install -q mlflow

## XGBOOST BAG OF WORDS

In [None]:
# ────────────────────────────────────────────────
# ①  MONTER GOOGLE DRIVE
# ────────────────────────────────────────────────
from google.colab import drive
import os

if not os.path.exists('/content/drive'):
    drive.mount('/content/drive')
else:
    print("Google Drive is already mounted.")

# ────────────────────────────────────────────────
# ②  DÉMARRER MLFLOW (backend persistant sur Drive)
# ────────────────────────────────────────────────
import shlex, os, time, mlflow, pathlib

PORT = 5000
!fuser -k {PORT}/tcp || true          # libère le port s'il était occupé

BACKEND = "/content/drive/MyDrive/Colab_Notebooks/PROJET7/MLflowStore"  # ✅ sans espace
os.makedirs(BACKEND, exist_ok=True)   # crée le dossier si besoin

quoted = shlex.quote(BACKEND)         # ajoute quotes (sécurité)

# redirection vers un log pour débogage éventuel
get_ipython().system_raw(
    f"mlflow server "
    f"--backend-store-uri {quoted} "
    f"--default-artifact-root {quoted} "
    f"--host 0.0.0.0 --port {PORT} --workers 1 "
    f"> mlflow.log 2>&1 &"
)

time.sleep(4)                         # on laisse le serveur se lancer

# URL proxy Colab
from google.colab import output, widgets
ui_url = output.eval_js(f"google.colab.kernel.proxyPort({PORT})")
print("🖥️  Interface MLflow :", ui_url)

# ────────────────────────────────────────────────
# ③  CONFIGURATION CLIENT & EXPÉRIENCE
# ────────────────────────────────────────────────
mlflow.set_tracking_uri(f"http://127.0.0.1:{PORT}")
mlflow.set_experiment("BoW_XGB_Binary")

# ────────────────────────────────────────────────
# ④  PRÉPARATION DES DONNÉES (TF-IDF ou CountVectorizer)
#     — ici on suppose X_train, X_test, y_train_enc, y_test_enc,
#       vectorizer, encoder déjà présents dans l'environnement.
#     Sinon, adapte la partie ci-dessous.
# ────────────────────────────────────────────────
# Ex. :
# from sklearn.feature_extraction.text import TfidfVectorizer
# vectorizer = TfidfVectorizer(ngram_range=(1,2))
# X_train_vec = vectorizer.fit_transform(X_train)
# X_test_vec  = vectorizer.transform(X_test)

# ────────────────────────────────────────────────
# ⑤  ENTRAÎNEMENT UNIQUE + LOG MINIMAL
# ────────────────────────────────────────────────
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report
import pickle, pathlib, mlflow

# hyper-paramètres définitifs
best_params = {"n_estimators": 400, "learning_rate": 0.20, "max_depth": 6}

class_names = [str(c) for c in encoder.classes_]   # ex. ['0', '4']

# dossier pickles
pkl_dir = pathlib.Path("/content/drive/MyDrive/Colab_Notebooks/PROJET7/pickles")
pkl_dir.mkdir(parents=True, exist_ok=True)

with mlflow.start_run(run_name="final_BoW_XGB"):

    # — modèle
    model = XGBClassifier(
        objective="binary:logistic",
        eval_metric="logloss",
        **best_params
    ).fit(X_train_vec, y_train_enc)


    # — prédictions & métriques
    y_proba = model.predict_proba(X_test_vec)[:, 1]   # probability of class “1” (i.e. original “4”)
    y_pred  = (y_proba > 0.5).astype(int)

    acc = accuracy_score(y_test_enc, y_pred)
    rpt = classification_report(
        y_test_enc, y_pred,
        target_names=class_names,
        output_dict=True, zero_division=0
    )

    # — log dans MLflow
    mlflow.log_params(best_params)
    mlflow.log_metric("accuracy", acc)
    for lbl in class_names:
        mlflow.log_metric(f"precision_{lbl}", rpt[lbl]["precision"])
        mlflow.log_metric(f"recall_{lbl}",    rpt[lbl]["recall"])
        mlflow.log_metric(f"f1_{lbl}",        rpt[lbl]["f1-score"])

    # — sauvegarde modèle
    pkl_path = pkl_dir / "BoW_XGB_final.pkl"
    with open(pkl_path, "wb") as f:
        pickle.dump(
            {"model": model, "vectorizer": vectorizer, "encoder": encoder},
            f
        )
    mlflow.log_artifact(str(pkl_path))

    print(f"✅  BoW_XGB_final — acc = {acc:.4f}")

print("\n🏁  Entraînement terminé — consulte le run « final_BoW_XGB » dans l’UI MLflow :", ui_url)


Google Drive is already mounted.
5000/tcp:             8316  8317
🖥️  Interface MLflow : https://5000-m-hm-1nk77asvti4v9-a.asia-east1-0.prod.colab.dev
✅  BoW_XGB_final — acc = 0.7534
🏃 View run final_BoW_XGB at: http://127.0.0.1:5000/#/experiments/547837938429167033/runs/e9ad18e1367e476e96e62dc28efc449a
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/547837938429167033

🏁  Entraînement terminé — consulte le run « final_BoW_XGB » dans l’UI MLflow : https://5000-m-hm-1nk77asvti4v9-a.asia-east1-0.prod.colab.dev


In [None]:
import shlex, os, time, mlflow, pathlib

PORT = 5000
!fuser -k {PORT}/tcp || true          # libère le port s'il était occupé

BACKEND = "/content/drive/MyDrive/Colab_Notebooks/PROJET7/MLflowStore"  # ✅ sans espace
os.makedirs(BACKEND, exist_ok=True)   # crée le dossier si besoin

quoted = shlex.quote(BACKEND)         # ajoute quotes (sécurité)

# redirection vers un log pour débogage éventuel
get_ipython().system_raw(
    f"mlflow server "
    f"--backend-store-uri {quoted} "
    f"--default-artifact-root {quoted} "
    f"--host 0.0.0.0 --port {PORT} --workers 1 "
    f"> mlflow.log 2>&1 &"
)

time.sleep(4)

## Approche TF-IDF

In [None]:
# ────────────────────────────────────────────────
# 0. Suppositions
# ────────────────────────────────────────────────
# • serveur MLflow tourne déjà (http://127.0.0.1:5000)
# • X_train, X_test, y_train_enc, y_test_enc, encoder existent

# ────────────────────────────────────────────────
# 1. Vectorisation TF-IDF (1–2-gram)
# ────────────────────────────────────────────────
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1, 2))        # + stop_words='english' si besoin
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec  = vectorizer.transform(X_test)

# ────────────────────────────────────────────────
# 2. Config client MLflow
# ────────────────────────────────────────────────
import mlflow, pathlib, pickle
mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("TFIDF_XGBoost_simplified")

# ────────────────────────────────────────────────
# 3. Entraînement + log
# ────────────────────────────────────────────────
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report

params = dict(
    n_estimators = 100,
    learning_rate= 0.05,
    max_depth    = 3,
    reg_lambda   = 2.0,
    reg_alpha    = 1.0
)

class_names = [str(c) for c in encoder.classes_]        # ['0', '4'] par ex.

with mlflow.start_run(run_name="TFIDF_fixed"):

    # — modèle
    model = XGBClassifier(
        objective="binary:logistic",
        eval_metric="logloss",
        **params
    ).fit(X_train_vec, y_train_enc)


    # — prédictions & métriques
    y_proba = model.predict_proba(X_test_vec)[:, 1]   # probability of class “1” (i.e. original “4”)
    y_pred  = (y_proba > 0.5).astype(int)

    acc  = accuracy_score(y_test_enc, y_pred)
    rpt  = classification_report(
              y_test_enc, y_pred,
              target_names=class_names,
              output_dict=True,
              zero_division=0
           )

    # — log dans MLflow
    mlflow.log_params(params)
    mlflow.log_metric("accuracy", acc)
    for lbl in class_names:
        mlflow.log_metric(f"precision_{lbl}", rpt[lbl]["precision"])
        mlflow.log_metric(f"recall_{lbl}",    rpt[lbl]["recall"])
        mlflow.log_metric(f"f1_{lbl}",        rpt[lbl]["f1-score"])

    # — sauvegarde pickle (Drive + artefact)
    pkl_path = pathlib.Path(
        "/content/drive/MyDrive/Colab_Notebooks/PROJET7/TFIDF_fixed.pkl"
    )
    with open(pkl_path, "wb") as f:
        pickle.dump(
            {"model": model, "vectorizer": vectorizer, "encoder": encoder}, f
        )
    mlflow.log_artifact(str(pkl_path))

    print(f"✅  TF-IDF fixed — acc={acc:.4f} | f1_0={rpt['0']['f1-score']:.3f} | f1_4={rpt['4']['f1-score']:.3f}")


2025/07/20 16:05:37 INFO mlflow.tracking.fluent: Experiment with name 'TFIDF_XGBoost_simplified' does not exist. Creating a new experiment.


✅  TF-IDF fixed — acc=0.6628 | f1_0=0.566 | f1_4=0.724
🏃 View run TFIDF_fixed at: http://127.0.0.1:5000/#/experiments/642038648378918568/runs/d7bf45aac1d0440b86f496161dbc1dd2
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/642038648378918568


## Approche word2vec

In [None]:
#!pip uninstall gensim numpy -y
#!pip install numpy
#!pip install gensim

Found existing installation: gensim 4.3.3
Uninstalling gensim-4.3.3:
  Successfully uninstalled gensim-4.3.3
Found existing installation: numpy 1.26.4
Uninstalling numpy-1.26.4:
  Successfully uninstalled numpy-1.26.4
Collecting numpy
  Downloading numpy-2.3.1-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.1/62.1 kB[0m [31m435.7 kB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-2.3.1-cp311-cp311-manylinux_2_28_x86_64.whl (16.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.9/16.9 MB[0m [31m53.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scipy 1.13.1 requires numpy<2.3,>=1.22.4, but you have numpy 2.3.1 which is incompatible.
opencv-python-headless 4.12.0.88 requires numpy<2.3.0

Collecting gensim
  Using cached gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Using cached gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
Installing collected packages: numpy, gensim
  Attempting uninstall: numpy
    Found existing installation: numpy 2.3.1
    Uninstalling numpy-2.3.1:
      Successfully uninstalled numpy-2.3.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opencv-python-headless 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.
thinc 8.3.6 requires numpy<3.0.0,

In [None]:
import shlex, os, time, mlflow, pathlib

PORT = 5000
!fuser -k {PORT}/tcp || true          # libère le port s'il était occupé

BACKEND = "/content/drive/MyDrive/Colab_Notebooks/PROJET7/MLflowStore"  # ✅ sans espace
os.makedirs(BACKEND, exist_ok=True)   # crée le dossier si besoin

quoted = shlex.quote(BACKEND)         # ajoute quotes (sécurité)

# redirection vers un log pour débogage éventuel
get_ipython().system_raw(
    f"mlflow server "
    f"--backend-store-uri {quoted} "
    f"--default-artifact-root {quoted} "
    f"--host 0.0.0.0 --port {PORT} --workers 1 "
    f"> mlflow.log 2>&1 &"
)

time.sleep(4)

In [None]:
# ──────────────────────────────────────────────────────────────
# 0. INSTALL génim / kagglehub (si besoin) – une seule fois
# ──────────────────────────────────────────────────────────────
!pip install -q gensim kagglehub

# ──────────────────────────────────────────────────────────────
# 1. DATASET Sentiment140
# ──────────────────────────────────────────────────────────────
import kagglehub, pandas as pd, gensim, numpy as np, mlflow, pathlib, pickle, os, shlex, time
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report
from xgboost import XGBClassifier

# Téléchargement
path = kagglehub.dataset_download("kazanova/sentiment140")
csv_file = f"{path}/training.1600000.processed.noemoticon.csv"

cols = ['target','id','date','flag','user','text']
df = pd.read_csv(csv_file, encoding='ISO-8859-1', names=cols)

# Pré-tokenisation simple (minuscule, remove punct, etc.)
tokens = df['text'].apply(gensim.utils.simple_preprocess)

# Split
X_train, X_test, y_train, y_test = train_test_split(tokens, df['target'],
                                                    test_size=0.2,
                                                    random_state=42)

# Encode labels (0→0, 4→1)
encoder   = LabelEncoder()
y_train_e = encoder.fit_transform(y_train)
y_test_e  = encoder.transform(y_test)

# ──────────────────────────────────────────────────────────────
# 2. Word2Vec 300-d
# ──────────────────────────────────────────────────────────────
w2v_model = gensim.models.Word2Vec(
    sentences=X_train,
    vector_size=300,
    window=5,
    min_count=5,
    workers=4,
    epochs=20
)

def doc_vector(tokens, model):
    vecs = [model.wv[w] for w in tokens if w in model.wv]
    return np.mean(vecs, axis=0) if vecs else np.zeros(model.vector_size)

X_train_vec = np.vstack([doc_vector(doc, w2v_model) for doc in X_train])
X_test_vec  = np.vstack([doc_vector(doc, w2v_model) for doc in X_test])

# ──────────────────────────────────────────────────────────────
# 3. MLflow : expérience Word2Vec
# ──────────────────────────────────────────────────────────────
mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("W2V_XGBoost")

with mlflow.start_run(run_name="W2V_fixed"):

    params = dict(
        n_estimators = 161,
        learning_rate= 0.05,
        max_depth    = 4,
        subsample    = 0.6,
        colsample_bytree=0.8,
        reg_alpha    = 0.0,
        reg_lambda   = 2.0,
        n_jobs       = -1
    )

    model = XGBClassifier(
        objective="binary:logistic",
        eval_metric="logloss",
        **params
    ).fit(X_train_vec, y_train_e)



    # — prédictions & métriques
    y_proba = model.predict_proba(X_test_vec)[:, 1]   # probability of class “1” (i.e. original “4”)
    y_pred  = (y_proba > 0.5).astype(int)


    acc = accuracy_score(y_test_e, y_pred)
    rpt = classification_report(
            y_test_e, y_pred,
            target_names=[str(c) for c in encoder.classes_],
            output_dict=True, zero_division=0
          )

    # ---- log MLflow
    mlflow.log_params(params)
    mlflow.log_metric("accuracy", acc)
    for lbl in ['0','4']:
        mlflow.log_metric(f"precision_{lbl}", rpt[lbl]['precision'])
        mlflow.log_metric(f"recall_{lbl}",    rpt[lbl]['recall'])
        mlflow.log_metric(f"f1_{lbl}",        rpt[lbl]['f1-score'])

    # ---- artefact & sauvegarde Drive
    pkl_path = pathlib.Path(
        "/content/drive/MyDrive/Colab_Notebooks/PROJET7/W2V_fixed.pkl"
    )
    with open(pkl_path, "wb") as f:
        pickle.dump(
            {"model": model,
             "w2v"  : w2v_model,
             "encoder": encoder}, f
        )
    mlflow.log_artifact(str(pkl_path))

    print(f"✅  W2V_fixed — acc={acc:.4f} | f1_0={rpt['0']['f1-score']:.3f} | f1_4={rpt['4']['f1-score']:.3f}")

print("\n🏁  Run terminé — retrouvez-le dans l’UI MLflow (onglet *W2V_XGBoost*).")


2025/07/20 16:30:01 INFO mlflow.tracking.fluent: Experiment with name 'W2V_XGBoost' does not exist. Creating a new experiment.


✅  W2V_fixed — acc=0.7482 | f1_0=0.752 | f1_4=0.745
🏃 View run W2V_fixed at: http://127.0.0.1:5000/#/experiments/140404627774248766/runs/9891fe88bbf0470abd8e5f220264fd58
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/140404627774248766

🏁  Run terminé — retrouvez-le dans l’UI MLflow (onglet *W2V_XGBoost*).


In [None]:
!fuser -k 5000/tcp # KILL THE SERVER

In [None]:
from google.colab import drive, output
import shlex, os, time, subprocess, textwrap

# 0) (Re)monter Drive
drive.mount('/content/drive', force_remount=True)

# 1) Tuer toute instance éventuelle
!fuser -k 5000/tcp || true

# 2) Dossier backend
BACKEND = "/content/drive/MyDrive/Colab_Notebooks/PROJET7/MLflowStore"
os.makedirs(BACKEND, exist_ok=True)

# 3) Lancer avec log détaillé
get_ipython().system_raw(
    f"mlflow server "
    f"--backend-store-uri {shlex.quote(BACKEND)} "
    f"--default-artifact-root {shlex.quote(BACKEND)} "
    f"--host 0.0.0.0 --port 5000 --workers 1 "
    f"> mlflow.log 2>&1 &"
)

time.sleep(6)                        # laisse vraiment du temps

# 4) Vérif rapide
print(subprocess.run('lsof -i:5000', shell=True, text=True).stdout or "Port 5000 fermé")

# 5) URL proxy
print("UI MLflow :", output.eval_js("google.colab.kernel.proxyPort(5000)"))



Mounted at /content/drive
Port 5000 fermé
UI MLflow : https://5000-m-hm-1nk77asvti4v9-a.asia-east1-0.prod.colab.dev


## UNIVERSAL SENTENCE ENCODER

In [None]:
#!pip uninstall -U "tensorflow==2.15.*" "tensorflow-text==2.15.*" "tensorflow-hub==0.16.1" -y
#!pip install "tensorflow==2.15.*" "tensorflow-text==2.15.*" "tensorflow-hub==0.16.1"


Usage:   
  pip3 uninstall [options] <package> ...
  pip3 uninstall [options] -r <requirements file> ...

no such option: -U
Collecting tensorflow==2.15.*
  Downloading tensorflow-2.15.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting tensorflow-text==2.15.*
  Downloading tensorflow_text-2.15.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)
Collecting ml-dtypes~=0.3.1 (from tensorflow==2.15.*)
  Downloading ml_dtypes-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Collecting protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 (from tensorflow==2.15.*)
  Downloading protobuf-4.25.8-cp37-abi3-manylinux2014_x86_64.whl.metadata (541 bytes)
Collecting wrapt<1.15,>=1.11.0 (from tensorflow==2.15.*)
  Downloading wrapt-1.14.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting tensorboard

In [None]:
# ──────────────────────────────────────────────────────────────
# 0. PRÉREQUIS
#    • serveur MLflow actif (http://127.00.1:5000 ➜ MLflowStore)
#    • TensorFlow ≥ 2.15 + tensorflow-hub installés
# ──────────────────────────────────────────────────────────────
# !pip install -q --upgrade tensorflow tensorflow-hub # Already installed in previous cells

# ──────────────────────────────────────────────────────────────
# 1. DATASET Sentiment140
# ──────────────────────────────────────────────────────────────
import kagglehub, pandas as pd, mlflow, json, pathlib, pickle, os, time
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score, precision_recall_fscore_support
import tensorflow as tf
import tensorflow_hub as hub

path = kagglehub.dataset_download("kazanova/sentiment140")
csv_file = f"{path}/training.1600000.processed.noemoticon.csv"

cols = ['target','id','date','flag','user','text']
df = pd.read_csv(csv_file, encoding='ISO-8859-1', names=cols)

texts  = df['text'].astype(str)
labels = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.2, random_state=42
)

le = LabelEncoder()
y_train_enc = le.fit_transform(y_train)
y_test_enc  = le.transform(y_test)

# ──────────────────────────────────────────────────────────────
# 2. UNIVERSAL SENTENCE ENCODER  & MODELE KERAS
# ──────────────────────────────────────────────────────────────

class USELayer(tf.keras.layers.Layer):
    def __init__(self, **kwargs):
        super(USELayer, self).__init__(**kwargs)
        self.use_layer = hub.KerasLayer(
            "https://tfhub.dev/google/universal-sentence-encoder/4",
            input_shape=[], dtype=tf.string, trainable=False)

    def call(self, inputs):
        return self.use_layer(inputs)

inputs  = tf.keras.layers.Input(shape=(), dtype=tf.string, name="text")
x       = USELayer(name="USE")(inputs) # Use the custom layer
x       = tf.keras.layers.Dropout(0.3)(x)
outputs = tf.keras.layers.Dense(len(le.classes_), activation="softmax", name="classifier")(x)

model = tf.keras.Model(inputs, outputs)
model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-4),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    metrics=["accuracy"]
)

# ──────────────────────────────────────────────────────────────
# 3. MLflow — expérience USE
# ──────────────────────────────────────────────────────────────
mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("USE_Keras")

with mlflow.start_run(run_name="USE_fixed"):

    # — Paramètres notables
    mlflow.log_param("dropout", 0.3)
    mlflow.log_param("epochs", 5)
    mlflow.log_param("batch_size", 16)
    mlflow.log_param("encoder", "USE/4")

    # — Entraînement
    history = model.fit(
        X_train, y_train_enc,
        validation_data=(X_test, y_test_enc),
        epochs=5,
        batch_size=16,
        verbose=2
    )

    # — Évaluation finale
    y_pred_prob = model.predict(X_test, batch_size=64)
    y_pred      = y_pred_prob.argmax(axis=1)



    acc = accuracy_score(y_test_enc, y_pred)
    rpt = classification_report(
            y_test_enc, y_pred,
            target_names=[str(c) for c in le.classes_],
            output_dict=True, zero_division=0
          )

    mlflow.log_metric("accuracy", acc)
    for lbl in ['0','4']:
        mlflow.log_metric(f"precision_{lbl}", rpt[lbl]['precision'])
        mlflow.log_metric(f"recall_{lbl}",    rpt[lbl]['recall'])
        mlflow.log_metric(f"f1_{lbl}",        rpt[lbl]['f1-score'])

    # — Sauvegarde du modèle (SavedModel) + artefact
    saved_path = "/content/use_model_saved"
    model.save(saved_path, include_optimizer=False)
    mlflow.log_artifact(saved_path)            # dossier entier archivé

    # — Sauvegarde du label encoder
    enc_path = pathlib.Path("/content/label_encoder.pkl")
    pickle.dump(le, open(enc_path,"wb"))
    mlflow.log_artifact(str(enc_path))

    print(f"✅  USE_fixed — acc={acc:.4f} | f1_0={rpt['0']['f1-score']:.3f} | f1_4={rpt['4']['f1-score']:.3f}")

    # Copie pratique dans Drive
    drive_pkl = "/content/drive/MyDrive/Colab_Notebooks/PROJET7/USE_fixed_encoder.pkl"
    !cp "{enc_path}" "{drive_pkl}"
    drive_model = "/content/drive/MyDrive/Colab_Notebooks/PROJET7/USE_fixed_model"
    !cp -r "{saved_path}" "{drive_model}"


print("\n🏁  Run terminé — retrouvez-le dans l’interface MLflow (expérience **USE_Keras**).")

2025/07/20 16:51:43 INFO mlflow.tracking.fluent: Experiment with name 'USE_Keras' does not exist. Creating a new experiment.


Epoch 1/5
80000/80000 - 794s - loss: 0.5011 - accuracy: 0.7639 - val_loss: 0.4682 - val_accuracy: 0.7782 - 794s/epoch - 10ms/step
Epoch 2/5
80000/80000 - 776s - loss: 0.4817 - accuracy: 0.7701 - val_loss: 0.4658 - val_accuracy: 0.7793 - 776s/epoch - 10ms/step
Epoch 3/5
80000/80000 - 774s - loss: 0.4812 - accuracy: 0.7701 - val_loss: 0.4652 - val_accuracy: 0.7794 - 774s/epoch - 10ms/step
Epoch 4/5
80000/80000 - 771s - loss: 0.4813 - accuracy: 0.7700 - val_loss: 0.4649 - val_accuracy: 0.7795 - 771s/epoch - 10ms/step
Epoch 5/5
80000/80000 - 766s - loss: 0.4813 - accuracy: 0.7703 - val_loss: 0.4648 - val_accuracy: 0.7797 - 766s/epoch - 10ms/step
✅  USE_fixed — acc=0.7797 | f1_0=0.778 | f1_4=0.781
🏃 View run USE_fixed at: http://127.0.0.1:5000/#/experiments/691042665796839504/runs/9937cbcd67804ae4a7ae7bc8c8e0cad2
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/691042665796839504

🏁  Run terminé — retrouvez-le dans l’interface MLflow (expérience **USE_Keras**).


# lancer le serveur ML flow sans relancer les experiment

In [None]:
from google.colab import drive, output
import shlex, os, time, subprocess, textwrap

# 0) (Re)monter Drive
drive.mount('/content/drive', force_remount=True)

# 1) Tuer toute instance éventuelle
!fuser -k 5000/tcp || true

# 2) Dossier backend
BACKEND = "/content/drive/MyDrive/Colab_Notebooks/PROJET7/MLflowStore"
os.makedirs(BACKEND, exist_ok=True)

# 3) Lancer avec log détaillé
get_ipython().system_raw(
    f"mlflow server "
    f"--backend-store-uri {shlex.quote(BACKEND)} "
    f"--default-artifact-root {shlex.quote(BACKEND)} "
    f"--host 0.0.0.0 --port 5000 --workers 1 "
    f"> mlflow.log 2>&1 &"
)

time.sleep(6)                        # laisse vraiment du temps

# 4) Vérif rapide
print(subprocess.run('lsof -i:5000', shell=True, text=True).stdout or "Port 5000 fermé")

# 5) URL proxy
print("UI MLflow :", output.eval_js("google.colab.kernel.proxyPort(5000)"))



Mounted at /content/drive
Port 5000 fermé
UI MLflow : https://5000-m-s-3inedjgsm6flj-a.us-central1-1.prod.colab.dev


In [None]:
!fuser -k 5000/tcp # KILL THE SERVER

# Tracking via MLFlow

In [None]:
!pip install -q mlflow


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/24.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/24.7 MB[0m [31m225.4 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━[0m [32m15.8/24.7 MB[0m [31m238.0 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m24.5/24.7 MB[0m [31m246.4 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m24.7/24.7 MB[0m [31m237.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.7/24.7 MB[0m [31m96.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.9 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m88.8 MB/s[0m eta [36m0:00:00[0m
[

In [None]:
!pip uninstall -U "tensorflow==2.15.*" "tensorflow-text==2.15.*" "tensorflow-hub==0.16.1" -y
!pip install "tensorflow==2.15.*" "tensorflow-text==2.15.*" "tensorflow-hub==0.16.1"


Usage:   
  pip3 uninstall [options] <package> ...
  pip3 uninstall [options] -r <requirements file> ...

no such option: -U
Collecting tensorflow==2.15.*
  Downloading tensorflow-2.15.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting tensorflow-text==2.15.*
  Downloading tensorflow_text-2.15.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)
Collecting ml-dtypes~=0.3.1 (from tensorflow==2.15.*)
  Downloading ml_dtypes-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Collecting numpy<2.0.0,>=1.23.5 (from tensorflow==2.15.*)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Collecting protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 (from tensorflow==2.15.*)
  Downloading protobuf-4.25

In [None]:
!pip install plot-keras-history

Collecting plot-keras-history
  Downloading plot_keras_history-1.1.39.tar.gz (12 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sanitize_ml_labels>=1.0.48 (from plot-keras-history)
  Downloading sanitize_ml_labels-1.1.4.tar.gz (324 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m324.5/324.5 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting compress-json (from sanitize_ml_labels>=1.0.48->plot-keras-history)
  Downloading compress_json-1.1.1.tar.gz (6.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: plot-keras-history, sanitize_ml_labels, compress-json
  Building wheel for plot-keras-history (setup.py) ... [?25l[?25hdone
  Created wheel for plot-keras-history: filename=plot_keras_history-1.1.39-py3-none-any.whl size=10667 sha256=4e7fe754b78b07a9142cafb38b76e6e7f65fb36d6a070d458313070776400e28
  Stored in directory: /root/.cache/pip/

# DistilBERT : version 1

In [None]:
# ╭──────────────────────────────────────────────╮
# │ 0. Install & MLflow                         │
# ╰──────────────────────────────────────────────╯


from google.colab import drive, output
import os, shlex, time, pickle, pathlib
import pandas as pd, numpy as np, mlflow, kagglehub
import tensorflow as tf
from transformers import TFAutoModel, AutoTokenizer
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

# ── petite fonction de courbe ───────────────────────────
def save_history(history, path, title):
    fig, ax1 = plt.subplots(figsize=(6,4))
    ax1.set_title(title)
    ax1.plot(history.history["loss"],        label="loss")
    ax1.plot(history.history["val_loss"],    label="val_loss")
    ax2 = ax1.twinx()
    ax2.plot(history.history["accuracy"],     label="acc",      c="g", ls="--")
    ax2.plot(history.history["val_accuracy"], label="val_acc",  c="r", ls="--")
    ax1.set_xlabel("epoch"); ax1.set_ylabel("loss"); ax2.set_ylabel("acc")
    lines = ax1.get_lines()+ax2.get_lines()
    fig.legend(lines, [l.get_label() for l in lines], loc="lower right")
    fig.tight_layout(); fig.savefig(path); plt.close(fig)

# ╭──────────────────────────────────────────────╮
# │ 1. Drive + MLflow                           │
# ╰──────────────────────────────────────────────╯
drive.mount("/content/drive", force_remount=True)

PORT, BACKEND = 5000, "/content/drive/MyDrive/Colab_Notebooks/PROJET7/MLflowStore"
!fuser -k {PORT}/tcp || true
os.makedirs(BACKEND, exist_ok=True)
get_ipython().system_raw(
    f"mlflow server --backend-store-uri {shlex.quote(BACKEND)} "
    f"--default-artifact-root {shlex.quote(BACKEND)} "
    f"--host 0.0.0.0 --port {PORT} --workers 1 &")
time.sleep(3)

mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("BERT_curriculum_distil_HF")
print("MLflow UI:", output.eval_js("google.colab.kernel.proxyPort(5000)"))

# ╭──────────────────────────────────────────────╮
# │ 2. Dataset                                  │
# ╰──────────────────────────────────────────────╯
df = pd.read_csv(
    kagglehub.dataset_download("kazanova/sentiment140") +
    "/training.1600000.processed.noemoticon.csv",
    names=["target","id","date","flag","user","text"],
    encoding="ISO-8859-1"
)
texts, labels = df["text"].astype(str).values, df["target"].values
le = LabelEncoder(); labels_enc = le.fit_transform(labels)

VAL_SIZE = 10_000
val_split = StratifiedShuffleSplit(n_splits=1, test_size=VAL_SIZE, random_state=123)
train_pool_idx, val_idx = next(val_split.split(texts, labels_enc))
X_val, y_val = texts[val_idx], labels_enc[val_idx]
train_pool   = train_pool_idx                     # encore disponibles

# ╭──────────────────────────────────────────────╮
# │ 3. DistilBERT (TF + HF)                     │
# ╰──────────────────────────────────────────────╯
MODEL = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
bert      = TFAutoModel.from_pretrained(MODEL)

def encode(batch):
    toks = tokenizer(list(batch), truncation=True, padding="max_length",
                     max_length=128, return_tensors="tf")
    return toks["input_ids"], toks["attention_mask"]

def make_ds(x, y, shuffle=False):
    ids, mask = encode(x)
    ds = tf.data.Dataset.from_tensor_slices(((ids, mask), y.astype(np.float32)))
    if shuffle: ds = ds.shuffle(len(x), reshuffle_each_iteration=True)
    return ds.batch(16).prefetch(tf.data.AUTOTUNE)

val_ds = make_ds(X_val, y_val)

def build_model():
    ids_in  = tf.keras.Input(shape=(128,), dtype=tf.int32, name="ids")
    mask_in = tf.keras.Input(shape=(128,), dtype=tf.int32, name="mask")
    x       = bert(ids_in, attention_mask=mask_in).last_hidden_state[:,0,:]
    x       = tf.keras.layers.Dropout(0.1)(x)
    out     = tf.keras.layers.Dense(1, activation="sigmoid")(x)
    model   = tf.keras.Model([ids_in, mask_in], out)
    model.compile(optimizer=tf.keras.optimizers.Adam(3e-5),
                  loss="binary_crossentropy", metrics=["accuracy"])
    return model

model = build_model()            # UNE SEULE FOIS

# ╭──────────────────────────────────────────────╮
# │ 4. Curriculum (même modèle, pas de reload)   │
# ╰──────────────────────────────────────────────╯
STAGES = [10_000, 25_000, 50_000, 100_000, 125_000, 150_000]
EPOCHS_PER_STAGE = 2
ckpt_root = pathlib.Path("/content/drive/MyDrive/Colab_Notebooks/PROJET7/bert_curriculum_HF")
ckpt_root.mkdir(parents=True, exist_ok=True)

total_epochs, prev_run_id, prev_size = 0, None, 0

for size in STAGES:
    add_needed = size - prev_size
    add_split  = StratifiedShuffleSplit(n_splits=1, train_size=add_needed,
                                        random_state=42+size)
    add_rel, _ = next(add_split.split(texts[train_pool], labels_enc[train_pool]))
    add_idx    = train_pool[add_rel]
    train_pool = np.setdiff1d(train_pool, add_idx)

    X_add, y_add = texts[add_idx], labels_enc[add_idx]
    train_ds = make_ds(X_add, y_add, shuffle=True)

    with mlflow.start_run(run_name=f"DistilHF_{size//1000}k",
                          nested=True,
                          tags={"parent": prev_run_id} if prev_run_id else None) as run:

        mlflow.log_params({"cumulative_train": size,
                           "added_this_stage": add_needed,
                           "epochs_this_stage": EPOCHS_PER_STAGE})

        hist = model.fit(
            train_ds,
            validation_data=val_ds,
            epochs=total_epochs + EPOCHS_PER_STAGE,
            initial_epoch=total_epochs,
            verbose=2,
            callbacks=[tf.keras.callbacks.LambdaCallback(
                on_epoch_end=lambda e,l: mlflow.log_metric(
                    "val_accuracy", float(l["val_accuracy"]), step=e))]
        )

        total_epochs += EPOCHS_PER_STAGE
        prev_size     = size

        png = f"/tmp/hist_{size}.png"
        save_history(hist, png, title=f"DistilBERT {size//1000}k")
        mlflow.log_artifact(png)

        ckpt_path = ckpt_root / f"distilbert_HF_{size}k"
        model.save(ckpt_path, include_optimizer=True)
        mlflow.log_artifact(str(ckpt_path))

        if size == STAGES[0]:
            enc_pkl = ckpt_root / "label_encoder.pkl"
            pickle.dump(le, open(enc_pkl,"wb"))
            mlflow.log_artifact(str(enc_pkl))

        prev_run_id = run.info.run_id
        print(f"✅  Stage {size//1000}k — val_acc={hist.history['val_accuracy'][-1]:.4f}")

print("\n🏁  Curriculum terminé — tout est dans MLflow.")


Mounted at /content/drive
MLflow UI: https://5000-gpu-l4-s-x38m43v3qpyy-c.us-west4-0.prod.colab.dev


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g.

Epoch 1/2
625/625 - 1007s - loss: 0.4797 - accuracy: 0.7724 - val_loss: 0.3957 - val_accuracy: 0.8187 - 1007s/epoch - 2s/step
Epoch 2/2
625/625 - 991s - loss: 0.3077 - accuracy: 0.8694 - val_loss: 0.4437 - val_accuracy: 0.8022 - 991s/epoch - 2s/step




✅  Stage 10k — val_acc=0.8022
🏃 View run DistilHF_10k at: http://127.0.0.1:5000/#/experiments/281729971456783880/runs/9f2d7de07cc743b8aa81de28e02593ca
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/281729971456783880
Epoch 3/4
938/938 - 1374s - loss: 0.4193 - accuracy: 0.8113 - val_loss: 0.3905 - val_accuracy: 0.8277 - 1374s/epoch - 1s/step
Epoch 4/4
938/938 - 1369s - loss: 0.2701 - accuracy: 0.8887 - val_loss: 0.4471 - val_accuracy: 0.8154 - 1369s/epoch - 1s/step




✅  Stage 25k — val_acc=0.8154
🏃 View run DistilHF_25k at: http://127.0.0.1:5000/#/experiments/281729971456783880/runs/c18c65e885864b52b881ca028fc87cc1
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/281729971456783880
Epoch 5/6
1563/1563 - 2132s - loss: 0.3999 - accuracy: 0.8204 - val_loss: 0.3741 - val_accuracy: 0.8310 - 2132s/epoch - 1s/step
Epoch 6/6
1563/1563 - 2129s - loss: 0.2725 - accuracy: 0.8862 - val_loss: 0.4196 - val_accuracy: 0.8242 - 2129s/epoch - 1s/step




✅  Stage 50k — val_acc=0.8242
🏃 View run DistilHF_50k at: http://127.0.0.1:5000/#/experiments/281729971456783880/runs/d4193a2d16124d1585c329970d1d6b5c
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/281729971456783880
Epoch 7/8
3125/3125 - 4031s - loss: 0.3810 - accuracy: 0.8310 - val_loss: 0.3879 - val_accuracy: 0.8282 - 4031s/epoch - 1s/step
Epoch 8/8
3125/3125 - 4036s - loss: 0.2690 - accuracy: 0.8900 - val_loss: 0.3900 - val_accuracy: 0.8349 - 4036s/epoch - 1s/step




✅  Stage 100k — val_acc=0.8349
🏃 View run DistilHF_100k at: http://127.0.0.1:5000/#/experiments/281729971456783880/runs/996c0636489c44b7af62c8ccb25a4242
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/281729971456783880
Epoch 9/10
1563/1563 - 2134s - loss: 0.3821 - accuracy: 0.8280 - val_loss: 0.3551 - val_accuracy: 0.8419 - 2134s/epoch - 1s/step
Epoch 10/10
1563/1563 - 2136s - loss: 0.2474 - accuracy: 0.8989 - val_loss: 0.3952 - val_accuracy: 0.8330 - 2136s/epoch - 1s/step




✅  Stage 125k — val_acc=0.8330
🏃 View run DistilHF_125k at: http://127.0.0.1:5000/#/experiments/281729971456783880/runs/05fc8948e6fe4c659391fd9c9d8f5d4d
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/281729971456783880
Epoch 11/12
1563/1563 - 2135s - loss: 0.3809 - accuracy: 0.8297 - val_loss: 0.3627 - val_accuracy: 0.8399 - 2135s/epoch - 1s/step
Epoch 12/12
1563/1563 - 2133s - loss: 0.2373 - accuracy: 0.9051 - val_loss: 0.4049 - val_accuracy: 0.8353 - 2133s/epoch - 1s/step




✅  Stage 150k — val_acc=0.8353
🏃 View run DistilHF_150k at: http://127.0.0.1:5000/#/experiments/281729971456783880/runs/f0cac6b2e42549b0b762e4681c239d5e
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/281729971456783880

🏁  Curriculum terminé — tout est dans MLflow.


# DistilBERT : version 2

version utilisée en production

In [None]:
!pip install -U "numpy<2.0,>=1.26" \
               "tensorflow==2.16.*" \
               "keras==3.*" \
               "transformers>=4.41" \
               "mlflow>=3.1" \
               kagglehub \
               scikit-learn \
               matplotlib \
               pandas \
               fsspec[s3] \
               plot-keras-history

In [None]:
# ╭──────────────────────────────────────────────╮
# │ 0‑b. Imports & utilitaires                   │
# ╰──────────────────────────────────────────────╯
from google.colab import drive, output
import os, shlex, time, pickle, pathlib
import numpy as np, pandas as pd, matplotlib.pyplot as plt

import tensorflow as tf
from transformers import TFAutoModel, AutoTokenizer
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import LabelEncoder
import mlflow, kagglehub
from plot_keras_history import plot_history


# ── fonction perso (encore utile si besoin) ───
def save_history(history, path, title):
    fig, ax1 = plt.subplots(figsize=(6,4))
    ax1.set_title(title)
    ax1.plot(history.history["loss"],        label="loss")
    ax1.plot(history.history["val_loss"],    label="val_loss")
    ax2 = ax1.twinx()
    ax2.plot(history.history["accuracy"],     label="acc",      c="g", ls="--")
    ax2.plot(history.history["val_accuracy"], label="val_acc",  c="r", ls="--")
    ax1.set_xlabel("epoch"); ax1.set_ylabel("loss"); ax2.set_ylabel("acc")
    lines = ax1.get_lines()+ax2.get_lines()
    fig.legend(lines, [l.get_label() for l in lines], loc="lower right")
    fig.tight_layout(); fig.savefig(path); plt.close(fig)


In [None]:
# ╭──────────────────────────────────────────────╮
# │ 1. Google Drive + MLflow                     │
# ╰──────────────────────────────────────────────╯
#drive.mount("/content/drive", force_remount=True)

from google.colab import drive
drive.flush_and_unmount()           # (au cas où un montage fantôme traîne)

!rm -rf /content/drive              # nettoyer le point de montage
drive.mount("/content/drive")

PORT, BACKEND = 5000, "/content/drive/MyDrive/Colab_Notebooks/PROJET7/MLflowStore"
!fuser -k {PORT}/tcp || true
os.makedirs(BACKEND, exist_ok=True)
get_ipython().system_raw(
    f"mlflow server --backend-store-uri {shlex.quote(BACKEND)} "
    f"--default-artifact-root {shlex.quote(BACKEND)} "
    f"--host 0.0.0.0 --port {PORT} --workers 1 &")
time.sleep(3)

mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("BERT_curriculum_distil_HF_Last_v")
print("MLflow UI:", output.eval_js("google.colab.kernel.proxyPort(5000)"))


In [None]:
# ╭──────────────────────────────────────────────╮
# │ 2. Dataset Sentiment140                      │
# ╰──────────────────────────────────────────────╯
df = pd.read_csv(
    kagglehub.dataset_download("kazanova/sentiment140") +
    "/training.1600000.processed.noemoticon.csv",
    names=["target","id","date","flag","user","text"],
    encoding="ISO-8859-1"
)
texts, labels  = df["text"].astype(str).values, df["target"].values
le             = LabelEncoder()
labels_enc     = le.fit_transform(labels)

VAL_SIZE = 10_000
val_split      = StratifiedShuffleSplit(n_splits=1, test_size=VAL_SIZE, random_state=123)
train_pool_idx, val_idx = next(val_split.split(texts, labels_enc))
X_val, y_val   = texts[val_idx], labels_enc[val_idx]
train_pool     = train_pool_idx.copy()                # échantillons restants


In [None]:
# ╭──────────────────────────────────────────────╮
# │ 3. Modèle DistilBERT (TF + HF)               │
# ╰──────────────────────────────────────────────╯
MODEL     = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
bert      = TFAutoModel.from_pretrained(MODEL)

def encode(batch):
    toks = tokenizer(list(batch),
                     truncation=True, padding="max_length",
                     max_length=128, return_tensors="tf")
    return toks["input_ids"], toks["attention_mask"]

def make_ds(x, y, shuffle=False):
    ids, mask = encode(x)
    ds = tf.data.Dataset.from_tensor_slices(((ids, mask), y.astype(np.float32)))
    if shuffle:
        ds = ds.shuffle(len(x), reshuffle_each_iteration=True)
    return ds.batch(16).prefetch(tf.data.AUTOTUNE)

val_ds = make_ds(X_val, y_val)

def build_model():
    ids_in  = tf.keras.Input(shape=(128,), dtype=tf.int32, name="ids")
    mask_in = tf.keras.Input(shape=(128,), dtype=tf.int32, name="mask")
    x       = bert(ids_in, attention_mask=mask_in).last_hidden_state[:, 0, :]
    x       = tf.keras.layers.Dropout(0.1)(x)
    out     = tf.keras.layers.Dense(1, activation="sigmoid")(x)
    model   = tf.keras.Model([ids_in, mask_in], out)
    model.compile(optimizer=tf.keras.optimizers.Adam(3e-5),
                  loss="binary_crossentropy",
                  metrics=["accuracy"])
    return model

model = build_model()


In [None]:
# ╭──────────────────────────────────────────────╮
# │ 4. Curriculum learning + plot‑keras‑history  │
# ╰──────────────────────────────────────────────╯
STAGES            = [25_000, 50_000, 100_000, 125_000, 150_000,175_000,200_000]          # exemples cumulés (à ajuster)
EPOCHS_PER_STAGE  = 2
ckpt_root         = pathlib.Path("/content/drive/MyDrive/Colab_Notebooks/PROJET7/bert_curriculum_HF_last_version")
ckpt_root.mkdir(parents=True, exist_ok=True)

total_epochs, prev_run_id, prev_size = 0, None, 0

for size in STAGES:
    # -- sélection des nouveaux exemples ----------------------------
    add_needed = size - prev_size
    add_split  = StratifiedShuffleSplit(
        n_splits=1, train_size=add_needed, random_state=42 + size
    )
    add_rel, _ = next(add_split.split(texts[train_pool], labels_enc[train_pool]))
    add_idx    = train_pool[add_rel]
    train_pool = np.setdiff1d(train_pool, add_idx)

    X_add, y_add = texts[add_idx], labels_enc[add_idx]
    train_ds     = make_ds(X_add, y_add, shuffle=True)

    # -- run MLflow --------------------------------------------------
    with mlflow.start_run(
        run_name=f"DistilHF_{size//1000}k",
        nested=True,
        tags={"parent": prev_run_id} if prev_run_id else None,
    ) as run:

        mlflow.log_params({
            "cumulative_train":  size,
            "added_this_stage":  add_needed,
            "epochs_this_stage": EPOCHS_PER_STAGE,
        })

        hist = model.fit(
            train_ds,
            validation_data=val_ds,
            epochs=total_epochs + EPOCHS_PER_STAGE,
            initial_epoch=total_epochs,
            verbose=2,
            callbacks=[tf.keras.callbacks.LambdaCallback(
                on_epoch_end=lambda e, l: mlflow.log_metric(
                    "val_accuracy", float(l["val_accuracy"]), step=e)
            )],
        )

        total_epochs += EPOCHS_PER_STAGE
        prev_size     = size

        # -- tracé avec plot‑keras‑history ---------------------------
        png = f"/tmp/hist_{size}.png"
        png = f"/tmp/hist_{size}.png"
        png = f"/tmp/hist_{size}.png"
        fig, _ = plot_history(          # ← déballer le tuple
            hist.history,
            path  = png,
            title = f"DistilBERT {size//1000}k"
        )
        mlflow.log_artifact(png)
        plt.close(fig)                  # maintenant c’est bien une Figure


        # -- sauvegarde du modèle -----------------------------------
        ckpt_path = ckpt_root / f"distilbert_HF_{size}k"
        model.save(ckpt_path, save_format="tf")
        mlflow.log_artifact(str(ckpt_path))

        # -- pickling du label encoder au 1er stage ------------------
        if size == STAGES[0]:
            enc_pkl = ckpt_root / "label_encoder.pkl"
            pickle.dump(le, open(enc_pkl, "wb"))
            mlflow.log_artifact(str(enc_pkl))

        prev_run_id = run.info.run_id
        print(f"✅  Stage {size//1000}k — val_acc={hist.history['val_accuracy'][-1]:.4f}")

print("\n🏁  Curriculum terminé — tout est dans MLflow.")


Epoch 1/2
1563/1563 - 2158s - loss: 0.4436 - accuracy: 0.7934 - val_loss: 0.4037 - val_accuracy: 0.8142 - 2158s/epoch - 1s/step
Epoch 2/2
1563/1563 - 2128s - loss: 0.2994 - accuracy: 0.8738 - val_loss: 0.3978 - val_accuracy: 0.8276 - 2128s/epoch - 1s/step
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f086f1550>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f94105e90>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f67f28590>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d703468dd90>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f08406110>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f0849c250>, because it is not built.
✅  Stage 25k — val_acc=0.8276
🏃 View run DistilHF_25k at: http://127.0.0.1:5000/#/experiments/591348536328332306/runs/9c6dcfe3209b4d9f8c5b6ec149322996
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/591348536328332306
Epoch 3/4
1563/1563 - 2132s - loss: 0.3925 - accuracy: 0.8251 - val_loss: 0.3752 - val_accuracy: 0.8310 - 2132s/epoch - 1s/step
Epoch 4/4
1563/1563 - 2132s - loss: 0.2594 - accuracy: 0.8936 - val_loss: 0.3969 - val_accuracy: 0.8337 - 2132s/epoch - 1s/step
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f086f1550>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f94105e90>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f67f28590>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d703468dd90>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f08406110>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f0849c250>, because it is not built.
✅  Stage 50k — val_acc=0.8337
🏃 View run DistilHF_50k at: http://127.0.0.1:5000/#/experiments/591348536328332306/runs/6e17d4dd719d4d70ad97aa4922c77d0e
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/591348536328332306
Epoch 5/6
3125/3125 - 4051s - loss: 0.3845 - accuracy: 0.8271 - val_loss: 0.3563 - val_accuracy: 0.8402 - 4051s/epoch - 1s/step
Epoch 6/6
3125/3125 - 4011s - loss: 0.2727 - accuracy: 0.8860 - val_loss: 0.3742 - val_accuracy: 0.8404 - 4011s/epoch - 1s/step
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f086f1550>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f94105e90>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f67f28590>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d703468dd90>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f08406110>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f0849c250>, because it is not built.
✅  Stage 100k — val_acc=0.8404
🏃 View run DistilHF_100k at: http://127.0.0.1:5000/#/experiments/591348536328332306/runs/5b3a8078dd324e858de98981e2c5dbe1
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/591348536328332306
Epoch 7/8
1563/1563 - 2114s - loss: 0.3708 - accuracy: 0.8317 - val_loss: 0.3547 - val_accuracy: 0.8398 - 2114s/epoch - 1s/step
Epoch 8/8
1563/1563 - 2109s - loss: 0.2322 - accuracy: 0.9042 - val_loss: 0.4206 - val_accuracy: 0.8389 - 2109s/epoch - 1s/step
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f086f1550>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f94105e90>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f67f28590>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d703468dd90>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f08406110>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f0849c250>, because it is not built.
✅  Stage 125k — val_acc=0.8389
🏃 View run DistilHF_125k at: http://127.0.0.1:5000/#/experiments/591348536328332306/runs/6341c06483c240aaa2c3c60c71053b31
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/591348536328332306
Epoch 9/10
1563/1563 - 2110s - loss: 0.3804 - accuracy: 0.8303 - val_loss: 0.3589 - val_accuracy: 0.8426 - 2110s/epoch - 1s/step
Epoch 10/10
1563/1563 - 2109s - loss: 0.2433 - accuracy: 0.9010 - val_loss: 0.4132 - val_accuracy: 0.8372 - 2109s/epoch - 1s/step
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f086f1550>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f94105e90>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f67f28590>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d703468dd90>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f08406110>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f0849c250>, because it is not built.
✅  Stage 150k — val_acc=0.8372
🏃 View run DistilHF_150k at: http://127.0.0.1:5000/#/experiments/591348536328332306/runs/a47798ac6bb443d78aac0b8f33f221ed
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/591348536328332306
Epoch 11/12
1563/1563 - 2111s - loss: 0.3708 - accuracy: 0.8332 - val_loss: 0.3504 - val_accuracy: 0.8438 - 2111s/epoch - 1s/step
Epoch 12/12
1563/1563 - 2110s - loss: 0.2286 - accuracy: 0.9070 - val_loss: 0.4246 - val_accuracy: 0.8356 - 2110s/epoch - 1s/step
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f086f1550>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f94105e90>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f67f28590>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d703468dd90>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f08406110>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f0849c250>, because it is not built.
✅  Stage 175k — val_acc=0.8356
🏃 View run DistilHF_175k at: http://127.0.0.1:5000/#/experiments/591348536328332306/runs/e4c0a174350f4406ad72bf32569d530b
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/591348536328332306
Epoch 13/14
1563/1563 - 2110s - loss: 0.3758 - accuracy: 0.8330 - val_loss: 0.3440 - val_accuracy: 0.8494 - 2110s/epoch - 1s/step
Epoch 14/14
1563/1563 - 2109s - loss: 0.2352 - accuracy: 0.9043 - val_loss: 0.3930 - val_accuracy: 0.8371 - 2109s/epoch - 1s/step
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f086f1550>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f94105e90>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f67f28590>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d703468dd90>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f08406110>, because it is not built.
WARNING:tensorflow:Skipping full serialization of TF-Keras layer <tf_keras.src.layers.regularization.dropout.Dropout object at 0x7d6f0849c250>, because it is not built.
✅  Stage 200k — val_acc=0.8371
🏃 View run DistilHF_200k at: http://127.0.0.1:5000/#/experiments/591348536328332306/runs/26a8db95e5b34c2789dc5b32d03a6b81
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/591348536328332306

🏁  Curriculum terminé — tout est dans MLflow.