<a href="https://colab.research.google.com/github/CrisMcode111/DI_Bootcamp/blob/main/w6_mini_projet_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import os
os.environ["TF_USE_LEGACY_KERAS"] = "1" # Forces TensorFlow to use a legacy Keras implementation, likely for compatibility.

# Mini Project: Sentiment Assistant with BERT Fine-Tuning

## Prerequisites & Story Setup

In [None]:
!pip -q install "tensorflow==2.19.0" \
                "tensorflow-text==2.19.*" \
                "transformers==4.46.*" \
                tensorflow-datasets \
                accelerate \
                evaluate

#tensorflow: Core deep learning framework for building and training models.
#tensorflow-text: Text processing operations for TensorFlow.
#transformers: Provides pre-trained models (like BERT) and fine-tuning tools.
#tensorflow-datasets: For loading and managing common datasets.
#accelerate: Utility for accelerating distributed training
#evaluate: Library for model evaluation metrics and utilities.

In [None]:
import platform, tensorflow as tf, keras, transformers
print("Python:", platform.python_version())   # 3.12.x acceptable
print("TF    :", tf.__version__)              # 2.19.0
print("Keras :", keras.__version__)           # 3.5.x
print("HF    :", transformers.__version__)    # 4.46.x
print("GPU   :", tf.config.list_physical_devices('GPU'))


Python: 3.12.12
TF    : 2.19.0
Keras : 3.10.0
HF    : 4.46.3
GPU   : [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


## Imports & Hardware Check

In [None]:
import platform
import tensorflow as tf
import tensorflow_datasets as tfds
from transformers import BertTokenizer, TFBertForSequenceClassification

print("Python version      :", platform.python_version())
print("TensorFlow version  :", tf.__version__)
print("GPU devices detected:", tf.config.list_physical_devices('GPU'))

Python version      : 3.12.12
TensorFlow version  : 2.19.0
GPU devices detected: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


## Load the IMDB Reviews Dataset

In [None]:
(train_raw, val_raw, test_raw), ds_info = tfds.load(
    "imdb_reviews",
    split=["train[:80%]", "train[80%:]", "test"],
    as_supervised=True,
    with_info=True,
    shuffle_files=True
)

print(ds_info)
print("Train samples:", tf.data.experimental.cardinality(train_raw).numpy())
print("Val samples  :", tf.data.experimental.cardinality(val_raw).numpy())
print("Test samples :", tf.data.experimental.cardinality(test_raw).numpy())



Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.0FF83P_1.0.0/imdb_reviews-train.tfrecor…

Generating test examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.0FF83P_1.0.0/imdb_reviews-test.tfrecord…

Generating unsupervised examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.0FF83P_1.0.0/imdb_reviews-unsupervised.…

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.
tfds.core.DatasetInfo(
    name='imdb_reviews',
    full_name='imdb_reviews/plain_text/1.0.0',
    description="""
    Large Movie Review Dataset. This is a dataset for binary sentiment
    classification containing substantially more data than previous benchmark
    datasets. We provide a set of 25,000 highly polar movie reviews for training,
    and 25,000 for testing. There is additional unlabeled data for use as well.
    """,
    config_description="""
    Plain text
    """,
    homepage='http://ai.stanford.edu/~amaas/data/sentiment/',
    data_dir='/root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0',
    file_format=tfrecord,
    download_size=80.23 MiB,
    dataset_size=129.83 MiB,
    features=FeaturesDict({
        'label': ClassLabel(shape=(), dtype=int64, num_classes=2),
        'text': Text(shape=(), dtype=string),
    }),
   

In [None]:
for text, label in train_raw.take(4):
    print("Label:", "Positive" if label.numpy() else "Negative")
    print(text.numpy().decode()[:250], "...\n")

Label: Negative
This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline ...

Label: Negative
I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film wa ...

Label: Negative
Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brennan give enjoyable performances as they always seem to do. <br /><br />But come on Hollywood - a Mountie telling the people of Dawson City, Yukon to el ...

Label: Positive
This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a

## Tokenizer Setup & Data Pipeline

In [None]:

MAX_LENGTH = 128
BATCH_SIZE = 4  # si ça replante, mets 2

from keras import mixed_precision
mixed_precision.set_global_policy("mixed_float16")

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)
print("Tokenizer loaded:", tokenizer.name_or_path)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Tokenizer loaded: bert-base-uncased


In [1]:
def encode_review(review_input): # Transforme un avis textuel en format numérique pour BERT.
    if isinstance(review_input, bytes):
        review_text = review_input.decode("utf-8")
    elif hasattr(review_input, "numpy"):
        review_text = review_input.numpy().decode("utf-8")
    else:
        review_text = str(review_input)

    return tokenizer.encode_plus(
        review_text,
        add_special_tokens=True,
        max_length=MAX_LENGTH,
        padding="max_length",
        truncation=True,
        return_attention_mask=True,
        return_token_type_ids=True,
    )

def tf_encode(text, label): # Adapte la fonction Python 'encode_review' pour qu'elle fonctionne avec TensorFlow.
    def _encode_py(t):
        e = encode_review(t)  # retourne un dict HF
        return (e["input_ids"], e["attention_mask"], e["token_type_ids"])

    input_ids, attention_mask, token_type_ids = tf.py_function(
        func=_encode_py,
        inp=[text],
        Tout=[tf.int32, tf.int32, tf.int32]
    )

    # ⚠️ impose les shapes statiques (sinon OperatorNotAllowedInGraphError & cie)
    input_ids.set_shape([MAX_LENGTH])
    attention_mask.set_shape([MAX_LENGTH])
    token_type_ids.set_shape([MAX_LENGTH])

    features = {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "token_type_ids": token_type_ids
    }
    return features, label


def to_tuple(features, label): # Réorganise les données pour qu'elles soient acceptées par le modèle Keras.
    return (features["input_ids"],
            features["attention_mask"],
            features["token_type_ids"]), label


def prepare_dataset(dataset): # Applique toutes les étapes de préparation des données (tokenisation, regroupement par lots, etc.) au jeu de données.
    return (dataset
            .map(tf_encode, num_parallel_calls=tf.data.AUTOTUNE)   # fixe les shapes [MAX_LENGTH]
            .map(to_tuple, num_parallel_calls=tf.data.AUTOTUNE)    # tuple positionnel
            .shuffle(2000)
            .batch(BATCH_SIZE, drop_remainder=True)                 # batch UNE seule fois
            .prefetch(tf.data.AUTOTUNE))

train_ds = prepare_dataset(train_raw)
val_ds   = prepare_dataset(val_raw)
test_ds  = prepare_dataset(test_raw)


NameError: name 'train_raw' is not defined

## Initialize the Fine-Tuning Model

In [None]:
from transformers import BertTokenizer, TFBertForSequenceClassification
import tensorflow as tf
import keras  # <= Keras 3

In [None]:
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification

print("TF    :", tf.__version__)
print("GPU   :", tf.config.list_physical_devices('GPU'))

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)

model = TFBertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2,
    use_safetensors=False
)

for layer in model.bert.encoder.layer[:6]:
    layer.trainable = False

optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5, epsilon=1e-8)
loss_fn   = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics   = [tf.keras.metrics.SparseCategoricalAccuracy(name="accuracy")]

model.compile(optimizer=optimizer, loss=loss_fn, metrics=metrics)
model.summary()


TF    : 2.19.0
GPU   : [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: "tf_bert_for_sequence_classification_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_75 (Dropout)        multiple                  0 (unused)
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 109483778 (417.65 MB)
Trainable params: 66956546 (255.42 MB)
Non-trainable params: 42527232 (162.23 MB)
_________________________________________________________________


## Train and Monitor

In [None]:
# Warmup (signature tuple déjà OK)
(x_ids, x_mask, x_type), _ = next(iter(train_ds))
_ = model((x_ids, x_mask, x_type), training=False)

In [None]:
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
callbacks = [
    EarlyStopping(monitor="val_accuracy", mode="max", patience=1, restore_best_weights=True),
    ModelCheckpoint("bert_imdb_best.weights.h5", monitor="val_accuracy", mode="max",
                    save_best_only=True, save_weights_only=True),
]

In [None]:
EPOCHS = 2
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=EPOCHS,
    verbose=1,
    callbacks=callbacks
)


Epoch 1/2



Epoch 2/2


In [None]:
# 1) Compte de pas/epoch attendu (avec drop_remainder=True)
train_n = tf.data.experimental.cardinality(train_raw).numpy()
val_n   = tf.data.experimental.cardinality(val_raw).numpy()
print("steps/epoch train ~", train_n // BATCH_SIZE, " | val steps ~", val_n // BATCH_SIZE)

# 2) Formes d’un batch
(x_ids, x_mask, x_type), y = next(iter(train_ds))
print(x_ids.shape, x_mask.shape, x_type.shape, y.shape)  # (BATCH_SIZE, MAX_LENGTH) ... (BATCH_SIZE,)


steps/epoch train ~ 2500  | val steps ~ 625
(8, 128) (8, 128) (8, 128) (8,)


## Evaluate on the Held-Out Test Set

In [None]:
eval_metrics = model.evaluate(test_ds, return_dict=True)
print({k: round(v, 4) for k, v in eval_metrics.items()})

{'loss': 0.2938, 'accuracy': 0.887}


## Build a Reusable Inference Helper

In [None]:
def predict_sentiment(text: str):
    import numpy as np
    import tensorflow as tf
    id2label = {0: "Negative", 1: "Positive"}
    enc = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=MAX_LENGTH,
        padding="max_length",
        truncation=True,
        return_attention_mask=True,
        return_token_type_ids=True,
        return_tensors="tf"
    )
    outputs = model(
        enc["input_ids"],
        attention_mask=enc["attention_mask"],
        token_type_ids=enc["token_type_ids"],
        training=False
    )
    logits = outputs.logits  # (1,2)
    probs = tf.nn.softmax(logits, axis=-1).numpy()[0]
    pred = int(np.argmax(probs))
    return {0: "Negative", 1: "Positive"}[pred], float(probs[pred])

label, confidence = predict_sentiment("The onboarding emails were confusing, but the agent fixed everything politely.")
print(f"Prediction: {label} (confidence={confidence:.3f})")


Prediction: Positive (confidence=0.861)
