Nous avons utilis√© le fine-tuning de DistilBERT sur le dataset IMDB afin d‚Äôadapter un mod√®le pr√©-entra√Æn√© √† une t√¢che de classification de sentiment, puis nous l‚Äôavons d√©ploy√© via Flask et Ngrok.

**installer les biblioth√®ques n√©cessaires**


In [1]:
!pip install -q transformers torch scikit-learn pandas pyngrok

**Importer les biblioth√®ques**

In [2]:
import pandas as pd
import torch
#S√©pare les donn√©es en train et test
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
#Pour mesurer les performances du mod√®le
from transformers import (
    BertTokenizer,
    BertForSequenceClassification,
    Trainer,#g√®re l‚Äôentra√Ænement automatiquement
    TrainingArguments
)


**Charger le dataset IMDB**

In [None]:
csv_path = "../data/IMDB_Dataset.csv"

df = pd.read_csv(
    csv_path,
    engine="python",
    encoding="utf-8",
    on_bad_lines="skip"
)

print("Nombre total de reviews :", len(df))
df.head()


Nombre total de reviews : 50000


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


**Encoder les labels**

In [4]:
df["label"] = df["sentiment"].map({
    "positive": 1,
    "negative": 0
})

df = df.dropna(subset=["review", "label"])


**Train / Test split 80% entra√Ænement 20% test**

In [5]:
train_texts, test_texts, train_labels, test_labels = train_test_split(
    df["review"].tolist(),
    df["label"].tolist(),
    test_size=0.2,
    random_state=42
)

print("Train size:", len(train_texts))
print("Test size :", len(test_texts))


Train size: 40000
Test size : 10000


**Charger le tokenizer BERT**

In [6]:
#uncased:text automatiquement miniscule
#Pour charger un mod√®le BERT d√©j√† entra√Æn√©
#technique qui r√©sume un gros mod√®le dans un plus petit, c un encoder , Entra√Ænement et inference plus rapides
#DistilBERT apporte la compr√©hension du texte, et la couche finale fait la pr√©diction.
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Dataset PyTorch**

In [7]:
#PyTorch est une biblioth√®que de Deep Learning (creer des reseaux de neuronnes , entrainer des modeles IA , utiliser le GPU pour acc√©l√©rer les calculs,)
#PyTorch est le moteur qui fait tourner BERT
#Transformer le texte en nombres manipulables
class IMDBDataset(torch.utils.data.Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        encoding = self.tokenizer(
            self.texts[idx],
            truncation=True,
            padding="max_length",
            max_length=self.max_length,
            return_tensors="pt"
        )

        return {
            "input_ids": encoding["input_ids"].squeeze(0),
            "attention_mask": encoding["attention_mask"].squeeze(0),
            "labels": torch.tensor(self.labels[idx], dtype=torch.long)
        }


**Cr√©er les datasets**

In [8]:
train_dataset = IMDBDataset(train_texts, train_labels, tokenizer, max_length=128)
test_dataset  = IMDBDataset(test_texts,  test_labels,  tokenizer, max_length=128)


**M√©triques**

In [9]:
#permet d‚Äô√©valuer les performances du mod√®le en calculant l‚Äôaccuracy, la pr√©cision, le rappel et le score F1 √† partir des pr√©dictions et des labels r√©els.
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(torch.tensor(logits), dim=1)

    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average="binary"
    )
    acc = accuracy_score(labels, predictions)

    return {
        "accuracy": acc,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }


**TrainingArguments**

In [10]:
training_args = TrainingArguments(
    output_dir="./results",
    report_to="none",
    eval_strategy="epoch",
    save_strategy="no",          # √©vite de sauver √† chaque epoch (plus rapide)
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,          # 2 epoch
    weight_decay=0.01,
    fp16=True,                   # ‚úÖ acc√©l√®re sur GPU
)


**Trainer**

In [11]:
#modifie les poids de DistilBERT pour qu‚Äôil devienne expert en sentiment IMDB
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)


  trainer = Trainer(


**Entra√Æner BERT**

In [12]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.3025,0.308017,0.8642,0.931131,0.788847,0.854104
2,0.2067,0.305351,0.8913,0.888823,0.896408,0.8926


TrainOutput(global_step=5000, training_loss=0.27113096618652344, metrics={'train_runtime': 483.4811, 'train_samples_per_second': 165.467, 'train_steps_per_second': 10.342, 'total_flos': 2649347973120000.0, 'train_loss': 0.27113096618652344, 'epoch': 2.0})

**Test**

In [13]:
text = "This movie was absolutely fantastic and inspiring!"

device = model.device

inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
inputs = {k: v.to(device) for k, v in inputs.items()}

model.eval()
with torch.no_grad():
    outputs = model(**inputs)

prediction = torch.argmax(outputs.logits, dim=1).item()
print("Positive" if prediction == 1 else "Negative")


Positive


In [14]:
text = "This movie was a complete waste of time. The story was boring and the acting was terrible."

device = model.device

inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
inputs = {k: v.to(device) for k, v in inputs.items()}

model.eval()
with torch.no_grad():
    outputs = model(**inputs)

prediction = torch.argmax(outputs.logits, dim=1).item()
print("Positive" if prediction == 1 else "Negative")


Negative


In [15]:
text = "A masterpiece! Visually stunning and emotionally touching."

device = model.device

inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
inputs = {k: v.to(device) for k, v in inputs.items()}

model.eval()
with torch.no_grad():
    outputs = model(**inputs)

prediction = torch.argmax(outputs.logits, dim=1).item()
print("Positive" if prediction == 1 else "Negative")


Positive


In [16]:

#√âvaluer les performances sur tout le test set
results = trainer.evaluate()
print(results)  # Accuracy, F1, Precision, Recall


{'eval_loss': 0.3053514063358307, 'eval_accuracy': 0.8913, 'eval_precision': 0.888823297914207, 'eval_recall': 0.8964080174637825, 'eval_f1': 0.8925995454994565, 'eval_runtime': 23.3706, 'eval_samples_per_second': 427.888, 'eval_steps_per_second': 26.743, 'epoch': 2.0}


In [17]:
# Sauvegarder
model.save_pretrained("imdb_bert_model")
tokenizer.save_pretrained("imdb_bert_model")
print("Mod√®le et tokenizer sauvegard√©s ! ‚úÖ")

Mod√®le et tokenizer sauvegard√©s ! ‚úÖ


In [18]:
import shutil

shutil.make_archive(
    "imdb_bert_model",  # nom du zip
    'zip',
    "imdb_bert_model"   # dossier √† zipper
)

print("Dossier compress√© en imdb_bert_model.zip ‚úÖ")


Dossier compress√© en imdb_bert_model.zip ‚úÖ


In [19]:
from google.colab import files

files.download("imdb_bert_model.zip")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**D√©ploiement Flask + ngrok (BERT / PyTorch)**


In [None]:
# ==================== 9Ô∏è‚É£ FLASK + NGROK (BERT) ====================
!pip -q install flask pyngrok

import os
import torch
from flask import Flask, request, render_template_string
from pyngrok import ngrok
import socket

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Tue les processus ngrok existants pour √©viter les erreurs
os.system("pkill -f ngrok")

app = Flask(__name__)

# ‚úÖ Charger mod√®le + tokenizer sauvegard√©s
MODEL_DIR = "imdb_bert_model"  # dossier cr√©√© par model.save_pretrained(...)
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)

# ‚úÖ Choisir device (GPU si dispo)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

HTML_TEMPLATE = """
<!DOCTYPE html>
<html>
<head>
    <title>IMDB Sentiment Analyzer</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            text-align: center;
            margin-top: 60px;
            background: #0b1220;
            color: white;
        }
        h1 { color: #22d3ee; }
        input[type=text] {
            width: 520px;
            padding: 14px;
            border-radius: 12px;
            border: 1px solid #334155;
            background: #111a2e;
            color: white;
        }
        input[type=submit] {
            padding: 14px 28px;
            background: linear-gradient(135deg, #22d3ee, #7c3aed);
            color: #06111f;
            border-radius: 12px;
            border: none;
            font-weight: bold;
            cursor: pointer;
        }
        .result {
            font-size: 22px;
            margin-top: 30px;
            padding: 18px;
            border-radius: 14px;
            display: inline-block;
            border: 1px solid #334155;
            background: rgba(255,255,255,0.06);
        }
        .footer {
            margin-top: 40px;
            font-size: 13px;
            color: #94a3b8;
        }
        a { color: #22d3ee; }
    </style>
</head>
<body>

<h1>IMDB Movie Review</h1>

<form method="post" action="/predict">
    <input type="text" name="review" placeholder="Write a review here..." required>
    <input type="submit" value="Predict">
</form>

{% if prediction %}
<div class="result">
    {{ prediction }}
</div>
{% endif %}

<div class="footer">
üîó <a href="{{ ngrok_url }}" target="_blank">{{ ngrok_url }}</a><br>
Device: {{ device_name }}
</div>

</body>
</html>
"""

@app.route('/')
def home():
    return render_template_string(
        HTML_TEMPLATE,
        prediction=None,
        ngrok_url=public_url,
        device_name=str(device)
    )

@app.route('/predict', methods=['POST'])
def predict():
    review = request.form['review']

    # ‚úÖ Tokenization BERT
    inputs = tokenizer(
        review,
        return_tensors="pt",
        truncation=True,
        padding=True,
        max_length=128
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # ‚úÖ Inference
    with torch.no_grad():
        outputs = model(**inputs)
        pred = torch.argmax(outputs.logits, dim=1).item()

    result = "Positive üòä" if pred == 1 else "Negative üòû"
    return render_template_string(
        HTML_TEMPLATE,
        prediction=f"Detected Sentiment ‚Üí {result}",
        ngrok_url=public_url,
        device_name=str(device)
    )

# ‚úÖ Ngrok token
ngrok.set_auth_token("36tZwLhMx4oGzIOviP7yUNj6ubz_tm8HxcU5dBVhR4kyGcrP")

# ‚úÖ Port dynamique
base_port = 5000
flask_port = base_port
while True:
    try:
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        s.bind(('127.0.0.1', flask_port))
        s.close()
        break
    except OSError:
        flask_port += 1
        if flask_port > base_port + 10:
            raise Exception("Impossible de trouver un port libre.")

public_url = ngrok.connect(flask_port)
print("üöÄ Public URL:", public_url)

app.run(port=flask_port, use_reloader=False, threaded=True)


üöÄ Public URL: NgrokTunnel: "https://sibyllic-vanesa-listlessly.ngrok-free.dev" -> "http://localhost:5000"
 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug:127.0.0.1 - - [17/Dec/2025 20:04:58] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [17/Dec/2025 20:04:58] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -
