# Week 4 Lab: Building the AMU Chatbot Brain

Your mission today, should you choose to accept it, is to build the "brain" for a new AMU chatbot. 

The goal is **Intent Classification**: We need to take a student's query (e.g., *"Où est mon emploi du temps?"*) and map it to a specific service from the AMU portal (e.g., `get_schedule`).

Today, we will build and compare **four** different "brains" to see how they perform. We'll compare them on two key metrics:
1.  **Accuracy**: Does it get the right answer?
2.  **Latency**: How fast is it?

## Module 0: Setup

First, let's install the libraries we'll need. We'll use `transformers`, `datasets` (for later), `scikit-learn` for our classic ML model, `pandas` for our final analysis, and `sentence-transformers` (which is built on top of `transformers`) for high-quality embeddings.

In [14]:
%pip install -U transformers datasets scikit-learn pandas sentence-transformers accelerate torch

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [15]:
import time
import numpy as np
import pandas as pd
import re
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer
import torch

# A helper to print things nicely
from IPython.display import display, Markdown

## Module 1: The Dataset - Our "Ground Truth"

Before we can build a classifier, we need data! Based on the AMU portal, we'll define 5 key intents. We'll then create a small, synthetic dataset of *training* queries and *testing* queries.

**Our 5 Intents:**
1.  `get_schedule` (for "Planning des cours (ADE)")
2.  `check_email` (for "Ma messagerie")
3.  `register_classes` (for "Inscriptions pédagogiques (IP)")
4.  `get_student_card` (for "Ma carte AMU")
5.  `find_library_info` (for "BU" / "Compte Lecteur BU")

In [17]:
# Here is our training data. We'll use this to train Classifier 2.
train_data = [
    {"text": "Où est mon emploi du temps?", "label": "get_schedule"},
    {"text": "Je veux voir mes cours de demain", "label": "get_schedule"},
    {"text": "Afficher mon planning de la semaine", "label": "get_schedule"},
    {"text": "J'ai un nouveau mail?", "label": "check_email"},
    {"text": "Ouvrir ma messagerie", "label": "check_email"},
    {"text": "Boite de réception", "label": "check_email"},
    {"text": "Comment je m'inscris à un cours?", "label": "register_classes"},
    {"text": "Où sont les inscriptions pédas?", "label": "register_classes"},
    {"text": "Je dois faire mon IP", "label": "register_classes"},
    {"text": "J'ai perdu ma carte étudiante", "label": "get_student_card"},
    {"text": "Refaire ma carte AMU", "label": "get_student_card"},
    {"text": "La BU est ouverte?", "label": "find_library_info"},
    {"text": "Quels sont les horaires de la bibliothèque?", "label": "find_library_info"},
    {"text": "Je veux emprunter un livre", "label": "find_library_info"}
]

# Here is our test data. We'll use this to evaluate ALL classifiers.
test_data = [
    {"text": "C'est quand mon prochain TD?", "label": "get_schedule"},
    {"text": "Ouvrir la boite de réception", "label": "check_email"},
    {"text": "Je veux m'inscrire en L3", "label": "register_classes"},
    {"text": "Ma carte est cassée", "label": "get_student_card"},
    {"text": "Les horaires de la BU St Charles", "label": "find_library_info"},
    {"text": "Quelle salle pour mon cours de 10h?", "label": "get_schedule"},
    {"text": "J'ai reçu un email important?", "label": "check_email"},
    {"text": "C'est quand les IP?", "label": "register_classes"},
    {"text": "Où est-ce que je peux imprimer avec ma carte?", "label": "get_student_card"},
    {"text": "Comment réserver un livre à la BU?", "label": "find_library_info"}
]

# Let's put them in a DataFrame to visualize
train_df = pd.DataFrame(train_data)
test_df = pd.DataFrame(test_data)

display(Markdown("### Training Data"))
display(train_df)
display(Markdown("### Test Data"))
display(test_df)

### Training Data

Unnamed: 0,text,label
0,Où est mon emploi du temps?,get_schedule
1,Je veux voir mes cours de demain,get_schedule
2,Afficher mon planning de la semaine,get_schedule
3,J'ai un nouveau mail?,check_email
4,Ouvrir ma messagerie,check_email
5,Boite de réception,check_email
6,Comment je m'inscris à un cours?,register_classes
7,Où sont les inscriptions pédas?,register_classes
8,Je dois faire mon IP,register_classes
9,J'ai perdu ma carte étudiante,get_student_card


### Test Data

Unnamed: 0,text,label
0,C'est quand mon prochain TD?,get_schedule
1,Ouvrir la boite de réception,check_email
2,Je veux m'inscrire en L3,register_classes
3,Ma carte est cassée,get_student_card
4,Les horaires de la BU St Charles,find_library_info
5,Quelle salle pour mon cours de 10h?,get_schedule
6,J'ai reçu un email important?,check_email
7,C'est quand les IP?,register_classes
8,Où est-ce que je peux imprimer avec ma carte?,get_student_card
9,Comment réserver un livre à la BU?,find_library_info


--- 
## Module 2: Classifier 1 - The Regex Baseline

Our first model isn't a model at all! It's a simple function using Regular Expressions (or just `if/in` statements) to check for keywords.

**Why?** It's extremely fast, easy to understand, and a perfect baseline. Never underestimate the power of a simple, robust baseline.

### ✏️ Your Exercise:

Complete the `classify_regex` function below. We've given you a `keywords` dictionary to start with. Your function should take a `query`, convert it to lowercase, and check if any keywords for an intent are present. It should return the *first* intent it matches.

In [18]:
keywords = {
    "get_schedule": ["planning", "emploi du temps", "cours", "salle", "td", "cm"],
    "check_email": ["mail", "messagerie", "email", "boite de réception"],
    "register_classes": ["inscription", "ip", "péda", "inscrire"],
    "get_student_card": ["carte", "amu"],
    "find_library_info": ["bu", "bibliothèque", "livre", "emprunter"]
}

def classify_regex(query):
    query_low = query.lower()
    for intent, kws in keywords.items():
        for kw in kws:
            if kw in query_low:
                return intent
    return "unknown" # Default if no keyword is found

# --- Test your function ---
test_query = test_data[0]['text']
prediction = classify_regex(test_query)

print(f"Query: '{test_query}'")
print(f"Prediction: {prediction}")
print(f"Correct: {prediction == test_data[0]['label']}")

Query: 'C'est quand mon prochain TD?'
Prediction: get_schedule
Correct: True


### Exploration:
Look at the `keywords` for `get_student_card`. The keyword `"amu"` is present, but what if the query is *"Comment contacter le secrétariat d'AMU?"*? This would be a false positive! How would you make your regex more specific to avoid this? (This is the fundamental limit of regex).

---
## Module 3: Classifier 2 - Embeddings + Logistic Regression

This is our "classic ML" approach. We will use a powerful pretrained model from Hugging Face to **extract features** (embeddings) from our text. Then, we'll feed these numerical features into a very simple and fast `scikit-learn` classifier, `LogisticRegression`.

This is a perfect example of mixing the "low-level" `transformers` world with the `scikit-learn` ecosystem.

### Step 3.1: Load the Embedding Model

We'll use a `sentence-transformer` model. These models are specifically fine-tuned to create high-quality embeddings for tasks like comparison and classification. We'll use a multilingual one since our queries are in French.

### ✏️ Your Exercise:

Load the `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` model. We'll create a helper function `get_embedding` that takes a list of texts and returns their embeddings.

In [19]:
# Load the embedding model
# This will download the model the first time you run it
print("Loading embedding model...")
embed_model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
print("Model loaded!")

def get_embeddings(text_list):
    # .encode() is the main function of SentenceTransformer
    # It handles tokenization, model forward pass, and pooling all in one!
    return embed_model.encode(text_list)

# --- Test your function ---
example_texts = [train_data[0]['text'], train_data[3]['text']]
example_embeddings = get_embeddings(example_texts)

print(f"\nSuccessfully converted {len(example_texts)} texts into embeddings.")
print(f"Shape of embeddings: {example_embeddings.shape}")

Loading embedding model...
Model loaded!

Successfully converted 2 texts into embeddings.
Shape of embeddings: (2, 384)


### Step 3.2: Train the Classifier

Now we'll use our `train_data` to train a `LogisticRegression` model.

### Your Exercise:
1.  Create `X_train` by getting the embeddings for all texts in `train_data`.
2.  Create `y_train` by getting the corresponding labels from `train_data`.
3.  Initialize and `fit` a `LogisticRegression` classifier.

In [20]:
# 1. Create X_train (the features)
print("Creating training embeddings...")
train_texts = [item['text'] for item in train_data]
X_train = get_embeddings(train_texts)

# 2. Create y_train (the labels)
y_train = [item['label'] for item in train_data]

print(f"Created X_train with shape {X_train.shape} and y_train with {len(y_train)} labels")

# 3. Train the classifier
print("Training Logistic Regression classifier...")
clf_logreg = LogisticRegression(max_iter=1000) # Use more iterations for convergence
clf_logreg.fit(X_train, y_train)
print("Classifier trained!")

# --- Test your classifier ---
test_query = test_data[0]['text']
test_embedding = get_embeddings([test_query]) # Note: must be a list!

prediction = clf_logreg.predict(test_embedding)[0]
print(f"\nQuery: '{test_query}'")
print(f"Prediction: {prediction}")
print(f"Correct: {prediction == test_data[0]['label']}")

Creating training embeddings...
Created X_train with shape (14, 384) and y_train with 14 labels
Training Logistic Regression classifier...
Classifier trained!

Query: 'C'est quand mon prochain TD?'
Prediction: get_schedule
Correct: True


### Exploration:
How much better does this model get with more data? Try adding 5-10 more examples to `train_data` and re-run this module. Does the accuracy on the test set improve? What other `scikit-learn` classifiers could you try instead of `LogisticRegression`? (e.g., `SVC`, `RandomForestClassifier`).

---
## Module 4: Classifier 3 - The Zero-Shot Pipeline

This is the **high-level API** from your lecture. We'll use a `zero-shot-classification` pipeline. This model was trained on a Natural Language Inference (NLI) task, which allows it to determine if a "premise" (our query) entails a "hypothesis" (our candidate labels).

**The best part?** No training data required! 

### Your Exercise:
1.  Load a `zero-shot-classification` pipeline. We'll use `MoritzLaurer/mDeBERTa-v3-base-mnli-xnli`, which is a strong multilingual model.
2.  Define your `candidate_labels`. **Pro-tip**: These are *descriptions*, not just the short names. This helps the model.
3.  Call the classifier on a test query.

In [22]:
# 1. Load the pipeline
print("Loading zero-shot pipeline... (This may take a moment)")
clf_zero_shot = pipeline(
    "zero-shot-classification", 
    model="MoritzLaurer/mDeBERTa-v3-base-mnli-xnli",
    device=0 if torch.cuda.is_available() else -1 # Use GPU if available
)
print("Pipeline loaded!")

# 2. Define candidate labels (we use English for fun, the model is multilingual!)
candidate_labels = [
    "student's class schedule",
    "check student email",
    "register for new classes",
    "manage student ID card",
    "find library information"
]

# This maps the descriptive labels back to our short ones
label_map_zero_shot = {
    "student's class schedule": "get_schedule",
    "check student email": "check_email",
    "register for new classes": "register_classes",
    "manage student ID card": "get_student_card",
    "find library information": "find_library_info"
}

# 3. Call the classifier
test_query = test_data[0]['text']
result = clf_zero_shot(test_query, candidate_labels)

# The result is a dictionary, the top label is the first one
top_label_desc = result['labels'][0]
prediction = label_map_zero_shot[top_label_desc]

print(f"\nQuery: '{test_query}'")
print(f"Model's top choice: '{top_label_desc}' (Score: {result['scores'][0]:.2f})")
print(f"Prediction: {prediction}")
print(f"Correct: {prediction == test_data[0]['label']}")

Loading zero-shot pipeline... (This may take a moment)
  [2m2025-10-20T09:13:37.454271Z[0m [33m WARN[0m  [33mStatus Code: 500. Retrying..., [1;33mrequest_id[0m[33m: "01K80EAHWMT34PCY766SDYKJK5"[0m
    [2;3mat[0m /Users/runner/work/xet-core/xet-core/cas_client/src/http_client.rs:227

  [2m2025-10-20T09:13:37.454289Z[0m [33m WARN[0m  [33mRetry attempt #0. Sleeping 384.375198ms before the next attempt[0m
    [2;3mat[0m /Users/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/reqwest-retry-0.7.0/src/middleware.rs:171

  [2m2025-10-20T09:13:37.944870Z[0m [33m WARN[0m  [33mStatus Code: 500. Retrying..., [1;33mrequest_id[0m[33m: "01K80EAJBZ274FX0QDSWJN1F5V"[0m
    [2;3mat[0m /Users/runner/work/xet-core/xet-core/cas_client/src/http_client.rs:227

  [2m2025-10-20T09:13:37.944925Z[0m [33m WARN[0m  [33mRetry attempt #1. Sleeping 2.88131406s before the next attempt[0m
    [2;3mat[0m /Users/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/r

ValueError: Could not load model MoritzLaurer/mDeBERTa-v3-base-mnli-xnli with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForSequenceClassification'>, <class 'transformers.models.deberta_v2.modeling_deberta_v2.DebertaV2ForSequenceClassification'>). See the original errors:

while loading with AutoModelForSequenceClassification, an error is thrown:
Traceback (most recent call last):
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/modeling_utils.py", line 1037, in _get_resolved_checkpoint_files
    resolved_archive_file = cached_file(pretrained_model_name_or_path, filename, **cached_file_kwargs)
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/utils/hub.py", line 322, in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/utils/hub.py", line 567, in cached_files
    raise e
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/utils/hub.py", line 479, in cached_files
    hf_hub_download(
    ~~~~~~~~~~~~~~~^
        path_or_repo_id,
        ^^^^^^^^^^^^^^^^
    ...<10 lines>...
        local_files_only=local_files_only,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 1010, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
        # Destination
    ...<14 lines>...
        force_download=force_download,
    )
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 1171, in _hf_hub_download_to_cache_dir
    _download_to_tmp_and_move(
    ~~~~~~~~~~~~~~~~~~~~~~~~~^
        incomplete_path=Path(blob_path + ".incomplete"),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<8 lines>...
        xet_file_data=xet_file_data,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 1723, in _download_to_tmp_and_move
    xet_get(
    ~~~~~~~^
        incomplete_path=incomplete_path,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<3 lines>...
        displayed_filename=filename,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 629, in xet_get
    download_files(
    ~~~~~~~~~~~~~~^
        xet_download_info,
        ^^^^^^^^^^^^^^^^^^
    ...<3 lines>...
        progress_updater=[progress_updater],
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
RuntimeError: Data processing error: CAS service error : Reqwest Error: HTTP status server error (500 Internal Server Error), domain: https://cas-server.xethub.hf.co/reconstructions/8bded11c3b90feb4aa05526dcb7665950899379207b117a01166aeaa4abf5cfa

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/pipelines/base.py", line 293, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/models/auto/auto_factory.py", line 604, in from_pretrained
    return model_class.from_pretrained(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/modeling_utils.py", line 277, in _wrapper
    return func(*args, **kwargs)
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/modeling_utils.py", line 4900, in from_pretrained
    checkpoint_files, sharded_metadata = _get_resolved_checkpoint_files(
                                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        pretrained_model_name_or_path=pretrained_model_name_or_path,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<15 lines>...
        transformers_explicit_filename=transformers_explicit_filename,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/modeling_utils.py", line 1160, in _get_resolved_checkpoint_files
    raise OSError(
    ...<5 lines>...
    ) from e
OSError: Can't load the model for 'MoritzLaurer/mDeBERTa-v3-base-mnli-xnli'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'MoritzLaurer/mDeBERTa-v3-base-mnli-xnli' is the correct path to a directory containing a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/modeling_utils.py", line 1037, in _get_resolved_checkpoint_files
    resolved_archive_file = cached_file(pretrained_model_name_or_path, filename, **cached_file_kwargs)
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/utils/hub.py", line 322, in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/utils/hub.py", line 567, in cached_files
    raise e
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/utils/hub.py", line 479, in cached_files
    hf_hub_download(
    ~~~~~~~~~~~~~~~^
        path_or_repo_id,
        ^^^^^^^^^^^^^^^^
    ...<10 lines>...
        local_files_only=local_files_only,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 1010, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
        # Destination
    ...<14 lines>...
        force_download=force_download,
    )
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 1171, in _hf_hub_download_to_cache_dir
    _download_to_tmp_and_move(
    ~~~~~~~~~~~~~~~~~~~~~~~~~^
        incomplete_path=Path(blob_path + ".incomplete"),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<8 lines>...
        xet_file_data=xet_file_data,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 1723, in _download_to_tmp_and_move
    xet_get(
    ~~~~~~~^
        incomplete_path=incomplete_path,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<3 lines>...
        displayed_filename=filename,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 629, in xet_get
    download_files(
    ~~~~~~~~~~~~~~^
        xet_download_info,
        ^^^^^^^^^^^^^^^^^^
    ...<3 lines>...
        progress_updater=[progress_updater],
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
RuntimeError: Data processing error: CAS service error : Reqwest Error: HTTP status server error (500 Internal Server Error), domain: https://cas-server.xethub.hf.co/reconstructions/8bded11c3b90feb4aa05526dcb7665950899379207b117a01166aeaa4abf5cfa

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/pipelines/base.py", line 311, in infer_framework_load_model
    model = model_class.from_pretrained(model, **fp32_kwargs)
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/models/auto/auto_factory.py", line 604, in from_pretrained
    return model_class.from_pretrained(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/modeling_utils.py", line 277, in _wrapper
    return func(*args, **kwargs)
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/modeling_utils.py", line 4900, in from_pretrained
    checkpoint_files, sharded_metadata = _get_resolved_checkpoint_files(
                                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        pretrained_model_name_or_path=pretrained_model_name_or_path,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<15 lines>...
        transformers_explicit_filename=transformers_explicit_filename,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/modeling_utils.py", line 1160, in _get_resolved_checkpoint_files
    raise OSError(
    ...<5 lines>...
    ) from e
OSError: Can't load the model for 'MoritzLaurer/mDeBERTa-v3-base-mnli-xnli'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'MoritzLaurer/mDeBERTa-v3-base-mnli-xnli' is the correct path to a directory containing a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.

while loading with DebertaV2ForSequenceClassification, an error is thrown:
Traceback (most recent call last):
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/modeling_utils.py", line 1037, in _get_resolved_checkpoint_files
    resolved_archive_file = cached_file(pretrained_model_name_or_path, filename, **cached_file_kwargs)
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/utils/hub.py", line 322, in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/utils/hub.py", line 567, in cached_files
    raise e
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/utils/hub.py", line 479, in cached_files
    hf_hub_download(
    ~~~~~~~~~~~~~~~^
        path_or_repo_id,
        ^^^^^^^^^^^^^^^^
    ...<10 lines>...
        local_files_only=local_files_only,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 1010, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
        # Destination
    ...<14 lines>...
        force_download=force_download,
    )
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 1171, in _hf_hub_download_to_cache_dir
    _download_to_tmp_and_move(
    ~~~~~~~~~~~~~~~~~~~~~~~~~^
        incomplete_path=Path(blob_path + ".incomplete"),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<8 lines>...
        xet_file_data=xet_file_data,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 1723, in _download_to_tmp_and_move
    xet_get(
    ~~~~~~~^
        incomplete_path=incomplete_path,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<3 lines>...
        displayed_filename=filename,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 629, in xet_get
    download_files(
    ~~~~~~~~~~~~~~^
        xet_download_info,
        ^^^^^^^^^^^^^^^^^^
    ...<3 lines>...
        progress_updater=[progress_updater],
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
RuntimeError: Data processing error: CAS service error : Reqwest Error: HTTP status server error (500 Internal Server Error), domain: https://cas-server.xethub.hf.co/reconstructions/8bded11c3b90feb4aa05526dcb7665950899379207b117a01166aeaa4abf5cfa

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/pipelines/base.py", line 293, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/modeling_utils.py", line 277, in _wrapper
    return func(*args, **kwargs)
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/modeling_utils.py", line 4900, in from_pretrained
    checkpoint_files, sharded_metadata = _get_resolved_checkpoint_files(
                                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        pretrained_model_name_or_path=pretrained_model_name_or_path,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<15 lines>...
        transformers_explicit_filename=transformers_explicit_filename,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/modeling_utils.py", line 1160, in _get_resolved_checkpoint_files
    raise OSError(
    ...<5 lines>...
    ) from e
OSError: Can't load the model for 'MoritzLaurer/mDeBERTa-v3-base-mnli-xnli'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'MoritzLaurer/mDeBERTa-v3-base-mnli-xnli' is the correct path to a directory containing a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/modeling_utils.py", line 1037, in _get_resolved_checkpoint_files
    resolved_archive_file = cached_file(pretrained_model_name_or_path, filename, **cached_file_kwargs)
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/utils/hub.py", line 322, in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/utils/hub.py", line 567, in cached_files
    raise e
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/utils/hub.py", line 479, in cached_files
    hf_hub_download(
    ~~~~~~~~~~~~~~~^
        path_or_repo_id,
        ^^^^^^^^^^^^^^^^
    ...<10 lines>...
        local_files_only=local_files_only,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 1010, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
        # Destination
    ...<14 lines>...
        force_download=force_download,
    )
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 1171, in _hf_hub_download_to_cache_dir
    _download_to_tmp_and_move(
    ~~~~~~~~~~~~~~~~~~~~~~~~~^
        incomplete_path=Path(blob_path + ".incomplete"),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<8 lines>...
        xet_file_data=xet_file_data,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 1723, in _download_to_tmp_and_move
    xet_get(
    ~~~~~~~^
        incomplete_path=incomplete_path,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<3 lines>...
        displayed_filename=filename,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 629, in xet_get
    download_files(
    ~~~~~~~~~~~~~~^
        xet_download_info,
        ^^^^^^^^^^^^^^^^^^
    ...<3 lines>...
        progress_updater=[progress_updater],
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
RuntimeError: Data processing error: CAS service error : Reqwest Error: HTTP status server error (500 Internal Server Error), domain: https://cas-server.xethub.hf.co/reconstructions/8bded11c3b90feb4aa05526dcb7665950899379207b117a01166aeaa4abf5cfa

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/pipelines/base.py", line 311, in infer_framework_load_model
    model = model_class.from_pretrained(model, **fp32_kwargs)
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/modeling_utils.py", line 277, in _wrapper
    return func(*args, **kwargs)
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/modeling_utils.py", line 4900, in from_pretrained
    checkpoint_files, sharded_metadata = _get_resolved_checkpoint_files(
                                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        pretrained_model_name_or_path=pretrained_model_name_or_path,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<15 lines>...
        transformers_explicit_filename=transformers_explicit_filename,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/muellersebastian/1math/1teaching/llms-data-science-course/venvLLMDS/lib/python3.13/site-packages/transformers/modeling_utils.py", line 1160, in _get_resolved_checkpoint_files
    raise OSError(
    ...<5 lines>...
    ) from e
OSError: Can't load the model for 'MoritzLaurer/mDeBERTa-v3-base-mnli-xnli'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'MoritzLaurer/mDeBERTa-v3-base-mnli-xnli' is the correct path to a directory containing a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.




### Exploration:
This model's performance is *highly* dependent on the `candidate_labels`. What happens if you use our short labels (e.g., `"get_schedule"`)? What if you use French descriptions (e.g., `"emploi du temps"`)? Which works best?

---
## Module 5: Classifier 4 - The LLM Prompt

Time for the SOTA (State-of-the-Art) approach. We'll use a powerful LLM. Instead of *training* it, we will *prompt* it.

This is **In-Context Learning**. We'll tell the model what its job is, give it the list of possible labels, and ask it to classify our query. This is the most flexible approach, but often the slowest.

### Your Exercise:
1.  Create a prompt using our template.
2.  Call the pipeline and **parse the output** to get just the label.

In [None]:
# First, we need to install the Google AI client library
%pip install -q google-generativeai

In [13]:
import google.generativeai as genai
import os

# --- This code securely gets your API key --- 
try:
    # Used in Google Colab
    from google.colab import userdata
    GEMINI_API_KEY = userdata.get('GEMINI_API_KEY')
except ImportError:
    # Fallback for local Jupyter/VSCode
    GEMINI_API_KEY = os.environ.get('GEMINI_API_KEY')

if not GEMINI_API_KEY:
    raise ValueError("API Key not found. Please follow the instructions in the markdown cell above to set 'GOOGLE_API_KEY'.")

genai.configure(api_key=GEMINI_API_KEY)
print("Gemini API configured successfully!")

# --- 2. Create the model and prompt --- 

# Get the list of labels from our dataframe
possible_labels = train_df['label'].unique().tolist()
labels_list_str = ", ".join(possible_labels)

MODEL_NAME = os.getenv('GEMINI_MODEL', 'gemini-2.5-flash')

# With the Gemini API, we use a 'system_instruction' to set the model's behavior
SYSTEM_PROMPT = f"""You are an AMU chatbot assistant. Classify the student's request into exactly one of the following categories: {labels_list_str}
Return ONLY the category name and nothing else."""

# We also set the 'generation_config' to control the output
generation_config = genai.types.GenerationConfig(
    temperature=0,      # We want deterministic, not creative, answers
    max_output_tokens=100 # We only need one or two words for the label
)

# 3. Initialize the model
model_gemini = genai.GenerativeModel(
    MODEL_NAME,
    system_instruction=SYSTEM_PROMPT,
    generation_config=generation_config
)

print(f"Gemini 2.5 Flash model loaded. Ready to classify!")


# --- 4. Call the API and parse --- 
test_index = 0
test_query = test_data[test_index]['text']
print(f"\nRunning Gemini inference for query: '{test_query}'...")

try:
    # This is the actual API call!
    response = model_gemini.generate_content(test_query)
    
    # The API response is clean, no more .split() needed!
    prediction = response.text.strip()

    print(f"Gemini Raw Output: {response.text}")
    print(f"Prediction: {prediction}")
    print(f"Correct: {prediction == test_data[test_index]['label']}")

except Exception as e:
    print(f"An error occurred: {e}")
    print("This may be due to a missing API key or safety settings.")

Gemini API configured successfully!
Gemini 2.5 Flash model loaded. Ready to classify!

Running Gemini inference for query: 'C'est quand mon prochain TD?'...


E0000 00:00:1760951131.341271 7440395 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


Gemini Raw Output: get_schedule
Prediction: get_schedule
Correct: True


### Exploration:
Prompt engineering is an art. How does the model's accuracy change if you change the system prompt? What if you give it two examples in the prompt (this is called "few-shot" prompting)?

---
## Module 6: The Showdown!

It's time to compare our four champions. We will loop through our entire `test_data` and run all four classifiers on each query. We'll record their `prediction` and their `latency` (speed).

### Your Exercise:
Fill in the loop below. We've provided the structure. You need to call each of your classifiers and time them using `time.perf_counter()`.

**Note:** For a fair *latency* comparison, we should test on a CPU. But for the *accuracy* part, using a GPU for the slow models is fine. For simplicity, we'll just test on whatever device you have. Be aware that the LLM/Zero-Shot latencies will be *much* lower on a GPU.

In [None]:
results = []
print(f"Running benchmarks on {len(test_data)} test items...")

for item in test_data:
    query = item['text']
    correct_label = item['label']

    # --- 1. Regex ---
    start_time = time.perf_counter()
    pred_regex = classify_regex(query)
    end_time = time.perf_counter()
    results.append({
        "classifier": "Regex",
        "query": query,
        "prediction": pred_regex,
        "correct": pred_regex == correct_label,
        "latency_ms": (end_time - start_time) * 1000
    })

    # --- 2. Embed + LogReg ---
    start_time = time.perf_counter()
    query_embedding = get_embeddings([query])
    pred_logreg = clf_logreg.predict(query_embedding)[0]
    end_time = time.perf_counter()
    results.append({
        "classifier": "Embed + LogReg",
        "query": query,
        "prediction": pred_logreg,
        "correct": pred_logreg == correct_label,
        "latency_ms": (end_time - start_time) * 1000
    })

    # --- 3. Zero-Shot Pipeline ---
    start_time = time.perf_counter()
    res_zero_shot = clf_zero_shot(query, candidate_labels)
    pred_zero_shot = label_map_zero_shot[res_zero_shot['labels'][0]]
    end_time = time.perf_counter()
    results.append({
        "classifier": "Zero-Shot Pipe",
        "query": query,
        "prediction": pred_zero_shot,
        "correct": pred_zero_shot == correct_label,
        "latency_ms": (end_time - start_time) * 1000
    })

    # --- 4. Gemini API ---
    start_time = time.perf_counter()
    try:
        # We re-use the 'model_gemini' we configured in Module 5
        response = model_gemini.generate_content(query)
        pred_llm = response.text.strip()
    except Exception as e:
        print(f"Gemini API error on query '{query}': {e}")
        pred_llm = "API_ERROR" # So we can see failures
    end_time = time.perf_counter()
    
    results.append({
        "classifier": "Gemini API", # Renamed
        "query": query,
        "prediction": pred_llm,
        "correct": pred_llm == correct_label,
        "latency_ms": (end_time - start_time) * 1000
    })

print("Benchmarks complete!")

# Convert to a DataFrame for analysis
results_df = pd.DataFrame(results)

display(results_df)

### Final Analysis: Accuracy and Speed

Now for the final step. Let's group by our classifiers and calculate two things:
1.  **Accuracy**: The mean of the `correct` column (True=1, False=0).
2.  **Avg. Latency**: The mean of the `latency_ms` column.

In [None]:
# We have to fix any LLM predictions that didn't give a valid label
valid_labels = set(possible_labels)
def validate_label(row):
    if row['classifier'] == 'LLM Prompt' and row['prediction'] not in valid_labels:
        return False # Mark as incorrect if the label isn't in our list
    return row['correct']

results_df['correct'] = results_df.apply(validate_label, axis=1)

# Now, let's calculate our final metrics
final_report = results_df.groupby('classifier').agg(
    Accuracy=pd.NamedAgg(column='correct', aggfunc='mean'),
    Avg_Latency_ms=pd.NamedAgg(column='latency_ms', aggfunc='mean')
).sort_values(by='Accuracy', ascending=False)

# Format for nice printing
final_report['Accuracy'] = final_report['Accuracy'].apply(lambda x: f"{x*100:.2f}%")
final_report['Avg_Latency_ms'] = final_report['Avg_Latency_ms'].apply(lambda x: f"{x:.2f} ms")

display(Markdown("## Final Report Card"))
display(final_report)

## Conclusion & Your Turn

Look at your final report. What do you see?

* **Regex** is by far the **fastest**, but likely the least accurate (and will get worse as queries get more complex).
* **Embed + LogReg** is the perfect balance: **very fast** (once trained) and **very accurate**. s Downsides are that it needs good training data and adding new intent requires new training. 
* **Zero-Shot Pipe** is amazing for a **prototype** (good accuracy, no training!), but it's much slower.
* **LLM Prompt** is likely the **most accurate** and flexible (it can handle typos and complex phrasing!), but it is by far the **slowest**.

### Final Question:

If you were building the *real* AMU chatbot, which would you choose? 

*(Hint: There's no single right answer. A great system might use a **hybrid**! Try Regex first for simple keywords, and if it doesn't find a match, pass the query to the `Embed + LogReg` model. This gives you the speed of Regex and the power of ML!)*