# Travel Order Resolver — Notebook
This notebook implements an intent classifier + NER (Departure/Destination) for the **Travel Order Resolver** project.
It uses **CamemBERT** (transfer learning) via Hugging Face Transformers.

**What is included**
- Installation of dependencies
- Loading dataset (from Google Drive or upload)
- Preprocessing and tokenization
- Fine-tuning (intent classification) and token-classification (NER)
- Inference pipeline that reads `sentenceID,sentence` and writes outputs in the required format

**Important notes**
- The GPU notebook is optimized to run on Google Colab with a GPU runtime.
- The CPU notebook is lighter and intended for local execution without a GPU (slower).

---


## CPU notebook (local execution)

This cell verifies whether required dependencies are already installed in the current Python environment (for example a `venv`). If any package is missing it installs packages from `requirements.txt` and ensures a CPU-only build of `torch` is installed. Using the current Python executable (`sys.executable`) avoids modifying a different environment.

In [18]:
import sys
import subprocess

print(f"Python executable: {sys.executable}")

# Try to import key packages to avoid unnecessary reinstalls
missing = []
for pkg in ("transformers","datasets","evaluate","seqeval","tokenizers","sacrebleu"):
    try:
        __import__(pkg)
    except Exception:
        missing.append(pkg)

if missing:
    print("Missing packages detected:", missing)
    print("Installing from requirements.txt using the current Python environment...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-r", "requirements.txt"])
else:
    print("All listed Python packages are already installed.")

# Ensure CPU-only torch is present (install if missing)
try:
    import torch
    print("torch is installed (version: ", torch.__version__, ")")
except Exception:
    print("Installing CPU-only torch from PyTorch index...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "torch", "--index-url", "https://download.pytorch.org/whl/cpu"]) 

print("Dependencies ensured for CPU run.")

Python executable: /home/stanley-honkpehedji/Téléchargements/nlp_miniproject/.venv/bin/python
All listed Python packages are already installed.
torch is installed (version:  2.9.0+cpu )
Dependencies ensured for CPU run.


Cette cellule charge les fichiers CSV d'entraînement et de test en DataFrames pandas. Si les fichiers ne sont pas trouvés dans `/content/`, ajustez les chemins vers votre dossier local ou Google Drive.

In [19]:

# Load dataset (adjust paths as needed)
import os
import pandas as pd

# Paths relative to the notebook's directory
train_csv = "dataset/train_set.csv"
test_csv = "dataset/test_set.csv"

# If using Drive, change to the drive path (example shown in previous cell)
if not os.path.exists(train_csv) or not os.path.exists(test_csv):
    print(f"Train or test CSV not found.")
    print(f"Looking for: {os.path.abspath(train_csv)}")
    print(f"Looking for: {os.path.abspath(test_csv)}")
    print("Please adjust the paths to match your file locations.")
else:
    train_df = pd.read_csv(train_csv, encoding="utf-8")
    test_df = pd.read_csv(test_csv, encoding="utf-8")
    print("Train shape:", train_df.shape, "Test shape:", test_df.shape)
    print("Data loaded successfully!")

Train shape: (7724, 3) Test shape: (1827, 3)
Data loaded successfully!


Charger les données : lire `train_set.csv` et `test_set.csv` en utilisant pandas. Le code suivant essaie `/content/` par défaut et affiche la forme des DataFrames si trouvés.

In [20]:

# Inspect and parse 'entities' JSON field (if present)
import json
def parse_entities_field(row):
    try:
        ents = json.loads(row['entities'])
    except Exception:
        ents = []
    valid = []
    for ent in ents:
        if 'start' in ent and 'end' in ent and 0 <= ent['start'] < ent['end'] <= len(row['text']):
            valid.append(ent)
    return valid

train_df['parsed_entities'] = train_df.apply(parse_entities_field, axis=1)
test_df['parsed_entities'] = test_df.apply(parse_entities_field, axis=1)
print('Parsed entities (example):')
train_df.head(3)

Parsed entities (example):


Unnamed: 0,text,intent,entities,parsed_entities
0,Comment aller de Grenoble à Nogent-le-Rotrou ?,TRIP,"[{""start"": 17, ""end"": 25, ""label"": ""Departure""...","[{'start': 17, 'end': 25, 'label': 'Departure'..."
1,"Bonsoir, je veux un voyage de Gex vers Lille.",TRIP,"[{""start"": 30, ""end"": 33, ""label"": ""Departure""...","[{'start': 30, 'end': 33, 'label': 'Departure'..."
2,Nous vous suis reconnaissant pour votre collab...,NOT_TRIP,[],[]


Cette cellule parse et valide le champ JSON `entities` (s'il existe) pour chaque ligne, et construit une colonne `parsed_entities` contenant les spans valides (start/end).

In [21]:

# Prepare Hugging Face datasets for intent classification
from datasets import Dataset
from sklearn.preprocessing import LabelEncoder
from transformers import AutoTokenizer

label_encoder = LabelEncoder()
label_encoder.fit(train_df['intent'].unique())
num_labels = len(label_encoder.classes_)
print("Intent classes:", list(label_encoder.classes_))

hf_train = Dataset.from_pandas(train_df[['text','intent']].rename(columns={'intent':'label'}))
hf_test  = Dataset.from_pandas(test_df[['text','intent']].rename(columns={'intent':'label'}))

def encode_label(example):
    example['label'] = int(label_encoder.transform([example['label']])[0])
    return example

hf_train = hf_train.map(encode_label)
hf_test = hf_test.map(encode_label)

model_name = "camembert-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_classification(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=128)

hf_train = hf_train.map(tokenize_classification, batched=True, remove_columns=['text'])
hf_test = hf_test.map(tokenize_classification, batched=True, remove_columns=['text'])

hf_train = hf_train.rename_column("label", "labels")
hf_test = hf_test.rename_column("label", "labels")
hf_train.set_format(type="torch")
hf_test.set_format(type="torch")

print(hf_train[0])

Intent classes: ['NOT_FRENCH', 'NOT_TRIP', 'TRIP', 'UNKNOWN']


Map: 100%|██████████| 7724/7724 [00:01<00:00, 7402.80 examples/s]
Map: 100%|██████████| 7724/7724 [00:01<00:00, 7402.80 examples/s]
Map: 100%|██████████| 1827/1827 [00:00<00:00, 9042.27 examples/s]

Map: 100%|██████████| 7724/7724 [00:00<00:00, 14818.69 examples/s]
Map: 100%|██████████| 7724/7724 [00:00<00:00, 14818.69 examples/s]
Map: 100%|██████████| 1827/1827 [00:00<00:00, 14754.94 examples/s]

{'labels': tensor(2), 'input_ids': tensor([    5,   841,   632,     8,  7069,    15, 29291,    26,   185,    26,
         5952, 21149,   106,     6,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1, 




Cette cellule convertit les DataFrames en datasets Hugging Face pour la classification d'intent : encodage des labels, tokenization (padding/truncation) et formatage pour PyTorch.

In [22]:

# Fine-tune CamemBERT for Intent Classification (CPU-friendly args)
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import evaluate, numpy as np

model_cls = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
metric = evaluate.load("f1")

def compute_metrics_intent(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    f1 = metric.compute(predictions=preds, references=labels, average="macro")['f1']
    acc = (preds == labels).mean()
    return {"accuracy": acc, "f1_macro": f1}

training_args = TrainingArguments(
    output_dir="./camembert-intent",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=8,
    evaluation_strategy="epoch",
    num_train_epochs=2,
    save_strategy="epoch",
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    greater_is_better=True,
    fp16=False
)

trainer = Trainer(
    model=model_cls,
    args=training_args,
    train_dataset=hf_train,
    eval_dataset=hf_test,
    compute_metrics=compute_metrics_intent,
    tokenizer=tokenizer
)

trainer.train()
trainer.save_model("./camembert-intent-best")

Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at camembert-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'

Cette cellule effectue le fine-tuning de CamemBERT pour la classification d'intent en utilisant la classe Trainer de Hugging Face. Les arguments sont adaptés pour un environnement CPU (lot réduit, fp16 désactivé).

In [None]:

# Prepare NER dataset (token-classification) - convert spans to BIO using fast tokenizer offsets
from transformers import AutoTokenizer
tokenizer_fast = AutoTokenizer.from_pretrained(model_name, use_fast=True)
ner_labels = ['O', 'B-Departure', 'I-Departure', 'B-Destination', 'I-Destination']
label2id = {l:i for i,l in enumerate(ner_labels)}
id2label = {i:l for l,i in label2id.items()}

def spans_to_bio(text, spans):
    encoding = tokenizer_fast(text, return_offsets_mapping=True, truncation=True, max_length=128)
    offsets = encoding['offset_mapping']
    labels = ['O'] * len(offsets)
    for ent in spans:
        st, ed = ent['start'], ent['end']
        token_indices = []
        for i,(a,b) in enumerate(offsets):
            if a==b==0:
                continue
            if not (b <= st or a >= ed):
                token_indices.append(i)
        if not token_indices:
            continue
        labels[token_indices[0]] = 'B-' + ent['label']
        for idx in token_indices[1:]:
            labels[idx] = 'I-' + ent['label']
    label_ids = [label2id.get(l,0) for l in labels]
    return encoding, label_ids

# Build HF datasets (train/test)
from datasets import Dataset
ner_rows = []
for _, r in train_df.iterrows():
    text = r['text']
    spans = r['parsed_entities'] if r['intent']=='TRIP' else []
    _, label_ids = spans_to_bio(text, spans)
    ner_rows.append({'text': text, 'labels': label_ids})

ner_train_ds = Dataset.from_list(ner_rows)

ner_rows_test = []
for _, r in test_df.iterrows():
    text = r['text']
    spans = r['parsed_entities'] if r['intent']=='TRIP' else []
    _, label_ids = spans_to_bio(text, spans)
    ner_rows_test.append({'text': text, 'labels': label_ids})

ner_test_ds = Dataset.from_list(ner_rows_test)
print('NER datasets prepared (counts):', len(ner_train_ds), len(ner_test_ds))

Préparer les données pour la tâche NER (token-classification). Le code ci-dessous convertit les spans en étiquettes BIO alignées avec les tokens du tokenizer rapide.

In [None]:

# Fine-tune CamemBERT for NER (CPU-friendly)
from transformers import AutoModelForTokenClassification, DataCollatorForTokenClassification, TrainingArguments, Trainer
import numpy as np, seqeval.metrics as seq_metrics

model_ner = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(ner_labels), id2label=id2label, label2id=label2id)
data_collator = DataCollatorForTokenClassification(tokenizer_fast)

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer_fast(examples['text'], truncation=True, padding='max_length', max_length=128, return_tensors='pt')
    return {'input_ids': tokenized_inputs['input_ids'].tolist(), 'attention_mask': tokenized_inputs['attention_mask'].tolist(), 'labels': examples['labels']}

ner_train_tok = ner_train_ds.map(lambda x: tokenize_and_align_labels(x), batched=True)
ner_test_tok = ner_test_ds.map(lambda x: tokenize_and_align_labels(x), batched=True)

ner_train_tok.set_format(type='torch', columns=['input_ids','attention_mask','labels'])
ner_test_tok.set_format(type='torch', columns=['input_ids','attention_mask','labels'])

def align_preds(predictions, label_ids):
    preds = np.argmax(predictions, axis=-1)
    preds_list = [[id2label[p] for p in pred] for pred in preds]
    labels_list = [[id2label[l] for l in lab] for lab in label_ids]
    return preds_list, labels_list

def compute_metrics_ner(p):
    preds, labels = p
    preds_list, labels_list = align_preds(preds, labels)
    return {
        'precision': seq_metrics.precision_score(labels_list, preds_list),
        'recall': seq_metrics.recall_score(labels_list, preds_list),
        'f1': seq_metrics.f1_score(labels_list, preds_list)
    }

training_args = TrainingArguments(
    output_dir='./camembert-ner',
    per_device_train_batch_size=2,
    per_device_eval_batch_size=4,
    evaluation_strategy='epoch',
    num_train_epochs=2,
    save_strategy='epoch',
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model='f1',
    greater_is_better=True
)

trainer_ner = Trainer(
    model=model_ner,
    args=training_args,
    train_dataset=ner_train_tok,
    eval_dataset=ner_test_tok,
    data_collator=data_collator,
    tokenizer=tokenizer_fast,
    compute_metrics=compute_metrics_ner
)

trainer_ner.train()
trainer_ner.save_model('./camembert-ner-best')

Cette cellule fine-tune CamemBERT pour la tâche de token-classification (NER). Les métriques sont calculées avec `seqeval` et les arguments d'entraînement sont configurés pour CPU.

In [None]:

# Inference pipeline (load saved models and run on new lines)
from transformers import pipeline, AutoModelForSequenceClassification, AutoModelForTokenClassification, AutoTokenizer

# Load intent model
intent_tokenizer = AutoTokenizer.from_pretrained('./camembert-intent-best')
intent_model = AutoModelForSequenceClassification.from_pretrained('./camembert-intent-best')

# Load ner model
ner_tokenizer = AutoTokenizer.from_pretrained('./camembert-ner-best', use_fast=True)
ner_model = AutoModelForTokenClassification.from_pretrained('./camembert-ner-best')

intent_pipe = pipeline('text-classification', model=intent_model, tokenizer=intent_tokenizer)
ner_pipe = pipeline('token-classification', model=ner_model, tokenizer=ner_tokenizer, aggregation_strategy='simple')

# Example inference function
def predict_line(sentenceID, sentence):
    # intent
    inputs = intent_tokenizer(sentence, return_tensors='pt', truncation=True, max_length=128)
    logits = intent_model(**inputs).logits.detach().cpu().numpy()[0]
    pred_id = int(logits.argmax())
    print('Pred intent id:', pred_id)
    # NER
    ner_res = ner_pipe(sentence)
    print('NER result:', ner_res)
    return None

# Example usage
print(predict_line('1', 'Je voudrais un billet Toulouse Paris.'))

Cette cellule met en place le pipeline d'inférence : elle recharge les modèles sauvegardés (intent + ner) et fournit une fonction `predict_line` qui retourne l'intent et les entités pour une phrase donnée.


# Final notes
- The GPU notebook is for Colab with a GPU runtime (recommended). Use Runtime -> Change runtime type -> GPU.
- The CPU notebook is slower; training on a CPU may take a long time.
- Save your models in the drive if you want to keep them across sessions (example path: /content/drive/MyDrive/nlp_miniprojects/).
- Save the tokenizer and label mappings alongside the model for correct inference later.
