<a href="https://colab.research.google.com/github/TSION2121/pragma-SpeechActNLI/blob/master/demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pragmatic Analysis Pipeline (Project 3)

- Speech act classification (statement, question, directive)
- NLI over simple knowledge base (ENTAILMENT, CONTRADICTION, NEUTRAL)


In [None]:
!pip install transformers datasets torch scikit-learn




## 1. Data and label mapping


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

data = [
    ("Can you open the window?", "directive"),
    ("What time is the meeting?", "question"),
    ("The meeting is at 3 pm.", "statement"),
    ("Please send me the report.", "directive"),
    ("Is this the right room?", "question"),
    ("The window is already open.", "statement"),
]

df = pd.DataFrame(data, columns=["text", "label"])

label2id = {"statement": 0, "question": 1, "directive": 2}
id2label = {v: k for k, v in label2id.items()}

df["label_id"] = df["label"].map(label2id)
train_df, test_df = train_test_split(df, test_size=0.5, stratify=df["label_id"], random_state=42)
len(train_df), len(test_df)

(3, 3)

## 2. Speech-act classifier (DistilBERT)


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import torch
from datasets import Dataset

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_batch(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=64)

train_ds = Dataset.from_pandas(train_df[["text", "label_id"]].rename(columns={"label_id": "labels"}))
test_ds  = Dataset.from_pandas(test_df[["text", "label_id"]].rename(columns={"label_id": "labels"}))

train_ds = train_ds.map(tokenize_batch, batched=True)
test_ds  = test_ds.map(tokenize_batch, batched=True)

train_ds.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
test_ds.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=3,
    id2label=id2label,
    label2id=label2id,
)

training_args = TrainingArguments(
    output_dir="./speechact-checkpoints",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=10,
)

def compute_metrics(eval_pred):
    import numpy as np
    from sklearn.metrics import accuracy_score, precision_recall_fscore_support

    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="macro")
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.evaluate()

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 3


[34m[1mwandb[0m: You chose "Don't visualize my results"


Step,Training Loss


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


{'eval_loss': 1.0933858156204224,
 'eval_accuracy': 0.6666666666666666,
 'eval_precision': 0.5,
 'eval_recall': 0.6666666666666666,
 'eval_f1': 0.5555555555555555,
 'eval_runtime': 0.0486,
 'eval_samples_per_second': 61.725,
 'eval_steps_per_second': 20.575,
 'epoch': 3.0}

## 3. NLI stage and knowledge base


In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Define device globally for use across functions
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

nli_name = "roberta-large-mnli"
nli_tokenizer = AutoTokenizer.from_pretrained(nli_name)
nli_model = AutoModelForSequenceClassification.from_pretrained(nli_name).to(device) # Move NLI model to the detected device

kb_facts = [
    "Dolphins are marine mammals.",
    "Windows can be opened and closed.",
    "Meetings usually have a scheduled time.",
]

nli_id2label = {0: "CONTRADICTION", 1: "NEUTRAL", 2: "ENTAILMENT"}

def nli_check(premise, hypothesis):
    inputs = nli_tokenizer(premise, hypothesis, return_tensors="pt", truncation=True)
    inputs = {k: v.to(device) for k, v in inputs.items()} # Move input tensors to the detected device
    with torch.no_grad():
        logits = nli_model(**inputs).logits
    label_id = torch.argmax(logits, dim=-1).item()
    return nli_id2label[label_id]

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## 4. End-to-end pipeline examples


---



In [None]:
def classify_speech_act(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=64, padding="max_length")
    inputs = {k: v.to(device) for k, v in inputs.items()} # Move input tensors to the detected device
    with torch.no_grad():
        logits = model(**inputs).logits
    pred_id = torch.argmax(logits, dim=-1).item()
    return id2label[pred_id], torch.softmax(logits, dim=-1)[0, pred_id].item()

def run_pipeline(utterance):
    speech_act, sa_conf = classify_speech_act(utterance)

    if speech_act != "statement":
        return {
            "utterance": utterance,
            "speech_act": speech_act,
            "speech_act_confidence": sa_conf,
            "nli_applicable": False,
            "kb_fact": None,
            "nli_result": None,
        }

    best_fact = None
    best_result = None
    for fact in kb_facts:
        result = nli_check(fact, utterance)
        if result == "ENTAILMENT":
            best_fact, best_result = fact, result
            break

    return {
        "utterance": utterance,
        "speech_act": speech_act,
        "speech_act_confidence": sa_conf,
        "nli_applicable": True,
        "kb_fact": best_fact,
        "nli_result": best_result or "NEUTRAL",
    }

examples = [
    "Can you open the window?",
    "Dolphins are marine mammals.",
    "The meeting is at 3 pm.",
]

for e in examples:
    print(run_pipeline(e))

{'utterance': 'Can you open the window?', 'speech_act': 'directive', 'speech_act_confidence': 0.33613285422325134, 'nli_applicable': False, 'kb_fact': None, 'nli_result': None}
{'utterance': 'Dolphins are marine mammals.', 'speech_act': 'statement', 'speech_act_confidence': 0.34822356700897217, 'nli_applicable': True, 'kb_fact': 'Dolphins are marine mammals.', 'nli_result': 'ENTAILMENT'}
{'utterance': 'The meeting is at 3 pm.', 'speech_act': 'statement', 'speech_act_confidence': 0.33999547362327576, 'nli_applicable': True, 'kb_fact': None, 'nli_result': 'NEUTRAL'}


## 5. Additional end-to-end examples and results table


In [None]:
# Extra examples for testing the pipeline
more_examples = [
    "Dolphins are fish.",
    "The window is closed.",
    "Please close the window.",
    "The meeting was yesterday at 5 pm.",
]

import pandas as pd

all_utterances = examples + more_examples  # 'examples' is from the previous cell

results = []
for utt in all_utterances:
    out = run_pipeline(utt)
    results.append(out)
    print(out)

results_df = pd.DataFrame(results)
results_df


{'utterance': 'Can you open the window?', 'speech_act': 'directive', 'speech_act_confidence': 0.33613285422325134, 'nli_applicable': False, 'kb_fact': None, 'nli_result': None}
{'utterance': 'Dolphins are marine mammals.', 'speech_act': 'statement', 'speech_act_confidence': 0.34822356700897217, 'nli_applicable': True, 'kb_fact': 'Dolphins are marine mammals.', 'nli_result': 'ENTAILMENT'}
{'utterance': 'The meeting is at 3 pm.', 'speech_act': 'statement', 'speech_act_confidence': 0.33999547362327576, 'nli_applicable': True, 'kb_fact': None, 'nli_result': 'NEUTRAL'}
{'utterance': 'Dolphins are fish.', 'speech_act': 'statement', 'speech_act_confidence': 0.3460741639137268, 'nli_applicable': True, 'kb_fact': None, 'nli_result': 'NEUTRAL'}
{'utterance': 'The window is closed.', 'speech_act': 'statement', 'speech_act_confidence': 0.3423449695110321, 'nli_applicable': True, 'kb_fact': None, 'nli_result': 'NEUTRAL'}
{'utterance': 'Please close the window.', 'speech_act': 'directive', 'speech_a

Unnamed: 0,utterance,speech_act,speech_act_confidence,nli_applicable,kb_fact,nli_result
0,Can you open the window?,directive,0.336133,False,,
1,Dolphins are marine mammals.,statement,0.348224,True,Dolphins are marine mammals.,ENTAILMENT
2,The meeting is at 3 pm.,statement,0.339995,True,,NEUTRAL
3,Dolphins are fish.,statement,0.346074,True,,NEUTRAL
4,The window is closed.,statement,0.342345,True,,NEUTRAL
5,Please close the window.,directive,0.343313,False,,
6,The meeting was yesterday at 5 pm.,statement,0.338498,True,,NEUTRAL


## 6. Save results for report (optional)


In [None]:
results_df.to_csv("pragmatic_pipeline_examples.csv", index=False)
results_df.head()


Unnamed: 0,utterance,speech_act,speech_act_confidence,nli_applicable,kb_fact,nli_result
0,Can you open the window?,directive,0.336133,False,,
1,Dolphins are marine mammals.,statement,0.348224,True,Dolphins are marine mammals.,ENTAILMENT
2,The meeting is at 3 pm.,statement,0.339995,True,,NEUTRAL
3,Dolphins are fish.,statement,0.346074,True,,NEUTRAL
4,The window is closed.,statement,0.342345,True,,NEUTRAL


## 7. Evaluation summary and error analysis


In [None]:
metrics = trainer.evaluate()
metrics


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


{'eval_loss': 1.0933858156204224,
 'eval_accuracy': 0.6666666666666666,
 'eval_precision': 0.5,
 'eval_recall': 0.6666666666666666,
 'eval_f1': 0.5555555555555555,
 'eval_runtime': 0.0251,
 'eval_samples_per_second': 119.669,
 'eval_steps_per_second': 39.89,
 'epoch': 3.0}

In [None]:
import numpy as np

# Get predictions for the tiny test set
preds_output = trainer.predict(test_ds)
logits = preds_output.predictions
labels = preds_output.label_ids
preds = np.argmax(logits, axis=-1)

test_texts = test_df["text"].tolist()
test_labels = test_df["label"].tolist()

error_rows = []
for text, true_id, pred_id, true_label in zip(test_texts, labels, preds, test_labels):
    if true_id != pred_id:
        error_rows.append({
            "text": text,
            "true_label": true_label,
            "pred_label": id2label[pred_id],
        })

error_df = pd.DataFrame(error_rows)
error_df


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Unnamed: 0,text,true_label,pred_label
0,Is this the right room?,question,statement


In [None]:
# NLI "failures" – statements where we expected ENTAILMENT but got NEUTRAL/CONTRADICTION, or no kb_fact
nli_errors = results_df[
    (results_df["speech_act"] == "statement") &
    (results_df["nli_applicable"] == True) &
    (results_df["nli_result"] != "ENTAILMENT")
]
nli_errors


Unnamed: 0,utterance,speech_act,speech_act_confidence,nli_applicable,kb_fact,nli_result
2,The meeting is at 3 pm.,statement,0.339995,True,,NEUTRAL
3,Dolphins are fish.,statement,0.346074,True,,NEUTRAL
4,The window is closed.,statement,0.342345,True,,NEUTRAL
6,The meeting was yesterday at 5 pm.,statement,0.338498,True,,NEUTRAL


## 8. Methodology (for report)

- Fine-tuned DistilBERT as a speech-act classifier with three labels: statement, question, directive.
- Used RoBERTa-large-MNLI as a fixed NLI model over a small hand-crafted knowledge base of simple factual sentences.
- Built an end-to-end pipeline that first predicts speech act and only runs NLI when the utterance is classified as a statement.


## 9. Results (for report)

- Speech-act classifier metrics on the tiny test split (from `trainer.evaluate()`): accuracy, precision, recall, and F1.
- Example pipeline outputs (from `results_df`) showing both successful ENTAILMENT cases and NEUTRAL cases where the KB lacks a matching fact.


## 10. Error analysis and limitations

- Speech-act errors: confusion between questions and statements for short or ambiguous sentences (see `error_df`).
- NLI errors: ENTAILMENT is missed when the utterance wording differs from KB facts or when no relevant fact exists.
- Overall performance is limited by the very small toy dataset; a real Switchboard subset would improve robustness but require more compute.


In [None]:
# Save important artifacts
model.save_pretrained("speechact_model")
tokenizer.save_pretrained("speechact_tokenizer")
results_df.to_csv("pragmatic_pipeline_examples.csv", index=False)
error_df.to_csv("speechact_errors.csv", index=False)
