# Deep learning in Human Language Technology Project

- Students names: Momina Iffat Iftikhar, Muhammad Junaid Raza
- Date: 12th Nov, 2025
- Chosen Corpus: Rotten Tomatoes
- Contributions (if group project):

### Corpus information

- Description of the chosen corpus: Movie Review Dataset containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews for binary sentiment classification.
- Paper(s) and other published materials related to the corpus: Bo Pang and Lillian Lee, "Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales", Proceedings of the ACL, 2005.
- Random baseline performance and expected performance for recent machine learned models: Random baseline: 50% (binary classification). Expected SOTA performance: ~85-90% accuracy based on transformer models.

---

## 1. Setup

In [1]:
# Install required libraries
!pip install -q transformers datasets accelerate evaluate scikit-learn

# Import libraries
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)
from datasets import load_dataset
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix
import evaluate
import warnings
warnings.filterwarnings('ignore')

# Check GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m81.9/84.1 kB[0m [31m5.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hUsing device: cuda
GPU: Tesla T4
Memory: 15.83 GB


---

## 2. Data download, sampling and preprocessing

### 2.1. Download the corpus

In [2]:
# Load the Rotten Tomatoes dataset
dataset = load_dataset("rotten_tomatoes")

# Display dataset structure
print("Dataset structure:")
print(dataset)
print("\n" + "="*50)

# Display split sizes
print("\nSplit sizes:")
for split in dataset.keys():
    print(f"{split}: {len(dataset[split])} examples")

# Display label distribution for each split
print("\n" + "="*50)
print("\nLabel distribution:")
for split in dataset.keys():
    labels = dataset[split]['label']
    unique, counts = np.unique(labels, return_counts=True)
    print(f"\n{split}:")
    for label, count in zip(unique, counts):
        label_name = "negative" if label == 0 else "positive"
        print(f"  {label_name} ({label}): {count} ({count/len(labels)*100:.1f}%)")

# Display a few examples
print("\n" + "="*50)
print("\nExample texts from training set:")
for i in range(3):
    label_name = "negative" if dataset['train'][i]['label'] == 0 else "positive"
    print(f"\nExample {i+1} [{label_name}]:")
    print(f"  {dataset['train'][i]['text']}")

README.md: 0.00B [00:00, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Dataset structure:
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})


Split sizes:
train: 8530 examples
validation: 1066 examples
test: 1066 examples


Label distribution:

train:
  negative (0): 4265 (50.0%)
  positive (1): 4265 (50.0%)

validation:
  negative (0): 533 (50.0%)
  positive (1): 533 (50.0%)

test:
  negative (0): 533 (50.0%)
  positive (1): 533 (50.0%)


Example texts from training set:

Example 1 [positive]:
  the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .

Example 2 [positive]:
  the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately desc

### 2.2. Sampling and preprocessing

In [3]:
# Your code for any necessary sampling and preprocessing here
# Since the dataset is small and balanced, we won't downsample
# Let's just verify our final dataset statistics

print("Final dataset statistics (no sampling applied):")
print("="*60)
print(f"Training set: {len(dataset['train'])} examples")
print(f"Validation set: {len(dataset['validation'])} examples")
print(f"Test set: {len(dataset['test'])} examples")
print(f"\nLabel distribution: 50% negative, 50% positive (balanced)")
print(f"Random baseline accuracy: 50.0%")

# Store label mapping for later use
label_names = ["negative", "positive"]
id2label = {0: "negative", 1: "positive"}
label2id = {"negative": 0, "positive": 1}

print(f"\nLabel mapping: {id2label}")
print("\nDataset is ready for model training!")

Final dataset statistics (no sampling applied):
Training set: 8530 examples
Validation set: 1066 examples
Test set: 1066 examples

Label distribution: 50% negative, 50% positive (balanced)
Random baseline accuracy: 50.0%

Label mapping: {0: 'negative', 1: 'positive'}

Dataset is ready for model training!


---

## 3. Prompting a generative model

### 3.1 Prompt optimization

In [4]:
# Your code and experiments relating to the prompt optimization here
# We'll start with Qwen/Qwen2.5-0.5B-Instruct for prompting
model_name = "Qwen/Qwen2.5-0.5B-Instruct"

print(f"Loading model: {model_name}")
print("This may take a few minutes...")

# Load tokenizer and model
tokenizer_qwen = AutoTokenizer.from_pretrained(model_name)
model_qwen = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Set padding token if not set
if tokenizer_qwen.pad_token is None:
    tokenizer_qwen.pad_token = tokenizer_qwen.eos_token

print(f"\n✓ Model loaded successfully!")
print(f"Model size: {sum(p.numel() for p in model_qwen.parameters()) / 1e6:.1f}M parameters")

Loading model: Qwen/Qwen2.5-0.5B-Instruct
This may take a few minutes...


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]


✓ Model loaded successfully!
Model size: 494.0M parameters


In [5]:
# Helper function to generate predictions using prompts
def generate_with_prompt(texts, prompt_template, model, tokenizer, few_shot_examples=None, max_new_tokens=10):
    """
    Generate sentiment predictions using a prompt template.
    """
    predictions = []

    for text in texts:
        # Build the prompt
        if few_shot_examples:
            # Few-shot: add examples before the query
            messages = [{"role": "system", "content": "You are a sentiment analysis assistant. Classify movie reviews as either 'positive' or 'negative'."}]
            for ex_text, ex_label in few_shot_examples:
                ex_label_name = "positive" if ex_label == 1 else "negative"
                messages.append({"role": "user", "content": f"Review: {ex_text}\nSentiment:"})
                messages.append({"role": "assistant", "content": ex_label_name})
            messages.append({"role": "user", "content": f"Review: {text}\nSentiment:"})
        else:
            # Zero-shot
            messages = [
                {"role": "system", "content": "You are a sentiment analysis assistant. Classify movie reviews as either 'positive' or 'negative'."},
                {"role": "user", "content": prompt_template.format(text=text)}
            ]

        # Apply chat template
        prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

        # Tokenize and generate
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=False,
                pad_token_id=tokenizer.eos_token_id
            )

        # Decode output
        generated_text = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True).strip().lower()

        # Extract sentiment (look for "positive" or "negative" in response)
        if "positive" in generated_text and "negative" not in generated_text:
            predictions.append(1)
        elif "negative" in generated_text and "positive" not in generated_text:
            predictions.append(0)
        else:
            # If unclear, try to parse first word
            first_word = generated_text.split()[0] if generated_text.split() else ""
            if first_word == "positive":
                predictions.append(1)
            elif first_word == "negative":
                predictions.append(0)
            else:
                predictions.append(-1)  # Failed to parse

    return predictions

# Quick test
print("Helper function defined successfully!")

Helper function defined successfully!


In [6]:
# Take a small subset of validation data for prompt optimization
val_subset_size = 100
val_subset = dataset['validation'].select(range(val_subset_size))

print(f"Testing prompts on {val_subset_size} validation examples...")
print("="*60)

# Define different prompt templates to test
prompt_templates = {
    "simple": "Review: {text}\nSentiment:",
    "instruction": "Classify the following movie review as 'positive' or 'negative'.\n\nReview: {text}\n\nSentiment:",
    "detailed": "Read the following movie review carefully and determine if the sentiment is 'positive' or 'negative'. Answer with only one word.\n\nReview: {text}\n\nSentiment:",
}

# Test zero-shot prompts
results = {}
for name, template in prompt_templates.items():
    print(f"\nTesting prompt: {name}")
    preds = generate_with_prompt(
        val_subset['text'],
        template,
        model_qwen,
        tokenizer_qwen
    )

    # Calculate accuracy (excluding failed parses)
    valid_preds = [(p, l) for p, l in zip(preds, val_subset['label']) if p != -1]
    if valid_preds:
        accuracy = sum(p == l for p, l in valid_preds) / len(valid_preds)
        failed = preds.count(-1)
        results[name] = {"accuracy": accuracy, "failed": failed}
        print(f"  Accuracy: {accuracy:.3f} ({len(valid_preds)}/{val_subset_size} valid predictions, {failed} failed)")
    else:
        print(f"  All predictions failed to parse")

print("\n" + "="*60)
print("Prompt optimization results:")
for name, res in sorted(results.items(), key=lambda x: x[1]['accuracy'], reverse=True):
    print(f"  {name}: {res['accuracy']:.3f} accuracy, {res['failed']} failed")

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Testing prompts on 100 validation examples...

Testing prompt: simple
  Accuracy: 0.830 (100/100 valid predictions, 0 failed)

Testing prompt: instruction
  Accuracy: 0.790 (100/100 valid predictions, 0 failed)

Testing prompt: detailed
  Accuracy: 0.656 (96/100 valid predictions, 4 failed)

Prompt optimization results:
  simple: 0.830 accuracy, 0 failed
  instruction: 0.790 accuracy, 0 failed
  detailed: 0.656 accuracy, 4 failed


In [7]:
# Now test few-shot prompting with the best zero-shot template
best_prompt = max(results.items(), key=lambda x: x[1]['accuracy'])[0]
print(f"Best zero-shot prompt: {best_prompt}")
print("\nNow testing few-shot prompting (2-shot and 4-shot)...")
print("="*60)

# Select few-shot examples from training set (balanced)
few_shot_examples_2 = [
    (dataset['train'][0]['text'], dataset['train'][0]['label']),  # negative
    (dataset['train'][4265]['text'], dataset['train'][4265]['label'])  # positive
]

few_shot_examples_4 = [
    (dataset['train'][0]['text'], dataset['train'][0]['label']),  # negative
    (dataset['train'][1]['text'], dataset['train'][1]['label']),  # negative
    (dataset['train'][4265]['text'], dataset['train'][4265]['label']),  # positive
    (dataset['train'][4266]['text'], dataset['train'][4266]['label'])  # positive
]

# Test 2-shot
print("\nTesting 2-shot prompting...")
preds_2shot = generate_with_prompt(
    val_subset['text'],
    None,
    model_qwen,
    tokenizer_qwen,
    few_shot_examples=few_shot_examples_2
)
valid_preds_2shot = [(p, l) for p, l in zip(preds_2shot, val_subset['label']) if p != -1]
acc_2shot = sum(p == l for p, l in valid_preds_2shot) / len(valid_preds_2shot) if valid_preds_2shot else 0
failed_2shot = preds_2shot.count(-1)
print(f"2-shot accuracy: {acc_2shot:.3f} ({len(valid_preds_2shot)}/{val_subset_size} valid, {failed_2shot} failed)")

# Test 4-shot
print("\nTesting 4-shot prompting...")
preds_4shot = generate_with_prompt(
    val_subset['text'],
    None,
    model_qwen,
    tokenizer_qwen,
    few_shot_examples=few_shot_examples_4
)
valid_preds_4shot = [(p, l) for p, l in zip(preds_4shot, val_subset['label']) if p != -1]
acc_4shot = sum(p == l for p, l in valid_preds_4shot) / len(valid_preds_4shot) if valid_preds_4shot else 0
failed_4shot = preds_4shot.count(-1)
print(f"4-shot accuracy: {acc_4shot:.3f} ({len(valid_preds_4shot)}/{val_subset_size} valid, {failed_4shot} failed)")

# Compare all approaches
print("\n" + "="*60)
print("Summary of all approaches:")
print(f"  Best zero-shot ({best_prompt}): {results[best_prompt]['accuracy']:.3f}")
print(f"  2-shot: {acc_2shot:.3f}")
print(f"  4-shot: {acc_4shot:.3f}")

Best zero-shot prompt: simple

Now testing few-shot prompting (2-shot and 4-shot)...

Testing 2-shot prompting...
2-shot accuracy: 0.873 (71/100 valid, 29 failed)

Testing 4-shot prompting...
4-shot accuracy: 0.875 (80/100 valid, 20 failed)

Summary of all approaches:
  Best zero-shot (simple): 0.830
  2-shot: 0.873
  4-shot: 0.875


### 3.2 Evaluation on test set

In [None]:
# Your code to evaluate the best-performing approach on the test set here

---

## 4. Fine-tuning a generative model

### 4.1. Model training

In [None]:
# Your code to train the transformer-based model on the training set and evaluate the performance on the validation set here

### 4.2 Hyperparameter optimization

In [None]:
# Your code for hyperparameter optimization here

### 4.3. Evaluation on test set

In [None]:
# Your code to evaluate the final model on the test set here

---

## 5. Fine-tuning a bidirectional model

### 5.1. Model training

In [None]:
# Your code to train the transformer-based model on the training set and evaluate the performance on the validation set here

### 5.2 Hyperparameter optimization

In [None]:
# Your code for hyperparameter optimization here

### 5.3 Evaluation on test set

In [None]:
# Your code to evaluate the final model on the test set here

---

## 6. Bonus Task (optional)

Repeat sections 3 through 5 here for a second generative and a second bidirectional model. When summarizing your results below (Section 7), include also comparison of the two generative models and the two bidirectional models.

---

## 7. Results and summary

### 7.1 Corpus insights

(Briefly discuss what you learned about the corpus and its annotation)

### 7.2 Results

(Briefly summarize your results)

### 7.3 Relation to random baseline / expected performance / state of the art

(Compare your results with the random and state-of-the-art performance)

---

## 8 Error analysis (group projects only)

(Present the error analysis results here)