<a href="https://colab.research.google.com/github/DVORA-AZARKOVICH/Narrative-Similarity/blob/main/NarrativeSimilarity_story_emb_prompt_fixed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Narrative Similarity Fine-Tuning with QLoRA

This notebook fine-tunes the `uhhlt/story-emb` model to detect narrative similarity between stories. We use **QLoRA** (Quantized Low-Rank Adaptation) for memory-efficient training and **Triplet Loss** to optimize the embedding space.

**Key Steps:**
1. **Triplet Dataset:** formatting data into (Anchor, Positive, Negative) examples.
2. **QLoRA Training:** fine-tuning adapters on top of a 4-bit quantized base model.

In [None]:
!pip install -q transformers==4.36.2 peft==0.7.1 bitsandbytes==0.41.3 accelerate==0.25.0 spacy
!python -m spacy download en_core_web_sm

import torch
import os
print(f"CUDA Available: {torch.cuda.is_available()}")

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sentence-transformers 5.2.0 requires transformers<6.0.0,>=4.41.0, but you have transformers 4.36.2 which is incompatible.[0m[31m
[0mCollecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m158.4 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
CUDA Available: True

In [None]:
!pip uninstall -y transformers accelerate bitsandbytes peft

!pip install -U transformers accelerate bitsandbytes peft

Found existing installation: transformers 4.36.2
Uninstalling transformers-4.36.2:
  Successfully uninstalled transformers-4.36.2
Found existing installation: accelerate 0.25.0
Uninstalling accelerate-0.25.0:
  Successfully uninstalled accelerate-0.25.0
Found existing installation: bitsandbytes 0.41.3
Uninstalling bitsandbytes-0.41.3:
  Successfully uninstalled bitsandbytes-0.41.3
Found existing installation: peft 0.7.1
Uninstalling peft-0.7.1:
  Successfully uninstalled peft-0.7.1
Collecting transformers
  Using cached transformers-4.57.3-py3-none-any.whl.metadata (43 kB)
Collecting accelerate
  Using cached accelerate-1.12.0-py3-none-any.whl.metadata (19 kB)
Collecting bitsandbytes
  Using cached bitsandbytes-0.49.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting peft
  Using cached peft-0.18.0-py3-none-any.whl.metadata (14 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Using cached tokenizers-0.22.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_6

## 1. Data Loading
We load the synthetic training data and the development validation set. The data contains pairs of stories compared to an anchor story.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import json

BASE_PATH = '/content/drive/MyDrive/Narrative Similarity Data/'

SYNTHETIC_FILE = BASE_PATH + 'synthetic_data_for_classification.jsonl'
df_synthetic = pd.read_json(SYNTHETIC_FILE, lines=True)
print("--- Synthetic Data Loaded ---")
print(f"Shape: {df_synthetic.shape}")

DEV_FILE = BASE_PATH + 'SemEval2026-Task_4-dev-v1/dev_track_a.jsonl'
df_dev = pd.read_json(DEV_FILE, lines=True)
print("\n--- Development Data Loaded ---")
print(f"Shape: {df_dev.shape}")

--- Synthetic Data Loaded ---
Shape: (1900, 5)

--- Development Data Loaded ---
Shape: (200, 4)


In [None]:
df_synthetic = df_synthetic.drop(columns=['model_name'])

df_train = df_synthetic.copy()

## 2. Dataset & Prompt Engineering
We define a custom `Dataset` class that prepares **Triplets**:
1. **Anchor:** The reference story (prefixed with a system prompt defining narrative similarity).
2. **Positive:** The story narratively closer to the anchor.
3. **Negative:** The story narratively further from the anchor.

The prompt explicitly instructs the model to consider **Theme**, **Course of Action**, and **Outcomes**, while ignoring style and setting.

In [None]:
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer

class NarrativeTripletDataset(Dataset):
    def __init__(self, df, tokenizer, max_length=1024):
        self.df = df
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.prompt = '''You are an expert annotator tasked with identifying narrative similarity between stories.
Your goal is to determine which of two candidate stories (Text A or Text B) is narratively closer to an Anchor story.

### DEFINITIONS OF NARRATIVE SIMILARITY
[cite_start]You must evaluate similarity based ONLY on the following three core aspects[cite: 12]:

1. **Abstract Theme**: The defining constellation of problems, central ideas, and core motifs. [cite_start]This explicitly DOES NOT cover the concrete setting (time/place)[cite: 20, 21].
2. [cite_start]**Course of Action**: The sequence of events, actions, conflicts, turning points, and the order in which they happen[cite: 22].
3. **Outcomes**: The results of the plot at the end of the text (e.g., conflict resolution, characters' fates, moral lessons). [cite_start]This refers to the final state, not intermediate states[cite: 23].

### WHAT TO IGNORE
[cite_start]Do NOT consider the following factors as part of the narrative similarity [cite: 38-42]:
* The style of writing.
* The concrete setting (e.g., Sci-Fi vs. Western, 19th century vs. future).
* Names of characters and locations.
* Length of text or level of detail.

### INSTRUCTIONS
1.  Analyze the **Anchor** to identify its Abstract Theme, Course of Action, and Outcome.
2.  Compare **Text A** to the Anchor based on the three core aspects.
3.  Compare **Text B** to the Anchor based on the three core aspects.
4.  Weigh the factors. [cite_start]Note that stories sharing a "Course of Action" often share a "Theme", but "Outcomes" can be distinct (e.g., similar events leading to opposite endings)[cite: 45, 47].
5.  Decide which text is narratively closer overall.'''

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]

        anchor_text = str(row['anchor_text'])
        text_a = str(row['text_a'])
        text_b = str(row['text_b'])

        anchor = self.prompt + anchor_text

        if row['text_a_is_closer']:
            positive = text_a
            negative = text_b
        else:
            positive = text_b
            negative = text_a

        return anchor, positive, negative

model_name = "uhhlt/story-emb"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

train_dataset = NarrativeTripletDataset(df_train, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/624 [00:00<?, ?B/s]

## 3. QLoRA Model Setup
We load the `story-emb` model in **4-bit precision** using `BitsAndBytes`. We then attach Low-Rank Adapters (**LoRA**) to the attention layers. This allows us to fine-tune the model on a consumer GPU while keeping the base model frozen.

In [None]:
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

base_model = prepare_model_for_kbit_training(base_model)

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.FEATURE_EXTRACTION
)

model = get_peft_model(base_model, peft_config)
model.print_trainable_parameters()

config.json:   0%|          | 0.00/669 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/154 [00:00<?, ?B/s]

trainable params: 41,943,040 || all params: 7,283,675,136 || trainable%: 0.5758


## 4. Fine-Tuning with Triplet Loss
We use **Triplet Margin Loss** for training. The objective is to push the *Negative* embedding at least `margin=0.5` distance away from the *Anchor* compared to the *Positive* embedding.

$$Loss = \max(0, d(Anchor, Positive) - d(Anchor, Negative) + margin)$$

We calculate embeddings using the hidden state of the **last token** in the sequence.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from tqdm.autonotebook import tqdm
from sklearn.metrics import accuracy_score, classification_report
tqdm.pandas()

# Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
# Loss function: מנסה ליצור מרחק של לפחות 0.5 בין החיובי לשלילי
criterion = nn.TripletMarginLoss(margin=0.5, p=2)

def get_last_token_embeddings(outputs, attention_mask):
    """חילוץ ה-Embedding מהטוקן האמיתי האחרון"""
    hidden_states = outputs.hidden_states[-1]
    last_token_indices = attention_mask.sum(1) - 1
    batch_size = hidden_states.shape[0]
    return hidden_states[torch.arange(batch_size), last_token_indices]

def evaluate_model(model, tokenizer, df_dev, device):
    model.eval()
    predictions = []
    true_labels = df_dev['text_a_is_closer'].tolist()

    print("\nRunning evaluation on Dev set...")
    for idx, row in tqdm(df_dev.iterrows(), total=len(df_dev), desc="Evaluating"):
        anchor = "Retrieve stories with a similar narrative to the given story: " + str(row['anchor_text'])
        txt_a = str(row['text_a'])
        txt_b = str(row['text_b'])

        with torch.no_grad():
            inputs = tokenizer(
                [anchor, txt_a, txt_b],
                return_tensors='pt',
                padding=True,
                truncation=True,
                max_length=1024
            ).to(device)

            outputs = model(**inputs, output_hidden_states=True)
            embeddings = get_last_token_embeddings(outputs, inputs.attention_mask)
            embeddings = F.normalize(embeddings, p=2, dim=1)

            score_a = torch.dot(embeddings[0], embeddings[1]).item()
            score_b = torch.dot(embeddings[0], embeddings[2]).item()

            predictions.append(score_a > score_b)

    y_true = [int(x) for x in true_labels]
    y_pred = [int(x) for x in predictions]

    accuracy = accuracy_score(y_true, y_pred)
    report = classification_report(y_true, y_pred, output_dict=True)
    model.train()
    return accuracy, report

EPOCHS = 5
accumulation_steps =
model.train()

print("Starting Fine-Tuning...")

for epoch in range(EPOCHS):
    total_loss = 0
    progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}")

    for i, batch in enumerate(progress_bar):
        anchor_txt, pos_txt, neg_txt = batch

        all_texts = list(anchor_txt) + list(pos_txt) + list(neg_txt)

        inputs = tokenizer(
            all_texts,
            return_tensors='pt',
            padding=True,
            truncation=True,
            max_length=1024
        ).to(model.device)

        outputs = model(**inputs, output_hidden_states=True)

        embeddings = get_last_token_embeddings(outputs, inputs.attention_mask)
        embeddings = F.normalize(embeddings, p=2, dim=1) # נרמול

        batch_size = len(anchor_txt)
        anchor_emb = embeddings[:batch_size]
        pos_emb = embeddings[batch_size:2*batch_size]
        neg_emb = embeddings[2*batch_size:]

        loss = criterion(anchor_emb, pos_emb, neg_emb)
        loss = loss / accumulation_steps
        loss.backward()

        if (i + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()

        total_loss += loss.item() * accumulation_steps
        progress_bar.set_postfix({'loss': loss.item() * accumulation_steps})

    print(f"Epoch {epoch+1} finished. Avg Loss: {total_loss / len(train_loader)}")

    accuracy, report = evaluate_model(model, tokenizer, df_dev, model.device)
    print(f"Epoch {epoch+1} Dev Accuracy: {accuracy:.4f}")

# שמירת המודל
save_path = BASE_PATH + "fine_tuned_story_emb_final"
model.save_pretrained(save_path)
print(f"Model saved to {save_path}")

Starting Fine-Tuning...


Epoch 1:   0%|          | 0/950 [00:00<?, ?it/s]

  return fn(*args, **kwargs)


Epoch 1 finished. Avg Loss: 0.025107477479859402

Running evaluation on Dev set...


Evaluating:   0%|          | 0/200 [00:00<?, ?it/s]

Epoch 1 Dev Accuracy: 0.6500


Epoch 2:   0%|          | 0/950 [00:00<?, ?it/s]

  return fn(*args, **kwargs)


Epoch 2 finished. Avg Loss: 0.0029780854676899155

Running evaluation on Dev set...


Evaluating:   0%|          | 0/200 [00:00<?, ?it/s]

Epoch 2 Dev Accuracy: 0.6450


Epoch 3:   0%|          | 0/950 [00:00<?, ?it/s]

  return fn(*args, **kwargs)


Epoch 3 finished. Avg Loss: 0.001162603968068173

Running evaluation on Dev set...


Evaluating:   0%|          | 0/200 [00:00<?, ?it/s]

Epoch 3 Dev Accuracy: 0.6700
Model saved to /content/drive/MyDrive/Narrative Similarity Data/fine_tuned_story_emb_final


## 5. Evaluation and Submission
Finally, we run the fine-tuned model on the Development set. We calculate the dot product similarity between the Anchor and candidates (A and B) to determine which is closer.

In [None]:
import torch
from tqdm.auto import tqdm
tqdm.pandas()

model.eval()

def predict_row(row):
    anchor = "Retrieve stories with a similar narrative to the given story: " + row['anchor_text']
    txt_a = row['text_a']
    txt_b = row['text_b']

    with torch.no_grad():
        inputs = tokenizer(
            [anchor, txt_a, txt_b],
            return_tensors='pt',
            padding=True,
            truncation=True,
            max_length=1024
        ).to(model.device)

        outputs = model(**inputs, output_hidden_states=True)
        embeddings = get_last_token_embeddings(outputs, inputs.attention_mask)
        embeddings = F.normalize(embeddings, p=2, dim=1)

        score_a = torch.dot(embeddings[0], embeddings[1]).item()
        score_b = torch.dot(embeddings[0], embeddings[2]).item()

    return score_a > score_b

print("Running prediction on Dev set...")
df_dev['prediction'] = df_dev.progress_apply(predict_row, axis=1)

from sklearn.metrics import accuracy_score, classification_report

y_true = df_dev['text_a_is_closer'].astype(int)
y_pred = df_dev['prediction'].astype(int)

print(f"Accuracy: {accuracy_score(y_true, y_pred)}")
print(classification_report(y_true, y_pred))

Running prediction on Dev set...


  0%|          | 0/200 [00:00<?, ?it/s]

Accuracy: 0.67
              precision    recall  f1-score   support

           0       0.65      0.71      0.68        99
           1       0.69      0.63      0.66       101

    accuracy                           0.67       200
   macro avg       0.67      0.67      0.67       200
weighted avg       0.67      0.67      0.67       200



In [None]:
import zipfile
from google.colab import files

submission_df = df_dev[['prediction']].copy()
submission_df = submission_df.rename(columns={'prediction': 'text_a_is_closer'})

jsonl_filename = 'track_a.jsonl'
submission_df.to_json(jsonl_filename, orient='records', lines=True)

zip_filename = 'submission_model.zip'
with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
    zipf.write(jsonl_filename)

print(f"Created {zip_filename}")
files.download(zip_filename)

Created submission_zeroshot.zip


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>



@inproceedings{hatzel-biemann-2024-story-embeddings,
    title = "Story Embeddings -- Narrative-Focused Representations of Fictional Stories",
    author = "Hatzel, Hans Ole and Biemann, Chris",
    booktitle = "Proceedings of the 62st Annual Meeting of the Association for Computational Linguistics",
    year = "2024",
    address = "Miami, Florida",
    publisher = "Association for Computational Linguistics",
}
