<a href="https://colab.research.google.com/github/DVORA-AZARKOVICH/Narrative-Similarity/blob/main/NarrativeSimilarityGemini_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Narrative Similarity: Gemini Distillation & Embeddings (Test Phase)
**SemEval-2026 Task 4: Narrative Similarity**

This notebook executes the **Distill-then-Embed** pipeline.
1.  **Distill:** Extracts the "Narrative Core" (Theme, Action, Outcomes) using **Gemini 1.5 Flash**.
2.  **Embed:** Converts the distilled text into vectors using **Text-Embedding-004**.
3.  **Predict:** Computes Cosine Similarity to determine if the Anchor story is closer to Text A or Text B.

## 1. Setup Environment
Installing the Google GenAI SDK, mounting Google Drive, and configuring the API key.

In [None]:
!pip install -q -U google-generativeai

import google.genai as genai
import pandas as pd
from tqdm import tqdm
import time
import os
from google.colab import drive

drive.mount('/content/drive')

from google.colab import userdata
try:
    GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
except:
    print("Please set your GOOGLE_API_KEY in the Colab secrets or manually here.")


Mounted at /content/drive


## 2. Load Data
Reading the dataset from the specified path.

In [None]:

BASE_PATH = '/content/drive/MyDrive/Narrative Similarity Data/'

SYNTHETIC_FILE = BASE_PATH + 'synthetic_data_for_classification.jsonl'
df_synthetic = pd.read_json(SYNTHETIC_FILE, lines=True)
print("--- Synthetic Data Loaded ---")
print(f"Shape: {df_synthetic.shape}")

DEV_FILE = BASE_PATH + 'SemEval2026-Task_4-dev-v1/dev_track_a.jsonl'
df_dev = pd.read_json(DEV_FILE, lines=True)
print("\n--- Development Data Loaded ---")
print(f"Shape: {df_dev.shape}")

TEST_FILE = BASE_PATH + 'SemEval2026-Task_4-test-v1/test_track_a.jsonl'
df_test = pd.read_json(TEST_FILE, lines=True)

--- Synthetic Data Loaded ---
Shape: (1900, 5)

--- Development Data Loaded ---
Shape: (200, 4)


In [None]:
df_synthetic = df_synthetic.drop(columns=['model_name'])

df_train = df_synthetic.copy()

## 3. Narrative Distillation (Gemini 1.5 Flash)
We define a system prompt that directs the LLM to strip away superficial details (style, names, specific settings) and rewrite the story focusing only on the **abstract theme, course of action, and outcomes**.

This function is then applied to the `anchor`, `text_a`, and `text_b` columns.

In [None]:
import google.generativeai as genai
from tqdm import tqdm
import pandas as pd

from google.colab import userdata
try:
    if not genai.configure(api_key=GOOGLE_API_KEY):
        pass
except:
    pass

distillation_system_prompt = """
You are an expert annotator tasked with extracting the 'Narrative Core' of a story.

### DEFINITIONS
Extract the core based ONLY on:
1. **Abstract Theme**: The defining constellation of problems and central ideas.
2. **Course of Action**: The sequence of events, actions, conflicts, and turning points.
3. **Outcomes**: The results of the plot (resolution, fates).

### WHAT TO IGNORE
Explicitly REMOVE and IGNORE:
* The concrete setting (e.g., Sci-Fi, Western, dates, locations).
* Names of characters.
* The style of writing.

### OUTPUT INSTRUCTION
Rewrite the story summary to include ONLY the Abstract Theme, Course of Action, and Outcomes. Do not analyze, just describe the narrative core neutrally.
"""

model_distill = genai.GenerativeModel('gemini-1.5-flash',
                                      system_instruction=distillation_system_prompt)

def distill_narrative(text):
    try:
        if not isinstance(text, str): return ""
        response = model_distill.generate_content(text)
        return response.text.strip()
    except Exception as e:
        return text

tqdm.pandas(desc="Distilling Stories")
cols_to_distill = ['anchor_text', 'text_a', 'text_b']

print("Starting distillation...")
for col in cols_to_distill:
    print(f"Distilling column: {col}...")
    try:
        df_test[f'{col}_core'] = df_test[col].progress_apply(distill_narrative)
    except:
        df_test[f'{col}_core'] = df_test[col].apply(distill_narrative)



All support for the `google.generativeai` package has ended. It will no longer be receiving 
updates or bug fixes. Please switch to the `google.genai` package as soon as possible.
See README for more details:

https://github.com/google-gemini/deprecated-generative-ai-python/blob/main/README.md

  loader.exec_module(module)


Starting distillation...
Distilling column: anchor_text...


Distilling Stories: 100%|██████████| 400/400 [01:33<00:00,  4.28it/s]


Distilling column: text_a...


Distilling Stories: 100%|██████████| 400/400 [01:41<00:00,  3.96it/s]


Distilling column: text_b...


Distilling Stories: 100%|██████████| 400/400 [01:50<00:00,  3.61it/s]


## 4. Embeddings & Similarity Scoring
We generate vector embeddings for the *distilled* narrative cores using `text-embedding-004`.
Then, we calculate the **Cosine Similarity** between the Anchor and the two candidates.

* **Logic:** `Predicted_Label = 1` if Similarity(Anchor, A) > Similarity(Anchor, B), else `0`.

In [None]:
import numpy as np
from numpy.linalg import norm
import google.generativeai as genai
from tqdm import tqdm
import getpass
import pandas as pd

genai.configure(GOOGLE_API_KEY)


EMBEDDING_MODEL = 'models/text-embedding-004'

print(f"\nAttempting to connect to embedding model: {EMBEDDING_MODEL}...")

try:
    test_emb = genai.embed_content(
        model=EMBEDDING_MODEL,
        content="test",
        task_type="semantic_similarity"
    )
    print("✅ SUCCESS! Embedding model found and working.")

except Exception as e:
    print(f"\n❌ Error connecting to {EMBEDDING_MODEL}: {e}")
    print("\n--- Debugging: Available Models for your Key ---")
    try:
        for m in genai.list_models():
            if 'embed' in m.name:
                print(f"- {m.name}")
                if not 'text-embedding-004' in m.name:
                     pass
    except Exception as list_err:
        print(f"Could not list models: {list_err}")

    print("\nIf you see other embedding models above (like 'models/embedding-001'), replace EMBEDDING_MODEL in the code with one of them.")
    raise e


def get_embedding(text):
    if not isinstance(text, str) or not text.strip():
        return np.zeros(768)

    try:
        result = genai.embed_content(
            model=EMBEDDING_MODEL,
            content=text,
            task_type="semantic_similarity"
        )
        return np.array(result['embedding'])
    except Exception as e:
        return np.zeros(768)

def cosine_similarity(vec_a, vec_b):
    if np.all(vec_a == 0) or np.all(vec_b == 0):
        return 0.0
    return np.dot(vec_a, vec_b) / (norm(vec_a) * norm(vec_b))

tqdm.pandas(desc="Embeddings")

print("\n--- Starting Full Embedding Process ---")

cols = ['anchor_text', 'text_a', 'text_b']
suffix = '_core' if 'anchor_text_core' in df_test.columns else ''

print(f"Using columns with suffix: '{suffix}'")

df_test['anchor_emb'] = df_test[f'anchor_text{suffix}'].progress_apply(get_embedding)
df_test['text_a_emb'] = df_test[f'text_a{suffix}'].progress_apply(get_embedding)
df_test['text_b_emb'] = df_test[f'text_b{suffix}'].progress_apply(get_embedding)

print("Calculating Scores...")
df_test['score_a'] = df_test.apply(lambda row: cosine_similarity(row['anchor_emb'], row['text_a_emb']), axis=1)
df_test['score_b'] = df_test.apply(lambda row: cosine_similarity(row['anchor_emb'], row['text_b_emb']), axis=1)

df_test['predicted_label'] = (df_test['score_a'] > df_test['score_b']).astype(bool)

print("\n--- Done! Results Sample ---")
print(df_test[['score_a', 'score_b', 'predicted_label']].head())

Please paste your Google API Key below and press Enter:
··········

Attempting to connect to embedding model: models/text-embedding-004...
✅ SUCCESS! Embedding model found and working.

--- Starting Full Embedding Process ---
Using columns with suffix: '_core'


Embeddings: 100%|██████████| 400/400 [03:02<00:00,  2.20it/s]
Embeddings: 100%|██████████| 400/400 [02:45<00:00,  2.42it/s]
Embeddings: 100%|██████████| 400/400 [02:46<00:00,  2.41it/s]

Calculating Scores...

--- Done! Results Sample ---
    score_a   score_b  predicted_label
0  0.607960  0.661398            False
1  0.649949  0.688102            False
2  0.567343  0.679646            False
3  0.688693  0.620238             True
4  0.624853  0.576523             True





## 5. Export Results
Saving the results (including the predictions and embeddings) to a file for backup or submission.

In [None]:
import zipfile
from google.colab import files

submission_df = df_test[['predicted_label']].copy()

submission_df = submission_df.rename(columns={'predicted_label': 'text_a_is_closer'})

jsonl_filename = 'track_a.jsonl'
submission_df.to_json(jsonl_filename, orient='records', lines=True)
print(f"Saved intermediate JSONL: {jsonl_filename}")

zip_filename = 'submission_zeroshot.zip'
with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
    zipf.write(jsonl_filename)

print(f"Successfully created: {zip_filename}")

files.download(zip_filename)

Saved intermediate JSONL: track_a.jsonl
Successfully created: submission_zeroshot.zip


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>