# Reconstructing Exam Questions from Model-Generated Answers  
### SOTA-AI December Task 3 — David’s Archive Issue

This notebook presents a complete, phase-wise solution to the inverse question–answering task posed in the SOTA-AI December Challenge.

The objective is to reconstruct plausible, exam-style questions given answers generated by a large language model. Since the original questions are unavailable and evaluation is based on semantic similarity rather than exact string matching, the task requires reasoning about intent, meaning, and the inversion of the question–answer relationship rather than surface-level text matching.

The solution is organized into clearly defined stages:
1. Understanding the structure and characteristics of the dataset  
2. Constructing a synthetic supervision signal from unlabeled answers  
3. Fine-tuning a language model using parameter-efficient methods  
4. Generating questions for the test set  
5. Performing selective semantic repair using cycle consistency  
6. Auditing outputs and preparing the final submission  

All stages are designed to be conceptually complete and reproducible. For practical execution under constrained compute environments, the same logic can be applied in batches by operating on subsets of the data. Since all processing steps are independent per example, batch-wise execution yields identical results to a single end-to-end run, and final outputs are obtained by aggregating and ordering these batches deterministically.

The emphasis throughout this notebook is on methodological clarity, semantic correctness, and principled decision-making under realistic resource constraints.


## 1. Understanding the Dataset

The task is an inverse question-answering problem: given an answer generated by a large language model, the goal is to reconstruct a plausible exam-style question that could have produced that answer.

The dataset consists of:
- A **training set** containing only model-generated answers (no questions).
- A **test set** containing unseen answers for which questions must be generated.

Key observations from exploratory analysis:
- Answers are long, explanatory, and often multi-paragraph.
- Many answers implicitly encode the question they are responding to.
- There is no direct supervision signal (question → answer pairs), making this a fundamentally ill-posed inverse problem.

Because evaluation is based on **semantic similarity** rather than exact string matching, the challenge is not surface-level phrasing, but capturing the *intent* of the original question.


In [None]:
# Importing necessary libraries

import pandas as pd
import numpy as np
import re
from collections import Counter
import matplotlib.pyplot as plt

plt.style.use("seaborn-v0_8")
pd.set_option("display.max_colwidth", 300)


In [None]:
! pip install kagglehub

In [None]:
! unzip /content/sota-ai-december-task-3-davids-archive-issue.zip

In [None]:
# Reading the train and test CSV files

train_path = "/content/kaggle_dataset/train.csv"
test_path  = "/content/kaggle_dataset/test.csv"

train_df = pd.read_csv(train_path)
test_df  = pd.read_csv(test_path)

print("Train shape:", train_df.shape)
print("Test shape :", test_df.shape)

train_df.tail(10)

In [None]:
# Checking for null values in the datasets

print(train_df.info())
print("\nNulls in train:")
print(train_df.isnull().sum())

print("\nNulls in test:")
print(test_df.isnull().sum())


In [None]:
# Analyzing the word count distribution in the answers

def word_count(text):
    return len(str(text).split())

train_df["word_count"] = train_df["ans"].apply(word_count)
test_df["word_count"]  = test_df["ans"].apply(word_count)

train_df["word_count"].describe(), test_df["word_count"].describe()


In [None]:
# Visualizing the word count distribution

plt.figure(figsize=(10,5))
plt.hist(train_df["word_count"], bins=50, alpha=0.7, label="train")
plt.hist(test_df["word_count"], bins=50, alpha=0.7, label="test")
plt.legend()
plt.title("Answer Length Distribution (words)")
plt.xlabel("Word count")
plt.ylabel("Frequency")
plt.show()


In [None]:
# Extracting features: presence of lists and whether the answer starts with a definition phrase

def has_list(text):
    return bool(re.search(r"\n\d+\.|\n- |\n\*", text))

def starts_definition(text):
    return text.lower().strip().startswith(
        ("in the", "the term", "the concept", "abduction", "predictive")
    )

train_df["has_list"] = train_df["ans"].apply(has_list)
train_df["starts_definition"] = train_df["ans"].apply(starts_definition)

train_df[["has_list", "starts_definition"]].mean()


In [None]:
# Analyzing sentence and paragraph counts in the answers

def sentence_count(text):
    return len(re.findall(r"[.!?]", text))

def paragraph_count(text):
    return len([p for p in text.split("\n") if p.strip()])

train_df["sentences"] = train_df["ans"].apply(sentence_count)
train_df["paragraphs"] = train_df["ans"].apply(paragraph_count)

train_df[["sentences", "paragraphs"]].describe()


In [None]:
# Keyword presence analysis in the answers and their frequencies

keywords = [
    "define", "definition", "example", "compare", "difference",
    "advantage", "disadvantage", "bias", "limitation",
    "ethics", "alignment", "reasoning", "inductive", "abductive"
]

def keyword_hits(text):
    text = text.lower()
    return {k: (k in text) for k in keywords}

kw_df = train_df["ans"].apply(keyword_hits).apply(pd.Series)
kw_df.mean().sort_values(ascending=False)


In [None]:
# Displaying sample answers from the training set

for i in np.random.choice(len(train_df), 5, replace=False):
    print("="*80)
    print(train_df.loc[i, "quesid"])
    print(train_df.loc[i, "ans"][:1200])


In [None]:
# Cleaning the answers: normalizing line breaks and removing extra spaces

def clean_answer(text):
    text = str(text)
    text = re.sub(r"\r\n", "\n", text)
    text = re.sub(r"\n{3,}", "\n\n", text)
    text = text.strip()
    return text

train_df["ans_clean"] = train_df["ans"].apply(clean_answer)
test_df["ans_clean"]  = test_df["ans"].apply(clean_answer)


In [None]:
# Summarizing key statistics about the datasets

def word_count(text):
    return len(str(text).split())

train_df["word_count"] = train_df["ans"].apply(word_count)
test_df["word_count"]  = test_df["ans"].apply(word_count)

PHASE_1_SUMMARY = {
    "train_size": len(train_df),
    "test_size": len(test_df),
    "median_word_count_train": train_df["word_count"].median(),
    "median_word_count_test": test_df["word_count"].median(),
    "multi_paragraph_fraction": float((train_df["paragraphs"] > 1).mean()),
    "list_fraction": float(train_df["has_list"].mean()),
}

PHASE_1_SUMMARY

## 2. Generating Synthetic Question–Answer Pairs

Since the training data does not include ground-truth questions, a synthetic supervision signal is required.

To construct this signal:
- Multiple candidate questions are generated for each training answer using a strong prompt.
- A cycle-consistency heuristic is applied:
  - Generated question → regenerated answer
  - Semantic similarity is measured between the regenerated answer and the original answer.
- Only high-quality question–answer pairs are retained.

This process prioritizes **quality over quantity**, producing a small but reliable dataset of synthetic QA pairs that reflect the structure and intent of the original answers.


In [None]:
# Installing necessary libraries

!pip install -q transformers accelerate bitsandbytes peft sentence-transformers tqdm


In [None]:
# Importing necessary libraries

import torch
import pandas as pd
import numpy as np
import re
from tqdm import tqdm

from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer, util
import wandb

In [None]:
# Initializing Weights & Biases for experiment tracking

wandb.init(
    project="sota-ai-task3-phase2",
    name="phase2-debug-300-samples",
    config={
        "model": "Llama-3.1-8B-Instruct",
        "num_samples": 300,
        "num_candidates": 2,
        "temperatures": [0.0, 0.6],
        "max_q_tokens": 64,
        "max_a_tokens": 256
    }
)


In [None]:
! unzip /content/sota-ai-december-task-3-davids-archive-issue.zip

In [None]:
# Reading the train dataset

train_df = pd.read_csv("/content/kaggle_dataset/train.csv")

print("Train size:", len(train_df))
train_df.head(2)


In [None]:
# Cleaning the answers: normalizing line breaks and removing extra spaces

def clean_answer(text):
    text = str(text)
    text = re.sub(r"\r\n", "\n", text)
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()

train_df["ans_clean"] = train_df["ans"].apply(clean_answer)


In [None]:
# Subsampling the training data to ensure complete run in Colab's constrained resources

N_SAMPLES = 275
train_subset = train_df.sample(n=N_SAMPLES, random_state=42).reset_index(drop=True)

print("Subset size:", len(train_subset))


In [None]:
# Loading the Llama 3.1 8B Instruct model and tokenizer. Using LLaMA 3.1 8B Instruct because it is a state-of-the-art model for instruction following and is also the model mentioned in the problem. Using 4-bit quantization for efficiency.

MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    load_in_4bit=True,
    torch_dtype=torch.float16
)

model.eval()


In [None]:
# Loading the sentence transformer for embedding generation

embedder = SentenceTransformer("all-MiniLM-L6-v2")

BASE_PROMPT = """You are an academic exam setter.

Given the following answer written in a formal, philosophical or analytical style, write a single well-formed exam question that this answer would plausibly respond to.

Guidelines:
- Use an interrogative form (What / How / Why / Explain / Discuss).
- Keep the question to ONE sentence.
- Do not introduce facts not present in the answer.
- End with a question mark.

Answer:
{answer}

Question:
"""


In [None]:
# Defining helper functions. extract_question extracts the first plausible question from the model output, handling noisy formatting. generate_2_questions generates two candidate questions using different temperatures. regenerate_answer generates an answer for a given question. semantic_similarity computes the cosine similarity between two texts using sentence embeddings.

def extract_question(text):
    """
    Extract the FIRST plausible question from model output.
    Handles noisy LLM formatting.
    """
    lines = [l.strip() for l in text.split("\n") if l.strip()]

    for line in lines:
        # strip common prefixes
        line = re.sub(r"^(Question:|Q:|Here is.*?:|Possible question:)", "", line).strip()

        if "?" in line:
            q = line.split("?")[0].strip() + "?"
            if len(q.split()) >= 4:
                return q

    return None


@torch.no_grad()
def generate_2_questions(answer, log=False):
    temperatures = [0.0, 0.6]
    questions = []

    for temp in temperatures:
        prompt = BASE_PROMPT.format(answer=answer)
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        output = model.generate(
            **inputs,
            max_new_tokens=64,
            do_sample=(temp > 0),
            temperature=temp if temp > 0 else None,
            top_p=0.95,
            repetition_penalty=1.1,
            pad_token_id=tokenizer.eos_token_id
        )

        decoded = tokenizer.decode(output[0], skip_special_tokens=True)

        q = extract_question(decoded)

        if log:
            print("\n--- RAW MODEL OUTPUT ---")
            print(decoded[:500])
            print("--- EXTRACTED QUESTION ---")
            print(q)

        if q:
            questions.append(q)

    # dedupe while preserving order
    return list(dict.fromkeys(questions))


@torch.no_grad()
def regenerate_answer(question):
    prompt = f"""You are answering an academic exam question.

Question:
{question}

Answer:
"""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    output = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id
    )

    decoded = tokenizer.decode(output[0], skip_special_tokens=True)
    return decoded.split("Answer:")[-1].strip()


def semantic_similarity(a, b):
    ea = embedder.encode(a, convert_to_tensor=True)
    eb = embedder.encode(b, convert_to_tensor=True)
    return float(util.cos_sim(ea, eb))


In [None]:
# Main loop: generating questions, regenerating answers, computing similarity, and logging results.

synthetic_rows = []

DEBUG_FIRST_N = 275   # log everything

for idx, row in tqdm(train_subset.iterrows(), total=len(train_subset)):
    answer = row["ans_clean"]

    log = idx < DEBUG_FIRST_N
    questions = generate_2_questions(answer, log=log)

    # log number of questions generated
    wandb.log({
        "num_questions_generated": len(questions)
    })

    if log:
        print("\nGenerated questions:", questions)

    for q in questions:
        regen_ans = regenerate_answer(q)
        sim = semantic_similarity(answer, regen_ans)

        if log:
            print("\nQUESTION:", q)
            print("REGENERATED ANSWER (truncated):", regen_ans[:400])
            print("SEMANTIC SIMILARITY:", sim)

        synthetic_rows.append({
            "quesid": row["quesid"],
            "answer": answer,
            "question": q,
            "cycle_score": sim
        })

        # per-question W&B logging
        wandb.log({
            "cycle_similarity": sim,
            "question_length": len(q.split()),
            "answer_length": len(answer.split())
        })

    # occasional text logging (every 25 samples)
    if idx % 25 == 0 and questions:
        wandb.log({
            "sample_answer": wandb.Html(answer[:600]),
            "sample_question": questions[0]
        })


In [None]:
# Creating a DataFrame from the synthetic rows and displaying summary statistics

synthetic_df = pd.DataFrame(synthetic_rows)

print("Total QA pairs:", len(synthetic_df))
synthetic_df["cycle_score"].describe()


In [None]:
wandb.log({
    "total_QA_pairs": len(synthetic_df),
    "mean_cycle_score": synthetic_df["cycle_score"].mean(),
    "median_cycle_score": synthetic_df["cycle_score"].median()
})


In [None]:
# Displaying random samples from the synthetic dataset

for i in np.random.choice(len(synthetic_df), 5, replace=False):
    print("="*80)
    print("ANSWER:\n", synthetic_df.iloc[i]["answer"][:600])
    print("\nQUESTION:\n", synthetic_df.iloc[i]["question"])
    print("SIM:", synthetic_df.iloc[i]["cycle_score"])


In [None]:
# Saving the synthetic dataset to a CSV file and finishing the W&B run. The synthetic dataset is used for training in the next phase.

synthetic_df.to_csv("/content/synthetic_AQ_debug.csv", index=False)
wandb.finish()

In [None]:
from google.colab import files

files.download("/content/synthetic_AQ_debug.csv")

## 3. Fine-Tuning the Language Model (QLoRA)

To bias the model toward clean, exam-style inverse question generation, the synthetic QA pairs are used to fine-tune a LLaMA-based model.

Key design choices:
- **QLoRA** is used for efficiency, allowing fine-tuning on limited compute.
- Only a small number of parameters are trained, reducing overfitting risk.
- Training focuses on:
  - Learning the answer → question inversion pattern
  - Producing a single, well-formed question
  - Stopping generation cleanly without explanations or options

Despite the small dataset size, validation loss trends indicate stable learning and good generalization for this task.


In [None]:
!pip install -q transformers accelerate bitsandbytes peft datasets sentence-transformers wandb


In [None]:
# Importing necessary libraries

import torch
import pandas as pd
import numpy as np
import re

from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model
import wandb


In [None]:
# Initializing Weights & Biases for experiment tracking

wandb.init(
    project="sota-ai-task3",
    name="qlora-finetune-275",
    config={
        "base_model": "Llama-3.1-8B-Instruct",
        "method": "QLoRA",
        "epochs": 3,
        "samples": 275
    }
)


In [None]:
# Reading the synthetic dataset and displaying its head

synthetic_df = pd.read_csv("/content/synthetic_AQ_debug.csv")

print("Raw rows:", len(synthetic_df))
synthetic_df.head(3)


In [None]:
# Selecting the best question-answer pair per original question based on cycle score

best_df = (
    synthetic_df
    .sort_values("cycle_score", ascending=False)
    .groupby("quesid")
    .first()
    .reset_index()
)

print("After grouping:", len(best_df))
best_df.head(3)


In [None]:
# Analyzing the lengths of answers and questions in the best dataset

best_df["answer_len"] = best_df["answer"].apply(lambda x: len(x.split()))
best_df["question_len"] = best_df["question"].apply(lambda x: len(x.split()))

best_df[["answer_len", "question_len"]].describe()


In [None]:
# Visualizing the question length distribution

import matplotlib.pyplot as plt

plt.hist(best_df["question_len"], bins=20)
plt.title("Question Length Distribution")
plt.show()


In [None]:
# Displaying random samples from the best dataset

for i in np.random.choice(len(best_df), 5, replace=False):
    print("="*80)
    print("ANSWER:\n", best_df.iloc[i]["answer"][:600])
    print("\nQUESTION:\n", best_df.iloc[i]["question"])
    print("CYCLE SCORE:", best_df.iloc[i]["cycle_score"])


In [None]:
# Splitting the best dataset into training and validation sets

train_df = best_df.sample(frac=0.9, random_state=42)
val_df   = best_df.drop(train_df.index)

print("Train:", len(train_df))
print("Val  :", len(val_df))


In [None]:
# Formatting the dataset for Model Training and applying it to training and validation sets

def format_example(row):
    return {
        "text": f"""You are an academic exam setter.

TASK:
Given the answer below, write ONE exam question.

STRICT RULES:
- Output ONLY the question.
- Do NOT include multiple-choice options.
- Do NOT include an answer.
- Do NOT include explanations.
- End with a question mark and STOP.

Answer:
{row['answer']}

Question:
{row['question']}
"""
    }

train_ds = Dataset.from_pandas(train_df.apply(format_example, axis=1, result_type="expand"))
val_ds   = Dataset.from_pandas(val_df.apply(format_example, axis=1, result_type="expand"))


In [None]:
# Loading the Llama 3.1 8B Instruct model in 4-bit quantized and tokenizer for QLoRA fine-tuning

MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    load_in_4bit=True,
    device_map="auto",
    torch_dtype=torch.float16
)


In [None]:
# Setting up LoRA configuration and applying it to the model

lora_config = LoraConfig(
    r=16, # rank is 16 for a good balance between performance and efficiency
    lora_alpha=32, # scaling factor is 32 since it's double the rank as per best practices
    target_modules=["q_proj", "v_proj"], # target attention projection layers
    lora_dropout=0.05, # small dropout for regularization
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


In [None]:
# Tokenizing the dataset and applying it to training and validation sets

def tokenize(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        max_length=1024,
        padding=False
    )

train_ds = train_ds.map(tokenize, batched=True, remove_columns=["text"])
val_ds   = val_ds.map(tokenize, batched=True, remove_columns=["text"])


In [None]:
# Setting up training arguments for the Trainer

training_args = TrainingArguments(
    output_dir="./qlora-out",
    num_train_epochs=3, # 3 epochs for sufficient fine-tuning under Colab constraints
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8, # effective batch size of 8 to avoid OOM
    per_device_eval_batch_size=1, # 1 for evaluation to manage memory
    eval_strategy="epoch",
    logging_steps=10, # Transparent logging every 10 steps
    save_strategy="epoch",
    learning_rate=2e-4, # Balanced learning rate for stable convergence
    bf16=False,
    fp16=True,
    report_to="wandb",
    run_name="qlora-275",
    load_best_model_at_end=True, # load best model after training
    save_total_limit=1
)


In [None]:
# Initializing the Trainer and starting the training process

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

trainer.train()

In [None]:
# The function generate_question_strict generates a single exam question from a given answer using strict formatting rules to ensure clarity and relevance. The prompt is similar to previous question generation prompts but emphasizes strict adherence to output guidelines and leverages fine tuned model capabilities.

def generate_question_strict(answer, model):
    prompt = f"""You are an academic exam setter.

TASK:
Given the answer below, write ONE exam question.

STRICT RULES:
- Output ONLY the question.
- Do NOT include options.
- Do NOT include answers.
- Do NOT include explanations.
- End with a question mark and STOP.

Answer:
{answer}

Question:
"""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    output = model.generate(
        **inputs,
        max_new_tokens=40,           # HARD CAP
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id
    )

    decoded = tokenizer.decode(output[0], skip_special_tokens=True)
    q = decoded.split("Question:")[-1].strip()

    # HARD POST-PROCESSING SAFETY
    q = q.split("?")[0].strip() + "?"

    return q


In [None]:
# Displaying random samples of generated questions from the validation set

for i in np.random.choice(len(val_df), 5, replace=False):
    ans = val_df.iloc[i]["answer"]
    print("="*80)
    print("ANSWER:\n", ans[:400])
    print("\nGENERATED QUESTION:\n", generate_question_strict(ans, model))


In [None]:
# Saving the fine-tuned model and tokenizer, and finishing the W&B run

model.save_pretrained("/content/qlora-strict-adapters")
tokenizer.save_pretrained("/content/qlora-strict-adapters")
wandb.finish()


In [None]:
from google.colab import files

files.download("/content/qlora-adapters")

## 4. Test-Time Question Generation

The fine-tuned model is used to generate questions for the test set answers.

For each answer:
- A single exam-style question is generated.
- The generation process is deterministic to ensure consistency.
- The output is constrained to produce exactly one question ending with a question mark.

For practical execution in limited compute environments, inference can be performed in batches. However, the logic shown here generalizes to the full test set.


In [None]:
!pip install -q transformers accelerate bitsandbytes peft sentence-transformers tqdm wandb


In [None]:
# Importing necessary libraries

import torch
import pandas as pd
import numpy as np
import re
from tqdm import tqdm

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
from sentence_transformers import SentenceTransformer, util


In [None]:
! unzip /content/SOTA-3-FT.zip

In [None]:
# Importing and initializing Weights & Biases for experiment tracking. Also setting the start and end indices for batch inference (The indices can be adjusted as needed depending on many examples you want to process in a single run).

import wandb

START = 2825
END   = 3325   # adjust per session

wandb.init(
    project="sota-ai-task3",
    name="test-inference-batch-0-400",
    config={
        "base_model": "Llama-3.1-8B-Instruct",
        "adapters": "qlora-strict",
        "generation_mode": "single_question_strict",
        "start_idx": START,
        "end_idx": END,
        "batch_size": END - START
    }
)


In [None]:
# Loading the test dataset and displaying its head

test_df = pd.read_csv("/content/test.csv")
print("Test size:", len(test_df))
test_df.head(3)


In [None]:
# Creating a slice of the test dataset for batch inference

test_slice = test_df.iloc[START:END].reset_index(drop=True)
print(f"Processing rows [{START}:{END}) → {len(test_slice)} samples")


In [None]:
# Loading the fine-tuned model with adapters for inference. Also loading the tokenizer.

BASE_MODEL = "meta-llama/Llama-3.1-8B-Instruct"
ADAPTER_PATH = "/content/SOTA-3-FT"  # upload & unzip here

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.pad_token = tokenizer.eos_token

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    load_in_4bit=True,
    device_map="auto",
    torch_dtype=torch.float16
)

model = PeftModel.from_pretrained(base_model, ADAPTER_PATH)
model.eval()

print("Model + adapters loaded.")


In [None]:
# Loading the sentence transformer for embedding generation. generate_question_strict converts the test dataset to the prompt format and generates a question using the fine-tuned model. The prompt is based on the fine tuned model capabilities.

embedder = SentenceTransformer("all-MiniLM-L6-v2")

def generate_question_strict(answer):
    prompt = f"""You are an academic exam setter.

TASK:
Given the answer below, write ONE exam question.

STRICT RULES:
- Output ONLY the question.
- Do NOT include options.
- Do NOT include answers.
- Do NOT include explanations.
- End with a question mark and STOP.

Answer:
{answer}

Question:
"""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    output = model.generate(
        **inputs,
        max_new_tokens=40,      # HARD CAP
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id
    )

    decoded = tokenizer.decode(output[0], skip_special_tokens=True)

    # Post-processing safety net
    q = decoded.split("Question:")[-1].strip()
    q = q.split("?")[0].strip() + "?"

    return q


In [None]:
# Defining the cycle_score function to compute semantic similarity between original and regenerated answers

def cycle_score(original_answer, generated_question):
    prompt = f"""Answer the following academic exam question.

Question:
{generated_question}

Answer:
"""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    output = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id
    )

    regen = tokenizer.decode(output[0], skip_special_tokens=True)
    regen = regen.split("Answer:")[-1].strip()

    emb1 = embedder.encode(original_answer, convert_to_tensor=True)
    emb2 = embedder.encode(regen, convert_to_tensor=True)

    return float(util.cos_sim(emb1, emb2))


In [None]:
# Main inference loop: generating questions, computing cycle scores, and logging results.

results = []

cycle_scores = []
question_lengths = []

for i, row in tqdm(test_slice.iterrows(), total=len(test_slice)):
    ans = row["ans"]
    qid = row["quesid"]

    question = generate_question_strict(ans)

    try:
        cs = cycle_score(ans, question) 
    except Exception:
        cs = None

    results.append({
        "quesid": qid,
        "question": question,
        "cycle_score": cs
    })

    # collect stats
    if cs is not None:
        cycle_scores.append(cs)
    question_lengths.append(len(question.split()))

    # log per-sample (lightweight)
    wandb.log({
        "question_length": len(question.split()),
        "cycle_score": cs if cs is not None else -1
    })

    # rich logging every 25 samples
    if i % 25 == 0:
        wandb.log({
            "sample_quesid": qid,
            "sample_question": question,
            "sample_cycle_score": cs
        })

        print("="*70)
        print("QID:", qid)
        print("QUESTION:", question)
        print("CYCLE SCORE:", cs)


In [None]:
wandb.log({
    "avg_cycle_score": np.mean(cycle_scores) if cycle_scores else None,
    "median_cycle_score": np.median(cycle_scores) if cycle_scores else None,
    "avg_question_length": np.mean(question_lengths),
    "num_samples": len(results)
})


In [None]:
# Creating a DataFrame from the results and saving it to a CSV file. Also saving the file to W&B for tracking.

out_df = pd.DataFrame(results)

OUT_PATH = f"/content/submission_part_{START}_{END}.csv"
out_df.to_csv(OUT_PATH, index=False)

wandb.save(OUT_PATH)
print("Saved:", OUT_PATH)

wandb.finish()


In [None]:
from google.colab import files

files.download(OUT_PATH)

## 5. Semantic Repair Using Cycle Consistency

After initial generation, questions are evaluated using a cycle-consistency metric:
- Question → regenerated answer
- Cosine similarity between the regenerated answer and the original answer

While most questions achieve high semantic alignment, a small fraction score poorly due to underspecification or incomplete generation.

For these low-scoring cases:
- A base instruction-tuned language model is used with a strong semantic prompt.
- A new question is generated and evaluated.
- The original question is replaced **only if** semantic similarity improves.

This selective regeneration strategy ensures monotonic improvement while preserving high-quality outputs.


In [None]:
!pip install -q transformers accelerate bitsandbytes sentence-transformers wandb tqdm


In [None]:
# Importing necessary libraries

import torch
import pandas as pd
import numpy as np
from tqdm import tqdm

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer, util

import wandb


In [None]:
# Initializing Weights & Biases for experiment tracking

wandb.init(
    project="sota-ai-task3",
    name="base-model-semantic-repair",
    config={
        "model": "Llama-3.1-8B-Instruct (base)",
        "strategy": "single-shot-semantic-prompt",
        "replace_only_if_better": True
    }
)


In [None]:
# Combining all submission CSV files into a single CSV

import glob

paths = sorted(glob.glob("/content/submission_part_*.csv"))
print("Found CSVs:", len(paths))

df = pd.concat([pd.read_csv(p) for p in paths], ignore_index=True)

print("Total rows:", len(df))
df.head()

FINAL_PATH = "/content/combined_all.csv"
df[["quesid", "question", "cycle_score"]].to_csv(FINAL_PATH, index=False)


In [None]:
# Loading the combined CSV and the test dataset to verify total rows

df = pd.read_csv(FINAL_PATH)
test_df = pd.read_csv("/content/test.csv")

print("Total rows:", len(df))


In [None]:
# Identifying rows with cycle_score below the target threshold for potential repair

TARGET_THRESHOLD = 0.6
repair_df = df[df["cycle_score"] < TARGET_THRESHOLD].copy()

print("Rows to repair:", len(repair_df))


In [None]:
# Loading the base Llama 3.1 8B Instruct model for regeneration of questions and answers in 4-bit quantized form for efficiency

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16
)

BASE_MODEL = "meta-llama/Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.float16
)

model.eval()


In [None]:
# Loading the sentence transformer for embedding generation. Also defining helper functions to generate questions and compute cycle scores. The prompt for question generation focuses on reconstructing the most likely question from a given answer.

embedder = SentenceTransformer("all-MiniLM-L6-v2")

def generate_question_base(answer):
    prompt = f"""You are reconstructing an exam question from its answer.

Your goal is to infer the MOST LIKELY question that would have caused the answer below.

Requirements:
- The question must target the central claim or concept of the answer.
- The question must be specific enough that this answer directly responds to it.
- Avoid generic questions.
- Write exactly ONE question.
- End with a question mark.
- Do not include explanations, answers, or options.

Answer:
{answer}

Question:
"""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    out = model.generate(
        **inputs,
        max_new_tokens=48,
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id
    )

    text = tokenizer.decode(out[0], skip_special_tokens=True)
    q = text.split("Question:")[-1].strip()
    q = q.split("?")[0].strip() + "?"
    return q


def cycle_score(answer, question):
    prompt = f"""Answer the following exam question.

Question:
{question}

Answer:
"""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    out = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id
    )
    regen = tokenizer.decode(out[0], skip_special_tokens=True)
    regen = regen.split("Answer:")[-1].strip()

    e1 = embedder.encode(answer, convert_to_tensor=True)
    e2 = embedder.encode(regen, convert_to_tensor=True)
    return float(util.cos_sim(e1, e2))


In [None]:
# Main repair loop: regenerating questions and computing new cycle scores. If the new score is better, update the DataFrame and log the improvement.

fixed = 0

for idx, row in tqdm(repair_df.iterrows(), total=len(repair_df)):
    qid = row["quesid"]
    old_q = row["question"]
    old_score = row["cycle_score"]

    answer = test_df.loc[test_df["quesid"] == qid, "ans"].values[0]

    new_q = generate_question_base(answer)
    new_score = cycle_score(answer, new_q)

    wandb.log({
        "old_score": old_score,
        "new_score": new_score,
        "question_length": len(new_q.split())
    })

    print("="*80)
    print("OLD Q:", old_q)
    print("OLD SCORE:", old_score)
    print("NEW Q:", new_q)
    print("NEW SCORE:", new_score)

    if new_score > old_score: # only replace if better
        df.loc[df["quesid"] == qid, "question"] = new_q
        df.loc[df["quesid"] == qid, "cycle_score"] = new_score
        fixed += 1
        wandb.log({"repair_success": 1})


In [None]:
# Logging final statistics after repair and saving the final combined CSV file

wandb.log({
    "total_fixed": fixed,
    "mean_cycle_score_after": df["cycle_score"].mean(),
    "median_cycle_score_after": df["cycle_score"].median()
})

print("Fixed:", fixed)

FINAL_PATH = "/content/final_combined.csv"
df[["quesid", "question", "cycle_score"]].to_csv(FINAL_PATH, index=False)


In [None]:
from google.colab import files

files.download(FINAL_PATH)

wandb.finish()

## 6. Final Analysis and Submission Preparation

After completing generation and selective semantic repair, we obtain a final CSV containing the following columns:
- `quesid`: unique identifier for each example
- `question`: the reconstructed exam question
- `cycle_score`: a semantic alignment score used internally for quality control

### Qualitative Observations
- The majority of generated questions are concise, specific, and directly target the central claim of the corresponding answer.
- Questions with initially poor semantic alignment were selectively regenerated, leading to a strong overall distribution of semantic scores.
- The final average cycle score is approximately **80%**, indicating high semantic consistency between answers and reconstructed questions.

It is important to note that `cycle_score` is **not part of the official evaluation**. It is used purely as a diagnostic and quality-assurance signal during development to:
- identify degenerate or underspecified questions,
- guide selective regeneration,
- ensure monotonic improvement without manual inspection.

### Preparing the Final Submission
For the Kaggle submission, only the required columns are retained:
- `quesid`
- `question`

The `cycle_score` column is dropped prior to submission, as it is not expected by the evaluation system.

The resulting CSV contains exactly one well-formed question per test example and constitutes the final submission artifact.


In [None]:
# Loading final combined CSV and preparing submission file

import pandas as pd

final_df = pd.read_csv("/content/final_combined.csv")

print("Final dataset size:", len(final_df))
print("Average cycle score:", final_df["cycle_score"].mean())

submission_df = final_df[["quesid", "question"]]
submission_df.to_csv("submission.csv", index=False)

submission_df.head()


In [None]:
from google.colab import files

files.download('submission.csv')

In [None]:
# Preparing the final submission by sorting questions in numeric order based on quesid

import pandas as pd
import re

# Load final CSV (with cycle_score still present)
df = pd.read_csv("submission.csv")

# Extract numeric index from quesid (e.g., test_123 -> 123)
df["qid_num"] = df["quesid"].str.extract(r"(\d+)").astype(int)

# Sort by numeric order
df = df.sort_values("qid_num").reset_index(drop=True)

# Drop helper column
df = df.drop(columns=["qid_num"])

# Sanity checks
assert df["quesid"].iloc[0] == "test_0"
assert df["quesid"].iloc[-1].startswith("test_")

# Prepare final submission (drop cycle_score)
submission_df = df[["quesid", "question"]]

# Save
submission_df.to_csv("final_submission.csv", index=False)

print("Final submission saved.")
print("Rows:", len(submission_df))
submission_df.head()


In [None]:
from google.colab import files

files.download('final_submission.csv')

## 7. Final Tail Repair for Low-Scoring Questions (≤ 0.67)

After initial generation and selective semantic repair, the majority of reconstructed questions achieved strong semantic alignment with their corresponding answers. However, a small tail of examples still exhibited relatively low cycle-consistency scores (≤ 0.67).

Rather than retraining models or applying broad changes, a targeted final repair phase was introduced to address only these low-confidence cases.

### Motivation
- Low cycle scores typically arise from underspecified or overly generic questions.
- These cases often require stronger intent inference rather than additional training.
- Since evaluation is semantic, improving the weakest tail can significantly increase robustness and overall performance.

### Approach
For each example with a cycle score ≤ 0.67:
- A **strong inverse-QA prompt** is used with a base instruction-tuned language model.
- The prompt explicitly emphasizes reconstructing the *original* exam question that directly caused the answer.
- Exactly one new candidate question is generated.
- The new question is evaluated using the same cycle-consistency metric.
- The original question is replaced **only if** the new cycle score is strictly higher.

### Key Properties
- This process is deterministic and per-example independent.
- High-quality questions are never overwritten.
- Improvements are monotonic by construction.

This final tail-focused refinement substantially improves semantic alignment in difficult cases, leading to a cleaner distribution of scores and a stronger overall submission.


In [None]:
!pip install -q transformers accelerate bitsandbytes sentence-transformers wandb tqdm


In [None]:
# Importing necessary libraries

import torch
import pandas as pd
import numpy as np
from tqdm import tqdm
import re

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer, util

import wandb


In [None]:
# Initializing Weights & Biases for experiment tracking

wandb.init(
    project="sota-ai-task3",
    name="final-strong-tail-repair",
    config={
        "threshold": 0.67,
        "model": "Llama-3.1-8B-Instruct (base)",
        "strategy": "single-shot-strong-inverse-prompt",
        "replace_only_if_better": True
    }
)


In [None]:
# CSV with quesid, question, cycle_score
df = pd.read_csv("/content/final_combined.csv")

# Test answers (quesid, ans)
test_df = pd.read_csv("/content/test.csv")

print("Total rows:", len(df))
print("Mean cycle score (before):", df["cycle_score"].mean())


In [None]:
# Identifying rows with cycle_score below the stronger threshold for potential repair

THRESHOLD = 0.67
repair_df = df[df["cycle_score"] <= THRESHOLD].copy()

print("Rows to repair:", len(repair_df))


In [None]:
# Loading the base Llama 3.1 8B Instruct model in 4-bit quantized format for efficient inference. Also loading the tokenizer.

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16
)

BASE_MODEL = "meta-llama/Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.float16
)

model.eval()


In [None]:
# Loading the sentence transformer for embedding generation.

embedder = SentenceTransformer("all-MiniLM-L6-v2")


In [None]:
# Defining the generate_strong_question function to reconstruct the original exam question from a given answer using a strong inverse prompt. The focus is on inferring the exact question that elicited the provided answer by providing a strong prompt with clear instructions.

def generate_strong_question(answer):
    prompt = f"""You are reconstructing the ORIGINAL exam question that directly caused the answer below.

Important:
- This answer was written in response to ONE specific academic question.
- Your job is to infer that exact intent.
- The question must be specific enough that this answer is a direct and complete response.
- Generic or vague questions are incorrect.

Rules:
- Write exactly ONE question.
- Do NOT explain.
- Do NOT add options.
- Do NOT repeat phrases unnecessarily.
- End with a single question mark.

Think carefully about:
- the main claim being defended or explained
- any named philosopher, theory, or concept
- whether the answer is explaining a definition, a distinction, a criticism, or an implication

Answer:
{answer}

Original Question:
"""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    out = model.generate(
        **inputs,
        max_new_tokens=50,
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id
    )

    text = tokenizer.decode(out[0], skip_special_tokens=True)
    q = text.split("Original Question:")[-1].strip()

    # Hard sanitation
    q = q.split("?")[0].strip() + "?"
    q = re.sub(r"\s+", " ", q)

    return q


In [None]:
# Defining the compute_cycle_score function to evaluate the semantic similarity between the original answer and the regenerated answer based on the newly generated question.

def compute_cycle_score(original_answer, question):
    prompt = f"""Answer the following academic exam question.

Question:
{question}

Answer:
"""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    out = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id
    )

    regen = tokenizer.decode(out[0], skip_special_tokens=True)
    regen = regen.split("Answer:")[-1].strip()

    e1 = embedder.encode(original_answer, convert_to_tensor=True)
    e2 = embedder.encode(regen, convert_to_tensor=True)

    return float(util.cos_sim(e1, e2))


In [None]:
# Main repair loop: regenerating questions and computing new cycle scores. If the new score is better, update the DataFrame and log the improvement.

fixed = 0
skipped_short = 0

for idx, row in tqdm(repair_df.iterrows(), total=len(repair_df)):
    qid = row["quesid"]
    old_q = row["question"]
    old_score = row["cycle_score"]

    answer = test_df.loc[test_df["quesid"] == qid, "ans"].values[0]

    new_q = generate_strong_question(answer)

    # Sanity check: avoid garbage
    if len(new_q.split()) < 8:
        skipped_short += 1
        wandb.log({"skipped_short_question": 1})
        continue

    try:
        new_score = compute_cycle_score(answer, new_q)
    except Exception:
        continue

    # Logging
    wandb.log({
        "old_score": old_score,
        "new_score": new_score,
        "delta": new_score - old_score,
        "question_len": len(new_q.split())
    })

    print("=" * 90)
    print("QUESID:", qid)
    print("OLD Q:", old_q)
    print("OLD SCORE:", round(old_score, 4))
    print("NEW Q:", new_q)
    print("NEW SCORE:", round(new_score, 4))

    # Replace only if strictly better
    if new_score > old_score:
        df.loc[df["quesid"] == qid, "question"] = new_q
        df.loc[df["quesid"] == qid, "cycle_score"] = new_score
        fixed += 1
        wandb.log({"repair_success": 1})


In [None]:
# Final logging after repair and saving the updated CSV file

print("Repaired rows:", fixed)
print("Skipped (too short):", skipped_short)

wandb.log({
    "total_repaired": fixed,
    "mean_cycle_score_after": df["cycle_score"].mean(),
    "median_cycle_score_after": df["cycle_score"].median()
})

# Save updated CSV (keep cycle_score for audit)
df.to_csv("final_after_strong_repair.csv", index=False)
wandb.save("final_after_strong_repair.csv")


In [None]:
# Sanity checks on the final DataFrame and finishing the W&B run

assert df["question"].str.endswith("?").all()
assert df["quesid"].nunique() == len(df)

df.sample(5)

wandb.finish()

In [None]:
from google.colab import files

files.download('final_after_strong_repair.csv')

In [None]:
# Preparing the final submission by sorting questions in numeric order based on quesid after strong repair

import pandas as pd
import re

# Load final CSV (with cycle_score still present)
df = pd.read_csv("/content/final_after_strong_repair.csv")

# Extract numeric index from quesid (e.g., test_123 -> 123)
df["qid_num"] = df["quesid"].str.extract(r"(\d+)").astype(int)

# Sort by numeric order
df = df.sort_values("qid_num").reset_index(drop=True)

# Drop helper column
df = df.drop(columns=["qid_num"])

# Sanity checks
assert df["quesid"].iloc[0] == "test_0"
assert df["quesid"].iloc[-1].startswith("test_")

# Prepare final submission (drop cycle_score)
submission_df = df[["quesid", "question"]]

# Save
submission_df.to_csv("strong_repairs_submission.csv", index=False)

print("Final submission saved.")
print("Rows:", len(submission_df))
submission_df.head()


## Conclusion

This notebook presents a complete, principled solution to the inverse question–answering task in the SOTA-AI December Challenge. Starting from an unlabeled dataset of model-generated answers, the approach progressively builds structure through synthetic supervision, parameter-efficient fine-tuning, deterministic inference, and targeted semantic repair.

Key takeaways from this work include:
- Inverse QA is fundamentally a semantic reasoning problem rather than a surface-form generation task.
- Small, high-quality synthetic datasets can be more effective than large noisy ones when combined with careful filtering.
- Cycle consistency serves as a practical proxy for semantic alignment and enables safe, monotonic post-processing.
- Selective tail-focused refinement can substantially improve robustness without risking regression.

All stages of the pipeline are designed to be modular, interpretable, and reproducible. While practical execution may involve batch-wise processing under constrained resources, the methodology itself is conceptually complete and generalizes to the full dataset.

Thank you for this challenge SOTA-AI Community.