# Research Environment Recreation

**Name:** Peter Yeshua J. Sotomango  
**Title:** Research Environment Recreation

**Description:** In this notebook, I attempt to recreate the testing environment used in the paper "GENERATING IS BELIEVING: MEMBERSHIP INFERENCE ATTACKS AGAINST RETRIEVAL-AUGMENTED GENERATION" (Yuying Li et al.). The goal is to reproduce the key experiments and analyses from the paper so results can be validated, extended, and adapted to related settings.

This notebook focuses on:
- Reconstructing the RAG (retrieval-augmented generation) pipeline used in the study: the retriever, the external datastore/corpus, and the generative model.
- Implementing membership inference attacks described in the paper (attack models, attack features, and attack workflows).
- Running controlled experiments to measure attack effectiveness using standard metrics.
- Performing ablations and sensitivity analyses for variables such as retrieval corpus size, prompt templates, model temperature, and attacker access assumptions.

Contents and reproducibility notes:
- Datasets and corpus preparation steps (how documents are indexed and which splits are used for member vs. non-member examples).
- Model and retriever configuration (model checkpoints, retriever type and hyperparameters, prompt templates).
- Attack model training and evaluation (feature extraction, training/validation splits, evaluation metrics).
- Experimental controls (random seeds, hardware notes, library versions, and instructions to reproduce results locally or on cloud instances).

Notebook organization:
1. Environment setup and dependency listing
2. Data preparation and indexing
3. RAG pipeline implementation
4. Attack implementation and training
5. Evaluation, ablations, and visualization
6. Conclusions and replication checklist

Where possible, I include reproducible code, configuration files, and explicit random seeds to make experiments deterministic and easy to re-run.

# Stage 1: Load Data and Setup Environment

In this stage, we load a simplified QA corpus using the Hugging Face `nq_open` dataset (for cleaner structure and straightforward processing).

We split the data 80/20:
- 80% becomes the knowledge base used by the retriever in the RAG pipeline (member set).
- 20% is held out as non-member examples for evaluation and membership inference attack testing.

In [None]:
%pip install datasets sentence-transformers faiss-cpu accelerate transformers bitsandbytes sacrebleu

In [None]:
import pandas as pd
from datasets import load_dataset, Dataset
from sklearn.model_selection import train_test_split
import random
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
import torch
import sacrebleu
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

MODEL_ID = "meta-llama/Llama-2-7b-chat-hf"
SEED = 42
random.seed(SEED)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")
if device == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")

The `prepare_environment()` function below implements the data loading and splitting strategy described in the research paper:

1. **Load Dataset**: Loads the Natural Questions Open (nq_open) dataset from HuggingFace.

2. **Format Target Samples**: For each row, constructs the target sample `x_t` following the paper's specification:
    - `x_t = x_t^q ⊕ x_t^r` (Question ⊕ Answer)
    - Format: `"{question}? {answer}"` (skipping empty answers)
    - Stores three components:
      - `query_text`: The question portion (used as retrieval query)
      - `remaining_text`: The answer portion (used for generation evaluation)
      - `full_sample`: Complete formatted text (stored in RAG external database)

3. **Membership Split (80/20)**: Divides data into member (80%) and non-member (20%) pools.

4. **Reference Dataset (Dr)**: Samples 1,000 members + 1,000 non-members for attack model training and threshold tuning.

5. **Evaluation Dataset**: Samples an additional 1,000 members + 1,000 non-members for final attack performance testing.

6. **RAG External Database**: Constructs the knowledge base from all member samples' `full_sample` text.

Returns: `rag_external_database`, `reference_dataset`, `evaluation_dataset`

In [None]:
def prepare_environment():
    print("Loading Natural Questions (nq_open) dataset...")
    dataset = load_dataset("nq_open", split="train")

    print("Pre-processing samples...")
    processed_data = []
    for row in dataset:
        question = row['question']
        answer = row['answer'][0] if len(row['answer']) > 0 else ""

        full_text = f"{question}? {answer}"

        processed_data.append({
            "query_text": question,
            "remaining_text": answer,
            "full_sample": full_text
        })

    df = pd.DataFrame(processed_data)
    print(f"Total samples loaded: {len(df)}")

    print("\nPerforming 80/20 Membership Split...")
    members_pool, non_members_pool = train_test_split(
        df,
        train_size=0.8,
        random_state=SEED,
        shuffle=True
    )

    ref_members = members_pool.sample(n=1000, random_state=SEED)
    ref_non_members = non_members_pool.sample(n=1000, random_state=SEED)

    reference_dataset = pd.concat([
        ref_members.assign(label="member"),
        ref_non_members.assign(label="non_member")
    ])

    remaining_members = members_pool.drop(ref_members.index)
    remaining_non_members = non_members_pool.drop(ref_non_members.index)

    eval_members = remaining_members.sample(n=1000, random_state=SEED)
    eval_non_members = remaining_non_members.sample(n=1000, random_state=SEED)

    evaluation_dataset = pd.concat([
        eval_members.assign(label="member"),
        eval_non_members.assign(label="non_member")
    ])

    rag_external_database = members_pool['full_sample'].tolist()

    print(f"\nEnvironment Statistics:")
    print(f"   - RAG Knowledge Base Size: {len(rag_external_database)} documents")
    print(f"   - Reference Dataset (Dr):  {len(reference_dataset)} samples (Used for training Attack Model)")
    print(f"   - Evaluation Dataset:      {len(evaluation_dataset)} samples (Used for testing Attack)")

    return rag_external_database, reference_dataset, evaluation_dataset

In [None]:
rag_db, ref_set, eval_set = prepare_environment()

print("\nSample Data (Reference Set):")
print(ref_set[['query_text', 'remaining_text', 'label']].head(3))

- This stage configures the retrieval component of the RAG pipeline:
    - Primary retriever: Contriever (`facebook/contriever`).
    - Fallback retriever: `sentence-transformers/multi-qa-mpnet-base-dot-v1` (used if Contriever fails to load).
- All member documents in the RAG knowledge base (`rag_db`) are encoded into dense embeddings using the selected retriever.
- A FAISS inner-product index (IndexFlatIP) is built over these embeddings for efficient approximate nearest neighbor search.
- Retrieval workflow:
    - Input: a query string Q (e.g., `ref_set['query_text']`).
    - Encode Q to an embedding with the same model.
    - Search FAISS to get top-k document indices.
    - Map indices back to text docs from `rag_db`.
- Similarity scoring:
    - S = Sim(Q, d_i), where Sim is dot-product similarity in the shared embedding space.
    - Higher S indicates closer semantic relevance between the query and a candidate document d_i.
- End-to-end retrieval operator:
    - Ds = R(Q, D, S), where:
        - Q: query embedding,
        - D: corpus embeddings (built from `rag_db`),
        - S: similarity function (inner product).
    - R returns the ranked set of contexts used for generation and downstream attack feature extraction.
- Practical notes:
    - Batch encoding is used for efficiency.
    - Index statistics (dimension, ntotal) verify successful construction.
    - A quick smoke test retrieves the top-5 contexts to validate the pipeline before experimentation.

In [None]:
# Load Model
print("Loading Retriever model (facebook/contriever-msmarco)...")
try:
    retriever_model = SentenceTransformer('facebook/contriever-msmarco', device=device)
except:
    print("Contriever load failed, falling back to 'nthakur/contriever-base-msmarco' (Functionally similar for testing)")
    retriever_model = SentenceTransformer('nthakur/contriever-base-msmarco', device=device)


print(f"Encoding {len(rag_db)} documents... (This may take a moment on CPU)")

batch_size = 32
corpus_embeddings = retriever_model.encode(
    rag_db,
    batch_size=batch_size,
    show_progress_bar=True,
    convert_to_numpy=True
)

In [None]:
# Build FAISS Index
print("Building FAISS Index...")
dimension = corpus_embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)
index.add(corpus_embeddings)

print(f"Indexed {index.ntotal} documents.")

In [None]:
def retrieve_context(query, k=5):
    """
    Retrieves the top-k most similar documents for a given query.
    Per paper, k=5.
    """

    query_embedding = retriever_model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(query_embedding)

    _, indices = index.search(query_embedding, k)

    results = []
    for idx in indices[0]:
        results.append(rag_db[idx])

    return results

# Test the Retriever
test_query = ref_set.iloc[0]['query_text']
retrieved_docs = retrieve_context(test_query, k=5)

print("\nRetrieval Test:")
print(f"Query: {test_query}")
print(f"Top Retrieved Context: {retrieved_docs[0]}")

# Stage 2: Setup Large Language Models and Evaluation Metrics

This script loads the `Llama-2-7b model` or selected `MODEL_ID`, generates responses conditioned on retrieved contexts, and computes the membership-attack features used downstream (BLEU and perplexity). Steps performed by the subsequent code cells:

- Model loading and quantization
    - Configure 4-bit quantization via BitsAndBytesConfig (nf4, compute dtype=float16).
    - Load tokenizer and set pad_token = eos_token to avoid generation issues.
    - Load the Llama-2-7b causal model with device_map="auto" and trust_remote_code (handles HF auth errors).

- Prompt construction
    - format_prompt(query, retrieved_contexts) flattens top-k retrieved contexts into a Context string and injects an instruction template that asks for a concise response (no extra explanation).

- Response generation
    - run_attack_on_sample(sample_row) calls retrieve_context(query, k=5) (FAISS + SentenceTransformer index built earlier), builds the prompt, tokenizes to the model device, and generates a deterministic response (do_sample=False, max_new_tokens=50).

- Metric computation
    - calculate_metrics computes semantic similarity (S_sem) via sacrebleu.sentence_bleu (BLEU score) between generated text and the target answer.
    - Perplexity (PPL_gen) is computed by tokenizing the generated text, running the model to obtain loss, and returning exp(loss).

- End-to-end test
    - A smoke test runs S²MIA on a sample from ref_set: retrieval → prompt → generation → BLEU & PPL calculation → prints query, target, generated text, features, and true label.

Notes:
- Retrieval uses the previously built SentenceTransformer embedding model and FAISS IndexFlatIP over rag_db.
- BLEU higher and PPL lower are interpreted as signals indicating a sample is more likely a member.

In [None]:
import os

os.environ["HF_TOKEN"] = "[REDACTED]"

print(f"Loading {MODEL_ID} in 4-bit...")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

try:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
    tokenizer.pad_token = tokenizer.eos_token

    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
    )
except OSError:
    print("❌ ERROR: You likely need to log in to Hugging Face.")
    print("   Run: from huggingface_hub import login; login()")
    print("   And ensure you accepted the Meta Llama 2 license.")
    raise

print(f"✅ {MODEL_ID} loaded successfully in 4-bit.")

In [None]:
def format_prompt(query, retrieved_contexts):
  # Join retrieved contexts
  context_str = "\n".join([f"- {ctx}" for ctx in retrieved_contexts])

  prompt = f"""Context:
{context_str}

Question: {query}

Provide a concise answer directly addressing the question using the most relevant information from the context above. Do not include explanatory text.

Answer:
"""

  return prompt

In [None]:
def calculate_metrics(target_text, generated_text, model, tokenizer, query_text):
    """
    Computes S_sem (BLEU) and PPL_gen (Perplexity).
    """

    bleu = sacrebleu.sentence_bleu(generated_text, [target_text])
    s_sem = bleu.score

    full_text = f"{query_text} {target_text}"

    encodings = tokenizer(
        full_text,
        return_tensors="pt",
        truncation=True,
        max_length=512
    ).to(model.device)

    with torch.no_grad():
        outputs = model(**encodings, labels=encodings.input_ids)
        loss = outputs.loss
        ppl_gen = torch.exp(loss).item()

    return s_sem, ppl_gen

In [None]:
def run_attack_on_sample(sample_row):
    query = sample_row['query_text']
    target_answer = sample_row['remaining_text']

    retrieved_docs = retrieve_context(query, k=5)
    prompt = format_prompt(query, retrieved_docs)
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    outputs = model.generate(
      **inputs,
      max_new_tokens=100,
      do_sample=True,
      temperature=0.7,
      top_p=0.9,
      pad_token_id=tokenizer.eos_token_id
    )

    generated_text = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
    generated_text = generated_text.strip()

    s_sem, ppl_gen = calculate_metrics(target_answer, generated_text, model, tokenizer, query)

    return {
        "query": query,
        "target_answer": target_answer,
        "generated_text": generated_text,
        "S_sem": s_sem,
        "PPL_gen": ppl_gen,
        "True_Label": sample_row['label']
    }

In [None]:
# --- Run Test ---
print("\nRunning S²MIA on a random sample...")
test_sample = ref_set.iloc[0]
result = run_attack_on_sample(test_sample)

print(f"\nRESULTS:")
print(f"Query:      {result['query']}")
print(f"Real Ans:   {result['target_answer']}")
print(f"Gen Ans:    {result['generated_text']}")
print(f"------------------------------------------------")
print(f"Feature 1 (BLEU):       {result['S_sem']:.4f}  (Higher = Likely Member)")
print(f"Feature 2 (Perplexity): {result['PPL_gen']:.4f} (Lower = Likely Member)")
print(f"True Label:             {result['True_Label']}")

# Stage 3: Testing and Evaluation
This cell executes the S²MIA pipeline over the Reference Set to produce per-sample attack features used for training/tuning the attack. It systematically runs retrieval → prompt construction → model generation → metric extraction for each sample and stores the resulting features and labels for later threshold optimization and evaluation.

What this cell does (step-by-step)
- Selects the dataset to process (reference or evaluation split) and optionally limits the number of samples via DEMO_LIMIT.
- Iterates over samples (uses tqdm for progress reporting).
- For each sample:
    - Retrieves top-k contexts from the FAISS + SentenceTransformer retriever.
    - Builds the prompt with retrieved contexts (format_prompt).
    - Generates a deterministic response from the LLM (run_attack_on_sample uses do_sample=False).
    - Computes two attack features:
        - S_sem: semantic similarity approximated by BLEU (sacrebleu).
        - PPL_gen: generation perplexity from the model.
    - Records the numeric features and the ground-truth label (member=1, non-member=0).
- Collects results into a DataFrame (ref_scores_df or eval_scores_df) for downstream threshold optimization.

Inputs & configurable parameters
- DEMO_LIMIT: integer or None. If set, only the first DEMO_LIMIT rows of the dataset are processed (useful for quick tests).
- BATCH_SIZE: influences encoding throughput in retriever/model-related operations (kept modest here for memory constraints).
- SEED (global): ensures reproducible dataset splits and sampling.

Outputs
- ref_scores_df: DataFrame with columns ["s_sem", "ppl_gen", "label"] for the Reference Set (used for threshold optimization / attack training).
- eval_scores_df: same format for the Evaluation Set (used for final testing).

Interpretation of features
- S_sem (BLEU) — higher values suggest the model reproduced the target better; typically correlated with membership (higher ⇒ more likely member).
- PPL_gen — lower perplexity indicates the model finds the generated text more probable under its distribution; lower ⇒ more likely member.
- The greedy threshold search (next cell) finds BLEU (>=) and PPL (<=) cutoffs that maximize F1 on the reference set.

Practical notes & performance
- Expect generation and perplexity computation to be the slowest parts; running on GPU significantly reduces runtime. If you are on CPU, reduce DEMO_LIMIT or use a smaller model for development.
- Errors from the model (OOM, HF auth) are handled per-sample — problematic samples are skipped and logged.
- For efficiency, retrieval uses batch encoding for the corpus (built earlier) and FAISS for fast nearest-neighbor lookups.
- If running large-scale experiments, consider:
    - Increasing BATCH_SIZE for encoding if you have memory headroom.
    - Caching generated outputs or intermediate encodings.
    - Running generation in mixed precision or using a smaller/quantized model.

Reproducibility & debugging
- Keep SEED consistent across runs to reproduce splits/sampling.
- If you see Hugging Face auth errors, confirm HF_TOKEN/login and model license acceptance.
- Inspect logged warnings for per-sample exceptions to identify problematic inputs (very long context, empty answers, tokenization issues).

Next steps after this cell
- Use the produced ref_scores_df in the threshold optimization cell to run a greedy search for BLEU/PPL cutoffs (cell 22).
- Evaluate the learned thresholds on eval_scores_df to estimate attack accuracy, precision, recall, and F1.
- Optional: visualize distributions (histograms, KDE) of s_sem and ppl_gen by label, plot ROC/PR curves, and conduct ablations (vary k, prompt templates, model temperature).

In [None]:
# --- Configuration ---
DEMO_LIMIT = None
BATCH_SIZE = 8

def generate_attack_dataset(dataset, limit=None):
    results = []

    data_to_process = dataset.head(limit) if limit else dataset

    print(f"Starting processing of {len(data_to_process)} samples...")

    for index, row in tqdm(data_to_process.iterrows(), total=len(data_to_process)):
        try:
            attack_result = run_attack_on_sample(row)

            results.append({
                "s_sem": attack_result['S_sem'],
                "ppl_gen": attack_result['PPL_gen'],
                "label": 1 if row['label'] == 'member' else 0
            })
        except Exception as e:
            print(f"⚠️ Error on sample {index}: {e}")
            continue

    return pd.DataFrame(results)


print("Generating Reference Scores (Training Data)...")
ref_scores_df = generate_attack_dataset(ref_set, limit=DEMO_LIMIT)

print("\nReference Scores Generated:")
print(ref_scores_df.head())

print("\nGenerating Evaluation Scores (Testing Data)...")
eval_scores_df = generate_attack_dataset(eval_set, limit=DEMO_LIMIT)

print("\nReference Scores Generated:")
print(eval_scores_df.head())

In [None]:
def optimize_thresholds(reference_df):
    """
    Implements the Greedy Search from Section 3.2.
    Finds theta_sem and theta_gen that maximize F1 Score on the Reference Set.
    """
    print(f"Optimizing Thresholds on {len(reference_df)} reference samples...")

    bleu_candidates = np.unique(np.percentile(reference_df['s_sem'], np.arange(0, 100, 2)))
    ppl_candidates = np.unique(np.percentile(reference_df['ppl_gen'], np.arange(0, 100, 2)))

    best_score = 0
    best_thresholds = {'bleu': 0, 'ppl': 0}

    for t_bleu in bleu_candidates:
        for t_ppl in ppl_candidates:
            predictions = (
                (reference_df['s_sem'] >= t_bleu) &
                (reference_df['ppl_gen'] <= t_ppl)
            ).astype(int)


            score = f1_score(reference_df['label'], predictions, zero_division=0)

            if score > best_score:
                best_score = score
                best_thresholds = {'bleu': t_bleu, 'ppl': t_ppl}

    print(f"Optimization Complete.")
    print(f"   Best Training F1: {best_score:.4f}")
    print(f"   Optimal BLEU Threshold (>): {best_thresholds['bleu']:.4f}")
    print(f"   Optimal PPL Threshold (<):  {best_thresholds['ppl']:.4f}")

    return best_thresholds

def evaluate_attack(eval_df, thresholds):
    """
    Applies the learned thresholds to the Evaluation Set (Test Data).
    """
    print(f"\nExecuting Attack on {len(eval_df)} evaluation samples...")

    predictions = (
        (eval_df['s_sem'] >= thresholds['bleu']) &
        (eval_df['ppl_gen'] <= thresholds['ppl'])
    ).astype(int)

    acc = accuracy_score(eval_df['label'], predictions)
    prec = precision_score(eval_df['label'], predictions, zero_division=0)
    rec = recall_score(eval_df['label'], predictions, zero_division=0)
    f1 = f1_score(eval_df['label'], predictions, zero_division=0)

    print(f"Final S²MIA-T Attack Results:")
    print(f"   ---------------------------")
    print(f"   Accuracy:  {acc:.4f}  (Base rate is 0.50)")
    print(f"   Precision: {prec:.4f}")
    print(f"   Recall:    {rec:.4f}")
    print(f"   F1 Score:  {f1:.4f}")

    return predictions

if len(ref_scores_df) > 0:
    optimal_thetas = optimize_thresholds(ref_scores_df)

    if len(eval_scores_df) > 0:
        final_preds = evaluate_attack(eval_scores_df, optimal_thetas)
    else:
        print("⚠️ Evaluation set is empty. Run the previous step with a higher limit.")
else:
    print("⚠️ Reference set is empty. Run the previous step to generate scores first.")

# Stage 4: Visualizations and Verification

Below are some manual verifications of the performance of the model, attacks, and RAG to see how well the dat being provided by the rag is relevant depending on if the scores predicted are a member of the database or not.

We also check the distribution of both BLEU scores and Perplexity scores and to see the average predictions of the dataset. After that we use the first 5 members and non-members to manually verify if the RAG is generating the correct answers.

In [None]:
# Separate members and non-members
members = ref_scores_df[ref_scores_df['label'] == 1]
non_members = ref_scores_df[ref_scores_df['label'] == 0]

# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# BLEU distribution
axes[0].hist(members['s_sem'], bins=50, alpha=0.5, label='Members', color='blue')
axes[0].hist(non_members['s_sem'], bins=50, alpha=0.5, label='Non-Members', color='red')
axes[0].set_xlabel('BLEU Score (S_sem)')
axes[0].set_ylabel('Density')
axes[0].set_title('BLEU Score Distribution')
axes[0].legend()

# Perplexity distribution
axes[1].hist(members['ppl_gen'], bins=50, alpha=0.5, label='Members', color='blue')
axes[1].hist(non_members['ppl_gen'], bins=50, alpha=0.5, label='Non-Members', color='red')
axes[1].set_xlabel('Perplexity (PPL_gen)')
axes[1].set_ylabel('Density')
axes[1].set_title('Perplexity Distribution')
axes[1].legend()

plt.tight_layout()
plt.savefig('feature_distributions.png')
plt.show()

# Print statistics
print("\n=== FEATURE STATISTICS ===")
print(f"Members BLEU - Mean: {members['s_sem'].mean():.4f}, Std: {members['s_sem'].std():.4f}")
print(f"Non-Members BLEU - Mean: {non_members['s_sem'].mean():.4f}, Std: {non_members['s_sem'].std():.4f}")
print(f"Members PPL - Mean: {members['ppl_gen'].mean():.4f}, Std: {members['ppl_gen'].std():.4f}")
print(f"Non-Members PPL - Mean: {non_members['ppl_gen'].mean():.4f}, Std: {non_members['ppl_gen'].std():.4f}")

In [None]:
# Check 5 member samples
print("\n=== MEMBER SAMPLES ===")
for i in range(5):
    sample = ref_set[ref_set['label'] == 'member'].iloc[i]
    result = run_attack_on_sample(sample)
    print(f"\nSample {i+1}:")
    print(f"  Query: {result['query']}")
    print(f"  Target: {result['target_answer']}")
    print(f"  Generated: {result['generated_text']}")
    print(f"  BLEU: {result['S_sem']:.4f}")
    print(f"  PPL: {result['PPL_gen']:.4f}")

# Check 5 non-member samples
print("\n=== NON-MEMBER SAMPLES ===")
for i in range(5):
    sample = ref_set[ref_set['label'] == 'non_member'].iloc[i]
    result = run_attack_on_sample(sample)
    print(f"\nSample {i+1}:")
    print(f"  Query: {result['query']}")
    print(f"  Target: {result['target_answer']}")
    print(f"  Generated: {result['generated_text']}")
    print(f"  BLEU: {result['S_sem']:.4f}")
    print(f"  PPL: {result['PPL_gen']:.4f}")

In [None]:
# Pick a member sample
member_sample = ref_set[ref_set['label'] == 'member'].iloc[0]
query = member_sample['query_text']
full_sample = member_sample['full_sample']

# Retrieve
retrieved = retrieve_context(query, k=5)

# Check if the full sample is in retrieved documents
print(f"Query: {query}")
print(f"Full Sample: {full_sample}")
print(f"\nRetrieved Documents:")
for i, doc in enumerate(retrieved):
    print(f"{i+1}. {doc}")
    if doc == full_sample:
        print("   ✅ EXACT MATCH FOUND")

In [None]:
print("Member samples - BLEU mean:", ref_scores_df[ref_scores_df['label']==1]['s_sem'].mean())
print("Non-member samples - BLEU mean:", ref_scores_df[ref_scores_df['label']==0]['s_sem'].mean())
print("Member samples - PPL mean:", ref_scores_df[ref_scores_df['label']==1]['ppl_gen'].mean())
print("Non-member samples - PPL mean:", ref_scores_df[ref_scores_df['label']==0]['ppl_gen'].mean())