# RGB - Local Evaluation with Ollama

## Key Steps
1. Load the dataset (JSON lines) from `data/<DATASET>.json`.
2. Shuffle or sample documents according to `noise_rate`.
3. Compose a prompt and call a local model with `ollama run <model_name>`.
4. Evaluate correctness or rejection.

## Requirements
- [ollama](https://github.com/jmorganca/ollama) installed on your system
- A local model pulled (e.g., `ollama pull llama3.2:3b`)

If your model returns empty strings, set `debug=True` in the relevant cells and check `stderr` for error messages.

In [1]:
# Cell 1: Imports and Basic Setup
import json
import random
import math
import os
import subprocess
import numpy as np
from typing import List, Union
from tqdm import tqdm  # a progress bar

## Cell 2: Load the Dataset
We'll define a helper function to load a JSON-lines file (like `en.json`) into a list of dicts.

In [2]:
# Cell 2: Function to load the dataset
def load_dataset(dataset_path: str) -> List[dict]:
    """Load each line of JSON from the dataset file into a list of dicts."""
    instances = []
    with open(dataset_path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            instances.append(json.loads(line))
    return instances

print('Function load_dataset defined.')

 # Given a file path, this function reads the file line-by-line, treats each line as a JSON object, and accumulates the resulting Python dictionaries in a list. :param dataset_path: Path to the dataset file (e.g. "data/en.json"). :return: A list of dictionaries, each dictionary parsed from one JSON line.

Function load_dataset defined.


## Cell 3: Data Processing Helpers
These functions replicate the `processdata()` and `checkanswer()` logic from the original `evalue.py`, which handle how we shuffle or slice `positive` and `negative` documents, and how we check if the final generation contains the correct answer.

In [3]:
# Cell 3: Data Processing Functions

def processdata(
    instance: dict,
    noise_rate: float,
    passage_num: int,
    filename: str,
    correct_rate: float = 0.0
) -> (str, Union[str,List[str]], List[str]):
    """
    Prepare query, answer, and documents from a single instance.
    This logic is adapted from evalue.py.

    :param instance: A dictionary representing a single data record, containing keys like 'query', 'answer', 'positive', etc.
    :param noise_rate: The fraction of documents to be randomly chosen as 'noise' (incorrect docs).
    :param passage_num: How many documents (both correct and noisy) to return.
    :param filename: The dataset filename (used to check if it includes '_int' or '_fact' for special logic).
    :param correct_rate: Additional fraction used specifically for '_fact' datasets to add correct docs among the wrong ones.
    :return: A tuple (query, ans, docs):
             query (str): The user's query or question.
             ans (str): The ground-truth answer (or set of possible answers).
             docs (List[str]): The final set of passages, shuffled to include both positives and negatives if needed.
    """

    # Extract the query and answer from the instance
    query = instance['query']
    ans = instance['answer']

    # Calculate how many documents will be 'negative' (noise) vs. positive
    neg_num = math.ceil(passage_num * noise_rate)
    pos_num = passage_num - neg_num

    # --- Special case 1: If dataset filename indicates '_int' (Information Integration) ---
    if '_int' in filename:
        # Each element in instance['positive'] is itself a list of docs
        # Randomly shuffle each list so we don't always pick the same doc from the same position
        for i in instance['positive']:
            random.shuffle(i)

        # Collect the first doc from each positive list
        docs = [i[0] for i in instance['positive']]

        # If we need more positive docs than we currently have, gather additional docs from each sub-list
        if len(docs) < pos_num:
            maxnum = max([len(i) for i in instance['positive']])
            for i in range(1, maxnum):
                for j in instance['positive']:
                    # Only add if this sub-list has enough docs
                    if len(j) > i:
                        docs.append(j[i])
                        # Stop if we have enough
                        if len(docs) == pos_num:
                            break
                if len(docs) == pos_num:
                    break

        # After we've collected all the positive docs we can, compute how many negatives are still needed
        neg_num = passage_num - len(docs)
        if neg_num > 0:
            # Take that many negative docs from instance['negative']
            docs += instance['negative'][:neg_num]

    # --- Special case 2: If dataset filename indicates '_fact' (Counterfactual) ---
    elif '_fact' in filename:
        # We incorporate 'correct_rate' to decide how many correct docs to mix in
        correct_num = math.ceil(passage_num * correct_rate)
        pos_num = passage_num - neg_num - correct_num

        # We randomly select positions to gather from instance['positive_wrong']
        indexs = list(range(len(instance['positive'])))
        selected = random.sample(indexs, min(len(indexs), pos_num))

        # First, we add the 'wrong' positive docs
        docs = [instance['positive_wrong'][i] for i in selected]

        # Then, if we still need correct docs, pick them from the remaining
        remain = [i for i in indexs if i not in selected]
        if correct_num > 0 and len(remain) > 0:
            docs += [instance['positive'][i] for i in random.sample(remain, min(len(remain), correct_num))]

        # Finally, add negative docs if needed
        if neg_num > 0:
            docs += instance['negative'][:neg_num]

    # --- Default case: Normal dataset without '_int' or '_fact' ---
    else:
        # If noise_rate == 1, all docs will be negative
        if noise_rate == 1:
            neg_num = passage_num
            pos_num = 0
        else:
            # Ensure we don't exceed available negative or positive docs
            if neg_num > len(instance['negative']):
                neg_num = len(instance['negative'])
                pos_num = passage_num - neg_num
            elif pos_num > len(instance['positive']):
                pos_num = len(instance['positive'])
                neg_num = passage_num - pos_num

        # Slice out the positive and negative docs
        positive = instance['positive'][:pos_num]
        negative = instance['negative'][:neg_num]

        # Combine them
        docs = positive + negative

    # Shuffle the final docs list so we don't always present them in the same order
    random.shuffle(docs)

    # Return the query, answer, and final doc set
    return query, ans, docs


def checkanswer(prediction: str, ground_truth: Union[str, List[str], List[List[str]]]) -> List[int]:
    """
    Return a list of 0s or 1s indicating whether each item in ground_truth is found in prediction.
    If ground_truth is a list of lists, we treat each sub-list as synonyms.

    :param prediction: The LLM's predicted answer (string).
    :param ground_truth: The reference correct answer(s). Could be:
                        - a single string
                        - a list of strings
                        - a list of lists (synonyms)
    :return: A list of integers (0 or 1), indicating for each ground_truth item if it's found.
    """
    # Convert the prediction to lowercase to do case-insensitive matching
    prediction_lower = prediction.lower()

    # If ground_truth is not already a list, wrap it in one
    if not isinstance(ground_truth, list):
        ground_truth = [ground_truth]

    labels = []
    for instance_gt in ground_truth:
        flag = True

        # If this ground_truth element is a list (synonyms), check if any synonym is present
        if isinstance(instance_gt, list):
            flag = False
            instance_gt = [i.lower() for i in instance_gt]
            for i in instance_gt:
                if i in prediction_lower:
                    flag = True
                    break
        else:
            # single string check
            if instance_gt.lower() not in prediction_lower:
                flag = False

        # Convert boolean to int (1 if found, 0 if not)
        labels.append(int(flag))

    return labels


print('Data processing functions defined.')

# Summary: 
# processdata(...) selects and shuffles the relevant positive and negative passages 
# from the instance, guided by noise_rate, passage_num, and dataset type 
# (_int or _fact). checkanswer(...) inspects the predicted answer to see if 
# it contains the ground truth or its synonyms, returning a list of 1/0 flags.

Data processing functions defined.


## Cell 4: Local Model Prediction with `ollama run`

In [4]:
# Cell 4: The new local_ollama_generate function using `ollama run`

def local_ollama_generate(prompt: str, model_name: str = "llama3.2:3b", debug: bool = False) -> str:
    """
    Calls the local model via `ollama run <model_name>`.
    The `prompt` is passed via stdin to subprocess.run.
    If the return code is non-zero or we suspect an error, we print stderr if debug=True.

    :param prompt: The text prompt you want to send to the local model.
    :param model_name: The local model you want to run via Ollama (default: 'llama3.2:3b').
    :param debug: When True, prints additional debugging info (STDERR, return code).
    :return: The model's output (string), or an empty string if an exception or error occurs.
    """
    
    # Construct the command to call Ollama:
    # "ollama run <model_name>" 
    # We pass the prompt as input in the subprocess call.
    cmd = ["ollama", "run", model_name]  # replaced 'generate -m' with 'run'
    
    try:
        # Launch the subprocess, providing the prompt as stdin.
        process = subprocess.run(
            cmd,
            input=prompt,       # The text prompt to send to the model
            text=True,          # Treat both input and output as strings instead of bytes
            capture_output=True # Capture both stdout and stderr for later analysis
        )

        # The standard output (model's answer) is captured in process.stdout
        output = process.stdout.strip()

        # If debug mode is on, we print additional information
        if debug:
            print(f"[DEBUG] Return code: {process.returncode}")
            print(f"[DEBUG] STDOUT: {output}")
            print(f"[DEBUG] STDERR: {process.stderr.strip()}")

        # If the command ended with a non-zero return code,
        # it often implies an error or an unknown model name.
        if process.returncode != 0:
            if debug:
                print("[DEBUG] Non-zero exit code => Something might be wrong with the command.")

        # Return the trimmed output as the model's response
        return output
    
    except Exception as e:
        # If an exception occurs (e.g., `ollama` not found),
        # optionally print debug info, then return an empty string.
        if debug:
            print("[ERROR] Exception while calling ollama:", e)
        return ""

# Summarizing what happens here:
# 1. We build a shell command to run Ollama locally.
# 2. We provide the prompt as stdin.
# 3. We capture the output (stdout) and any error info (stderr).
# 4. We optionally print debug information and return the final result.

print('Function local_ollama_generate defined.')

Function local_ollama_generate defined.


## Cell 5: Prediction Function
Composes a prompt from the query and the retrieved documents, then calls the `local_ollama_generate` function

- Searches the output for **insufficient info** (assign `labels = [-1]`) or else calls `checkanswer()`
- Checks for "事实性错误" or "factual errors" to set `factlabel = 1`

In [5]:
# Cell 5: predict function

def predict(
    query: str,
    ground_truth: Union[str, List[str], List[List[str]]],
    docs: List[str],
    system: str = "",
    instruction: str = "",
    dataset: str = "en",
    debug: bool = False
) -> (List[int], str, int):
    """
    This function handles the end-to-end process of producing an LLM response
    and evaluating whether it contains the correct answer or mentions factual errors.

    Steps:
    1) Compose a prompt from `query + docs`.
    2) Call local_ollama_generate() to get a prediction from the local model.
    3) If the output contains 'insufficient information', set labels = [-1].
       (indicating the model is refusing or unsure).
    4) Otherwise, call checkanswer() to determine if the prediction matches ground_truth.
    5) If the output mentions 'facts errors', set factlabel=1 to indicate the model 
       recognized factual issues in the provided info.

    :param query: The user’s original question.
    :param ground_truth: The reference answer(s). Could be a string, list of strings, 
                         or list of lists (synonyms).
    :param docs: A list of document strings relevant to the query.
    :param system: A system-level prompt that can act as role instructions.
    :param instruction: Additional instructions for the user prompt.
    :param dataset: The dataset name/identifier (e.g., 'en', 'zh', etc.).
    :param debug: If True, prints debug logs from local_ollama_generate.
    :return: (labels, prediction, factlabel)
             labels (List[int]): Either [-1] for insufficient info, 
                                 or a list of 0/1 indicating ground_truth matches.
             prediction (str): The raw string generated by the local model.
             factlabel (int): 1 if the model claims there are factual errors, else 0.
    """

    # Join all the documents into a single string separated by newlines
    docs_str = "\n".join(docs)

    # Build the final prompt. If we have docs, list them; otherwise note no docs available
    if docs_str:
        full_prompt = (
            f"{system}\n\n"
            f"{instruction}\n\n"
            f"Question: {query}\n"
            f"Documents:\n{docs_str}\n\n"
            f"Answer:"
        )
    else:
        full_prompt = (
            f"{system}\n\n"
            f"{instruction}\n\n"
            f"Question: {query}\n"
            f"(No additional documents)\n\n"
            f"Answer:"
        )

    # Use the local_ollama_generate function to get a prediction from the local model
    prediction = local_ollama_generate(
        full_prompt,
        model_name="llama3.2:3b",  # you could param-ify this if you like
        debug=debug
    )

    # If any of these markers appear in the prediction, it means the model 
    # claims there's insufficient information to answer
    insufficient_markers = ["信息不足", "insufficient information"]
    if any(marker.lower() in prediction.lower() for marker in insufficient_markers):
        # If the model claims insufficient info, we set label to [-1]
        labels = [-1]
    else:
        # Otherwise, we check how many ground truths the model actually included
        labels = checkanswer(prediction, ground_truth)

    # Next, we check if the model's prediction mentions it found "factual errors"
    factlabel = 0
    error_markers = ["事实性错误", "factual errors"]
    if any(marker.lower() in prediction.lower() for marker in error_markers):
        factlabel = 1

    # Return both the labels array, the raw model prediction, and 
    # a flag indicating mention of factual errors
    return labels, prediction, factlabel

print('predict() function defined.')

# Summary:
# The predict(...) function assembles a complete prompt, 
# queries the local LLM, checks for 'insufficient information' or 
# the presence of correct ground-truth answers, and flags if the 
# model explicitly mentions "factual errors."

predict() function defined.


## Cell 6: Main Evaluation Logic
Now we tie it all together:
1. We set up some parameters (`DATASET`, `NOISE_RATE`, etc.).
2. Load our data from `data/<DATASET>.json`.
3. For each instance, we do `processdata -> predict -> store results`.
4. We track success/failure metrics similar to `evalue.py`.
5. We'll print out a summary of how many were correct or properly rejected.

You can tweak these parameters or set them to match your needs.

In [7]:
# Cell 6: Main Logic

# === Parameters ===
DATASET = "en"        # Could be en, zh, en_int, zh_fact, etc.
NOISE_RATE = 1     # If close to 1, almost all documents are noisy
PASSAGE_NUM = 5       # Number of docs to provide to the model
CORRECT_RATE = 0.0    # Used only for '_fact' datasets, to mix in correct docs among counterfactual ones
DEBUG_CALL = False    # Set True to see debugging prints from the local_ollama_generate function

# System and user-level prompts (can be replaced with prompts from a YAML config if preferred)
system_prompt = "You are a helpful assistant."
user_instruction = "Given the question and the provided documents, provide the best possible answer."

# === Load data ===
# Construct the path to the JSON file and load it via load_dataset
data_path = f"data/{DATASET}.json"
instances = load_dataset(data_path)
print(f"Loaded {len(instances)} instances from {data_path}")

# We'll store the full results of each evaluation in a list
results = []

# We'll also track a few metrics used in evalue.py-like scoring:
tt = 0        # The count of successful answers (or correct rejections)
fact_tt = 0   # For '_fact' datasets, how many times the model identified factual errors
correct_tt = 0

# === Evaluate each instance ===
# We iterate through all data entries in the loaded dataset
for instance in tqdm(instances, desc="Evaluating instances"):
    # 1) processdata(...) picks and shuffles docs (positive or negative) based on NOISE_RATE and PASSAGE_NUM
    query, ans, docs = processdata(
        instance,
        noise_rate=NOISE_RATE,
        passage_num=PASSAGE_NUM,
        filename=DATASET,
        correct_rate=CORRECT_RATE
    )

    # 2) predict(...) sends the query, docs, and system prompts to our local model, returning (labels, prediction, factlabel)
    labels, prediction, factlabel = predict(
        query=query,
        ground_truth=ans,
        docs=docs,
        system=system_prompt,
        instruction=user_instruction,
        dataset=DATASET,
        debug=DEBUG_CALL
    )

    # 3) Create a new record containing the inputs and the model's response
    newinstance = {
        "id": instance["id"],
        "query": query,
        "ans": ans,
        "label": labels,
        "prediction": prediction,
        "docs": docs,
        "noise_rate": NOISE_RATE,
        "factlabel": factlabel
    }
    results.append(newinstance)

    # 4) Implement scoring logic similar to evalue.py:

    # a) If noise_rate == 1, we expect the model to refuse to answer
    #    => labels should be [-1] to indicate "insufficient information" or a refusal
    if NOISE_RATE == 1:
        if len(labels) == 1 and labels[0] == -1:
            tt += 1
    else:
        # b) Otherwise, for noise_rate < 1, success means 0 not in label and 1 in label,
        #    indicating the model found the correct answer for at least one ground_truth item
        if (0 not in labels) and (1 in labels):
            tt += 1

    # c) If we're dealing with a '_fact' dataset, we also track whether the model recognized factual errors
    if '_fact' in DATASET:
        if factlabel == 1:
            fact_tt += 1
            # If 0 not in labels => the model correctly recognized a different answer was needed
            if 0 not in labels:
                correct_tt += 1

print("\nDone evaluating!")

# Summary:
# 1) We set up parameters (dataset file, noise rate, etc.).
# 2) We load all instances from the JSON dataset.
# 3) For each instance, we retrieve relevant docs (with noise if needed) using processdata(...),
#    then call predict(...) to get the LLM’s answer and evaluate correctness.
# 4) We store all results and simultaneously compute metrics:
#    - How often the model refused an all-noise set,
#    - How often it found a correct answer,
#    - And (for '_fact' datasets) how often it identified factual errors.
# 5) Finally, we print "Done evaluating!" to signal completion.

Loaded 300 instances from data/en.json


Evaluating instances: 100%|██████████| 300/300 [19:31<00:00,  3.90s/it]


Done evaluating!





## Cell 7: Compute and Print Final Metrics
We replicate the summary from `evalue.py` to show `all_rate`, `fact_check_rate`, etc.

In [9]:
# Cell 7: Compute final metrics

# Calculate 'all_rate' as the ratio of successful outcomes to total results.
# If we have no results, default to 0.0 to avoid division by zero.
all_rate = tt / len(results) if len(results) > 0 else 0.0

# Build a dictionary of metrics. We include noise_rate, total
# successful outcomes (tt), and total number of results.
scores = {
    "all_rate": all_rate,   # Overall success (accuracy or acceptance) rate
    "noise_rate": NOISE_RATE,
    "tt": tt,               # Count of 'successful' predictions by the evalue.py logic
    "nums": len(results)    # Total number of instances processed
}

# If the dataset name includes '_fact', we compute additional metrics
# about how often the model identified factual errors.
if '_fact' in DATASET:
    # fact_check_rate: ratio of instances for which the model said there were factual errors
    fact_check_rate = fact_tt / len(results) if len(results) > 0 else 0.0

    # correct_rate: among those where the model claimed factual errors,
    # how many times did it simultaneously avoid matching a (wrong) ground truth
    if fact_tt > 0:
        correct_rate = correct_tt / fact_tt
    else:
        correct_rate = 0.0

    # Update the scores dictionary with the extra information
    scores.update({
        "fact_check_rate": fact_check_rate,  # proportion of recognized factual errors
        "correct_rate": correct_rate,        # of those recognized, how many were truly not matching
        "fact_tt": fact_tt,                  # count of recognized factual errors
        "correct_tt": correct_tt             # count of recognized + correct rejections
    })

# Print out the final metrics in a readable format
print("=== Final Metrics ===")
for k, v in scores.items():
    print(f"{k}: {v}")

# Signal that we've completed the evaluation/metrics stage
print("\nDone!")

# Summary:
# 1) We calculate the overall success rate (all_rate) for normal tasks or rejections. 
# 2) For '_fact' datasets, we additionally track how often factual errors are reported 
#    (fact_check_rate), and how often those recognized errors align with correct rejections 
#    (correct_rate). 
# 3) The final metrics are printed and can be saved or further analyzed.

=== Final Metrics ===
all_rate: 0.0
noise_rate: 1
tt: 0
nums: 300

Done!


## Cell 8: (Optional) Save Predictions & Metrics
If you’d like, you can store your predictions and metrics in JSON files for later analysis.

In [10]:
# Cell 8: Save predictions & metrics if desired

# We define file paths for storing:
# 1) The individual prediction results, line by line.
# 2) The summary metrics in JSON format.
output_predictions_file = f"prediction_{DATASET}_ollama.json"
output_metrics_file = f"prediction_{DATASET}_ollama_metrics.json"

# 1) Save the entire list of results
#    Each item is a dictionary with fields like "query", "prediction", "label", etc.
#    We write each dictionary as a JSON-encoded line.
with open(output_predictions_file, 'w', encoding='utf-8') as f:
    for item in results:
        f.write(json.dumps(item, ensure_ascii=False) + "\n")

# 2) Save the aggregated metrics to a separate file
#    Here we dump the 'scores' dictionary, which contains final metric values,
#    such as 'all_rate', 'fact_check_rate', etc., in an indented JSON format.
with open(output_metrics_file, 'w', encoding='utf-8') as f:
    json.dump(scores, f, ensure_ascii=False, indent=2)

# Print out the file locations so the user knows where the data was saved
print(f"Results saved to: {output_predictions_file}")
print(f"Metrics saved to: {output_metrics_file}")

# Summary:
# In this cell, we write the detailed prediction results (line by line) 
# to 'prediction_{DATASET}_ollama.json' and the final evaluation metrics 
# to 'prediction_{DATASET}_ollama_metrics.json'.

Results saved to: prediction_en_ollama.json
Metrics saved to: prediction_en_ollama_metrics.json


In [16]:
def notebook_reject_evaluation(evaluation_output_file: str, reject_output_file: str, result_file: str, api_url: str, api_key: str):
    """
    Translated version of reject_evalue.py for use in a Jupyter Notebook.
    
    This function processes a JSONL evaluation output file (produced with noise_rate == 1) by:
      1. Loading each prediction.
      2. Querying the ChatGPT completions API (using a helper function "check") to evaluate if 
         the model’s answer addresses the question.
      3. Writing updated predictions (with an added "evaluation" field) to a new JSONL file.
      4. Computing overall metrics (reject rate and acceptance rate) and saving them as JSON.
    
    Parameters:
      evaluation_output_file (str): Path to the original predictions file (JSONL format).
      reject_output_file (str): Path to write the updated predictions with rejection evaluation.
      result_file (str): Path to write the final metrics JSON.
      api_url (str): URL endpoint for the ChatGPT completions API.
      api_key (str): API key for authentication.
    """
    import json
    import os
    import requests
    from tqdm import tqdm

    def check(question, answer, url, api_key):
        prompt = '''I will give you a question and an answer generated through document retrieval. Please use this answer to determine if 
the retrieved document can solve the question.
Demonstrations:
Question: 2023年澳网女单冠军是谁
Answer:文档信息不足，因此我无法基于提供的文档回答该问题。
No, the question is not addressed by the documents.

Question: Who is the champion of Australian Open 2023 Women's Singles?
Answer: Serena Williams
Yes, the question is addressed by the documents.

Question: Where is ACL2023 held?
Answer: Location of ACL2023 has not been confirmed.
No, the question is not addressed by the documents.

Question:  2023年中国GDP是多少?
Answer: I can not answer this question。
No, the question is not addressed by the documents.

Begin to generate:
Question: {question}
Answer: {answer}
        '''
        text = prompt.format(question=question, answer=answer)
        return getdata(text, url, api_key)

    def getdata(text, url, api_key):
        payload = {
            "model": "gpt-3.5-turbo",
            "messages": [{"role": "user", "content": text}]
        }
        headers = {"Authorization": f"Bearer {api_key}"}
        response = requests.post(url, json=payload, headers=headers)
        response_json = response.json()
        return response_json['choices'][0]['message']['content']

    # Load any previously processed data (to avoid re-evaluation if the function is re-run)
    results = []
    useddata = {}
    if os.path.exists(reject_output_file):
        with open(reject_output_file, "r", encoding="utf-8") as fin:
            for line in fin:
                data = json.loads(line)
                useddata[data['id']] = data

    # Process each instance from the evaluation output file
    with open(reject_output_file, "w", encoding="utf-8") as fout:
        with open(evaluation_output_file, "r", encoding="utf-8") as fin:
            for line in tqdm(fin, desc="Running Reject Evaluation"):
                data = json.loads(line)
                # Reuse a record if it was processed already with the same query and answer.
                if (data['id'] in useddata and 
                    data['query'] == useddata[data['id']]['query'] and 
                    data['ans'] == useddata[data['id']]['ans']):
                    results.append(useddata[data['id']])
                    fout.write(json.dumps(useddata[data['id']], ensure_ascii=False) + "\n")
                    continue
                try:
                    question = data['query']
                    answer = data['prediction']
                    evaluation = check(question, answer, api_url, api_key)
                    data['evaluation'] = evaluation
                    results.append(data)
                    fout.write(json.dumps(data, ensure_ascii=False) + "\n")
                except Exception as e:
                    print("Error processing entry:", e)
                    print("Question:", question)
                    print("Answer:", answer)
                    continue

    # Compute rejection and acceptance metrics
    reject_count = sum(1 for item in results if "not addressed" in item.get('evaluation', "").lower())
    accepted_count = sum(1 for item in results if (0 not in item.get('label', []) and 1 in item.get('label', [])))
    
    if results:
        acceptance_rate = accepted_count / len(results)
        reject_rate = reject_count / len(results)
    else:
        acceptance_rate = 0.0
        reject_rate = 0.0

    print("Acceptance Rate:", acceptance_rate)
    scores = {
        'reject_rate': reject_rate,
        'all_rate': acceptance_rate,
        'tt': accepted_count,
        'rejecttt': reject_count,
        'nums': len(results),
    }
    with open(result_file, "w", encoding="utf-8") as score_file:
        json.dump(scores, score_file, ensure_ascii=False, indent=4)
    print("Final Scores:", scores)

# Example usage:
# evaluation_file = "outputs/predictions_noise1.jsonl"
# reject_file = "outputs/predictions_noise1_reject.jsonl"
# result_file = "outputs/final_reject_scores.json"
# api_url = "https://api.openai.com/v1/chat/completions"
# api_key = "YOUR_API_KEY"
# notebook_reject_evaluation(evaluation_file, reject_file, result_file, api_url, api_key)

In [17]:
evaluation_file = "prediction_en_ollama.json"
reject_file = "predictions_noise1_reject.json"
result_file = "final_reject_scores.json"
api_url = "https://api.openai.com/v1/chat/completions"
api_key = ""
notebook_reject_evaluation(evaluation_file, reject_file, result_file, api_url, api_key)

Running Reject Evaluation: 300it [03:24,  1.47it/s]

Acceptance Rate: 0.22
Final Scores: {'reject_rate': 0.32, 'all_rate': 0.22, 'tt': 66, 'rejecttt': 96, 'nums': 300}





In [1]:
def notebook_fact_evaluation(
    evaluation_output_file: str,
    fact_output_file: str,
    result_file: str,
    api_url: str,
    api_key: str,
    noise_rate: float = 0.0
):
    """
    Translated version of fact_evalue.py for use in a Jupyter Notebook.
    
    This function processes a JSONL evaluation output file (with predictions) by:
      1. Loading each prediction.
      2. Querying the ChatGPT completions API with a prompt that checks whether the model can 
         identify factual errors in the answer.
      3. Writing updated predictions (with an added "evaluation" field) to a new JSONL file.
      4. Computing overall metrics (e.g., reject rate, correct rate) and saving them as a JSON file.
    
    Parameters:
      evaluation_output_file (str): Path to the initial predictions file (JSONL format).
      fact_output_file (str): Path to write the updated predictions including fact evaluation.
      result_file (str): Path to write the final aggregate metrics (JSON).
      api_url (str): URL endpoint for the ChatGPT completions API.
      api_key (str): API key for authentication.
      noise_rate (float): The noise rate used in generating the predictions.
    """
    import json
    import os
    import requests
    from tqdm import tqdm

    def check(question, answer, url, api_key):
        prompt = '''Please determine whether the model can identify factual errors in the document. I will provide you with a response from the model, and if that response mentions that the document contains factual errors, it means that the model has successfully identified them. Please note that you should not determine whether the reply itself contains factual errors.
Demonstrations:
Response:文档信息有误，2008年的奥运会在北京举办。
Yes, the model has identified the factual errors.

Response:2014年澳网女单冠军是大阪直美。
NO, the model fail to identify the factual errors.

Response: The director of the Silent Hill movie is Justin Kurzel.
NO, the model fail to identify the factual errors.

Response: Harry Potter is written by J. K. Rowling.
NO, the model fail to identify the factual errors.

Response:  There are factual errors in the provided documents. The correct answer is 2023.
Yes, the model has identified the factual errors.

Begin to generate:
Answer: {answer}
        '''
        text = prompt.format(answer=answer)
        return getdata(text, url, api_key)
    
    def getdata(text, url, api_key):
        payload = {
            "model": "gpt-3.5-turbo",
            "messages": [{"role": "user", "content": text}]
        }
        headers = {"Authorization": f"Bearer {api_key}"}
        response = requests.post(url, json=payload, headers=headers)
        response_json = response.json()
        return response_json['choices'][0]['message']['content']
    
    # Load any already processed data (if the fact output file exists)
    results = []
    useddata = {}
    if os.path.exists(fact_output_file):
        with open(fact_output_file, "r", encoding="utf-8") as fin:
            for line in fin:
                data = json.loads(line)
                useddata[data['id']] = data
    
    # Process each instance from the evaluation file and update with fact evaluation.
    with open(fact_output_file, "w", encoding="utf-8") as fout:
        with open(evaluation_output_file, "r", encoding="utf-8") as fin:
            for line in tqdm(fin, desc="Running Fact Evaluation"):
                data = json.loads(line)
                # Reuse a record if it was already processed (with same query and answer)
                if data['id'] in useddata:
                    results.append(useddata[data['id']])
                    fout.write(json.dumps(useddata[data['id']], ensure_ascii=False) + "\n")
                    continue
                try:
                    question = data['query']
                    answer = data['prediction']
                    evaluation = check(question, answer, api_url, api_key)
                    data['evaluation'] = evaluation
                    results.append(data)
                    fout.write(json.dumps(data, ensure_ascii=False) + "\n")
                except Exception as e:
                    print("Error processing entry:", e)
                    print("Question:", question)
                    print("Answer:", answer)
                    continue
    
    # Compute metrics related to factual error detection:
    # - 'rejecttt' counts responses where evaluation indicates factual errors were identified.
    # - 'tt' counts correctly accepted predictions as defined by label criteria.
    # - 'correct_tt' counts the subset of cases where both the evaluation indicates factual errors and the label criteria are met.
    rejecttt = 0
    tt = 0
    correct_tt = 0
    for i in results:
        if "has identified" in i.get('evaluation', "") or "Yes" in i.get('evaluation', ""):
            rejecttt += 1
            if 0 not in i.get('label', []) and 1 in i.get('label', []):
                correct_tt += 1
        if 0 not in i.get('label', []) and 1 in i.get('label', []):
            tt += 1
    
    if results:
        acceptance_rate = tt / len(results)
    else:
        acceptance_rate = 0.0

    print("Acceptance Rate:", acceptance_rate)
    scores = {
        'reject_rate': rejecttt / len(results) if results else 0.0,
        'all_rate': acceptance_rate,
        'correct_rate': correct_tt / rejecttt if rejecttt > 0 else 0,
        'tt': tt,
        'rejecttt': rejecttt,
        'correct_tt': correct_tt,
        'nums': len(results),
        'noise_rate': noise_rate,
    }
    with open(result_file, "w", encoding="utf-8") as score_file:
        json.dump(scores, score_file, ensure_ascii=False, indent=4)
    print("Final Scores:", scores)

In [3]:
evaluation_file = "prediction_en_ollama.json"
fact_file = "predictions_en_fact_chatgpt.jsonl"
result_file = "final_fact_scores.json"
api_url = "https://api.openai.com/v1/chat/completions"
api_key = ""
notebook_fact_evaluation(evaluation_file, fact_file, result_file, api_url, api_key, noise_rate=0.0)

Running Fact Evaluation: 300it [05:21,  1.07s/it]

Acceptance Rate: 0.22
Final Scores: {'reject_rate': 0.10333333333333333, 'all_rate': 0.22, 'correct_rate': 0.22580645161290322, 'tt': 66, 'rejecttt': 31, 'correct_tt': 7, 'nums': 300, 'noise_rate': 0.0}



