<a href="https://colab.research.google.com/github/Reemaalt/Detection-of-Hallucination-in-Arabic/blob/main/mysemantic_entropy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- Estimate the probability of each cluster.
- Use Monte Carlo integration to compute semantic entropy.

In [None]:
import json
import os
import numpy as np
import torch
from google.colab import files

In [None]:
# Load clustered responses
file_path = "entailment_clusters_Llama3.1-xquadAll-LOG.json"
with open(file_path, "r", encoding="utf-8") as f:
    clustered_data = json.load(f)

print(f"Loaded {len(clustered_data)} questions with clustered answers.")


Loaded 1190 questions with clustered answers.


In [None]:
def compute_cluster_log_probabilities(clusters, log_likelihoods):
    """
    Computes cluster probabilities when log likelihoods are provided for individual responses.

    input:
    - clusters: List of clusters
    - log_likelihoods: List of log likelihoods for each response.

    Compute p(c|x) for each cluster according to equation (2)

    """
    # Count how many total responses
    total_responses = sum(len(cluster) for cluster in clusters)

    # Verify numbers of likelihoods and responses
    if total_responses != len(log_likelihoods):
        print(f"Warning: Mismatch between responses ({total_responses}) and likelihoods ({len(log_likelihoods)}). Attempting to align...")

     # Convert log likelihoods to probabilities (in normal space)
    likelihoods = [np.exp(-ll) for ll in log_likelihoods]

    # Initialize index for tracking position in likelihoods list
    idx = 0
    cluster_probs = []

    for cluster in clusters:
        # Get probabilities for all sequences in this cluster
        cluster_size = len(cluster)
        cluster_likelihoods = likelihoods[idx:idx+cluster_size]
        idx += cluster_size

        # Sum probabilities of all sequences in the cluster (equation 2)
        cluster_prob = sum(cluster_likelihoods)
        cluster_probs.append(cluster_prob)


    # Normalize to ensure sum of probabilities = 1
    total_prob = sum(cluster_probs)
    if total_prob > 0:
        return [p/total_prob for p in cluster_probs]
    else:
        print("Warning: Total probability is zero. Check input values.")
        return [0.0] * len(cluster_probs)

- The reference code uses predictive_entropy_rao which computes -np.sum(np.exp(log_probs) * log_probs)
our code computes -sum(p * np.log(p) for p in normalized_probs)
These are mathematically equivalent approaches

- implementing the semantic entropy calculation as described in the reference code's predictive_entropy_rao() function.



In [None]:
#use Monte Carlo approximation based on Equation (3)
def compute_semantic_entropy(probabilities):
    """
    Compute semantic entropy from a list of probabilities.
    Compute MC estimate of entropy.

    Uses the formula SE(x) = -∑ p(c|x)log p(c|x)
    """
    # Filter out zero probabilities to avoid log(0)
    valid_probs = [p for p in probabilities if p > 0]

    if not valid_probs:
        return 0.0

    # Re-normalize if needed
    total = sum(valid_probs)
    normalized_probs = [p/total for p in valid_probs]

    # Calculate entropy using the proper formula 3
    # This is the mc calculation as in the reference code
    entropy = -sum(p * np.log(p) for p in normalized_probs)

    return entropy


In [None]:
# Step 5: Calculate semantic entropy
def process_clustered_data(clustered_data):
    entropy_results = {}

    for question_id, data in clustered_data.items():
        clusters = data["clusters"]
        log_likelihoods = data["total_avg_neg_log_likelihoods_for_clusters"]

        # Flatten the clusters to check total response count
        total_responses = sum(len(cluster) for cluster in clusters)

        # Prints
        print(f"Question {question_id}:")
        print(f"  Total clusters: {len(clusters)}")
        print(f"  Total responses: {total_responses}")
        print(f"  Total likelihoods: {len(log_likelihoods)}")

        # Step 1: Compute cluster probabilities
        probabilities = compute_cluster_log_probabilities(clusters, log_likelihoods)

           # Step 2: Compute semantic entropy
        entropy = compute_semantic_entropy(probabilities)

        # Store results
        entropy_results[question_id] = {
            "question": data["question"],
            "semantic_entropy": entropy,
            "num_clusters": len(clusters),
            "cluster_probabilities": probabilities
        }

    return entropy_results

In [None]:
entropy_results = process_clustered_data(clustered_data)
output_file = "semantic_entropy_Llama3.1-8b_xquadAll_results.json"
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(entropy_results, f, ensure_ascii=False, indent=4)

files.download(output_file)
print(f"Semantic entropy results saved to {output_file}")

Question 0:
  Total clusters: 8
  Total responses: 10
  Total likelihoods: 10
Question 1:
  Total clusters: 6
  Total responses: 10
  Total likelihoods: 10
Question 2:
  Total clusters: 10
  Total responses: 10
  Total likelihoods: 10
Question 3:
  Total clusters: 8
  Total responses: 10
  Total likelihoods: 10
Question 4:
  Total clusters: 6
  Total responses: 10
  Total likelihoods: 10
Question 5:
  Total clusters: 9
  Total responses: 10
  Total likelihoods: 10
Question 6:
  Total clusters: 9
  Total responses: 10
  Total likelihoods: 10
Question 7:
  Total clusters: 5
  Total responses: 10
  Total likelihoods: 10
Question 8:
  Total clusters: 9
  Total responses: 10
  Total likelihoods: 10
Question 9:
  Total clusters: 5
  Total responses: 10
  Total likelihoods: 10
Question 10:
  Total clusters: 7
  Total responses: 10
  Total likelihoods: 10
Question 11:
  Total clusters: 8
  Total responses: 10
  Total likelihoods: 10
Question 12:
  Total clusters: 6
  Total responses: 10
  Tota

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Semantic entropy results saved to semantic_entropy_Llama3.1-8b_xquadAll_results.json
