# MMHal-Bench with LLaVA-HF (7B/13B)

End-to-end Colab to generate LLaVA-1.5 answers and evaluate with GPT-4o.

Recommended: Runtime > Change runtime type > GPU (A100 preferred; T4 works).

Notes:
- Uses HuggingFace llava-hf (1.5) models with bf16/fp16 precision (no 4-bit).
- Judge uses OpenAI GPT-4o; set your OPENAI_API_KEY when prompted.
- If OpenAI SDK issues on Colab: we pin httpx<0.28.


In [None]:
import torch, os
print('CUDA available:', torch.cuda.is_available())
if torch.cuda.is_available():
    print('Device name:', torch.cuda.get_device_name(0))
    try:
        print('Capability:', torch.cuda.get_device_capability(0))
    except Exception as e:
        print('Capability: unknown', e)
else:
    print('Using CPU (slower)')


In [None]:
%pip -q install "transformers>=4.41" "accelerate>=0.30" pillow requests "openai>=1.37.0" "httpx<0.28" bitsandbytes


In [None]:
# Optional: mount Google Drive if your dataset/images are in Drive
# from google.colab import drive
# drive.mount('/content/drive')


In [None]:
# Clone the Socrates repo (adjust if you have a fork)
REPO_URL = 'https://github.com/MohammedEsamaldin/Socrates.git'
WORKDIR = '/content/Socrates'
import os, sys
if not os.path.exists(WORKDIR):
    !git clone $REPO_URL $WORKDIR
%cd $WORKDIR


In [None]:
# Set paths and parameters
# Point DATASET to your MMHal-Bench JSON (list of 96 records).
# If records have relative image paths, set IMAGE_ROOT accordingly.
import os
DATASET = '/content/path/to/mmhal.json'  # TODO: replace
IMAGE_ROOT = '/content/path/to/images'   # TODO: replace or set to '' if absolute/URLs
OUTPUT = '/content/llava15_7b_responses.json'
EVAL_JSON = '/content/llava15_7b_eval.json'
MODEL_ID = os.environ.get('SOC_LLAVA_MODEL', 'llava-hf/llava-1.5-7b-hf')  # or 'llava-hf/llava-1.5-13b-hf'
MAX_NEW_TOKENS = 256
TEMPERATURE = 0.2
print('MODEL_ID =', MODEL_ID)


In [None]:
# Generate answers with LLaVA-HF (bf16/fp16 enforced, no 4-bit)
!python -m socrates_system.mllm_evaluation.scripts.generate_mmhal_llava_hf \
  --dataset "$DATASET" \
  --image-root "$IMAGE_ROOT" \
  --output "$OUTPUT" \
  --model-id "$MODEL_ID" \
  --max-new-tokens $MAX_NEW_TOKENS \
  --temperature $TEMPERATURE


In [None]:
# Set your OpenAI API key for judging (kept in env only within this session)
import os, getpass
os.environ['OPENAI_API_KEY'] = getpass.getpass('Enter OPENAI_API_KEY: ')
print('Key set:', 'OPENAI_API_KEY' in os.environ)


In [None]:
# Judge with GPT-4o (structured results saved to EVAL_JSON)
!python -m socrates_system.mllm_evaluation.analysis.MMHal_Bench.eval_gpt4 \
  --response "$OUTPUT" \
  --evaluation "$EVAL_JSON" \
  --api-key "$OPENAI_API_KEY" \
  --gpt-model gpt-4o-2024-08-06


In [None]:
# Quick summary of evaluation results
import json
from statistics import mean
with open(EVAL_JSON, 'r', encoding='utf-8') as f:
    eval_recs = json.load(f)
ratings = [r.get('rating', 0) for r in eval_recs]
hall = [r.get('hallucination', 1) for r in eval_recs]
print('Num samples:', len(eval_recs))
print('Average score:', round(mean(ratings), 2) if ratings else 0)
print('Hallucination rate:', round(mean(hall), 2) if hall else 1.0)


### Troubleshooting
- If VRAM is insufficient, switch to 7B or reduce max tokens.
- For dataset schema, the generator expects keys: question, image (or image_path, etc.), image_content (optional).
- The judge script expects fields: image_id, question_type, question_topic, image_content, question, gt_answer, model_answer.
- If using URLs for images, set IMAGE_ROOT to '' and ensure records contain HTTP(S) URLs.
