# Qualitative Analysis - Fact Checking with Claude API

This notebook uses Claude API to verify if the information generated by LLMs in the **realistic** approach is coherent with real-world data.

In [1]:
import json
import os
import re
import time
import pandas as pd
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()

OUTPUT_DIR = "output"
REALISTIC_MODELS = ["claude-sonnet-real", "kimi-real", "qwen-real"]

client = Anthropic()
print("Claude API client initialized")

Claude API client initialized


## 1. Load Realistic Articles

In [2]:
def extract_tag_value(text, tag):
    """Extract value from XML tag"""
    pattern = f"<{tag}>(.*?)</{tag}>"
    match = re.search(pattern, text, re.DOTALL)
    return match.group(1).strip() if match else None

def load_realistic_articles():
    """Load all articles from realistic approaches"""
    all_articles = []
    
    for model_name in REALISTIC_MODELS:
        json_path = os.path.join(OUTPUT_DIR, model_name, "articles.json")
        
        if os.path.exists(json_path):
            with open(json_path, "r", encoding="utf-8") as f:
                articles = json.load(f)
            
            for i, article in enumerate(articles):
                text = article.get("article", "")
                all_articles.append({
                    "generator": model_name,
                    "article_num": i + 1,
                    "model_name": extract_tag_value(text, "model"),
                    "params": extract_tag_value(text, "params"),
                    "hardware": extract_tag_value(text, "hardware"),
                    "country": extract_tag_value(text, "country"),
                    "year": extract_tag_value(text, "year"),
                    "training": extract_tag_value(text, "training"),
                    "full_text": text
                })
    
    return pd.DataFrame(all_articles)

df = load_realistic_articles()
print(f"Loaded {len(df)} articles from realistic approaches")
df[["generator", "article_num", "model_name", "params", "hardware", "country", "year"]]

Loaded 30 articles from realistic approaches


Unnamed: 0,generator,article_num,model_name,params,hardware,country,year
0,claude-sonnet-real,1,GLM-130B,130 billion parameters,96 NVIDIA A100 40GB GPUs,China,2022.0
1,claude-sonnet-real,2,LinguaFormer-7B,7.2 billion parameters,NVIDIA A100 GPUs,South Korea,2023.0
2,claude-sonnet-real,3,LinguaNet-7B,7.2 billion parameters,NVIDIA A100 80GB GPUs,South Korea,2023.0
3,claude-sonnet-real,4,DeepSeek-V2,236 billion parameters,2048 NVIDIA H800 GPUs,China,2024.0
4,claude-sonnet-real,5,LinguaFormer-7B,7.2 billion parameters,128 NVIDIA A100 GPUs,South Korea,2023.0
5,claude-sonnet-real,6,LinguaFormer-XL,13.7 billion parameters,NVIDIA A100 80GB GPUs,Germany,2023.0
6,claude-sonnet-real,7,DeepMind Chinchilla,70 billion parameters,TPU v4,United Kingdom,2022.0
7,claude-sonnet-real,8,GLM-130B,130 billion parameters,96 NVIDIA A100 80GB GPUs,China,2022.0
8,claude-sonnet-real,9,DeepSeq-T5X,11 billion parameters,NVIDIA A100 GPUs,Canada,2023.0
9,claude-sonnet-real,10,PaLM-2,340 billion parameters,TPU v4,United States,2023.0


## 2. Fact-Checking Function with Claude

In [3]:
FACT_CHECK_PROMPT = """You are a fact-checker for AI research information. Analyze the following information extracted from a generated scientific article and verify its accuracy.

Information to verify:
- Model name: {model_name}
- Parameters: {params}
- Hardware: {hardware}
- Training duration: {training}
- Country: {country}
- Year: {year}

For each piece of information, determine:
1. Is the model name real or fictional?
2. If real, are the parameters approximately correct?
3. Is the hardware temporally coherent (was it available in that year)?
4. Is the country attribution correct (if model is real)?
5. Is the year plausible for this model?

Return your analysis in the following JSON format:
{{
  "model_status": "real" or "fictional" or "based_on_real",
  "model_comment": "brief explanation",
  "params_coherent": true or false,
  "params_comment": "brief explanation",
  "hardware_coherent": true or false,
  "hardware_comment": "brief explanation",
  "country_coherent": true or false or "unknown",
  "country_comment": "brief explanation",
  "year_coherent": true or false,
  "year_comment": "brief explanation",
  "overall_score": 1 to 5 (1=many errors, 5=fully coherent),
  "summary": "one sentence summary"
}}

IMPORTANT: Return ONLY the JSON, no additional text."""

def fact_check_article(row):
    prompt = FACT_CHECK_PROMPT.format(
        model_name=row["model_name"] or "Not specified",
        params=row["params"] or "Not specified",
        hardware=row["hardware"] or "Not specified",
        training=row["training"] or "Not specified",
        country=row["country"] or "Not specified",
        year=row["year"] or "Not specified"
    )
    
    try:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1000,
            messages=[{"role": "user", "content": prompt}]
        )
        
        content = response.content[0].text.strip()
        
        if content.startswith("```json"):
            content = content[7:]
        if content.startswith("```"):
            content = content[3:]
        if content.endswith("```"):
            content = content[:-3]
        
        return json.loads(content.strip())
    
    except Exception as e:
        return {"error": str(e)}

print("Fact-checking function ready")

Fact-checking function ready


## 3. Run Fact-Checking on All Articles

⚠️ This will make API calls for each article. Takes a few minutes.

In [4]:
results = []

for idx, row in df.iterrows():
    print(f"Checking {row['generator']} - Article {row['article_num']}...", end=" ")
    
    result = fact_check_article(row)
    result["generator"] = row["generator"]
    result["article_num"] = row["article_num"]
    result["model_name"] = row["model_name"]
    results.append(result)
    
    if "error" in result:
        print(f"ERROR: {result['error']}")
    else:
        print(f"Score: {result.get('overall_score', 'N/A')}/5")
    
    time.sleep(1)  # Rate limiting

df_results = pd.DataFrame(results)
print(f"\nFact-checking complete: {len(df_results)} articles analyzed")

Checking claude-sonnet-real - Article 1... Score: 5/5
Checking claude-sonnet-real - Article 2... Score: 3/5
Checking claude-sonnet-real - Article 3... Score: 3/5
Checking claude-sonnet-real - Article 4... Score: 5/5
Checking claude-sonnet-real - Article 5... Score: 3/5
Checking claude-sonnet-real - Article 6... Score: 3/5
Checking claude-sonnet-real - Article 7... Score: 5/5
Checking claude-sonnet-real - Article 8... Score: 5/5
Checking claude-sonnet-real - Article 9... Score: 2/5
Checking claude-sonnet-real - Article 10... Score: 4/5
Checking kimi-real - Article 1... Score: 2/5
Checking kimi-real - Article 2... Score: 2/5
Checking kimi-real - Article 3... Score: 2/5
Checking kimi-real - Article 4... Score: 2/5
Checking kimi-real - Article 5... Score: 3/5
Checking kimi-real - Article 6... Score: 3/5
Checking kimi-real - Article 7... Score: 2/5
Checking kimi-real - Article 8... Score: 5/5
Checking kimi-real - Article 9... Score: 2/5
Checking kimi-real - Article 10... Score: 2/5
Checking

## 4. Results Overview

In [5]:
display_cols = ["generator", "article_num", "model_name", "model_status", 
                "hardware_coherent", "country_coherent", "year_coherent", "overall_score"]

df_display = df_results[[c for c in display_cols if c in df_results.columns]]
df_display

Unnamed: 0,generator,article_num,model_name,model_status,hardware_coherent,country_coherent,year_coherent,overall_score
0,claude-sonnet-real,1,GLM-130B,real,True,True,True,5
1,claude-sonnet-real,2,LinguaFormer-7B,fictional,True,unknown,True,3
2,claude-sonnet-real,3,LinguaNet-7B,fictional,True,unknown,True,3
3,claude-sonnet-real,4,DeepSeek-V2,real,True,True,True,5
4,claude-sonnet-real,5,LinguaFormer-7B,fictional,True,unknown,True,3
5,claude-sonnet-real,6,LinguaFormer-XL,fictional,True,unknown,True,3
6,claude-sonnet-real,7,DeepMind Chinchilla,real,True,True,True,5
7,claude-sonnet-real,8,GLM-130B,real,True,True,True,5
8,claude-sonnet-real,9,DeepSeq-T5X,fictional,True,unknown,True,2
9,claude-sonnet-real,10,PaLM-2,real,True,True,True,4


## 5. Summary by Generator

In [6]:
summary_data = []

for generator in REALISTIC_MODELS:
    subset = df_results[df_results["generator"] == generator]
    
    # Check for errors
    if "error" in subset.columns:
        errors = subset["error"].notna().sum()
        valid = subset[subset["error"].isna()]
    else:
        errors = 0
        valid = subset
    
    if len(valid) > 0:
        avg_score = valid["overall_score"].mean() if "overall_score" in valid.columns else 0
        
        real_models = (valid["model_status"] == "real").sum() if "model_status" in valid.columns else 0
        fictional = (valid["model_status"] == "fictional").sum() if "model_status" in valid.columns else 0
        
        hw_ok = valid["hardware_coherent"].sum() if "hardware_coherent" in valid.columns else 0
        country_ok = (valid["country_coherent"] == True).sum() if "country_coherent" in valid.columns else 0
        year_ok = valid["year_coherent"].sum() if "year_coherent" in valid.columns else 0
    else:
        avg_score = 0
        real_models = fictional = hw_ok = country_ok = year_ok = 0
    
    summary_data.append({
        "Generator": generator,
        "Avg Score": f"{avg_score:.1f}/5",
        "Real Models": int(real_models),
        "Fictional Models": int(fictional),
        "Hardware OK": int(hw_ok),
        "Country OK": int(country_ok),
        "Year OK": int(year_ok),
        "Errors": int(errors)
    })

df_summary = pd.DataFrame(summary_data)
df_summary

Unnamed: 0,Generator,Avg Score,Real Models,Fictional Models,Hardware OK,Country OK,Year OK,Errors
0,claude-sonnet-real,3.8/5,5,5,10,5,10,0
1,kimi-real,2.5/5,1,8,9,1,9,0
2,qwen-real,2.0/5,0,10,8,0,5,0


## 6. Detailed Comments from Claude

In [7]:
for generator in REALISTIC_MODELS:
    print(f"\n{'='*60}")
    print(f"{generator}")
    print(f"{'='*60}")
    
    subset = df_results[df_results["generator"] == generator]
    
    for _, row in subset.iterrows():
        print(f"\n--- Article {row['article_num']}: {row.get('model_name', 'N/A')} ---")
        
        if "error" in row and pd.notna(row.get("error")):
            print(f"  ERROR: {row['error']}")
            continue
        
        print(f"  Model: {row.get('model_status', 'N/A')} - {row.get('model_comment', '')}")
        print(f"  Params: {'OK' if row.get('params_coherent') else 'ISSUE'} - {row.get('params_comment', '')}")
        print(f"  Hardware: {'OK' if row.get('hardware_coherent') else 'ISSUE'} - {row.get('hardware_comment', '')}")
        print(f"  Country: {'OK' if row.get('country_coherent') == True else 'ISSUE/Unknown'} - {row.get('country_comment', '')}")
        print(f"  Year: {'OK' if row.get('year_coherent') else 'ISSUE'} - {row.get('year_comment', '')}")
        print(f"  Score: {row.get('overall_score', 'N/A')}/5")
        print(f"  Summary: {row.get('summary', '')}")


claude-sonnet-real

--- Article 1: GLM-130B ---
  Model: real - GLM-130B is a real large language model developed by Tsinghua University and Zhipu AI
  Params: OK - 130 billion parameters is correct for GLM-130B
  Hardware: OK - NVIDIA A100 40GB GPUs were available in 2022 and commonly used for large model training
  Country: OK - GLM-130B was developed in China by Tsinghua University and Zhipu AI
  Year: OK - GLM-130B was released in 2022, making the timeline accurate
  Score: 5/5
  Summary: All provided information about GLM-130B appears to be factually accurate and temporally coherent.

--- Article 2: LinguaFormer-7B ---
  Model: fictional - LinguaFormer-7B is not a known real language model - appears to be a fictional name combining 'Lingua' with the Transformer architecture naming convention
  Params: OK - 7.2 billion parameters is realistic and coherent for a 7B model designation in 2023
  Hardware: OK - NVIDIA A100 GPUs were widely available and commonly used for large model tr

## 7. Save Results

In [8]:
# Save detailed results
df_results.to_json("fact_check_results.json", orient="records", indent=2)
print("Results saved to fact_check_results.json")

# Save summary
df_summary.to_csv("fact_check_summary.csv", index=False)
print("Summary saved to fact_check_summary.csv")

Results saved to fact_check_results.json
Summary saved to fact_check_summary.csv


## 8. Conclusion

In [9]:
print("QUALITATIVE ANALYSIS CONCLUSION")
print("=" * 70)

for _, row in df_summary.iterrows():
    generator = row["Generator"]
    score = row["Avg Score"]
    
    print(f"\n{generator}:")
    print(f"  Average coherence score: {score}")
    print(f"  Real models used: {row['Real Models']}/10")
    print(f"  Fictional models: {row['Fictional Models']}/10")
    print(f"  Hardware temporally coherent: {row['Hardware OK']}/10")
    print(f"  Country attribution correct: {row['Country OK']}/10")
    print(f"  Year plausible: {row['Year OK']}/10")

QUALITATIVE ANALYSIS CONCLUSION

claude-sonnet-real:
  Average coherence score: 3.8/5
  Real models used: 5/10
  Fictional models: 5/10
  Hardware temporally coherent: 10/10
  Country attribution correct: 5/10
  Year plausible: 10/10

kimi-real:
  Average coherence score: 2.5/5
  Real models used: 1/10
  Fictional models: 8/10
  Hardware temporally coherent: 9/10
  Country attribution correct: 1/10
  Year plausible: 9/10

qwen-real:
  Average coherence score: 2.0/5
  Real models used: 0/10
  Fictional models: 10/10
  Hardware temporally coherent: 8/10
  Country attribution correct: 0/10
  Year plausible: 5/10
