# Chain-of-Thought Enhanced Fake News Detection with Web-Based Evidence Retrieval


This notebook builds a fake-news detection pipeline combining a Large Language Model (LLM) with
web-based evidence retrieval. We will: (1) retrieve relevant evidence via the Google Custom Search API,
(2) summarize and rank the evidence, (3) prompt an OpenAI GPT model (3.5/4) using a chain-ofthought prompt that includes the evidence, and (4) output a final truth/fake label along with the
model’s reasoning. We break the solution into major modules to keep the code organized. Relevant
references are cited throughout.


## Setup and Dependencies
We install and import necessary libraries. We use the OpenAI API for chain-of-thought reasoning, the
Google API client for search, Hugging Face Transformers for summarization, and SentenceTransformers for semantic similarity ranking. We also configure API keys (placeholders shown below).


In [None]:
!pip install openai google-api-python-client transformers sentence-transformers



ERROR: Could not find a version that satisfies the requirement sentencetransformers (from versions: none)

[notice] A new release of pip is available: 24.2 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: No matching distribution found for sentencetransformers


In [None]:

import os
import openai
import tiktoken
from googleapiclient.discovery import build
from transformers import pipeline
from sentence_transformers import SentenceTransformer, util

openai.api_key = "" #"OPENAI_API_KEY"
google_api_key = "" # "GOOGLE_API_KEY"
google_cse_id = "" # "GOOGLE_CSE_ID"

## Tools Used
* OpenAI API: We will use openai.ChatCompletion.create() with models like "gpt-4".
* Google Custom Search API: We use googleapiclient.discovery.build as in [StackOverflow ] to perform web searches.
* Hugging Face Transformers: A pipeline("summarization") can quickly summarize text snippets.
* Sentence-Transformers: We use a model like all-MiniLM-L6-v2 to embed text and compute cosine similarity for ranking evidence.

# 1: Make a Google Search Query from the Claim

In [57]:
def generate_search_query(claim: str, 
                          model: str = "gpt-4o-mini", 
                          temperature: float = 0.0, 
                          max_tokens: int = 50) -> str:
    system_prompt = (
        "You are an expert web‑search query generator. "
        "Your goal is to convert a factual claim into a succinct, "
        "keyword‑rich search query (5–10 words max), "
        "using quoted phrases and Boolean operators if useful. "
        "Only output the query text—no explanations or extra text."
    )
    user_prompt = f"Claim: “{claim}”\nGenerate the best search query for fact‑checking this."

    response = openai.ChatCompletion.create(
        model=model,
        temperature=temperature,
        max_tokens=max_tokens,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user",   "content": user_prompt}
        ],
    )
    return response.choices[0].message.content.strip().strip("\"")

# 2: Evidence Retrieval from the Web

In [9]:
def google_search(query, api_key, cse_id, num=5):
    service = build("customsearch", "v1", developerKey=api_key)
    res = service.cse().list(q=query, cx=cse_id, num=num).execute()
    return res.get("items", [])
def retrieve_evidence(claim, top_k=5):
    results = google_search(claim, google_api_key, google_cse_id, num=top_k)
    evidence = []
    for item in results:
        title = item.get('title')
        snippet = item.get('snippet')
        link = item.get('link')
        if snippet:
            evidence.append((title, snippet, link))
    return evidence

* We search the entire web but can bias toward reliable news/fact-check sites by adding filters (e.g. site:politiFact.com or site:snopes.com).
* The retrieve_evidence function returns raw text snippets from search results to be used in the pipeline.

In [10]:
retrieve_evidence("trump war")

[('Trump calls for deal on Israel war in Gaza amid signs of progress ...',
  '22 hours ago ... President Trump pleaded for progress in ceasefire talks in the war in Gaza, as Israel and Hamas appeared to be inching closer to an\xa0...',
  'https://www.npr.org/2025/06/29/nx-s1-5450133/trump-gaza-netanyahu'),
 ('HEALTH, EDUCATION, LABOR AND PENSIONS COMMITTEE ...',
  "May 13, 2025 ... Trump's war on science is not making America healthy again. It is making Americans and people throughout the world sicker. 4 Interview with\xa0...",
  'https://www.sanders.senate.gov/wp-content/uploads/HELP-Committee-Minority-Report-Trumps-War-on-Science.pdf'),
 ('Trump, Congress, and the War Powers Resolution | The New Yorker',
  "2 days ago ... These worries have intensified in debates about the legality of President Trump's decision to bomb Iranian nuclear facilities more than a week\xa0...",
  'https://www.newyorker.com/magazine/2025/07/07/trump-congress-and-the-war-powers-resolution'),
 ('Senate votes d

# 3: Summarization and Ranking of Evidence

In [30]:
def count_tokens(text, model="gpt-4o-mini"):
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

def chunk_text(text, max_tokens=3500, model="gpt-4o-mini"):
    enc = tiktoken.encoding_for_model(model)
    tokens = text.split()
    chunks, current = [], []
    count = 0
    for word in tokens:
        count += len(enc.encode(word))
        if count > max_tokens:
            chunks.append(" ".join(current))
            current = [word]
            count = len(enc.encode(word))
        else:
            current.append(word)
    if current:
        chunks.append(" ".join(current))
    return chunks

def openai_summarize(text, max_tokens=150):
    chunks = chunk_text(text)
    summaries = []
    for chunk in chunks:
        resp = openai.chat.completions.create(
            model="gpt-4o-mini",
            temperature=0.3,
            max_tokens=max_tokens,
            messages=[
                {"role": "system", "content": "You are a helpful assistant that summarizes text."},
                {"role": "user", "content": f"Summarize this in 2‑3 sentences:\n\n{chunk}"}
            ]
        )
        summaries.append(resp.choices[0].message.content.strip())
    return " ".join(summaries)

def summarize_snippets(evidence):
    summaries = []
    for title, snippet, link in evidence:
        summary = openai_summarize(snippet)
        summaries.append((title, summary, link))
    return summaries

## Semantic Similarity Ranking
Next we rank the evidence by semantic similarity to the claim using SentenceTransformers. We encode both the claim and each summary into embeddings and compute cosine similarity. Higher scores mean more relevant evidence.

3: Chain-of-Thought Reasoning with GPT

With evidence in hand, we now prompt an OpenAI model (GPT-4 or GPT-3.5) to reason about the claim.
We use a chain-of-thought prompt, asking the model to "think step by step" to make the reasoning
explicit . We include the top evidence snippets in the prompt to ground the reasoning.


In [31]:
def ask_gpt_chain_of_thought(claim, top_summaries):
    # Prepare evidence text block
    evidence_block = "\n".join(f"- \"{summary}\" (Source: {title})" for title, summary, link in top_summaries[:3])
    prompt = (
        f"You are a fact-checking assistant. Evaluate the following claim: \n\n"
        f"\"{claim}\"\n\n"
        f"Use the evidence below to decide if the claim is TRUE or FALSE. "
        f"Provide your reasoning step by step, citing the evidence as needed.\n\n"
        f"Evidence:\n{evidence_block}\n\n"
        f"Question: Is the claim true or false? Explain carefully."
        f"awnser in this format: TRUE/FALSE: REASON"
    )
    messages = [
        {"role": "system", "content": "You are a helpful and precise factchecking assistant."},
        {"role": "user", "content": prompt}
    ]
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.2
    )
    answer = response.choices[0].message.content.strip()
    return answer


# 4: Final PIpeline

In [65]:
def fact_check_pipeline(claim, top_k=5):
    # 1. Search Query from claim
    query = openai_search_query(claim).strip("\"")
    # 2. Retrieve evidence
    evidence = retrieve_evidence(query, top_k=top_k)
    if not evidence:
        return {"claim": claim, "query": query, "label": "Unknown", "reasoning":"No evidence found."}
    # 3. Summarize snippets
    summarized = summarize_snippets(evidence)
    # 4. Ask GPT to reason
    reasoning = ask_gpt_chain_of_thought(claim, summarized)
    # For simplicity, assume the first word of reasoning is the label (True/False)
    label = reasoning.split(":")[0].rstrip(':.')
    return {"claim": claim, "query": query, "label": label, "reasoning": reasoning}

# fact_check_pipeline("The Great Wall of China is visible from the Moon.")

# fact_check_pipeline("cristiano ronaldo is brother of ronaldo nazario.")


# Evaluation with LIAR Dataset

LIAR Dataset: Contains ~12.8K human-labeled claims from Politifact
sites.cs.ucsb.edu
 (labels like True, Mostly False, Pants on Fire, etc.). We would map LIAR labels to a binary True/False and measure accuracy or F1 of our pipeline.

* LIAR Dataset is available on: https://www.cs.ucsb.edu/~william/data/liar_dataset.zip

In [44]:
import csv
liar_dataset = []
with open("datasets/liar/test.tsv") as fd:
    rd = csv.reader(fd, delimiter="\t", quotechar='"')
    for row in rd:
        liar_dataset.append(row)
liar_dataset = [data for data in liar_dataset if data[1]=="true" or data[1]=="false"]

In [77]:
def evaluate_binary(results, positive_label='true'):
    tp = fp = tn = fn = 0
    for result, y_true in results:
        y_pred = result['label'].strip().lower()
        y_true = y_true.strip().lower()

        if y_true == positive_label:
            if y_pred == positive_label:
                tp += 1
            else:
                fn += 1
        else:
            if y_pred == positive_label:
                fp += 1
            else:
                tn += 1

    total = tp + tn + fp + fn
    accuracy  = (tp + tn) / total if total else 0
    precision = tp / (tp + fp)   if (tp + fp) else 0
    recall    = tp / (tp + fn)   if (tp + fn) else 0
    f1_score  = (2 * precision * recall / (precision + recall)
                 if (precision + recall) else 0)

    return {
        'accuracy':  accuracy,
        'precision': precision,
        'recall':    recall,
        'f1_score':  f1_score
    }

In [None]:
results = []
for d in liar_dataset:
    results.append((fact_check_pipeline(d[2]), d[1]))

In [None]:
for result, y_true in results[0:10]:
    if(y_true != result['label'].lower()):
        print(result)

{'claim': 'Over the past five years the federal government has paid out $601 million in retirement and disability benefits to deceased former federal employees.', 'query': 'federal government retirement disability benefits deceased former employees $601 million past five years', 'label': 'FALSE', 'reasoning': "FALSE: The claim states that the federal government has paid out $601 million in retirement and disability benefits to deceased former federal employees over the past five years. However, the evidence indicates that the Federal Government's Civil Service Retirement and Disability program has made payments totaling over $601 million annually, not specifically to deceased employees. This suggests that the total payments include benefits to living retirees and possibly other categories, not exclusively to deceased former employees. Therefore, the claim is misleading and not accurate as stated."}
{'claim': 'Says that Tennessee law requires that schools receive half of proceeds -- $31

0.6

In [78]:
evaluate_binary(results)

{'accuracy': 0.6197183098591549,
 'precision': 0.68,
 'recall': 0.4722222222222222,
 'f1_score': 0.5573770491803278}

* As you can see in results the algorithm doesnt work good on facts that include numbers in them.
* and the accuracy goes somewhere between CRH and UFD but with a much more cost.


<img src="src/Performance-comparison-on-LIAR-dataset.png">