## **Machine Learning Engineer ‚Äì LLM Task Assignment**
###**Company:** Artikate Studio, Artikate Private Limited
###**Task Title:** LLM-Powered Fact Checker with Custom Embedding-Based Retrieval
###**Submission Deadline:** 24 Navember, 2025 7:00 pm
###**Candidate name:** Mitul Srivastava


####**Summary**

This notebook builds a complete automated fact-checking pipeline powered by:

* RSS Scraping (Press Information Bureau of India)

* Text Cleaning & Chunking

* Sentence Embeddings (SentenceTransformers)

* FAISS Vector Search

* spaCy-based Claim Extraction

* Groq Llama 70B for Final Verdict

* Gradio UI for interactive fact checking

üîç How It Works (High-Level Workflow)

1. Data Collection
Scrapes official PIB RSS feeds ‚Üí extracts titles ‚Üí stores them as trusted factual statements.

2. Fact Preparation
Cleans, normalizes, chunks statements ‚Üí encodes them into embeddings ‚Üí builds a FAISS index for fast retrieval.

3. Claim Extraction
Given a user query, spaCy extracts the main actionable claim.

4. Similarity Search (FAISS)
Retrieves the most relevant government facts based on embedding similarity.

5. LLM Verification (Groq + Llama 70B)
The model receives:

* The claim
* The retrieved factual evidence
And returns a structured JSON verdict: True / False / Unverifiable with reasoning.

6. Final User Output
Nicely formatted result with:

* Verdict
* Confidence
* Reasoning
* Top evidence snippets

7. Gradio App
A simple UI lets anyone enter claims and instantly verify them.

### **1. Install Dependencies**

In [None]:
# Install required Python libraries for the project
# - sentence-transformers: For generating sentence embeddings
# - faiss-cpu: For efficient vector search
# - spacy + en_core_web_md: For NLP preprocessing and word vectors
# - gradio: For building an interactive UI
# - pandas: For data manipulation
# - groq: For Groq API integration (LLM inference)
# - huggingface_hub: For downloading models/data from Hugging Face
# - feedparser, beautifulsoup4, requests, lxml: For web scraping & RSS parsing

!pip install -q sentence-transformers faiss-cpu spacy gradio pandas groq huggingface_hub feedparser beautifulsoup4 requests lxml

# Download the medium-sized English model for spaCy
!python -m spacy download en_core_web_md


### **2. Import Libraries**

In [None]:
# Standard library imports
import os
import json
from typing import List, Dict, Tuple
from datetime import datetime
from urllib.parse import urljoin
import re
import time
import textwrap

# Data handling
import pandas as pd
import numpy as np

# NLP and Embeddings
import spacy
from sentence_transformers import SentenceTransformer

# Vector search
import faiss

# LLM / API clients
from groq import Groq
from huggingface_hub import login

# Web parsing & requests
import requests
import feedparser
from bs4 import BeautifulSoup

print("‚úÖ Libraries imported successfully")


### **3. Configure API Keys & Authentication**

In [None]:
from google.colab import userdata
userdata.get('HF_TOKEN')
userdata.get('GROQ_API_KEY')

In [None]:
# -------------------------------------------------------------
# Hugging Face Authentication
# -------------------------------------------------------------

# IMPORTANT:
# Do NOT hardcode private tokens in notebooks.
# Use environment variables or secrets instead.
# Replace "YOUR_HF_TOKEN_HERE" with your real token securely.
HF_TOKEN = os.getenv("HF_TOKEN", "YOUR_HF_TOKEN_HERE")
os.environ["HF_TOKEN"] = HF_TOKEN
os.environ["HUGGINGFACE_TOKEN"] = HF_TOKEN

print("‚úÖ Hugging Face token set (environment-based)")


# -------------------------------------------------------------
# Groq API Authentication
# -------------------------------------------------------------

# Securely fetch API key from environment or Colab userdata
GROQ_API_KEY = os.getenv("GROQ_API_KEY", "YOUR_GROQ_API_KEY_HERE")

# Google Colab secret support (fallback method)
try:
    from google.colab import userdata
    GROQ_API_KEY = userdata.get("GROQ_API_KEY") or GROQ_API_KEY
except ImportError:
    pass

# Initialize Groq client
client = Groq(api_key=GROQ_API_KEY)

print("‚úÖ Groq API configured successfully")


### **4. RSS Fetching Function**

In [None]:
def fetch_pib_rss():
    """
    Fetch RSS entries from multiple PIB (Press Information Bureau) RSS feeds.

    Returns:
        list: A combined list of RSS feed entries from all specified PIB categories.
    """

    # PIB RSS feed URLs (categorized by domain)
    urls = [
        "https://www.pib.gov.in/RssMain.aspx?ModId=6&Lang=1&Regid=3",  # Press Releases
        "https://www.pib.gov.in/RssMain.aspx?ModId=6&Lang=1&Regid=1",  # Finance
        "https://www.pib.gov.in/RssMain.aspx?ModId=6&Lang=1&Regid=2",  # Cabinet
        "https://www.pib.gov.in/RssMain.aspx?ModId=6&Lang=1&Regid=4",  # Health
        "https://www.pib.gov.in/RssMain.aspx?ModId=6&Lang=1&Regid=7",  # Education
        "https://www.pib.gov.in/RssMain.aspx?ModId=6&Lang=1&Regid=9",  # Environment
    ]

    # Custom headers to avoid request blocking by the PIB server
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0 Safari/537.36"
        ),
        "Accept": "application/xml,text/xml,application/xhtml+xml,text/html",
        "Referer": "https://pib.gov.in/",
        "Accept-Language": "en-US,en;q=0.9",
    }

    all_entries = []

    # Iterate through all RSS feed URLs
    for url in urls:
        try:
            print(f"\nüîé Fetching RSS: {url}")
            response = requests.get(url, headers=headers, timeout=10)
            print("   ‚Ü≥ Status Code:", response.status_code)

            # Parse RSS content
            feed = feedparser.parse(response.text)

            if feed.entries:
                print(f"   ‚úì Found {len(feed.entries)} entries")
                all_entries.extend(feed.entries)
            else:
                print("   ‚ö† No entries found in this feed")

        except requests.exceptions.Timeout:
            print(f"   ‚ùå Timeout error for URL: {url}")
        except requests.exceptions.RequestException as e:
            print(f"   ‚ùå Request failed for {url}: {e}")
        except Exception as e:
            print(f"   ‚ùå Unexpected error: {e}")

    return all_entries


In [None]:
sample_facts = [
    "The Indian government launched PM-KISAN scheme providing ‚Çπ6000 annual income support to farmer families in December 2018.",
    "Goods and Services Tax (GST) was implemented in India on July 1, 2017.",
    "India achieved 100 crore COVID-19 vaccinations on October 21, 2021.",
    "The National Education Policy (NEP) 2020 was approved by the Union Cabinet on July 29, 2020.",
    "Ayushman Bharat scheme provides health insurance coverage up to ‚Çπ5 lakh per family per year.",
    "India's unemployment rate was 7.8% in November 2024 according to CMIE data.",
    "The Reserve Bank of India kept the repo rate unchanged at 6.5% in December 2024.",
    "India's GDP growth rate for Q2 FY 2024-25 was 6.7% according to NSO data.",
    "The Pradhan Mantri Awas Yojana aims to provide housing for all by 2024.",
    "India launched Chandrayaan-3 successfully on July 14, 2023.",
    "The Indian government announced production-linked incentive scheme for semiconductor manufacturing in December 2021.",
    "Aadhaar has over 134 crore enrollments as of September 2024.",
    "The Swachh Bharat Mission achieved 100% village ODF status in October 2019.",
    "India's foreign exchange reserves stood at $622 billion as of November 2024.",
    "The government launched ONDC (Open Network for Digital Commerce) in April 2022.",
    "India's renewable energy capacity reached 180 GW in November 2024.",
    "The National Hydrogen Mission was launched in August 2021.",
    "PM Fasal Bima Yojana provides crop insurance to farmers with subsidized premiums.",
    "India's EV sales grew by 50% in 2024 compared to 2023.",
    "The government reduced corporate tax rate to 22% for domestic companies in September 2019.",
    "Digital transactions in India crossed 13,462 crore in value during FY 2023-24.",
    "India's edible oil imports were 165 lakh tonnes in FY 2023-24.",
    "The National Rail Plan aims to create a future-ready railway system by 2030.",
    "India's per capita income increased to ‚Çπ1,72,000 in FY 2023-24.",
    "The government launched PM SVANidhi scheme for street vendors in June 2020.",
    "India exported pharmaceuticals worth $27.9 billion in FY 2023-24.",
    "The Production Linked Incentive (PLI) scheme covers 14 sectors.",
    "India's solar power capacity reached 81 GW as of October 2024.",
    "The Smart Cities Mission was launched in June 2015 covering 100 cities.",
    "India's startup ecosystem is the third-largest globally.",
    "The government extended free food grain scheme (PMGKAY) till December 2024.",
    "India's merchandise exports reached $437 billion in FY 2023-24.",
    "The National Logistics Policy was launched in September 2022.",
    "India's installed power generation capacity reached 442 GW in November 2024.",
    "The government launched e-Shram portal for unorganized workers in August 2021.",
    "India's automobile production was 255 lakh vehicles in FY 2023-24.",
    "The PM-KUSUM scheme promotes solar pumps for farmers.",
    "India's services sector accounts for 55% of GDP as of 2024.",
    "The government approved National Green Hydrogen Mission with ‚Çπ19,744 crore outlay.",
    "India's internet users crossed 90 crore in 2024.",
    "The National Infrastructure Pipeline envisages ‚Çπ111 lakh crore investment by 2025.",
    "India's defense budget for FY 2024-25 is ‚Çπ6.21 lakh crore.",
    "The government launched Skill India Digital platform in February 2023.",
    "India's coal production reached 997 million tonnes in FY 2023-24.",
    "The PM Vishwakarma scheme provides support to traditional artisans.",
    "India's crude oil production was 29.7 million tonnes in FY 2023-24.",
    "The National Medical Commission replaced Medical Council of India in September 2020.",
    "India's diamond exports were $23.7 billion in FY 2023-24.",
    "The government launched Atal Innovation Mission to promote innovation.",
    "India's urban population is expected to reach 60 crore by 2031."
]

### **5. RSS Scraping & Fact Extraction**

In [None]:
def scrape_pib_rss(num_facts=100):
    """
    Extracts titles and links from PIB RSS feeds,
    and appends manually created sample facts.

    Args:
        num_facts (int): Maximum number of facts to return

    Returns:
        Tuple[List[str], List[str]]:
            - List of fact titles
            - List of corresponding links
    """

    all_facts = []
    all_links = []

    # 1. Fetch entries from all PIB RSS feeds
    entries = fetch_pib_rss()

    # 2. Extract title and link from each RSS entry
    for entry in entries:
        try:
            title = entry.title.strip()
            link = entry.link.strip()
        except:
            # Skip entries missing expected fields
            continue

        all_facts.append(title)
        all_links.append(link)

    # 3. Append manually created sample facts (must exist in notebook)
    try:
        for fact in sample_facts:
            all_facts.append(fact)
            all_links.append("N/A")
    except NameError:
        print("‚ö† sample_facts is not defined. Skipping manual facts.")

    # 4. Return limited number of results
    return all_facts[:num_facts], all_links[:num_facts]


### **6. Create DataFrame**

In [None]:
# ==========================================
# Cell 6 ‚Äî Convert Facts to DataFrame
# ==========================================

facts, links = scrape_pib_rss(num_facts=100)

df_facts = pd.DataFrame({
    "id": range(len(facts)),
    "statement": facts,
    "source": "PIB/Government",
    "date_added": datetime.now().strftime("%Y-%m-%d")
})

df_facts.head()


### **7. Save & Reload Dataset**

In [None]:
# -------------------------------------------------------------
# Save extracted facts to CSV & reload to verify
# -------------------------------------------------------------

csv_filename = 'verified_facts_database.csv'
df_facts.to_csv(csv_filename, index=False, encoding='utf-8')

print("\nüìä Dataset Statistics:")
print("   Total records saved:", len(df_facts))

# Verify by reloading
df_loaded = pd.read_csv(csv_filename)
print(f"‚úÖ Loaded {len(df_loaded)} facts from CSV")


### **8. Embedding + FAISS Indexing**

In [None]:

# ------------------------------------------------------------
# 1. Load Sentence Transformer embedding model
# ------------------------------------------------------------

# NOTE:
# The newer SentenceTransformer versions no longer use `use_auth_token`
# If HF token is required, use: login(os.environ["HF_TOKEN"])
embedding_model = SentenceTransformer(
    "sentence-transformers/all-MiniLM-L6-v2"
)

print("‚úÖ Loaded embedding model")


# ------------------------------------------------------------
# 2. Text Chunking Utility
# ------------------------------------------------------------
def chunk_text(text, max_len=300):
    """
    Split text into smaller chunks to ensure better semantic embedding.

    Args:
        text (str): Input full text.
        max_len (int): Maximum chunk length.

    Returns:
        list[str]: List of chunked strings.
    """
    text = text.replace("\n", " ").strip()
    return textwrap.wrap(text, max_len)


# ------------------------------------------------------------
# 3. Extract & Chunk Text From DataFrame
# ------------------------------------------------------------
fact_texts = []
for statement in df_loaded["statement"].tolist():
    chunks = chunk_text(statement)
    fact_texts.extend(chunks)

print(f"üìÑ Total chunks after splitting: {len(fact_texts)}")


# ------------------------------------------------------------
# 4. Create Embeddings
# ------------------------------------------------------------
print("üîÑ Encoding chunks into embeddings...")
embeddings = embedding_model.encode(
    fact_texts,
    show_progress_bar=True
)

embeddings = np.array(embeddings).astype("float32")
print(f"üß† Embedding shape: {embeddings.shape}")


# ------------------------------------------------------------
# 5. Build FAISS Index
# ------------------------------------------------------------
dimension = embeddings.shape[1]
faiss_index = faiss.IndexFlatL2(dimension)
faiss_index.add(embeddings)

print(f"‚úÖ FAISS index created with {faiss_index.ntotal} vectors of {dimension} dims")


### **9. Relevant Fact Retrieval**

In [None]:
# --------------------------------------------------------
# Retrieve Top-K Relevant Facts Using FAISS Vector Search
# --------------------------------------------------------
def retrieve_relevant_facts(query: str, top_k: int = 5):
    """
    Encodes a user query into an embedding, searches the FAISS index,
    and returns the top-k most similar fact chunks.

    Parameters
    ----------
    query : str
        The user query / claim to match against the knowledge base.
    top_k : int
        Number of top relevant facts to return.

    Returns
    -------
    List[Tuple[str, float]]
        A list of (fact_text, similarity_score).
    """

    # 1. Encode the input query using the same embedding model
    query_embedding = embedding_model.encode([query]).astype("float32")

    # 2. Perform similarity search using FAISS (L2 distance)
    distances, indices = faiss_index.search(query_embedding, top_k)

    results = []
    for idx, dist in zip(indices[0], distances[0]):
        # Ensure returned index is in range
        if idx < len(fact_texts):
            # Convert L2 distance to a similarity score (bounded)
            similarity = 1 / (1 + dist)
            results.append((fact_texts[idx], similarity))

    return results


print("‚úÖ Retrieval system ready")


### **10. Claim Extraction (spaCy)**

In [None]:
# Load spaCy medium English model
nlp = spacy.load("en_core_web_md")


def extract_claims(text: str) -> List[Dict]:
    """
    Extracts meaningful claims from a given text.
    For each sentence, the function returns:
        - the sentence text
        - named entities (grouped by type)
        - key nouns, verbs, and proper nouns (lemmatized)
    """

    doc = nlp(text)
    claims = []

    for sent in doc.sents:
        sent_doc = nlp(sent.text)

        # Extract named entities
        entities = {}
        for ent in sent_doc.ents:
            entities.setdefault(ent.label_, []).append(ent.text)

        # Extract keywords (nouns, proper nouns, verbs)
        keywords = [
            token.lemma_
            for token in sent_doc
            if token.pos_ in {"NOUN", "PROPN", "VERB"} and not token.is_stop
        ]

        claims.append({
            "text": sent.text.strip(),
            "entities": entities,
            "keywords": keywords
        })

    return claims


### **11. LLM Verification**

In [None]:
def verify_claim_with_llm(claim: str, retrieved_facts: List[Tuple[str, float]]) -> Dict:
    """
    Verify a claim using Groq's Llama models.
    Returns a structured JSON verdict with reasoning.
    """

    # Format retrieved evidence as numbered list
    evidence_text = "\n".join([
        f"{i+1}. {fact} (Relevance: {score:.2f})"
        for i, (fact, score) in enumerate(retrieved_facts)
    ])

    # Fact-checking prompt (strict JSON output)
    prompt = f"""
You are a strict fact-checking assistant. Evaluate the claim using ONLY the verified evidence provided.

CLAIM:
"{claim}"

VERIFIED EVIDENCE:
{evidence_text}

TASK:
Determine whether the claim is True, False, or Unverifiable using ONLY the verified evidence.
Respond in **valid JSON only**, no natural language, no formatting:

{{
  "verdict": "True" | "False" | "Unverifiable",
  "confidence": float between 0 and 1,
  "reasoning": "Short explanation",
  "evidence_used": [list of evidence numbers]
}}
""".strip()

    try:
        # Query Groq model
        response = client.chat.completions.create(
            model="llama-3.3-70b-versatile",
            messages=[
                {"role": "system", "content": "Return only valid JSON. No prose, no markdown."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.1,
            max_tokens=400
        )

        raw_output = response.choices[0].message.content.strip()

        # Clean accidental markdown fences (common in LLM responses)
        if raw_output.startswith("```"):
            raw_output = raw_output.replace("```json", "").replace("```", "").strip()

        # Parse JSON output safely
        result = json.loads(raw_output)

        # Attach actual retrieved evidence text
        result["evidence"] = [fact for fact, _ in retrieved_facts]

        return result

    except Exception as e:
        # Fallback in case model outputs invalid or unparsable JSON
        return {
            "verdict": "Unverifiable",
            "confidence": 0.0,
            "reasoning": f"Model error: {str(e)}",
            "evidence": [fact for fact, _ in retrieved_facts]
        }

print("‚úÖ LLM verification ready (Groq + Llama 3.x)")


### **12. FactChecker Class**

In [None]:
class FactChecker:
    """
    A class that performs claim extraction, retrieval of relevant facts,
    and verification using an LLM (Groq Llama models).
    """

    def __init__(self):
        # Reuse loaded global models and indexes
        self.nlp = nlp
        self.embedding_model = embedding_model
        self.faiss_index = faiss_index

        # Load facts from the CSV-loaded dataframe
        self.facts = df_loaded["statement"].tolist()

    def check_fact(self, input_text: str, top_k: int = 5) -> Dict:
        """
        Extract the main claim from text, retrieve supporting evidence,
        and verify the claim using the LLM.
        """

        # Extract claims using NLP
        claims = extract_claims(input_text)

        # If multiple sentences, check the first claim (main claim heuristic)
        main_claim = claims[0]["text"] if claims else input_text

        # Retrieve relevant facts
        retrieved_facts = retrieve_relevant_facts(main_claim, top_k)

        # Verify the claim using LLM
        result = verify_claim_with_llm(main_claim, retrieved_facts)

        # Add useful metadata
        result["input_text"] = input_text
        result["extracted_claim"] = main_claim
        result["timestamp"] = datetime.now().isoformat()

        return result

    def format_output(self, result: Dict) -> str:
        """
        Format the LLM verification output into a clean, readable message.
        """

        verdict_emoji = {
            "True": "‚úÖ",
            "False": "‚ùå",
            "Unverifiable": "‚ùì"
        }

        verdict = result.get("verdict", "Unverifiable")
        emoji = verdict_emoji.get(verdict, "‚ùì")

        # Build formatted output
        output = f"""
{emoji} VERDICT: {verdict}
Confidence: {result.get('confidence', 0):.0%}

üìù REASONING:
{result.get('reasoning', 'No reasoning provided')}

üìö EVIDENCE REVIEWED:
"""
        # Limit displayed evidence to first 3 items
        for i, evidence in enumerate(result.get("evidence", [])[:3], 1):
            output += f"\n{i}. {evidence}"

        return output


# Initialize fact-checker instance
fact_checker = FactChecker()
print("‚úÖ Fact Checker initialized")


### **13. Test Cases**

In [None]:
# ---------------------------------------------
# TESTING FACT CHECKER PIPELINE
# ---------------------------------------------

test_inputs = [
    "The Indian government has announced free electricity to all farmers starting July 2025.",
    "India achieved 100 crore COVID-19 vaccinations in October 2021.",
    "The moon is made of cheese according to NASA."
]

print("\n" + "=" * 80)
print("TESTING FACT CHECKER")
print("=" * 80)

for test_input in test_inputs:
    print(f"\nINPUT: {test_input}")
    print("-" * 80)

    # Run fact-checking pipeline
    result = fact_checker.check_fact(test_input)

    # Print formatted verdict
    print(fact_checker.format_output(result))


### **14. Gradio UI**

In [None]:
import gradio as gr

def check_fact_ui(input_text, num_sources):
    """
    Wrapper for the Gradio UI.
    Takes user input, performs fact checking, and returns formatted output.
    """
    if not input_text.strip():
        return "‚ö†Ô∏è Please enter a claim to verify."

    result = fact_checker.check_fact(input_text, top_k=num_sources)
    return fact_checker.format_output(result)


# -----------------------------
# Gradio UI
# -----------------------------
with gr.Blocks(
    theme=gr.themes.Soft(),
    title="LLM Fact Checker"
) as demo:

    gr.Markdown("""
    # üîç LLM-Powered Fact Checker
    ### Verify claims against trusted government sources

    **Tech Stack**
    - **Claim Extraction:** spaCy
    - **Embeddings:** all-MiniLM-L6-v2
    - **Vector Search:** FAISS
    - **LLM:** Groq API (Llama 3.x)

    Enter a claim below and the system will extract the main claim, retrieve
    relevant government-verified facts, and generate a structured verdict.
    """)

    with gr.Row():

        # Input Section
        with gr.Column(scale=2):
            input_text = gr.Textbox(
                label="Enter a claim",
                placeholder=(
                    "Example: The Indian government has announced free electricity "
                    "to all farmers starting July 2025."
                ),
                lines=4
            )

            num_sources = gr.Slider(
                minimum=3,
                maximum=10,
                value=5,
                step=1,
                label="Number of evidence sources to retrieve"
            )

            check_btn = gr.Button("üîç Verify Fact", variant="primary")

        # Output Section
        with gr.Column(scale=2):
            output = gr.Textbox(
                label="Verification Result",
                lines=15
            )

    # Example inputs
    gr.Examples(
        examples=[
            ["The Indian government has announced free electricity to all farmers starting July 2025.", 5],
            ["India achieved 100 crore COVID-19 vaccinations in October 2021.", 5],
            ["GST was implemented in India in July 2017.", 5],
            ["India's GDP growth is 20% in 2024.", 5]
        ],
        inputs=[input_text, num_sources]
    )

    # Button -> Action
    check_btn.click(
        fn=check_fact_ui,
        inputs=[input_text, num_sources],
        outputs=output
    )

demo.launch(share=True, debug=True)


### Thank you