<div style="text-align: justify">

## Section 1. Introduction to the Problem/Task

**The Problem**
Navigating extensive legal and technical documents, such as the Philippine DOLE Occupational Safety and Health Standards (OSHS), presents a significant "information bottleneck." Finding specific compliance metrics, hazard guidelines, or equipment specifications via manual search is inefficient and prone to human error. Furthermore, while standard Large Language Models (LLMs) are highly capable conversational agents, they cannot be trusted with critical safety queries out-of-the-box because they are prone to "hallucinating" technical facts and lack innate knowledge of localized policy documents.

**Purpose and Domain Use Case**
The purpose of this project is to develop an LLM-powered chatbot tailored specifically to the domain of workplace safety policies and manuals. The intended use case is to serve as an interactive safety assistant for safety officers, employers, and workers. Users can query the system in natural language (e.g., "What are the required dimensions for a machine guard?") and the chatbot will instantly retrieve and synthesize the exact procedural guidelines and compliance protocols from the official DOLE OSHS text.

**Real-World Significance**
Building a retrieval-grounded conversational system (utilizing a Retrieval-Augmented Generation or RAG pipeline) is critical for this application. By anchoring the LLM's responses exclusively to retrieved chunks of the official OSHS document, we eliminate hallucinations and guarantee that the information provided is factual, reliable, and citeable. In a real-world setting, this system accelerates regulatory compliance, democratizes access to dense safety protocols, and ultimately helps mitigate workplace hazards by ensuring accurate safety knowledge is instantly accessible.

</div>

## Section 2. Dataset Description

**Knowledge Source and Collection**
The primary knowledge source for this chatbot is the **Occupational Safety and Health Standards (OSHS) As Amended** handbook, issued by the Department of Labor and Employment (DOLE) of the Philippines. The document was acquired as a digital PDF (closed-corpus) and serves as the definitive legal and regulatory baseline for occupational safety in the country. 

**Dataset Structure**
* **Format:** Single PDF document (`Osh-Handbook.pdf`)
* **Domain:** Legal, Regulatory, and Occupational Health & Safety
* **Contents:** The document is highly structured, consisting of hierarchical legal frameworks (Rules, Sections, Sub-sections) alongside dense technical matrices (e.g., Threshold Limit Values for airborne contaminants, medical supply requirements).

**Preprocessing and Data Pipeline**
To ensure the LLM accurately retrieves and contextualizes the legal statutes without hallucination, standard naive chunking was discarded in favor of a **Structure-Aware Processing Pipeline**:

1.  **Document Cleaning:** * **Artifact Removal:** Page numbers, headers, and extraneous source tags (e.g., `--- PAGE X ---`) were stripped using Regular Expressions to reduce embedding noise.
    * **Hyphenation Merging:** Words split across line breaks by hyphens (e.g., "equip-ment") were systematically rejoined to maintain semantic integrity during vector search.
2.  **Handling Tables:** * Complex tables embedded within the PDF are extracted independently using `pdfplumber`. These tables are converted into Markdown format before embedding to preserve their row-column relationships, ensuring that specific numerical limits and chemical properties remain explicitly linked to their respective entities.
3.  **Structure-Aware Chunking & Metadata Tagging:** * The text is strictly partitioned using **Rule Numbers** (e.g., "Rule 1040") as the primary delimiters. 
    * **Context Injection:** To prevent orphaned text chunks from losing their legal context, the specific Rule Number and Title are prepended as metadata to every sub-chunk generated from that section.

In [None]:
# Install all required libraries once (run this after Section 2)
!pip install -U --force-reinstall \
    numpy==1.26.4 protobuf==4.25.3 \
    transformers==4.46.3 sentence-transformers==3.3.1 peft==0.12.0 \
    accelerate==0.34.2 bitsandbytes==0.49.2 \
    langchain==0.3.11 langchain-core==0.3.24 langchain-community==0.3.11 \
    langchain-huggingface==0.1.2 langchain-text-splitters==0.3.2 \
    langchain-chroma==0.1.4 \
    chromadb==0.5.23 pdfplumber==0.11.4 pandas==2.2.3 tabulate==0.9.0 gradio==5.9.1

## Section 2.1 Dataset Cleaning

#### Environment Setup and Imports
Run this first to install the required libraries and import the modules. tabulate is required for pandas to convert tables to Markdown.

In [None]:
# Install required libraries
!pip install pdfplumber langchain langchain-text-splitters pandas tabulate

# Import modules
import re
import pdfplumber
import pandas as pd
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Define the file path (Ensure your PDF is uploaded to the Colab files section)
PDF_PATH = "Osh-Handbook.pdf"

^C


ModuleNotFoundError: No module named 'pdfplumber'

Collecting pdfplumber
  Downloading pdfplumber-0.11.9-py3-none-any.whl.metadata (43 kB)
Collecting langchain
  Downloading langchain-1.2.10-py3-none-any.whl.metadata (5.7 kB)
Collecting langchain-text-splitters
  Downloading langchain_text_splitters-1.1.1-py3-none-any.whl.metadata (3.3 kB)
Collecting pandas
  Downloading pandas-3.0.1-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting tabulate
  Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
Collecting pdfminer.six==20251230 (from pdfplumber)
  Downloading pdfminer_six-20251230-py3-none-any.whl.metadata (4.3 kB)
Collecting Pillow>=9.1 (from pdfplumber)
  Downloading pillow-12.1.1-cp312-cp312-win_amd64.whl.metadata (9.0 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-5.5.0-py3-none-win_amd64.whl.metadata (68 kB)
Collecting charset-normalizer>=2.0.0 (from pdfminer.six==20251230->pdfplumber)
  Downloading charset_normalizer-3.4.4-cp312-cp312-win_amd64.whl.metadata (38 kB)
Collecting cryptography>


[notice] A new release of pip is available: 24.2 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


#### Document Cleaning Utility
This cell defines the function used to strip out page numbers, source tags, and fix broken words.

In [None]:
def clean_text(text):
    """Removes PDF artifacts and merges hyphenated words."""
    if not text:
        return ""
    
    # Remove page artifacts like "--- PAGE 1 ---"
    text = re.sub(r'--- PAGE \d+ ---', '', text)
    
    # Merge hyphenated words across newlines (e.g., "work-\nplace" -> "workplace")
    text = re.sub(r'(\w+)-\n(\w+)', r'\1\2', text)
    
    # Clean up excessive newlines
    text = re.sub(r'\n{3,}', '\n\n', text)
    
    return text.strip()

#### Table Extraction
This cell handles extracting complex tables and converting them into Markdown so the LLM can understand the rows and columns.

In [None]:
def extract_tables_to_documents(pdf_path):
    print("Extracting tables...")
    table_documents = []
    
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            tables = page.extract_tables()
            for i, table in enumerate(tables):
                # Minimum content filter for tables
                if not table or len(table) < 2: 
                    continue 
                
                # Harden headers: Convert None to "" and ensure unique column names
                raw_headers = [str(col) if col is not None else f"Col_{j}" for j, col in enumerate(table[0])]
                
                # Deduplicate headers if PDF parsing messed up (e.g., two columns named "Limit")
                headers = pd.Series(raw_headers).mask(pd.Series(raw_headers).duplicated(), 
                                                      pd.Series(raw_headers) + '_dup').tolist()
                
                try:
                    df = pd.DataFrame(table[1:], columns=headers).dropna(how='all')
                    df = df.fillna("") 
                    md_table = df.to_markdown(index=False)
                    
                    # Deduplication/Noise filter: Skip tiny or empty tables
                    if len(md_table.strip()) < 50:
                        continue
                        
                    # Create structured LangChain Document
                    doc = Document(
                        page_content=f"[Table Extracted from Page {page_num + 1}]\n{md_table}",
                        metadata={
                            "source": "Osh-Handbook.pdf",
                            "page": page_num + 1,
                            "type": "table",
                            "table_index": i
                        }
                    )
                    table_documents.append(doc)
                except Exception as e:
                    print(f"Skipped broken table on page {page_num + 1}: {e}")
                
    print(f"Successfully extracted {len(table_documents)} table documents.")
    return table_documents

#### Text Extraction & Structure-Aware Chunking
This is the core logic. It reads the text, splits it by DOLE Rules, and prepends the Rule Title to every sub-chunk so context is never lost.

In [None]:
def process_dole_rules_to_documents(pdf_path):
    print("Extracting and cleaning text...")
    
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=150,
        separators=["\n\n", "\n", ".", " ", ""]
    )
    
    text_documents = []
    seen_chunks = set()
    
    # --- STATE PERSISTENCE VARIABLES ---
    # Defined OUTSIDE the page loop so they survive page transitions
    current_rule_id = "General"
    current_rule_title = "General OSHS Provision"
    
    print("Chunking rules and assigning metadata...")
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
            extracted = page.extract_text()
            if not extracted:
                continue
            
            # Apply your cleaning function
            cleaned_page_text = clean_text(extracted)
            if not cleaned_page_text:
                continue
            
            # Prepend a newline to ensure regex catches a Rule if it starts at the very top of the page
            cleaned_page_text = "\n" + cleaned_page_text
            
            # Split by Rule headers
            # logic: (?i) case-insensitive, \n matches newline before Rule
            rule_splits = re.split(r'(?i)\n(?=Rule\s\d{4})', cleaned_page_text)
            
            for section in rule_splits:
                section = section.strip()
                if len(section) < 50:
                    continue
                
                # CHECK: Does this section START with a new Rule Header?
                first_line = section.split('\n')[0]
                rule_match = re.match(r'(?i)Rule\s(\d{4})', first_line)
                
                if rule_match:
                    # YES: We found a new rule. Update the "State".
                    current_rule_id = rule_match.group(1)
                    current_rule_title = first_line.strip()
                else:
                    # NO: This is continuation text from the previous page/rule.
                    # We KEEP using the existing 'current_rule_id' and 'current_rule_title'
                    pass 
                
                # Now chunk using the correct context (whether new or inherited)
                sub_chunks = text_splitter.split_text(section)
                
                for chunk in sub_chunks:
                    normalized_chunk = chunk.strip()
                    if len(normalized_chunk.split()) < 10:
                        continue
                    
                    dedup_key = (current_rule_id, normalized_chunk)
                    if dedup_key in seen_chunks:
                        continue
                    seen_chunks.add(dedup_key)
                    
                    doc = Document(
                        page_content=f"[{current_rule_title}]\n{normalized_chunk}",
                        metadata={
                            "source": "Osh-Handbook.pdf",
                            "rule_id": current_rule_id, # Uses the persisted state
                            "rule_title": current_rule_title,
                            "type": "text",
                            "page": page_num
                        }
                    )
                    text_documents.append(doc)
            
    print(f"Generated {len(text_documents)} structured text documents.")
    return text_documents

# Execution
tables_docs = extract_tables_to_documents(PDF_PATH)
text_docs = process_dole_rules_to_documents(PDF_PATH)
all_knowledge_base_docs = text_docs + tables_docs

# Preview the rich metadata
# --- FINAL PREVIEW BLOCK ---

print("\n--- Document Object Preview ---")

# 1. The "Empty-List Guard": Check if the list actually has items
if not all_knowledge_base_docs:
    print("Warning: No documents were generated. Please check your PDF path and extraction logic.")
else:
    # 2. The Dynamic Index: Use index 50, OR the very last index if the list is smaller than 51
    preview_index = min(50, len(all_knowledge_base_docs) - 1)
    
    print(f"Previewing Document at Index {preview_index}:")
    print(f"Content: {all_knowledge_base_docs[preview_index].page_content[:150]}...")
    print(f"Metadata: {all_knowledge_base_docs[preview_index].metadata}")

#### Combine & Final Check
Run this cell to combine your extracted tables and text chunks into one unified knowledge base list, ready to be embedded and stored in ChromaDB in your next steps.

In [None]:
# Combine text and table documents
all_knowledge_base_docs = text_docs + tables_docs

print(f"Total Text Docs: {len(text_docs)}")
print(f"Total Table Docs: {len(tables_docs)}")
print(f"Total Combined Docs ready for Vector DB: {len(all_knowledge_base_docs)}")

# This list 'all_knowledge_base_docs' is what you will pass to your embedding model!

## Section 2.2 Comparing Embedding Models

Before finalizing the system architecture, a evaluation of different embedding models is necessary to determine which performs best on the DOLE OSHS legal text. The goal is to find a model that balances technical language comprehension, semantic accuracy, and cross-lingual (Taglish) capabilities.

The following 4 models are evaluated in this Embedding Evaluation:
1. **`all-MiniLM-L6-v2`**: The industry standard for lightweight, fast semantic search. Serves as our baseline.
2. **`all-mpnet-base-v2`**: A heavier, highly accurate pure-English model from Sentence Transformers.
3. **`BAAI/bge-small-en-v1.5`**: A state-of-the-art open-source model on the MTEB leaderboard, known for handling dense technical retrieval.
4. **`paraphrase-multilingual-MiniLM-L12-v2`**: A multilingual model tested specifically for its ability to map Tagalog/Taglish queries to English regulatory text.

**Methodology:**
The cleaned, structure-aware `Document` objects are embedded into temporary ChromaDB vector stores. A mini test-suite of 5 diverse queries (covering English technical, Taglish, and Table lookups) is passed to each model. We evaluate them **qualitatively** (by reviewing the retrieved context) and **quantitatively** by comparing the L2 Distance scores (where a lower score indicates higher mathematical similarity between the query and the retrieved document).

#### Setup and Model Definition
Run this cell to define the 4 models you are going to test. This uses the updated, highly curated list we discussed.

In [None]:
# Install required libraries for Vector Store and Embeddings
!pip install chromadb sentence-transformers langchain-huggingface langchain-community pandas tabulate

from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
import pandas as pd

# 1. Define the 4 models for the ablation study
models_to_test = {
    "MiniLM (Baseline)": "sentence-transformers/all-MiniLM-L6-v2",
    "MPNet (Heavy English)": "sentence-transformers/all-mpnet-base-v2",
    "BGE-Small (Technical)": "BAAI/bge-small-en-v1.5",
    "Multilingual (Taglish)": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
}

vector_stores_arena = {}

print("Models loaded into the Arena configuration:")
for name in models_to_test.keys():
    print(f"- {name}")

#### Building the Vector Databases (The Heavy Lifting)
This cell will download each model and embed your documents. Note: Depending on your Colab GPU, this might take a few minutes to complete since it is processing 4 different models back-to-back.

In [None]:
print("=== Starting Embedding Model Evaluation ===")

# Ensure we actually have documents to test
if 'all_knowledge_base_docs' not in locals() or not all_knowledge_base_docs:
    print("Error: all_knowledge_base_docs is empty or not defined. Please run Section 2.1 first.")
else:
    # 2. Build an in-memory ChromaDB for each model
    for model_nickname, model_path in models_to_test.items():
        print(f"\nInitializing {model_nickname}...")
        
        # Load the embedding model (utilizing Colab T4 GPU)
        embeddings = HuggingFaceEmbeddings(
            model_name=model_path,
            model_kwargs={'device': 'cuda'}, 
            encode_kwargs={'normalize_embeddings': True} 
        )
        
        # Create an in-memory vector store (no persist_directory)
        print(f"Embedding chunks into temporary Vector Store...")
        vectorstore = Chroma.from_documents(
            documents=all_knowledge_base_docs,
            embedding=embeddings
        )
        
        vector_stores_arena[model_nickname] = vectorstore

    print("\n=== All Temporary Databases Built Successfully! ===")

#### Quantitative & Qualitative Evaluation
This is where the magic happens. We use similarity_search_with_score to extract the mathematical distance, and we wrap it in a Pandas DataFrame to output a beautiful comparison table.

In [None]:
# Define a mini test-suite representing different user intents
test_queries = [
    "What are the requirements for a Safety Committee in a high-risk workplace?", # Standard English
    "What is the threshold limit value for Lead and Arsenic?", # Table/Chemical lookup
    "Who is responsible for providing personal protective equipment?", # Policy/Responsibility
    "Ilang safety officer ang kailangan sa construction site na may 300 workers?", # Taglish/Filipino
    "Ano ang parusa sa hindi pagsunod sa OSH standards?" # Taglish/Penalties
]

# We will store the L2 Distance scores (lower is better) to quantitatively compare models
results_data = []

print("=== Quantitative & Qualitative Embedding Evaluation ===")

if not vector_stores_arena:
    print("Error: Vector stores not built. Run the previous cells.")
else:
    for query in test_queries:
        print(f"\n\n--- TEST QUERY: '{query}' ---")
        
        # Initialize a dictionary for our DataFrame row
        query_row = {"Query": query[:35] + "..."}
        
        for model_nickname, vectorstore in vector_stores_arena.items():
            # Get top 1 result and its distance score (ChromaDB defaults to L2 distance)
            results = vectorstore.similarity_search_with_score(query, k=1)
            
            if results:
                top_doc, score = results[0]
                
                # Print Qualitative Result
                print(f"\n>>> [{model_nickname}] (L2 Distance: {score:.4f})")
                print(f"Rule: {top_doc.metadata.get('rule_id')} - {top_doc.metadata.get('rule_title')} (Page {top_doc.metadata.get('page')})")
                print(f"Preview: {top_doc.page_content.replace('\n', ' ')[:100]}...")
                
                # Save Quantitative Result (Score) for the table
                query_row[model_nickname] = round(score, 4)
        
        results_data.append(query_row)
    
    # --- DISPLAY QUANTITATIVE SUMMARY TABLE ---
    print("\n\n" + "="*70)
    print("=== QUANTITATIVE SUMMARY (L2 Distance - Lower is Better) ===")
    print("="*70)
    df_results = pd.DataFrame(results_data)
    
    # Print as a clean Markdown table
    print(df_results.to_markdown(index=False))

#### Evaluation Metric
The quantitative results indicated that `BAAI/bge-small-en-v1.5` is the superior embedding model for the DOLE OSHS dataset. It achieved the lowest L2 Distance scores across all five test categories, including a significant margin of victory in dense table lookups (0.5236). Notably, it also outperformed the dedicated multilingual model on Taglish queries, likely due to its superior handling of the English technical loan words embedded within the Filipino syntax. 

Therefore, `bge-small-en-v1.5` is selected as the permanent embedding model for the final RAG pipeline in Section 3.

## Section 3. Requirements

To construct a reliable, hallucination-free Retrieval-Augmented Generation (RAG) pipeline for the DOLE OSHS handbook, the following frameworks and libraries were selected based on performance, open-source availability, and hardware constraints (Google Colab Pro T4 GPU):

**1. Large Language Model (LLM):**
* **`Llama 3.1 8B Instruct`**: Selected as the primary reasoning engine. It is highly optimized for instruction-following and "closed-corpus" tasks, ensuring that the model adheres strictly to the provided OSHS context and minimizes the risk of hallucination.

**2. Embedding Model:**
* **`BAAI/bge-small-en-v1.5`**: Chosen following a rigorous ablation study. It demonstrated the highest mathematical accuracy in retrieving dense technical English and PDF table contents, outperforming standard baseline models.

**3. Vector Database:**
* **`ChromaDB`**: An open-source, locally hosted vector database. It allows for efficient storage and similarity searching of high-dimensional vectors with integrated metadata filtering.

**4. Backend and UI Tool:**
* **`Gradio`**: Selected for the final web deployment. Gradio offers native "notebook-first" support for Google Colab, allowing for a stable, interactive chat interface without the need for complex external tunneling.

**5. Additional Utilities:**
* **`pdfplumber` & `pandas`**: Used for high-fidelity extraction of structured rules and tabular matrices from the OSHS PDF.
* **`LangChain`**: The orchestration framework used to link the retriever, the prompt template, and the LLM into a unified RAG chain.

In [None]:
# --- SECTION 3: MASTER REQUIREMENTS & SETUP ---

# 1. Install final required libraries
# Packages are already installed in the setup cell below Section 2.

# 2. Import core components
import gradio as gr
# --- UPDATED IMPORTS FOR LANGCHAIN v0.3+ ---
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Keep these as they are (these are in the community and huggingface extensions)
from langchain_community.vectorstores import Chroma
from langchain_huggingface import HuggingFaceEmbeddings

print("All OSHS Chatbot requirements successfully installed and imported.")

## Section 4. System Architecture

The chatbot uses a Retrieval-Augmented Generation (RAG) architecture to keep outputs grounded in the DOLE OSHS corpus.

**Overall Architecture**
- **Retriever:** LangChain retriever configured on the final vector store (`k=3`) to fetch top relevant chunks.
- **Vector Store:** Persistent `Chroma` collection stored in Google Drive for reuse across sessions.
- **Embedding Model:** `BAAI/bge-small-en-v1.5` used for document and query embeddings.
- **LLM:** `Llama 3.1 8B Instruct` used for final answer generation.
- **Prompt Template:** A strict template that injects retrieved context and constrains answers to source-grounded information.

**Pipeline (query → embedding → similarity search → context injection)**
1. User submits a safety/compliance query.
2. Query is embedded with `BAAI/bge-small-en-v1.5`.
3. Similarity search in `Chroma` retrieves top-k relevant chunks.
4. Retrieved chunks + metadata are injected into the prompt template.
5. LLM generates an answer using grounded context.

**Prompt Design and Grounding Strategy**
- The prompt explicitly instructs the model to answer from retrieved context only.
- Retrieved chunks preserve rule/page metadata to improve traceability.
- Low temperature (`0.1`) reduces generative variance and hallucination risk.
- If evidence is insufficient, the response should avoid unsupported claims.

#### System Flow Diagram and Pseudocode

```text
function answer_query(user_query):
    q_vec = embed(user_query, model='BAAI/bge-small-en-v1.5')
    docs = chroma.similarity_search(q_vec, k=3)
    prompt = build_prompt(user_query, docs)
    answer = llama_generate(prompt, temperature=0.1)
    return answer
```

```mermaid
flowchart LR
    A[User Query] --> B[Embed Query\nBGE-Small]
    B --> C[Chroma Similarity Search\nTop-k Chunks]
    C --> D[Context Injection\nPrompt Template]
    D --> E[Llama 3.1 8B Instruct]
    E --> F[Grounded Answer]
```

#### Lock in the Retriever
This cell builds your final, permanent database using the winning model.

In [None]:
import os
import torch
from google.colab import drive
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

# --- FIX 1 & 2: Google Drive Persistence and Device Agnostic Fallback ---
print("Mounting Google Drive for persistent database storage...")
drive.mount('/content/drive')
persist_directory = "/content/drive/MyDrive/OSHS_ChromaDB_Final"

# Dynamically set device to avoid crashing if GPU is detached
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Initializing final embedding model (BGE-Small) on {device}...")

final_embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-small-en-v1.5",
    model_kwargs={'device': device}, 
    encode_kwargs={'normalize_embeddings': True}
)

# --- FIX 3: Prevent Duplication by checking if DB exists ---
if os.path.exists(persist_directory) and os.listdir(persist_directory):
    print("Loading existing Vector Database from Google Drive (preventing duplication)...")
    final_vectorstore = Chroma(
        persist_directory=persist_directory, 
        embedding_function=final_embeddings
    )
else:
    print("Building NEW Vector Database and saving to Google Drive...")
    final_vectorstore = Chroma.from_documents(
        documents=all_knowledge_base_docs,
        embedding=final_embeddings,
        persist_directory=persist_directory
    )

# Convert the database into a LangChain "Retriever"
retriever = final_vectorstore.as_retriever(search_kwargs={"k": 3})
print("Retriever successfully built and ready!")

#### Load Llama 3.1 8B (The LLM)
This is the most crucial step. Because Llama 3.1 is an advanced model, it requires 4-bit quantization to fit on your Colab Pro GPU.

CRITICAL PREREQUISITE: Llama 3.1 is a "gated" model. To download it, you must have a free Hugging Face account, accept Meta's terms on the Llama 3.1 page, and create an Access Token. You need to put your token in the Colab "Secrets" tab (the little key icon on the left sidebar) and name it HF_TOKEN.

In [None]:
from google.colab import userdata
import requests

try:
    # 1. Check if the secret exists in Colab
    token = userdata.get('HF_TOKEN')
    print("✅ Success: 'HF_TOKEN' found in Colab Secrets.")
    
    # 2. Check if the token is valid by pinging Hugging Face API
    headers = {"Authorization": f"Bearer {token}"}
    response = requests.get("https://huggingface.co/api/whoami-v2", headers=headers)
    
    if response.status_code == 200:
        user_info = response.json()
        print(f"✅ Success: Token is valid. Authenticated as: {user_info.get('name')}")
    else:
        print(f"❌ Error: Token invalid or expired (Status Code: {response.status_code})")
        print("Response:", response.text)

except Exception as e:
    print(f"❌ Error: Could not find 'HF_TOKEN'. Ensure the toggle is 'ON' in the Secrets tab.")

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
from langchain_huggingface import HuggingFacePipeline
from google.colab import userdata

# --- LOAD THE LLM (LLAMA 3.1 8B) ---
print("Downloading and quantizing Llama 3.1 8B (This may take 2-3 minutes)...")

# Retrieve token
hf_token = userdata.get('HF_TOKEN')
model_id = "meta-llama/Llama-3.1-8B-Instruct"

# Configure 4-bit Quantization (Shrinks the model to fit on the T4 GPU)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load Tokenizer and Model
tokenizer = AutoTokenizer.from_pretrained(model_id, token=hf_token)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    token=hf_token
)

# Create the HuggingFace Pipeline
text_generation_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,      # Limit answer length to prevent rambling
    temperature=0.1,         # Keep temperature low for factual, legal answers
    repetition_penalty=1.1,
    return_full_text=False   # Only return the generated answer, not the prompt
)

# Wrap it in LangChain so it can be used in the LCEL chain
llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

print("Llama 3.1 8B Instruct loaded successfully and ready for RAG!")