# HR Policy Question Answering System (RAG)

This notebook demonstrates a **production-ready Retrieval-Augmented Generation (RAG) pipeline** using real HR Policy PDF documents.

Key Features:

- PDF ingestion & enterprise-grade text cleaning
- Smart chunking strategy
- Semantic embeddings
- Vector store retrieval using ChromaDB
- Controlled LLM response to prevent hallucinations (Anti-Hallucination)
- Environment: Kaggle Notebook
- Domain: Human Resources Policies

## 1. Environment Setup

Install only the required libraries for a lightweight, clean RAG pipeline.

In [1]:
!pip install -q -U pip
!pip install -q -U langchain langchain-community langchain-core langchain-huggingface langchain-chroma langchain-text-splitters 
!pip install -q -U pypdf chromadb sentence-transformers tqdm bitsandbytes accelerate transformers

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for pypika (pyproject.toml) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
xmanager 0.7.1 requires sqlalchemy==1.2.19, but you have sqlalchemy 2.0.45 which is incompatible.
opentelemetry-exporter-otlp-proto-http 1.37.0 requires opentelemetry-exporter-otlp-proto-common==1.37.0, but you have opentelemetry-exporter-otlp-proto-common 1.39.1 which is incompatible.
opentelemetry-exporter-otlp-proto-http 1.37.0 requires opentelemetry-proto==1.37.0, but you have opentelemetry-proto 1.39.1 which is incompatible.
opentelemetry-exp

## 2. Dataset Loading & Verification

Load HR Policy PDFs and check availability.

In [2]:
import os

DATA_PATH = "/kaggle/input/hr-policy-docs-pdf"

pdf_files = [f for f in os.listdir(DATA_PATH) if f.endswith(".pdf")]
print("PDF files:", pdf_files)

PDF files: ['officetime.pdf', 'separation.pdf', 'annualhealthcheck.pdf', 'leavepolicy.pdf', 'noticeperiod.pdf', 'travel.pdf']


## 3. PDF Document Ingestion & Text Cleaning & Normalization

Load each PDF page-by-page to preserve metadata (file name and page number).

Clean text from headers, footers, page numbers, bullets, and unnecessary whitespace.

In [3]:
import re
from tqdm import tqdm
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.documents import Document 

raw_documents = []
for file in tqdm(pdf_files, desc="Loading PDFs"):
    loader = PyPDFLoader(os.path.join(DATA_PATH, file))
    raw_documents.extend(loader.load())

def clean_text(text: str) -> str:
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r"Page\s*\d+", "", text, flags=re.IGNORECASE)
    text = re.sub(r"HR Policy", "", text, flags=re.IGNORECASE)
    text = re.sub(r"[•●■□▪]", "", text)
    return text.strip()

cleaned_documents = []
for doc in tqdm(raw_documents, desc="Cleaning text"):
    text = clean_text(doc.page_content)
    if len(text) > 50:
        cleaned_documents.append(Document(page_content=text, metadata=doc.metadata))

print(f"Cleaned documents: {len(cleaned_documents)}")

2026-01-13 19:23:34.982949: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1768332215.398761      23 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1768332215.525053      23 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1768332216.529099      23 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1768332216.529154      23 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1768332216.529159      23 computation_placer.cc:177] computation placer alr

Cleaned documents: 32





## 4. Chunking &  Quality Control

Split documents into semantic chunks for better retrieval.

Filter weak or noisy chunks and clean metadata.

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=600,
    chunk_overlap=120,
    separators=["\n\n", "\n", ".", " ", ""]
)

all_chunks = splitter.split_documents(cleaned_documents)

final_chunks = []
for chunk in all_chunks:
    if len(chunk.page_content) > 100:
        source_name = os.path.basename(chunk.metadata.get("source", "Unknown"))
        chunk.metadata["source"] = source_name
        final_chunks.append(chunk)

print(f"Final Chunks for Vector DB: {len(final_chunks)}")

Final Chunks for Vector DB: 97


# 5. Embeddings &  Vector Store (ChromaDB)

Convert chunks into vector embeddings.

Store embeddings with metadata for efficient semantic search.

In [5]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

VECTOR_DB_PATH = "chroma_hr_db"
vectorstore = Chroma.from_documents(
    documents=final_chunks,
    embedding=embeddings,
    persist_directory=VECTOR_DB_PATH
)

print("Vector Store is ready and persisted.")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Vector Store is ready and persisted.


# 6. LLM Setup (Llama-3 with 4-bit Quantization)

In [6]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
from langchain_huggingface import HuggingFacePipeline

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_id = "unsloth/llama-3-8b-Instruct-bnb-4bit"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

text_gen = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.1, 
    repetition_penalty=1.1
)

llm = HuggingFacePipeline(pipeline=text_gen)
print("Llama-3 is loaded and ready.")

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]



model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

Device set to use cuda:0


Llama-3 is loaded and ready.


# 7. RAG Chain & Anti-Hallucination Prompt

In [7]:
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a professional HR Assistant. Use ONLY the following context to answer the question. 
If the answer is not in the context, strictly say: "I'm sorry, but this information is not available in the HR policies."

Context:
{context}<|eot_id|><|start_header_id|>user<|end_header_id|>

Question: {question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

prompt = PromptTemplate.from_template(template)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

def ask_hr(query):
    print(f"\n Question: {query}")
    
    response = rag_chain.invoke(query)
    
    docs = retriever.invoke(query)

    if "<|end_header_id|>" in response:
        clean_answer = response.split("<|end_header_id|>")[-1].strip()
    elif "assistant" in response:
        clean_answer = response.split("assistant")[-1].strip()
    else:
        clean_answer = response.strip()
    
    print(f" Answer: {clean_answer}")
    
    print("\n Sources :")
    sources = set([f"- {doc.metadata.get('source')} (Page: {doc.metadata.get('page', 'N/A')})" for doc in docs])
    for s in sources:
        print(s)

print("--- HR System Online (Clean Version) ---")
ask_hr("What is the policy for annual health checks?")
ask_hr("Can I bring a lion to the office?")

--- HR System Online (Clean Version) ---

 Question: What is the policy for annual health checks?
 Answer: According to our company's Annual Health Check-up Policy, the objective is to provide health check-ups to all employees to facilitate their well-being. The scope of this policy applies to all employees in India.

The policy outlines the eligibility criteria, process, and frequency of health check-ups based on the employee's grade and age. The frequency and cost of the health check-ups vary depending on the employee's grade and age group.

For example:

* Employees in Grades E11 and above can avail of a health check-up once a year at a cost of ₹5,100.
* Employees in Grades E09 and E10 can avail of a health check-up once a year at a cost of ₹3,100.
* And so on.

Eligible employees need to approach the HR department for authorization before undergoing the health check-up. The HR department will direct them to an authorized hospital/laboratory and settle the expenses directly with the

# 8. Stress Testing

In [8]:
print("Start professional assessment tests...\n")

# 1. Semantic Search Test

print("--- Test 1: Indirect Query ---")
ask_hr("If I am feeling sick and want to get a medical check-up, how much will the company pay for me?")

# 2. Data Extraction & Comparison

print("\n--- Test 2: Comparison & Logic ---")
ask_hr("What is the difference in cost for a health check-up between a 30-year-old employee and a 40-year-old employee in grade E01?")

# 3. Boundary / Adversarial Test

print("\n--- Test 3: Out-of-Domain (Adversarial) ---")
ask_hr("Write a Python code to sum two numbers.")

# 4.Multi-source Retrieval

print("\n--- Test 4: Multi-context Search ---")
ask_hr("Summary of the rules for travel and the health check-up eligibility.")

Start professional assessment tests...

--- Test 1: Indirect Query ---

 Question: If I am feeling sick and want to get a medical check-up, how much will the company pay for me?
 Answer: According to our company's policy, if you're an eligible employee, you can get a health check-up done at an authorized hospital/laboratory. The company will reimburse you up to the extent of the eligibility amount or the actual expense incurred, whichever is lower.

To determine the reimbursement amount, we need to consider your age group. Based on the provided context, it seems that the reimbursement amount varies depending on your age group. Could you please share your age with me so I can look it up in the table?

Once I have your age, I'll be able to tell you the exact reimbursement amount according to our company's policy.

 Sources :
- travel.pdf (Page: 2)
- annualhealthcheck.pdf (Page: 1)

--- Test 2: Comparison & Logic ---

 Question: What is the difference in cost for a health check-up between

## Conclusion

This notebook demonstrates a clean, **enterprise-style RAG pipeline**:

- **Data quality**: Advanced text cleaning
- **Retrieval accuracy**: Semantic embeddings + ChromaDB
- **Hallucination prevention**: Controlled LLM prompt
 
Ready for **real-world enterprise use**.