# Nestl√© HR Policy Chatbot


> **Crafting an AI-Powered HR Assistant: A Use Case for Nestle‚Äôs HR Policy Documents**

1. Set up the environment and configure the OpenAI API key.
2. Load Nestl√©‚Äôs HR policy PDF using `PyPDFLoader`.
3. Split the text into manageable chunks.
4. Create vector embeddings using OpenAI embeddings and store them in **Chroma**.
5. Build a **question-answering (QA) system** using `ChatOpenAI` and `RetrievalQA`.
6. Wrap everything in a **Gradio chatbot UI** so users can ask questions about the HR policy.


In [1]:
# 3. Imports

from pathlib import Path
import os

from dotenv import load_dotenv

# LangChain PDF loading & text splitting
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Try new-style OpenAI integration first, then fall back to legacy imports
try:
    from langchain_openai import ChatOpenAI, OpenAIEmbeddings
except ImportError:
    # Legacy fallback (for older LangChain versions)
    from langchain.chat_models import ChatOpenAI 
    from langchain.embeddings import OpenAIEmbeddings  

# Vector store
from langchain_community.vectorstores import Chroma

# Prompting & QA chain
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA


# UI
import gradio as gr

In [2]:
# 4. Environment Setup ‚Äì Load API Key from .env

# Load environment variables from .env 
load_dotenv(r"C:\Users\kgjam\OneDrive\Desktop\Chatbot\.env")

openai_api_key = os.getenv("OPENAI_API_KEY")

if not openai_api_key:
    raise ValueError(
        "OPENAI_API_KEY is not set. Please create a .env file with your OpenAI key "
        "or export it as an environment variable."
    )

print(" OPENAI_API_KEY found and loaded.")

 OPENAI_API_KEY found and loaded.


In [3]:
from pathlib import Path

PDF_PATH = Path(r"C:\Users\kgjam\Downloads\the_nestle_hr_policy_pdf_2012.pdf")

if not PDF_PATH.exists():
    raise FileNotFoundError(
        f"HR Policy PDF not found at: {PDF_PATH}.\n"
        "Please place the Nestl√© HR policy PDF in the correct path "
        "and update PDF_PATH if needed."
    )

CHROMA_DIR = Path("chroma_db_nestle_hr")


In [4]:
# 5. Configuration ‚Äì Model Names, Paths, and Parameters

# Update this if your PDF has a different name or location
PDF_PATH = r"C:\Users\kgjam\Downloads\the_nestle_hr_policy_pdf_2012.pdf"  # Put the PDF in the same folder as the notebook


assert os.getenv("OPENAI_API_KEY"), "Set OPENAI_API_KEY in .env or in the code!"



# Vector DB folder
CHROMA_DIR    = "chroma_nestle_hr"

# Model names
MODEL_NAME    = "gpt-3.5-turbo"
EMBED_MODEL   = "text-embedding-3-small"

# Text splitting
CHUNK_SIZE    = 900
CHUNK_OVERLAP = 120
TOP_K         = 4  # how many chunks to retrieve

print("Config + API key OK.")


Config + API key OK.


In [5]:
# 5. Configuration ‚Äì Model Names, Paths, and Parameters

EMBED_MODEL_NAME = "text-embedding-3-small"   
CHAT_MODEL_NAME  = "gpt-4o-mini"              


## 6. Load and Split the Nestl√© HR Policy PDF

In this step we:

1. Load the PDF using `PyPDFLoader`.
2. Convert it into a list of `Document` objects (one per page).
3. Split each page into **chunks** using `RecursiveCharacterTextSplitter` so that:
   - Each chunk is ~900 characters,
   - Overlap is 120 characters (to preserve context across chunks).

In [6]:
# 6.1 Load PDF with PyPDFLoader

loader = PyPDFLoader(str(PDF_PATH))
docs = loader.load()

print(f"Loaded {len(docs)} pages from the HR policy PDF.")

Loaded 8 pages from the HR policy PDF.


In [7]:
# 6.2 Split Documents into Chunks

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    separators=["\n\n", "\n", ". ", " ", ""],
)

split_docs = text_splitter.split_documents(docs)

print(f"Created {len(split_docs)} text chunks from the HR policy.")

Created 21 text chunks from the HR policy.


## 7. Create Embeddings and Build Chroma Vector Store

Next we:

1. Initialize **OpenAI embeddings** using the `text-embedding-3-small` model.
2. Create a **Chroma** database from our text chunks.
3. Persist the vector store to disk, so it can be reused later without recomputing embeddings.

In [8]:
# 7.1 Initialize Embeddings

embeddings = OpenAIEmbeddings(model=EMBED_MODEL_NAME)

print("OpenAI embeddings initialized.")

OpenAI embeddings initialized.


In [9]:
# 7.2 Create or Load Chroma Vector Store

from pathlib import Path

CHROMA_DIR = Path("chroma_db_nestle_hr")  # must be a Path, not a string
CHROMA_DIR.mkdir(exist_ok=True)

vectorstore = Chroma.from_documents(
    documents=split_docs,
    embedding=embeddings,
    persist_directory=str(CHROMA_DIR),
)

vectorstore.persist()

print("Chroma vector store created and persisted.")


Chroma vector store created and persisted.


  vectorstore.persist()


In [10]:
# 7.3 Create a Retriever Interface

retriever = vectorstore.as_retriever(search_kwargs={"k": TOP_K})

print("Retriever is ready.")

Retriever is ready.


## 8. Build the Question-Answering Chain (GPT + Retrieval)

We now:

1. Create a **prompt template** that clearly instructs the model to:
   - Use **only** the provided context.
   - Admit when the answer is not found in the policy.
2. Initialize a `ChatOpenAI` model (GPT).
3. Combine these with the `retriever` into a **`RetrievalQA` chain**.


In [11]:
# 8.1 Define Prompt Template

qa_prompt_template = """You are an AI HR assistant for Nestl√©. 
You answer questions strictly based on the provided HR policy context below.

If the answer is not contained in the context, say:
"I‚Äôm sorry, but I could not find that information in the Nestl√© HR policy document."

Use clear, concise language and, where appropriate, bullet points.

----------------
Context:
{context}
----------------

Question: {question}

Answer as the Nestl√© HR assistant:
"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=qa_prompt_template,
)

print("Prompt template created.")

Prompt template created.


In [12]:
# 8.2 Initialize Chat Model

llm = ChatOpenAI(
    model_name=CHAT_MODEL_NAME,
    temperature=0.0,  # deterministic / exam-friendly
)

print("ChatOpenAI model initialized.")

ChatOpenAI model initialized.


In [13]:
# 8.3 Build RetrievalQA Chain

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt},
)

print("RetrievalQA chain is ready.")

RetrievalQA chain is ready.


## 9. Quick Test Query 

In [14]:
# 9.1 Test the QA Chain with a Sample Question

sample_question = "What does the policy say about working hours and overtime?"

response = qa_chain(sample_question)

print("Question:", sample_question)
print("\nAnswer:\n", response["result"])

print("\nSources used:")
for i, doc in enumerate(response["source_documents"], start=1):
    page = doc.metadata.get("page", "N/A")
    print(f"Source {i}: page {page}")

  response = qa_chain(sample_question)


Question: What does the policy say about working hours and overtime?

Answer:
 I‚Äôm sorry, but I could not find that information in the Nestl√© HR policy document.

Sources used:
Source 1: page 4
Source 2: page 4
Source 3: page 4
Source 4: page 4


## 10. Build the Gradio Chatbot Interface

Now we create a simple chatbot interface using **Gradio**:

- Users type HR-related questions.
- The bot responds using the `qa_chain`.
- show which **policy pages** were used as sources at the bottom of each answer.


In [15]:
# 10.1 Define Chat Function for Gradio

def hr_chatbot(history, user_message):
    if not user_message.strip():
        return history, ""

    # Call the RetrievalQA chain
    res = qa_chain(user_message)
    answer = res["result"]
    
    # Append source page info to the answer
    source_lines = []
    for i, doc in enumerate(res.get("source_documents", []), start=1):
        page_num = doc.metadata.get("page", "N/A")
        source_lines.append(f"Source {i}: page {page_num}")
    if source_lines:
        answer += "\n\n" + "\n".join(source_lines)

    # Update chat history
    history = history + [(user_message, answer)]
    return history, ""

In [18]:
def debug_retrieval(question: str):
    docs = retriever.get_relevant_documents(question)
    print(f"üîé Question: {question}")
    print(f"Retrieved {len(docs)} chunks:\n")
    for i, d in enumerate(docs, start=1):
        print(f"---- Chunk {i} | page {d.metadata.get('page', 'NA')} ----")
        print(d.page_content[:800])
        print()

# Example:
debug_retrieval("What does the policy say about working hours?")


  docs = retriever.get_relevant_documents(question)


üîé Question: What does the policy say about working hours?
Retrieved 4 chunks:

---- Chunk 1 | page 4 ----
working inside or outside our premises under 
contractual obligations with service providers 
and we insist that they also take steps so that 
adequate working conditions are made available 
to them.
We believe that it is essential to build a 
relationship based on trust and respect of 
employees at all levels. We do not tolerate any 
form of harassment or discrimination.
Therefore, managers are committed to build 
and sustain, with their teams, an environment 
of mutual trust. HR ensures that a respectful 
dialogue is present and the voice of the 
employees is heard.
Corporate policy: 
Policy on Conditions of Work and Employment
 Employment and working conditions

---- Chunk 2 | page 4 ----
working inside or outside our premises under 
contractual obligations with service providers 
and we insist that they also take steps so that 
adequate working conditions are made available 

In [21]:
def is_context_relevant(docs, threshold=25):
    # Require at least X characters of meaningful content
    total_len = sum(len(d.page_content.strip()) for d in docs)
    return total_len >= threshold


In [22]:
def chat_answer(history, message):
    retrieved = retriever.get_relevant_documents(message)

    if not is_context_relevant(retrieved):
        answer = "I‚Äôm sorry, but I could not find that information in the Nestl√© HR policy document."
        history.append((message, answer))
        return history, ""

    # Normal RAG pipeline
    res = qa_chain(message)
    answer = res["result"]
    
    pages = [str(d.metadata.get("page", "NA")) for d in res["source_documents"]]
    if pages:
        answer += "\n\nSources: " + ", ".join(pages)

    history.append((message, answer))
    return history, ""


In [23]:
# 10.2 Build and Launch Gradio Interface

with gr.Blocks(title="Nestl√© HR Policy Chatbot") as demo:
    gr.Markdown("""# Nestl√© HR Policy Chatbot
Ask questions about the Nestl√© HR policy document.  
The assistant will answer **only** using information found in the policy PDF.
""")
    
    chat = gr.Chatbot(height=420, label="Chat History")
    user_input = gr.Textbox(label="Your Question", placeholder="Type an HR-related question here...")
    clear_btn = gr.ClearButton([chat, user_input])

    user_input.submit(
        fn=hr_chatbot,
        inputs=[chat, user_input],
        outputs=[chat, user_input],
    )



print(" Gradio app is defined. Run demo.launch() in an interactive environment to start the chatbot.")

 Gradio app is defined. Run demo.launch() in an interactive environment to start the chatbot.


  chat = gr.Chatbot(height=420, label="Chat History")


In [25]:
demo.launch()

Rerunning server... use `close()` to stop if you need to change `launch()` parameters.
----
* To create a public link, set `share=True` in `launch()`.


