There are 4 steps in the launching HR Assistant with additional knowledge base to Fine-Tuned LLM


1. Installing necessary libraries for fine-tuning HR Assistance model with Company Policy documents
2. Preprocessing and Chunking Company Policy Documents for Performing RAG
3. Use the fine-tuned LLM to answer HR queries using additional document chunks as a knowledge base.
4. Launch Daily HR Chatbot App using Gradio


# 1. Installing necessary libraries for fine-tuning HR Assistance model with Company Policy documents



In [None]:
%%capture
!pip install langchain  # Framework for building language model applications
!pip install sentence_transformers==2.2.2  # Embedding model for sentence-level representations
!pip install -U langchain-community  # To use Hugging Face embedding models
!pip install gradio  # To build a demo site for chatbots
!pip install langchain_google_genai --upgrade  # Install or upgrade the langchain_google_genai package for Google GenAI support
!pip install faiss-cpu  # Vector database for efficient similarity search
!pip install pypdf  # For handling PDF documents
!pip install datasets  # For accessing various datasets
!pip install transformers accelerate  # Hugging Face Transformers and Accelerate library for efficient model deployment
!pip install bitsandbytes  # Quantization and optimization for transformers
!pip install huggingface_hub  # For accessing models from Hugging Face Hub
!pip install torch  # For deep learning framework
!pip install peft transformers accelerate  # Parameter-Efficient Fine-Tuning for transformers
!pip install -U sentence-transformers

# 2. Preprocessing and Chunking Company Policy Documents for Performing RAG
**Documentation:** Preprocessing and Chunking Company Policy Documents for Performing RAG in HR AI Assistant

**Usage Context:**
This preprocessing pipeline prepares HR policy documents for a Retrieval-Augmented Generation (RAG) system, enabling an HR AI Assistant to answer employee queries effectively. By chunking policy documents and structuring them into a retrievable format, the system improves contextual accuracy and search efficiency when responding to HR-related queries.

1. Suppress Warnings & Import Required Libraries
Disables unnecessary warnings to improve readability.
Imports libraries for document loading, text chunking, embeddings, vector storage, and retrieval-based question-answering (RAG).
2. Load and Extract Text from Company Policy Documents
Mounts Google Drive to access policy documents stored in a specific folder.
Retrieves all PDF files from the folder and extracts text using PyPDFLoader.
Stores extracted text as structured documents for further processing.
3. Text Splitting for RAG Processing
Uses RecursiveCharacterTextSplitter to divide long documents into overlapping text chunks (500 tokens per chunk, 20-token overlap).
This ensures better context retention for retrieval-augmented generation (RAG) tasks in the HR AI Assistant.
4. Dataset Creation for Retrieval-Augmented Generation
Saves processed text chunks into a JSONL file (company_policy_dataset.jsonl).
This dataset is later used to build an efficient vector search database (FAISS) and support LLM-based question-answering (RetrievalQA).



In [None]:
import warnings
warnings.filterwarnings("ignore")  # Suppress unnecessary warnings

# Import necessary libraries for processing and fine-tuning
from langchain.text_splitter import RecursiveCharacterTextSplitter  # For chunking text
from langchain.embeddings import HuggingFaceEmbeddings  # For embedding models
from langchain.vectorstores import FAISS  # For vector database support
from langchain_community.document_loaders import PyPDFLoader  # For loading PDF documents
from langchain.prompts import PromptTemplate  # For prompt templating
from langchain.chains import RetrievalQA  # For Question-Answering tasks over documents

import os
import json

# Mount Google Drive and set up directories
from google.colab import drive  # For mounting Google Drive in Colab
drive.mount('/content/drive')
folder_path = '/content/drive/MyDrive/AI Project/CompanyPolicyDocuments'  # Folder containing PDFs

# Get list of all PDF files from the folder
pdf_files = [f for f in os.listdir(folder_path) if f.endswith('.pdf')]

# Load all PDF documents
documents = []
for pdf_file in pdf_files:
    pdf_path = os.path.join(folder_path, pdf_file)
    loader = PyPDFLoader(pdf_path)  # PDF loader for extracting text
    documents.extend(loader.load())

print(f"Loaded {len(documents)} documents from {folder_path}")

# Split documents into smaller chunks using RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
text_chunks = text_splitter.split_documents(documents)

print(f"Number of text chunks created: {len(text_chunks)}")

# Save text chunks into a JSONL file for training
dataset_path = "company_policy_dataset.jsonl"
with open(dataset_path, "w") as f:
    for chunk in text_chunks:
        json.dump({"text": chunk.page_content}, f)
        f.write("\n")

print(f"Dataset saved to {dataset_path}")

Mounted at /content/drive
Loaded 44 documents from /content/drive/MyDrive/AI Project/CompanyPolicyDocuments
Number of text chunks created: 212
Dataset saved to company_policy_dataset.jsonl


# 3. Use the fine-tuned LLM to answer HR queries using additional document chunks as a knowledge base.


**Documentation:** Implementing Conversational RAG for HR AI Assistant

**Usage Context:**
This system powers an HR AI Assistant that enables employees to query company policies using RAG-based conversational AI. The assistant efficiently retrieves relevant policy sections and provides concise, well-structured answers while retaining context-awareness during multi-turn interactions.
1. Import Required Libraries
Loads key libraries for retrieval-augmented generation (RAG), LLM inference, and conversational memory.
Uses LangChain, Hugging Face Transformers, and FAISS for processing and searching company policy documents.
2. Set Up Memory and Embeddings for RAG
ConversationBufferMemory enables context retention in chat history.
HuggingFaceEmbeddings converts text chunks into dense vectors for similarity search.
FAISS vector database is built from pre-processed policy documents to allow efficient document retrieval.
3. Load Fine-Tuned LLaMA Model
Retrieves the trained LLaMA model stored in Google Drive.
Applies 4-bit quantization (BitsAndBytesConfig) for efficient inference on limited GPU resources.
Uses Hugging Face pipeline for text generation, configuring key hyperparameters like temperature, top-p, and repetition penalty.
4. Define Custom Prompt for HR Query Handling
Implements a structured prompt that ensures:
Concise responses (4-5 lines only).
No unnecessary repetition of context.
Fallback response ("Don't have information") if the model lacks relevant data.
5. Create Conversational Retrieval Chain
Uses ConversationalRetrievalChain to integrate:
Retrieval-based search over company policy documents.
Custom response formatting using a predefined prompt.
Conversational memory to track chat history.
6. Deploy Chatbot Interface
Implements a simple chat loop that continuously accepts user queries.
Calls the QA chain to generate responses based on retrieved context.
Exits when the user types "exit".


In [None]:
import os
import json
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.llms import HuggingFacePipeline
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding,
    BitsAndBytesConfig,
    pipeline
)
from peft import get_peft_model, LoraConfig, TaskType
# Login to Hugging Face Hub
# notebook_login()
os.environ["HF_TOKEN"] = ""  # Set Hugging Face token

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vector_db = FAISS.from_documents(text_chunks, embeddings)
# vector_db.similarity_search("Procedure to avail a sick leave")

MODEL_PATH = "/content/drive/My Drive/trained_llama_model_gen_AI"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

# Load the model (ensure you use the correct quantization settings)
bnb_config = BitsAndBytesConfig(load_in_4bit=True)  # Use 8-bit if needed
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", quantization_config=bnb_config)

# Create a text generation pipeline
text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=256,  # Set max_new_tokens for generated output
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.1
)

# Wrap in LangChain LLM
llm = HuggingFacePipeline(pipeline=text_pipeline)

# Define your custom prompt
# template = """use the context to provide a concise answer and if you don't know just say don't have information.
# {context}
# Question: {question}
# Helpful Answer:"""


retriever = vector_db.as_retriever(search_kwargs={"k": 2})
prompt_template = """You are an AI assistant helping users with HR policy questions. Use the given context to answer concisely
and Do not mention or repeat the context in your response and provide your summarized response in 4 to 5 lines only
and if you don't know just say don't have information.

Context:
{context}

User Question: {question}

Helpful Answer:"""

QA_CHAIN_PROMPT = PromptTemplate.from_template(prompt_template)

# Enable conversational memory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

# Create Conversational Retrieval Chain with custom prompt
qa_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=vector_db.as_retriever(),
    memory=memory,
    combine_docs_chain_kwargs={"prompt": QA_CHAIN_PROMPT}  # Adding the custom prompt here
)

# Chat loop
print("Chatbot is ready! Type 'exit' to stop.")

while True:
    query = input("You: ")
    if query.lower() == "exit":
        break
    response = qa_chain({"question": query})  # Pass question as a dictionary
    answer = response['answer'] # Access the 'answer' key
    print(f"Bot: {answer}") # Print only the answer

  memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Device set to use cuda:0
  llm = HuggingFacePipeline(pipeline=text_pipeline)


In [None]:
qa_chain.run({"question": "what's the responsibility of the employee irrespective of Attendance"})
#Sample Prompts:
#what's the responsibility of the employee irrespective of Attendance
#what's the responsibility of the Line Manager irrespective of Attendance

This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


"You are an AI assistant helping users with HR policy questions. Use the given context to answer concisely \nand Do not mention or repeat the context in your response and provide your summarized response in 4 to 5 lines only\nand if you don't know just say don't have information.\n\nContext:\nunderstand and comply with the policy and procedures.   \nIt is the responsibility of the Line Manager:  \n\uf0b7 To deal with absence in a fair and consistent way, in line with the procedures outlined in \nthis policy. \n\uf0b7 To conduct a Return to Work discussion with employees after all episodes of sickness \n(including those of 1 day) confirming the dates and reason for absence.  \n(N.B. this helps to demonstrate that attendance is managed and can also serve as the\n\nResponsibilities \n\uf0b7 It is the responsibility of the employee to attend work regularly and to notify their \nLine Manager or nominated deputy that they are unable to work due to sickness \n\uf0b7 It is the responsibility o

# 4. Launch Daily HR Chatbot App using Gradio
**Documentation:** Deploying HR AI Assistant Using Gradio Chat Interface

**Usage Context:**
This implementation creates an interactive HR AI chatbot for employee queries on HR policies. Users can enter workplace-related questions, and the system provides concise, expert-guided responses using Conversational RAG.

1. Import Required Library
Uses Gradio (gr), a Python library for building interactive AI-powered web applications.
2. Define Chatbot Response Function
Accepts user input and chat history.
Passes the user's query to the Retrieval-Augmented Generation (RAG) pipeline (qa_chain).
Appends the generated response to chat history for conversational continuity.
3. Create a Gradio Chat UI
Uses Gradio Blocks to structure the interface.
Displays a Markdown header introducing the chatbot.
Implements a chatbox (gr.Chatbot) to hold conversation history.
Adds an input text field (gr.Textbox) for user messages.
Provides a send button (gr.Button) to submit queries.
4. Handle User Interactions
When the send button is clicked, the chatbot:
Processes the query using chatbot_response().
Updates the chat window with the new response.
Clears the input field for the next query.
5. Launch the Gradio Chatbot
Calls demo.launch() to start the chatbot server, making it accessible via a web interface.


In [None]:
import gradio as gr

def chatbot_response(user_input, chat_history):
    response = qa_chain.run({"question": user_input})
    chat_history.append((user_input, response))
    return "", chat_history

with gr.Blocks() as demo:
    gr.Markdown("## Your daily GenAI Powered HR's companion. An Expert guidance for your workplace needs.")

    chatbot = gr.Chatbot()
    msg = gr.Textbox(label="Your Message")
    send_btn = gr.Button("Send")

    send_btn.click(chatbot_response, inputs=[msg, chatbot], outputs=[msg, chatbot])

demo.launch()


Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://47c508296366b477cd.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


