# What is the problem?
### - Companies have lots of internal documents: HR policies, SOPs, product manuals.
### - Employees waste time reading through these or asking HR / managers.
### - Can’t just use ChatGPT because it doesn’t know your company’s internal documents.

## ** What are we trying to achieve?**
## Build a chatbot (or AI Agent) that answers employee questions using your company’s documents.¶

### How does it work (concept)?
### We use RAG (Retrieval-Augmented Generation):
### We store the documents in a special searchable database (vector database).
### When a user asks a question:
### We search the documents to find the most relevant parts (retrieval).
### We send those relevant parts + question to the LLM (GPT-4, Claude etc.) to generate a good answer.¶

# Step-by-step:
## Step 1: Prepare documents
## Collect PDFs or text files of company documents.
### Split them into small chunks (e.g., 200–500 words) for easier search later.¶

## What is the purpose of this code?
## You want to take a folder full of documents (PDF, TXT, Word, etc.) and turn them into a list of text chunks your GenAI app can work with.
## LangChain provides a DirectoryLoader, which helps you:
## Open a directory (folder) on your computer/server.
## Find all supported files inside.
## Load their contents as a list of Document objects.


# Import DirectoryLoader

In [7]:
from langchain_community.document_loaders import DirectoryLoader


# Create a loader

# Actually load the documents

In [10]:
loader = DirectoryLoader("D:\\desktop\\LLM_Project folder\\DATA_1")

In [11]:
docs = loader.load()

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


In [12]:
docs

[Document(metadata={'source': 'D:\\desktop\\LLM_Project folder\\DATA_1\\employee_handbook.txt'}, page_content='Welcome to ACME Corp!\n\nOur company provides flexible working hours from 9 AM to 6 PM with a one-hour lunch break. Employees are entitled to 24 paid leaves per year, including casual and sick leaves.\n\nThe maternity leave policy allows for 26 weeks of paid leave, and paternity leave is 2 weeks.\n\nThe dress code is business casual from Monday to Thursday and casuals on Friday.\n\nWe believe in equal opportunities and have a strict anti-harassment policy.'),
 Document(metadata={'source': 'D:\\desktop\\LLM_Project folder\\DATA_1\\faq.txt'}, page_content="Q: How do I reset my password? A: Please visit the internal portal and click on 'Forgot Password'.\n\nQ: What is the company holiday calendar? A: The calendar is published every December and is available on the intranet.\n\nQ: Can I work remotely? A: Remote work is allowed up to 2 days per week with manager approval."),
 Docum

In [13]:
docs[0].metadata


{'source': 'D:\\desktop\\LLM_Project folder\\DATA_1\\employee_handbook.txt'}

In [14]:
print(docs[0].page_content)


Welcome to ACME Corp!

Our company provides flexible working hours from 9 AM to 6 PM with a one-hour lunch break. Employees are entitled to 24 paid leaves per year, including casual and sick leaves.

The maternity leave policy allows for 26 weeks of paid leave, and paternity leave is 2 weeks.

The dress code is business casual from Monday to Thursday and casuals on Friday.

We believe in equal opportunities and have a strict anti-harassment policy.


# Step 2: Split & embed
# Split text into chunks & turn each chunk into a vector (numbers) using embeddings.

In [16]:
from langchain.text_splitter import RecursiveCharacterTextSplitter


In [17]:
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = splitter.split_documents(docs)
chunks


[Document(metadata={'source': 'D:\\desktop\\LLM_Project folder\\DATA_1\\employee_handbook.txt'}, page_content='Welcome to ACME Corp!\n\nOur company provides flexible working hours from 9 AM to 6 PM with a one-hour lunch break. Employees are entitled to 24 paid leaves per year, including casual and sick leaves.\n\nThe maternity leave policy allows for 26 weeks of paid leave, and paternity leave is 2 weeks.\n\nThe dress code is business casual from Monday to Thursday and casuals on Friday.\n\nWe believe in equal opportunities and have a strict anti-harassment policy.'),
 Document(metadata={'source': 'D:\\desktop\\LLM_Project folder\\DATA_1\\faq.txt'}, page_content="Q: How do I reset my password? A: Please visit the internal portal and click on 'Forgot Password'.\n\nQ: What is the company holiday calendar? A: The calendar is published every December and is available on the intranet.\n\nQ: Can I work remotely? A: Remote work is allowed up to 2 days per week with manager approval."),
 Docum

In [18]:
print(len(chunks))

3


In [19]:
#pip install -U langchain-openai


In [20]:
from langchain_openai import OpenAIEmbeddings
with open(r"D:\desktop\LLM_Project folder\Key_GEN_AI.txt","r") as f:
    Api_key = f.read().strip()

embeddings = OpenAIEmbeddings(openai_api_key= Api_key )


In [21]:
embeddings

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x000001F7E0273150>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x000001F7E0280C10>, model='text-embedding-ada-002', dimensions=None, deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=None, disallowed_special=None, chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

# Step 3: Store vectors
# Store all vectors in a searchable vector database (e.g., FAISS).




In [22]:
from langchain.vectorstores import FAISS


In [23]:
vectorstore = FAISS.from_documents(chunks, embeddings)



In [24]:
vectorstore.save_local("./faiss_index")
print("Vectorstore saved to ./faiss_index")

Vectorstore saved to ./faiss_index


# Step 4: Build Q&A chain

What is RAG?
RAG = Retrieval-Augmented Generation

Combine retrieval from external knowledge + generation from an LLM.

Why?
LLMs (like GPT-3.5/4) don’t know your private documents. So you first search your documents for relevant pieces, and then pass those to the LLM to generate the answer.



| RAG Step        |                   Code                                              |
| --------------- | ------------------------------------------------------------------  |
| 🔍 **Retrieve** | `retriever = db.as_retriever()`                                    
| 🔷 **Augment**  | `retriever` pulls relevant chunks                                  |
| 📝 **Generate** | `llm = ChatOpenAI()` uses retrieved context to answer              |
| ✅ Combine       | `qa_chain = RetrievalQA.from_chain_type(llm, retriever=retriever)`|


In [49]:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(openai_api_key=Api_key)


In [53]:
retriever = vectorstore.as_retriever()
qa_chain = RetrievalQA.from_chain_type(llm, retriever=retriever)

In [55]:
response = qa_chain.run("What is the maternity leave policy?")
print(response)

  response = qa_chain.run("What is the maternity leave policy?")


The maternity leave policy at ACME Corp allows for 26 weeks of paid leave.


In [59]:
from IPython.display import Markdown


In [61]:
Markdown(response)

The maternity leave policy at ACME Corp allows for 26 weeks of paid leave.

General HR policies:

What are the official working hours?

How many paid leaves do employees get?

What is the maternity leave policy?

What is the paternity leave policy?

What is the company dress code on Fridays?

Does the company have an anti-harassment policy?

How does the company ensure equal opportunities?



In [64]:
response = qa_chain.run("General HR policies:?")
Markdown(response)

ACME Corp has the following general HR policies:

- Flexible working hours from 9 AM to 6 PM with a one-hour lunch break.
- 24 paid leaves per year, including casual and sick leaves.
- 26 weeks of paid maternity leave and 2 weeks of paternity leave.
- Business casual dress code from Monday to Thursday, and casual attire on Fridays.
- Equal opportunities policy and strict anti-harassment policy.
- Remote work is allowed up to 2 days per week with manager approval.

In [66]:
response = qa_chain.run("What are the official working hours?")
Markdown(response)

The official working hours at ACME Corp are from 9 AM to 6 PM with a one-hour lunch break.

In [68]:
response = qa_chain.run("How many paid leaves do employees get?")
Markdown(response)

Employees are entitled to 24 paid leaves per year, including casual and sick leaves.