## Data Preprocessing

1. Load the **PDFs or Markdown** files in Order
2. Extract the **data/content** from the PDF
3. Then perform chunking - *Because the context window for the LLMs are small*
4. Then pass it to the **Embedding Models**

In [2]:
# 1. Loading of the PDFs

from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
import os

FILE_PATH = "C:/Users/sayan/OneDrive/Desktop/internship-project/enterprise-knowledge-copilot/data"

# Function to load PDF files from a specified directory
def load_pdf_file(file_path=FILE_PATH):
    if not os.path.isdir(file_path):
        print(f"Error: The provided path '{file_path}' is not a directory.")
        return []

    loader = DirectoryLoader(
        file_path,
        glob="*.pdf",
        loader_cls=PyPDFLoader,
        recursive=True,
    )

    documents = loader.load()
    return documents


# Function to load Markdown files from a specified directory
# This function doesn't extract all the required data.
def load_markdown_file(file_path=FILE_PATH):
    if not os.path.isdir(file_path):
        print(f"Error: The provided path '{file_path}' is not a directory.")
        return []

    loader = DirectoryLoader(
        file_path,
        glob="**/*.md",           
        loader_cls=TextLoader,
        loader_kwargs={"encoding": "utf-8"}
    )

    docs = loader.load()
    return docs

In [29]:
import re
import markdown
from bs4 import BeautifulSoup

# Function to clean markdown text
def clean_markdown(md_text: str) -> str:
    # 1. Remove YAML front matter
    md_text = re.sub(r"^---.*?---", "", md_text, flags=re.DOTALL)

    # 2. Convert markdown → HTML
    html = markdown.markdown(md_text)

    # 3. Parse HTML
    soup = BeautifulSoup(html, "html.parser")

    # 4. Remove unwanted tags
    for tag in soup(["script", "style", "iframe", "img", "table"]):
        tag.decompose()

    # 5. Get text
    text = soup.get_text(separator="\n")

    # 6. Remove markdown links but keep text
    text = re.sub(r"\[(.*?)\]\(.*?\)", r"\1", text)

    # 7. Normalize whitespace
    text = re.sub(r"\n{2,}", "\n\n", text)
    text = re.sub(r"[ \t]+", " ", text)

    return text.strip()

In [30]:
# Filters documents to only include page_content and source metadata.
def filter_to_minimal_docs(docs):
    minimal_docs = []

    for doc in docs:
        full_path = doc.metadata.get("source", "")
        file_name = os.path.basename(full_path)

        minimal_doc = Document(
            page_content=doc.page_content,
            metadata={
                "title": doc.metadata.get("title"),
                "description": doc.metadata.get("description"),
                "source": file_name
            }
        )

        minimal_docs.append(minimal_doc)

    return minimal_docs

In [31]:
# Split the documents into smaller chunks
def text_split(minimal_docs):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=100
    )
    split_docs = text_splitter.split_documents(minimal_docs)
    return split_docs

In [32]:
import frontmatter
from pathlib import Path

# This will be the Function to load MARKDOWN FILES
def load_md_with_metadata(path):
    docs = []

    for file in Path(path).rglob("*.md"):
        post = frontmatter.load(file)

        docs.append(
            Document(
                page_content=post.content,
                metadata={
                    "title": post.get("title"),
                    "description": post.get("description"),
                    "source": str(file)
                }
            )
        )

    return docs

In [33]:
# Loading the Markdown files
docs = load_md_with_metadata(FILE_PATH)

# Filtering to only include page_content and source metadata.
minimal_docs = filter_to_minimal_docs(docs)

# Split the minimized documents into text chunks
text_chunks = text_split(minimal_docs)

## Performing the Vector Embedding on Text Data:

The Embedding Model is **Sentence-Transformers**

1. The chunks are processed and converted to vectors
2. Then the embedded vectors is then sent to the Endee Vector Database using the Endee API
3. Once the Data is stored in the the Vector DB, it is now ready to serve as a Knowledge base for the LLM

In [40]:
from sentence_transformers import SentenceTransformer
embeddingModel = SentenceTransformer('all-MiniLM-L6-v2')

Loading weights: 100%|██████████| 103/103 [00:00<00:00, 131.34it/s, Materializing param=pooler.dense.weight]                             
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


In [8]:
last_index = len(text_chunks)

### Storing the vector embeddings for each text chunk

In [46]:
vectorEmbeddings = []

for id, chunk in enumerate(text_chunks):
    source = chunk.metadata.get('source', "")
    title = chunk.metadata.get("title", "")
    description = chunk.metadata.get("description", "")
    text = chunk.page_content
    embedding = embeddingModel.encode(text).tolist()

    data = {
        "id": id + 1,
        "vector": embedding,
        "meta": {
            "title": title,
            "description": description,
            "source": source,
            "text": text
        }
    }
    vectorEmbeddings.append(data)

In [47]:
print(vectorEmbeddings[0])

{'id': 1, 'vector': [-0.07971559464931488, -0.04249269515275955, 0.034429214894771576, -0.004940586164593697, 0.00379744078963995, -0.11942380666732788, 0.03763573616743088, -0.032276514917612076, 0.03974507749080658, -0.08302801847457886, -0.07007479667663574, 0.07063862681388855, 0.08201108127832413, -0.004002273548394442, 0.04556172341108322, -0.05214153602719307, 0.01700224168598652, 0.04116419702768326, 0.03911973908543587, -0.08725880831480026, -0.0489850677549839, 0.011393359862267971, 0.023980718106031418, 0.03097580559551716, 0.022394699975848198, -0.06662602722644806, 0.003506705164909363, -0.016434794291853905, -0.07027978450059891, -0.09300784021615982, -0.035802412778139114, 0.015175309032201767, 0.017852244898676872, 0.072708860039711, -0.10882662236690521, 0.15073974430561066, -0.011714130640029907, -0.026711083948612213, -0.07094957679510117, -0.021012822166085243, -0.046953000128269196, -0.05267389491200447, -0.02817157655954361, -0.01358864177018404, 0.003454051911830

### Setting up for connecting with Endee API powered by Flask

In [48]:
import requests

# Index name for Endee Vector Database
INDEX_NAME = "enterprise_knowledge_base"

# URL for Endee API service
ENDEE_URL = "http://127.0.0.1:8000" 

### Creating a Index in Endee Vector Database

In [43]:
# Payload for creating an index in Endee Vector DB
payload_for_create_index = {
    "index_name": INDEX_NAME,
    "dimension": len(vectorEmbeddings[0]["vector"]),
    "precision": "INT16D"
}

In [44]:
response_for_create_index = requests.post(
    f"{ENDEE_URL}/index/create",
    json=payload_for_create_index
)
print(f"Message for index creation: {response_for_create_index.json()}")

Message for index creation: {'index_name': 'enterprise_knowledge_base', 'status': 'index created'}


### Checking for the Existence of the Index in the Endee Vector Database

In [45]:
response_for_get_index = requests.post(
    f"{ENDEE_URL}/index/get",
    json={
        "index_name": INDEX_NAME
    })
print(response_for_get_index.json())

{'index_name': 'enterprise_knowledge_base', 'status': 'index loaded'}


### Inserting the Embedded Vectors into Endee Vector DB 

In [None]:
payload_to_insert_multiple_data = {
    "index_name": INDEX_NAME,
    "embedded_vectors": vectorEmbeddings
}

payload_to_insert_single_data = {
    "index_name": INDEX_NAME,
    "embedded_vectors": [vectorEmbeddings[0]]
}

In [50]:
response_for_multiple_insert = requests.post(
    f"{ENDEE_URL}/index/upsert",
    json=payload_to_insert_multiple_data
)

# response_for_single_insert = requests.post(
#     f"{ENDEE_URL}/index/upsert",
#     json=payload_to_insert_single_data
# )

print(response_for_multiple_insert.json())
# print(response_for_single_insert.json())

{'error': 'Cannot insert more than 1000 vectors at a time'}


In [None]:
# Inserting the Data:

payload = {
    "index_name": INDEX_NAME,
    "embedded_vectors": vectorEmbeddings[1740:]
}

response = requests.post(
    f"{ENDEE_URL}/index/upsert",
    json=payload
)

print(response.json())

{'count': 805, 'status': 'vectors upserted'}


### Retrieving the Top K most relevant data

In [92]:
query = "What is GitLab?"
embedding_for_query = embeddingModel.encode(query).tolist()

In [93]:
payload = {
    "index_name": INDEX_NAME,
    "vector": embedding_for_query,
    "top_k": 10
}

In [94]:
# Sends a query to the Endee API
response = requests.post(
    f"{ENDEE_URL}/index/query",
    json=payload
)

data = response.json()

In [None]:
data["results"]

### Build complete RAG Pipeline:

1. Using GROQ Cloud for it's Free APIs and Fast response
2. Using Langchain as a warapper

### How  does this work:

1. `retriever`: returns `List[Document]` for a query (Find relevant information)
2. `prompt`: a prompt template with `{context}` and `{input}` (Frame the question for the AI)
3. `chatModel`: an LLM runnable (Generate the answer)
4. `RunnablePassthrough`: Carry the user’s question forward unchanged
5. `StrOutputParser`: Clean the output

In [59]:
# Loading the environment variables from the .env file
from dotenv import load_dotenv
load_dotenv()

GROQ_API_KEY = os.getenv("GROQ_API_KEY")
os.environ["GROQ_API_KEY"] = GROQ_API_KEY

In [60]:
from langchain_groq import ChatGroq

llm = ChatGroq(
    model_name="llama-3.3-70b-versatile",
    temperature=0.5
)

In [68]:
system_prompt = """
You are a professional and a helpful assistant for a Company named GitLab.
Answer in three to five sentences for general questions.
and if the answer needs to be more elaborate, provide more details like provide step by step instructions only when needed.
Use the context given below for your reference and keep the answer concise and to the point.
Respond in detail only when the user specifies it.

Context:
{context}

If a user ask whether they can upload a Document, respond with "Yes, you can upload a PDF Document only."
"""

In [62]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnablePassthrough

In [87]:
# This is wrapper for retrieving the data
def endee_retriever(query: str):
    embedding_for_query = embeddingModel.encode(query).tolist()

    payload = {
        "index_name": INDEX_NAME,
        "vector": embedding_for_query,
        "top_k": 10
    }

    response = requests.post(
        f"{ENDEE_URL}/index/query",
        json=payload
    )

    response.raise_for_status()
    data = response.json()

    docs = []
    for d in data.get("matches", []):
        text = d.get("text", "")

        docs.append(
            Document(
                page_content=text,
                metadata={
                    "similarity": d.get("similarity"),
                    "source": d.get("source"),
                    "title": d.get("title"),
                    "description": d.get("description")
                }
            )
        )

    return docs

retriever = RunnableLambda(endee_retriever)

In [69]:
prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}")
])

In [88]:
# Rag pipeline:

rag_chain = (
    {
        "context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
        "input": RunnablePassthrough()
    }
    | prompt
    | llm
    | StrOutputParser()
)

In [86]:
# Sample test:
rag_chain.invoke("If I want to prepare for an Interview, what all should I need to learn, can you guide me in this")

"To prepare for an interview, you should learn about the company, the role you're applying for, and the required skills. Review the job description and requirements, and be ready to provide examples of your experience and skills. You should also practice common interview questions and be prepared to ask questions to the interviewer. Additionally, learn about GitLab's products and services, such as Git version control, CI/CD pipelines, and Agile project planning."

In [91]:
# Sample test:
print(rag_chain.invoke("If I want to prepare for an Interview, what all should I need to learn, can you guide me in this. Tell me step by step in detail"))

To prepare for an interview, here's a step-by-step guide:

1. **Research the Company**: Learn about GitLab's products, mission, values, and culture to understand our vision and goals.
2. **Review the Job Description**: Study the job requirements, responsibilities, and skills needed for the position you're applying for.
3. **Update Your Resume and Online Profiles**: Ensure your resume, LinkedIn profile, and other social media platforms are up-to-date and highlight your relevant skills and experiences.
4. **Practice Common Interview Questions**: Prepare answers to common interview questions, such as "Why do you want to work at GitLab?" or "What are your strengths and weaknesses?"
5. **Develop Your Technical Skills**: If you're applying for a technical role, review and practice your coding skills, and be prepared to answer technical questions related to your field.
6. **Prepare Questions to Ask**: Come up with a list of questions to ask the interviewer, such as "What are the biggest chall

In [85]:
rag_chain.invoke("Tell me about yourself")

"I am a professional and helpful assistant for GitLab, a company that provides a platform for version control and collaborative software development. My role is to assist users with their queries and provide support. I can help with general questions, provide step-by-step instructions when needed, and offer guidance on using GitLab's features and services. I'm here to help with any questions you may have."

In [67]:
# Testing the RAG pipeline with user input in a loop
while True:
    user_input = input("User: ")
    if user_input.lower() in ["exit", "quit"]:
        print("Exiting the chat. Goodbye!")
        break

    response = rag_chain.invoke(user_input)
    print(f"Assistant: {response}")

Assistant: You can check the company's website for employee referral programs or ask current employees if they can refer you. You can also utilize professional networking platforms like LinkedIn to connect with current or former employees and ask for referrals. Additionally, you can reach out to the company's HR department to inquire about their referral process.
Assistant: The company has various job roles, but the specific roles are not mentioned in the given context. If you need more information, please provide more context or clarify which company you are referring to. However, some common job roles in general include administrative, technical, marketing, and sales positions. For more details, please provide additional context.
Assistant: You can check the available openings on the official GitLab website, specifically on their careers or jobs page. They list various positions, including engineering, marketing, sales, and more. For the most up-to-date information, I recommend visit