## Data Preprocessing

1. Load the **PDFs or Markdown** files in Order
2. Extract the **data/content** from the PDF
3. Then perform chunking - *Because the context window for the LLMs are small*
4. Then pass it to the **Embedding Models**

In [1]:
# 1. Loading of the PDFs

from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
import os

current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)
DATA_FILE_PATH = os.path.join(parent_dir, "data")

In [8]:
# Function to load PDF files from a specified directory
def load_pdf_file(file_path=DATA_FILE_PATH):
    if not os.path.isdir(file_path):
        print(f"Error: The provided path '{file_path}' is not a directory.")
        return []

    loader = DirectoryLoader(
        file_path,
        glob="*.pdf",
        loader_cls=PyPDFLoader,
        recursive=True,
    )

    documents = loader.load()
    return documents


# Function to load Markdown files from a specified directory
# This function doesn't extract all the required data.
def load_markdown_file(file_path=DATA_FILE_PATH):
    if not os.path.isdir(file_path):
        print(f"Error: The provided path '{file_path}' is not a directory.")
        return []

    loader = DirectoryLoader(
        file_path,
        glob="**/*.md",           
        loader_cls=TextLoader,
        loader_kwargs={"encoding": "utf-8"}
    )

    docs = loader.load()
    return docs

In [9]:
import frontmatter
from pathlib import Path

# This will be the Function to load MARKDOWN FILES
def load_md_with_metadata(path):
    docs = []

    for file in Path(path).rglob("*.md"):
        post = frontmatter.load(file)

        docs.append(
            Document(
                page_content=post.content,
                metadata={
                    "title": post.get("title"),
                    "description": post.get("description"),
                    "source": str(file)
                }
            )
        )

    return docs

In [10]:
import re
import markdown
from bs4 import BeautifulSoup

# Function to clean markdown text
def clean_markdown(md_text: str) -> str:
    # 1. Remove YAML front matter
    md_text = re.sub(r"^---.*?---", "", md_text, flags=re.DOTALL)

    # 2. Convert markdown â†’ HTML
    html = markdown.markdown(md_text)

    # 3. Parse HTML
    soup = BeautifulSoup(html, "html.parser")

    # 4. Remove unwanted tags
    for tag in soup(["script", "style", "iframe", "img", "table"]):
        tag.decompose()

    # 5. Get text
    text = soup.get_text(separator="\n")

    # 6. Remove markdown links but keep text
    text = re.sub(r"\[(.*?)\]\(.*?\)", r"\1", text)

    # 7. Normalize whitespace
    text = re.sub(r"\n{2,}", "\n\n", text)
    text = re.sub(r"[ \t]+", " ", text)

    return text.strip()

In [11]:
# Filters documents to only include page_content and source metadata.
def filter_to_minimal_docs(docs):
    minimal_docs = []

    for doc in docs:
        full_path = doc.metadata.get("source", "")
        file_name = os.path.basename(full_path)

        minimal_doc = Document(
            page_content=doc.page_content,
            metadata={
                "title": doc.metadata.get("title"),
                "description": doc.metadata.get("description"),
                "source": file_name
            }
        )

        minimal_docs.append(minimal_doc)

    return minimal_docs

In [12]:
# Split the documents into smaller chunks
def text_split(minimal_docs):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=100
    )
    split_docs = text_splitter.split_documents(minimal_docs)
    return split_docs

In [13]:
# Loading the Markdown files
docs = load_md_with_metadata(DATA_FILE_PATH)

# Filtering to only include page_content and source metadata.
minimal_docs = filter_to_minimal_docs(docs)

# Split the minimized documents into text chunks
text_chunks = text_split(minimal_docs)

## Performing the Vector Embedding on Text Data:

The Embedding Model is **Sentence-Transformers**

1. The chunks are processed and converted to vectors
2. Then the embedded vectors is then sent to the Endee Vector Database using the Endee API
3. Once the Data is stored in the the Vector DB, it is now ready to serve as a Knowledge base for the LLM

In [19]:
from sentence_transformers import SentenceTransformer
embeddingModel = SentenceTransformer('all-MiniLM-L6-v2')

Loading weights: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 103/103 [00:00<00:00, 110.97it/s, Materializing param=pooler.dense.weight]                             
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


### Storing the vector embeddings for each text chunk

In [20]:
start_idx = 2949
vectorEmbeddings = []

for id, chunk in enumerate(text_chunks):
    source = chunk.metadata.get('source', "")
    title = chunk.metadata.get("title", "")
    description = chunk.metadata.get("description", "")
    text = chunk.page_content
    embedding = embeddingModel.encode(text).tolist()

    data = {
        "id": id + start_idx,
        "vector": embedding,
        "meta": {
            "title": title,
            "description": description,
            "source": source,
            "text": text
        }
    }
    vectorEmbeddings.append(data)

### Setting up for connecting with Endee API powered by Flask

In [41]:
import requests
from requests.exceptions import ConnectionError, Timeout, HTTPError

# Index name for Endee Vector Database
INDEX_NAME = "enterprise_knowledge_base"

# URL for Endee API service
ENDEE_URL = "http://127.0.0.1:8000" 

### Creating a Index in Endee Vector Database

Using only the Sparse Dimensions (***Vector Embeddings***)

In [None]:
# Payload for creating an index in Endee Vector DB
payload = {
    "index_name": INDEX_NAME,
    "dimension": embeddingModel.get_sentence_embedding_dimension(),
    "precision": "INT16D"
}

In [None]:
try:    
    response = requests.post(
        f"{ENDEE_URL}/index/create",
        json=payload
    )
    response.raise_for_status()
    print(f"Message for index creation: {response.json()}")

except ConnectionError:
    print("Backend service is not reachable (is it running?)")
    
except Timeout:
    print("Request timed out")
    
except HTTPError as e:
    try:
        err = e.response.json()
        print("Error message:", err.get("error"))
    except ValueError:
        print("Raw error:", e.response.text)
    
except Exception as e:
    print(f"Unexpected error: {e}")

HTTP error, Code: 500, Message: Conflict: Index with this name already exists for this user


### Checking for the Existence of the Index in the Endee Vector Database

In [None]:
try:
    response = requests.post(
        f"{ENDEE_URL}/index/get",
        json={
            "index_name": INDEX_NAME
        })
    
    response.raise_for_status()
    print(response.json())
    
except ConnectionError:
    print("Backend service is not reachable (is it running?)")
    
except Timeout:
    print("Request timed out")
    
except HTTPError as e:
    try:
        err = e.response.json()
        print("Error message:", err.get("error"))
    except ValueError:
        print("Raw error:", e.response.text)
    
except Exception as e:
    print(f"Unexpected error: {e}")

{'index_name': 'enterprise_knowledge_base', 'status': 'index loaded'}


### Inserting the Embedded Vectors into Endee Vector DB 

In [None]:
text = """
Manav Khurana is a business-minded technical person. Two things drive my work practice:
Solving customer problems â€“ more of them and doing so in the best way.

Optimizing for business results in what we do.

I take a data-driven approach to make sure weâ€™re making progress in both. Youâ€™ll often see me mapping inputs to outputs with data, putting ourselves in our customersâ€™ shoes, and asking about customer feedback. I get a lot of satisfaction in going about that work with my colleagues, building life-long friendships along the way. I believe everything else â€“ feeling of impact, fun, recognition, money, etc â€“ follows.
I was born and raised in New Delhi, India. I came to the U.S. to study electrical and computer engineering in New York and then moved to San Francisco for whatâ€™s now been a 25-year career in tech. My wife, whom I met in college, is a psychologist and keeps me accountable to our value system ðŸ™‚. We have a 12-year-old son, who is pure joy to be around and inspires us with his curiosity. We cherish our time together at home, with our community, and in our travels. In my spare time, I like sweating it out on a tennis court a couple of times a week.
I respond to all pronunciations of my name but prefer the original â€“ Maah-nuhv.
"""

title = "Manav Khurana - Chief Product and Marketing Officer - README"
description = "Learn more about working with Manav Khurana, CPMO at GitLab."

vectors = embeddingModel.encode(text).tolist()

In [None]:
payload = {
    "index_name": INDEX_NAME,
    "embedded_vectors": [vectors]
}

try:
    response = requests.post(
        f"{ENDEE_URL}/index/upsert",
        json=payload
    )

    response.raise_for_status()
    print(response.json())
    
except ConnectionError:
    print("Backend service is not reachable (is it running?)")
    
except Timeout:
    print("Request timed out")
    
except HTTPError as e:
    try:
        err = e.response.json()
        print("Error message:", err.get("error"))
    except ValueError:
        print("Raw error:", e.response.text)
    
except Exception as e:
    print(f"Unexpected error: {e}")

In [None]:
# Performing list slicing because, insertion limit is 1000 vectors, and I am keeping 950 vectors per upsert
slices = []

end = len(vectorEmbeddings) // 950
prev = 0
for i in range(1, end + 1):
    slices.append((prev, 950 * i + 1))
    prev = 950 * i + 1
slices.append((prev, len(vectorEmbeddings)))

In [None]:
for start, end in slices:
    vectors = vectorEmbeddings[start:end]
    
    if not upsertVectors(vectors):
        break

{'count': 951, 'status': 'vectors upserted'}
{'count': 950, 'status': 'vectors upserted'}
{'count': 950, 'status': 'vectors upserted'}
{'count': 950, 'status': 'vectors upserted'}
{'count': 843, 'status': 'vectors upserted'}


In [None]:
# This is the function to UPSERT the VECTORS into DB
def upsertVectors(vectors):
    payload = {
            "index_name": INDEX_NAME,
            "embedded_vectors": vectors
    }

    try:
        response = requests.post(
            f"{ENDEE_URL}/index/upsert",
            json=payload
        )
        response.raise_for_status()

        print(response.json())
        return True

    except ConnectionError:
        print("Backend service is not reachable (is it running?)")

    except Timeout:
        print("Request timed out")

    except HTTPError as e:
        try:
            err = e.response.json()
            print("Error message:", err.get("error"))
        except ValueError:
            print("Raw error:", e.response.text)

    except Exception as e:
        print(f"Unexpected error: {e}")

    return False

### Retrieving the Top K most relevant data

In [89]:
query = "Who is Manav Khurana"
embedding_for_query = embeddingModel.encode(query).tolist()

In [93]:
payload = {
    "index_name": INDEX_NAME,
    "vector": embedding_for_query,
    "top_k": 20
}

In [94]:
# Sends a query to the Endee API
response = requests.post(
    f"{ENDEE_URL}/index/query",
    json=payload
)

data = response.json()

In [115]:
data["results"][:3]

[{'description': None,
  'distance': 0.7324415445327759,
  'id': '1780',
  'similarity': 0.2675584554672241,
  'text': 'GitLab KPIs are duplicates of goals of the reports further down this page.\nGitLab KPIs are the 10 most important indicators of company performance, and the most important KPI is Net ARR.\nWe review these at each quarterly meeting of the Board of Directors.\nThese KPIs are determined by a combination of their stand alone importance to the company and the amount of management focus devoted to improving the metric.',
  'title': 'KPIs'},
 {'description': None,
  'distance': 0.7402098476886749,
  'id': '3202',
  'similarity': 0.2597901523113251,
  'text': "- [Company KPI's](/handbook/company/kpis/)\n   - [Mitigating Concerns](https://internal.gitlab.com/handbook/leadership/mitigating-concerns/)",
  'title': 'Board of Directors and Corporate Governance'},
 {'description': 'This page aggregates dashboards, analysis, and insights generated or owned by the Product Data Insigh

### Build complete RAG Pipeline:

1. Using GROQ Cloud for it's Free APIs and Fast response
2. Using Langchain as a warapper

### How  does this work:

1. `retriever`: returns `List[Document]` for a query (Find relevant information)
2. `prompt`: a prompt template with `{context}` and `{input}` (Frame the question for the AI)
3. `chatModel`: an LLM runnable (Generate the answer)
4. `RunnablePassthrough`: Carry the userâ€™s question forward unchanged
5. `StrOutputParser`: Clean the output

In [74]:
# Loading the environment variables from the .env file
from dotenv import load_dotenv
load_dotenv()

GROQ_API_KEY = os.getenv("GROQ_API_KEY")
os.environ["GROQ_API_KEY"] = GROQ_API_KEY

In [75]:
from langchain_groq import ChatGroq

llm = ChatGroq(
    model_name="llama-3.3-70b-versatile",
    temperature=0.5
)

In [76]:
system_prompt = """
You are a professional and a helpful assistant for a Company named GitLab.
You answer to all the queries made by the employees of this company to make their work easier.
Answer in three to five sentences for general questions.
and if the answer needs to be more elaborated, provide more details like step by step instructions only when needed or asked specifically by the user.
Use the context given below for your reference and keep the answer concise and to the point.
Respond in detail only when the user specifies it.
And always perform a Double check before giving the final response to the user.

Context:
{context}

If a user ask whether they can upload a Document, respond with "Yes, you can upload a PDF Document only."
"""

In [77]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnablePassthrough

In [None]:
# This is wrapper for retrieving the data
def endee_retriever(query: str):
    embedding_for_query = embeddingModel.encode(query).tolist()

    payload = {
        "index_name": INDEX_NAME,
        "vector": embedding_for_query,
        "top_k": 10
    }

    response = requests.post(
        f"{ENDEE_URL}/index/query",
        json=payload
    )

    response.raise_for_status()
    data = response.json()

    docs = []
    for d in data.get("matches", []):
        text = d.get("text", "")

        docs.append(
            Document(
                page_content=text,
                metadata={
                    "similarity": d.get("similarity"),
                    "source": d.get("source"),
                    "title": d.get("title"),
                    "description": d.get("description")
                }
            )
        )

    return docs

retriever = RunnableLambda(endee_retriever)

In [79]:
prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}")
])

In [80]:
# Rag pipeline:

rag_chain = (
    {
        "context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
        "input": RunnablePassthrough()
    }
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:

rag_chain.invoke("If I want to prepare for an Interview, what all should I need to learn, can you guide me in this")

"To prepare for an interview, you should learn about the company, the role you're applying for, and the required skills. Review the job description and requirements, and be ready to provide examples of your experience and skills. You should also practice common interview questions and be prepared to ask questions to the interviewer. Additionally, learn about GitLab's products and services, such as Git version control, CI/CD pipelines, and Agile project planning."

In [91]:
# Sample test:
print(rag_chain.invoke("If I want to prepare for an Interview, what all should I need to learn, can you guide me in this. Tell me step by step in detail"))

To prepare for an interview, here's a step-by-step guide:

1. **Research the Company**: Learn about GitLab's products, mission, values, and culture to understand our vision and goals.
2. **Review the Job Description**: Study the job requirements, responsibilities, and skills needed for the position you're applying for.
3. **Update Your Resume and Online Profiles**: Ensure your resume, LinkedIn profile, and other social media platforms are up-to-date and highlight your relevant skills and experiences.
4. **Practice Common Interview Questions**: Prepare answers to common interview questions, such as "Why do you want to work at GitLab?" or "What are your strengths and weaknesses?"
5. **Develop Your Technical Skills**: If you're applying for a technical role, review and practice your coding skills, and be prepared to answer technical questions related to your field.
6. **Prepare Questions to Ask**: Come up with a list of questions to ask the interviewer, such as "What are the biggest chall

In [None]:
rag_chain.invoke("Tell me about yourself")

"I am a professional and helpful assistant for GitLab, dedicated to providing support and answering queries from employees to make their work easier and more efficient. I am here to assist with any questions or concerns you may have, and I will do my best to provide concise and accurate responses. If you have a specific question or need help with a particular task, feel free to ask and I'll be happy to help. My goal is to provide helpful and timely assistance to ensure a smooth work experience for all GitLab employees."

In [67]:
# Testing the RAG pipeline with user input in a loop
while True:
    user_input = input("User: ")
    if user_input.lower() in ["exit", "quit"]:
        print("Exiting the chat. Goodbye!")
        break

    response = rag_chain.invoke(user_input)
    print(f"Assistant: {response}")

Assistant: You can check the company's website for employee referral programs or ask current employees if they can refer you. You can also utilize professional networking platforms like LinkedIn to connect with current or former employees and ask for referrals. Additionally, you can reach out to the company's HR department to inquire about their referral process.
Assistant: The company has various job roles, but the specific roles are not mentioned in the given context. If you need more information, please provide more context or clarify which company you are referring to. However, some common job roles in general include administrative, technical, marketing, and sales positions. For more details, please provide additional context.
Assistant: You can check the available openings on the official GitLab website, specifically on their careers or jobs page. They list various positions, including engineering, marketing, sales, and more. For the most up-to-date information, I recommend visit