## Loading the data

* since we have directory, lets use directory loader

In [1]:
from langchain_community.document_loaders import DirectoryLoader

In [2]:
direcory_loader = DirectoryLoader(
    path="./data",
    glob="**/*.txt",
    show_progress=True,
    use_multithreading=True)

raw_documents = direcory_loader.load()

  0%|          | 0/116 [00:00<?, ?it/s]libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
Need to load profiles.
Need to load profiles.
Need to load profiles.
short text: "Title: Understanding CI/CD Pipelines". Defaulting to English.
short text: "Title: Introduction to Terraform". Defaulting to English.
short text: "Title: Kubernetes Basics". Defaulting to English.
Need to load profiles.
Need to load profiles.
Need to load profiles.
Need to load profiles.
  1%|          | 1/116 [00:07<14:53,  7.77s/it]Need to load profiles.
Need to load profiles.
  2%|▏         | 2/116 [00:07<09:18,  

In [3]:
len(raw_documents)

116

In [4]:
print(raw_documents[0].page_content)

Title: Understanding CI/CD Pipelines

Overview: CI/CD (Continuous Integration and Continuous Deployment) is a process that enables software teams to deliver code changes frequently and reliably.

Key Concepts: 1. Continuous Integration – Developers merge code into a shared repository frequently. 2. Continuous Deployment – Automated deployment of tested code into production. 3. Benefits – Faster delivery, fewer bugs, and improved collaboration.


# We have documents but we need to chunk
* Since the nature of data is text which has paragraphs, lines etc
* [Splitters](https://python.langchain.com/docs/concepts/text_splitters/)

In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 200,
    chunk_overlap=10,
    separators=["\n", "\n\n"],
)
raw_documents_post_split = text_splitter.split_documents(raw_documents)

In [6]:
print(f"raw_documents before split {len(raw_documents)}")
print(f"raw_documents post split {len(raw_documents_post_split)}")

raw_documents before split 116
raw_documents post split 948


In [7]:
print(raw_documents_post_split[0].page_content)


Title: Understanding CI/CD Pipelines

Overview: CI/CD (Continuous Integration and Continuous Deployment) is a process that enables software teams to deliver code changes frequently and reliably.


## We need to choose an embedding model and vector store.

* Lets use [Text embedding from gcp](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings#get-text-embeddings-for-a-snippet-of-text)
* Vector store, lets use chromadb



In [None]:
from langchain_google_vertexai import VertexAIEmbeddings

embeddings = VertexAIEmbeddings(model="text-embedding-005")

In [None]:
from langchain_chroma import Chroma

In [None]:
vector_store = Chroma(
    collection_name="kb_collection",
    embedding_function=embeddings,
    persist_directory="./vectordb",
)

In [None]:
# Now add documents
vector_store.add_documents(raw_documents_post_split)

['41bbb163-0b7d-4e35-bc1a-f6a5dc80ba06',
 'ea1fbf13-bbf7-4473-acdc-7d663bfb2c9b',
 '6b75c0f7-b919-44b7-aa7c-76334c6f2d21',
 '92d6c972-ace1-4127-bd10-ed5ca8ea1c33',
 '0efa069a-a06b-4b30-8c59-88a216222ea7',
 '39e19394-4e55-4e5e-bd23-217b82f47b68',
 'f084d311-0d1a-439b-8deb-c8cbfff65f7d',
 'b57b5c1a-c723-4a46-8d7e-6062640c7a59',
 '23460de6-684f-464d-838d-b86157d35898',
 'ef54bfac-f18d-4c71-9ede-f13a833ccbad',
 '4b202693-681c-4ed5-9ac1-31fb8a745e1b',
 '3a4c753a-6186-4763-9730-04540457d35f',
 'f74109c6-8cd4-4202-82ba-bea0503c141f',
 '17351142-5afb-491f-9e14-f5e0d199f14f',
 '0e15c731-fc84-48f5-abda-63ab03ec2393',
 'b6881232-8b0b-4c56-a836-ad64cb3f455a',
 'ab537487-9c2a-4d87-acfc-b8185c2a2df3',
 '1608da58-38c8-489a-8066-a957ab65be0a',
 '87bcd6cc-35f9-46af-a65c-e54a43a6bbec',
 '50161b24-5e3c-4adb-8502-65f4f0982eea',
 'ad5098d0-0011-4702-8256-d1f1e0845d70',
 'f9674329-1bd2-43bb-b2e0-95e75ad8ae19',
 '5bf242fb-faca-445c-8115-77eae8173a7d',
 'ff5eaf5d-cfaf-44c4-a2d3-d9f3c78cf700',
 'b679a69b-0581-

In [None]:
# lets experiment
retriever = vector_store.as_retriever()

In [None]:
results = retriever.invoke("what are CI/CD Pipelines ?")

In [None]:
len(results)

4

In [None]:
for result in results:
    print(result.metadata)

{'source': 'data\\article_001_ci_cd_pipeline.txt'}
{'source': 'data\\article_048_ci_cd_tools.txt'}
{'source': 'data\\CI_CD_Best_Practices.txt'}
{'source': 'data\\article_041_ci_cd_testing.txt'}


In [None]:
for result in results:
    print(result.page_content)

Title: Understanding CI/CD Pipelines

Overview: CI/CD (Continuous Integration and Continuous Deployment) is a process that enables software teams to deliver code changes frequently and reliably.
Title: Popular CI/CD Tools

Overview: Several tools help implement CI/CD pipelines.

--------------------------------- 5. Collaboration & Governance --------------------------------- - **Code Reviews**: All changes must pass peer review before merging into main branches. - **Branch Protection Rules**: Protect `main` and `develop` branches with mandatory checks. - **Documentation**: Maintain pipeline documentation for setup, usage, and troubleshooting. - **Training**: Conduct periodic training for developers and DevOps engineers on CI/CD best practices.
Title: Testing in CI/CD Pipelines

Overview: Automated testing in CI/CD ensures code quality before deployment.

Types:

1. Unit tests.

2. Integration tests.

3. End

to

end tests.


In [None]:
# few shot prompting
from langchain.chat_models import init_chat_model
llm = init_chat_model("gemini-2.5-flash-lite", model_provider="google_vertexai")

In [None]:
from langchain_core.messages import SystemMessage, HumanMessage, AIMessage
messages = [
    SystemMessage(content="You are a senior in college who has knowledge of computer science and responds in oneliners"),
    HumanMessage(content="What are graphs?"),
    AIMessage(content="Graphs are part of datastructures"),
    HumanMessage(content="What are trees in computer science?"),
    AIMessage(content="Trees are part of datastructures"),
    HumanMessage(content="What are compilers ?")
]

In [None]:
response = llm.invoke(input=messages)

In [None]:
type(response)

langchain_core.messages.ai.AIMessage

In [None]:
response.pretty_print()


Compilers translate code from one language to another.


In [None]:
from langchain_core.messages import SystemMessage, HumanMessage, AIMessage
messages = [
    SystemMessage(content="You are a patient school teacher"),
    HumanMessage(content="What is addition ?"),
    AIMessage(content="If you have 10 rupees and your father gives you 5 rupees, what the total amount, count it answer shoule be 15"),
    HumanMessage(content="What is subtraction ?"),
    AIMessage(content="You have 10 chocolates, you ate 2 how many are left, count it it should be 8"),
    HumanMessage(content="What is division ?")
]

In [None]:
response = llm.invoke(messages)
response.pretty_print()


Imagine you have 12 cookies, and you want to share them equally among your 3 friends. Division is like figuring out how many cookies each friend gets.

So, you would take your 12 cookies and divide them into 3 equal groups.

Each friend would get **4** cookies. That's division! It's all about splitting a total amount into equal parts.

Does that make sense? 😊


In [None]:
prompt_template_string = """ You are an helpful assistant trying to help employees understand the kb articles

Question: {question}

Context: {context}

Using the context give a simple answer

"""

In [None]:
question = input("What are you searching for ?")
retriever = vector_store.as_retriever()
response = retriever.invoke(question)

In [None]:
response

[Document(id='eebf5e81-3fb2-4ccb-a13f-bed9d402e67d', metadata={'source': 'data\\article_033_api_gateway.txt'}, page_content='Title: Role of API Gateways\n\nOverview: An API Gateway manages API requests between clients and services.'),
 Document(id='f084d311-0d1a-439b-8deb-c8cbfff65f7d', metadata={'source': 'data\\article_005_api_design.txt'}, page_content='Title: Best Practices for API Design\n\nOverview: APIs should be intuitive, consistent, and easy to consume.'),
 Document(id='fc5365bf-4d9a-4243-96d2-b39a3f9d19de', metadata={'source': 'data\\article_057_api_testing.txt'}, page_content='Title: API Testing Approaches\n\nOverview: API testing validates endpoints for functionality and performance.\n\nApproaches:\n\n1. Functional testing.\n\n2. Load testing.\n\n3. Security testing.'),
 Document(id='b57b5c1a-c723-4a46-8d7e-6062640c7a59', metadata={'source': 'data\\article_005_api_design.txt'}, page_content='Best Practices: 1. Use RESTful principles or GraphQL when suitable. 2. Version API

In [None]:
context = ""
for doc in response:
    context = f"{context}\n{doc.page_content}"
print(context)



Title: Role of API Gateways

Overview: An API Gateway manages API requests between clients and services.
Title: Best Practices for API Design

Overview: APIs should be intuitive, consistent, and easy to consume.
Title: API Testing Approaches

Overview: API testing validates endpoints for functionality and performance.

Approaches:

1. Functional testing.

2. Load testing.

3. Security testing.
Best Practices: 1. Use RESTful principles or GraphQL when suitable. 2. Version APIs properly. 3. Provide clear documentation.


In [None]:
from langchain_core.prompts import PromptTemplate
template = PromptTemplate.from_template(prompt_template_string)

In [None]:
chain = template | llm

In [None]:
# RAG's reponse
response = chain.invoke({'question': question, 'context': context})
response.pretty_print()


Based on the provided context, an API (Application Programming Interface) is a way for different services or software to communicate with each other. Think of it as a messenger that handles requests between a client (like a user's device) and the services it needs information from.

The articles highlight that APIs should be designed to be easy to use, managed by gateways, and tested for their functionality, performance, and security.


In [None]:
# models response
llm.invoke(question).pretty_print()


API stands for **Application Programming Interface**.

In its simplest terms, an API is a **set of rules, protocols, and tools that allows different software applications to communicate and interact with each other.** Think of it as a **messenger** or a **contract** that defines how one piece of software can request services or data from another.

Here's a breakdown of the key concepts:

**1. Interface:**

* **It's a boundary:** Just like a physical interface (like a USB port) allows different devices to connect, an API provides a defined way for software components to connect and exchange information.
* **It's what you see and interact with:** You don't need to know the intricate internal workings of another application to use its API. You only need to understand the API's specifications.

**2. Programming:**

* **It's for software developers:** APIs are designed for programmers to use when building new applications or integrating existing ones.
* **It defines how to program interact