## Loading the data

* since we have directory, lets use directory loader

In [1]:
from langchain_community.document_loaders import DirectoryLoader

In [2]:
direcory_loader = DirectoryLoader(
    path="./data",
    glob="**/*.txt",
    show_progress=True,
    use_multithreading=True)

raw_documents = direcory_loader.load()

  0%|          | 0/116 [00:00<?, ?it/s]libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
Need to load profiles.
Need to load profiles.
  1%|          | 1/116 [00:02<05:23,  2.81s/it]short text: "Title: Introduction to Terraform". Defaulting to English.
short text: "Title: Kubernetes Basics". Defaulting to English.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filet

In [3]:
len(raw_documents)

116

In [5]:
print(raw_documents[0].page_content)

Title: Understanding CI/CD Pipelines

Overview: CI/CD (Continuous Integration and Continuous Deployment) is a process that enables software teams to deliver code changes frequently and reliably.

Key Concepts: 1. Continuous Integration – Developers merge code into a shared repository frequently. 2. Continuous Deployment – Automated deployment of tested code into production. 3. Benefits – Faster delivery, fewer bugs, and improved collaboration.


# We have documents but we need to chunk
* Since the nature of data is text which has paragraphs, lines etc
* [Splitters](https://python.langchain.com/docs/concepts/text_splitters/)

In [11]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 200,
    chunk_overlap=10,
    separators=["\n", "\n\n"],
)
raw_documents_post_split = text_splitter.split_documents(raw_documents)

In [13]:
print(f"raw_documents before split {len(raw_documents)}")
print(f"raw_documents post split {len(raw_documents_post_split)}")

raw_documents before split 116
raw_documents post split 948


In [14]:
print(raw_documents_post_split[0].page_content)


Title: Understanding CI/CD Pipelines

Overview: CI/CD (Continuous Integration and Continuous Deployment) is a process that enables software teams to deliver code changes frequently and reliably.


## We need to choose an embedding model and vector store.

* Lets use [Text embedding from gcp](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings#get-text-embeddings-for-a-snippet-of-text)
* Vector store, lets use chromadb



In [15]:
from langchain_google_vertexai import VertexAIEmbeddings

embeddings = VertexAIEmbeddings(model="text-embedding-005")



In [17]:
from langchain_chroma import Chroma

In [18]:
vector_store = Chroma(
    collection_name="kb_collection",
    embedding_function=embeddings,
    persist_directory="./vectordb",
)

In [19]:
# Now add documents
vector_store.add_documents(raw_documents_post_split)

['2535a1d7-3678-4e92-8e15-73f45fa3a328',
 'f690ed60-1b7f-4746-b926-0cdb36e6d31a',
 'bd2cd00c-026b-4b06-8725-3c1daae7959a',
 '4226c6ab-5b36-4748-9ac4-7d3be4ac7eb3',
 '9a380eeb-4ab6-4df5-8d4b-e726024452b6',
 'e8cd391d-d8e1-4300-b23a-5d845f83f10f',
 '251b680c-da03-4e62-aa5c-07c2d9775581',
 'f3185289-cf2f-4e8c-95fe-28575a05c0fc',
 '41f337cd-ec76-4c47-a893-a092b0432529',
 'ec40ac49-87f0-42f0-95f7-9aa7fb2e98e3',
 'c9d92ec5-0667-49ec-a6d8-f51673fc77df',
 '5be10978-cb7a-4044-a9a4-cb727471798a',
 'ca60a82f-d63a-4b03-b081-f28f2a2d4f08',
 'e7a7ee12-24b0-41e9-8777-2d44f64425d2',
 '2a2e4060-3226-46d8-8a19-873b068a1f70',
 '03dbb541-e5b3-49a5-912f-69a341e3d2a6',
 'e4e9fe64-ba9c-4e48-973d-f06ff6030ab0',
 '83b72322-103e-429a-b647-c9ab4626993f',
 '052f514e-d8f6-4728-b2e6-3153937155da',
 '28ccf8b2-94ae-4a7a-b410-834eabd7868a',
 '76f0fc39-cbbb-4673-a397-ee7a9e55c5e6',
 '89390512-c3ff-4c4f-bb66-dc393755cb2d',
 '1b66afae-0059-461b-b3f2-cf6517bdaf9f',
 '67d551d9-9d5f-4d2f-983e-1d729b597a42',
 '97e1e270-a2af-

In [21]:
# lets experiment
retriever = vector_store.as_retriever()

In [22]:
results = retriever.invoke("what are CI/CD Pipelines ?")

In [23]:
len(results)

4

In [24]:
for result in results:
    print(result.metadata)

{'source': 'data\\article_001_ci_cd_pipeline.txt'}
{'source': 'data\\article_048_ci_cd_tools.txt'}
{'source': 'data\\CI_CD_Best_Practices.txt'}
{'source': 'data\\article_041_ci_cd_testing.txt'}


In [25]:
for result in results:
    print(result.page_content)

Title: Understanding CI/CD Pipelines

Overview: CI/CD (Continuous Integration and Continuous Deployment) is a process that enables software teams to deliver code changes frequently and reliably.
Title: Popular CI/CD Tools

Overview: Several tools help implement CI/CD pipelines.

--------------------------------- 5. Collaboration & Governance --------------------------------- - **Code Reviews**: All changes must pass peer review before merging into main branches. - **Branch Protection Rules**: Protect `main` and `develop` branches with mandatory checks. - **Documentation**: Maintain pipeline documentation for setup, usage, and troubleshooting. - **Training**: Conduct periodic training for developers and DevOps engineers on CI/CD best practices.
Title: Testing in CI/CD Pipelines

Overview: Automated testing in CI/CD ensures code quality before deployment.

Types:

1. Unit tests.

2. Integration tests.

3. End

to

end tests.
