## Loading the data

* since we have directory, lets use directory loader

In [1]:
from langchain_community.document_loaders import DirectoryLoader

In [2]:
direcory_loader = DirectoryLoader(
    path="./data",
    glob="**/*.txt",
    show_progress=True,
    use_multithreading=True)

raw_documents = direcory_loader.load()

  0%|          | 0/116 [00:00<?, ?it/s]libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
Need to load profiles.
Need to load profiles.
  1%|          | 1/116 [00:02<05:23,  2.81s/it]short text: "Title: Introduction to Terraform". Defaulting to English.
short text: "Title: Kubernetes Basics". Defaulting to English.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filet

In [3]:
len(raw_documents)

116

In [5]:
print(raw_documents[0].page_content)

Title: Understanding CI/CD Pipelines

Overview: CI/CD (Continuous Integration and Continuous Deployment) is a process that enables software teams to deliver code changes frequently and reliably.

Key Concepts: 1. Continuous Integration – Developers merge code into a shared repository frequently. 2. Continuous Deployment – Automated deployment of tested code into production. 3. Benefits – Faster delivery, fewer bugs, and improved collaboration.


# We have documents but we need to chunk
* Since the nature of data is text which has paragraphs, lines etc
* [Splitters](https://python.langchain.com/docs/concepts/text_splitters/)

In [11]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 200,
    chunk_overlap=10,
    separators=["\n", "\n\n"],
)
raw_documents_post_split = text_splitter.split_documents(raw_documents)

In [13]:
print(f"raw_documents before split {len(raw_documents)}")
print(f"raw_documents post split {len(raw_documents_post_split)}")

raw_documents before split 116
raw_documents post split 948


In [14]:
print(raw_documents_post_split[0].page_content)


Title: Understanding CI/CD Pipelines

Overview: CI/CD (Continuous Integration and Continuous Deployment) is a process that enables software teams to deliver code changes frequently and reliably.


## We need to choose an embedding model and vector store.

* Lets use [Text embedding from gcp](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings#get-text-embeddings-for-a-snippet-of-text)
* Vector store, lets use chromadb



In [15]:
from langchain_google_vertexai import VertexAIEmbeddings

embeddings = VertexAIEmbeddings(model="text-embedding-005")



In [17]:
from langchain_chroma import Chroma

In [18]:
vector_store = Chroma(
    collection_name="kb_collection",
    embedding_function=embeddings,
    persist_directory="./vectordb",
)