### How to work with Chormadb and store it in a database

- Here we insert docs into db
- Uses search operation on it
- Get the context
- Provide that context to the LLM with scores
- Generate the response

In [2]:
from langchain_community.document_loaders import PyMuPDFLoader
import os
from dotenv import load_dotenv
load_dotenv()

False

In [2]:
# langchain + chormadb
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.documents import Document

#for chromadb
from langchain_community.vectorstores import Chroma
from typing import List
import os



  from .autonotebook import tqdm as notebook_tqdm


In [10]:
strings_array = [
    "Python is renowned for its flexible data structures, which include lists, tuples, sets, and dictionaries. Lists are ordered collections that support dynamic resizing and a variety of methods for adding, removing, or searching elements. Tuples are similar to lists but immutable, ensuring that data remains unchanged once assigned and making them suitable for fixed data groupings. Dictionaries use key-value pairs allowing fast access, manipulation, and association of data by unique keys. Sets are collections of unordered, unique elements, great for removing duplicates and performing mathematical operations like unions and intersections. These data structures form the backbone for efficient algorithm development, making Python popular for data engineering, scientific computing, and rapid prototyping in diverse software projects."
    ,
    "Docker containers have revolutionized application deployment by encapsulating all dependencies within lightweight, portable units. The container lifecycle runs through image creation, build, run, and destruction, allowing consistency across environments from a developer’s laptop to cloud production servers. Networking and persistent storage are handled through Docker’s bridge networks and mounted volumes, often configured in a YAML file for orchestration. Security features, resource limits, and automated health checks help maintain uptime and isolation. Command-line tools and APIs provide granular control, while platforms like Kubernetes extend management to large-scale clusters. Docker’s architecture enables microservices, CI/CD pipelines, and efficient scaling for modern software infrastructure."
    ,
    "Vector databases have emerged as critical infrastructure for AI-driven applications by enabling the fast, approximate search of high-dimensional embeddings. Unlike traditional relational stores, vector DBs represent data points as mathematical vectors, supporting similarity queries, KNN search, and clustering. This approach underpins semantic retrieval in RAG systems, recommendation engines, and fraud detection. Technologies such as Pinecone, Weaviate, and ChromaDB offer APIs to store, update, and query embeddings generated by models like BERT or CLIP. They optimize for speed and scalability with techniques including approximate neighbors, distributed indexing, and GPU acceleration. Advanced filtering and metadata support enable hybrid retrieval for context-aware generative AI solutions."
    ,
    "SQL query optimization is crucial for scalable database operations. Indexing frequently searched columns is an essential strategy, but too many indexes can degrade write performance. Queries should avoid ‘SELECT *’, minimize joins to essential tables, and use WHERE clauses that leverage indexed columns. Tools like EXPLAIN PLAN visualize execution steps, guiding developers to restructure queries for efficiency. Partitioning large tables can improve access speed while reducing locking contention. Regularly updating table statistics ensures the optimizer selects the best execution path. Avoiding correlated subqueries and using batch processing techniques can reduce resource consumption. These approaches together lead to faster, more reliable database systems."
    ,
    "In Node.js, the event loop is a key mechanism that allows non-blocking I/O operations on a single thread. Each incoming request is delegated to the system kernel, freeing the JavaScript runtime to handle other events. Async callbacks are queued and executed when the kernel signals completion, drastically improving throughput. Promises and async/await syntax further simplify asynchronous code management, reducing callback hell and enhancing maintainability. Node.js excels in microservices, web servers, and real-time applications like chat or streaming services. Its event-driven model, combined with fast V8 execution, supports horizontal scaling and resource-efficient concurrency on modest hardware, making Node.js immensely popular for backend services."
]


In [11]:
# Storing the samlpes in a file
import tempfile
temp_dir = tempfile.mkdtemp()

for i,doc in enumerate(strings_array):
    file_path = os.path.join(temp_dir, f"doc_{i}.txt")
    with open(file_path, "w", encoding="utf-8") as f:
        f.write(doc)

print("Doc created")

Doc created


In [None]:

# Document loading
from langchain_community.document_loaders import DirectoryLoader,TextLoader
load = DirectoryLoader(
    temp_dir,
    glob="*.txt",
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf-8"}
)

documents=load.load()
for i,doc in enumerate(documents):
    print(f"Document {i}:\n{doc.page_content}")
    print(f"Metadata: {doc.metadata}")

### Text Splitting from docs


In [14]:
# Text Splitting
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter=RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=20,
    length_function=len,
    separators=[" "]
)
chunks=splitter.split_documents(documents)
print(f"Total chunks: {len(chunks)} of {len(documents)}")
print(f"Content: {chunks[0]}")
print(f"Metadata: {chunks[0].metadata}")

Total chunks: 15 of 5
Content: page_content='Python is renowned for its flexible data structures, which include lists, tuples, sets, and dictionaries. Lists are ordered collections that support dynamic resizing and a variety of methods for adding, removing, or searching elements. Tuples are similar to lists but immutable, ensuring that data' metadata={'source': 'C:\\Users\\ps19j\\AppData\\Local\\Temp\\tmppjdesij8\\doc_0.txt'}
Metadata: {'source': 'C:\\Users\\ps19j\\AppData\\Local\\Temp\\tmppjdesij8\\doc_0.txt'}


### Embedding

In [None]:
sample_text = "The quick brown fox jumps over the lazy dog."
embeddings=HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"

)
embeddings.embed_query(sample_text)

#### Storing into ChromaDB using HuggingFace Space

In [19]:
# Directory of ChromaDB
persistant_directory = "./chromaDB"

VECTOR_STORE = Chroma(
    persist_directory=persistant_directory,
    embedding_function=HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2"
    ),
    collection_name="Rag_collection"
)
VECTOR_STORE.add_documents(chunks)
print(f"Vector store created at {persistant_directory}")
print(f"Number of vectors: {VECTOR_STORE._collection.count()}")

Vector store created at ./chromaDB
Number of vectors: 20


In [22]:
query="How to work with Node js?"
similar_chunks=VECTOR_STORE.similarity_search(query, k=3)
print(similar_chunks)

[Document(metadata={'source': 'C:\\Users\\ps19j\\AppData\\Local\\Temp\\tmppjdesij8\\doc_4.txt'}, page_content='signals completion, drastically improving throughput. Promises and async/await syntax further simplify asynchronous code management, reducing callback hell and enhancing maintainability. Node.js excels in microservices, web servers, and real-time applications like chat or streaming services. Its'), Document(metadata={'source': 'C:\\Users\\ps19j\\AppData\\Local\\Temp\\tmppjdesij8\\doc_4.txt'}, page_content='In Node.js, the event loop is a key mechanism that allows non-blocking I/O operations on a single thread. Each incoming request is delegated to the system kernel, freeing the JavaScript runtime to handle other events. Async callbacks are queued and executed when the kernel signals completion, drastically improving throughput. Promises and async/await syntax further simplify asynchronous code management, reducing callback hell and enhancing maintainability. Node.js excels in 