### Building a RAG Model with LangChain and ChromaDB

In [3]:
import os
from dotenv import load_dotenv
load_dotenv()

True

In [4]:
#lanngchain imports
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.documents import Document
from langchain_community.vectorstores import Chroma

#utility imports
import numpy as np
from typing import List

In [5]:
#create sample documents

sample_data = [
    '''Machine Learning
    
    Machine learning is a branch of artificial intelligence that enables computers to learn from data and improve performance without being explicitly programmed. It works by feeding large datasets into algorithms that identify patterns and relationships, which can then be used to make predictions or decisions on new, unseen data. Unlike traditional programming, where rules must be manually defined, machine learning systems adapt and evolve as they encounter more information. This makes them especially powerful for complex tasks such as image recognition, speech processing, natural language translation, fraud detection, and personalized recommendations. By leveraging techniques like supervised learning, unsupervised learning, and reinforcement learning, machine learning has become a cornerstone of modern technology, powering applications from healthcare diagnostics to self-driving cars''',

    '''Deep Learning and Neural Networks
    
    Deep learning is a specialized branch of machine learning that uses multilayered neural networks to process and learn from complex data. It mimics the way the human brain works by stacking artificial neurons into layers, allowing computers to automatically discover intricate patterns and representations without manual feature engineering. The term “deep” refers to the use of many hidden layers—sometimes hundreds or thousands—that transform input data step by step until a meaningful output is produced. These architectures can be trained in supervised, semi-supervised, or unsupervised ways, making them highly versatile.

Deep learning has revolutionized fields such as computer vision, natural language processing, speech recognition, and autonomous systems. For example, convolutional neural networks (CNNs) excel at image classification, while recurrent and transformer-based networks dominate language tasks. By leveraging vast datasets and powerful GPUs, deep learning models achieve state-of-the-art performance in tasks ranging from facial recognition to self-driving cars. Its ability to handle unstructured data—like images, audio, and text—sets it apart from traditional machine learning approaches, making deep learning a cornerstone of modern artificial intelligence''',

'''Natural Language Processing

Natural Language Processing (NLP) is a field of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language. It combines linguistics, computer science, and machine learning to bridge the gap between human communication and digital systems. NLP techniques allow machines to process text and speech, making it possible to perform tasks such as sentiment analysis, language translation, speech recognition, chatbots, and information retrieval. By breaking down language into components like syntax, semantics, and context, NLP systems can extract meaning and respond intelligently. Modern NLP relies heavily on deep learning models, such as transformers, which power applications like voice assistants, automated customer support, and large-scale text analysis. This technology is central to how humans interact naturally with machines today, making it one of the most impactful areas of AI.'''
]

sample_data

['Machine Learning\n\n    Machine learning is a branch of artificial intelligence that enables computers to learn from data and improve performance without being explicitly programmed. It works by feeding large datasets into algorithms that identify patterns and relationships, which can then be used to make predictions or decisions on new, unseen data. Unlike traditional programming, where rules must be manually defined, machine learning systems adapt and evolve as they encounter more information. This makes them especially powerful for complex tasks such as image recognition, speech processing, natural language translation, fraud detection, and personalized recommendations. By leveraging techniques like supervised learning, unsupervised learning, and reinforcement learning, machine learning has become a cornerstone of modern technology, powering applications from healthcare diagnostics to self-driving cars',
 'Deep Learning and Neural Networks\n\n    Deep learning is a specialized bra

In [6]:
#saving sample_data in txt file

# import tempfile
# temp_dir = tempfile.mkdtemp()--this will create docs in path "C:\Users\mouni\AppData\Local\Temp\tmpedlpm0wi"

for i, doc in enumerate(sample_data):
    with open(f"doc_{i}.txt", "w") as f:
        f.write(doc)

print(f"Sample documents are created")        

Sample documents are created


In [8]:
#document loading

from langchain_community.document_loaders import DirectoryLoader

#Load documents from directory
loader = DirectoryLoader(path="D:\\RAGs In Depth\\Vectorstore_and_Databases",
                         glob='**/*.txt',
                         loader_cls=TextLoader,
                         loader_kwargs={"encoding":"utf-8"})

documents = loader.load()
len(documents)

3

In [9]:
documents

[Document(metadata={'source': 'D:\\RAGs In Depth\\Vectorstore_and_Databases\\doc_0.txt'}, page_content='Machine Learning\n\nMachine learning is a branch of artificial intelligence that enables computers to learn from data and improve performance without being explicitly programmed. It works by feeding large datasets into algorithms that identify patterns and relationships, which can then be used to make predictions or decisions on new, unseen data. Unlike traditional programming, where rules must be manually defined, machine learning systems adapt and evolve as they encounter more information. This makes them especially powerful for complex tasks such as image recognition, speech processing, natural language translation, fraud detection, and personalized recommendations. By leveraging techniques like supervised learning, unsupervised learning, and reinforcement learning, machine learning has become a cornerstone of modern technology, powering applications from healthcare diagnostics to

In [10]:
#Document Splitting

splitter = RecursiveCharacterTextSplitter(chunk_size=300,
                                        chunk_overlap=50, 
                                        length_function=len,
                                        separators=["\n\n", "\n", " ", ". ", ""])

chunks = splitter.split_documents(documents)
chunks

[Document(metadata={'source': 'D:\\RAGs In Depth\\Vectorstore_and_Databases\\doc_0.txt'}, page_content='Machine Learning'),
 Document(metadata={'source': 'D:\\RAGs In Depth\\Vectorstore_and_Databases\\doc_0.txt'}, page_content='Machine learning is a branch of artificial intelligence that enables computers to learn from data and improve performance without being explicitly programmed. It works by feeding large datasets into algorithms that identify patterns and relationships, which can then be used to make predictions or'),
 Document(metadata={'source': 'D:\\RAGs In Depth\\Vectorstore_and_Databases\\doc_0.txt'}, page_content='which can then be used to make predictions or decisions on new, unseen data. Unlike traditional programming, where rules must be manually defined, machine learning systems adapt and evolve as they encounter more information. This makes them especially powerful for complex tasks such as image'),
 Document(metadata={'source': 'D:\\RAGs In Depth\\Vectorstore_and_Datab

In [11]:
#defining embedding models

embeddings = HuggingFaceEmbeddings(model="sentence-transformers/all-MiniLM-L6-v2")
embeddings

HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, query_encode_kwargs={}, multi_process=False, show_progress=False)

In [12]:
#Initialize the chromadb vectorstore and store vectors of chunks in them
persist_directory="./chroma_db"
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory= persist_directory,
    collection_name="rag_collection")

print(f"Vector store created with {vectorstore._collection.count()} vectors")
print(f"Persisted to: {persist_directory}")

Vector store created with 51 vectors
Persisted to: ./chroma_db


In [13]:
#Testing similarity search

query="what is machine learning?"

similar_docs= vectorstore.similarity_search(query, k=2)
similar_docs

[Document(metadata={'source': 'D:\\RAGs In Depth\\Vectorstore_and_Databases\\doc_0.txt'}, page_content='Machine learning is a branch of artificial intelligence that enables computers to learn from data and improve performance without being explicitly programmed. It works by feeding large datasets into algorithms that identify patterns and relationships, which can then be used to make predictions or'),
 Document(metadata={'source': 'D:\\RAGs In Depth\\Vectorstore_and_Databases\\doc_0.txt'}, page_content='Machine learning is a branch of artificial intelligence that enables computers to learn from data and improve performance without being explicitly programmed. It works by feeding large datasets into algorithms that identify patterns and relationships, which can then be used to make predictions or')]

In [14]:
#similarity search with score

similarity_search_with_score= vectorstore.similarity_search_with_score(query, k=1)
similarity_search_with_score

[(Document(metadata={'source': 'D:\\RAGs In Depth\\Vectorstore_and_Databases\\doc_0.txt'}, page_content='Machine learning is a branch of artificial intelligence that enables computers to learn from data and improve performance without being explicitly programmed. It works by feeding large datasets into algorithms that identify patterns and relationships, which can then be used to make predictions or'),
  0.3144099712371826)]

In [15]:
#initialize LLM, RAG Chain, Augumented Prompt, Query over RAG Model

from langchain_groq import ChatGroq

groq_api_key= os.getenv("GROQ_API_KEY")
model = ChatGroq(model="llama-3.1-8b-instant", api_key=groq_api_key)
model



ChatGroq(profile={'max_input_tokens': 131072, 'max_output_tokens': 8192, 'image_inputs': False, 'audio_inputs': False, 'video_inputs': False, 'image_outputs': False, 'audio_outputs': False, 'video_outputs': False, 'reasoning_output': False, 'tool_calling': True}, client=<groq.resources.chat.completions.Completions object at 0x000002113DAEF790>, async_client=<groq.resources.chat.completions.AsyncCompletions object at 0x000002113DC5CBD0>, model_name='llama-3.1-8b-instant', model_kwargs={}, groq_api_key=SecretStr('**********'))

In [16]:
#another way of initializing model

from langchain.chat_models.base import init_chat_model

llm = init_chat_model(model="groq:llama-3.1-8b-instant")
llm.invoke("what is ai?")

AIMessage(content='Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. The term can also be applied to any machine that exhibits traits associated with a human mind such as learning and problem-solving.\n\nAI technology is based on the principle of developing algorithms and statistical models that enable computers to process and analyze data, make decisions, and take actions. These algorithms are designed to enable machines to perform tasks that typically require human intelligence, such as:\n\n1. **Learning**: AI systems can learn from data, identify patterns, and improve their performance over time.\n2. **Problem-solving**: AI can analyze complex problems, identify potential solutions, and make decisions.\n3. **Reasoning**: AI can draw conclusions based on available data and make logical decisions.\n4. **Perception**: AI can interpret and understand data from sensors, such as images, speech, and t

In [17]:
#Modern RAG chain

from langchain_classic.chains import create_retrieval_chain
from langchain_classic.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

#converting vectorstore into a retriever

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={'k':2} #retrieve top 3 relevant chunks
)

retriever

VectorStoreRetriever(tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x000002113A917FD0>, search_kwargs={'k': 2})

In [18]:
prompt = ChatPromptTemplate.from_template(
    '''You are an assistant for question-answering tasks.
    Answer the question only based on the context infomrtaion provided.
    If you don't have context information for the question, 
    mention as you are not sure about the answer.
    give answer not more than two sentences.
    context : {context}
    question : {input}
'''
)

prompt

ChatPromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks.\n    Answer the question only based on the context infomrtaion provided.\n    If you don't have context information for the question, \n    mention as you are not sure about the answer.\n    give answer not more than two sentences.\n    context : {context}\n    question : {input}\n"), additional_kwargs={})])

In [19]:
#creating stuff document chain

documents_chain = create_stuff_documents_chain(llm=model,
                                               prompt=prompt)
documents_chain

RunnableBinding(bound=RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableLambda(format_docs)
}), kwargs={}, config={'run_name': 'format_inputs'}, config_factories=[])
| ChatPromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks.\n    Answer the question only based on the context infomrtaion provided.\n    If you don't have context information for the question, \n    mention as you are not sure about the answer.\n    give answer not more than two sentences.\n    context : {context}\n    question : {input}\n"), additional_kwargs={})])
| ChatGroq(profile={'max_input_tokens': 131072, 'max_output_tokens': 8192, 'image_inputs': False, 'audio_inputs': False, 'video_inputs': False, 'image_outputs': False, 'audio_outputs': False, 'video_outputs': False, 

In [20]:
#creating retrieval chain

rag_chain = create_retrieval_chain(retriever,
    documents_chain
    
)
rag_chain

RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableBinding(bound=RunnableLambda(lambda x: x['input'])
           | VectorStoreRetriever(tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x000002113A917FD0>, search_kwargs={'k': 2}), kwargs={}, config={'run_name': 'retrieve_documents'}, config_factories=[])
})
| RunnableAssign(mapper={
    answer: RunnableBinding(bound=RunnableBinding(bound=RunnableAssign(mapper={
              context: RunnableLambda(format_docs)
            }), kwargs={}, config={'run_name': 'format_inputs'}, config_factories=[])
            | ChatPromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks.\n    Answer the question only based on the context infomrtaion pro

In [21]:
#invoking chain with query

response = rag_chain.invoke({"input":"what is neural network"})
response['answer']

'A neural network is a type of network composed of artificial neurons that are stacked into layers, allowing computers to discover intricate patterns in data. It is a key component of deep learning that mimics the way the human brain works.'

In [22]:
response['context']

[Document(metadata={'source': 'D:\\RAGs In Depth\\Vectorstore_and_Databases\\doc_1.txt'}, page_content='Deep learning is a specialized branch of machine learning that uses multilayered neural networks to process and learn from complex data. It mimics the way the human brain works by stacking artificial neurons into layers, allowing computers to automatically discover intricate patterns and'),
 Document(metadata={'source': 'D:\\RAGs In Depth\\Vectorstore_and_Databases\\doc_1.txt'}, page_content='Deep learning is a specialized branch of machine learning that uses multilayered neural networks to process and learn from complex data. It mimics the way the human brain works by stacking artificial neurons into layers, allowing computers to automatically discover intricate patterns and')]

#### Creating RAG Chain Alternative - Using LCEL (LangChain Expression Language)

In [23]:
#even more flexible approach using LCEL

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableParallel, RunnablePassthrough

In [24]:
#creating a custom prompt

custom_prompt = ChatPromptTemplate.from_template('''
Use the following context to answer the question.
If you don't know the answer based on the context, say you don't know.
Provide specific details from the context to support your answer.

context: {context}
question: {question} 
answer:   ''')

custom_prompt

ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="\nUse the following context to answer the question.\nIf you don't know the answer based on the context, say you don't know.\nProvide specific details from the context to support your answer.\n\ncontext: {context}\nquestion: {question} \nanswer:   "), additional_kwargs={})])

In [25]:
#format the output documents for the prompt

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


In [31]:
#Build the chain using LCEL

rag_chain_lcel = (
   { "context": retriever | format_docs,
    "question": RunnablePassthrough()} |
    custom_prompt |
    model |
    StrOutputParser()
)

rag_chain_lcel

{
  context: VectorStoreRetriever(tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x000002113A917FD0>, search_kwargs={'k': 2})
           | RunnableLambda(format_docs),
  question: RunnablePassthrough()
}
| ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="\nUse the following context to answer the question.\nIf you don't know the answer based on the context, say you don't know.\nProvide specific details from the context to support your answer.\n\ncontext: {context}\nquestion: {question} \nanswer:   "), additional_kwargs={})])
| ChatGroq(profile={'max_input_tokens': 131072, 'max_output_tokens': 8192, 'image_inputs': False, 'audio_inputs': False, 'video_inputs': False, 'image_outputs': False, 'audio_outputs': False, 'video_outputs':

In [32]:
rag_chain_lcel.invoke("what is machine learning")

'Machine learning is a branch of artificial intelligence that enables computers to learn from data and improve performance without being explicitly programmed.'

In [35]:
from langchain_core.callbacks.manager import CallbackManagerForRetrieverRun
retriever._get_relevant_documents("what is machine learning", 
                                  run_manager=CallbackManagerForRetrieverRun)

[Document(metadata={'source': 'D:\\RAGs In Depth\\Vectorstore_and_Databases\\doc_0.txt'}, page_content='Machine learning is a branch of artificial intelligence that enables computers to learn from data and improve performance without being explicitly programmed. It works by feeding large datasets into algorithms that identify patterns and relationships, which can then be used to make predictions or'),
 Document(metadata={'source': 'D:\\RAGs In Depth\\Vectorstore_and_Databases\\doc_0.txt'}, page_content='Machine learning is a branch of artificial intelligence that enables computers to learn from data and improve performance without being explicitly programmed. It works by feeding large datasets into algorithms that identify patterns and relationships, which can then be used to make predictions or')]

#### Adding new documents to an existing vectorstore

In [None]:
vectorstore.

<langchain_community.vectorstores.chroma.Chroma at 0x2113a917fd0>

In [None]:
new_data = '''Generative AI
Generative AI is a field of artificial intelligence that focuses on creating new content—such as text, images, music, video, or even computer code—by learning patterns from large datasets and then producing outputs that resemble human creativity. Powered by advanced neural networks, especially transformer models, generative AI can write essays, design artwork, compose songs, or simulate conversations, making it a powerful tool for innovation across industries. Its ability to automate creativity is transforming areas like education, healthcare, entertainment, and business, while also raising important questions about ethics, originality, and responsible use.'''

new_doc= Document(
    page_content=new_data,
    metadata={"source":"manual_addition", "topic":"generative ai"}
)

In [43]:
#adding new documents into vectorstore

new_chunks=splitter.split_documents([new_doc])
new_chunks

[Document(metadata={'source': 'manual_addition', 'topic': 'generative ai'}, page_content='Generative AI'),
 Document(metadata={'source': 'manual_addition', 'topic': 'generative ai'}, page_content='Generative AI is a field of artificial intelligence that focuses on creating new content—such as text, images, music, video, or even computer code—by learning patterns from large datasets and then producing outputs that resemble human creativity. Powered by advanced neural networks, especially'),
 Document(metadata={'source': 'manual_addition', 'topic': 'generative ai'}, page_content='Powered by advanced neural networks, especially transformer models, generative AI can write essays, design artwork, compose songs, or simulate conversations, making it a powerful tool for innovation across industries. Its ability to automate creativity is transforming areas like education,'),
 Document(metadata={'source': 'manual_addition', 'topic': 'generative ai'}, page_content='creativity is transforming area

In [44]:
#add new documents into vectorstore
vectorstore.add_documents(new_chunks)

['4721eec3-e436-44e6-88ef-981bbec3238f',
 '64622d7d-c19d-4607-8370-15305d34b3ad',
 '9ce111dc-aa94-4a12-868b-260239456458',
 '28f020c2-c3b2-4035-9412-4ccbba074dd9']

In [45]:
print(f"Total vectors now: {vectorstore._collection.count()}")

Total vectors now: 55


In [47]:
#query with updated vectorstore
query="what is generative ai?"
updated_retriever = vectorstore.as_retriever()
response= updated_retriever.invoke(query)
response

[Document(metadata={'source': 'manual_addition', 'topic': 'generative ai'}, page_content='Generative AI'),
 Document(metadata={'source': 'manual_addition', 'topic': 'generative ai'}, page_content='Generative AI is a field of artificial intelligence that focuses on creating new content—such as text, images, music, video, or even computer code—by learning patterns from large datasets and then producing outputs that resemble human creativity. Powered by advanced neural networks, especially'),
 Document(metadata={'source': 'manual_addition', 'topic': 'generative ai'}, page_content='Powered by advanced neural networks, especially transformer models, generative AI can write essays, design artwork, compose songs, or simulate conversations, making it a powerful tool for innovation across industries. Its ability to automate creativity is transforming areas like education,'),
 Document(metadata={'source': 'D:\\RAGs In Depth\\Vectorstore_and_Databases\\doc_1.txt'}, page_content='output is produce