<a href="https://colab.research.google.com/github/Suriyanand/GEN_AI_PROJECTS/blob/main/langchain_step_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from langchain.document_loaders import TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings , HuggingFaceBgeEmbeddings
from langchain.vectorstores import chroma
import os

In [None]:
text_loader = TextLoader("/content/example.txt")
text_docs = text_loader.load()
print(text_docs)

[Document(metadata={'source': '/content/example.txt'}, page_content='LangChain is a framework for developing applications powered by language models. It enables the chaining of components like prompt templates, language model calls, and output parsers in a flexible and easy-to-use way.\n\nLangChain supports document loading, splitting, embedding, and storage using vector databases. This makes it suitable for building powerful retrieval-based applications like chatbots, Q&A systems, and search engines.\n\nDevelopers can use embeddings from providers like HuggingFace or OpenAI and persist data into vector stores like Chroma or FAISS for efficient retrieval.\n')]


In [None]:
pdf_loader = PyPDFLoader("https://arxiv.org/pdf/2305.12675.pdf")
pdf_docs = pdf_loader.load()
print(pdf_docs)



In [None]:
all_docs = text_docs + pdf_docs

In [None]:
from posixpath import split
spiltter = RecursiveCharacterTextSplitter(chunk_size = 500 ,chunk_overlap = 100)
split_docs = spiltter.split_documents(all_docs)

In [None]:
hf_embedder = HuggingFaceBgeEmbeddings(model_name="all-MiniLM-L6-v2")
hf_embeddering = hf_embedder.embed_documents([docs.page_content for docs in split_docs])


In [None]:
print("Hugging_face lenght : " , len(hf_embeddering))

Hugging_face lenght :  229


In [None]:
from langchain.vectorstores import Chroma
chroma_store = Chroma.from_documents(
    documents=split_docs,
    embedding=hf_embedder,
    persist_directory="chroma_db"
)

In [None]:
query = "what is Langchain?"
result = chroma_store.similarity_search(query)
for i,res in enumerate(result):
   print(f"\nResult {i+1}:\n{res.page_content[:300]}")


Result 1:
LangChain is a framework for developing applications powered by language models. It enables the chaining of components like prompt templates, language model calls, and output parsers in a flexible and easy-to-use way.

LangChain supports document loading, splitting, embedding, and storage using vect

Result 2:
3. Background
3.1. Language Models
An LM is a probability distribution over token se-
quences. Given a sequence x1:t = x1, x2, . . . , xt
of length t, LM assigns a probability p(x1:t) to the
sequence, which is usually decomposed in an au-
toregressive fashion: p(x1:t) =Qt
i=1 p(xi|x<i).
N-gram Langu

Result 3:
7. References
BigScience. 2023. Bloom: A 176b-parameter open-
access multilingual language model.
Tom B. Brown, Benjamin Mann, Nick Ryder,
Melanie Subbiah, Jared Kaplan, Prafulla Dhari-
wal, Arvind Neelakantan, Pranav Shyam, Girish
Sastry, Amanda Askell, Sandhini Agarwal, Ariel
Herbert-Voss, Gretche

Result 4:
First, all the operations (i.e., construction, predic

In [None]:
chroma_store.persist()
print("Chroma DB persisted!")

Chroma DB persisted!


  chroma_store.persist()


In [None]:
loaded_chroma = Chroma(persist_directory="chroma_db", embedding_function=hf_embedder)
print("Chroma DB loaded with", len(loaded_chroma.get()["documents"]), "documents.")

Chroma DB loaded with 229 documents.


  loaded_chroma = Chroma(persist_directory="chroma_db", embedding_function=hf_embedder)
