## Chain And Retriever With Langchain

In [1]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("Colon.pdf")
docs = loader.load()
docs

[Document(page_content='PMID: 28106826 PMCID: PMC5297828 DOI: 10.3390/ijms18010197 \nAbstract \nColorectal cancer (CRC) is the third most common cancer and the fourth most common cause of \ncancer-related death. Most cases of CRC are detected in Western countries, with its incidence \nincreasing year by year. The probability of suffering from colorectal cancer is about 4%-5% and the \nrisk for developing CRC is associated with personal features or habits such as age, chronic disease \nhistory and lifestyle. In this context, the gut microbiota has a relevant role, and dysbiosis situations \ncan induce colonic carcinogenesis through a chronic inflammation mechanism. Some of the bacteria \nresponsible for this multiphase process include Fusobacterium spp, Bacteroides fragilis and \nenteropathogenic Escherichia coli. CRC is caused by mutations that target oncogenes, tumour \nsuppressor genes and genes related to DNA repair mechanisms. Depending on the origin of the \nmutation, colorectal c

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
text_splitter.split_documents(docs)[:5]

[Document(page_content='PMID: 28106826 PMCID: PMC5297828 DOI: 10.3390/ijms18010197 \nAbstract \nColorectal cancer (CRC) is the third most common cancer and the fourth most common cause of \ncancer-related death. Most cases of CRC are detected in Western countries, with its incidence \nincreasing year by year. The probability of suffering from colorectal cancer is about 4%-5% and the \nrisk for developing CRC is associated with personal features or habits such as age, chronic disease \nhistory and lifestyle. In this context, the gut microbiota has a relevant role, and dysbiosis situations \ncan induce colonic carcinogenesis through a chronic inflammation mechanism. Some of the bacteria \nresponsible for this multiphase process include Fusobacterium spp, Bacteroides fragilis and \nenteropathogenic Escherichia coli. CRC is caused by mutations that target oncogenes, tumour \nsuppressor genes and genes related to DNA repair mechanisms. Depending on the origin of the', metadata={'source': 'C

In [3]:
documents=text_splitter.split_documents(docs)
documents

[Document(page_content='PMID: 28106826 PMCID: PMC5297828 DOI: 10.3390/ijms18010197 \nAbstract \nColorectal cancer (CRC) is the third most common cancer and the fourth most common cause of \ncancer-related death. Most cases of CRC are detected in Western countries, with its incidence \nincreasing year by year. The probability of suffering from colorectal cancer is about 4%-5% and the \nrisk for developing CRC is associated with personal features or habits such as age, chronic disease \nhistory and lifestyle. In this context, the gut microbiota has a relevant role, and dysbiosis situations \ncan induce colonic carcinogenesis through a chronic inflammation mechanism. Some of the bacteria \nresponsible for this multiphase process include Fusobacterium spp, Bacteroides fragilis and \nenteropathogenic Escherichia coli. CRC is caused by mutations that target oncogenes, tumour \nsuppressor genes and genes related to DNA repair mechanisms. Depending on the origin of the', metadata={'source': 'C

In [4]:
# from langchain_community.embeddings import OpenAIEmbeddings
# from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
from langchain_community.vectorstores import FAISS

db=FAISS.from_documents(documents[:30],embedding_model)

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
db

<langchain_community.vectorstores.faiss.FAISS at 0x20dbef8fbd0>

In [6]:
query="The choice of first-line treatment in CRC follows"
result=db.similarity_search(query)
result[0].page_content

'treatment in CRC follows a multimodal approach based on tumour-related characteristics and usually \ncomprises surgical resection followed by chemotherapy combined with monoclonal antibodies or \nproteins against vascular endothelial growth factor (VEGF) and epidermal growth receptor (EGFR). \nBesides traditional chemotherapy, alternative therapies (such as agarose tumour macrobeads, anti-\ninflammatory drugs, probiotics, and gold-based drugs) are currently being studied to increase \ntreatment effectiveness and reduce side effects. \n \nKeywords: agarose macrobeads; anti-inflammatories; biomarkers; colorectal carcinoma; functional \nfood; gene-expression profiling; metal-based drugs; microbiota; ncRNA; probiotics.'

In [7]:
from langchain_community.llms import Ollama
## Load Ollama gemma3:1b LLM model
llm=Ollama(model="gemma3:1b")
llm

Ollama(model='gemma3:1b')

In [8]:
## Design ChatPrompt Template
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_template("""
Answer the following question based only on the provided context. 
Think step by step before providing a detailed answer. 
I will rate you 5 out of 5 if user finds the answer helpful. 
<context>
{context}
</context>
Question: {input}""")

In [None]:
## Create Stuff Document Chain

from langchain.chains.combine_documents import create_stuff_documents_chain
document_chain = create_stuff_documents_chain(llm,prompt)

In [10]:
retriever=db.as_retriever()
retriever

VectorStoreRetriever(tags=['FAISS', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x0000020DBEF8FBD0>)

In [11]:
from langchain.chains import create_retrieval_chain
retrieval_chain=create_retrieval_chain(retriever,document_chain)

In [12]:
response=retrieval_chain.invoke({"input":"Which type of Cancer is discussed in the document?"})

In [13]:
response['answer']

'According to the document, colorectal cancer is discussed.'