Building using Rag and LangChain

In [46]:
#Requirments
#!pip intall langchain pinecone-client huggingface_hub langchain-community sentence_transformers - common way
!pip install -r "/content/Requirments.txt" # importing lib from file



1. LangChain (langchain)
Purpose: A framework for building applications powered by large language models (LLMs).
2. Pinecone Client (pinecone-client)
Purpose: A vector database for storing and searching embeddings efficiently.
3. Hugging Face Hub (huggingface_hub)
Purpose: Provides access to Hugging Face models, datasets, and spaces.
4. LangChain Community (langchain_community)
Purpose: A package for community-driven integrations in LangChain.
5. Sentence Transformers (sentence_transformers)
Purpose: A library for computing sentence embeddings for NLP tasks.

IMPORT LIB

In [47]:
#used for input and output operations
import os
#for checking time
import time
#to ignore warning
import warnings
warnings.filterwarnings('ignore')
#allows you to securely store and retrieve sensitive information (like API keys or passwords)without hardcoding them in your script.
#does not store data permanently; once the session is reset, the stored data is lost.
from google.colab import userdata

Lang chain and its components

In [48]:
#to load text documents
from langchain.document_loaders import TextLoader
#to split the document
from langchain.text_splitter import CharacterTextSplitter
#convert text - embeddings , will handle embedding part
from langchain.embeddings import HuggingFaceEmbeddings

Data Ingestion

Text Loader - This helps to load any text file into the environment that can be further used by any other component of lang chain
We will firstly create an object of TextLoader class to pass our txt doc


In [49]:
loader=TextLoader("/content/Human Nerves.txt")

Pine cone accepts data in form of documents or objects , using below process, we can create doc which can be loaeded in pinecone

In [50]:
document=loader.load()

In [51]:
document

[Document(metadata={'source': '/content/Human Nerves.txt'}, page_content='Importance of Human Nerves – Summary\nThe human nervous system is essential for controlling and coordinating bodily functions. It is responsible for transmitting signals between different parts of the body, allowing for movement, sensation, thought, and homeostasis.\n\n1. Structure of the Nervous System\nThe nervous system is divided into:\n\nCentral Nervous System (CNS): Includes the brain and spinal cord, which process and send out signals.\nPeripheral Nervous System (PNS): Consists of nerves outside the CNS that connect the brain and spinal cord to the rest of the body.\nSomatic Nervous System: Controls voluntary movements.\nAutonomic Nervous System: Regulates involuntary functions (heartbeat, digestion, etc.).\n2. Functions of Nerves\nSensory Function: Detects stimuli (e.g., pain, temperature) and sends signals to the CNS.\nMotor Function: Transmits signals from the CNS to muscles, causing movement.\nAutonomi

Context Window - size or length of window of tokens an LLM takes at once

Split Data - as model cant capture all info at once.So we need to break down doc
1. create object , set params
2. use object to split

In [52]:
textSplitter=CharacterTextSplitter(
    separator=" ",
    chunk_size=150,
    chunk_overlap=4)

In [53]:
splittedDocuments=textSplitter.split_documents(documents=document)

chunk overlap do not consume memory, raw text file consumes more memory than this

In [54]:
splittedDocuments

[Document(metadata={'source': '/content/Human Nerves.txt'}, page_content='Importance of Human Nerves – Summary\nThe human nervous system is essential for controlling and coordinating bodily functions. It is responsible for'),
 Document(metadata={'source': '/content/Human Nerves.txt'}, page_content='for transmitting signals between different parts of the body, allowing for movement, sensation, thought, and homeostasis.\n\n1. Structure of the Nervous'),
 Document(metadata={'source': '/content/Human Nerves.txt'}, page_content='System\nThe nervous system is divided into:\n\nCentral Nervous System (CNS): Includes the brain and spinal cord, which process and send out'),
 Document(metadata={'source': '/content/Human Nerves.txt'}, page_content='out signals.\nPeripheral Nervous System (PNS): Consists of nerves outside the CNS that connect the brain and spinal cord to the rest of the'),
 Document(metadata={'source': '/content/Human Nerves.txt'}, page_content='the body.\nSomatic Nervous System: C

Embeddings

In [55]:
embedding=HuggingFaceEmbeddings()

Data base set up - after creating embedding , put them into vector data base - pinecone


We need to initiatialize pinecone using pinecone api ,,
need to check if we have already smtg in DB,,
if there is ntg we will create new index

In [56]:
#import pinecone
from pinecone import Pinecone,ServerlessSpec

In [57]:
#Start connection with pinecone --Chat is key in pinecone
pineconeDataBase=Pinecone(api_key=userdata.get("Chat"))

In [58]:
#create srver for DB
#as it is cloud , working with aws platform
mycloud='aws'
#region
myregion='us-east-1'
#start server less DB
serv=ServerlessSpec(cloud=mycloud,region=myregion)

In [59]:
#DB creation
#DB name - no upper case or underscore , only lower case and dash(-)
index_name='priya-llm-human-nerve-db'

In [60]:
#we need to check if there is similar existing DB or not , if not create new index for us
if index_name not in pineconeDataBase.list_indexes().names():
  pineconeDataBase.create_index(
      name=index_name,
      dimension=768,
      metric='cosine',#similarity matrix based on distance
      spec=serv)
#we will wait till the time DB is created and running
while not pineconeDataBase.describe_index(index_name).status["ready"]:
  time.sleep(1)



In [61]:
print(pineconeDataBase.Index(index_name).describe_index_stats())
#check in pinecone if DB is there

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 57}},
 'total_vector_count': 57}


add data into database

In [62]:
PINECONE_API_KEY=userdata.get("Chat")

In [63]:
os.environ["PINECONE_API_KEY"]=PINECONE_API_KEY

In [64]:
#get this pinecone DB from vectorstore
from langchain.vectorstores import Pinecone

In [65]:
#index exists or not
if index_name not in pineconeDataBase.list_indexes():
  #if not create index and feed data from document
  docsearch=Pinecone.from_documents(splittedDocuments,embedding,index_name=index_name)
  #if availible , take the index name and embeddind model , feed data
else:
  docsearch=Pinecone.from_existing_index(index_name,embedding,pinecone_index=pineconeDataBase.Index(index_name))

In [66]:
docsearch

<langchain_community.vectorstores.pinecone.Pinecone at 0x7e3066d639d0>

Model set up--
since we have embedded text in vector DB , so we can now use a model

In [67]:
repository_id= "mistralai/Mixtral-8x7B-Instruct-v0.1"

In [68]:
from langchain.llms import HuggingFaceHub

Model Initialization

In [69]:
llm=HuggingFaceHub(
    repo_id=repository_id, #model
    huggingfacehub_api_token=userdata.get("Huggingllm"))

Prompt

In [70]:
from langchain import PromptTemplate #helps to give instruction to model

In [71]:
myTemplate = """
You are doctor , users will ask doubts about nervous system.
Use the following context to answer the question.
If you dont know the answer , just say I don't know.
Keep the answer brief

Context:{context}
Question:{question}
Answer:

"""

In [72]:
prompt=PromptTemplate(
    template=myTemplate,
    input_variables=["Context","Question"]
)

In [73]:
prompt

PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="\nYou are doctor , users will ask doubts about nervous system.\nUse the following context to answer the question.\nIf you dont know the answer , just say I don't know.\nKeep the answer brief\n\nContext:{context}\nQuestion:{question}\nAnswer:\n\n")

Chain them together

DataBase - pinecone , model - LLM (Mixtral), Prompt-prompt template

Process begins with doc search where we will try to search for releavent info based on context in doc , one query goes through LLM , our prompt will start working. Prompt -->LLM--> DB-->INFO-->LLM-->conextualized output

In [74]:
from langchain.chains import RetrievalQA

In [75]:
questionanswerChain=RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=docsearch.as_retriever()
)

In [76]:
query='what is nerve'

In [77]:
results=questionanswerChain.run(query)

In [78]:
print(results)

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

from injuries through neuroplasticity.
Human nerves are fundamental for sensation, movement, and survival. Without them, essential body functions

from injuries through neuroplasticity.
Human nerves are fundamental for sensation, movement, and survival. Without them, essential body functions

from injuries through neuroplasticity.
Human nerves are fundamental for sensation, movement, and survival. Without them, essential body functions

from injuries through neuroplasticity.
Human nerves are fundamental for sensation, movement, and survival. Without them, essential body functions

Question: what is nerve
Helpful Answer: Nerves are the part of the body that allow for the transmission of signals between different parts of the body. Neuroplasticity is the brain's ability to form new neural pathways, allowing for the reorganizat