<a href="https://colab.research.google.com/github/ShanAliZaidi/RAG_Project/blob/main/Rag_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%pip install -qU langchain-google-genai langchain_pinecone langchain-community pypdf

In [31]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from pinecone import Pinecone,ServerlessSpec
import os
from google.colab import userdata

GEMINI_API_KEY = userdata.get("GEMINI_KEY")
pinecone_key = userdata.get("pinecone_key")
os.environ["GOOGLE_API_KEY"] = GEMINI_API_KEY
os.environ["PINECONE_API_KEY"] = pinecone_key

#initalize chat model
llm_model = ChatGoogleGenerativeAI(
    model = "gemini-2.0-flash-exp",
    temperature = 0
)

#initialize embeddings
embeddings_model = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

#initialize vector store
pc = Pinecone(api_key=pinecone_key)
index_name = "web-index"
existing_indexes = [index_info["name"] for index_info in pc.list_indexes()]
if index_name not in existing_indexes:
    pc.create_index(
        name=index_name,
        dimension=768,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )
index = pc.Index(index_name)
vector_store = PineconeVectorStore(embedding=embeddings_model, index=index)

In [32]:
from langchain.document_loaders import TextLoader
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import WebBaseLoader

page_url = "https://python.langchain.com/docs/tutorials/rag/"

loader = WebBaseLoader(web_paths=[page_url])


#loading the document
#upload the file in the files menu on the left

#file_path = "/content/imambaqeras_lifesketchzhj.pdf"
#loader = PyPDFLoader(file_path)
#loader = TextLoader(file_path)
documents = loader.load()



In [33]:

print(f"Total characters: {len(documents[0].page_content)}")

Total characters: 42732


In [36]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

#splitting document

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # chunk size (characters)
    chunk_overlap=200,  # chunk overlap (characters)
    add_start_index=True,  # track index in original document
)
docs = text_splitter.split_documents(documents)
document_ids = vector_store.add_documents(documents=docs)

In [37]:
print(f"Split blog post into {len(docs)} sub-documents.")

Split blog post into 67 sub-documents.


In [38]:
from tqdm import tqdm

# Create embeddings and upload to Pinecone
for doc in tqdm(docs):
    vector = embeddings_model.embed_query(doc.page_content)

100%|██████████| 67/67 [00:08<00:00,  7.60it/s]


In [49]:
retriever = vector_store.as_retriever()

In [40]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm = llm_model,
    chain_type = "stuff",
    retriever = retriever,  # Pass the retriever here
    return_source_documents = True  # Optional: to get the source documents used in the response
)

In [45]:
res = qa_chain.invoke("what is splitting the document and how we can do it")
print(res["result"])

Splitting a document refers to dividing a large document into smaller chunks. This is often necessary because many models have context window limitations, and even if they don't, they can struggle to find information in very long inputs. By splitting the document into smaller, more manageable pieces, it becomes easier to process and retrieve relevant information.

You can split a document using a `RecursiveCharacterTextSplitter`. This method recursively splits the document using common separators like new lines until each chunk is the appropriate size. It is the recommended text splitter for generic text use cases.

Here's how you can do it:

```python
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # chunk size (characters)
    chunk_overlap=200,  # chunk overlap (characters)
    add_start_index=True,  # track index in original document
)
all_splits = text_splitter.split_documents(docs)
print(f"