# $$ Building \ a \ Question \ and \ Answer \ system \ with \ RAG $$


In [50]:
# pip install google-generativeai

In [51]:
pip show google-generativeai

Name: google-generativeai
Version: 0.5.2
Summary: Google Generative AI High level API client library and tools.
Home-page: https://github.com/google/generative-ai-python
Author: Google LLC
Author-email: googleapis-packages@google.com
License: Apache 2.0
Location: c:\users\kalag\anaconda3\lib\site-packages
Requires: google-ai-generativelanguage, google-api-core, google-api-python-client, google-auth, protobuf, pydantic, tqdm, typing-extensions
Required-by: langchain-google-genai
Note: you may need to restart the kernel to use updated packages.




## Initializing GEMINI API KEY

In [52]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(),override=True)
GEMINI_API_KEY = os.environ.get('GEMINI_API_KEY')

## 1.Load Document

In [53]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain.prompts.chat import ChatPromptTemplate
from langchain_core.messages import SystemMessage
from langchain_core.prompts.chat import HumanMessagePromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_community.vectorstores import Chroma
from langchain_core.runnables import RunnablePassthrough
from IPython.display import Markdown as md

In [54]:
loader = PyPDFLoader('Data/leaveNoContextBehind.pdf')
pages = loader.load_and_split()

In [55]:
pages[0]

Document(page_content='Preprint. Under review.\nLeave No Context Behind:\nEfficient Infinite Context Transformers with Infini-attention\nTsendsuren Munkhdalai, Manaal Faruqui and Siddharth Gopal\nGoogle\ntsendsuren@google.com\nAbstract\nThis work introduces an efficient method to scale Transformer-based Large\nLanguage Models (LLMs) to infinitely long inputs with bounded memory\nand computation. A key component in our proposed approach is a new at-\ntention technique dubbed Infini-attention. The Infini-attention incorporates\na compressive memory into the vanilla attention mechanism and builds\nin both masked local attention and long-term linear attention mechanisms\nin a single Transformer block. We demonstrate the effectiveness of our\napproach on long-context language modeling benchmarks, 1M sequence\nlength passkey context block retrieval and 500K length book summarization\ntasks with 1B and 8B LLMs. Our approach introduces minimal bounded\nmemory parameters and enables fast stream

## 2.Splitting the Document

In [56]:
splt_text = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=500,chunk_overlap=100)
chunk_texts = splt_text.split_documents(pages)
print(len(chunk_texts))
print(type(chunk_texts))
print(type(chunk_texts[0]))

13
<class 'list'>
<class 'langchain_core.documents.base.Document'>


In [75]:
chunk_texts[0:3]

[Document(page_content='Preprint. Under review.\nLeave No Context Behind:\nEfficient Infinite Context Transformers with Infini-attention\nTsendsuren Munkhdalai, Manaal Faruqui and Siddharth Gopal\nGoogle\ntsendsuren@google.com\nAbstract\nThis work introduces an efficient method to scale Transformer-based Large\nLanguage Models (LLMs) to infinitely long inputs with bounded memory\nand computation. A key component in our proposed approach is a new at-\ntention technique dubbed Infini-attention. The Infini-attention incorporates\na compressive memory into the vanilla attention mechanism and builds\nin both masked local attention and long-term linear attention mechanisms\nin a single Transformer block. We demonstrate the effectiveness of our\napproach on long-context language modeling benchmarks, 1M sequence\nlength passkey context block retrieval and 500K length book summarization\ntasks with 1B and 8B LLMs. Our approach introduces minimal bounded\nmemory parameters and enables fast strea

## 3.Create Chunks Embedding

In [58]:
embeddings = GoogleGenerativeAIEmbeddings(google_api_key=GEMINI_API_KEY,
                                          model="models/embedding-001")

## 4.Store the chunks in vector store

In [59]:
db = Chroma.from_documents(chunk_texts, embeddings, persist_directory="vectordb")

In [93]:
chat_model = ChatGoogleGenerativeAI(google_api_key=GEMINI_API_KEY, 
                                   model="gemini-1.5-pro-latest")

output_parser = StrOutputParser()

## 5.Setup the PromptTemplate

In [94]:
chat_template = ChatPromptTemplate.from_messages([
 
    SystemMessage(content="""Please answer the following question as thoroughly as possible using the provided context.
    If the context does not contain the answer, simply state,
    'The answer is not available in the context.' Avoid guessing or providing incorrect answers."""),
    
    HumanMessagePromptTemplate.from_template("""Aswer the question based on the given context.
    Context:
    {context}
    
    Question: 
    {question}
    
    Answer: """)
])


## 5.Based on users query retrieve the context

In [104]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = ({"context": retriever | format_docs, "question": RunnablePassthrough()} 
             | chat_template | chat_model| output_parser)

## 6.Pass the context and question to the LLM

In [105]:
response = rag_chain.invoke("Write Summary in 5 lines about transformers")

response

'## Transformer Summary:\n\n* **Attention Mechanism:** Transformers utilize attention mechanisms, allowing them to focus on relevant parts of the input sequence when making predictions. This is in contrast to traditional recurrent neural networks that process sequences sequentially.\n* **Long-Context Challenges:**  While powerful, transformers face challenges with long sequences due to limitations in memory and computational efficiency. Researchers are exploring techniques like compressive memory, efficient attention mechanisms, and long-context continual pre-training to address these issues.\n* **Scaling Transformers:**  Efforts are underway to scale transformers to handle even longer sequences, such as millions or even billions of tokens. This involves developing new architectures, data engineering techniques, and efficient attention mechanisms.\n* **Applications:**  Transformers have revolutionized natural language processing tasks, excelling in areas like machine translation, text 

In [106]:
response = rag_chain.invoke("What is the key contribution of the Infini-attention mechanism proposed in the paper?")

md(response)

## Key Contribution of Infini-attention

The key contribution of the Infini-attention mechanism, as described in the context, is its ability to enable Transformer LLMs (Large Language Models) to efficiently process infinitely long inputs while maintaining a bounded memory footprint and computational cost. This is achieved through a combination of:

* **Compressive Memory:** Infini-attention incorporates a compressive memory that stores old key-value (KV) attention states, unlike standard attention mechanisms that discard them. This allows the model to retain and access long-term context history.
* **Local and Global Context:** The mechanism combines both masked local attention for focusing on recent context within a segment and long-term linear attention for retrieving information from the compressive memory. This enables efficient modeling of both short-range and long-range dependencies.

Essentially, Infini-attention allows LLMs to scale to infinitely long contexts without the typical memory and computational constraints, making it a significant advancement in the field. 


In [107]:
response = rag_chain.invoke("Write Summary in 5 lines about diabetes")

md(response)

The answer is not available in the context. 
    The provided text focuses on research in natural language processing, specifically attention mechanisms and memory networks, and does not contain information about diabetes. 
