# Langchain :
## 6. RAG :
#### - Combining search + generation
### why RAG :
#### - Limitation of standalone LLM
#### - Need for external knowledge

| Chatgpt and other models                        | RAG                                                 |
|-------------------------------------------------|-----------------------------------------------------|
| Pure generation                                 | Retrival + Generation                               |

### Components of RAG :
#### 1. REtriver
#### 2. Generator
#### 4. Knowledge

### Flow:
#### input query -> Retriver -> Generator -> Final output

![Embedding](images/RAG.png)

### Types : 
#### 1. Self RAG
#### 2. Corrective RAG 
#### 3. Fusion RAG
#### 4. Advance RAG
#### 5. Speculation RAG

### 1. Self RAG:
#### - It checks the own answer, if unsure, retrives more information and tries again
![Self Rag](images/self_RAG.png)

In [None]:
import os
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_pinecone import Pinecone as LangchainPinecone

# Step 1: Set your Gemini API key
os.environ["GOOGLE_API_KEY"] = "xxxxxxxxxxxxxxxxxxxx"  # 🔒 Replace with your key
os.environ["PINECONE_API_KEY"] = "xxxxxxxxxxxxxxxxxxx"  # 🔒 Replace with your key

# Step 2: Initialize the Gemini Embedding Model
embedding_model = GoogleGenerativeAIEmbeddings(
    model="models/embedding-001"  # or use 'text-embedding-3-large' if needed
)

# Step 3: Connect to existing Pinecone vector store
vectorstore = LangchainPinecone(
    embedding=embedding_model,
    index_name="langchainpdf",  # ✅ Your existing index name
    namespace="llmpdf"          # ✅ The namespace you used earlier
)



In [31]:
# STEP 4: Run a test query (optional)
query = "What is LangChain?"
results = vectorstore.similarity_search(query, k=3)

# STEP 5: Show results
for i, doc in enumerate(results):
    print(f"\nResult #{i+1}:\n{doc.page_content}")


Result #1:
What is LangChain?

Result #2:
How does LangChain work?

Result #3:
LangChain comes with many extensions and a larger ecosystem that is developing around it.


#### Adding Self Rag :

In [32]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain_pinecone import Pinecone as LangchainPinecone
from langchain.chains import RetrievalQA

In [33]:
retriever = vectorstore.as_retriever()

In [34]:
# ✅ Step 1: Initialize Gemini 2.5 Pro LLM
llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-pro",
    temperature=0.3,
    convert_system_message_to_human=True
)

# ✅ Step 2: Connect LLM with your existing retriever
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,  # Use your Pinecone or FAISS retriever here
    return_source_documents=True
)

In [39]:
def self_rag_query(query):
    """
    Perform a self-RAG query using the initialized LLM and retriever.
    """
    print("First attempt without retriever:")
    
    # ✅ Use invoke instead of predict
    first_response_msg = llm.invoke(f"Q: {query}\nA: Please provide a detailed answer based on your knowledge.")
    first_response = first_response_msg.content

    if (
        "I'm not sure" in first_response
        or "I don't know" in first_response
        or len(first_response) < 30
    ):
        print("Low confidence response detected. Using RAG for better results...")
        improved_response = qa_chain.invoke({"query": query})  # Can also use qa_chain.run(query)
        return improved_response
    else:
        return {
            "result": first_response,
            "source_documents": []
        }


In [40]:
response = self_rag_query("Explain Transformer architecture in detail")

print("\nFinal Response:\n", response["result"])

if response["source_documents"]:
    print("\nSource Documents:")
    for doc in response["source_documents"]:
        print("📄", doc.metadata.get("source", "Unknown source"))
        print("📝", doc.page_content[:300], "...\n")
else:
    print("\nNo source documents used.")



First attempt without retriever:
Low confidence response detected. Using RAG for better results...





Final Response:
 Based on the provided context, I cannot explain the Transformer architecture in detail. The text fragments are incomplete and do not contain enough information for a detailed explanation.

From the context, I can only tell you that:

*   A Transformer is a Deep Learning (DL) architecture.
*   It was first introduced in 2017 by researchers at Google.
*   The model architecture has an encoder-decoder structure.

The provided text does not explain what the encoder and decoder do, nor does it describe the other architectural features that are essential to understanding how a Transformer works.

Source Documents:
📄 documents/LangChain.pdf
📝 Wikimedia Commons):
Figure 1.6: The Transformer architecture ...

📄 documents/LangChain.pdf
📝 The architectural features that have contributed to the success of transformers are: ...

📄 documents/LangChain.pdf
📝 The transformer model architecture has an encoder-decoder structure, where the encoder maps ...

📄 documents/LangChain.pdf
📝 A

### 2. Corrective RAG:
#### - It identifies if the generated answer is wrong or misleading and correct it by fetching better sources
![Corrective Rag](images/CRAG_working.png)
![Corrective Rag](images/CRAG_graph.png)

In [70]:
import re

def corrective_rag_query(query: str):
    """
    Try answering with Gemini LLM. If unsure, fallback to RAG.
    Always return a clean string answer.
    """
    print(f"\n🧠 Query: {query}")

    # Step 1: Get initial Gemini LLM response
    first_response = llm.invoke(query)

    # Handle case where response is a list of messages
    if isinstance(first_response, list):
        first_response = first_response[0]

    # Extract message content
    try:
        first_response_text = first_response.content.strip()
    except AttributeError:
        first_response_text = str(first_response).strip()

    print("\n🔸 Initial LLM Response:\n", first_response_text)

    # Step 2: Check for weak response
    if "I don't know" in first_response_text or "I'm not sure" in first_response_text or len(first_response_text) < 50:
        print("\n⚠️ Low confidence — using RAG for better result...")

        improved_response = qa_chain.invoke(query)
        raw_answer = str(improved_response["result"])

        # Step 3: Clean up content=[...] format
        match = re.search(r"content=\[(.*?)\]", raw_answer, re.DOTALL)
        if match:
            try:
                content_list = eval(f"[{match.group(1)}]")
                cleaned_answer = " ".join(s.strip() for s in content_list)
            except Exception as e:
                cleaned_answer = "⚠️ Error parsing content: " + str(e)
        else:
            cleaned_answer = raw_answer.strip()

        return {"result": cleaned_answer}
    
    # LLM confident — return its answer directly
    return {"result": first_response_text}


In [71]:
query = "Explain Transformer architecture in detail"
response = corrective_rag_query(query)

# Raw string (your response)
raw_answer = str(response["result"])  # Already a string based on your output

# Step 1: Extract the part inside content=[ ... ]
match = re.search(r"content=\[(.*?)\]", raw_answer, re.DOTALL)

if match:
    # Step 2: Join the parts into one string, removing any leftover quote marks
    content_list = eval(f"[{match.group(1)}]")  # safely convert the inner list
    cleaned_answer = " ".join(s.strip() for s in content_list)
else:
    # fallback if pattern not found
    cleaned_answer = raw_answer

# Output result
print("\n✅ Final Answer:\n")
print(cleaned_answer)



🧠 Query: Explain Transformer architecture in detail





🔸 Initial LLM Response:
 content=['Of course. Let\'s break down the Transformer architecture in detail. It\'s a powerful and complex model, so we\'ll go step-by-step, from the big picture to the nitty-gritty components.\n\n### The Big Picture: What Problem Did Transformers Solve?\n\nBefore the Transformer, the state-of-the-art for sequence tasks (like machine translation) were **Recurrent Neural Networks (RNNs)**, including LSTMs and GRUs.\n\nRNNs had two major limitations:\n1.  **Sequential Processing:** They process data one word at a time. This makes them slow and prevents parallelization on modern hardware (like GPUs).\n2.  **Long-Range Dependencies:** While LSTMs improved this, it was still difficult for an RNN to connect a word at the end of a long paragraph to a word at the beginning. The "memory" would fade.\n\nThe Transformer, introduced in the 2017 paper **"Attention Is All You Need,"** solved these problems by completely removing recurrence and relying entirely on a mechani

### 3. Fusion Rag :
#### - It retrievers multiple documents and merges their information to generate a comprehensive, non-redundant answer.
![Fusion RaG](images/Fusion_rag.png)

In [2]:
import os
from dotenv import load_dotenv
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain_pinecone import Pinecone as LangchainPinecone
from langchain.chains import RetrievalQA

load_dotenv()

PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
PINECONE_ENV = os.getenv("PINECONE_ENV")
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Step 2: Initialize the Gemini Embedding Model
embedding_model = GoogleGenerativeAIEmbeddings(
    model="models/embedding-001"  # or use 'text-embedding-3-large' if needed
)

# Step 3: Connect to existing Pinecone vector store
vectorstore = LangchainPinecone(
    embedding=embedding_model,
    index_name="langchainpdf",  # ✅ Your existing index name
    namespace="llmpdf"          # ✅ The namespace you used earlier
)

  vectorstore = LangchainPinecone(


In [5]:
# ✅ Step 4: Initialize Gemini 2.5 Pro LLM
llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-pro",
    temperature=0.3,
    convert_system_message_to_human=True
)

retriever = vectorstore.as_retriever()

# ✅ Step 5: Connect LLM with your existing retriever
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,  # Use your Pinecone or FAISS retriever here
    return_source_documents=True,
    chain_type="stuff"  # Use "stuff" for simple concatenation of documents
)

In [6]:
# ✅ Step 6: Perform a query
query = "What is the summary of this PDF?"
result = qa_chain({"query": query})

# Show answer
print("📌 Answer:")
print(result["result"])

# Show sources
print("\n📄 Source Documents:")
for i, doc in enumerate(result["source_documents"]):
    print(f"\n--- Document {i+1} ---")
    print(doc.page_content[:500], "...")


  result = qa_chain({"query": query})


📌 Answer:
I cannot provide a summary because the text of the PDF was not included in the context. The provided text is a set of instructions and examples about how to summarize a document, not the content of the document itself.

📄 Source Documents:

--- Document 1 ---
Write a concise summary of the following:
{text}
CONCISE SUMMARY: ...

--- Document 2 ---
their outputs.
Here’s a simple example of loading a PDF document and summarizing it: ...

--- Document 3 ---
in a more concise and simplified manner. It can also answer specific questions about the paper, ...

--- Document 4 ---
look at summarization in much more detail in Chapter 4, Building Capable Assistants. Let’s move on. ...


### Advance RAG :
#### - It uses the external tools, multi-step reasoning or planing agents to complex queries.
#### - It can retriever, process and combine information on in stages.

![Advance_Rag](images/advance_RAG.png)

In [7]:
retriever = vectorstore.as_retriever(search_type="similarity_score_threshold", search_kwargs={
    "k": 10,
    "score_threshold": 0.7  # filters weak matches
})


In [8]:

# Use map_reduce for summarization + reasoning
qa_chain_arag = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
    chain_type="map_reduce"
)

In [9]:
query = "Summarize the key transformer architecture concepts explained in this PDF."
result = qa_chain({"query": query})

print("📌 Answer:")
print(result["result"])

print("\n📄 Sources:")
for i, doc in enumerate(result["source_documents"]):
    print(f"\n--- Source {i+1} (page {doc.metadata.get('page', '?')}) ---")
    print(doc.page_content[:300], "...")




📌 Answer:
Based on the provided text, the key transformer architecture concept explained is that it has an **encoder-decoder structure**. The text also begins to explain that the encoder's function is to map an input, but the sentence is cut off.

📄 Sources:

--- Source 1 (page 43.0) ---
The architectural features that have contributed to the success of transformers are: ...

--- Source 2 (page 43.0) ---
Wikimedia Commons):
Figure 1.6: The Transformer architecture ...

--- Source 3 (page 57.0) ---
7. What is a transformer and what does it consist of?
8. What does GPT stand for? ...

--- Source 4 (page 42.0) ---
The transformer model architecture has an encoder-decoder structure, where the encoder maps ...


### Speculative RAG :
#### - Instead of retrieying first, speculative rag start by guessing answer based on query, then retrievers document to refine or verity
#### - Helps when intial retrieval is weak.
#### flow :
#### Input -> Guess topic -> vector DB -> documents -> LLM -> Answer

![Speculative_RAG](images/speculative_rag.png)

In [10]:
retriever = vectorstore.as_retriever(search_type="similarity", k=5)

In [11]:
# Draft (fast) and Target (accurate) can be same if you don’t have GPT-4 etc.
draft_llm = ChatGoogleGenerativeAI(model="gemini-2.5-pro", google_api_key=GOOGLE_API_KEY)
target_llm = ChatGoogleGenerativeAI(model="gemini-2.5-pro", google_api_key=GOOGLE_API_KEY)

In [16]:
def speculative_rag(query: str, retriever, draft_llm, final_llm):
    # 1. Retrieve context
    context_docs = retriever.get_relevant_documents(query)
    context_text = "\n---\n".join([doc.page_content for doc in context_docs])

    # 2. Draft answer from fast LLM
    draft_prompt = f"""Context:\n{context_text}\n\nQuestion:\n{query}\n\nDraft Answer:"""
    draft_answer = draft_llm.invoke(draft_prompt)

    # 3. Refine using final LLM
    refine_prompt = f"""You're a helpful assistant. Please refine the following draft based on context.\n\nDraft:\n\"\"\"\n{draft_answer}\n\"\"\"\n\nRefined Answer:"""
    refined_answer = final_llm.invoke(refine_prompt)

    return refined_answer


In [24]:
from IPython.display import Markdown, display

In [25]:
query = "What are the key principles behind the transformer model?"

refined = speculative_rag(
    query=query,
    retriever=retriever,
    draft_llm=draft_llm,
    final_llm=target_llm
)

refined_text = refined.content

markdown_ready = f"## 🤖 Final Answer\n\n{refined_text}"

display(Markdown(markdown_ready))



## 🤖 Final Answer

Of course. Here is a refined version of the draft, edited for clarity, flow, and impact.

***

### **Key Principles of the Transformer Architecture**

The transformer model revolutionized how machines process sequential data like text, largely replacing traditional recurrent architectures. Its power and efficiency are built on the following core principles:

1.  **Self-Attention Mechanism**
    At the heart of the transformer is the **self-attention mechanism**. Unlike recurrent models that process words one by one, self-attention allows the model to weigh the importance of all other words in a sequence when encoding a specific word. This enables it to capture complex, **long-range dependencies** and contextual relationships, regardless of their distance from each other. Crucially, these calculations can be performed for all tokens simultaneously, making the process highly **parallelizable** and significantly speeding up training.

2.  **Multi-Head Attention**
    Instead of calculating attention just once, the model employs **multi-head attention**. This means it runs the self-attention process multiple times in parallel. Each "head" can learn different types of contextual relationships—for instance, one head might focus on syntactic links while another tracks semantic connections. The outputs from all heads are then combined to create a richer, more nuanced representation of the input.

3.  **Positional Encodings**
    Because the self-attention mechanism is inherently order-agnostic, the model needs a way to understand the sequence of the input. This is solved by adding **positional encodings** to the input embeddings. These are vectors that provide information about the absolute or relative position of each token, ensuring that the model can factor word order into its calculations.

4.  **Encoder-Decoder Architecture**
    The original transformer model featured a two-part structure:
    *   **The Encoder:** A stack of layers that processes the entire input sequence at once to build a rich, contextualized representation. Each encoder layer contains a multi-head self-attention mechanism and a feed-forward network.
    *   **The Decoder:** A stack of layers that generates the output sequence one token at a time. In addition to the components found in the encoder, the decoder includes a third sub-layer that performs "encoder-decoder attention," allowing it to focus on the most relevant parts of the encoded input.

    *(Note: This two-part structure is characteristic of the original model. Many modern architectures use only one part, such as encoder-only models (e.g., BERT) or decoder-only models (e.g., GPT).)*

5.  **Feed-Forward Networks & Layer Normalization**
    To support the attention mechanisms, each layer also contains two key components:
    *   A simple, position-wise **feed-forward network** that adds computational depth and non-linearity, further processing the output of the attention layer.
    *   **Residual connections** and **layer normalization** are applied around each sub-layer. These techniques are critical for stabilizing the training of very deep networks by preventing gradients from vanishing or exploding.