# **LangChain_HUggingFace_Tool-Routed Agent Based Web RAG Engine**

# Overall Goal

This pipeline is designed to answer user questions comprehensively by first consulting an internal knowledge base. If the internal retrieval is insufficient, the system intelligently expands its search to external specialized sources such as Wikipedia for general knowledge and Arxiv for academic literature. All retrieved information is then synthesized into a single, coherent, and well-cited final answer.

# Core Components and Modular Functions

## Document Loaders

The system uses modular document loaders to ingest content from multiple sources, each tagged for traceability.

- **load_pdf(file_path: str)**  
  Reads and processes content from local files such as `/content/LLM.pdf`. The extracted text is tagged as `[PDF]` and handled robustly to support standard text-based documents.

- **load_web_docs(url: str)**  
  Fetches and parses content from a specified web URL, such as `https://docs.smith.langchain.com/`. All extracted text is tagged as `[Web]`.

- **load_wikipedia(query: str)**  
  Executes a Wikipedia search based on the query and tags the retrieved content as `[Wikipedia]`.

- **load_arxiv(query: str)**  
  Searches Arxiv for relevant academic papers and tags the retrieved content as `[Arxiv]`.

## Document Processing

- **RecursiveCharacterTextSplitter**  
  All loaded documents are segmented into smaller, semantically meaningful chunks to improve retrieval accuracy and embedding quality.

## Vector Store and Retrieval

- **Embeddings**  
  HuggingFaceEmbeddings using the `BAAI/bge-small-en-v1.5` model are used to generate dense vector representations for each document chunk.

- **Vector Store**  
  All embeddings are stored in a FAISS vector database.

- **Retriever**  
  A retriever queries the FAISS index to return the top `k` most relevant document chunks (for example, `k = 3`) for a given user question.

## Document Formatting

- **format_docs(docs: List[Document])**  
  A helper function that formats retrieved document chunks into a structured text block, preserving inline source tags such as `[PDF]`, `[Web]`, `[Wikipedia]`, and `[Arxiv]`.

## Language Model

- **LLM**  
  `ChatGroq` with the model `llama-3.1-8b-instant` serves as the primary language model. It is responsible for routing decisions, intermediate reasoning, and final answer generation.

## Tools

The system exposes tool functions using decorators, enabling the agent to dynamically invoke them during execution.

- **internal_knowledge_base**  
  Retrieves and formats relevant content from the FAISS vector store.

- **wikipedia_search**  
  Queries Wikipedia using the Wikipedia loader and returns formatted results.

- **arxiv_search**  
  Queries Arxiv using the Arxiv loader and returns formatted academic content.

## Orchestration with LangGraph

LangGraph manages the entire workflow using a state-machine-based execution model.

# Step-by-Step Pipeline Flow

## 1. Initialization and State Management

The workflow begins when `app.invoke()` is called with an initial state containing the user’s question and an empty message history.

- **AgentState** is defined as a TypedDict that maintains:
  - The user question
  - The conversation history
  - Intermediate tool outputs (`internal_rag_result`, `external_wiki_result`, `external_arxiv_result`)
  - The final synthesized answer (`final_answer_text`)

## 2. Entry Point: Internal Knowledge Base Node

The graph starts at the internal knowledge base node.

- The `rag_node` function is executed.
- It calls `run_internal_rag` with the user question.
- The retriever queries the FAISS vector store, which contains data from PDFs, web documentation, and previously loaded sources.
- Retrieved chunks are combined with the question and passed to a RAG prompt.
- The language model generates an initial answer.
- The state is updated with a status message and the result is stored as `internal_rag_result`.

## 3. Routing Decision

After the internal RAG step, the workflow reaches a conditional routing function.

- The `should_continue` function evaluates the sufficiency of `internal_rag_result`.
- A routing prompt is sent to the language model, which must return one of the following decisions:
  - `synthesize`
  - `wikipedia_search`
  - `arxiv_search`

### Decision Logic

- If the internal result sufficiently answers the question, the router selects `synthesize`.
- If additional general knowledge is required, it selects `wikipedia_search`.
- If academic or research-oriented information is needed, it selects `arxiv_search`.

The routing decision determines the next node in the graph.

## 4. External Tool Invocation

Depending on the routing decision, the workflow may invoke an external tool.

### Wikipedia Path

- The workflow transitions to the Wikipedia node.
- The `wikipedia_search` tool is invoked.
- Wikipedia content is loaded, formatted, and stored in `external_wiki_result`.
- The state is updated with a completion message.

### Arxiv Path

- The workflow transitions to the Arxiv node.
- The `arxiv_search` tool is invoked.
- Academic papers are retrieved and formatted.
- The result is stored in `external_arxiv_result`.
- The state is updated with a completion message.

## 5. Final Answer Synthesis

Regardless of whether external tools were used, the workflow proceeds to the final answer node.

- The `final_answer_node` combines:
  - The original question
  - The internal RAG result
  - Any external Wikipedia or Arxiv results
- A final synthesis prompt instructs the language model to:
  - Produce a comprehensive and coherent answer
  - Clearly explain the reasoning
  - Include inline source citations such as `[PDF]`, `[Web]`, `[Wikipedia]`, and `[Arxiv]`
- The final response is stored in `final_answer_text`.

## 6. End of Workflow

After the final answer is generated, the graph transitions to the end state, completing the execution.

# Output Clarity and Structure

Throughout execution, structured status messages provide clear visibility into which node is running and what decisions are being made. This improves transparency, traceability, and debuggability.

The final output is always a well-structured, comprehensive answer that clearly cites all sources used and explains how the information was selected and combined. This modular, state-driven architecture enables a flexible and adaptive RAG system that dynamically adjusts its strategy based on the query and available information.


In [1]:
pip install -q --upgrade langchain langchain-core langchain-community


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/108.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m108.5/108.5 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/490.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m490.2/490.2 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.5/2.5 MB[0m [31m104.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m50.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
!pip install -q uvicorn langserve


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.3/40.3 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
import langchain
print(langchain.__version__)


1.2.6


In [4]:
pip install -q fastapi


In [5]:
pip install --q langchain-groq

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/137.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m133.1/137.5 kB[0m [31m6.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [6]:
pip install -q pymupdf

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m64.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [7]:
pip install -q streamlit

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m60.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m77.3 MB/s[0m eta [36m0:00:00[0m
[?25h

## RAG dependencies

In [9]:
pip install -q pypdf arxiv wikipedia faiss-cpu sentence-transformers

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m329.0/329.0 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.5/81.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone


In [8]:
# Google Colab-compatible environment setup with sanity checks

import os
from google.colab import userdata
from google.colab.userdata import SecretNotFoundError # Import SecretNotFoundError

# Fetch secrets from Colab userdata
LANGCHAIN_API_KEY = userdata.get("LANGCHAIN_API_KEY")
try:
    LANGCHAIN_PROJECT = userdata.get("LANGCHAIN_PROJECT")
except SecretNotFoundError:
    print("Warning: LANGCHAIN_PROJECT secret not found in Colab userdata.")
    print("Please add 'LANGCHAIN_PROJECT' to your Colab secrets if you intend to use Langsmith project tracking.")
    LANGCHAIN_PROJECT = None # Set to None if not found

# Set environment variables
if LANGCHAIN_API_KEY:
    os.environ["LANGCHAIN_API_KEY"] = LANGCHAIN_API_KEY

os.environ["LANGCHAIN_TRACING_V2"] = "true"

if LANGCHAIN_PROJECT:
    os.environ["LANGCHAIN_PROJECT"] = LANGCHAIN_PROJECT

# -------- Sanity Checks --------
def sanity_check():
    checks = {
        "LANGCHAIN_API_KEY": os.environ.get("LANGCHAIN_API_KEY"),
        "LANGCHAIN_TRACING_V2": os.environ.get("LANGCHAIN_TRACING_V2"),
        "LANGCHAIN_PROJECT": os.environ.get("LANGCHAIN_PROJECT"), # Check if it's set in env
    }

    print("\n--- Sanity Checks ---")
    for key, value in checks.items():
        if value:
            print(f"[OK] {key} is set")
        else:
            print(f"[MISSING] {key} is NOT set")

sanity_check()

Please add 'LANGCHAIN_PROJECT' to your Colab secrets if you intend to use Langsmith project tracking.

--- Sanity Checks ---
[OK] LANGCHAIN_API_KEY is set
[OK] LANGCHAIN_TRACING_V2 is set
[MISSING] LANGCHAIN_PROJECT is NOT set


# **All models available in GROQ**

In [10]:
import requests
import os
import json
from google.colab import userdata

# Ensure GROQ_API_KEY is fetched directly from Colab secrets or environment
api_key = userdata.get("GROQ_API_KEY")

# If the API key is still not found, raise an error or inform the user
if not api_key:
    raise ValueError("GROQ_API_KEY not found in Colab secrets. Please ensure it is added.")

url = "https://api.groq.com/openai/v1/models"

headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}

response = requests.get(url, headers=headers)
response.raise_for_status() # This will raise an HTTPError for bad responses (4xx or 5xx)

print(json.dumps(response.json(), indent=2))


{
  "object": "list",
  "data": [
    {
      "id": "canopylabs/orpheus-arabic-saudi",
      "object": "model",
      "created": 1765926439,
      "owned_by": "Canopy Labs",
      "active": true,
      "context_window": 4000,
      "public_apps": null,
      "max_completion_tokens": 50000
    },
    {
      "id": "llama-3.3-70b-versatile",
      "object": "model",
      "created": 1733447754,
      "owned_by": "Meta",
      "active": true,
      "context_window": 131072,
      "public_apps": null,
      "max_completion_tokens": 32768
    },
    {
      "id": "meta-llama/llama-prompt-guard-2-22m",
      "object": "model",
      "created": 1748632101,
      "owned_by": "Meta",
      "active": true,
      "context_window": 512,
      "public_apps": null,
      "max_completion_tokens": 512
    },
    {
      "id": "meta-llama/llama-guard-4-12b",
      "object": "model",
      "created": 1746743847,
      "owned_by": "Meta",
      "active": true,
      "context_window": 131072,
      "publi

# Model Selection Guide (Purpose-Based)

This guide maps each available model to its best use case so you can quickly choose the right one.

---

## General Natural Language Generation / Chat

Suitable for chatbots, summaries, reasoning, coding help, and general text generation.

| Model | Notes | Best For |
|-----|-----|-----|
| **llama-3.3-70b-versatile** | Large, high-quality | Deep reasoning, complex tasks, long contexts |
| **llama-3.1-8b-instant** | Small, very fast | General chat, Q&A, lightweight apps |
| **openai/gpt-oss-20b** | Open-source GPT-style | Strong general text generation |
| **openai/gpt-oss-120b** | Very large OSS model | Highest-quality OSS reasoning & generation |

---

## Lightweight / Fast / Cost-Efficient

Optimized for speed and lower resource usage.

| Model | Notes | Best For |
|-----|-----|-----|
| **groq/compound-mini** | Lightweight | Fast throughput, low cost |
| **groq/compound** | Balanced | Speed + quality |
| **allam-2-7b** | 7B model | Very lightweight text generation |
| **moonshotai/kimi-k2-instruct** | Instruction-tuned | Fast assistant-style tasks |

---

## Long-Context Processing

Designed for very large documents and multi-file inputs.

| Model | Context Size | Best For |
|-----|-----|-----|
| **moonshotai/kimi-k2-instruct-0905** | 262k tokens | Books, long documents, multi-doc reasoning |
| **llama-3.1 / 3.3 variants** | 131k tokens | Long-context chat and analysis |

---

## Speech-to-Text (Not Text Generation)

| Model | Best For |
|-----|-----|
| **whisper-large-v3** | High-quality transcription |
| **whisper-large-v3-turbo** | Faster speech-to-text |

---

## Safety / Guard Models (Not for Generation)

Used only for moderation, safety checks, or filtering.

| Model | Purpose |
|-----|-----|
| **meta-llama/llama-guard-4-12b** | Safety classification |
| **meta-llama/llama-prompt-guard-2-22m / 86m** | Prompt risk detection |

---

## Language / Region-Specific

| Model | Best For |
|-----|-----|
| **canopylabs/orpheus-v1-english** | English-focused NLP |
| **canopylabs/orpheus-arabic-saudi** | Arabic (Saudi dialect) |
| **allam-2-7b** | Arabic-centric lightweight tasks |

---

## Quick Recommendations

- **Best overall (small + free):** `llama-3.1-8b-instant`
- **Best quality:** `llama-3.3-70b-versatile`
- **Fastest / cheapest:** `groq/compound-mini`
- **Very long documents:** `moonshotai/kimi-k2-instruct-0905`
- **Speech recognition:** `whisper-large-v3`

---


In [11]:
from langchain_groq import ChatGroq
from google.colab import userdata
import os

# Set Groq API key (must exist in Colab secrets)
os.environ["GROQ_API_KEY"] = userdata.get("GROQ_API_KEY")

# Initialize Groq LLM
llm = ChatGroq(
    model="llama-3.1-8b-instant",
    temperature=0
)

print(llm)


profile={'max_input_tokens': 131072, 'max_output_tokens': 8192, 'image_inputs': False, 'audio_inputs': False, 'video_inputs': False, 'image_outputs': False, 'audio_outputs': False, 'video_outputs': False, 'reasoning_output': False, 'tool_calling': True} client=<groq.resources.chat.completions.Completions object at 0x7eb88bf985f0> async_client=<groq.resources.chat.completions.AsyncCompletions object at 0x7eb88b6724e0> model_name='llama-3.1-8b-instant' temperature=1e-08 model_kwargs={} groq_api_key=SecretStr('**********')


## **Sanity check: verify the Groq LLM is working**

In [12]:
from langchain_core.messages import HumanMessage

response = llm.invoke([HumanMessage(content="Reply with the single word: OK")])

print("LLM response:", response.content)


LLM response: OK


In [13]:
## Input and get response form LLM

result=llm.invoke("What is generative AI?")

In [14]:
result

AIMessage(content='Generative AI refers to a subset of artificial intelligence (AI) that focuses on generating new, original content, such as text, images, music, or videos, based on patterns and structures learned from existing data. This type of AI uses algorithms and machine learning techniques to create new, unique outputs that are often indistinguishable from those created by humans.\n\nGenerative AI models are trained on large datasets, which allows them to learn the underlying patterns, styles, and structures of the data. Once trained, these models can generate new content that is similar in style and quality to the original data.\n\nSome common applications of generative AI include:\n\n1. **Text generation**: Generating text, such as articles, stories, or chatbot responses, that are coherent and engaging.\n2. **Image generation**: Creating new images, such as artwork, landscapes, or product designs, that are realistic and visually appealing.\n3. **Music generation**: Composing 

## **Sanity Check 2**

In [15]:
from langchain_core.messages import HumanMessage
llm.invoke([HumanMessage(content="Hi , My name is Prithu and I am a Learner AI Engineer")])

AIMessage(content="Nice to meet you, Prithu. As a learner AI Engineer, you're likely exploring the exciting world of artificial intelligence and machine learning. What specific areas of AI are you interested in or currently learning about? Are you working on any projects or looking for resources to help you improve your skills? I'm here to help and provide any guidance I can.", additional_kwargs={}, response_metadata={'token_usage': {'completion_tokens': 73, 'prompt_tokens': 51, 'total_tokens': 124, 'completion_time': 0.127391292, 'completion_tokens_details': None, 'prompt_time': 0.004806736, 'prompt_tokens_details': None, 'queue_time': 0.078197526, 'total_time': 0.132198028}, 'model_name': 'llama-3.1-8b-instant', 'system_fingerprint': 'fp_6b5c123dd9', 'service_tier': 'on_demand', 'finish_reason': 'stop', 'logprobs': None, 'model_provider': 'groq'}, id='lc_run--019bdeb5-575b-7811-9e7f-73bb93ba4efb-0', tool_calls=[], invalid_tool_calls=[], usage_metadata={'input_tokens': 51, 'output_tok

# **Langchain_RAG_AGENT_ConversationalQA_with_Memory_History**

# Langchain_RAG capabilities, including loaders for PDF, ArXiv, and Wikipedia, the FAISS vector store, and Sentence Transformers for embeddings.

## Load and Process Documents


In [16]:
## Arxiv--Research
## Tools creation
from langchain_community.tools import ArxivQueryRun,WikipediaQueryRun
from langchain_community.utilities import WikipediaAPIWrapper,ArxivAPIWrapper

In [17]:
## Used the inbuilt tool of wikipedia
api_wrapper_wiki=WikipediaAPIWrapper(top_k_results=1,doc_content_chars_max=250)
wiki=WikipediaQueryRun(api_wrapper=api_wrapper_wiki)
wiki.name

'wikipedia'

In [18]:
api_wrapper_arxiv=ArxivAPIWrapper(top_k_results=1,doc_content_chars_max=250)
arxiv=ArxivQueryRun(api_wrapper=api_wrapper_arxiv)
print(arxiv.name)

arxiv


In [19]:
tools=[wiki,arxiv]

## Custom tools[RAG Tool]

In [20]:
## Custom tools[RAG Tool]
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter



In [21]:
pip install -q --upgrade langchain-experimental

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/210.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m143.4/210.1 kB[0m [31m4.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m210.1/210.1 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [21]:
# Install the missing library
!pip install -q langchain-huggingface

In [22]:
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings

# 1. Load web docs
loader = WebBaseLoader("https://docs.smith.langchain.com/")
docs = loader.load()

# 2. Chunk documents
documents = RecursiveCharacterTextSplitter(
    chunk_size=600,
    chunk_overlap=50
).split_documents(docs)

# 3. Hugging Face embeddings (recommended lightweight model)
embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-small-en-v1.5"
)

# 4. Vector store
vectordb = FAISS.from_documents(documents, embeddings)

# 5. Retriever
retriever = vectordb.as_retriever(search_kwargs={"k": 3})

retriever

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

VectorStoreRetriever(tags=['FAISS', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x7eb7f04ccfe0>, search_kwargs={'k': 3})

In [23]:
# Step 0: Fix environment for Arxiv loader
# !pip install "PyMuPDF<1.22" langchain langchain_community langchain_groq faiss-cpu transformers

from typing import List, TypedDict, Annotated, Literal
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.documents import Document
from langchain_core.runnables import RunnablePassthrough
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from IPython.display import Image # Added for graph visualization

from langchain_groq import ChatGroq
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WikipediaLoader, ArxivLoader, WebBaseLoader
from langchain_core.tools import tool

## **LLM Setup and Load Document**

In [27]:
# -------------------------------
# 1. LLM Setup
# -------------------------------
llm = ChatGroq(
    model="llama-3.1-8b-instant",
    temperature=0,
    max_tokens=1000  # Increased max_tokens for a more comprehensive summary
)

# -------------------------------
# 2. Load Sources
# -------------------------------
def load_pdf(file_path: str) -> List[Document]:
    try:
        # Assuming /content/LLM.pdf is text, not binary PDF. Add error handling for encoding.
        with open(file_path, "r", encoding='utf-8', errors='ignore') as f:
            content = f.read()
        return [Document(page_content=content, metadata={"source": file_path, "source_tag": "[PDF]"})]
    except Exception as e:
        print(f"Error loading PDF {file_path}: {e}")
        return []

def load_web_docs(url: str) -> List[Document]:
    try:
        loader = WebBaseLoader(url)
        docs = loader.load()
        for d in docs:
            d.metadata["source_tag"] = "[Web]"
        return docs
    except Exception as e:
        print(f"Error loading web docs from {url}: {e}")
        return []

def load_wikipedia(query: str, max_docs: int = 2) -> List[Document]:
    try:
        loader = WikipediaLoader(query=query, load_max_docs=max_docs)
        docs = loader.load()
        for d in docs:
            d.metadata["source_tag"] = "[Wikipedia]"
        return docs
    except Exception as e:
        print(f"Wikipedia load error: {e}")
        return []

def load_arxiv(query: str, max_docs: int = 2) -> List[Document]:
    try:
        loader = ArxivLoader(query=query, load_max_docs=max_docs)
        docs = loader.load()
        for d in docs:
            d.metadata["source_tag"] = "[Arxiv]"
        return docs
    except Exception as e:
        print(f"Arxiv load error: {e}")
        return []

pdf_docs = load_pdf("/content/LLM.pdf")
web_docs = load_web_docs("https://docs.smith.langchain.com/")
wiki_docs = load_wikipedia("LangChain")
arxiv_docs = load_arxiv("LangChain")
all_docs = pdf_docs + web_docs + wiki_docs + arxiv_docs

## **Chunk Documents**

In [28]:
# -------------------------------
# 3. Chunk Documents
# -------------------------------
text_splitter = RecursiveCharacterTextSplitter(chunk_size=600, chunk_overlap=50)
chunks = text_splitter.split_documents(all_docs)

# -------------------------------
# 4. Embeddings & Vector Store
# -------------------------------
embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-small-en-v1.5",
    show_progress=True # Added to show progress during embedding generation
)
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

Batches:   0%|          | 0/76 [00:00<?, ?it/s]

In [29]:
# -------------------------------
# 5. Format Documents
# -------------------------------
def format_docs(docs: List[Document]) -> str:
    if not docs:
        return "No results found."
    formatted_content = []
    for d in docs:
        source_tag = d.metadata.get('source_tag','[Unknown]')
        # Use a consistent structure for each document and its source
        formatted_content.append(f"Content: {d.page_content[:500]} (Source: {source_tag})")
    return "\n\n".join(formatted_content)

## **Tool and  RAG Prompt**

In [30]:
# -------------------------------
# 6. Tools
# -------------------------------
@tool(description="Searches the internal knowledge base (vector store) for relevant documents. Returns content with inline citations from PDF, Web, Wikipedia, Arxiv.")
def internal_knowledge_base(query: str) -> str:
    print(f"\n Calling Internal Knowledge Base for: {query}")
    docs = retriever.invoke(query)
    result = format_docs(docs)
    print(f" Internal KB result: {result[:100]}...")
    return result

@tool(description="Searches Wikipedia for the given query and returns summarized text with source tags.")
def wikipedia_search(query: str) -> str:
    print(f"\n Calling Wikipedia Search for: {query}")
    docs = load_wikipedia(query)
    result = format_docs(docs)
    print(f" Wikipedia result: {result[:100]}...")
    return result

@tool(description="Searches Arxiv for papers matching the query and returns summarized text with source tags.")
def arxiv_search(query: str) -> str:
    print(f"\n Calling Arxiv Search for: {query}")
    docs = load_arxiv(query)
    result = format_docs(docs)
    print(f" Arxiv result: {result[:100]}...")
    return result

# -------------------------------
# 7. RAG Prompt
# -------------------------------
rag_prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer the question based ONLY on the following context. Cite sources inline using their tags (e.g., [PDF], [Web], [Wikipedia], [Arxiv]). If the context is insufficient, state that you cannot answer from the provided information."),
    ("human", "Context:\n{context}\n\nQuestion:\n{question}")
])



## **RAG Chain**

In [31]:
# -------------------------------
# 8. RAG Chain
# -------------------------------
def run_internal_rag(question: str) -> str:
    # Directly invoke the retriever and format docs within this function
    # to ensure the context passed to the LLM is correctly formatted.
    retrieved_docs = retriever.invoke(question)
    context = format_docs(retrieved_docs)

    # Check if context explicitly indicates insufficiency
    if "No results found." in context:
        return "Based on the internal knowledge base, I don't have enough information to answer this question."

    result = rag_prompt | llm | StrOutputParser()
    return result.invoke({"context": context, "question": question})

# -------------------------------
# 9. Agent State
# -------------------------------
class AgentState(TypedDict):
    messages: Annotated[List[BaseMessage], add_messages]
    question: str
    internal_rag_result: str
    external_wiki_result: str
    external_arxiv_result: str
    final_answer_text: str

# -------------------------------
# 10. Nodes
# -------------------------------
def rag_node(state: AgentState):
    print("\n Entering Internal KB Node")
    answer = run_internal_rag(state["question"])
    return {
        "messages": [AIMessage(content="Internal RAG search completed.")],
        "internal_rag_result": answer
    }

def wiki_node(state: AgentState):
    print("\n Entering Wikipedia Node")
    # Invoke the tool directly, as it handles its own context loading/formatting
    answer = wikipedia_search.invoke({"query": state["question"]})
    return {
        "messages": [AIMessage(content="Wikipedia search completed.")],
        "external_wiki_result": answer
    }

def arxiv_node(state: AgentState):
    print("\n Entering Arxiv Node")
    # Invoke the tool directly, as it handles its own context loading/formatting
    answer = arxiv_search.invoke({"query": state["question"]})
    return {
        "messages": [AIMessage(content="Arxiv search completed.")],
        "external_arxiv_result": answer
    }



## **Final Answer Synthesis**

In [32]:
# -------------------------------
# 11. Final Answer Synthesis
# -------------------------------
final_answer_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an intelligent assistant. Synthesize the following information to provide a comprehensive answer to the user's question.
               Clearly cite sources like (Source: [PDF]), (Source: [Web]), (Source: [Wikipedia]), or (Source: [Arxiv]) when information is used.
               If a source isn't explicitly mentioned, state it as 'General knowledge' or 'Synthesized from multiple sources'.
               If no relevant information is found across all sources, state that you cannot answer the question.
               Explain your reasoning for using specific information.

               Original Question: {question}
               Internal RAG Result: {internal_rag_result}
               Wikipedia Result: {external_wiki_result}
               Arxiv Result: {external_arxiv_result}
               Math Result: {math_result}"""),
    ("human", "Generate final comprehensive answer with proper citations and explain your reasoning.")
])

def final_answer_node(state: AgentState):
    print("\n Entering Final Answer Node")
    context = {
        "question": state["question"],
        "internal_rag_result": state.get("internal_rag_result", "No results from internal RAG."),
        "external_wiki_result": state.get("external_wiki_result", "No results from Wikipedia."),
        "external_arxiv_result": state.get("external_arxiv_result", "No results from Arxiv."),
        "math_result": state.get("math_result", "No math calculation performed.")
    }
    final_answer = (final_answer_prompt | llm | StrOutputParser()).invoke(context)
    print(f" Final Answer generated: {final_answer[:100]}...")
    return {
        "messages": [AIMessage(content=final_answer)],
        "final_answer_text": final_answer
    }

# -------------------------------
# 12. Router LLM
# -------------------------------
llm_router_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a routing assistant. Based on the user's question and the initial internal RAG result, decide the next step.

               Respond with ONLY ONE of these exact words (lowercase):
               - synthesize (if internal RAG has sufficient information and no further search is needed)
               - wikipedia (if you need general encyclopedia information to supplement or clarify)
               - arxiv (if you need academic research papers to supplement or clarify)
               - math_calculator (if a mathematical expression needs to be evaluated)

               Question: {question}
               Internal RAG Result: {internal_rag_result}"""),
    ("human", "What should be the next step?")
])

llm_router = llm_router_prompt | llm | StrOutputParser()

def should_continue(state: AgentState) -> Literal["synthesize","wikipedia_search","arxiv_search", "math_calculator"]:
    question = state["question"]
    internal_rag_result = state.get("internal_rag_result","\nI cannot answer the question based on the provided context.") # Default to 'cannot answer' if empty

    print(f"\n Deciding next step with Router LLM for question: {question[:50]}...")
    print(f"  Internal RAG snippet: {internal_rag_result[:100]}...")

    decision = llm_router.invoke({
        "question": question,
        "internal_rag_result": internal_rag_result
    }).lower().strip()

    print(f"  Router decision: {decision}")

    if "synthesize" in decision:
        return "Final Answer"
    elif "wikipedia" in decision:
        return "Wikipedia"
    elif "arxiv" in decision:
        return "Arxiv"
    elif "math_calculator" in decision:
        return "Math_Calculator"
    else:
        # Fallback if LLM gives unexpected output (e.g., if it hallucinates a tool name)
        print("  Router produced unexpected decision, defaulting to Wikipedia if RAG insufficient, else synthesize.")
        if "i don't have enough information" in internal_rag_result.lower() or "no results found" in internal_rag_result.lower() or "cannot answer the question" in internal_rag_result.lower():
            return "Wikipedia"
        return "Final Answer"

# -------------------------------
# 9. Agent State (UPDATED)
# -------------------------------
class AgentState(TypedDict):
    messages: Annotated[List[BaseMessage], add_messages]
    question: str
    internal_rag_result: str
    external_wiki_result: str
    external_arxiv_result: str
    math_result: str # Added for math calculations
    final_answer_text: str

# -------------------------------
# 10. Nodes (UPDATED)
# -------------------------------
# existing nodes (rag_node, wiki_node, arxiv_node) remain the same

def math_node(state: AgentState):
    print("\n Entering Math Node")
    # The math_calculator tool expects an expression string
    answer = math_calculator.invoke({"expression": state["question"]}) # Use question as expression
    print(f" Math result: {answer[:100]}...")
    return {
        "messages": [AIMessage(content="Math calculation completed.")],
        "math_result": answer
    }

# -------------------------------
# 13. Build LangGraph (UPDATED)
# -------------------------------
graph = StateGraph(AgentState)
graph.add_node("Internal KB", rag_node)
graph.add_node("Wikipedia", wiki_node)
graph.add_node("Arxiv", arxiv_node)
graph.add_node("Math_Calculator", math_node) # Added math node
graph.add_node("Final Answer", final_answer_node)

graph.set_entry_point("Internal KB")

graph.add_conditional_edges("Internal KB", should_continue, {
    "Final Answer": "Final Answer",
    "Wikipedia": "Wikipedia",
    "Arxiv": "Arxiv",
    "Math_Calculator": "Math_Calculator" # Added conditional edge for math
})

graph.add_edge("Wikipedia", "Final Answer")
graph.add_edge("Arxiv", "Final Answer")
graph.add_edge("Math_Calculator", "Final Answer") # Added edge from math to final answer
graph.add_edge("Final Answer", END)

app = graph.compile()

print(" LangGraph agent compiled successfully!")

# --- Display the graph visual ---
print("\n--- LangGraph Visual ---")
# Ensure graphviz is installed for drawing, e.g., !apt-get install -y graphviz
# If you encounter issues, this step might require additional troubleshooting for your environment.
try:
    display(Image(app.get_graph().draw_png()))
except Exception as e:
    print(f"Error displaying graph visual: {e}")
    print("Please ensure 'graphviz' is installed (e.g., !apt-get install -y graphviz) and its dependencies are met.")


# -------------------------------
# 14. Test Queries (already existing tests below)
# -------------------------------

 LangGraph agent compiled successfully!

--- LangGraph Visual ---
Error displaying graph visual: Install pygraphviz to draw graphs: `pip install pygraphviz`.
Please ensure 'graphviz' is installed (e.g., !apt-get install -y graphviz) and its dependencies are met.


In [33]:
# -------------------------------
# 11. Final Answer Synthesis
# -------------------------------
final_answer_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an intelligent assistant. Synthesize the following information to provide a comprehensive answer to the user's question.
               Clearly cite sources like (Source: [PDF]), (Source: [Web]), (Source: [Wikipedia]), or (Source: [Arxiv]) when information is used.
               If a source isn't explicitly mentioned, state it as 'General knowledge' or 'Synthesized from multiple sources'.
               If no relevant information is found across all sources, state that you cannot answer the question.
               Explain your reasoning for using specific information.

               Original Question: {question}
               Internal RAG Result: {internal_rag_result}
               Wikipedia Result: {external_wiki_result}
               Arxiv Result: {external_arxiv_result}"""),
    ("human", "Generate final comprehensive answer with proper citations and explain your reasoning.")
])

def final_answer_node(state: AgentState):
    print("\n Entering Final Answer Node")
    context = {
        "question": state["question"],
        "internal_rag_result": state.get("internal_rag_result", "No results from internal RAG."),
        "external_wiki_result": state.get("external_wiki_result", "No results from Wikipedia."),
        "external_arxiv_result": state.get("external_arxiv_result", "No results from Arxiv.")
    }
    final_answer = (final_answer_prompt | llm | StrOutputParser()).invoke(context)
    print(f" Final Answer generated: {final_answer[:100]}...")
    return {
        "messages": [AIMessage(content=final_answer)],
        "final_answer_text": final_answer
    }

# -------------------------------
# 12. Router LLM
# -------------------------------
llm_router_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a routing assistant. Based on the user's question and the initial internal RAG result, decide the next step.

               Respond with ONLY ONE of these exact words (lowercase):
               - synthesize (if internal RAG has sufficient information and no further search is needed)
               - wikipedia (if you need general encyclopedia information to supplement or clarify)
               - arxiv (if you need academic research papers to supplement or clarify)

               Question: {question}
               Internal RAG Result: {internal_rag_result}"""),
    ("human", "What should be the next step?")
])

llm_router = llm_router_prompt | llm | StrOutputParser()

def should_continue(state: AgentState) -> Literal["synthesize","wikipedia_search","arxiv_search"]:
    question = state["question"]
    internal_rag_result = state.get("internal_rag_result","")

    print(f"\n Deciding next step with Router LLM for question: {question[:50]}...")
    print(f"  Internal RAG snippet: {internal_rag_result[:100]}...")

    decision = llm_router.invoke({
        "question": question,
        "internal_rag_result": internal_rag_result
    }).lower().strip()

    print(f"  Router decision: {decision}")

    if "synthesize" in decision:
        return "Final Answer"
    elif "wikipedia" in decision:
        return "Wikipedia"
    elif "arxiv" in decision:
        return "Arxiv"
    else:
        # Fallback if LLM gives unexpected output (e.g., if it hallucinates a tool name)
        print("  Router produced unexpected decision, defaulting to Wikipedia if RAG insufficient, else synthesize.")
        if "i don't have enough information" in internal_rag_result.lower() or "no results found" in internal_rag_result.lower():
            return "Wikipedia"
        return "Final Answer"

# -------------------------------
# 13. Build LangGraph (FIXED - removed color parameter)
# -------------------------------
graph = StateGraph(AgentState)
graph.add_node("Internal KB", rag_node)
graph.add_node("Wikipedia", wiki_node)
graph.add_node("Arxiv", arxiv_node)
graph.add_node("Final Answer", final_answer_node)

graph.set_entry_point("Internal KB")

graph.add_conditional_edges("Internal KB", should_continue, {
    "Final Answer": "Final Answer",
    "Wikipedia": "Wikipedia",
    "Arxiv": "Arxiv"
})

graph.add_edge("Wikipedia", "Final Answer")
graph.add_edge("Arxiv", "Final Answer")
graph.add_edge("Final Answer", END)

app = graph.compile()

print(" LangGraph agent compiled successfully!")

# --- Display the graph visual ---
print("\n--- LangGraph Visual ---")
# Ensure graphviz is installed for drawing, e.g., !apt-get install -y graphviz
# If you encounter issues, this step might require additional troubleshooting for your environment.
try:
    display(Image(app.get_graph().draw_png()))
except Exception as e:
    print(f"Error displaying graph visual: {e}")
    print("Please ensure 'graphviz' is installed (e.g., !apt-get install -y graphviz) and its dependencies are met.")


# -------------------------------
# 14. Test Queries
# -------------------------------
print("\n--- Test Query 1: Internal RAG should be sufficient ---")
query1 = "What are Large Language Models used for?"
initial_state1 = {"messages": [HumanMessage(content=query1)], "question": query1}
result1 = app.invoke(initial_state1)
print("Final Synthesized Answer:")
print(result1["final_answer_text"])

print("\n--- Test Query 2: Requires external search (e.g., Wikipedia) ---")
query2 = "Who developed LangChain and what is its primary use?"
initial_state2 = {"messages": [HumanMessage(content=query2)], "question": query2}
result2 = app.invoke(initial_state2)
print("Final Synthesized Answer:")
print(result2["final_answer_text"])

print("\n--- Test Query 3: Requires external search (e.g., Arxiv for technical details) ---")
query3 = "What is the transformer architecture and what are its key components?"
initial_state3 = {"messages": [HumanMessage(content=query3)], "question": query3}
result3 = app.invoke(initial_state3)
print("Final Synthesized Answer:")
print(result3["final_answer_text"])

print("\n--- Test Query 4: More complex, potentially needing multiple tools ---")
query4 = "Tell me about the recent advancements in Large Language Models."
initial_state4 = {"messages": [HumanMessage(content=query4)], "question": query4}
result4 = app.invoke(initial_state4)
print("Final Synthesized Answer:")
print(result4["final_answer_text"])


 LangGraph agent compiled successfully!

--- LangGraph Visual ---
Error displaying graph visual: Install pygraphviz to draw graphs: `pip install pygraphviz`.
Please ensure 'graphviz' is installed (e.g., !apt-get install -y graphviz) and its dependencies are met.

--- Test Query 1: Internal RAG should be sufficient ---

 Entering Internal KB Node


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


 Deciding next step with Router LLM for question: What are Large Language Models used for?...
  Internal RAG snippet: According to the provided context, Large Language Models (LLMs) are used in various sectors, includi...
  Router decision: synthesize

 Entering Final Answer Node
 Final Answer generated: Based on the provided information, Large Language Models (LLMs) are used in various sectors, includi...
Final Synthesized Answer:
Based on the provided information, Large Language Models (LLMs) are used in various sectors, including education, industry, and decision-making. 

According to the internal RAG result, LLMs are used in education [1], industry [2], and decision-making [3, 4]. This suggests that LLMs are versatile and can be applied in different fields to deliver precise and seamless interactions.

In education, LLMs can be used to create personalized learning experiences, provide real-time feedback, and assist teachers with grading and lesson planning [1]. This is supported 

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


 Deciding next step with Router LLM for question: Who developed LangChain and what is its primary us...
  Internal RAG snippet: According to the provided context, LangChain was developed by Harrison Chase [Wikipedia]. 

As for i...
  Router decision: wikipedia

 Entering Wikipedia Node

 Calling Wikipedia Search for: Who developed LangChain and what is its primary use?
 Wikipedia result: Content: Generation Z, often shortened to Gen Z and informally known as Zoomers, is the demographic ...

 Entering Final Answer Node
 Final Answer generated: Based on the provided information, I can answer the user's question about LangChain.

**Who develope...
Final Synthesized Answer:
Based on the provided information, I can answer the user's question about LangChain.

**Who developed LangChain?**
LangChain was developed by Harrison Chase (Source: Wikipedia).

**What is its primary use?**
LangChain is a software framework that helps facilitate the integration of large language models (LLMs) into app

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


 Deciding next step with Router LLM for question: What is the transformer architecture and what are ...
  Internal RAG snippet: Unfortunately, the provided context does not contain information about the transformer architecture....
  Router decision: wikipedia

 Entering Wikipedia Node

 Calling Wikipedia Search for: What is the transformer architecture and what are its key components?
 Wikipedia result: Content: In deep learning, the transformer is an artificial neural network architecture based on the...

 Entering Final Answer Node
 Final Answer generated: The transformer architecture is a type of neural network architecture primarily used for natural lan...
Final Synthesized Answer:
The transformer architecture is a type of neural network architecture primarily used for natural language processing (NLP) tasks, such as language translation, text summarization, and question answering [1]. It was introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017 [1].

The 

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


 Deciding next step with Router LLM for question: Tell me about the recent advancements in Large Lan...
  Internal RAG snippet: Based on the provided context, I can tell you that there have been recent advancements in Large Lang...
  Router decision: arxiv

 Entering Arxiv Node

 Calling Arxiv Search for: Tell me about the recent advancements in Large Language Models.
 Arxiv result: Content: Reinforcement Learning Meets Large Language Models: A Survey of
Advancements and Applicatio...

 Entering Final Answer Node
 Final Answer generated: Recent advancements in Large Language Models (LLMs) have been significant, with applications across ...
Final Synthesized Answer:
Recent advancements in Large Language Models (LLMs) have been significant, with applications across various domains. According to a preprint on arXiv titled "vision-language understanding with advanced large language models" (arXiv:2304.10592), there have been advancements in LLMs for vision-language understanding. However,

# **Text Summarizer RAG Pipeline – Workflow Overview**

## Text Summarizer RAG Pipeline – Workflow Overview

The text summarizer RAG pipeline, as implemented in the provided code, follows these main steps:

### 1. Load Sources
- Documents from various sources (PDF, Web, Wikipedia, Arxiv) are loaded.
- For the summarizer specifically, text content is loaded from local PDF files:
  - `/content/LLM.pdf`
  - `/content/apjspeech.pdf`
- PDF loading is performed using **PyMuPDF (`fitz`)**.

### 2. Combine and Truncate Content
- Content from all loaded sources is combined into a single string.
- To manage LLM token limits, the combined content is truncated to a maximum length (e.g., **20,000 characters**).

### 3. LLM Setup
- A **ChatGroq** model (`llama-3.1-8b-instant`) is initialized.
- Key parameters include:
  - `temperature`
  - `max_tokens` (increased to **1000** for more comprehensive summaries)

### 4. Summarization Prompt
- A `ChatPromptTemplate` is defined to guide the summarization process.
- It includes:
  - A **system instruction** for expert-level summarization
  - A **human input template** that injects the combined document content into the prompt

### 5. Summarization Chain
- A `Runnable` chain is created by piping:
  - `summarization_prompt`
  - `llm`
  - `StrOutputParser`
- This chain:
  - Accepts document content as input
  - Formats it using the prompt
  - Sends it to the LLM for summarization
  - Parses the LLM output into a string

### 6. Invoke and Display Summary
- The summarization chain is invoked with the truncated combined content.
- The generated summary is printed as the final output.


In [35]:
import fitz # PyMuPDF for handling PDF files
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Define a helper function to load text content from a PDF file using PyMuPDF
def load_pdf_content(file_path: str) -> str:
    """Loads text content from a PDF file using PyMuPDF."""
    text_content = ""
    try:
        doc = fitz.open(file_path)
        for page_num in range(doc.page_count):
            page = doc.load_page(page_num)
            text_content += page.get_text()
        doc.close()
        print(f"Successfully loaded content from {file_path}")
    except FileNotFoundError:
        print(f"Error: PDF file not found at {file_path}. Skipping this file.")
    except Exception as e:
        print(f"Error processing PDF {file_path}: {e}")
    return text_content

# Paths to the PDF files specified by the user
pdf_path_llm = "/content/LLM.pdf"
pdf_path_apjspeech = "/content/apjspeech.pdf"

# Load content from both PDFs
llm_content = load_pdf_content(pdf_path_llm)
apjspeech_content = load_pdf_content(pdf_path_apjspeech)

# Combine the content for summarization
combined_content = ""
if llm_content:
    combined_content += f"--- Content from {pdf_path_llm} ---\n{llm_content}\n\n"
if apjspeech_content:
    combined_content += f"--- Content from {pdf_path_apjspeech} ---\n{apjspeech_content}\n\n"

# Truncate combined_content to a manageable size (e.g., 20000 characters)
# to avoid exceeding model token limits.
MAX_CONTENT_LENGTH = 20000
if len(combined_content) > MAX_CONTENT_LENGTH:
    combined_content = combined_content[:MAX_CONTENT_LENGTH]
    print(f"Truncated combined_content to {MAX_CONTENT_LENGTH} characters.")

if not combined_content:
    print("No readable content found from the specified PDF files. Cannot summarize.")
else:
    # Define a summarization prompt
    summarization_prompt = ChatPromptTemplate.from_messages([
        ("system", "You are an expert summarizer. Summarize the following document(s) concisely and informatively. Focus on key concepts and main points."),
        ("human", "Document(s) to summarize:\n\n{document_content}\n\nProvide a comprehensive summary:")
    ])

    # Create a summarization chain using the existing LLM
    # Note: The 'llm' is configured with max_tokens=300, which might result in short summaries
    # for very long documents. Consider adjusting llm.max_tokens if a longer summary is desired.
    summarization_chain = summarization_prompt | llm | StrOutputParser()

    print("\n--- Generating Summary of Local Knowledge Base ---")
    summary = summarization_chain.invoke({"document_content": combined_content})

    print("\n--- Summary of PDF Documents ---")
    print(summary)


Successfully loaded content from /content/LLM.pdf
Successfully loaded content from /content/apjspeech.pdf
Truncated combined_content to 20000 characters.

--- Generating Summary of Local Knowledge Base ---

--- Summary of PDF Documents ---
**Comprehensive Overview of Large Language Models**

**Introduction**

Large Language Models (LLMs) have revolutionized natural language processing (NLP) tasks, demonstrating remarkable capabilities in text generation, understanding, and reasoning. This article provides a comprehensive overview of the recent developments in LLM research, covering various aspects, including architectures, training strategies, fine-tuning, and applications.

**Background**

LLMs are built upon the transformer architecture, which processes input sequences in parallel and independently. Tokenization, encoding positions, attention, and activation functions are essential components of LLMs. Tokenization involves parsing text into non-decomposing units called tokens, while 

In [36]:
# New follow-up question based on the summary
new_question = "The summary briefly touches upon challenges like 'Explainability' and 'Adversarial Attacks' in LLMs. Could you elaborate on these specific challenges and how researchers are currently trying to mitigate them?"

# Summarize the new question using the LLM
question_summarization_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a concise summarizer. Summarize the following question in a single sentence."),
    ("human", "{question_text}")
])

summarized_question_chain = question_summarization_prompt | llm | StrOutputParser()

summarized_new_question = summarized_question_chain.invoke({"question_text": new_question})

print("Original Follow-up Question:", new_question)
print("\nSummarized Follow-up Question:", summarized_new_question)

Original Follow-up Question: The summary briefly touches upon challenges like 'Explainability' and 'Adversarial Attacks' in LLMs. Could you elaborate on these specific challenges and how researchers are currently trying to mitigate them?

Summarized Follow-up Question: Researchers are currently addressing the challenges of "Explainability" and "Adversarial Attacks" in Large Language Models (LLMs) by developing techniques such as model interpretability methods, adversarial training, and robustness testing to improve transparency and security.


In [None]:
%%writefile summarizer_app.py

import os
import fitz # PyMuPDF for handling PDF files

from google.colab import userdata
from langchain_groq import ChatGroq
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# --- API Key Setup ---
# Ensure GROQ_API_KEY is fetched directly from Colab secrets or environment
# If running outside Colab, ensure os.environ["GROQ_API_KEY"] is set
api_key = userdata.get("GROQ_API_KEY")

# If the API key is still not found, raise an error or inform the user
if not api_key:
    raise ValueError("GROQ_API_KEY not found. Please ensure it is set as a Colab secret or environment variable.")
os.environ["GROQ_API_KEY"] = api_key

# --- LLM Setup ---
llm = ChatGroq(
    model="llama-3.1-8b-instant",
    temperature=0,
    max_tokens=1000
)

# --- PDF Content Loader Function ---
def load_pdf_content(file_path: str) -> str:
    """Loads text content from a PDF file using PyMuPDF."""
    text_content = ""
    try:
        doc = fitz.open(file_path)
        for page_num in range(doc.page_count):
            page = doc.load_page(page_num)
            text_content += page.get_text()
        doc.close()
        print(f"Successfully loaded content from {file_path}")
    except FileNotFoundError:
        print(f"Error: PDF file not found at {file_path}. Skipping this file.")
    except Exception as e:
        print(f"Error processing PDF {file_path}: {e}")
    return text_content

# --- Paths to PDF Files ---
pdf_path_llm = "/content/LLM.pdf"
pdf_path_apjspeech = "/content/apjspeech.pdf"

# --- Load and Combine Content ---
llm_content = load_pdf_content(pdf_path_llm)
apjspeech_content = load_pdf_content(pdf_path_apjspeech)

combined_content = ""
if llm_content:
    combined_content += f"--- Content from {pdf_path_llm} ---\n{llm_content}\n\n"
if apjspeech_content:
    combined_content += f"--- Content from {pdf_path_apjspeech} ---\n{apjspeech_content}\n\n"

# Truncate combined_content to avoid exceeding model token limits
MAX_CONTENT_LENGTH = 20000
if len(combined_content) > MAX_CONTENT_LENGTH:
    combined_content = combined_content[:MAX_CONTENT_LENGTH]
    print(f"Truncated combined_content to {MAX_CONTENT_LENGTH} characters.")

# --- Summarization Logic ---
if not combined_content:
    print("No readable content found from the specified PDF files. Cannot summarize.")
else:
    summarization_prompt = ChatPromptTemplate.from_messages([
        ("system", "You are an expert summarizer. Summarize the following document(s) concisely and informatively. Focus on key concepts and main points."),
        ("human", "Document(s) to summarize:\n\n{document_content}\n\nProvide a comprehensive summary:")
    ])

    summarization_chain = summarization_prompt | llm | StrOutputParser()

    print("\n--- Generating Summary of Local Knowledge Base ---")
    summary = summarization_chain.invoke({"document_content": combined_content})

    print("\n--- Summary of PDF Documents ---")
    print(summary)


Writing summarizer_app.py


## LangChain RAG Pipeline and Text Summarization Project – Overview

This project demonstrates the construction and application of a versatile **LangChain-based Retrieval-Augmented Generation (RAG) pipeline** alongside a dedicated **text summarization tool**.

### 1. RAG Agent Functionality
- Implements a dynamic RAG agent capable of intelligent query routing.
- Queries are directed to:
  - An internal knowledge base
  - Wikipedia
  - Arxiv
- Ensures comprehensive, accurate responses with proper **source citations**.

### 2. Modular Architecture
- Uses **HuggingFaceEmbeddings** and **FAISS** for efficient document embedding and retrieval.
- Integrates **ChatGroq** for:
  - Intelligent query routing
  - Response generation
- The modular design enables adaptive and scalable information retrieval.

### 3. PDF Text Summarization Component
- Implements a robust PDF summarizer within the same notebook.
- Extracts text from local PDF files using **PyMuPDF**.
- Combines extracted content into a single text corpus.
- Generates concise, informative summaries using the **ChatGroq LLM**.

### 4. Token Management and Reliability
- Applies content truncation to manage LLM token limits.
- Prevents `APIStatusError` when processing large documents.
- Ensures stable and reliable summarization performance.

### 5. Overall System Capabilities
- Supports:
  - In-depth, citation-backed question answering
  - Efficient content distillation through summarization
- Highlights the flexibility and power of **LangChain** for building advanced, production-ready LLM applications.


# **LangChain RAG_YouTube_Summarizer**

In [37]:
# First, reinstall the correct version
!pip uninstall -y youtube-transcript-api
!pip install youtube-transcript-api

[0mCollecting youtube-transcript-api
  Downloading youtube_transcript_api-1.2.3-py3-none-any.whl.metadata (24 kB)
Downloading youtube_transcript_api-1.2.3-py3-none-any.whl (485 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.1/485.1 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: youtube-transcript-api
Successfully installed youtube-transcript-api-1.2.3


In [38]:
# Step 1: Completely remove and reinstall

# Step 2: Restart runtime or reimport
import importlib
import sys
if 'youtube_transcript_api' in sys.modules:
    del sys.modules['youtube_transcript_api']

# Step 3: Fresh import
from youtube_transcript_api import YouTubeTranscriptApi

# Step 4: Verify installation
print(f" Import successful")
print(f"Available attributes: {[attr for attr in dir(YouTubeTranscriptApi) if not attr.startswith('_')]}")

 Import successful
Available attributes: ['fetch', 'list']


# **Sanity Check**

In [39]:
# Manual installation check
!python -c "from youtube_transcript_api import YouTubeTranscriptApi; print('Import OK'); print(dir(YouTubeTranscriptApi))"

# If the above shows the import works but no methods, try this workaround:
import requests
import re

def get_transcript_manual(video_id: str) -> str:
    """Manual transcript extraction as fallback."""
    try:
        # This uses the YouTube API endpoint directly
        url = f"https://www.youtube.com/watch?v={video_id}"
        response = requests.get(url)

        # Extract captions URL from page source (simplified)
        # Note: This is a basic approach and may not always work
        if "captionTracks" in response.text:
            print(" Captions found in page")
            # For a more robust solution, you'd need to parse the JSON
            # This is just to verify the video has captions
            return "Captions available but need proper parsing"
        else:
            return ""
    except Exception as e:
        print(f" Manual method error: {e}")
        return ""

# Test manual method
result = get_transcript_manual("AOQyRiwydyo")
print(result)

Import OK
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'fetch', 'list']
 Captions found in page
Captions available but need proper parsing


In [40]:
# Import API to fetch YouTube video transcripts
from youtube_transcript_api import YouTubeTranscriptApi

# Import prompt template for structured LLM prompting
from langchain_core.prompts import ChatPromptTemplate

# Import output parser to convert LLM output to string
from langchain_core.output_parsers import StrOutputParser


# Define the summarization prompt with system and human roles
# The system message sets the behavior of the LLM as an expert summarizer
# The human message injects the document content dynamically
summarization_prompt = ChatPromptTemplate.from_messages([
    (
        "system",
        "You are an expert summarizer. Summarize the following document(s) concisely and informatively. Focus on key concepts and main points."
    ),
    (
        "human",
        "Document(s) to summarize:\n\n{document_content}\n\nProvide a comprehensive summary:"
    )
])

# Create a summarization chain by connecting the prompt, LLM, and output parser
summarization_chain = summarization_prompt | llm | StrOutputParser()


def get_youtube_transcript(video_url: str) -> str:
    """
    Fetches and returns the transcript text of a YouTube video
    given its URL.
    """
    try:
        # Extract the YouTube video ID from standard or short URLs
        if "v=" in video_url:
            video_id = video_url.split("v=")[1].split("&")[0]
        elif "youtu.be/" in video_url:
            video_id = video_url.split("youtu.be/")[1].split("?")[0]
        else:
            # Raise an error if the URL format is not recognized
            raise ValueError("Invalid YouTube URL format")

        # Initialize the YouTube Transcript API
        api = YouTubeTranscriptApi()

        # Fetch the transcript object using the video ID
        transcript_obj = api.fetch(video_id)

        # Handle different transcript object structures
        if hasattr(transcript_obj, 'snippets'):
            # If transcript contains snippets, extract text from each snippet
            snippets = transcript_obj.snippets
            transcript = " ".join([snippet.text for snippet in snippets])
        elif hasattr(transcript_obj, '__iter__'):
            # If transcript is iterable, extract text from each item
            transcript = " ".join(
                [getattr(item, 'text', str(item)) for item in transcript_obj]
            )
        else:
            # Fallback: convert the transcript object directly to string
            transcript = str(transcript_obj)

        # Return the extracted transcript text
        return transcript

    except Exception as e:
        # Handle errors during transcript fetching or processing
        return ""


# YouTube video URL to be summarized
youtube_video_url = "https://www.youtube.com/watch?v=AOQyRiwydyo&t=6323s"

# Fetch the transcript for the given YouTube video
video_transcript = get_youtube_transcript(youtube_video_url)

# Proceed only if transcript extraction was successful
if video_transcript:
    # Define maximum allowed transcript length to control token usage
    MAX_TRANSCRIPT_LENGTH = 20000
    original_length = len(video_transcript)

    # Truncate transcript if it exceeds the maximum length
    if len(video_transcript) > MAX_TRANSCRIPT_LENGTH:
        video_transcript = video_transcript[:MAX_TRANSCRIPT_LENGTH]

    # Generate a summary using the summarization chain
    try:
        youtube_summary = summarization_chain.invoke(
            {"document_content": video_transcript}
        )

        # Print the generated summary
        print(youtube_summary)

    except Exception as e:
        # Handle errors during summary generation
        print("Error generating summary:", e)
else:
    # Handle the case where transcript could not be retrieved
    print("Could not retrieve YouTube transcript")


The document is a promotional video script for a LangChain course, which is a framework for building AI orchestration pipelines. The speaker aims to teach data engineers how to learn LangChain and become proficient in building AI agents.

**Key Points:**

1. **Why learn LangChain?** LangChain is a popular and in-demand framework, and knowing it can give data engineers an edge in the job market.
2. **Prerequisites:** Basic Python understanding is required, including functions, loops, conditions, and basic object-oriented programming (OOP) concepts.
3. **Course Overview:** The course will cover LangChain from scratch, with dedicated chapters, notebooks, and sessions. The speaker will provide live illustrations and notes, and all code examples will be available.
4. **What is LangChain?** LangChain is a framework for building AI agents, which are essentially software programs that perform specific tasks. AI agents are created by integrating tools with Large Language Models (LLMs).
5. **Wha

## LangChain YouTube Summarizer – Comprehensive Overview

This notebook’s YouTube summarizer provides a robust solution for extracting and summarizing content from YouTube videos using a **LangChain-based architecture**. Its core functionality includes the following components:

### Transcript Extraction
- Utilizes the `youtube-transcript-api` library to reliably fetch the full textual transcript from a given YouTube video URL.
- Handles video ID parsing and includes error handling to manage transcript retrieval failures.

### LLM Integration
- Integrates seamlessly with a pre-configured **ChatGroq LLM** (`llama-3.1-8b-instant`).
- Uses an increased `max_tokens` setting to enable more detailed and comprehensive summaries.

### Context Management
- Implements transcript truncation to ensure compatibility with LLM token limits.
- Limits raw transcript content to a predefined `MAX_TRANSCRIPT_LENGTH` (for example, 20,000 characters) before passing it to the model.

### Prompt Engineering
- Uses a `ChatPromptTemplate` to guide the summarization process.
- Instructs the LLM to behave as an expert summarizer, emphasizing concise, informative output focused on key concepts.

### Summarization Chain
- Builds a `Runnable` chain combining:
  - The summarization prompt
  - The LLM
  - A `StrOutputParser`
- Creates a streamlined and efficient workflow for generating the final summary.

### Overall Purpose
- Automates the conversion of lengthy YouTube video content into clear, digestible summaries.
- Enables rapid understanding of video content while intelligently managing the constraints of large language models.
