<a href="https://colab.research.google.com/github/Praveengovianalytics/10MinAIAgentChallenge/blob/main/Summarizar_Agent_with_MCP_Day1_Challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🤖 Day 1 Challenge – RAG Summarizer Agent with MCP Integration

Welcome to Day 1 of the **#10MinAIAgentChallenge**!  
Today, you’ll build an intelligent AI agent that summarizes content across various document formats using RAG (Retrieval-Augmented Generation) and exposes it as a FastAPI + MCP service.

---

## 🧩 Problem Statement

Organizations often deal with a large volume of documents in formats like PDF, Word, PowerPoint, TXT, and CSV. Manually extracting summaries from them is time-consuming.

**Your challenge:**  
Build an AI agent that:
- Loads unstructured documents
- Chunks and embeds their content
- Indexes them using FAISS
- Responds to summarization queries via a FastAPI + MCP service
- Uses OpenAI Agent SDK to interact in natural language

---

## 🛠️ What You'll Build

✅ A document loader that handles `.pdf`, `.docx`, `.pptx`, `.txt`, and `.csv`  
✅ A chunking logic for RAG-friendly text input  
✅ Embedding using OpenAI’s `text-embedding-ada-002`  
✅ A FAISS-based vector store  
✅ A FastAPI service for RAG summarization  
✅ An OpenAI Agent that queries the API and returns a concise summary

---

## 🚀 Workflow Summary

```mermaid
graph LR
A[📄 Multi-format Docs] --> B[🧩 Chunk + Embed]
B --> C[🗃️ FAISS Index]
C --> D[🔍 RAG Inference Engine]
D --> E[🧪 FastAPI + MCP Service]
E --> F[🤖 OpenAI Agent Query]
F --> G[📝 Final Summary Output]


In [1]:
!pip install fastapi fastapi_mcp uvicorn mcp-proxy phoenix-ai nest-asyncio -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/236.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m235.5/236.0 kB[0m [31m11.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.0/236.0 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h

# Setup sample data

In [6]:
!pip install python-docx python-pptx -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/472.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.6/472.8 kB[0m [31m4.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.8/472.8 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/172.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m172.3/172.3 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [7]:
import os
import pandas as pd
from typing import List
from PyPDF2 import PdfReader
from docx import Document as DocxDocument
from pptx import Presentation
from langchain_community.document_loaders import (
    TextLoader, PyPDFLoader, UnstructuredWordDocumentLoader,
    UnstructuredPowerPointLoader, UnstructuredExcelLoader
)

def ensure_folder_exists(folder_path: str):
    os.makedirs(folder_path, exist_ok=True)

def _read_pdf(file_path: str) -> str:
    reader = PdfReader(file_path)
    return "".join(page.extract_text() or "" for page in reader.pages)

def _split_text(text: str, max_chars: int = 1000, overlap: int = 100) -> List[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + max_chars, len(text))
        if end < len(text):
            while end > start and text[end] not in ' \n\t':
                end -= 1
            if end == start:
                end = start + max_chars
        chunks.append(text[start:end])
        start = end - overlap if end - overlap > start else start + max_chars
    return chunks

def _read_docx(file_path: str) -> str:
    doc = DocxDocument(file_path)
    return "\n".join(paragraph.text for paragraph in doc.paragraphs)

def _read_pptx(file_path: str) -> str:
    prs = Presentation(file_path)
    lines = []
    for slide in prs.slides:
        for shape in slide.shapes:
            if hasattr(shape, "text") and shape.text:
                lines.append(shape.text)
    return "\n".join(lines)

def load_and_process_single_document(folder_path: str, filename: str) -> pd.DataFrame:
    ensure_folder_exists(folder_path)
    file_path = os.path.join(folder_path, filename)
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"File not found: {file_path}")

    ext = filename.lower().rsplit(".", 1)[-1]
    try:
        if ext == "pdf":
            text = _read_pdf(file_path)
        elif ext == "txt":
            text = open(file_path, "r", encoding="utf-8").read()
        elif ext == "csv":
            df = pd.read_csv(file_path)
            text = "\n".join(" | ".join(str(v) for v in row if pd.notna(v))
                             for _, row in df.iterrows())
        elif ext == "docx":
            text = _read_docx(file_path)
        elif ext == "pptx":
            text = _read_pptx(file_path)
        else:
            raise ValueError(f"Unsupported type: .{ext}")
    except Exception as e:
        raise ValueError(f"Error reading {filename}: {e}")

    chunks = _split_text(text)
    return pd.DataFrame({
        "filename": [filename] * len(chunks),
        "chunk_id": list(range(len(chunks))),
        "content": chunks
    })



import os
import pandas as pd

# Assume your functions are in document_loader.py
#from document_loader import load_and_process_single_document, load_documents_to_dataframe, ensure_folder_exists

def setup_example_files(folder: str):
    ensure_folder_exists(folder)
    # Create sample .txt
    with open(os.path.join(folder, "hello.txt"), "w", encoding="utf-8") as f:
        f.write("Hello world! This is a test document to demonstrate chunking logic.")
    # Create sample .docx
    from docx import Document
    doc = Document()
    doc.add_paragraph("This is a test Word document.\n" * 5)
    doc.save(os.path.join(folder, "sample.docx"))
    # Create sample .pptx
    from pptx import Presentation
    prs = Presentation()
    slide = prs.slides.add_slide(prs.slide_layouts[5])  # blank layout
    tx_box = slide.shapes.add_textbox(left=100, top=100, width=400, height=100)
    tx = tx_box.text_frame
    tx.text = "Slide 1: Hello from PPTX!"
    prs.save(os.path.join(folder, "presentation.pptx"))

def main():
    folder = "data"
    setup_example_files(folder)

    # Process a single doc to see results
    df_single = load_and_process_single_document(folder, "sample.docx")
    print("Single DOCX:", df_single)

    # Load and process all supported documents
    #df_all = load_documents_to_dataframe(folder)
    print("All loaded documents:")
    #print(df_all.head(), "\nTotal records:", len(df_all))

if __name__ == "__main__":
    main()

Single DOCX:       filename  chunk_id                                            content
0  sample.docx         0  This is a test Word document.\nThis is a test ...
1  sample.docx         1  document.\nThis is a test Word document.\nThis...
All loaded documents:


# Vector Pipeline

In [11]:
from phoenix_ai.loaders import load_and_process_single_document,load_documents_to_dataframe
from phoenix_ai.vector_embedding_pipeline import VectorEmbedding
from phoenix_ai.utils import GenAIEmbeddingClient ,GenAIChatClient
from google.colab import userdata


# Step 1: Load documents
df = load_documents_to_dataframe(folder_path="data/")

# Step 2: Setup embedding client
api_key = userdata.get('openai_api_key')  # For testing, you may replace with actual key
embedding_model = "text-embedding-ada-002"

embedding_client = GenAIEmbeddingClient(
    provider="openai",
    model=embedding_model,
    api_key=api_key
)

# Step 3: Create 'index' folder if it doesn't exist
index_dir = "index"
os.makedirs(index_dir, exist_ok=True)

# Step 4: Generate FAISS index
vector = VectorEmbedding(embedding_client, chunk_size=500, overlap=50)
index_path, chunks = vector.generate_index(
    df=df,
    text_column="content",
    index_path=os.path.join(index_dir, "policy.index")
)

📘 Loading with LangChain loader: presentation.pptx
📘 Loading with LangChain loader: sample.docx
📘 Loading with LangChain loader: hello.txt
FAISS index saved with 1 chunks at index/policy.index


# Set up Summarizer Service ( MCP )

In [18]:
import nest_asyncio
from fastapi import FastAPI, HTTPException, Query
from fastapi_mcp import FastApiMCP
import uvicorn
import threading
from phoenix_ai.rag_inference import RAGInferencer
from phoenix_ai.config_param import Param
from phoenix_ai.utils import GenAIEmbeddingClient, GenAIChatClient
from google.colab import userdata  # For Colab use

# Enable nested event loop (Colab/Jupyter-safe)
nest_asyncio.apply()

# Initialize FastAPI app
app = FastAPI(title="RAG Summarizer API")

# === Setup RAG Components ===
api_key = userdata.get('openai_api_key')  # Or hardcode for local: api_key = "sk-..."

chat_model = "gpt-4o"
embedding_model = "text-embedding-ada-002"

embedding_client = GenAIEmbeddingClient(
    provider="openai",
    model=embedding_model,
    api_key=api_key
)

chat_client = GenAIChatClient(
    provider="openai",
    model=chat_model,
    api_key=api_key
)

rag = RAGInferencer(embedding_client, chat_client)

# === API Endpoint ===
@app.get("/rag/summarize")
async def summarize(
    question: str = Query(..., description="Your question or topic to summarize"),
    mode: str = Query("standard", description="RAG mode to use (e.g., 'standard')"),
    top_k: int = Query(5, description="Top K documents to retrieve")
):
    """
    RAG-based summarization using FAISS index
    """
    try:
        df = rag.infer(
            system_prompt=Param.get_rag_prompt(),
            index_path="index/policy.index",
            question=question,
            mode=mode,
            top_k=top_k
        )

        # Debug print
        print("🧠 RAG Inference Output:\n", df)

        # Check for 'response' column and last row
        if "response" not in df.columns or df.empty:
            raise ValueError("No valid response found in RAG output.")

        return {
            "question": question,
            "answer": df.iloc[-1].get("response", "No answer generated."),
            "sources": df.to_dict(orient="records")
        }

    except Exception as e:
        raise HTTPException(status_code=500, detail=f"RAG summarization failed: {str(e)}")

# === MCP Mount ===
mcp = FastApiMCP(
    app,
    name="RAG Summarizer API",
    description="MCP API for answering questions using RAG over FAISS index"
)
mcp.mount()

# === Launch Uvicorn in background thread (for Colab)
def run_server():
    print("🚀 Starting RAG Summarizer API at http://localhost:8003")
    uvicorn.run(app, host="0.0.0.0", port=8003)

thread = threading.Thread(target=run_server)
thread.start()

🚀 Starting RAG Summarizer API at http://localhost:8003


In [19]:
!curl "http://localhost:8003/rag/summarize?question=20the%20general%20summary%3F"

RAG Answer:
 The provided slides contain repetitive text from a PowerPoint presentation that describes a test Word document to demonstrate chunking logic. Each slide repeatedly states, "Hello from PPTX! This is a test Word document. Hello world! This is a test document to demonstrate chunking logic."
🧠 RAG Inference Output:
                                       retrieved_docs                question  \
0  [Slide 1: Hello from PPTX!\nThis is a test Wor...  20the general summary?   

                                              answer  
0  The provided slides contain repetitive text fr...  
INFO:     127.0.0.1:59632 - "GET /rag/summarize?question=20the%20general%20summary%3F HTTP/1.1" 500 Internal Server Error
{"detail":"RAG summarization failed: No valid response found in RAG output."}

# Setup Agent

In [20]:
!pip install openai-agents requests -q

In [21]:
import os, requests
from agents import Agent, Runner, function_tool

os.environ["OPENAI_API_KEY"] = userdata.get('openai_api_key')

In [22]:
@function_tool
def rag_summarizer(question: str) -> str:
    """
    Query the RAG summarizer FastAPI service.
    Args:
        question: natural language query (e.g. "overall summary")
    Returns:
        A concise summary response.
    """
    resp = requests.get("http://localhost:8003/rag/summarize", params={"question": question})
    resp.raise_for_status()
    return resp.json().get("answer", "No summary.")

In [23]:
agent = Agent(
    name="RAG Agent",
    instructions="Use the RAG summarizer tool to answer user questions.",
    tools=[rag_summarizer],
    # Optionally specify model, e.g. model="gpt-4o"
)

In [None]:
from agents import Runner

result = Runner.run_sync(agent, "Please summarize the documents")
print(result.final_output)