### Question 5: Build a Summary Agent Using Wikipedia Pages
Given the Wikipedia page: https://en.wikipedia.org/wiki/Capybara, use the Fetch Web Page tool (fetch_url function) to get the content of the page and use the Save Summary tool to save the summary. 

In [1]:
from agents import Agent, function_tool, Runner # from OpenAI Agents SDK 
import requests
from requests.exceptions import RequestException
from typing import Any, Dict, List, Optional
from pathlib import Path
import re

In [2]:
def fetch_url(url: str) -> Optional[str]:
    """
    Fetch the textual content of a webpage.

    Args:
        url (str): The target URL to fetch content from.

    Returns:
        Optional[str]: The decoded HTML/text content of the fetched page if successful,
        or None if an error occurred.

    Raises:
        ValueError: If the provided URL is empty or invalid.
    """
    if not url or not isinstance(url, str):
        raise ValueError("The 'url' parameter must be a non-empty string.")

    jina_reader_base_url = "https://r.jina.ai/"
    jina_reader_url = jina_reader_base_url + url.lstrip("/")

    try:
        response = requests.get(jina_reader_url, timeout=10)
        response.raise_for_status()  # Raises HTTPError for bad status codes
        return response.content.decode("utf-8")
    except RequestException as e:
        # Catch all network-related errors (e.g., ConnectionError, Timeout, HTTPError)
        print(f"Error fetching URL '{jina_reader_url}': {e}")
        return None
    except UnicodeDecodeError:
        print(f"Error decoding response from '{jina_reader_url}'.")
        return None

In [3]:
def save_summary(filename: str, text: str, directory: str = "summaries") -> str:
    """
    Save a summary to a UTF-8 .txt file inside a local 'summaries/' folder.

    Args:
        filename (str): Desired filename; '.txt' is appended if missing.
        text (str): The summary to write.
        directory (str): Target folder (created if needed).

    Returns:
        str: The absolute path to the saved file.

    Raises:
        ValueError: If filename or text is empty/invalid.
    """
    if not filename or not isinstance(filename, str):
        raise ValueError("filename must be a non-empty string.")
    if not text or not isinstance(text, str) or not text.strip():
        raise ValueError("text must be a non-empty string.")

    # Basic sanitization: keep letters, numbers, dots, dashes, underscores
    safe_name = re.sub(r"[^A-Za-z0-9._-]", "_", filename.strip())
    if not safe_name.endswith(".txt"):
        safe_name += ".txt"

    out_dir = Path(directory)
    out_dir.mkdir(parents=True, exist_ok=True)
    out_path = (out_dir / safe_name).resolve()

    # Write summary
    out_path.write_text(text, encoding="utf-8")
    return str(out_path)

In [4]:
# Test webpage retrieval function
#content = fetch_url('https://en.wikipedia.org/wiki/Capybara/')

In [5]:
web_agent = Agent(
    name="web_agent",
    model="gpt-4o-mini",
    tools=[function_tool(fetch_url), function_tool(save_summary)],
    instructions=(
        "You are a helpful assistant that summarizes web pages. "
        "When the user asks to summarize a URL, first call fetch_url(url=<the URL>) "
        "to get the page text, then write a concise summary. "
        "After you have the final summary text, call save_summary with "
        "filename (end it with .txt) and text (the summary you just wrote). "
        "In your final message to the user, include the saved file path and also show the summary."
    ),
)

In [6]:
runner = Runner()

# Ask the model to summarize *and* save:
question = (
    "Summarize this page in ~5 bullet points and save it as 'capybara_summary.txt': "
    "https://en.wikipedia.org/wiki/Capybara/"
)

results = await runner.run(web_agent, input=question)
print(results)

RunResult:
- Last agent: Agent(name="web_agent", ...)
- Final output (str):
    ### Summary of Capybara
    
    - The capybara (_Hydrochoerus hydrochaeris_) is the largest rodent in the world, native to South America.
    - Known for its social behavior, capybaras often live in groups of 10-20, sometimes up to 100.
    - They are semi-aquatic, inhabiting areas near rivers, lakes, and marshes, and are excellent swimmers.
    - Capybaras primarily graze on grasses and aquatic plants, and practice coprophagy to aid their digestion.
    - They are not considered endangered, but face threats from hunting and habitat destruction; they have also adapted well to urban environments.
    
    You can download the summary file [here](sandbox:/workspaces/Agentic-AI-course/Homework2/summaries/capybara_summary.txt).
- 7 new item(s)
- 4 raw response(s)
- 0 input guardrail result(s)
- 0 output guardrail result(s)
(See `RunResult` for more details)


In [7]:
# Inspect webpgage summary 
print(results.final_output)

### Summary of Capybara

- The capybara (_Hydrochoerus hydrochaeris_) is the largest rodent in the world, native to South America.
- Known for its social behavior, capybaras often live in groups of 10-20, sometimes up to 100.
- They are semi-aquatic, inhabiting areas near rivers, lakes, and marshes, and are excellent swimmers.
- Capybaras primarily graze on grasses and aquatic plants, and practice coprophagy to aid their digestion.
- They are not considered endangered, but face threats from hunting and habitat destruction; they have also adapted well to urban environments.

You can download the summary file [here](sandbox:/workspaces/Agentic-AI-course/Homework2/summaries/capybara_summary.txt).


In [8]:
# Inspect run items (for debugging purposes)
items = results.new_items
#print(items)

### Question 6: Give the Agent a Search Tool 
Ask the agent to index multiple webpages. Use the agent's search tool to answer the following question: "What are threats to capybara populations?"

In [9]:
from minsearch import AppendableIndex, VectorSearch
from sentence_transformers import SentenceTransformer # necessary for semantic search
from urllib.parse import urlparse, unquote
from pathlib import PurePosixPath
import numpy as np
import time
import docs

In [10]:
urls = [
    "https://en.wikipedia.org/wiki/Lesser_capybara",
    "https://en.wikipedia.org/wiki/Hydrochoerus",
    "https://en.wikipedia.org/wiki/Neochoerus", 
    "https://en.wikipedia.org/wiki/Caviodon", 
    "https://en.wikipedia.org/wiki/Neochoerus_aesopi"
]

def make_filename_from_url(url: str) -> str:
    """
    Use the last non-empty path segment of the URL as the base,
    sanitize it, lowercase it, and append _summary.txt
    """
    path = urlparse(url).path                 # e.g., '/wiki/Capybara'
    parts = [p for p in PurePosixPath(path).parts if p not in ("/", "")]
    base = parts[-1] if parts else "page"     # e.g., 'Capybara'
    base = unquote(base) 
    base = re.sub(r"[^A-Za-z0-9._-]+", "_", base).strip("_")
    if not base:
        base = "page"
    return f"{base.lower()}_summary.txt"

# Run serially
for url in urls:
    filename = make_filename_from_url(url)
    question = f"Summarize this page in ~5 bullet points and save it as '{filename}': {url}"
    results = await runner.run(web_agent, input=question)

In [11]:
def derive_wiki_url_from_filename(filename: str) -> Optional[str]:
    """
    From 'lesser_capybara_summary.txt' -> 'https://en.wikipedia.org/wiki/Lesser_capybara'
    From 'hydrochoerus_summary.txt'    -> 'https://en.wikipedia.org/wiki/Hydrochoerus'
    Falls back to file:// path if pattern doesn't match.
    """
    name = Path(filename).name
    if not name.endswith("_summary.txt"):
        return None
    base = name[: -len("_summary.txt")]  # e.g., 'lesser_capybara'
    if not base:
        return None
    # Capitalize first letter only, keep underscores as-is (matches your earlier file naming)
    slug = base[0].upper() + base[1:]
    return f"https://en.wikipedia.org/wiki/{slug}"

In [12]:
def build_lexical_index_from_summaries(dir_path: str = "./summaries") -> AppendableIndex:
    """
    Build an AppendableIndex over all .txt files in ./summaries.
    Uses 'text' as the searchable field, and stores url/filename for filtering.
    """
    index = AppendableIndex(
        text_fields=["text"],
        keyword_fields=["url", "filename"]
    )

    for p in Path(dir_path).glob("*.txt"):
        try:
            content = p.read_text(encoding="utf-8")
            if not content.strip():
                continue

            url_guess = derive_wiki_url_from_filename(p.name)
            doc = {
                "id": f"{p.stem}-{int(time.time())}",  # unique id
                "text": content,
                "url": url_guess or f"file://{p.resolve()}",
                "filename": p.name,
                "path": str(p.resolve()),
            }
            index.append(doc)
        except Exception as e:
            print(f"Skipped {p}: {e}")

    return index

In [13]:
# Build once at import time; rebuild this if you add more files later.
LEXICAL_INDEX = build_lexical_index_from_summaries("./summaries")

# Keep a simple log of queries for your assignment submission
SEARCH_LOG: List[Dict[str, Any]] = []

In [24]:
SEARCH_LOG

[{'query': 'threats to capybara populations',
  'filter_url': None,
  'num_results': 5,
  'ts': 1760955641.6864622},
 {'query': 'threats to capybara populations',
  'filter_url': None,
  'num_results': 5,
  'ts': 1760956052.2308307},
 {'query': 'threats to capybara populations',
  'filter_url': None,
  'num_results': 5,
  'ts': 1760956092.121735}]

In [16]:
def search_corpus(query: str, filter_url: Optional[str] = None, num_results: int = 5):
    """
    Simple lexical search over ./summaries using minsearch.AppendableIndex.
    Always call this tool before answering questions about the summaries.
    """
    if not isinstance(query, str) or not query.strip():
        raise ValueError("query must be a non-empty string.")

    # record for your submission
    SEARCH_LOG.append({
        "query": query,
        "filter_url": filter_url,
        "num_results": num_results,
        "ts": time.time(),
    })

    filter_dict = {"url": filter_url} if filter_url else None
    results = LEXICAL_INDEX.search(query, filter_dict=filter_dict, num_results=num_results)
    return results  # list of dicts: each includes 'text', 'url', 'filename', etc.

In [22]:
# Specialized search Agent (one task only, simple instructions)
search_agent = Agent(
    name="search_agent",
    model="gpt-4o-mini",
    tools=[function_tool(search_corpus)],
    instructions=(
        "You answer questions ONLY by searching the local summaries corpus via the `search_corpus` tool.\n"
        "Process:\n"
        "1) Break the user's question into 1–3 concrete search queries.\n"
        "2) Call `search_corpus` for each query (optionally pass filter_url to narrow to a page).\n"
        "3) Read the returned snippets and synthesize a concise answer grounded in them.\n"
        "Rules:\n"
        "- Do not rely on prior knowledge; if nothing is found, say so and suggest indexing more docs.\n"
        "- When helpful, mention the source filenames/URLs from the results.\n"
        "- In your final message, include a 'Search queries used:' list showing the exact queries you issued."
    ),
)

runner = Runner()

# Your assignment question:
question = "What are threats to capybara populations?"
results = await runner.run(search_agent, input=question)

print(results.final_output)

Capybara populations face several threats, including:

1. **Hunting**: Capybaras are hunted for their meat and skin, impacting their numbers.
2. **Habitat Destruction**: Their natural habitats, primarily near rivers, lakes, and marshes, are increasingly being destroyed for agricultural development and urban expansion.
3. **Urban Adaptation**: While they adapt well to urban areas, this can lead to conflicts with humans, further threatening their populations.

Despite these challenges, capybaras are not currently considered endangered. 

**Source:** The summaries highlight the threats as hunting and habitat destruction, indicating that although capybaras thrive in some environments, they still face significant risks.

**Search queries used:**
- threats to capybara populations


In [21]:
# Generic Agent (choice between two tasks, complex instructions)
multi_task_agent = Agent(
    name="web_agent",
    model="gpt-4o-mini",
    tools=[
        function_tool(fetch_url),
        function_tool(save_summary),
        function_tool(search_corpus),
    ],
    instructions=(
        "ROLE: You can do two tasks:\n"
        "A) Summarize a web page from a URL and (optionally) save it to disk.\n"
        "B) Answer questions using the local summaries corpus via the `search_corpus` tool.\n"
        "\n"
        "TOOL POLICY:\n"
        "- If the user's message contains a URL or asks to summarize: "
        "  1) Call fetch_url(url=<the URL>), 2) write a concise summary, "
        "  3) If the user asked you to save (or gave a filename), call save_summary(filename=<name>, text=<summary>). "
        "  4) In your final message, include the saved file path (if saved) and show the summary.\n"
        "- If the user's message is a knowledge question without a URL: "
        "  1) Break the question into 1–3 concrete search queries, "
        "  2) Call search_corpus for each query (optionally use filter_url to narrow), "
        "  3) Synthesize an answer strictly from the returned hits. Do not rely on prior knowledge.\n"
        "- If a single message asks for both (e.g., summarize *and* then answer a follow-up): "
        "  do the summarize flow first, then run the search flow.\n"
        "- Do not call save_summary during search mode unless the user explicitly asks to save your answer.\n"
        "\n"
        "OUTPUT FORMAT:\n"
        "- Summarize mode: 3–6 bullet points + one-sentence takeaway. If saved, add: 'Saved to: <path>'.\n"
        "- Search mode: 3–6 bullet points grounded in retrieved snippets. "
        "  When useful, cite sources with [filename] or the URL slug. "
        "  End with a short 'Search queries used:' list showing the exact queries you sent to search_corpus.\n"
        "\n"
        "TOOL SIGNATURES (for your reference):\n"
        "- fetch_url(url: str) -> str\n"
        "- save_summary(filename: str, text: str, directory?: str='summaries') -> str\n"
        "- search_corpus(query: str, filter_url?: str, num_results?: int=5) -> List[Result]\n"
    ),
)

runner = Runner()

# Your assignment question:
question = "What are threats to capybara populations?"
results = await runner.run(multi_task_agent, input=question)

print(results.final_output)

Capybara populations face several threats, including:

- **Hunting:** Capybaras are often hunted for their meat and hides, which can significantly reduce local populations.
- **Habitat Destruction:** Urbanization and agricultural expansion lead to fragmentation and loss of their natural habitats, particularly near rivers and lakes.
- **Human-Wildlife Conflict:** As humans expand into capybara territories, conflicts can arise, leading to further pressures on these animals.
- **Climate Change:** Changes in climate can affect their aquatic habitats and the availability of food sources.
- **Predation:** While adults have few natural predators, young capybaras are vulnerable to birds of prey, large mammals, and reptiles.

Despite these threats, capybaras are not currently considered endangered and have shown adaptability to urban environments.

**Search queries used:**
- threats to capybara populations
