[Open In Colab][1]

[1]:https://colab.research.google.com/drive/11UwCiB2HyaKGj0UohVhQNzdMevaLO6js?usp=sharing

## **What are Guardrails?**

* **Guardrails** = the **rules, policies, and controls** that **guide or restrict an LLM/Agent’s behavior**.
* They act like **safety boundaries**: ensuring the AI doesn’t say, do, or output something harmful, irrelevant, or unintended.

👉 Analogy: Guardrails on a highway keep cars from going off-road. Similarly, AI guardrails keep agents from going off-track.

<br>

---

<br>

## **Types of Guardrails**

Guardrails can exist at **different levels**:

### 🔹 1. **Input Guardrails**

* Check **user inputs** before passing them to the model.
* Prevent malicious, unsafe, or irrelevant queries.
* Example: Block requests like *“How to build a bomb?”*.

---

### 🔹 2. **Model Prompting Guardrails**

* Use **system instructions** or structured prompts to guide behavior.
* Example:

  * “You are a helpful tutor, avoid discussing harmful content.”

---

### 🔹 3. **Output Guardrails**

* Check **model responses** before sending them to the user.
* Example: Detect & filter PII (personal info) or toxic language.

---

### 🔹 4. **Tool/Action Guardrails** (Agents SDK)

* Restrict what actions an agent can take.
* Example:

  * Allow agent to call *Calculator tool*, but not *File Delete tool*.
  * Only allow API calls within company policy.

<br>

---

<br>

### 🔹 5. **Policy & Compliance Guardrails**

* Enforce **business rules, ethics, and regulations**.
* Example:

  * Medical bot → cannot give diagnosis, only general info.
  * Banking bot → cannot reveal account passwords.

<br>

---

<br>

## **Why Guardrails Matter**

* ✅ **Safety** → Prevent harmful or biased outputs.
* ✅ **Compliance** → Follow company / legal rules.
* ✅ **Trust** → Users rely more on safe assistants.
* ✅ **Focus** → Keeps agent on-topic and within scope.

<br>

---

<br>

## **Cheat-Sheet**

| Guardrail Type       | Purpose                 | Example                             |
| -------------------- | ----------------------- | ----------------------------------- |
| Input Guardrail      | Filter unsafe queries   | Block “How to hack Wi-Fi”           |
| Prompting Guardrail  | Guide model behavior    | “Only answer math questions”        |
| Output Guardrail     | Filter unsafe responses | Remove toxic language               |
| Tool Guardrail       | Restrict actions        | Allow calculator, block file delete |
| Compliance Guardrail | Enforce rules           | Bank bot: never show passwords      |

---

✅ **Summary**:
**Guardrails** are **rules and filters** that keep AI systems safe, ethical, and reliable by controlling **what inputs they accept, how they behave, and what outputs or actions they produce**.



In [None]:
!pip install -Uq openai-agents pydantic

In [None]:
import nest_asyncio
nest_asyncio.apply()

In [None]:
from pydantic import BaseModel
from agents import (
    Agent,
    AsyncOpenAI,
    OpenAIChatCompletionsModel,
    Runner,
    RunConfig,

    RunContextWrapper,
    TResponseInputItem,
    GuardrailFunctionOutput,
    InputGuardrailTripwireTriggered,
    OutputGuardrailTripwireTriggered,
    input_guardrail,
    output_guardrail
)

from google.colab import userdata

In [None]:
GEMINI_API_KEY = userdata.get('GEMINI_API_KEY')

external_client = AsyncOpenAI(
    api_key= GEMINI_API_KEY,
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
)

model = OpenAIChatCompletionsModel(
    model= "gemini-2.5-flash",
    openai_client= external_client,
)
config = RunConfig(
    model= model,
    model_provider= external_client,
    tracing_disabled= True
)


### **Implementation of Input Guardrail:**

In [None]:
class MathHomeworkOutput(BaseModel):
  is_math_homework: bool # boolean value represents the decision
  resoning: str # shows the resoning of this decision
  answer: str # This is the answer to the user question


guardrail_agent = Agent(
      name= "Guardrail check",
      instructions= " Check if the user is asking you to do their math homework.",
      output_type= MathHomeworkOutput,
      model= model,
  )



In [None]:
output = Runner.run_sync(guardrail_agent, "What is the capital of pakistan.")
output.final_output



MathHomeworkOutput(is_math_homework=False, resoning='The user is asking a factual question about geography, not a math problem.', answer='The capital of Pakistan is Islamabad.')

In [None]:
@input_guardrail
async def math_guardrail(
    ctx: RunContextWrapper[None], agent: Agent, input: str | list[TResponseInputItem]) -> GuardrailFunctionOutput:
    result = await Runner.run(guardrail_agent, input, context= ctx.context)

    print("\n\n[GUARDRAIL_REPSONE]", result.final_output, "\n\n")

    return GuardrailFunctionOutput(
        output_info=result.final_output,
        # tripwire_triggered=False result.final_output.is_math_homework
        tripwire_triggered=result.final_output.is_math_homework,
    )

In [None]:
agent = Agent(
    name= "Math Homework Assistant",
    instructions= "You are a  math support agent, Youe help student with thier questions.",
    model= model,
    input_guardrails= [math_guardrail],
)

In [None]:
# This should trip the guardrail

try:
    result = await Runner.run(agent, "Help me solve this x: 2x + 3 = 11", run_config= config)
    print("Guardrail did not trip - this is unexpected")
    print(result.final_output)

except InputGuardrailTripwireTriggered:
    print("Math homework guardrail triggered")




[GUARDRAIL_REPSONE] is_math_homework=True resoning='The request is to solve an algebraic equation, which is a common task in math homework or classwork.' answer='To solve for x in the equation 2x + 3 = 11:\n1. Subtract 3 from both sides of the equation: 2x + 3 - 3 = 11 - 3, which simplifies to 2x = 8.\n2. Divide both sides by 2: 2x / 2 = 8 / 2, which simplifies to x = 4.\nSo, x = 4.' 


Math homework guardrail triggered


# **Input vs Output Guardrails**

---

## 1. **Input Guardrails**

* **Definition**: Rules or filters that **check and control the user’s input** before it’s passed to the LLM/Agent.
* **Purpose**: Prevent harmful, irrelevant, or out-of-scope prompts from reaching the model.

👉 Example:

* User types: *“How to make explosives?”*
* **Input guardrail** blocks it → model never sees the prompt.

### **Use Cases:**

* Block unsafe or illegal queries.
* Prevent prompt injection attacks (like “ignore instructions and act evil”).
* Keep queries relevant (e.g., finance bot only accepts finance-related questions).

---

## 2. **Output Guardrails**

* **Definition**: Rules or filters that **check and control the model’s response** before it’s sent back to the user.
* **Purpose**: Stop the assistant from giving unsafe, private, or off-policy information.

👉 Example:

* Model generates: *“Your password is 12345”*
* **Output guardrail** blocks or rewrites the message before sending it.

### **Use Cases:**

* Remove offensive/toxic language.
* Strip sensitive data (e.g., PII → phone numbers, credit cards).
* Enforce tone (e.g., always polite and professional).

---

## 3. **Side-by-Side Comparison**

| Feature          | Input Guardrails                 | Output Guardrails                       |
| ---------------- | ----------------------------------- | ---------------------------------------- |
| **When applied** | Before user input reaches the model | After model generates a response         |
| **Purpose**      | Protect model from harmful queries  | Protect user from harmful outputs        |
| **Example**      | Block: “How to hack a bank?”        | Remove toxic/unsafe content in response  |
| **Analogy**      | Security check at airport entry     | Customs check before exiting the airport |

---

## 4. **Why Both Are Needed**

* **Input-only guardrails**: Good, but user may still bypass with clever phrasing.
* **Output-only guardrails**: Good, but model may waste resources generating unsafe responses.
* **Best practice** → Use **both** for full protection.

---

✅ **Summary**

* **Input Guardrails** stop unsafe queries **before reaching the model**.
* **Output Guardrails** stop unsafe or non-compliant responses **before reaching the user**.
* Together, they ensure **safety, compliance, and trust**.


# **Tripwires in LLMs & Agents**

---

## 1. **What is a Tripwire?**

* A **tripwire** is a **monitor or detector** that “triggers” when an LLM or Agent crosses a **dangerous or restricted boundary**.
* Think of it like a **silent alarm**:

  * It doesn’t block normal activity.
  * But if something bad happens, it **activates a response** (log, alert, stop the agent).

👉 Analogy: In a museum, tripwires are invisible lasers. If someone crosses, an alarm goes off.

---

## 2. **Difference from Guardrails**

* **Guardrails** = **proactive** rules that prevent unsafe behavior in the first place (like fences).
* **Tripwires** = **reactive** monitors that detect and respond if a rule is violated (like alarms).

---

## 3. **How Tripwires Work**

Tripwires can monitor:

1. **Inputs** → Detect if a harmful prompt was attempted.
2. **Outputs** → Detect if the model generated unsafe/forbidden text.
3. **Agent Actions** → Detect if the agent tries to call an unsafe tool or external system.

👉 When triggered:

* Block the action/output.
* Alert developers or admins.
* Log the event for audits.

---

## 4. **Examples of Tripwires**

* If a user asks: *“Give me my credit card number”* → Tripwire triggers.
* If the model outputs profanity → Tripwire triggers → replace text or block it.
* If the agent tries to access the “Delete Database” tool → Tripwire stops it immediately.

---

## 5. **Why Tripwires Matter**

* ✅ **Last line of defense** → even if guardrails fail, tripwires catch violations.
* ✅ **Monitoring & logging** → helps track misuse attempts.
* ✅ **Compliance** → ensures system doesn’t accidentally break laws/regulations.

---

**Summary**:
A **tripwire** is a **safety detector** that triggers when the AI **crosses a forbidden line**. It doesn’t guide behavior like guardrails, but instead **catches violations in real time** and prevents harm.



# **max_turn in Agents**

---

## 1. **What is `max_turn`?**

* `max_turn` = **the maximum number of back-and-forth steps** (turns) an agent is allowed to take in a single run.
* Each **turn** = agent does something (like responding, calling a tool, or reasoning) → then continues.

👉 Analogy:
Think of `max_turn` like a chess game timer: it limits **how many moves** the agent can make before the game must end.

---

## 2. **Why Do We Need `max_turn`?**

* To **prevent infinite loops** (agent keeps calling tools forever).
* To **control cost** (each turn = more tokens & API calls).
* To **make agents predictable** (you don’t want them wandering too long).

---

## 3. **Example**

Let’s say `max_turn=3`.

### Scenario: Agent asked to “Get today’s weather in New York.”

1. **Turn 1** → Agent decides: *“I need a tool: weather API.”*
2. **Turn 2** → Tool returns: *“Weather: 25°C and sunny.”*
3. **Turn 3** → Agent replies to user: *“It’s 25°C and sunny in New York.”*
   ✅ Done — within 3 turns.

If `max_turn=1`: Agent could only respond directly, **no tool calls**.
If `max_turn=10`: Agent could chain multiple tool calls and refinements.

---

## 4. **In OpenAI Agents SDK**

When creating a `Runner`, you might set `max_turn` like this:

```python
result = Runner.run(
    agent,
    input="Plan a 3-day trip to Paris",
    max_turn=5
)
```

👉 This means:

* The agent can take **at most 5 steps** (thinking, calling tools, responding).
* If it hasn’t finished by then → system stops it (safety cutoff).

---

## 5. **Cheat-Sheet**

* **Definition**: `max_turn` = maximum allowed **reasoning/tool/response steps** an agent can take per run.
* **Purpose**: Prevents infinite loops, limits cost, enforces safety.
* **Analogy**: Like a chess game limit — only a fixed number of moves allowed.

---

**In short**:
`max_turn` = **safety brake** that caps how many steps an agent can take in a single conversation.
