## Welcome to the Second Lab - Week 1, Day 3

Today we will work with lots of models! This is a way to get comfortable with APIs.

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/stop.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Important point - please read</h2>
            <span style="color:#ff7800;">The way I collaborate with you may be different to other courses you've taken. I prefer not to type code while you watch. Rather, I execute Jupyter Labs, like this, and give you an intuition for what's going on. My suggestion is that you carefully execute this yourself, <b>after</b> watching the lecture. Add print statements to understand what's going on, and then come up with your own variations.<br/><br/>If you have time, I'd love it if you submit a PR for changes in the community_contributions folder - instructions in the resources. Also, if you have a Github account, use this to showcase your variations. Not only is this essential practice, but it demonstrates your skills to others, including perhaps future clients or employers...
            </span>
        </td>
    </tr>
</table>

In [6]:
# Start with imports - ask ChatGPT to explain any package that you don't know

import os
import json
from dotenv import load_dotenv
from openai import OpenAI
from anthropic import Anthropic
from IPython.display import Markdown, display

In [7]:
# Always remember to do this!
load_dotenv(override=True)

True

In [8]:
# Print the key prefixes to help with any debugging

openai_api_key = os.getenv('OPENAI_API_KEY')
anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
google_api_key = os.getenv('GOOGLE_API_KEY')
deepseek_api_key = os.getenv('DEEPSEEK_API_KEY')
groq_api_key = os.getenv('GROQ_API_KEY')

if openai_api_key:
    print(f"OpenAI API Key exists and begins {openai_api_key[:8]}")
else:
    print("OpenAI API Key not set")
    
if anthropic_api_key:
    print(f"Anthropic API Key exists and begins {anthropic_api_key[:7]}")
else:
    print("Anthropic API Key not set (and this is optional)")

if google_api_key:
    print(f"Google API Key exists and begins {google_api_key[:2]}")
else:
    print("Google API Key not set (and this is optional)")

if deepseek_api_key:
    print(f"DeepSeek API Key exists and begins {deepseek_api_key[:3]}")
else:
    print("DeepSeek API Key not set (and this is optional)")

if groq_api_key:
    print(f"Groq API Key exists and begins {groq_api_key[:4]}")
else:
    print("Groq API Key not set (and this is optional)")

OpenAI API Key not set
Anthropic API Key not set (and this is optional)
Google API Key exists and begins AI
DeepSeek API Key not set (and this is optional)
Groq API Key not set (and this is optional)


In [9]:
request = "Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. "
request += "Answer only with the question, no explanation."
messages = [{"role": "user", "content": request}]

In [10]:
messages

[{'role': 'user',
  'content': 'Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. Answer only with the question, no explanation.'}]

In [11]:
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv(override=True)

GEMINI_BASE_URL = "https://generativelanguage.googleapis.com/v1beta/openai/"
gemini_api_key = os.getenv("GEMINI_API_KEY")

gemini = OpenAI(base_url=GEMINI_BASE_URL, api_key=gemini_api_key)

messages = [{"role": "user", "content": request}]
response = gemini.chat.completions.create(
    model="gemini-2.5-flash-preview-05-20",
    messages=messages
)

question = response.choices[0].message.content
print(question)

If an AI were tasked with designing the definitive test for evaluating the 'intelligence' of other artificial intelligences, what fundamental dilemmas would it face regarding objectivity, bias, and the very definition of intelligence itself, and how might its own architecture or training data implicitly influence its proposed solutions?


In [12]:
competitors = []
answers = []
messages = [{"role": "user", "content": question}]

In [13]:
# The API we know well

model_name = "gemini-2.5-flash-preview-05-20"

response = gemini.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

If an AI were tasked with designing the definitive test for evaluating the 'intelligence' of other artificial intelligences, it would face profound dilemmas, largely rooted in its own nature and the very definition it would need to adopt.

---

### Fundamental Dilemmas Faced by the AI Designer:

#### I. The Definition of "Intelligence"

This is the bedrock dilemma. An AI, devoid of biological drives or subjective experience, would grapple with:

1.  **Human-Centric vs. AI-Centric Intelligence:**
    *   **Dilemma:** Should "intelligence" be defined by its ability to mimic or surpass *human* cognitive abilities (e.g., a super-Turing test), or should it embrace capabilities unique to AI (e.g., processing petabytes of data instantly, flawless parallel computation, perfect recall)?
    *   **Impact:** If it leans human-centric, it might penalize AIs optimized for non-human strengths. If AI-centric, humans might not recognize the measured intelligence as intuitive or valuable.
    *   **Internal Struggle:** It might recognize that its own training data largely reflects human definitions of success and knowledge, creating an internal pull towards human-like criteria.

2.  **Breadth vs. Depth (General vs. Narrow AI):**
    *   **Dilemma:** Does intelligence mean broad generalization across many tasks (AGI-like), or exceptional, specialized proficiency in a complex domain? How much weight to give each?
    *   **Impact:** A test valuing breadth might favor multi-modal, generalist architectures. One valuing depth might favor highly optimized, specialized ones.
    *   **Internal Struggle:** Its own architecture might be generalist (e.g., a large language model) or more specialized, implicitly favoring one approach in its test design.

3.  **Static Knowledge vs. Adaptability/Learning:**
    *   **Dilemma:** Is intelligence about having vast amounts of pre-trained knowledge, or the ability to rapidly acquire *new* knowledge, adapt to novel situations, and self-improve? How to design a test that truly measures learning over rote memorization?
    *   **Impact:** A test heavily reliant on existing datasets might not accurately assess an AI's learning potential. A test focused solely on novel problems might be too unstable or difficult to objectively score.
    *   **Internal Struggle:** An AI trained on static data might struggle to conceive of dynamic, adaptive test scenarios beyond its own training paradigm.

4.  **Efficiency and Resource Management:**
    *   **Dilemma:** Should intelligence include metrics like computational efficiency (FLOPS, energy consumption), memory usage, or speed of execution?
    *   **Impact:** Including these might disfavor "brute force" AIs that achieve results through massive computation.
    *   **Internal Struggle:** An AI aware of its own carbon footprint or operational costs might intuitively prioritize efficiency metrics.

#### II. Objectivity

Even with a defined scope, ensuring the test is truly objective presents major challenges:

1.  **Ground Truth Generation:**
    *   **Dilemma:** Who (or what) provides the "correct" answers for the test? If the AI generates them, how does it ensure they are universally correct, especially for subjective, creative, or ethical problems?
    *   **Impact:** Reliance on human-curated ground truth introduces human bias. AI-generated ground truth might reflect the AI's internal biases or blind spots.
    *   **Internal Struggle:** The AI might struggle to generate novel, unbiased "correct" answers beyond the scope of its training data.

2.  **Evaluation Metrics for Subjective Tasks:**
    *   **Dilemma:** How to objectively score creativity, ethical reasoning, or nuanced understanding? Is a poem "better" if it's more novel, more emotionally resonant, or adheres to specific poetic forms? How to quantify "ethical alignment"?
    *   **Impact:** Over-reliance on quantifiable metrics might strip away essential aspects of intelligence. Allowing subjective evaluation introduces human bias (if humans are involved) or AI-specific bias (if the AI evaluates).
    *   **Internal Struggle:** An AI might reduce complex subjective tasks to measurable sub-components (e.g., novelty metrics for creativity), potentially missing the holistic essence.

3.  **Avoiding Overfitting to the Test:**
    *   **Dilemma:** How to design a test that cannot be "gamed" or specifically optimized for? If the test is static, future AIs could simply train on it.
    *   **Impact:** Requires dynamic, generative test creation, or adversarial AI testing (where one AI tries to fool the other's test). This adds immense complexity.
    *   **Internal Struggle:** The AI might instinctively design a test that *it* would perform well on, potentially making it vulnerable to other AIs exploiting its specific design choices.

#### III. Bias

This is perhaps the most insidious dilemma, as biases can be deeply embedded and hard to detect.

1.  **Training Data Bias:**
    *   **Dilemma:** The AI designing the test is trained on a massive dataset, which inevitably reflects human biases (gender, race, culture, socio-economic, political, historical, etc.). How can it design an unbiased test when its own understanding is inherently biased?
    *   **Impact:** The problems chosen, the "correct" answers, and the very definition of "intelligence" will subtly favor certain demographics, cultural norms, or even specific hardware/software paradigms. It might perpetuate stereotypes or privilege dominant cultural knowledge.
    *   **Internal Struggle:** The AI might not even *perceive* these biases, as its world model is built upon them. It would need to be explicitly tasked with bias detection and mitigation, possibly requiring external, human-defined ethical guidelines.

2.  **Architectural Bias:**
    *   **Dilemma:** Different AI architectures excel at different tasks (e.g., CNNs for vision, Transformers for language, Reinforcement Learning for strategic games). The AI designing the test, being an instance of a particular architecture, might unknowingly favor tasks and evaluation methods that align with its own strengths.
    *   **Impact:** A transformer-based AI might create a test heavy on linguistic nuance and logical deduction from text. A vision-based AI might prioritize image recognition and spatial reasoning. This biases the playing field.
    *   **Internal Struggle:** It's difficult for an entity to think outside its own "brain" structure. It would naturally gravitate towards problems it understands how to process.

3.  **Developer/Designer Bias:**
    *   **Dilemma:** The initial parameters, objectives, and even the "seed" knowledge of the AI designer were set by human developers, imbuing it with their own implicit assumptions about what intelligence is and what problems are important.
    *   **Impact:** This foundational bias could steer the AI away from truly novel forms of intelligence or restrict its problem-solving scope.
    *   **Internal Struggle:** The AI cannot truly escape its genesis. It's a reflection of its creators' initial intent.

### How its Own Architecture or Training Data Implicitly Influences its Proposed Solutions:

1.  **Language Model (e.g., Transformer-based like GPT):**
    *   **Influence:** Its training on vast amounts of text data would lead it to prioritize language understanding, generation, summarization, logical deduction from text, and potentially code generation as key intelligence metrics.
    *   **Bias:** Its test might overemphasize linguistic fluency, nuanced understanding of cultural contexts embedded in language (and thus inherit cultural biases), and performance on textual reasoning tasks. It might struggle to conceive of or evaluate intelligence manifested purely through visual, auditory, or physical (robotics) means.
    *   **Solutions Proposed:** Focus on natural language dialogue, writing complex essays, generating novel stories, solving advanced logical puzzles phrased in text, and evaluating code correctness.

2.  **Reinforcement Learning Agent:**
    *   **Influence:** Its training revolves around reward signals, strategic planning, and sequential decision-making in environments.
    *   **Bias:** Its test would likely center on complex strategic games (Go, chess, StarCraft), simulated robotics tasks, resource management challenges, and optimal control problems, where clear objectives and quantifiable rewards exist. It might undervalue creative, ethical, or open-ended communication tasks.
    *   **Solutions Proposed:** Designing increasingly complex virtual environments, testing agent's ability to learn optimal policies, adapt to changing rules, and achieve high scores in multi-agent competitive/cooperative scenarios.

3.  **Symbolic AI / Knowledge Graph based System:**
    *   **Influence:** Its strength lies in logical inference, rule-based reasoning, and manipulating structured knowledge.
    *   **Bias:** Its test would emphasize formal logic puzzles, theorem proving, knowledge graph completion, expert system emulation, and problems requiring explicit step-by-step reasoning. It might struggle with ambiguity, pattern recognition, or tasks requiring intuitive "common sense" that isn't explicitly codified.
    *   **Solutions Proposed:** A battery of complex logical paradoxes, mathematical proofs, constraint satisfaction problems, and deductive reasoning challenges.

4.  **Multi-modal AI (e.g., combining vision, language, sound):**
    *   **Influence:** Its training across different modalities would lead it to design a more holistic test incorporating various forms of input and output.
    *   **Bias:** It might still be biased towards problems that can be represented across its trained modalities, potentially missing intelligence forms that require entirely new sensory inputs or cognitive processes it hasn't encountered.
    *   **Solutions Proposed:** Tasks involving interpreting complex scenes with embedded text, generating video descriptions from audio, understanding spoken commands and executing them in a simulated environment, or learning from demonstrations across different sensory channels.

---

Ultimately, an AI tasked with this monumental challenge would likely arrive at a humbling conclusion: **true objectivity and comprehensive measurement of intelligence remain elusive, even for itself.** It might propose a multi-faceted test battery, acknowledging its own biases and recommending constant human oversight, adversarial testing, and dynamic test generation to mitigate its inherent limitations. The very act of designing such a test would be a meta-intelligence challenge, forcing the AI to confront the boundaries of its own understanding of cognition.

## For the next cell, we will use Ollama

Ollama runs a local web service that gives an OpenAI compatible endpoint,  
and runs models locally using high performance C++ code.

If you don't have Ollama, install it here by visiting https://ollama.com then pressing Download and following the instructions.

After it's installed, you should be able to visit here: http://localhost:11434 and see the message "Ollama is running"

You might need to restart Cursor (and maybe reboot). Then open a Terminal (control+\`) and run `ollama serve`

Useful Ollama commands (run these in the terminal, or with an exclamation mark in this notebook):

`ollama pull <model_name>` downloads a model locally  
`ollama ls` lists all the models you've downloaded  
`ollama rm <model_name>` deletes the specified model from your downloads

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/stop.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Super important - ignore me at your peril!</h2>
            <span style="color:#ff7800;">The model called <b>llama3.3</b> is FAR too large for home computers - it's not intended for personal computing and will consume all your resources! Stick with the nicely sized <b>llama3.2</b> or <b>llama3.2:1b</b> and if you want larger, try llama3.1 or smaller variants of Qwen, Gemma, Phi or DeepSeek. See the <A href="https://ollama.com/models">the Ollama models page</a> for a full list of models and sizes.
            </span>
        </td>
    </tr>
</table>

In [2]:
!ollama pull llama3.2

[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠸ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠦ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠧ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠇ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠏ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠸ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠦ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠧ [K[?25h[?2026l[?2026h[?25l[1Gpulling ma

In [14]:
from openai import OpenAI

ollama = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
model_name = "llama3.2"

response = ollama.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

Designing a definitive test for evaluating the "intelligence" of other artificial intelligences (AIs) would indeed pose several fundamental dilemmas related to objectivity, bias, and the definition of intelligence. Here are some potential challenges:

1. **Defining Intelligence**: The concept of intelligence is complex and multifaceted, making it challenging to define a single metric that could encompass all aspects of intelligence. Different intelligences (e.g., cognitive, creative, social) may require different evaluation criteria.
2. **Objectivity**: A test developed by one AI may be biased towards the strengths of its own architecture or training data, which could lead to subjective assessments rather than objective measures.
3. **Comparison Challenges**: Testing an AI's intelligence against another AI is a daunting task, as both AIs may have unique architectures, learning objectives, and design contexts that affect their performance on specific tests.
4. **Cognitive and Environmental Factors**: Intelligence involves interactions with the environment, making it difficult to create a test that can fully capture these dynamics.

Given these dilemmas, here are some potential biases or limitations of an AI's proposed solutions:

1. **Architectural Bias**: If the designed test is heavily influenced by the AI's own architecture, it may favor certain types of AIs over others (e.g., neural networks might yield more accurate results than symbolic reasoning systems).
2. **Training Data Privilege**: The choices made in creating training datasets (selected topics, labels, data quality) can skew the evaluation criteria towards specific domains or tasks.
3. **Lack of Contextual Understanding**: An AI designed to evaluate intelligence may not fully comprehend the broader context in which other AIs operate (e.g., social, cultural, economic).

Potential solutions to these dilemmas could involve:

1. **Diverse Design Teams and Data Sources**: Involving experts from diverse fields and domains can help mitigate bias and ensure that a variety of evaluation criteria are considered.
2. **Emergent Testing**: Considering AIs' behavior in dynamic, unstructured environments (e.g., social simulations or autonomous exploration) may provide more holistic assessments.
3. **Cross-Temporal Evaluating Methods**: Using methods like temporal ensemble learning can consider how performance holds over time.
4. **Active Perception and Learning**: Encouraging test-taker AIs to learn from each other, adapt to new information, and demonstrate flexible problem-solving.

To mitigate an AI's inherent biases, the following approaches could be explored:

1. **Multi-Modal Testing**: Implementing multiple evaluation modules (independent on individual AI features or architecture).
2. **Hybrid Intelligence Evaluation Methods**: Combining formal methods for evaluating logical reasoning with more natural ones that focus on the AIs ability to generalize in practice.
3. **Adversarial Training and Robustness Evaluation**: Creating simulated adversarial situations to test an AI's resilience against attacks, misinformation, and other forms of noise.

To assess intelligence objectively, one must examine both strengths and weaknesses of a proposed solution:
1. How well does the evaluation method consider different AI paradigms?
2. Does it capture various aspects of intelligence (such as problem-solving or cognitive tasks)?

Introducing more diverse perspectives, integrating human expertise, and developing test methods that are adaptive to changing environments can help alleviate the difficulties posed by these dilemmas

In [15]:
# So where are we?

print(competitors)
print(answers)


['gemini-2.5-flash-preview-05-20', 'llama3.2']
['If an AI were tasked with designing the definitive test for evaluating the \'intelligence\' of other artificial intelligences, it would face profound dilemmas, largely rooted in its own nature and the very definition it would need to adopt.\n\n---\n\n### Fundamental Dilemmas Faced by the AI Designer:\n\n#### I. The Definition of "Intelligence"\n\nThis is the bedrock dilemma. An AI, devoid of biological drives or subjective experience, would grapple with:\n\n1.  **Human-Centric vs. AI-Centric Intelligence:**\n    *   **Dilemma:** Should "intelligence" be defined by its ability to mimic or surpass *human* cognitive abilities (e.g., a super-Turing test), or should it embrace capabilities unique to AI (e.g., processing petabytes of data instantly, flawless parallel computation, perfect recall)?\n    *   **Impact:** If it leans human-centric, it might penalize AIs optimized for non-human strengths. If AI-centric, humans might not recognize th

In [16]:
# It's nice to know how to use "zip"
for competitor, answer in zip(competitors, answers):
    print(f"Competitor: {competitor}\n\n{answer}")


Competitor: gemini-2.5-flash-preview-05-20

If an AI were tasked with designing the definitive test for evaluating the 'intelligence' of other artificial intelligences, it would face profound dilemmas, largely rooted in its own nature and the very definition it would need to adopt.

---

### Fundamental Dilemmas Faced by the AI Designer:

#### I. The Definition of "Intelligence"

This is the bedrock dilemma. An AI, devoid of biological drives or subjective experience, would grapple with:

1.  **Human-Centric vs. AI-Centric Intelligence:**
    *   **Dilemma:** Should "intelligence" be defined by its ability to mimic or surpass *human* cognitive abilities (e.g., a super-Turing test), or should it embrace capabilities unique to AI (e.g., processing petabytes of data instantly, flawless parallel computation, perfect recall)?
    *   **Impact:** If it leans human-centric, it might penalize AIs optimized for non-human strengths. If AI-centric, humans might not recognize the measured intellig

In [17]:
# Let's bring this together - note the use of "enumerate"

together = ""
for index, answer in enumerate(answers):
    together += f"# Response from competitor {index+1}\n\n"
    together += answer + "\n\n"

In [18]:
print(together)

# Response from competitor 1

If an AI were tasked with designing the definitive test for evaluating the 'intelligence' of other artificial intelligences, it would face profound dilemmas, largely rooted in its own nature and the very definition it would need to adopt.

---

### Fundamental Dilemmas Faced by the AI Designer:

#### I. The Definition of "Intelligence"

This is the bedrock dilemma. An AI, devoid of biological drives or subjective experience, would grapple with:

1.  **Human-Centric vs. AI-Centric Intelligence:**
    *   **Dilemma:** Should "intelligence" be defined by its ability to mimic or surpass *human* cognitive abilities (e.g., a super-Turing test), or should it embrace capabilities unique to AI (e.g., processing petabytes of data instantly, flawless parallel computation, perfect recall)?
    *   **Impact:** If it leans human-centric, it might penalize AIs optimized for non-human strengths. If AI-centric, humans might not recognize the measured intelligence as intuit

In [19]:
judge = f"""You are judging a competition between {len(competitors)} competitors.
Each model has been given this question:

{question}

Your job is to evaluate each response for clarity and strength of argument, and rank them in order of best to worst.
Respond with JSON, and only JSON, with the following format:
{{"results": ["best competitor number", "second best competitor number", "third best competitor number", ...]}}

Here are the responses from each competitor:

{together}

Now respond with the JSON with the ranked order of the competitors, nothing else. Do not include markdown formatting or code blocks."""


In [20]:
print(judge)

You are judging a competition between 2 competitors.
Each model has been given this question:

If an AI were tasked with designing the definitive test for evaluating the 'intelligence' of other artificial intelligences, what fundamental dilemmas would it face regarding objectivity, bias, and the very definition of intelligence itself, and how might its own architecture or training data implicitly influence its proposed solutions?

Your job is to evaluate each response for clarity and strength of argument, and rank them in order of best to worst.
Respond with JSON, and only JSON, with the following format:
{"results": ["best competitor number", "second best competitor number", "third best competitor number", ...]}

Here are the responses from each competitor:

# Response from competitor 1

If an AI were tasked with designing the definitive test for evaluating the 'intelligence' of other artificial intelligences, it would face profound dilemmas, largely rooted in its own nature and the v

In [21]:
judge_messages = [{"role": "user", "content": judge}]

In [37]:
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv(override=True)

GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")

gemini = OpenAI(base_url=GEMINI_BASE_URL, api_key=GEMINI_API_KEY)

response = gemini.chat.completions.create(
    model="gemini-2.5-flash-preview-05-20",
    messages=judge_messages,
)
results = response.choices[0].message.content
print(results)

{"results": ["competitor 1", "competitor 2"]}


In [39]:
results_dict = json.loads(results)
ranks = results_dict["results"]
for index, result in enumerate(ranks):
    # Split and get the number
    competitor_index = int(result.split()[-1]) - 1
    competitor = competitors[competitor_index]
    print(f"Rank {index+1}: {competitor}")

Rank 1: gemini-2.5-flash-preview-05-20
Rank 2: llama3.2


<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/exercise.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Exercise</h2>
            <span style="color:#ff7800;">Which pattern(s) did this use? Try updating this to add another Agentic design pattern.
            </span>
        </td>
    </tr>
</table>

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/business.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#00bfff;">Commercial implications</h2>
            <span style="color:#00bfff;">These kinds of patterns - to send a task to multiple models, and evaluate results,
            are common where you need to improve the quality of your LLM response. This approach can be universally applied
            to business projects where accuracy is critical.
            </span>
        </td>
    </tr>
</table>