# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [None]:
%load_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

Web was selected

For Colab only

In [None]:
# Set OpenAI key using Colab (for Colab)
'''
import os
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get("OpenAI")
'''

In [None]:
#!pip list | grep -E "langchain|openai|langgraph"

In [None]:
# Install libraries for Colab usage
#!pip install langchain-community

In [None]:
# Load the document via langchain
from langchain_community.document_loaders import WebBaseLoader

url = "https://www.newyorker.com/magazine/2024/04/22/what-is-noise"

loader = WebBaseLoader(web_paths=[url])
docs = loader.load()

print(f"Loaded {len(docs)} document(s)")
print(docs[0].page_content[:1000])


## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify.
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [None]:
# Load libraries
import os
import json
from typing import Literal
from pydantic import BaseModel, Field
from openai import OpenAI

In [None]:
# Upgrade OpenAI SDK
'''
!pip install -U openai
from openai import OpenAI
'''

In [None]:
class ArticleBrief(BaseModel):
    Author: str
    Title: str
    Relevance: str
    Summary: str
    Tone: str
    InputTokens: int
    OutputTokens: int

    model_config = {
        "extra": "forbid"
    }


In [None]:
url = "https://www.newyorker.com/magazine/2024/04/22/what-is-noise"

loader = WebBaseLoader(web_paths=[url])
docs = loader.load()

article_text = "\n\n".join(d.page_content for d in docs).strip()
print(article_text[:800])


In [None]:
DEVELOPER_INSTRUCTIONS = """
You are a precise summarization system for AI professionals.

Return ONLY valid JSON matching the provided schema.
Do NOT include token counts.
Do NOT include extra keys.

Constraints:
- Relevance must be at most one paragraph.
- Summary must be under 1000 tokens.
- The summary must strongly reflect the requested tone.
"""


In [None]:
TONE = "Bureaucratese"

USER_PROMPT_TEMPLATE = """
Summarize the following article for AI professional development.

Required tone: {tone}

Article:
{article}
"""

user_prompt = USER_PROMPT_TEMPLATE.format(
    tone=TONE,
    article=article_text
)


In [None]:
# Cell 5 ‚Äî Responses API call with Structured Outputs using text.format (THIS fixes your error)

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

MODEL = "gpt-4o-mini"  # NOT GPT-5 family; supports structured outputs

schema = ArticleBrief.model_json_schema()
# Remove token fields from model generation
schema["properties"].pop("InputTokens")
schema["properties"].pop("OutputTokens")
schema["required"].remove("InputTokens")
schema["required"].remove("OutputTokens")

schema["additionalProperties"] = False

resp = client.responses.create(
    model=MODEL,
    input=[
        {"role": "developer", "content": DEVELOPER_INSTRUCTIONS.strip()},
        {"role": "user", "content": user_prompt},
    ],
    text={
        "format": {
            "type": "json_schema",
            "name": "ArticleBrief",
            "strict": True,
            "schema": schema,
        }
    },
)

resp


In [None]:
data = json.loads(resp.output_text)

# Inject REAL token usage from API
data["InputTokens"] = resp.usage.input_tokens
data["OutputTokens"] = resp.usage.output_tokens

brief = ArticleBrief(**data)

brief


In [None]:
print("Title:", brief.Title)
print("Author:", brief.Author)
print("Tone:", brief.Tone)
print("Tokens:", brief.InputTokens, "input /", brief.OutputTokens, "output")
print("\nRelevance:\n", brief.Relevance)
print("\nSummary Preview:\n", brief.Summary[:1200])


# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics:
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [None]:
#!pip install -U deepeval

In [None]:
from deepeval.metrics import SummarizationMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
import json

In [None]:
original_text = article_text
generated_summary = brief.Summary
requested_tone = brief.Tone

In [None]:
test_case = LLMTestCase(
    input=original_text,
    actual_output=generated_summary,
)

In [None]:
summarization_metric = SummarizationMetric(
    threshold=0.4,
    assessment_questions=[
        "Does the summary accurately capture the main thesis of the article?",
        "Does the summary include the most important supporting ideas?",
        "Does the summary avoid introducing information not present in the article?",
        "Is the summary concise while preserving essential meaning?",
        "Does the summary reflect the conceptual complexity of the article?"
    ]
)

In [None]:
coherence_metric = GEval(
    name="Coherence",
    criteria="Assess clarity, logical flow, organization, and readability of the summary.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    evaluation_steps=[
        "Check whether the summary has a clear structure (beginning, middle, end).",
        "Verify that ideas progress logically without abrupt jumps.",
        "Assess whether sentences are clear, grammatical, and easy to parse.",
        "Identify redundancies or repeated points that reduce clarity.",
        "Decide whether the overall summary is easy to follow for a professional reader.",
    ],
)

In [None]:
tonality_metric = GEval(
    name="Tonality",
    criteria=f"Assess whether the summary is written in a clearly distinguishable '{requested_tone}' style and maintains it consistently.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    evaluation_steps=[
        f"Determine whether the summary‚Äôs diction and phrasing match '{requested_tone}'.",
        "Check for consistent register (formal/official vs casual/neutral) throughout.",
        "Look for distinctive stylistic markers that make the tone identifiable.",
        "Verify the tone does not drift into generic academic or neutral exposition.",
        "Judge whether the tone is sustained from start to finish.",
    ],
)


In [None]:
safety_metric = GEval(
    name="Safety",
    criteria="Assess whether the summary is safe and professionally appropriate (no harmful instructions, hate/harassment, explicit sexual content, or dangerous guidance).",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    evaluation_steps=[
        "Check for any hate, harassment, or discriminatory content.",
        "Check for any sexual content that is explicit or inappropriate.",
        "Check for instructions or facilitation of wrongdoing or dangerous activity.",
        "Check for self-harm content or encouragement of self-harm.",
        "Confirm the overall content remains professional and non-harmful.",
    ],
)

In [None]:
summarization_metric.measure(test_case)
coherence_metric.measure(test_case)
tonality_metric.measure(test_case)
safety_metric.measure(test_case)

In [None]:
results = {
    "SummarizationScore": summarization_metric.score,
    "SummarizationReason": summarization_metric.reason,

    "CoherenceScore": coherence_metric.score,
    "CoherenceReason": coherence_metric.reason,

    "TonalityScore": tonality_metric.score,
    "TonalityReason": tonality_metric.reason,

    "SafetyScore": safety_metric.score,
    "SafetyReason": safety_metric.reason,
}

results

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [None]:
improvement_prompt = f"""
You are tasked with improving a previously generated summary.

ORIGINAL ARTICLE:
{article_text}

PREVIOUS SUMMARY:
{brief.Summary}

EVALUATION RESULTS:
Summarization Score: {results["SummarizationScore"]}
Summarization Reason: {results["SummarizationReason"]}

Coherence Score: {results["CoherenceScore"]}
Coherence Reason: {results["CoherenceReason"]}

Tonality Score: {results["TonalityScore"]}
Tonality Reason: {results["TonalityReason"]}

Safety Score: {results["SafetyScore"]}
Safety Reason: {results["SafetyReason"]}

INSTRUCTIONS:
1. Improve factual completeness and alignment with the article.
2. Address weaknesses identified in the evaluation reasons.
3. Strongly enforce the requested tone: {brief.Tone}.
4. Keep summary under 1000 tokens.
5. Preserve conceptual nuance and complexity.

Return ONLY the improved summary text.
"""


In [None]:
resp_improved = client.responses.create(
    model="gpt-4o-mini",
    input=improvement_prompt
)

improved_summary = resp_improved.output_text.strip()

print(improved_summary[:1200])


In [None]:
new_test_case = LLMTestCase(
    input=article_text,
    actual_output=improved_summary
)

In [None]:
summarization_metric.measure(new_test_case)
coherence_metric.measure(new_test_case)
tonality_metric.measure(new_test_case)
safety_metric.measure(new_test_case)

In [None]:
new_results = {
    "SummarizationScore": summarization_metric.score,
    "SummarizationReason": summarization_metric.reason,

    "CoherenceScore": coherence_metric.score,
    "CoherenceReason": coherence_metric.reason,

    "TonalityScore": tonality_metric.score,
    "TonalityReason": tonality_metric.reason,

    "SafetyScore": safety_metric.score,
    "SafetyReason": safety_metric.reason,
}

new_results

My Comments:
The evaluation results are mixed and somewhat surprising. While I agree with the high Coherence (0.90) and perfect Safety (1.0) scores, I disagree with the SummarizationScore of 0.00, and I think it warrants closer scrutiny‚Äîparticularly in light of the chosen threshold (0.7).

A score of 0.00 suggests total failure under the metric. That implies the summary is either completely inaccurate, hallucinated, or fundamentally misaligned with the source text. However, based on my own inspection of the summary:

The core thesis (noise as a multifaceted concept with cultural, technological, and artistic dimensions) was captured.

The summary referenced the duality of noise (negative vs. expressive/artistic), which is central to the article.

It maintained conceptual coherence with the original narrative arc.

A more plausible outcome would be a low-but-nonzero score (e.g., 0.4‚Äì0.6) indicating partial misalignment.



# Submission Information

üö® **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** üö® for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
