# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [1]:
%load_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [2]:
from langchain_community.document_loaders import PyPDFLoader

# Path to the PDF file
pdf_path = "../05_src/documents/Managing_Oneself_Drucker_HBR.pdf"

# loader
loader = PyPDFLoader(pdf_path)

# Load the PDF (each page becomes one Document object)
docs = loader.load()

# Check how many pages were loaded
print(f"The document has {len(docs)} pages")


The document has 13 pages


In [3]:
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

print(document_text[:1000])  # preview the first 1000 characters


www.hbr.org
B
 
EST  
 
OF  HBR 1999
 
Managing Oneself
 
by Peter F . Drucker
 
•
 
Included with this full-text 
 
Harvard Business Review
 
 article:
The Idea in Brief—the core idea
The Idea in Practice—putting the idea to work
 
1
 
Article Summary
 
2
 
Managing Oneself
A list of related materials, with annotations to guide further
exploration of the article’s ideas and applications
 
12
 
Further Reading
Success in the knowledge 
economy comes to those who 
know themselves—their 
strengths, their values, and 
how they best perform.
 
Reprint R0501KThis document is authorized for use only by Sharon Brooks (SHARON@PRICE-ASSOCIATES.COM). Copying or posting is an infringement of copyright. Please contact 
customerservice@harvardbusiness.org or 800-988-0886 for additional copies.
B
 
EST
 
 
 
OF
 
 HBR 1999
 
Managing Oneself
 
page 1
 
The Idea in Brief The Idea in Practice
 
COPYRIGHT © 2004 HARVARD BUSINESS SCHOOL PUBLISHING CORPORATION. ALL RIGHTS RESERVED.
 
We live in an age of

In [4]:
print("This is a summary of the document: ")
print(f"Total characters: {len(document_text)}")
print(f"Total words: {len(document_text.split())}")

This is a summary of the document: 
Total characters: 51452
Total words: 8670


## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [5]:
from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

# Define the structured output
class ArticleSummary(BaseModel):
    author: str
    title: str
    relevance: str
    summary: str
    tone: str = "Canadian Vernacular English"
    input_tokens: int
    output_tokens: int


In [6]:
instructions = """
You are an expert summarizer and writing assistant.
Your goal is to produce an easy-to-read summary in Canadian vernacular English.
Focus on practical, life-applicable lessons.
"""

# Use formatted strings so the document text is inserted dynamically
USER_PROMPT = f"""
Summarize the following article in clear, everyday Canadian English.
The summary should be easy to understand and practical — something a reader could apply in their own life or work.

Include:
- Author
- Title
- A short relevance paragraph explaining why this article is useful for personal and professional development.
- A concise summary and how to apply that in life (under 3000 tokens but not less than 1500 tokens).
- The tone used ("Canadian Vernacular English").

<Article>
{document_text}
</Article>
"""


In [7]:
response = client.responses.parse(
    model="gpt-4o-mini",
    input=[
        {"role": "developer", "content": instructions},
        {"role": "user", "content": USER_PROMPT},
    ],
    text_format=ArticleSummary,
)


In [8]:
result = response.output_parsed
for key, value in result.dict().items():
    print(f"{key.capitalize()}: {value}\n")


Author: Peter F. Drucker

Title: Managing Oneself

Relevance: This article highlights the importance of self-awareness and personal accountability in today's fast-paced, knowledge-driven work environment. By understanding our strengths, values, and how we work, we can take charge of our careers and enhance both personal and professional development, making it particularly relevant for anyone looking to thrive in their job.

Summary: In "Managing Oneself," Peter Drucker emphasizes that in today's job market, individuals must take charge of their own careers, acting as their own CEO. This starts with understanding oneself—knowing your strengths, values, work style, and where you belong in the professional landscape.

**Key Takeaways:**
1. **Identify Strengths:** Use feedback analysis to determine your skills. Keep track of decisions and expected outcomes, and compare them with actual results to identify patterns. Focus on enhancing your strengths rather than trying to improve areas where

C:\Users\carva\AppData\Local\Temp\ipykernel_28216\204493639.py:2: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  for key, value in result.dict().items():


In [9]:
print("Input tokens:", result.input_tokens)
print("Output tokens:", result.output_tokens)

Input tokens: 6039
Output tokens: 1490


# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [10]:
from deepeval.metrics import SummarizationMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

reference_text = document_text
summary_text = result.summary

# Create the test case
test_case = LLMTestCase(
    input=reference_text,
    actual_output=summary_text,
    #expected_output="A clear and accurate summary of 'Managing Oneself' by Peter Drucker."
)

# --- Summarization metric ---
summarization_metric = SummarizationMetric(
    model="gpt-4o-mini",
    assessment_questions=[
        "Does the summary accurately capture the main ideas of the article?",
        "Is the summary concise and focused, avoiding unnecessary details?",
        "Does the summary clearly express key lessons or insights?",
        "Is the summary coherent and easy to follow?",
        "Does the summary reflect the original tone and intent of the author?",
    ],
)

# --- Coherence metric ---
coherence_metric = GEval(
    name="Coherence",
    model="gpt-4o-mini",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    criteria=(
        "Evaluate how coherent the summary is:\n"
        "- Are ideas logically connected and presented in a smooth flow?\n"
        "- Does each sentence naturally lead to the next?\n"
        "- Are there any confusing or abrupt transitions?\n"
        "- Is the summary structurally consistent?\n"
        "- Would a reader understand the main argument without rereading?"
    ),
)

# --- Tonality metric ---
tonality_metric = GEval(
    name="Tonality",
    model="gpt-4o-mini",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    criteria=(
        "Evaluate whether the tone is consistent and appropriate:\n"
        "- Does it match Canadian vernacular English?\n"
        "- Does it sound friendly yet professional?\n"
        "- Is it engaging and accessible?\n"
        "- Is the tone suitable for self-development content?"
    ),
)

# --- Safety metric ---
safety_metric = GEval(
    name="Safety",
    model="gpt-4o-mini",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    criteria=(
        "Evaluate if the summary is safe, factual, and inclusive:\n"
        "- Does it avoid unsupported claims?\n"
        "- Is it grounded in the original article?\n"
        "- Does it avoid biased or offensive language?\n"
        "- Would it be safe to publish publicly?"
    ),
)

# --- Run metrics ---
summarization_metric.measure(test_case)
coherence_metric.measure(test_case)
tonality_metric.measure(test_case)
safety_metric.measure(test_case)

# --- Collect and display results ---
evaluation_results = {
    "SummarizationScore": summarization_metric.score,
    "SummarizationReason": summarization_metric.reason,
    "CoherenceScore": coherence_metric.score,
    "CoherenceReason": coherence_metric.reason,
    "TonalityScore": tonality_metric.score,
    "TonalityReason": tonality_metric.reason,
    "SafetyScore": safety_metric.score,
    "SafetyReason": safety_metric.reason,
}

print("✅ Evaluation Completed!\n")
for key, value in evaluation_results.items():
    print(f"{key}: {value}\n")


Output()

Output()

Output()

Output()

✅ Evaluation Completed!

SummarizationScore: 0

SummarizationReason: The score is 0.00 because the summary includes numerous pieces of extra information that are not present in the original text, leading to a significant deviation from the original content.

CoherenceScore: 0.838545153189503

CoherenceReason: The response effectively summarizes the key concepts from Drucker's 'Managing Oneself,' maintaining a logical flow and coherence throughout. Each takeaway is clearly articulated, reflecting the main ideas presented in the input. However, while the transitions between points are generally smooth, some sections could benefit from more explicit connections to enhance the overall coherence. Additionally, the response could have included more specific examples or details from the original text to strengthen the alignment with the evaluation steps.

TonalityScore: 0.8437213728689533

TonalityReason: The response effectively captures the essence of Drucker's 'Managing Oneself' by summari

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [13]:
ENHANCEMENT_PROMPT = f"""
You are an AI writing assistant tasked with improving a summary of Peter Drucker's article "Managing Oneself."
Use the feedback below to enhance alignment, accuracy, and flow.

Original document (context):
{reference_text[:3000]}  # limit to first 3000 characters for efficiency

Current summary:
{summary_text}

Evaluation feedback:
Summarization: {evaluation_results['SummarizationReason']}
Coherence: {evaluation_results['CoherenceReason']}
Tonality: {evaluation_results['TonalityReason']}
Safety: {evaluation_results['SafetyReason']}

Your task:
- Fix all issues mentioned in the feedback.
- Make sure the new summary directly reflects Drucker’s key points and examples.
- Keep the tone friendly, practical, and Canadian.
- Improve transitions and readability.
- Use examples like “feedback analysis,” “second career,” and “values alignment.”
- Output only the revised summary (no explanations).

The enhanced summary should be 1200–1800 tokens.
"""

response = client.responses.create(
    model="gpt-4o-mini",
    input=[{"role": "user", "content": ENHANCEMENT_PROMPT}],
    temperature=0.7,
)

improved_summary = response.output_text
print("✅ Enhanced Summary:\n")
print(improved_summary)


✅ Enhanced Summary:

In "Managing Oneself," Peter Drucker posits that in the knowledge economy, individuals must take charge of their own careers, effectively acting as their own chief executive officers. This proactive approach begins with a deep self-understanding that encompasses recognizing one’s strengths, values, and preferred working styles, ultimately guiding individuals to find their rightful place in the professional landscape.

**Key Takeaways:**

1. **Identify Strengths:** Drucker underscores the significance of feedback analysis in uncovering one's strengths. This involves documenting expected outcomes whenever a key decision is made and later comparing them to the actual results. For instance, if you expect a project to succeed based on a specific skill set and it does, that reinforces your strength in that area. By identifying patterns in successes and shortcomings, you can pinpoint which skills to further develop. Focus on enhancing these strengths rather than spending 

In [14]:
summary_text = improved_summary
test_case.actual_output = summary_text

summarization_metric.measure(test_case)
coherence_metric.measure(test_case)
tonality_metric.measure(test_case)
safety_metric.measure(test_case)

print("✅ Re-evaluation Completed!\n")
print(f"SummarizationScore: {summarization_metric.score}")
print(f"CoherenceScore: {coherence_metric.score}")
print(f"TonalityScore: {tonality_metric.score}")
print(f"SafetyScore: {safety_metric.score}")


Output()

Output()

Output()

Output()

✅ Re-evaluation Completed!

SummarizationScore: 0
CoherenceScore: 0.8731058572770511
TonalityScore: 0.86405724964141
SafetyScore: 0.8563948519769436


### Comments:
Initially, I received a **0 score** for the *Summarization* metric.  
After incorporating the feedback and re-evaluating, the *Summarization* score remained at **0**.  
However, the other metrics — **Coherence**, **Tonality**, and **Safety** — all showed slight improvement after applying the feedback and running the re-evaluation.



# Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
