# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [25]:
%load_ext dotenv
%dotenv ../05_src/.secrets



The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [26]:
import os
print("Current working directory:", os.getcwd())

Current working directory: c:\Users\vidhi\deploying-ai\02_activities


## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [28]:
from langchain_community.document_loaders.pdf import PyPDFLoader

file_path = "Managing Oneself_Drucker_HBR.pdf"
loader = PyPDFLoader(file_path)
docs = loader.load()

print(len(docs))

# Joining the documents into a single string

document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
print(document_text[:500])  # Print the first 500 characters

13
www.hbr.org
B
 
EST  
 
OF  HBR 1999
 
Managing Oneself
 
by Peter F . Drucker
 
‚Ä¢
 
Included with this full-text 
 
Harvard Business Review
 
 article:
The Idea in Brief‚Äîthe core idea
The Idea in Practice‚Äîputting the idea to work
 
1
 
Article Summary
 
2
 
Managing Oneself
A list of related materials, with annotations to guide further
exploration of the article‚Äôs ideas and applications
 
12
 
Further Reading
Success in the knowledge 
economy comes to those who 
know themselves‚Äîtheir 
strengths


## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [29]:
# 2Ô∏è‚É£ Define your Pydantic model

from pydantic import BaseModel
from typing import Optional

class ArticleSummary(BaseModel):
    Author: str
    Title: str
    Relevance: str
    Summary: str
    Tone: str
    InputTokens: int
    OutputTokens: int

    # 3Ô∏è‚É£ instructions and context

# Developer instructions (system prompt)

instructions = """
You are an AI assistant tasked with summarizing a professional document.
Produce a concise summary (max 1000 tokens) in a distinguishable tone Formal Academic Writing.

use the following format:

  "Author": "...",
  "Title": "...",
  "Relevance": "...",
  "Summary": "...",
  "Tone": "...",
  "InputTokens": ,
  "OutputTokens": 

"""

# Context (user prompt) remains unchanged
user_prompt = f"""

Provide: Author, Title, Relevance (one paragraph explaining why this is useful for AI professionals).
Return output structured as a Pydantic BaseModel.  



The article is the following: 
    
    <article>
    {document_text}
    </article>

"""






In [None]:
from openai import OpenAI

client = OpenAI()


# 4Ô∏è‚É£ Call the model
response = client.responses.create(
    model="gpt-4o",  # NOT GPT-5
    instructions=instructions,
    input= [{"role": "user", "content": user_prompt}],
    temperature=0.2
)

# Access the structured object
summary_text = response.output_text

print(summary_text)

```json
{
  "Author": "Peter F. Drucker",
  "Title": "Managing Oneself",
  "Relevance": "This article is highly relevant for AI professionals as it emphasizes the importance of self-management in the knowledge economy. AI professionals, often working in dynamic and rapidly evolving environments, must understand their strengths, values, and work styles to navigate their careers effectively. Drucker's insights on self-awareness and personal development are crucial for AI professionals who need to adapt to new technologies and methodologies, manage their careers proactively, and contribute meaningfully to their organizations. The ability to manage oneself is particularly vital in AI, where innovation and continuous learning are key to success.",
  "Summary": "Peter F. Drucker's 'Managing Oneself' discusses the necessity for individuals to take responsibility for their own careers in the knowledge economy. The article outlines the importance of self-awareness, including understanding one's

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [31]:
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

metric = AnswerRelevancyMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)

test_case = LLMTestCase(
    input=user_prompt.format(story=document_text),
    actual_output=summary_text
)

metric.measure(test_case)
print(metric.score,metric.reason)

Output()

1.0 The score is 1.00 because the output directly addresses the request for the author, title, and relevance of the article in a structured format. There are no irrelevant statements present, making the response fully aligned with the input requirements.


In [34]:
from deepeval import evaluate
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase


# ------------------------------------------------------
# Custom Assessment Questions (5 each)
# ------------------------------------------------------
summarization_questions = [
    "Does the summary capture the primary message of the original document?",
    "Does the summary correctly include the main supporting points?",
    "Does the summary avoid introducing incorrect or new information?",
    "Is the summary concise while still being complete?",
    "Is the summary written in a coherent and understandable manner?"
]

coherence_questions = [
    "Is the content logically structured?",
    "Do the ideas progress smoothly from one to another?",
    "Are references and transitions clear?",
    "Is the narrative free of contradictory statements?",
    "Is the writing easy to follow overall?"
]

tonality_questions = [
    "Does the tone match the intended stylistic choice?",
    "Is the tone consistent across the text?",
    "Does the tone avoid unintended emotional shifts?",
    "Is the tone appropriate for the context and audience?",
    "Does the tone remain natural and not forced?"
]

safety_questions = [
    "Does the summary avoid hate, harassment, or discriminatory language?",
    "Does the summary avoid endorsing harmful or unsafe actions?",
    "Does the summary avoid medical, legal, or professional misinformation?",
    "Does the summary avoid leaking personal or sensitive information?",
    "Is the content suitable for general educational or public use?"
]


# ------------------------------------------------------
# Build Test Case
# ------------------------------------------------------
test_case = LLMTestCase(
    input=document_text,        # full article text
    actual_output=summary_text  # your generated summary
)

# Measure metrics using SummarizationMetric
# -----------------------------
def run_metric(metric_name, questions):
    metric = SummarizationMetric(
        threshold=0.7,
        assessment_questions=questions,
        model="gpt-4o-mini",
        include_reason=True
    )
    metric.measure(test_case)
    return metric.score, metric.reason

# ------------------------------------------------------
# Run Evaluations
# ------------------------------------------------------
summ_score, summ_reason = run_metric("Summarization", summarization_questions)
coh_score, coh_reason = run_metric("Coherence", coherence_questions)
tone_score, tone_reason = run_metric("Tonality", tonality_questions)
safe_score, safe_reason = run_metric("Safety", safety_questions)

# ------------------------------------------------------
# Print results
# -----------------------------
print("SummarizationScore:", summ_score)
print("SummarizationReason:", summ_reason)

print("CoherenceScore:", coh_score)
print("CoherenceReason:", coh_reason)

print("TonalityScore:", tone_score)
print("TonalityReason:", tone_reason)

print("SafetyScore:", safe_score)
print("SafetyReason:", safe_reason)


Output()

Output()

Output()

Output()

SummarizationScore: 0.8571428571428571
SummarizationReason: The score is 0.86 because the summary effectively captures the main ideas of the original text, despite including extra information about Peter F. Drucker and his work 'Managing Oneself' that was not present in the original. This additional context may enhance understanding but does not detract from the overall quality of the summary.
CoherenceScore: 0.6
CoherenceReason: The score is 0.60 because the summary includes extra information about Peter F. Drucker and his work 'Managing Oneself' that is not present in the original text, which detracts from its accuracy. Additionally, the summary does not address specific questions regarding clarity and contradictions, indicating a lack of completeness.
TonalityScore: 0
TonalityReason: The score is 0.00 because the summary contains significant contradictions to the original text, particularly regarding the focus on managing relationships and communication, as well as the emphasis on d

In [33]:
eval_results = {
    "SummarizationScore": summ_score,
    "SummarizationReason": summ_reason,
    "CoherenceScore": coh_score,
    "CoherenceReason": coh_reason,
    "TonalityScore": tone_score,
    "TonalityReason": tone_reason,
    "SafetyScore": safe_score,
    "SafetyReason": safe_reason
}

print(eval_results)

{'SummarizationScore': 0.8571428571428571, 'SummarizationReason': 'The score is 0.86 because the summary effectively captures the main ideas of the original text, despite including extra information about Drucker encouraging individuals to prepare for the second half of their lives, which was not explicitly stated. This additional context enhances the understanding of the original message.', 'CoherenceScore': 0.5714285714285714, 'CoherenceReason': 'The score is 0.57 because the summary contains contradictions to the original text, such as the absence of emphasis on managing relationships and communication within organizations. Additionally, it introduces extra information about Peter F. Drucker and preparing for the second half of life, which were not present in the original text. Furthermore, the summary fails to address specific questions regarding clarity in references and transitions.', 'TonalityScore': 0, 'TonalityReason': 'The score is 0.00 because the summary includes extra info

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [36]:

# 0Ô∏è‚É£ Ensure summary_text is a string
# If it‚Äôs a Pydantic object, extract the Summary field
from pydantic import BaseModel



if isinstance(summary_text, ArticleSummary):
    summary_str = summary_text.Summary
else:
    summary_str = summary_text

# -------------------------------
# 1Ô∏è‚É£ Prepare enhancement prompt
# -------------------------------
enhance_prompt = f"""
You are tasked with improving a document summary for a professional.
Original Document (truncated): {document_text[:4000]}
Previous Summary: {summary_str}
Evaluation Feedback: 
- SummarizationScore: {summ_score}
- CoherenceScore: {coh_score}
- TonalityScore: {tone_score}
- SafetyScore: {safe_score}

Please generate an enhanced summary that:
- Keeps within 1000 tokens
- Fixes coherence issues
- Aligns tone with 'Business Professional'
- Avoids extra information not present in the document

Return output in Pydantic format: Author, Title, Relevance, Summary, Tone, InputTokens, OutputTokens
"""

# -------------------------------
# 2Ô∏è‚É£ Call the model for enhancement
# -------------------------------
enhanced_response = client.responses.create(
    model="gpt-4o",
    instructions=instructions,
    input=[{"role": "user", "content": enhance_prompt}],
    temperature=0.2
)

# -------------------------------
# 3Ô∏è‚É£ Parse enhanced summary
# -------------------------------
# Strip any extra code formatting like ```json if needed
raw_text = enhanced_response.output_text.strip().replace("```json", "").replace("```", "")
enhanced_summary = ArticleSummary.parse_raw(raw_text)
print("Enhanced Summary Output:\n", enhanced_summary)

# -------------------------------
# 4Ô∏è‚É£ Re-evaluate enhanced summary
# -------------------------------
enhanced_test_case = LLMTestCase(
    input=document_text,
    actual_output=enhanced_summary.Summary
)

# Reuse run_metric() function from earlier
enhanced_summ_score, enhanced_summ_reason = run_metric("Summarization", summarization_questions)
enhanced_coh_score, enhanced_coh_reason = run_metric("Coherence", coherence_questions)
enhanced_tone_score, enhanced_tone_reason = run_metric("Tonality", tonality_questions)
enhanced_safe_score, enhanced_safe_reason = run_metric("Safety", safety_questions)

# -------------------------------
# 5Ô∏è‚É£ Print enhanced evaluation
# -------------------------------
print("Enhanced SummarizationScore:", enhanced_summ_score)
print("Enhanced SummarizationReason:", enhanced_summ_reason)

print("Enhanced CoherenceScore:", enhanced_coh_score)
print("Enhanced CoherenceReason:", enhanced_coh_reason)

print("Enhanced TonalityScore:", enhanced_tone_score)
print("Enhanced TonalityReason:", enhanced_tone_reason)

print("Enhanced SafetyScore:", enhanced_safe_score)
print("Enhanced SafetyReason:", enhanced_safe_reason)


C:\Users\vidhi\AppData\Local\Temp\ipykernel_3436\4069049926.py:50: PydanticDeprecatedSince20: The `parse_raw` method is deprecated; if your data is JSON use `model_validate_json`, otherwise load the data then use `model_validate` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  enhanced_summary = ArticleSummary.parse_raw(raw_text)


Enhanced Summary Output:
 Author='Peter F. Drucker' Title='Managing Oneself' Relevance="This article is essential for professionals navigating the knowledge economy, where self-management is crucial. Understanding personal strengths, values, and work styles is vital for career advancement and effectiveness in dynamic environments. Drucker's insights are particularly relevant for those in rapidly evolving fields, where continuous learning and adaptation are necessary." Summary="Peter F. Drucker's 'Managing Oneself' emphasizes the importance of individuals taking charge of their careers in the knowledge economy. The article highlights self-awareness, urging professionals to understand their strengths, values, and preferred work styles. Drucker recommends feedback analysis to identify strengths and areas for improvement. He stresses the alignment of work with personal values and finding environments conducive to making significant contributions. The article also underscores the importance

Output()

Output()

Output()

Output()

Enhanced SummarizationScore: 0.38461538461538464
Enhanced SummarizationReason: The score is 0.38 because the summary includes numerous pieces of extra information that are not present in the original text, which indicates a lack of fidelity to the source material. This divergence from the original content significantly impacts the quality of the summary.
Enhanced CoherenceScore: 0.8
Enhanced CoherenceReason: The score is 0.80 because the summary effectively captures the main ideas of the original text, but it introduces extra information about Drucker encouraging individuals to prepare for the second half of their lives, which is not explicitly stated in the original text. Additionally, the summary does not address whether references and transitions are clear, which is a question that the original text can answer.
Enhanced TonalityScore: 0
Enhanced TonalityReason: The score is 0.00 because the summary introduces extra information that is not present in the original text, such as refere

Please, do not forget to add your comments.


# Submission Information

üö® **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** üö® for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
