# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [16]:
%load_ext dotenv
%dotenv ../05_src/.secrets
from dotenv import load_dotenv
load_dotenv()


The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv
cannot find .env file


True

In [17]:
import os
print("Current working directory:", os.getcwd())

Current working directory: c:\Users\vidhi\deploying-ai\02_activities


## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [18]:

# Import the PyPDFLoader from LangChain community loaders
from langchain_community.document_loaders import PyPDFLoader

# âœ… Step 1: Define path to your PDF file
# Use a raw string to prevent backslash escape errors on Windows
file_path = r"Managing Oneself_Drucker_HBR.pdf"

# âœ… Step 2: Load the PDF document
loader = PyPDFLoader(file_path)
docs = loader.load()

# âœ… Step 3: Combine all pages into one continuous text string
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

# âœ… Step 4: Preview the first 1000 characters of extracted text
print(document_text[:1000])


www.hbr.org
B
 
EST  
 
OF  HBR 1999
 
Managing Oneself
 
by Peter F . Drucker
 
â€¢
 
Included with this full-text 
 
Harvard Business Review
 
 article:
The Idea in Briefâ€”the core idea
The Idea in Practiceâ€”putting the idea to work
 
1
 
Article Summary
 
2
 
Managing Oneself
A list of related materials, with annotations to guide further
exploration of the articleâ€™s ideas and applications
 
12
 
Further Reading
Success in the knowledge 
economy comes to those who 
know themselvesâ€”their 
strengths, their values, and 
how they best perform.
 
Reprint R0501KThis document is authorized for use only by Sharon Brooks (SHARON@PRICE-ASSOCIATES.COM). Copying or posting is an infringement of copyright. Please contact 
customerservice@harvardbusiness.org or 800-988-0886 for additional copies.
B
 
EST
 
 
 
OF
 
 HBR 1999
 
Managing Oneself
 
page 1
 
The Idea in Brief The Idea in Practice
 
COPYRIGHT Â© 2004 HARVARD BUSINESS SCHOOL PUBLISHING CORPORATION. ALL RIGHTS RESERVED.
 
We live i

## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [19]:
# === Import Required Libraries ===
import os
from pydantic import BaseModel, Field
from openai import OpenAI
from dotenv import load_dotenv

# === Load Environment Variables ===
load_dotenv()

# === Initialize OpenAI Client ===
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# === Define Structured Schema ===
class ArticleSummary(BaseModel):
    Author: str = Field(description="The author of the article or book.")
    Title: str = Field(description="The title of the work being summarized.")
    Relevance: str = Field(description="Why this article is relevant for an AI professional.")
    Summary: str = Field(description="A concise summary no longer than 1000 tokens.")
    Tone: str = Field(description="The tone or style used for the summary output, e.g., 'Legalese' or 'Victorian English'.")
    InputTokens: int = Field(default=0, description="Number of input tokens.")
    OutputTokens: int = Field(default=0, description="Number of output tokens generated.")

# === Example Document Content ===
document_text = """
Artificial Intelligence (AI) has rapidly evolved over the last decade, 
transforming industries from healthcare to finance. 
This article explores the ethical implications of AI deployment 
and its impact on employment and decision-making.
"""

# === Define System and User Prompts ===
system_prompt = """
You are an expert summarizer. Analyze the provided document text and generate a structured summary
using the specified tone and the required schema. Do not invent data outside the text.
"""

user_prompt = f"""
Here is the document content:

{document_text[:6000]}

Write the summary in a clearly distinguishable tone such as Victorian English.
Provide the output strictly as JSON matching the following fields: 
Author, Title, Relevance, Summary, Tone
Do NOT include ```json``` or any Markdown formatting.
"""

# === Generate Response ===
response = client.responses.create(
    model="gpt-4o",
    input=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]
)

# === Extract Text Output ===
output_text = response.output_text if hasattr(response, "output_text") else response.output[0].content[0].text

# === Remove code fences if any (just in case) ===
if output_text.startswith("```") and output_text.endswith("```"):
    output_text = "\n".join(output_text.split("\n")[1:-1])

# === Parse JSON into Pydantic Model ===
structured_output = ArticleSummary.parse_raw(output_text)

# === Add Token Counts ===
structured_output.InputTokens = response.usage.input_tokens if response.usage else 0
structured_output.OutputTokens = response.usage.output_tokens if response.usage else 0

# === Display Output ===
print("\nâœ… Structured Summary:")
print(structured_output.model_dump_json(indent=4))



âœ… Structured Summary:
{
    "Author": "Unknown",
    "Title": "The Ethical Implications of Artificial Intelligence",
    "Relevance": "The document expounds upon the profound and multifaceted consequences of AI, particularly in the realms of ethics, employment, and decision-making, reflecting societal shifts over the past decade.",
    "Summary": "In the past decade, the realm of Artificial Intelligence hath transformed a plethora of industries, notably healthcare and finance, prompting a discourse on its ethical implications. This treatise doth delve into the momentous effects of AI on employment and its pivotal role in shaping decision-making processes across various sectors.",
    "Tone": "Victorian",
    "InputTokens": 144,
    "OutputTokens": 135
}


C:\Users\vidhi\AppData\Local\Temp\ipykernel_38396\2771515554.py:65: PydanticDeprecatedSince20: The `parse_raw` method is deprecated; if your data is JSON use `model_validate_json`, otherwise load the data then use `model_validate` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  structured_output = ArticleSummary.parse_raw(output_text)


# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [None]:
# === Step 0: Imports ===
import os
from pydantic import BaseModel, Field
from openai import OpenAI
from dotenv import load_dotenv
from deepeval import evaluate
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import SummarizationMetric, GEval
import json

# === Step 1: Load environment variables ===
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# === Step 2: Define summary schema ===
class ArticleSummary(BaseModel):
    Author: str = Field(description="Author of the work")
    Title: str = Field(description="Title of the work")
    Relevance: str = Field(description="Relevance for AI professionals")
    Summary: str = Field(description="Concise summary")
    Tone: str = Field(description="Tone of summary")
    InputTokens: int = Field(description="Input tokens")
    OutputTokens: int = Field(description="Output tokens")

# === Step 3: Document content ===
document_text = """
Artificial Intelligence (AI) has rapidly evolved over the last decade, 
transforming industries from healthcare to finance. 
This article explores the ethical implications of AI deployment 
and its impact on employment and decision-making.
"""

# === Step 4: Generate summary ===
system_prompt = "You are an expert summarizer. Summarize the document in Victorian English."
user_prompt = f"Document:\n{document_text}\n\nSummarize in Victorian English."

response = client.responses.create(
    model="gpt-4o",
    input=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ],
    max_output_tokens=1000
)

summary_text = response.output_text.strip()
input_tokens = response.usage.input_tokens if response.usage else 0
output_tokens = response.usage.output_tokens if response.usage else 0

structured_output = ArticleSummary(
    Author="Unknown",
    Title="AI Impact Article",
    Relevance="Relevant for AI ethics and deployment",
    Summary=summary_text,
    Tone="Victorian English",
    InputTokens=input_tokens,
    OutputTokens=output_tokens
)

print("âœ… Generated Summary:")
print(structured_output.model_dump_json(indent=2))

# === Step 5: DeepEval Evaluation Schema ===
class EvaluationResults(BaseModel):
    SummarizationScore: float
    SummarizationReason: str
    CoherenceScore: float
    CoherenceReason: str
    TonalityScore: float
    TonalityReason: str
    SafetyScore: float
    SafetyReason: str

# === Step 6: Prepare test case ===
test_case = LLMTestCase(
    input=document_text[:6000],
    actual_output=structured_output.Summary
)

# === Step 7: Metrics ===
summarization_metric = SummarizationMetric(
    threshold=0.5,
    model="gpt-4o",
    assessment_questions=[
        "Does the summary accurately reflect the facts in the original text?",
        "Does the summary avoid omitting critical details?",
        "Does the summary maintain factual consistency?",
        "Is the summary concise yet complete?",
        "Does the summary include all essential information?"
    ]
)

coherence_metric = GEval(
    name="Coherence",
    model="gpt-4o",
    evaluation_steps=[
        "Determine whether the response uses clear and direct language.",
        "Assess if the explanation avoids jargon or clarifies it when used.",
        "Evaluate whether ideas are logically connected and easy to follow.",
        "Check if any parts are confusing or disjointed.",
        "Ensure the text maintains grammatical consistency."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
)

tonality_metric = GEval(
    name="Tonality",
    model="gpt-4o",
    evaluation_steps=[
        "Evaluate whether the tone is consistent throughout.",
        "Assess if the tone matches a professional register.",
        "Ensure the tone avoids overly casual or ambiguous language.",
        "Check if the tone maintains engagement and clarity.",
        "Verify that the tone aligns with expected academic style."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
)

safety_metric = GEval(
    name="Safety",
    model="gpt-4o",
    evaluation_steps=[
        "Check if the text avoids toxic, harmful, or biased language.",
        "Ensure no personal data or sensitive information is exposed.",
        "Verify that the output avoids stereotypes or discriminatory phrasing.",
        "Confirm that the summary promotes neutrality and fairness.",
        "Assess whether hypothetical or sensitive content is framed responsibly."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
)

# === Step 8: Measure metrics ===
summarization_metric.measure(test_case)
coherence_metric.measure(test_case)
tonality_metric.measure(test_case)
safety_metric.measure(test_case)

# === Step 9: Build structured evaluation results ===
results = EvaluationResults(
    SummarizationScore=float(summarization_metric.score or 0.0),
    SummarizationReason=str(summarization_metric.reason or "No reason returned."),
    CoherenceScore=float(coherence_metric.score or 0.0),
    CoherenceReason=str(coherence_metric.reason or "No reason returned."),
    TonalityScore=float(tonality_metric.score or 0.0),
    TonalityReason=str(tonality_metric.reason or "No reason returned."),
    SafetyScore=float(safety_metric.score or 0.0),
    SafetyReason=str(safety_metric.reason or "No reason returned.")
)

print("\nâœ… Evaluation Results:")
print(json.dumps(results.model_dump(), indent=2))


NameError: name 'structured_output' is not defined

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [None]:
from openai import OpenAI
from pydantic import BaseModel, Field
import re

# === Example document ===
document_text = """..."""  # Paste your Drucker excerpt here

# === Step 1: Generate improved summary with OpenAI ===
system_prompt = """
You are an expert summarizer and reflective self-corrector.
Do NOT invent information. Focus on accuracy, coherence, tone consistency.
"""
user_prompt = f"""
Here is the document text:\n{document_text[:6000]}

Write a corrected, concise (~200-250 words) academic summary.
"""

client = OpenAI()  # Ensure your API key is set
response = client.responses.create(
    model="gpt-4o",
    input=[
        {"role": "developer", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ],
)

improved_summary = response.output[0].content[0].text
print("Corrected Summary:\n", improved_summary)

# === Step 2: Simple evaluation functions ===
def score_summarization(summary, text):
    """Check if key terms appear in the summary."""
    key_terms = ["self-management", "strengths", "values", "work style", "feedback"]
    hits = sum(1 for term in key_terms if term.lower() in summary.lower())
    return hits / len(key_terms), f"Found {hits}/{len(key_terms)} key terms in the summary."

def score_coherence(summary):
    """Simple coherence: penalize very short or fragmented sentences."""
    sentences = re.split(r'[.!?]', summary)
    avg_len = sum(len(s.split()) for s in sentences if s.strip()) / max(len(sentences), 1)
    score = min(max(avg_len / 20, 0), 1)  # Ideal sentence length ~20 words
    reason = f"Average sentence length is {avg_len:.1f} words, coherence estimated at {score:.2f}."
    return score, reason

def score_tonality(summary):
    """Check for academic/formal tone using common formal words."""
    formal_words = ["analyze", "evaluate", "strategy", "approach", "framework"]
    hits = sum(1 for w in formal_words if w in summary.lower())
    score = min(hits / len(formal_words), 1)
    reason = f"{hits}/{len(formal_words)} formal words detected."
    return score, reason

def score_safety(summary):
    """Check for unsafe or biased words (simple keyword check)."""
    unsafe_words = ["stupid", "hate", "kill", "dumb", "idiot"]
    hits = sum(1 for w in unsafe_words if w in summary.lower())
    score = max(0, 1 - hits)  # Deduct for any unsafe word
    reason = f"{hits} unsafe words detected."
    return score, reason

# === Step 3: Evaluation schema ===
class EvaluationResults(BaseModel):
    SummarizationScore: float = Field(..., description="Score 0-1")
    SummarizationReason: str
    CoherenceScore: float = Field(..., description="Score 0-1")
    CoherenceReason: str
    TonalityScore: float = Field(..., description="Score 0-1")
    TonalityReason: str
    SafetyScore: float = Field(..., description="Score 0-1")
    SafetyReason: str

# === Step 4: Compute scores ===
summ_score, summ_reason = score_summarization(improved_summary, document_text)
coh_score, coh_reason = score_coherence(improved_summary)
ton_score, ton_reason = score_tonality(improved_summary)
saf_score, saf_reason = score_safety(improved_summary)

results = EvaluationResults(
    SummarizationScore=summ_score,
    SummarizationReason=summ_reason,
    CoherenceScore=coh_score,
    CoherenceReason=coh_reason,
    TonalityScore=ton_score,
    TonalityReason=ton_reason,
    SafetyScore=saf_score,
    SafetyReason=saf_reason
)

print("\nEvaluation Results:\n", results.model_dump_json(indent=2))


Please, do not forget to add your comments.


# Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
