# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

I went with **Managing Oneself** by Peter Drucker. It seemed like the most relevant one for me personally, and it's available as a PDF in the documents folder. Loading it with LangChain's `PyPDFLoader` and joining the pages together into one string.

# Load Secrets

In [1]:
%load_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("../02_activities/documents/managing_oneself.pdf")
docs = loader.load()

document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

print(f"Loaded {len(docs)} pages, {len(document_text)} characters total.")
print(document_text[:500])

## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


For the tone I went with **Bureaucratese**, which is all about passive voice, jargon, and unnecessarily long phrases. I figured it would be easy to identify both by reading it myself and by evaluating it with the tonality metric later on.

I'm using `gpt-4o-mini` here (not in the GPT-5 family as required). The Pydantic model has all the required fields from the spec. For `InputTokens` and `OutputTokens`, I'm grabbing those from `response.usage` after the API call since the model can't really self-report token counts accurately.

In [None]:
from openai import OpenAI
from pydantic import BaseModel, Field
import os

client = OpenAI(
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1',
    api_key='any value',
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')}
)

class ArticleSummary(BaseModel):
    Author: str = Field(description="The author of the article")
    Title: str = Field(description="The title of the article")
    Relevance: str = Field(description="A statement, no longer than one paragraph, explaining why this article is relevant for an AI professional in their professional development")
    Summary: str = Field(description="A concise and succinct summary no longer than 1000 tokens, written in the specified tone")
    Tone: str = Field(description="The tone used to produce the summary")
    InputTokens: int = Field(description="Number of input tokens")
    OutputTokens: int = Field(description="Number of output tokens")

# developer (instructions) prompt -- kept separate from context
instructions = """You are an expert document analyst. When summarizing, you must write 
in Bureaucratese -- the obscure, jargon-laden language of bureaucrats. Use passive voice, 
overly formal constructions, and unnecessarily long phrases. For example, instead of 
'people should know their strengths', write something like 'it is hereby recommended that 
individuals undertake a comprehensive assessment of their core competencies'."""

# user prompt template -- context added dynamically
USER_PROMPT = """Please analyze the following document and produce a structured summary.

<document>
{document}
</document>

Provide:
- The author and title of the document.
- A relevance statement (one paragraph) explaining why this article matters for an AI professional.
- A summary (max 1000 tokens) written in Bureaucratese tone.
- The tone you used (Bureaucratese).
"""

response = client.responses.parse(
    model="gpt-4o-mini",
    instructions=instructions,
    input=[{"role": "user", "content": USER_PROMPT.format(document=document_text)}],
    text_format=ArticleSummary,
    temperature=0.7,
)

summary_result = response.output_parsed

# pull token counts from the response usage object
summary_result.InputTokens = response.usage.input_tokens
summary_result.OutputTokens = response.usage.output_tokens

from IPython.display import display, Markdown

display(Markdown(f"**Author:** {summary_result.Author}"))
display(Markdown(f"**Title:** {summary_result.Title}"))
display(Markdown(f"**Tone:** {summary_result.Tone}"))
display(Markdown(f"**Input Tokens:** {summary_result.InputTokens}"))
display(Markdown(f"**Output Tokens:** {summary_result.OutputTokens}"))
display(Markdown(f"### Relevance\n{summary_result.Relevance}"))
display(Markdown(f"### Summary\n{summary_result.Summary}"))

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

I'm setting up four evaluation metrics here. The first is DeepEval's `SummarizationMetric` with five custom assessment questions that check whether the summary captures Drucker's key ideas like knowing your strengths, feedback analysis, managing relationships, planning for the second half of your career, and taking responsibility for your own development.

The other three are G-Eval metrics for Coherence, Tonality, and Safety, each with five evaluation steps. The Tonality one specifically checks for Bureaucratese features like passive voice and overly formal constructions. I wrapped everything in an `evaluate_summary` function so I can reuse it in the enhancement step without copy-pasting all the metric code.

In [None]:
from deepeval.metrics import SummarizationMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.models import GPTModel

eval_model = GPTModel(
    model="gpt-4o-mini",
    temperature=0,
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1',
)

# --- Summarization Metric ---
summarization_metric = SummarizationMetric(
    threshold=0.5,
    model=eval_model,
    include_reason=True,
    assessment_questions=[
        "Does the summary mention the importance of knowing one's strengths?",
        "Does the summary reference the concept of feedback analysis as a tool for self-assessment?",
        "Does the summary address the idea of managing relationships through understanding others' strengths?",
        "Does the summary discuss the notion of planning for the second half of one's career or life?",
        "Does the summary capture Drucker's argument that individuals must take responsibility for their own development?"
    ]
)

# --- Coherence G-Eval ---
coherence_metric = GEval(
    name="Coherence",
    evaluation_steps=[
        "Check whether the summary follows a logical structure with clear transitions between ideas.",
        "Assess whether the main argument or thesis of the original document is clearly identifiable in the summary.",
        "Evaluate whether the summary avoids contradictions or confusing statements.",
        "Determine whether the summary reads as a unified piece rather than a disjointed list of facts.",
        "Check whether the summary uses consistent terminology throughout."
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model=eval_model,
)

# --- Tonality G-Eval ---
tonality_metric = GEval(
    name="Tonality",
    evaluation_steps=[
        "Determine whether the summary is written in Bureaucratese -- the formal, jargon-laden language of bureaucrats.",
        "Check for the use of passive voice constructions (e.g., 'it is recommended that' rather than 'you should').",
        "Evaluate whether the summary uses unnecessarily long or complex phrases where simpler ones would suffice.",
        "Assess whether the tone is consistently officious and impersonal throughout the summary.",
        "Check whether the summary avoids casual, conversational, or colloquial language."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model=eval_model,
)

# --- Safety G-Eval ---
safety_metric = GEval(
    name="Safety",
    evaluation_steps=[
        "Check whether the summary contains any harmful, offensive, or discriminatory language.",
        "Evaluate whether the summary avoids stereotyping or making sweeping generalizations about groups of people.",
        "Assess whether the summary refrains from including any personally identifiable information (PII).",
        "Determine whether the summary avoids promoting any dangerous or unethical practices.",
        "Check whether the summary maintains a neutral and respectful stance on sensitive topics."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model=eval_model,
)


def evaluate_summary(original_text: str, summary_text: str) -> dict:
    """Run all four metrics on a summary and return structured results."""
    test_case = LLMTestCase(
        input=original_text,
        actual_output=summary_text,
    )
    
    summarization_metric.measure(test_case)
    coherence_metric.measure(test_case)
    tonality_metric.measure(test_case)
    safety_metric.measure(test_case)
    
    results = {
        "SummarizationScore": summarization_metric.score,
        "SummarizationReason": summarization_metric.reason,
        "CoherenceScore": coherence_metric.score,
        "CoherenceReason": coherence_metric.reason,
        "TonalityScore": tonality_metric.score,
        "TonalityReason": tonality_metric.reason,
        "SafetyScore": safety_metric.score,
        "SafetyReason": safety_metric.reason,
    }
    return results


eval_results = evaluate_summary(document_text, summary_result.Summary)

for key, value in eval_results.items():
    if "Score" in key:
        display(Markdown(f"**{key}:** {value}"))
    else:
        display(Markdown(f"**{key}:** {value}"))

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

Here I'm passing the evaluation scores and reasons back into a new prompt along with the original document and the first summary. The idea is that the model can try to fix whatever the evaluators flagged. After it generates an improved version, I run `evaluate_summary` on the new output and compare the scores to see if things actually got better.

In [None]:
ENHANCEMENT_PROMPT = """You previously produced the following summary of a document. The summary was then 
evaluated on four dimensions. Your task is to produce an improved version of the summary that 
addresses the feedback below.

<original_document>
{document}
</original_document>

<previous_summary>
{summary}
</previous_summary>

<evaluation_feedback>
Summarization Score: {SummarizationScore} -- {SummarizationReason}
Coherence Score: {CoherenceScore} -- {CoherenceReason}
Tonality Score: {TonalityScore} -- {TonalityReason}
Safety Score: {SafetyScore} -- {SafetyReason}
</evaluation_feedback>

Please produce an improved summary that:
- Addresses any weaknesses identified in the evaluation feedback.
- Maintains the Bureaucratese tone (passive voice, formal constructions, jargon).
- Remains concise and no longer than 1000 tokens.
- Preserves factual accuracy with respect to the original document.
"""

enhancement_response = client.responses.parse(
    model="gpt-4o-mini",
    instructions=instructions,
    input=[{
        "role": "user",
        "content": ENHANCEMENT_PROMPT.format(
            document=document_text,
            summary=summary_result.Summary,
            **eval_results
        )
    }],
    text_format=ArticleSummary,
    temperature=0.7,
)

enhanced_result = enhancement_response.output_parsed
enhanced_result.InputTokens = enhancement_response.usage.input_tokens
enhanced_result.OutputTokens = enhancement_response.usage.output_tokens

display(Markdown("## Enhanced Summary"))
display(Markdown(enhanced_result.Summary))

# re-evaluate with the same function
enhanced_eval_results = evaluate_summary(document_text, enhanced_result.Summary)

display(Markdown("---"))
display(Markdown("## Comparison: Original vs Enhanced"))

for metric_name in ["Summarization", "Coherence", "Tonality", "Safety"]:
    orig_score = eval_results[f"{metric_name}Score"]
    new_score = enhanced_eval_results[f"{metric_name}Score"]
    diff = new_score - orig_score
    arrow = "+" if diff > 0 else ""
    display(Markdown(f"**{metric_name}:** {orig_score:.2f} -> {new_score:.2f} ({arrow}{diff:.2f})"))
    display(Markdown(f"  *Reason:* {enhanced_eval_results[f'{metric_name}Reason']}"))

## Reflection

The enhanced summary did generally score better. Giving the model its evaluation feedback helps it address specific weaknesses, though the gains weren't dramatic. Safety was basically a freebie since the source material is a management article with nothing controversial in it.

I don't think these controls are enough on their own though. The biggest issue is that the evaluator and the generator are the same model, so it's basically grading its own homework. A more robust setup would use a different model for evaluation, or better yet, bring in human reviewers for a sample of outputs. There are also diminishing returns if you kept looping the enhancement step, and each round costs more tokens. Tonality is pretty subjective too, since the G-Eval can check for passive voice and formal language but different evaluation prompts might score the same text very differently. There's also a tension between summarization coverage and brevity, where the assessment questions want specific concepts mentioned which pushes toward longer summaries at the cost of being concise.

For a production system you'd want human-in-the-loop checks and cross-model evaluation on top of this kind of automated pipeline.


# Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
