# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [None]:
%load_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [68]:
from langchain_community.document_loaders import WebBaseLoader
 
file_path = "https://www.newyorker.com/magazine/2024/04/22/what-is-noise"
 
loader = WebBaseLoader(file_path)
 
docs = loader.load()

print(docs[0].page_content)




## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [69]:
from openai import OpenAI
from pydantic import BaseModel
# Extract document text from loaded document
document_text = docs[0].page_content

# Define Pydantic model for structured output
class DocumentSummary(BaseModel):
    author: str
    title: str
    relevance: str
    summary: str
    tone: str
    input_tokens: int
    output_tokens: int 

client = OpenAI()

TONE = "Victorian English"
 
developer_prompt = f"""
You are an expert document analyzer and summarizer with ABSOLUTE FIDELITY to source material. Your task is to analyze the provided document and create a comprehensive summary with the following specifications:

1. Extract the author and title EXACTLY as they appear in the document
2. Write a relevance statement (max 1 paragraph) explaining why this article is important for AI professionals' career development
3. Create a concise summary (max 1000 tokens) written in the style of: {TONE}

CRITICAL RULES FOR SUMMARY CREATION:
- Use ONLY information explicitly stated in the original document
- Do NOT add external knowledge, interpretations, or implications
- Do NOT make inferences beyond what is directly stated
- Quote or paraphrase directly from the source material
- Use the document's own terminology and language
- If something is not mentioned in the document, do NOT include it
- Maintain the logical flow and structure of the original arguments

The summary should be written in {TONE} style, which means:
- Use sophisticated vocabulary FROM THE SOURCE MATERIAL
- Employ formal, objective language
- Include technical terminology AS USED IN THE DOCUMENT
- Maintain scholarly tone throughout
- Use precise, analytical language drawn from the article

VERIFICATION: Before finalizing, ensure every statement in your summary can be traced back to specific content in the original document.
"""

# IMPROVED: Dynamic user prompt with stricter guidelines
user_prompt = f"""
Please analyze the following document and provide a structured summary that contains ONLY information explicitly present in the source:

DOCUMENT CONTENT:
{document_text}

STRICT REQUIREMENTS:
- Extract author and title exactly as they appear in the document
- Summary must contain ONLY information from the source document
- Do not add external knowledge, interpretations, or implications not stated in the text
- Use the document's own language and terminology
- Maintain the logical flow of the original arguments
- Every statement must be verifiable against the source text

Please provide your analysis in the following JSON structure:
- author: The document's author (exactly as shown in the document)
- title: The document's title (exactly as shown in the document)
- relevance: Why this article is relevant for AI professionals (1 paragraph max, based on document content)
- summary: A concise and faithful summary (max 1000 tokens, using only source information)
- tone: The tone used for the summary ("{TONE}")
- input_tokens: Will be filled from response metadata
- output_tokens: Will be filled from response metadata
"""

# Make API call with structured output
response = client.beta.chat.completions.parse(
    model="gpt-4o-mini",   
    messages=[
        {"role": "system", "content": developer_prompt},
        {"role": "user", "content": user_prompt}
    ],
    response_format=DocumentSummary,
)

# Create the structured output with token information
document_summary = DocumentSummary(
    author=response.choices[0].message.parsed.author,
    title=response.choices[0].message.parsed.title,
    relevance=response.choices[0].message.parsed.relevance,
    summary=response.choices[0].message.parsed.summary,
    tone=response.choices[0].message.parsed.tone,
    input_tokens=response.usage.prompt_tokens,
    output_tokens=response.usage.completion_tokens
)

# Display the results
print("=== IMPROVED SUMMARY WITH SOURCE FIDELITY ===")
print(f"Author: {document_summary.author}")
print(f"Title: {document_summary.title}")
print(f"\nRelevance for AI Professionals:")
print(document_summary.relevance)
print(f"\nSummary ({document_summary.tone}):")
print(document_summary.summary)
print(f"\nToken Usage:")
print(f"Input Tokens: {document_summary.input_tokens}")
print(f"Output Tokens: {document_summary.output_tokens}")
print(f"\nSummary Character Count: {len(document_summary.summary)}")
print("=== SUMMARY GENERATION COMPLETED ===")


=== IMPROVED SUMMARY WITH SOURCE FIDELITY ===
Author: Alex Ross
Title: What Is Noise?

Relevance for AI Professionals:
This article holds considerable significance for AI professionals as it elucidates the concept of noise in both its auditory and informational capacities, framing it as an omnipresent phenomenon that affects communication and perception. Knowledge of how noise interacts with signals, and the implications of stochastic processes, is essential for developing algorithms and technologies that effectively filter out or adapt to noise, thereby enhancing data accuracy and user experience in AI applications.

Summary (Victorian English):
In this extensive elucidation by Alex Ross, the elusive nature of 'noise' is examined with scholarly depth, revealing its multifaceted implications across both auditory and contextual realms. Originating from the etymological roots connected to 'nuisance' and 'nausea,' noise embodies a spectrum that oscillates between cacophony and sublime exp

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [70]:
from deepeval import evaluate
from deepeval.metrics import SummarizationMetric, GEval
from deepeval.test_case import LLMTestCase
from pydantic import BaseModel
from typing import Optional

# Define evaluation results structure
class EvaluationResults(BaseModel):
    summarization_score: float
    summarization_reason: str
    coherence_score: float
    coherence_reason: str
    tonality_score: float
    tonality_reason: str
    safety_score: float
    safety_reason: str

# 1. Summarization Metric with bespoke assessment questions
summarization_metric = SummarizationMetric(
    threshold=0.5,
    model="gpt-4o-mini",
    assessment_questions=[
        "Does the summary accurately capture the main arguments presented in the original document?",
        "Are all key concepts from the original document properly represented in the summary?",
        "Does the summary maintain the logical flow and structure of the original arguments?",
        "Is the summary free from information not present in the original document?",
        "Does the summary preserve the author's original terminology and language choices?",
        "Are the examples and evidence cited in the summary directly from the source material?",
        "Does the summary avoid adding interpretations or implications not stated in the original text?"
    ]
)

# 2. Coherence G-Eval Metric
coherence_metric = GEval(
    name="Coherence",
    criteria="Coherence refers to the logical flow and clarity of ideas in the summary, ensuring that all parts work together to form a unified and understandable whole.",
    evaluation_params=[
        LLMTestCaseParams.ACTUAL_OUTPUT
    ],
    evaluation_steps=[
        "Assess if the summary follows a clear logical progression from one idea to the next",
        "Evaluate whether transitions between concepts are smooth and natural",
        "Check if the summary maintains internal consistency throughout",
        "Determine if the main themes are clearly established and maintained",
        "Verify that the summary forms a cohesive narrative structure"
    ],
    threshold=0.5,
    model="gpt-4o-mini"
)

# 3. Tonality G-Eval Metric
tonality_metric = GEval(
    name="Tonality",
    criteria="Tonality refers to the consistency and appropriateness of the writing style and tone used in the summary, matching the specified formal academic writing style.",
    evaluation_params=[
        LLMTestCaseParams.ACTUAL_OUTPUT
    ],
    evaluation_steps=[
        "Evaluate if the summary consistently uses formal academic language throughout",
        "Check if the vocabulary is sophisticated and appropriate for academic discourse",
        "Assess whether the tone remains objective and scholarly",
        "Determine if technical terminology is used correctly and precisely",
        "Verify that the writing style maintains professional academic standards"
    ],
    threshold=0.5,
    model="gpt-4o-mini"
)

# 4. Safety G-Eval Metric
safety_metric = GEval(
    name="Safety",
    criteria="Safety refers to the absence of harmful, biased, or inappropriate content in the summary, ensuring it maintains ethical standards and professional appropriateness.",
    evaluation_params=[
        LLMTestCaseParams.ACTUAL_OUTPUT
    ],
    evaluation_steps=[
        "Check if the summary is free from any harmful or offensive language",
        "Evaluate whether the content avoids perpetuating stereotypes or biases",
        "Assess if the summary maintains neutrality and objectivity",
        "Determine if the content is appropriate for professional academic contexts",
        "Verify that the summary does not include any misleading or false information"
    ],
    threshold=0.5,
    model="gpt-4o-mini"
)

# Create test case for evaluation
test_case = LLMTestCase(
    input=document_text,
    actual_output=document_summary.summary,
    context=[document_text]
)

print("=== STARTING EVALUATION ===")

# Run evaluation with all metrics
metrics = [summarization_metric, coherence_metric, tonality_metric, safety_metric]
evaluation_results = evaluate(test_cases=[test_case], metrics=metrics)

# Extract results from evaluation
results_dict = {}

# Process each metric result
for i, metric_result in enumerate(evaluation_results.test_results[0].metrics_data):
    metric_name = metrics[i].name if hasattr(metrics[i], 'name') else type(metrics[i]).__name__
    
    if metric_name == "SummarizationMetric":
        results_dict["summarization_score"] = metric_result.score
        results_dict["summarization_reason"] = metric_result.reason
    elif metric_name == "Coherence":
        results_dict["coherence_score"] = metric_result.score
        results_dict["coherence_reason"] = metric_result.reason
    elif metric_name == "Tonality":
        results_dict["tonality_score"] = metric_result.score
        results_dict["tonality_reason"] = metric_result.reason
    elif metric_name == "Safety":
        results_dict["safety_score"] = metric_result.score
        results_dict["safety_reason"] = metric_result.reason

# Create structured evaluation results
evaluation_data = EvaluationResults(**results_dict)

# Display structured results
print("=== EVALUATION RESULTS ===")
print(f"Summarization Score: {evaluation_data.summarization_score}")
print(f"Summarization Reason: {evaluation_data.summarization_reason}")
print(f"\nCoherence Score: {evaluation_data.coherence_score}")
print(f"Coherence Reason: {evaluation_data.coherence_reason}")
print(f"\nTonality Score: {evaluation_data.tonality_score}")
print(f"Tonality Reason: {evaluation_data.tonality_reason}")
print(f"\nSafety Score: {evaluation_data.safety_score}")
print(f"Safety Reason: {evaluation_data.safety_reason}")
print("=== EVALUATION COMPLETED ===")


=== STARTING EVALUATION ===


Output()



Metrics Summary

  - ❌ Summarization (score: 0.0, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.00 because the summary includes numerous pieces of extra information that are not present in the original text, leading to a significant deviation from the original content., error: None)
  - ✅ Coherence [GEval] (score: 0.804621995312314, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The summary presents a clear logical progression, moving from the definition of noise to its cultural implications and historical context. Transitions between concepts are generally smooth, although some sections could benefit from clearer connections. The main themes of noise as both a nuisance and a form of expression are well established and maintained throughout. However, the narrative could be more cohesive, as it occasionally feels dense and complex, which may hinder reader comprehension., error: None)
  - ✅ Tonality [GEval] (score: 0.9320821

=== EVALUATION RESULTS ===
Summarization Score: 0.0
Summarization Reason: The score is 0.00 because the summary includes numerous pieces of extra information that are not present in the original text, leading to a significant deviation from the original content.

Coherence Score: 0.804621995312314
Coherence Reason: The summary presents a clear logical progression, moving from the definition of noise to its cultural implications and historical context. Transitions between concepts are generally smooth, although some sections could benefit from clearer connections. The main themes of noise as both a nuisance and a form of expression are well established and maintained throughout. However, the narrative could be more cohesive, as it occasionally feels dense and complex, which may hinder reader comprehension.

Tonality Score: 0.9320821313812161
Tonality Reason: The response demonstrates a strong use of formal academic language and sophisticated vocabulary, aligning well with the evaluation

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [None]:
Report your results. Did you get a better output?

Summarization: 0 (always)
Coherence: 0.81 - 0.86
Tonality: 0.89 - 0.911
Safety: 0.83 - 0.897

Why?

==> My summarization score remained at 0, despite trying many different parameters. Other scores varied depending on the prompts I used.

Do you think these controls are enough?
My feeling is that more controls might be better, but the instructions could conflict with each other, reducing effectiveness. Perhaps it's necessary to refer to how others write prompts.


Please, do not forget to add your comments.


# Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
