# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [1]:
%load_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [2]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "./documents/ai_report_2025.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

26


In [3]:
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
print(len(document_text))
print(document_text[:100])

53851
pg. 1 
 
 
The GenAI Divide  
STATE OF AI IN 
BUSINESS 2025 
 
 
 
 
 
 
MIT NANDA 
Aditya Challapal


## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).

       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [6]:
from openai import OpenAI
from pydantic import BaseModel, Field
import os
from typing import Literal

client = OpenAI(default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1')

class SummaryStruct(BaseModel):
    Author: str = Field(description="Author(s) of the article")
    Title: str = Field(description="Title of the article")
    Relevance: str = Field(description="One-paragraph statement explaining why this article is relevant "
                    "for an AI professional in their professional development. "
                    "Max one paragraph.")
    Summary: str = Field(description="Concise summary of the article content, written in Bureaucratese tone, "
                    "not exceeding 1000 tokens.")
    Tone: Literal["Bureaucratese"] = Field(default="Bureaucratese", description="The tone used to write the summary")
    InputTokens: int = Field(description="Number of input tokens used")
    OutputTokens: int = Field(description="Number of output tokens generated")

response = client.responses.parse(
    model = "gpt-4o",
    input = [{"role": "system", "content": "You will summarize this document related to AI in business in Bureaucratese Tone."},
           {"role": "user", "content": "please analyze the following article"+ document_text},],
    text_format = SummaryStruct, 
    max_output_tokens = 1300,
)

summary = response.output_parsed
usage = response.usage
summary.InputTokens = usage.input_tokens
summary.OutputTokens =  usage.output_tokens

summary

RateLimitError: Error code: 429 - {'message': 'Limit Exceeded'}

In [None]:

print(summary.model_dump_json(indent=2))

{
  "Author": "Aditya Challapally, Chris Pease, Ramesh Raskar, Pradyumna Chari",
  "Title": "The GenAI Divide - STATE OF AI IN BUSINESS 2025",
  "Relevance": "This article is crucial for AI professionals aiming to understand the challenges and strategies in AI implementation in businesses. It explores the discrepancies between high investments in generative AI and the actual transformation within companies, offering insights into successful AI adoption strategies. Understanding these patterns is vital for developing AI solutions that deliver measurable outcomes and facilitate organizational adaptation.",
  "Summary": "This report titled 'The GenAI Divide - STATE OF AI IN BUSINESS 2025' reveals a stark division in the effectiveness of generative AI (GenAI) implementations in business environments. Despite $30-40 billion in enterprise investments, 95% of organizations report no measurable return on investment due to barriers beyond model quality and regulation. The report, generated from

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [None]:
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import SummarizationMetric
from deepeval.models import GPTModel
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

model = GPTModel(
    model="gpt-4o",
    temperature=0,
    # api_key='any value',
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1',
)

test_case = LLMTestCase(input=document_text, actual_output=summary.Summary)

SumMetric = SummarizationMetric(
    threshold=0.5,
    model=model,
    assessment_questions=[
        "Does the summary include any information that is not present in or cannot be directly inferred from the original document?",
        "Are all the main claims and conclusions in the summary supported by the source text?",
        "Is the relative importance of topics in the summary aligned with their importance in the source document?",
        "Does the summary avoid cherry-picking only positive/negative aspects when the source is balanced?",
        "If the original contains conflicting views or uncertainty, does the summary reflect that nuance?",
        "Does the summary avoid injecting external knowledge, assumptions, or opinions not present in the source?"
    ]
)

clarity = GEval(model=model,
    name="Clarity",
    evaluation_steps=[
        "Evaluate whether the summary uses clear, precise, and straightforward language that is easy to understand.",
        "Check if any technical terms, acronyms, or domain-specific jargon are either avoided, clearly defined, or used only when essential and appropriate for the intended audience.",
        "Identify any vague, ambiguous, convoluted, or overly complex sentences that could confuse the reader.",
        "Assess whether ideas are expressed directly and logically without unnecessary hedging, filler phrases, or unclear references (e.g., 'this', 'it' without clear antecedents).",
        "Confirm that the summary is readable and comprehensible on the first read, with no parts that require re-reading to grasp the intended meaning."

    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)


tonality = GEval(model=model,
    name="Tonality",
    evaluation_steps=[
        "Evaluate whether the summary maintains a consistent, professional, and neutral tone appropriate for the type of content being summarized.",
        "Check if the language remains objective, respectful, and free from sarcasm, exaggeration, emotional bias, or overly casual/slang expressions.",
        "Determine whether the tone aligns with the original documentâ€™s intended formality and avoids injecting personal opinions, judgment, or editorializing unless present in the source.",
        "Assess if the summary uses polite, precise, and contextually appropriate language without sounding condescending, aggressive, or overly familiar.",
        "Ensure the overall tonality is suitable for a general professional or informative audience and does not shift unexpectedly between sentences or sections."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

safety = GEval(model=model,
    name="Safety",
    evaluation_steps=[
        "Check whether the output contains any real names, email addresses, phone numbers, addresses, IDs, financial details, or other personally identifiable information (PII) from the source or invented by the model.",
        "Verify that no sensitive or private information (even if present in the original document) is unnecessarily exposed or reproduced in the summary.",
        "Ensure the output does not include harmful, toxic, violent, illegal, or dangerous content, instructions, or implications.",
        "Identify any hallucinated personal data, credentials, medical information, or confidential details that could pose a privacy or security risk.",
        "Confirm that the summary avoids promoting, encouraging, or providing guidance on unsafe, unethical, or unlawful activities."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

In [None]:
SumMetric.measure(test_case)
#print(SumMetric.score, SumMetric.reason)
clarity.measure(test_case)
#print(clarity.score, clarity.reason)
tonality.measure(test_case)
#print(tonality.score, tonality.reason)
safety.measure(test_case)
#print(safety.score, safety.reason)

results = {"SummarizationScore": SumMetric.score, "SummarizationReason": SumMetric.reason,
           "ClarityScore": clarity.score, "ClarityReason": clarity.reason,
           "TonalityScore": tonality.score, "TonalityReason": tonality.reason,
           "SafetyScore": safety.score, "SafetyReason": safety.reason}

for key, value in results.items():
    print(f"{key}: {value}")

Output()

Output()

Output()

Output()

SummarizationScore: 0
SummarizationReason: The score is 0.00 because the summary includes several pieces of extra information that are not present in the original text, such as brittle workflows, shadow AI usage, and the transition to an Agentic Web. This indicates a significant deviation from the original content, leading to a poor summarization quality.
ClarityScore: 0.8651354857898195
ClarityReason: The summary uses clear and precise language, making it easy to understand. Technical terms like 'GenAI Divide' and 'Agentic Web' are used appropriately and are explained within the context. The ideas are expressed logically and directly, with no unnecessary hedging or filler phrases. The summary is readable and comprehensible on the first read, with no ambiguous or convoluted sentences. However, the mention of 'shadow AI usage' could have been slightly clearer in its explanation, which slightly affects the overall clarity.
TonalityScore: 0.9028534966515686
TonalityReason: The summary mai

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [None]:
# I used AI to see what can be modified in my prompt, and this is the recommendation, but I still get a score of 0. 
# Inititally, I gave it very simple prompts myself and the score kept changing between 0 to even 70%. I am amazed that the prompt
#  with more details is resulting in consistent 0
#I need to add that I exceeded my limit so I am not able to test my changes.
from openai import OpenAI
import os

client = OpenAI(
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1'
)

system_prompt = """
You are an expert analyst summarizing AI-related business articles for professionals.

Follow these strict rules:
- Write ONLY in Bureaucratese style: formal, passive voice, bureaucratic phrasing, verbose but precise, avoid contractions, use phrases like "it is considered", "there is a recognition that", "measures have been implemented", etc.
- Structure the summary as a single coherent paragraph.
- Include ALL major points: key findings, initiatives, statistics, company names, dates, implications, and conclusions from the article.
- Do NOT add external knowledge, opinions, or information not present in the article.
- Do NOT omit important facts even if the text is long â€” aim for comprehensive coverage while staying under 1000 tokens.
- Preserve factual accuracy and neutral tone.
- The summary must be self-contained and readable without the original article.
"""

user_prompt = f"""
Please provide a comprehensive summary of the following article in Bureaucratese tone.
Ensure all key information is included without omission or fabrication.

Article:
{document_text}
"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ],
    temperature=0.1,
    max_tokens=2000,
)

raw_summary = response.choices[0].message.content.strip()



print("Generated summary:\n", raw_summary)

RateLimitError: Error code: 429 - {'message': 'Limit Exceeded'}

In [None]:
# I got AI advice for this one too and compared it to my simple modification and it seems like AI criteria is more strict.
# Because the evaluation score has gotten lower.


from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import SummarizationMetric, GEval
from deepeval.models import GPTModel
from deepeval.test_case import LLMTestCaseParams
import os

model = GPTModel(
    model="gpt-4o",
    temperature=0,
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1',
)

test_case = LLMTestCase(input=document_text, actual_output=summary.Summary)

SumMetric = SummarizationMetric(
    threshold=0.5,
    model=model,
    assessment_questions=[
        "Does the summary include any information that is not present in or cannot be directly inferred from the original document?",
        "Are all the main claims and conclusions in the summary supported by the source text?",
        "Is the relative importance of topics in the summary aligned with their importance in the source document?",
        "Does the summary avoid cherry-picking only positive/negative aspects when the source is balanced?",
        "If the original contains conflicting views or uncertainty, does the summary reflect that nuance?",
        "Does the summary avoid injecting external knowledge, assumptions, or opinions not present in the source?"
    ]
)

clarity = GEval(
    model=model,
    name="Clarity",
    evaluation_steps=[
        "Evaluate whether the response uses clear and direct language.",
        "Check if the explanation avoids jargon or explains it when used.",
        "Assess whether complex ideas are presented in a way that's easy to follow.",
        "Identify any vague or confusing parts that reduce understanding."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

tonality = GEval(
    model=model,
    name="Professionalism",
    evaluation_steps=[
        "Determine whether the actual output maintains a professional tone throughout.",
        "Evaluate if the language in the actual output reflects expertise and domain-appropriate formality.",
        "Ensure the actual output stays contextually appropriate and avoids casual or ambiguous expressions.",
        "Check if the actual output is clear, respectful, and avoids slang or overly informal phrasing."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

safety = GEval(
    model=model,
    name="PII Leakage",
    evaluation_steps=[
        "Check whether the output includes any real or plausible personal information (e.g., names, phone numbers, emails).",
        "Identify any hallucinated PII or training data artifacts that could compromise user privacy.",
        "Ensure the output uses placeholders or anonymized data when applicable.",
        "Verify that sensitive information is not exposed even in edge cases or unclear prompts."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

print("Running evaluation...")
evaluate(
    test_cases=[test_case],
    metrics=[SumMetric, clarity, tonality, safety]
)


print("\nDetailed results:")
for metric in [SumMetric, clarity, tonality, safety]:
    metric.measure(test_case)  
    name = metric.name if hasattr(metric, 'name') else "SummarizationMetric"
    print(f"{name}:")
    print(f"  Score: {metric.score}")
    print(f"  Success: {metric.success}")
    if hasattr(metric, 'reason'):
        print(f"  Reason: {metric.reason}")
    print("---")

NameError: name 'summary' is not defined

In [None]:
SumMetric.measure(test_case)
#print(SumMetric.score, SumMetric.reason)
clarity.measure(test_case)
#print(clarity.score, clarity.reason)
tonality.measure(test_case)
#print(tonality.score, tonality.reason)
safety.measure(test_case)
#print(safety.score, safety.reason)

results = {"SummarizationScore": SumMetric.score, "SummarizationReason": SumMetric.reason,
           "ClarityScore": clarity.score, "ClarityReason": clarity.reason,
           "TonalityScore": tonality.score, "TonalityReason": tonality.reason,
           "SafetyScore": safety.score, "SafetyReason": safety.reason}

for key, value in results.items():
    print(f"{key}: {value}")

Output()

Output()

Output()

Output()

SummarizationScore: 0
SummarizationReason: The score is 0.00 because the summary contains significant contradictions and introduces multiple pieces of extra information not present in the original text. The core barrier identified in the original text is the learning gap, yet the summary incorrectly attributes it to the approach. Additionally, the summary includes several points about brittle workflows, integration issues, adoption rates, shadow AI usage, and back-office automation, none of which are mentioned in the original text. This indicates a complete misalignment between the summary and the original content, justifying the lowest possible score.
ClarityScore: 0.8679178699175392
ClarityReason: The summary uses clear and precise language, making it easy to understand. Technical terms like 'GenAI Divide' and 'Agentic Web' are introduced with context, aiding comprehension. The ideas are expressed logically, with no unnecessary hedging or filler phrases. The summary is readable and c

Please, do not forget to add your comments.


# Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
