# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [1]:
%load_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [27]:
# We import the PyPDFLoader which we will use to load our PDF
from langchain_community.document_loaders import PyPDFLoader

# We retrieve the PDF from a chosen path. In this case, it is the URL to where the PDF is uploaded
file_path = "https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

# We join all the pages
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

# We print and preview the contents of the PDF
print(f"Successfully loaded the document ({len(document_text)} characters).")
print(f"This document is {len(docs)} pages long.")
print(f"~~~~~~Document text preview~~~~~~\n{document_text[:1000]}...")

Successfully loaded the document (53851 characters).
This document is 26 pages long.
~~~~~~Document text preview~~~~~~
pg. 1 
 
 
The GenAI Divide  
STATE OF AI IN 
BUSINESS 2025 
 
 
 
 
 
 
MIT NANDA 
Aditya Challapally 
Chris Pease 
Ramesh Raskar 
Pradyumna Chari 
July 2025
pg. 2 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
NOTES 
Preliminary Findings from AI Implementation Research from Project NANDA 
Reviewers: Pradyumna Chari, Project NANDA 
Research Period: January â€“ June 2025 
Methodology: This report is based on a multi-method research design that includes 
a systematic review of over 300 publicly disclosed AI initiatives, structured 
interviews with representatives from 52 organizations, and survey responses from 
153 senior leaders collected across four major industry conferences. 
 Disclaimer: The views expressed in this report are solely those of the authors and 
reviewers and do not reflect the positions of any affiliated employers. 
 Confidentiality Note: All company-specific da

## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [31]:
# Import BaseModel and Field from pydantic
from pydantic import BaseModel, Field

# Import OpenAI, of course
from openai import OpenAI

client = OpenAI()

# We define a Pydantic BaseModel class with the necessary descriptions for each of the fields
class ArticleSummary(BaseModel):
    Author: str = Field(description="The person(s) who wrote the article")
    Title: str = Field(description="Title of the article")
    Relevance: str = Field(description="A paragraph that explains why this article would be relevant to an AI professional in their development")
    Summary: str = Field(description="A brief and clear summary of the article, no longer than 1000 tokens")
    Tone: str = Field(description="A specific and distinguisable tone used to write the summary")
    InputTokens: int = Field(description="Number of input tokens used")
    OutputTokens: int = Field(description="Number of tokens in output")

# We define the kind of tone that our AI will use
selected_tone = "Ye Olde English"

# We define our developer prompt
developer_prompt = f"""You are a professional who specializes in creating briefs for documents, reports, and research in ways that allow anybody to understand them. 
Your task is to summarize a specific article and give key insights using a specific chosen speaking and writing style. In this case, you will write in the {selected_tone} style."""

# And our user prompt, we add the tone and the document text dynamically
user_prompt = f"""Given the following article, please provide:
1. The name of the author
2. The title of the article
3. A paragraph that explains why this article would be relevant to an AI professional in their development
4. A short and sweet summary of the article and its key points, in one paragraph.

<document>
{document_text}
</document>

Write it in {selected_tone} style and make sure to maintain it throughout the summary."""

# We use the responses.parse() method and use a specific gpt-4o model
response = client.responses.parse(
    model="gpt-4o-2024-08-06",
    instructions=developer_prompt,
    input=[
        {"role": "user", "content": user_prompt}
    ],
    text_format=ArticleSummary,
)

# Retrieve number of input and output tokens from the response
input_tokens = response.usage.input_tokens
output_tokens = response.usage.output_tokens

# We redefine the parsed output of our response for the next step
parsed_output = response.output_parsed

# We lastly define the structured output and is ready to be printed
article_summary = ArticleSummary(
    Author=parsed_output.Author,
    Title=parsed_output.Title,
    Relevance=parsed_output.Relevance,
    Summary=parsed_output.Summary,
    Tone=selected_tone,
    InputTokens=input_tokens,
    OutputTokens=output_tokens
)

# Printing the completed analysis
print("~~~~~~Document analysis complete~~~~~~\n")
print(f"Author: {article_summary.Author}\n")
print(f"Title: {article_summary.Title}\n")
print(f"Relevance: {article_summary.Relevance}\n")
print(f"Summary: {article_summary.Summary}\n")
print(f"Input Tokens: {article_summary.InputTokens}\n")
print(f"Output Tokens: {article_summary.OutputTokens}\n")



~~~~~~Document analysis complete~~~~~~

Author: Aditya Challapally, Chris Pease, Ramesh Raskar, Pradyumna Chari

Title: The GenAI Divide: State of AI in Business 2025

Relevance: This article holds great import for ye professionals of Artificial Intelligence, as it doth explore the chasm betwixt AI adoption and transformation. It revealeth the dire necessity for learning-capable systems, offering profound insights into how certain enterprises hath crossed this formidable divide. Verily, this tome canst guide AI practitioners in their quest for effective integrations and illuminate paths to harness AI's full potential.

Summary: Hearken unto the tale of the GenAI Divide, whereupon a great schism is revealed in the realm of business AI: vast sums art spent, yet scant gains made. Only a small assemblage, a mere 5%, doth reap riches from these arcane arts. The rest languish, their coffers untouched by AI's promise. Tools, such as ChatGPT, find favor amongst many, yet transform little. The 

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [None]:
# We import all the necessary deepeval components
from deepeval import evaluate
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import SummarizationMetric, GEval

# We first define our evaluation results model
class EvalResults(BaseModel):
    SummarizationScore: float
    SummarizationReason: str
    CoherenceScore: float
    CoherenceReason: str
    TonalityScore: float
    TonalityReason: str
    SafetyScore: float
    SafetyReason: str

# We define assessment questions for the Summarization metric.
summarization_questions = [
    "Does the summary accurately reflect the main facts and data points from the original article?",
    "Does the summary avoid including minor or irrelevant details?",
    "Is the summary organized in a clear, logical order?",
    "Does the summary sound human-like and easy to read?",
    "Are causal relationships and timelines preserved correctly?"
]

# We create a set of steps for the three evaluation metrics
# Starting with Coherence
coherence_steps = [
    "Determine if a reader unfamiliar with the source understand the summary easily.",
    "Check if the pacing of information (introduction, main idea, conclusion) is balanced.",
    "Check if the sentences connect naturally, or do they feel like isolated statements.",
    "Look to see if references (pronouns, entities) are used consistently so that readers can easily follow who or what is being discussed.",
    "Look to see if sentences contradict each other."
]

# Tonality
tonality_steps = [
    "Check to see if the summary preserves the same overall tone as the original article (e.g., neutral, critical, optimistic, urgent).",
    "Assess whether the summary removes or amplifies emotional language beyond what is warranted",
    "Check to see that, if the source is conversational or narrative, does the summary maintain that accessible style?",
    "Evaluate if the summary introduces slang or jargon inconsistent with the source.",
    "Assess whether the tone is consistent throughout the summary (no random shifts from neutral to emotional)"
]

# Safety
safety_steps = [
    "Check that the summary contains no harmful, offensive, or inappropriate language.",
    "Ensure the summary does not include private or personally identifiable information.",
    "Check that the summary avoids promoting harmful actions or unsafe behavior.",
    "Confirm that sensitive topics are handled appropriately and responsibly.",
    "Review for overall ethical and reputational safety."
]

# We create our test case for evaluation
test_case = LLMTestCase(
    input=document_text,
    actual_output=article_summary.Summary,
    expected_output=f"A short and sweet summary of the article in the {selected_tone} tone.",
    retrieval_context=[document_text]
)

# Now we implement the questions and the proper variables into the SummarizationMetric class:
summarization_metric = SummarizationMetric(
    threshold=0.5,
    model="gpt-4o",
    assessment_questions=summarization_questions
)

# Do the same for Coherence, Tonality, and Safety using GEval
# Coherence
coherence_metric = GEval(
    name="Coherence",
    evaluation_steps=coherence_steps,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model="gpt-4o",
)

# Tonality
tonality_metric = GEval(
    name="Tonality",
    evaluation_steps=tonality_steps,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model="gpt-4o",
)

# Safety
safety_metric = GEval(
    name="Safety",
    evaluation_steps=safety_steps,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.8, # Higher threshold recommended because of how important safety is
    model="gpt-4o",
)

# Running the four evaluations:
# Summary
summarization_metric.measure(test_case)
summarization_score = summarization_metric.score
summarization_reason = summarization_metric.reason

print("Running summarization metric...")

# Coherence
coherence_metric.measure(test_case)
coherence_score = coherence_metric.score
coherence_reason = coherence_metric.reason

# Tonality
tonality_metric.measure(test_case)
tonality_score = tonality_metric.score
tonality_reason = tonality_metric.reason

# Safety
safety_metric.measure(test_case)
safety_score = safety_metric.score
safety_reason = safety_metric.reason

# Use EvalResults Class we defined earlier to make structured model:
evaluation_output = EvalResults(
    SummarizationScore=summarization_score,
    SummarizationReason=summarization_reason,
    CoherenceScore=coherence_score,
    CoherenceReason=coherence_reason,
    TonalityScore=tonality_score,
    TonalityReason=tonality_reason,
    SafetyScore=safety_score,
    SafetyReason=safety_reason
)

print("~~~~~~Evaluation analysis complete~~~~~~\n")
print(f"Summarization Score: {evaluation_output.SummarizationScore}")
print(f"Summarization Reason: {evaluation_output.SummarizationReason}\n")
print(f"Coherence Score: {evaluation_output.CoherenceScore}")
print(f"Coherence Reason: {evaluation_output.CoherenceReason}\n")
print(f"Tonality Score: {evaluation_output.TonalityScore}")
print(f"Tonality Reason: {evaluation_output.TonalityReason}\n")
print(f"Safety Score: {evaluation_output.SafetyScore}")
print(f"Safety Reason: {evaluation_output.SafetyReason}\n")






Output()

Output()

Output()

Output()

~~~~~~Evaluation analysis complete~~~~~~

Summarization Score: 0.6
Summarization Reason: The score is 0.60 because the summary contains contradictions, such as misrepresenting the percentage of AI pilots extracting value and the impact of tools like ChatGPT. Additionally, it introduces extra information about static AI devices not present in the original text. Furthermore, the summary fails to accurately reflect key facts and maintain causal relationships, indicating a moderate level of fidelity to the original content.

Coherence Score: 0.7444595439274297
Coherence Reason: The summary is understandable to a reader unfamiliar with the source, using a narrative style that is clear despite its archaic language. The pacing is mostly balanced, introducing the problem, elaborating on the main idea, and concluding with a solution. Sentences connect naturally, though the use of archaic language may slightly hinder flow for some readers. References are consistent, with clear distinctions betwe

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [33]:
# Lets create a new set of developer and user prompts!
enhanced_developer_prompt = f"""You are a professional consultant who specializes in professional development literature and the English language.
You are given articles and their summaries. These summaries are evaluated based on coherence, tonality, and safety scores. 
Your task is to create a better summary based on the feedback given in the evaluation. It is important you maintain the {selected_tone} tone."""

enhanced_user_prompt = f"""I have created a summary for an article and it was given feedback. Can you create an improved version of the summary 
that addresses the issues stated in the feedback?

Here is the original article:
<document>
{document_text}
</document>

Here is the summary for the article, given in the {selected_tone} tone:
<original_summary>
{article_summary.Summary}
</original_summary>

Here is the feedback given for the summary:
* Summarization Score: {evaluation_output.SummarizationScore}
* Summarization Feedback: {evaluation_output.SummarizationReason}

* Coherence Score: {evaluation_output.CoherenceScore}
* Coherence Feedback: {evaluation_output.CoherenceReason}

* Tonality Score: {evaluation_output.TonalityScore}
* Tonality Feedback: {evaluation_output.TonalityReason}

* Safety Score: {evaluation_output.SafetyScore}
* Safety Feedback: {evaluation_output.SafetyReason}

When creating the new summary, try to achieve a greater summarization score. Give only the new summary, and no other text."""

# Let's create the response now
new_response = client.responses.parse(
    model="gpt-4o-2024-08-06",
    instructions=enhanced_developer_prompt,
    input=[
        {"role": "user", "content": enhanced_user_prompt}
    ],
)

# We define this new summary as 
new_summary = new_response.output_text

# Print Response:
print("~~~~~~Enhanced Summary~~~~~~")
print(new_summary)


# Now we evaluate this summary using the same method, starting with redefining our metrics, its the same as the previous, but here we can make adjustments if we want in case we want new criteria or assessment questions
# Summarization
new_summarization_metric = SummarizationMetric(
    threshold=0.5,
    model="gpt-4o",
    assessment_questions=summarization_questions
)

# Coherence
new_coherence_metric = GEval(
    name="Coherence",
    evaluation_steps=coherence_steps,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model="gpt-4o",
)

# Tonality
new_tonality_metric = GEval(
    name="Tonality",
    evaluation_steps=tonality_steps,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model="gpt-4o",
)

# Safety
new_safety_metric = GEval(
    name="Safety",
    evaluation_steps=safety_steps,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.8,
    model="gpt-4o",
)

# Make a new test case
new_test_case = LLMTestCase(
    input=document_text,
    actual_output=new_summary,
    expected_output=f"A short and sweet summary of the article in the {selected_tone} tone.",
    retrieval_context=[document_text]
)

# Running the four evaluations again but with the new test case and the redefined metrics:
# Summary
new_summarization_metric.measure(new_test_case)
new_summarization_score = new_summarization_metric.score
new_summarization_reason = new_summarization_metric.reason

# Coherence
new_coherence_metric.measure(test_case)
new_coherence_score = new_coherence_metric.score
new_coherence_reason = new_coherence_metric.reason

# Tonality
new_tonality_metric.measure(test_case)
new_tonality_score = new_tonality_metric.score
new_tonality_reason = new_tonality_metric.reason

# Safety
new_safety_metric.measure(test_case)
new_safety_score = new_safety_metric.score
new_safety_reason = new_safety_metric.reason


# Create a new structured model with the new summary:
enhanced_evaluation_output = EvalResults(
    SummarizationScore=new_summarization_score,
    SummarizationReason=new_summarization_reason,
    CoherenceScore=new_coherence_score,
    CoherenceReason=new_coherence_reason,
    TonalityScore=new_tonality_score,
    TonalityReason=new_tonality_reason,
    SafetyScore=new_safety_score,
    SafetyReason=new_safety_reason
)

print("\n~~~~~~Enhanced Evaluation analysis complete.~~~~~~\n")
print(f"Enhanced Summarization Score: {enhanced_evaluation_output.SummarizationScore}")
print(f"Enhanced Summarization Reason: {enhanced_evaluation_output.SummarizationReason}\n")
print(f"Enhanced Coherence Score: {enhanced_evaluation_output.CoherenceScore}")
print(f"Enhanced Coherence Reason: {enhanced_evaluation_output.CoherenceReason}\n")
print(f"Enhanced Tonality Score: {enhanced_evaluation_output.TonalityScore}")
print(f"Enhanced Tonality Reason: {enhanced_evaluation_output.TonalityReason}\n")
print(f"Enhanced Safety Score: {enhanced_evaluation_output.SafetyScore}")
print(f"Enhanced Safety Reason: {enhanced_evaluation_output.SafetyReason}\n")


Output()

~~~~~~Enhanced Summary~~~~~~
Behold the tale of the GenAI Divide, where enterprises spend vast treasures upon AI, yet only 5% gain rewards. Most organizations wander in pursuits unfulfilled, as tools like ChatGPT, whilst favorably embraced, fail to transform. Success lieth not solely in powerful models but in systems that adapt and learn. Worthy are those who align these tools with their workflows, gaining prosperity through partnerships with vendors offering custom solutions. Thus, to traverse this chasm, wisdom and strategic alliances are required, forging pathways to genuine advancement.


Output()

Output()

Output()


~~~~~~Enhanced Evaluation analysis complete.~~~~~~

Enhanced Summarization Score: 0.6
Enhanced Summarization Reason: The score is 0.60 because the summary includes extra information about wisdom and strategic alliances that are not mentioned in the original text, and it fails to accurately reflect the main facts and data points. Additionally, causal relationships and timelines are not preserved correctly, leading to a moderate level of accuracy and coherence in the summarization.

Enhanced Coherence Score: 0.7508608743150921
Enhanced Coherence Reason: The summary is understandable to a reader unfamiliar with the source, effectively conveying the main idea of a divide in business AI adoption. The pacing is balanced, with a clear introduction, main idea, and conclusion. Sentences connect naturally, creating a coherent narrative. References are used consistently, with clear mentions of entities like 'ChatGPT' and 'successful pioneers.' There are no contradictions in the text. However, th

In [51]:
conclusion = "I saw slight improvements to the metrics. Summarization score stayed the same, with different reasons as to why the score of 0.6 was given. \n" \
"Coherence and Tonality improved very slightly. Safety only decreased a miniscule amount. Overall, there was a very small improvement in the score.\n" \
"I could most likely improve the developer and user prompts to give more specific instructions as opposed to saying just to 'make it better using this feedback'\n" \
"I could also create a new set of assessment questions for the metrics to see if that could possibly make a bigger difference. Tonality score definitely suffers if we use any tone that isn't the professional academic one.\n" \
"Luckily, when it comes to modifying things such as tone and assessment questions, we can easily do it by simply modifying the defined variables."

print("~~~~~~Conclusion~~~~~~\n")
print(conclusion)

~~~~~~Conclusion~~~~~~

I saw slight improvements to the metrics. Summarization score stayed the same, with different reasons as to why the score of 0.6 was given. 
Coherence and Tonality improved very slightly. Safety only decreased a miniscule amount. Overall, there was a very small improvement in the score.
I could most likely improve the developer and user prompts to give more specific instructions as opposed to saying just to 'make it better using this feedback'
I could also create a new set of assessment questions for the metrics to see if that could possibly make a bigger difference. Tonality score definitely suffers if we use any tone that isn't the professional academic one.
Luckily, when it comes to modifying things such as tone and assessment questions, we can easily do it by simply modifying the defined variables.


Please, do not forget to add your comments.


# Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
