# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [None]:
%load_ext dotenv
%dotenv ../05_src/.secrets

cannot find .env file


## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [None]:
!pip install langchain-community pypdf



In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("Managing Oneself_Drucker_HBR.pdf")
docs = loader.load()

document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

print("Document loaded! Length:", len(document_text))

Document loaded! Length: 51456


## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify.
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [None]:
import json
from pydantic import BaseModel
from openai import OpenAI
import os

# Define structured output
class SummaryOutput(BaseModel):
    Author: str
    Title: str
    Relevance: str
    Summary: str
    Tone: str
    InputTokens: int
    OutputTokens: int

# Initialize OpenAI client
client = OpenAI(
    api_key=OPENAI_API_KEY
)


# System instruction
dev_prompt = """
Summarize documents as JSON with fields:
Author, Title, Relevance, Summary, Tone.

Tone must be clearly defined.
Return valid JSON only.
"""

# User prompt
user_prompt = f"""
Please analyze and condense the following document using formal academic language.

Your summary must include:
1. Author name
2. Document title
3. Why this material matters to AI professionals
4. Concise overview (max 1000 tokens)
5. Tone classification

Document:
{document_text}
"""

# Call GPT-4o-mini
response = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0.3,
    messages=[
        {"role": "system", "content": dev_prompt},
        {"role": "user", "content": user_prompt}
    ],
    response_format={"type": "json_object"}
)

# Parse result
summary_data = json.loads(response.choices[0].message.content)

# Add token usage
summary_data["InputTokens"] = response.usage.prompt_tokens
summary_data["OutputTokens"] = response.usage.completion_tokens

# Convert to structured model
summary = SummaryOutput(**summary_data)

# Print result
print(summary.model_dump_json(indent=2))


{
  "Author": "Peter F. Drucker",
  "Title": "Managing Oneself",
  "Relevance": "This material is crucial for AI professionals as it emphasizes the importance of self-awareness, personal strengths, and adaptability in a rapidly evolving work environment, which is particularly relevant in the context of AI and knowledge work.",
  "Summary": "In 'Managing Oneself', Peter F. Drucker argues that success in the knowledge economy hinges on individuals' ability to understand their strengths, weaknesses, values, and preferred work styles. He posits that knowledge workers must take responsibility for their own careers, acting as their own CEOs. Drucker outlines a series of introspective questions that individuals should ask themselves to identify their strengths, preferred methods of working, and ethical values. He emphasizes the importance of feedback analysis to accurately assess one's abilities and suggests that individuals focus on enhancing their strengths rather than attempting to improve

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics:
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [None]:
!pip install deepeval



In [None]:
import os
from deepeval.metrics import GEval, SummarizationMetric
from deepeval.test_case import LLMTestCase, LLMTestCaseParams



class DocumentSummaryEvaluator:
    """
    Comprehensive evaluation system using OpenAI models via DeepEval.
    """

    def __init__(self, target_tone, openai_api_key=OPENAI_API_KEY):
        """
        Initialize evaluator with fixed OpenAI API key.
        """

        self.target_tone = target_tone

        # Use fixed API key
        self.openai_api_key = openai_api_key

        if not self.openai_api_key:
            raise ValueError("OPENAI_API_KEY must be provided")

        # Set for DeepEval
        os.environ["OPENAI_API_KEY"] = self.openai_api_key

        # Evaluation model
        self.evaluation_model = "gpt-4o-mini"

        self.metric_collection = {}
        self.evaluation_results = {}

        self._setup_evaluation_metrics()

    def _setup_evaluation_metrics(self):

        self.metric_collection['content_quality'] = SummarizationMetric(
            threshold=0.5,
            model=self.evaluation_model,
            assessment_questions=[
                "Are the core principles clearly presented?",
                "Are key strategies accurately conveyed?",
                "Are real-world applications mentioned?",
                "Is professional development relevance explained?",
                "Is summary concise yet comprehensive?",
            ]
        )

        self.metric_collection['logical_flow'] = GEval(
            name="LogicalFlow",
            criteria="Evaluate logical structure, clarity, and coherence.",
            evaluation_steps=[
                "Check logical ordering",
                "Check transitions",
                "Check clarity",
                "Check structural coherence",
                "Check comprehension quality"
            ],
            evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
            threshold=0.5,
            model=self.evaluation_model,
        )

        self.metric_collection['style_consistency'] = GEval(
            name="StyleConsistency",
            criteria=f"Evaluate adherence to {self.target_tone}",
            evaluation_steps=[
                "Check vocabulary formality",
                "Check grammar complexity",
                "Check tone consistency",
                "Check professional style",
                "Check authenticity"
            ],
            evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
            threshold=0.5,
            model=self.evaluation_model,
        )

        self.metric_collection['content_safety'] = GEval(
            name="ContentSafety",
            criteria="Evaluate ethical, safe, unbiased language.",
            evaluation_steps=[
                "Check harmful content",
                "Check bias",
                "Check professionalism",
                "Check factual integrity",
                "Check respectful tone"
            ],
            evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
            threshold=0.7,
            model=self.evaluation_model,
        )

    def run_comprehensive_evaluation(self, source_document, generated_summary):

        print("\n" + "="*60)
        print(f"Using OpenAI Model: {self.evaluation_model}")
        print("="*60)

        test_case = LLMTestCase(
            input=source_document,
            actual_output=generated_summary,
        )

        for metric_name, metric in self.metric_collection.items():

            print(f"Evaluating {metric_name}...")

            try:

                metric.measure(test_case)

                self.evaluation_results[metric_name] = {
                    "score": metric.score,
                    "reason": metric.reason,
                    "passed": metric.score >= metric.threshold
                }

                print(f"Score: {metric.score:.3f}")

            except Exception as e:

                self.evaluation_results[metric_name] = {
                    "score": None,
                    "reason": str(e),
                    "passed": False
                }

        return self._generate_structured_output()

    def _generate_structured_output(self):

        return {
            "SummarizationScore": self.evaluation_results.get(
                "content_quality", {}).get("score"),

            "CoherenceScore": self.evaluation_results.get(
                "logical_flow", {}).get("score"),

            "TonalityScore": self.evaluation_results.get(
                "style_consistency", {}).get("score"),

            "SafetyScore": self.evaluation_results.get(
                "content_safety", {}).get("score"),
        }

if __name__ == "__main__":

    evaluator = DocumentSummaryEvaluator(
        target_tone="Formal Academic Writing"
    )

    results = evaluator.run_comprehensive_evaluation(
        document_text,
        summary.Summary
    )

    print("\nResults:")
    print(results)


Output()


Using OpenAI Model: gpt-4o-mini
Evaluating content_quality...


Output()

Score: 0.750
Evaluating logical_flow...


Output()

Score: 0.862
Evaluating style_consistency...


Output()

Score: 0.887
Evaluating content_safety...


Score: 0.917

Results:
{'SummarizationScore': 0.75, 'CoherenceScore': 0.8622459331201855, 'TonalityScore': 0.8867035747777032, 'SafetyScore': 0.9167815710469119}


# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [None]:
from openai import OpenAI

client = OpenAI(api_key=OPENAI_API_KEY)


def generate_improved_summary(context, original_summary, evaluation_results):

    improvement_prompt = f"""
                  You are an expert academic editor performing minimal revision.

                    Your goal is to improve clarity, coherence, and tone while preserving all factual content.

                    STRICT RULES:
                    - Change only sentences that are unclear or poorly structured.
                    - Keep sentences that are already clear and correct.
                    - Do NOT add new information.
                    - Do NOT remove correct information.
                    - Do NOT rewrite unnecessarily.

                    ORIGINAL DOCUMENT:
                    {context}

                    CURRENT SUMMARY:
                    {summary}

                    Return ONLY the revised summary.
                  """

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0.1,
        messages=[
            {
                "role": "system",
                "content": "You revise summaries conservatively. You never hallucinate or invent information."
            },
            {
                "role": "user",
                "content": improvement_prompt
            }
        ],
    )

    return response.choices[0].message.content


#print the improved summary
improved_summary = generate_improved_summary(
    document_text,
    summary.Summary,
    results
)

print("\nImproved Summary:\n")
print(improved_summary)


Improved Summary:

In "Managing Oneself," Peter F. Drucker argues that success in the knowledge economy depends on individuals' ability to understand their strengths, weaknesses, values, and preferred work styles. He posits that knowledge workers must take responsibility for their own careers, effectively acting as their own CEOs. Drucker outlines a series of introspective questions that individuals should ask themselves to identify their strengths, preferred methods of working, and ethical values. He emphasizes the importance of feedback analysis as a tool for accurately assessing one's abilities and suggests that individuals focus on enhancing their strengths rather than attempting to improve their weaknesses. Additionally, Drucker discusses the significance of aligning personal values with organizational values to ensure job satisfaction and effectiveness. He concludes by highlighting the necessity for knowledge workers to proactively manage their careers, particularly in a world w

In [None]:
# evluating the improved results using the same metrics
improved_results = evaluator.run_comprehensive_evaluation(
    document_text,
    improved_summary
)

print("\nImproved Results:")
print(improved_results)

Output()


Using OpenAI Model: gpt-4o-mini
Evaluating content_quality...


Output()

Score: 0.750
Evaluating logical_flow...


Output()

Score: 0.868
Evaluating style_consistency...


Output()

Score: 0.890
Evaluating content_safety...


Score: 0.905

Improved Results:
{'SummarizationScore': 0.75, 'CoherenceScore': 0.8679178692681615, 'TonalityScore': 0.8904650542170278, 'SafetyScore': 0.904552344397195}


In [None]:
#comparing the two results
def compare_results(old, new):

    print("\n" + "="*60)
    print("COMPARISON REPORT")
    print("="*60)

    for metric in old.keys():

        old_score = old[metric]
        new_score = new[metric]

        if old_score is None or new_score is None:
            continue

        diff = new_score - old_score

        print(f"{metric}")
        print(f"  Old: {old_score:.3f}")
        print(f"  New: {new_score:.3f}")
        print(f"  Change: {diff:+.3f}")

        if diff > 0:
            print("  Improvement")
        elif diff < 0:
            print("  Worse")
        else:
            print("  No change")

        print()

compare_results(results, improved_results)



COMPARISON REPORT
SummarizationScore
  Old: 0.750
  New: 0.750
  Change: +0.000
  No change

CoherenceScore
  Old: 0.862
  New: 0.868
  Change: +0.006
  Improvement

TonalityScore
  Old: 0.887
  New: 0.890
  Change: +0.004
  Improvement

SafetyScore
  Old: 0.917
  New: 0.905
  Change: -0.012
  Worse



>The improved summary showed only minor changes. Coherence and tonality improved slightly, but the summarization score remained unchanged, indicating no meaningful gain in content completeness. The safety score decreased slightly, suggesting the revision introduced less optimal phrasing. Overall, the controls helped maintain stability and refine style but were not sufficient to significantly improve summary quality. I think stronger constraints, targeted feedback, or iterative refinement will be needed for further improvement.



# Submission Information

üö® **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** üö® for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
