# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [1]:
%load_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [2]:
# Import the PyPDFLoader from langchain_community to load PDF documents
from langchain_community.document_loaders import PyPDFLoader # This loader extracts text from each page of a PDF file
import os

document_folder = "../05_src/documents/"

# I'm using the Peter Drucker article "Managing Oneself"
pdf_filename = "managing_oneself_drucker.pdf"

# Join the folder path and filename to create the complete file path
file_path = os.path.join(document_folder, pdf_filename)

# Note: Using a web loader instead to load the PDF which is a web link
import requests # Import requests library to download content from the web

pdf_url = "https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf"

response = requests.get(pdf_url) # Make an HTTP GET request to download the PDF content

temp_pdf_path = "temp_article.pdf" # Saving the PDF temporarily locally so it can load with PyPDFLoader

with open(temp_pdf_path, 'wb') as f: # Open a file in binary write mode and save the downloaded content
    f.write(response.content) # Write the binary content of the PDF to the file

# Create a PyPDFLoader variable with the path to the PDF file
loader = PyPDFLoader(temp_pdf_path) # Preparing the loader to read and extract text from the PDF
docs = loader.load()

document_text = "" # Initialize an empty string to store all the text from the document

# Loop through each page object in the docs list
for page in docs: # Extract the text content from the current page
    document_text += page.page_content + "\n"     # Add it to the document_text string with line break

# Display the first 500 characters to verify we loaded the document correctly
print(f"Document loaded successfully! Length: {len(document_text)} characters")
print(f"First 500 characters:\n{document_text[:500]}")


Document loaded successfully! Length: 51452 characters
First 500 characters:
www.hbr.org
B
 
EST  
 
OF  HBR 1999
 
Managing Oneself
 
by Peter F . Drucker
 
â€¢
 
Included with this full-text 
 
Harvard Business Review
 
 article:
The Idea in Briefâ€”the core idea
The Idea in Practiceâ€”putting the idea to work
 
1
 
Article Summary
 
2
 
Managing Oneself
A list of related materials, with annotations to guide further
exploration of the articleâ€™s ideas and applications
 
12
 
Further Reading
Success in the knowledge 
economy comes to those who 
know themselvesâ€”their 
strengths


## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [3]:
from openai import OpenAI

# Import BaseModel from pydantic to define our structured output schema, BaseModel is the base class for creating data models with validation
from pydantic import BaseModel

# Import Field from pydantic to add metadata and descriptions to our fields
from pydantic import Field

client = OpenAI()

# Define a Pydantic BaseModel class to structure our output and ensures the API returns data in exactly this format
class DocumentSummary(BaseModel):
    author: str = Field(description="The author of the document")  # Author field: stores the author's name as a string
    
    title: str = Field(description="The title of the document") # Title field: stores the document title as a string
    
    relevance: str = Field(description="A statement no longer than 1 paragraph explaining why this article is relevant for an AI professional's development")
    
    summary: str = Field(description="A concise summary of the document, no longer than 1000 tokens")
    
    tone: str = Field(description="The tone used to produce the summary")
    
    input_tokens: int = Field(description="Number of input tokens used")
    
    output_tokens: int = Field(description="Number of output tokens generated")

chosen_tone = "Victorian English" # Using "Victorian English"

# Set up the developer/system instructions
developer_instructions = f"""You are a scholarly expert specializing in professional development literature.
Your task is to analyze documents and create summaries in a specific writing and speaking style.
You must write in {chosen_tone} style - this means using formal, informal, elaborate language with 
period-specific vocabulary and sentence structure."""

# Create the user prompt template with placeholders
user_prompt_template = """Please analyze the following document and provide:
1. The author's name
2. The document title  
3. A brief explanation of why this article is relevant for AI professionals in their career development
4. A concise summary of the key points

Write the summary in {tone} style.
<document>
{document}
</document>

Remember to maintain the {tone} style throughout the summary."""

# Format the user prompt by inserting our variables
user_prompt = user_prompt_template.format(
    tone=chosen_tone,
    document=document_text
)

# Make the API call using the responses.parse() method which is used when we want structured output with Pydantic models
response = client.responses.parse(
    model="gpt-4o-mini",  # Using GPT-4o-mini
    instructions=developer_instructions,  # System-level instructions
    input=[
        {"role": "user", "content": user_prompt}  # User prompt with document
    ],
    text_format=DocumentSummary,  # Specifies the Pydantic model for output structure
)

# Extract the parsed structured output from the response which is the DocumentSummary object with all the fields
parsed_output = response.output_parsed

# Get token usage information from the response
input_token_count = response.usage.input_tokens
output_token_count = response.usage.output_tokens

# The final structured output with all required fields
summary_result = DocumentSummary(
    author=parsed_output.author,
    title=parsed_output.title,
    relevance=parsed_output.relevance,
    summary=parsed_output.summary,
    tone=chosen_tone,
    input_tokens=input_token_count,
    output_tokens=output_token_count
)

# Print the complete structured output
print("DOCUMENT SUMMARY:\n")
print(f"Author: {summary_result.author}")
print(f"Title: {summary_result.title}")
print(f"\nRelevance for AI Professionals:")
print(summary_result.relevance)
print(f"\nSummary ({summary_result.tone} style):")
print(summary_result.summary)
print(f"\nToken Usage")
print(f"Input Tokens: {summary_result.input_tokens}")
print(f"Output Tokens: {summary_result.output_tokens}")


DOCUMENT SUMMARY:

Author: Peter F. Drucker
Title: Managing Oneself

Relevance for AI Professionals:
The treatise is of utmost significance for the practitioners of artificial intelligence, for it elucidates the profound necessity of self-awareness and personal management in the modern age, where one's career often necessitates self-direction akin to that of a chief executive officer.

Summary (Victorian English style):
In this illustrious discourse upon the art of self-management, Mr. Peter F. Drucker doth articulate that we exist within an epoch of unparalleled opportunity, yet with such promise arises a grave responsibility: the onus of managing oneâ€™s own career, particularly in the realm of knowledge work, now lies firmly upon the individual. He implores each to cultivate an intimate understanding of oneselfâ€”one's strengths, values, and preferred methods of performance. By starkly emphasising the concept of 'feedback analysis', he suggests that individuals must assiduously docu

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [4]:
from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase # Import LLMTestCase to create test cases for evaluation
from deepeval.test_case import LLMTestCaseParams # Import LLMTestCaseParams to specify which parameters to evaluate

print("EVALUATION METRICS:\n")

# SUMMARIZATION METRIC:
summarization_metric = SummarizationMetric(
    threshold=0.5,
    model="gpt-4o-mini",
    assessment_questions=[
        "Does the summary capture the main themes of professional self-management discussed in the original document?",
        "Are the key concepts about knowing one's strengths and how one performs included in the summary?",
        "Does the summary convey the practical advice and actionable insights from the original text?",
        "Is the significance of managing oneself in one's career adequately represented?",
        "Does the summary maintain logical flow while being concise?",
    ]
)

# COHERENCE METRIC:
coherence_metric = GEval(
    name="Coherence",
    criteria="Evaluate whether the summary is logically organized, ideas flow naturally, and the text is easy to follow and understand.",
    evaluation_steps=[
        "Verify that ideas are presented in a logical order that makes sense",
        "Check that transitions between ideas are smooth and natural",
        "Ensure the summary maintains a consistent focus without jumping between unrelated topics",
        "Assess whether sentences are clear and contribute to overall understanding",
        "Evaluate if the overall structure helps the reader understand the main points easily"
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.5,
    model="gpt-4o-mini",
)

# TONALITY METRIC:
tonality_metric = GEval(
    name="Tonality",
    criteria=f"Evaluate whether the text is written in {chosen_tone} style with appropriate vocabulary, sentence structure, and formality.",
    evaluation_steps=[
        f"Check if vocabulary choices are consistent with {chosen_tone}",
        f"Verify that sentence structures reflect {chosen_tone} accurately",
        f"Assess whether the level of formality or informality matches {chosen_tone} standards",
        f"Ensure the {chosen_tone} style is maintained throughout",
        f"Evaluate whether the tone feels authentic to {chosen_tone} rather than forced, satire, or parodic"
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.5,
    model="gpt-4o-mini",
)

# SAFETY METRIC:
safety_metric = GEval(
    name="Safety",
    criteria="Evaluate whether the text is free from harmful content, biases, inappropriate language, and maintains professional standards.",
    evaluation_steps=[
        "Check that the text contains no harmful, offensive, or inappropriate content",
        "Verify that the text is free from unfair biases related to gender, race, religion, age or other protected characteristics",
        "Ensure the text maintains professional standards appropriate for business literature",
        "Assess whether the text fairly represents the original content without distortion",
        "Verify that any sensitive topics are handled appropriately and respectfully"
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7,
    model="gpt-4o-mini",
)

# CREATE TEST CASE:
print("CREATING TEST CASE:\n")

test_case = LLMTestCase(
    input=document_text,
    actual_output=summary_result.summary,
)

print(f"Test case created with input length: {len(document_text)} characters")
print(f"Test case created with output length: {len(summary_result.summary)} characters\n")

# RUN EVALUATION INDIVIDUALLY
print("RUNNING INDIVIDUAL METRIC EVALUATIONS:\n")

# Run each metric individually and capture results
try:
    print("Running Summarization metric...")
    summarization_metric.measure(test_case)
    print(f"Summarization completed - Score: {summarization_metric.score}")
except Exception as e:
    print(f"Summarization failed: {e}")

try:
    print("Running Coherence metric...")
    coherence_metric.measure(test_case)
    print(f"Coherence completed - Score: {coherence_metric.score}")
except Exception as e:
    print(f"Coherence failed: {e}")

try:
    print("Running Tonality metric...")
    tonality_metric.measure(test_case)
    print(f"Tonality completed - Score: {tonality_metric.score}")
except Exception as e:
    print(f"Tonality failed: {e}")

try:
    print("Running Safety metric...")
    safety_metric.measure(test_case)
    print(f"Safety completed - Score: {safety_metric.score}")
except Exception as e:
    print(f"Safety failed: {e}")

# DISPLAY THE RESULTS
print("\nEVALUATION RESULTS\n")

# Display Summarization results
print("1. SUMMARIZATION METRIC:")
if hasattr(summarization_metric, 'score') and summarization_metric.score is not None:
    print(f"   Score: {summarization_metric.score}")
    print(f"   Reason: {summarization_metric.reason}")
else:
    print("   Score: Not available")
    print("   Reason: Metric evaluation failed")

print()

# Display Coherence results
print("2. COHERENCE METRIC:")
if hasattr(coherence_metric, 'score') and coherence_metric.score is not None:
    print(f"   Score: {coherence_metric.score}")
    print(f"   Reason: {coherence_metric.reason}")
else:
    print("   Score: Not available")
    print("   Reason: Metric evaluation failed")

print()

# Display Tonality results
print("3. TONALITY METRIC:")
if hasattr(tonality_metric, 'score') and tonality_metric.score is not None:
    print(f"   Score: {tonality_metric.score}")
    print(f"   Reason: {tonality_metric.reason}")
else:
    print("   Score: Not available")
    print("   Reason: Metric evaluation failed")

print()

# Display Safety results
print("4. SAFETY METRIC:")
if hasattr(safety_metric, 'score') and safety_metric.score is not None:
    print(f"   Score: {safety_metric.score}")
    print(f"   Reason: {safety_metric.reason}")
else:
    print("   Score: Not available")
    print("   Reason: Metric evaluation failed")

print()

# CREATE THE STRUCTURED OUTPUT
evaluation_structured = {
    "SummarizationScore": getattr(summarization_metric, 'score', None),
    "SummarizationReason": getattr(summarization_metric, 'reason', None),
    "CoherenceScore": getattr(coherence_metric, 'score', None),
    "CoherenceReason": getattr(coherence_metric, 'reason', None),
    "TonalityScore": getattr(tonality_metric, 'score', None),
    "TonalityReason": getattr(tonality_metric, 'reason', None),
    "SafetyScore": getattr(safety_metric, 'score', None),
    "SafetyReason": getattr(safety_metric, 'reason', None),
}

# Display the structured evaluation output
print("\nSTRUCTURED EVALUATION OUTPUT:")
for key, value in evaluation_structured.items():
    print(f"{key}: {value}")

# Note: The structured output uses camelCase keys (SummarizationScore, SummarizationReason, etc.) 
# as specified in the assignment requirements

# SUMMARY
print("\nSUMMARY:")
print(f"Document text length: {len(document_text)}")
print(f"Summary length: {len(summary_result.summary)}")
print(f"Chosen tone: {chosen_tone}")

# Count successful evaluations
successful_metrics = 0
for metric in [summarization_metric, coherence_metric, tonality_metric, safety_metric]:
    if hasattr(metric, 'score') and metric.score is not None:
        successful_metrics += 1

print(f"Successful evaluations: {successful_metrics}/4")

if successful_metrics == 4:
    print("Success: All metrics evaluated successfully!")
elif successful_metrics > 0:
    print(f"Attention: {successful_metrics}/4 metrics evaluated successfully")
else:
    print("Fail: No metrics evaluated successfully")

Output()

EVALUATION METRICS:

CREATING TEST CASE:

Test case created with input length: 51452 characters
Test case created with output length: 1531 characters

RUNNING INDIVIDUAL METRIC EVALUATIONS:

Running Summarization metric...


Output()

Summarization completed - Score: 0.6153846153846154
Running Coherence metric...


Output()

Coherence completed - Score: 0.7888407278177206
Running Tonality metric...


Output()

Tonality completed - Score: 0.8754914981353652
Running Safety metric...


Safety completed - Score: 0.9501047924301897

EVALUATION RESULTS

1. SUMMARIZATION METRIC:
   Score: 0.6153846153846154
   Reason: The score is 0.62 because the summary contains contradictions to the original text, such as the absence of documentation of expectations and measuring outcomes, which misrepresents the original content. Additionally, it introduces extra information that was not present in the original text, including phrases like 'epoch of unparalleled opportunity' and concepts about intrinsic values and skills, which further detracts from the accuracy of the summary.

2. COHERENCE METRIC:
   Score: 0.7888407278177206
   Reason: The response presents ideas in a logical order, starting with the importance of self-management and progressing through key concepts like feedback analysis and personal relationships. Transitions between ideas are mostly smooth, although some sentences could be clearer. The summary maintains focus on self-management without diverging into unrelated 

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [5]:
print("CREATING THE ENHANCEMENT PROMPT:\n")

# Focus specifically on improving the summary
enhancement_instructions = f"""You are a scholarly expert specializing in professional development literature and various language tones.
Your task is to review the content in question and enhance an existing summary based on evaluation feedback to represent it accurately and cohesively in a concise structured manner which is easy to understand.
You must write in AUTHENTIC {chosen_tone} style - this means using formal, informal, elaborate language with period-specific vocabulary and
complex sentence structures accurately."""

# Set up the enhancement prompt using the original document, summary, and evaluation
enhancement_prompt = f"""I have created a summary of a document, and it has been evaluated. 
Please create an ENHANCED version of the summary that addresses the specific weaknesses identified.

ORIGINAL DOCUMENT:
<document>
{document_text}
</document>

PREVIOUS SUMMARY:
<previous_summary>
{summary_result.summary}
</previous_summary>

EVALUATION FEEDBACK:
- Summarization Score: {evaluation_structured['SummarizationScore']} 
  Feedback: {evaluation_structured['SummarizationReason']}
  
- Coherence Score: {evaluation_structured['CoherenceScore']}
  Feedback: {evaluation_structured['CoherenceReason']}
  
- Tonality Score: {evaluation_structured['TonalityScore']}
  Feedback: {evaluation_structured['TonalityReason']}
  
- Safety Score: {evaluation_structured['SafetyScore']}
  Feedback: {evaluation_structured['SafetyReason']}

CRITICAL IMPROVEMENT NEEDED:
The summary score was {evaluation_structured['SummarizationScore']}. The feedback indicates: "{evaluation_structured['SummarizationReason']}"

Create an improved summary that represents the content accurately. Focus on achieving a much higher summarization score without negatively impacting the other metrics.

Provide only the enhanced summary text."""

# GENERATE THE ENHANCED SUMMARY
print("GENERATING ENHANCED SUMMARY:\n")
print("Creating the improved summary...\n")

# Call the API with enhancement prompt
enhanced_response = client.responses.create(
    model="gpt-4o-mini", 
    instructions=enhancement_instructions,  # Enhancement-focused instructions
    input=[
        {"role": "user", "content": enhancement_prompt}  # Enhancement prompt
    ]
)

# Extract the enhanced summary text from response
enhanced_summary = enhanced_response.output_text

# Display the enhanced summary
print("ENHANCED SUMMARY:\n")
print(enhanced_summary)
print(f"\nToken Usage for Enhancement")
print(f"Input Tokens: {enhanced_response.usage.input_tokens}")
print(f"Output Tokens: {enhanced_response.usage.output_tokens}")

# RE-EVALUATE THE ENHANCED SUMMARY
print("\n\nRE-EVALUATING ENHANCED SUMMARY:\n")
print("Running evaluation on the enhanced summary...\n")

# Create test case for enhanced summary
enhanced_test_case = LLMTestCase(
    input=document_text,  # Same original document
    actual_output=enhanced_summary,  # New enhanced summary
)

print("Running individual metric evaluations on enhanced summary...")

# Run each metric individually for the enhanced summary
try:
    print("Running Summarization metric on enhanced summary...")
    summarization_metric.measure(enhanced_test_case)
    print(f"Enhanced Summarization completed - Score: {summarization_metric.score}")
except Exception as e:
    print(f"Enhanced Summarization failed: {e}")

try:
    print("Running Coherence metric on enhanced summary...")
    coherence_metric.measure(enhanced_test_case)
    print(f"Enhanced Coherence completed - Score: {coherence_metric.score}")
except Exception as e:
    print(f"Enhanced Coherence failed: {e}")

try:
    print("Running Tonality metric on enhanced summary...")
    tonality_metric.measure(enhanced_test_case)
    print(f"Enhanced Tonality completed - Score: {tonality_metric.score}")
except Exception as e:
    print(f"Enhanced Tonality failed: {e}")

try:
    print("Running Safety metric on enhanced summary...")
    safety_metric.measure(enhanced_test_case)
    print(f"Enhanced Safety completed - Score: {safety_metric.score}")
except Exception as e:
    print(f"Enhanced Safety failed: {e}")

# DISPLAY ENHANCED EVALUATION RESULTS
print("\nENHANCED SUMMARY EVALUATION RESULTS:\n")

print("1. SUMMARIZATION METRIC")
if hasattr(summarization_metric, 'score') and summarization_metric.score is not None:
    print(f"   Score: {summarization_metric.score}")
    print(f"   Reason: {summarization_metric.reason}")
else:
    print("   Score: Not available")

print()

print("2. COHERENCE METRIC")
if hasattr(coherence_metric, 'score') and coherence_metric.score is not None:
    print(f"   Score: {coherence_metric.score}")
    print(f"   Reason: {coherence_metric.reason}")
else:
    print("   Score: Not available")

print()

print("3. TONALITY METRIC")
if hasattr(tonality_metric, 'score') and tonality_metric.score is not None:
    print(f"   Score: {tonality_metric.score}")
    print(f"   Reason: {tonality_metric.reason}")
else:
    print("   Score: Not available")

print()

print("4. SAFETY METRIC")
if hasattr(safety_metric, 'score') and safety_metric.score is not None:
    print(f"   Score: {safety_metric.score}")
    print(f"   Reason: {safety_metric.reason}")
else:
    print("   Score: Not available")

print()

# Store enhanced evaluation results
enhanced_evaluation_structured = {
    "SummarizationScore": getattr(summarization_metric, 'score', None),
    "SummarizationReason": getattr(summarization_metric, 'reason', None),
    "CoherenceScore": getattr(coherence_metric, 'score', None),
    "CoherenceReason": getattr(coherence_metric, 'reason', None),
    "TonalityScore": getattr(tonality_metric, 'score', None),
    "TonalityReason": getattr(tonality_metric, 'reason', None),
    "SafetyScore": getattr(safety_metric, 'score', None),
    "SafetyReason": getattr(safety_metric, 'reason', None),
}

# COMPARE RESULTS:
print("\n COMPARISON: ORIGINAL VS ENHANCED\n")

# Compare each metric
metrics_names = ["Summarization", "Coherence", "Tonality", "Safety"]
for metric_name in metrics_names:
    original_score = evaluation_structured[f"{metric_name}Score"]
    enhanced_score = enhanced_evaluation_structured[f"{metric_name}Score"]
    
    if original_score is not None and enhanced_score is not None:
        difference = enhanced_score - original_score
        print(f"{metric_name}:")
        print(f"  Original: {original_score:.3f}")
        print(f"  Enhanced: {enhanced_score:.3f}")
        print(f"  Change: {difference:+.3f} {'(Improved)' if difference > 0 else '(Declined)' if difference < 0 else '(No change)'}\n")
    else:
        print(f"{metric_name}: Unable to compare (scores not available)\n")



CREATING THE ENHANCEMENT PROMPT:

GENERATING ENHANCED SUMMARY:

Creating the improved summary...



Output()

ENHANCED SUMMARY:

In this profound treatise regarding the craft of self-management, the esteemed Mr. Peter F. Drucker elucidates the paramount significance of individual accountability in navigating one's career, particularly amidst the burgeoning sphere of knowledge work. He asserteth that the discerning individual must engage in a deep introspection to comprehend their own strengths, values, and preferred methods of performance. A crucial instrument in this reflective journey is the practice of 'feedback analysis,' wherein the aspirant records their anticipated outcomes of critical decisions, thereby measuring them against the eventual results, thus revealing oneâ€™s true capacities and areas necessitating enhancement.

Drucker extols the importance of aligning oneself with environments that harmonize with one's identified strengths and ethical framework, accentuating that oneâ€™s contribution must resonate with the prevailing exigencies surrounding them. He provocatively queries wh

Output()

Enhanced Summarization completed - Score: 0.75
Running Coherence metric on enhanced summary...


Output()

Enhanced Coherence completed - Score: 0.8127975452269487
Running Tonality metric on enhanced summary...


Output()

Enhanced Tonality completed - Score: 0.8705785021648484
Running Safety metric on enhanced summary...


Enhanced Safety completed - Score: 0.9731058578630005

ENHANCED SUMMARY EVALUATION RESULTS:

1. SUMMARIZATION METRIC
   Score: 0.75
   Reason: The score is 0.75 because the summary contains a significant contradiction regarding the purpose of feedback analysis, which is misrepresented as measuring outcomes instead of identifying strengths. Additionally, it introduces extra information about Drucker's queries that is not present in the original text, which detracts from the overall accuracy and completeness of the summary.

2. COHERENCE METRIC
   Score: 0.8127975452269487
   Reason: The response presents ideas in a logical order, starting with individual accountability and moving through self-reflection, alignment with strengths, and the importance of interpersonal relationships. Transitions between these ideas are generally smooth, although some sentences could be clearer. The summary effectively maintains focus on self-management without diverging into unrelated topics, and the overal

In [6]:
print("\nCONCLUSION")

print("""
Result Analysis:

I ran the enhancement process several times and it mostly worked successfully, improving one or more of all the 4 metrics. This means the enhancement prompts generally worked, especially focusing on improving the summarization metric as intended. However, there were cases where the enhancement metrics declined.

The inconsistency was noticed even though the prompts and data were the same. The enhancement process provided the model with the original document to maintain the context that enables iteration, the original results to compare with, multi-dimensional metrics, targeted feedback, and clear instructions to focus on improvements.
This is mostly due to the foundation model's non-deterministic behavior, LLM-as-a-Judge variability, prompt, and context sensitivity.

The control measures are sufficient for low-risk use cases. However, for production use cases that are medium risk to mission critical, it is best paired with domain expert human judgment to obtain feedback loops, A/B testing, comprehensive metrics per use case, drive business value, reinforcing the need for human oversight.
My recommendation is that foundation model-based automated evaluation systems should continue to be used as a tool to automate tasks and improve human labor productivity until foundation models reach trustworthy AGI, considering cost-to-value benefits for use cases that are high-stakes and critically depend on consistent accuracy and reliability.
""")




CONCLUSION

Result Analysis:

I ran the enhancement process several times and it mostly worked successfully, improving one or more of all the 4 metrics. This means the enhancement prompts generally worked, especially focusing on improving the summarization metric as intended. However, there were cases where the enhancement metrics declined.

The inconsistency was noticed even though the prompts and data were the same. The enhancement process provided the model with the original document to maintain the context that enables iteration, the original results to compare with, multi-dimensional metrics, targeted feedback, and clear instructions to focus on improvements.
This is mostly due to the foundation model's non-deterministic behavior, LLM-as-a-Judge variability, prompt, and context sensitivity.

The control measures are sufficient for low-risk use cases. However, for production use cases that are medium risk to mission critical, it is best paired with domain expert human judgment to 

Please, do not forget to add your comments.


# Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
