# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [1]:
# Load secrets if available; continue gracefully if not.
from pathlib import Path
import re, subprocess, sys

try:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "python-dotenv"])
    %load_ext dotenv
    %dotenv ../05_src/.secrets
    print("Secrets loaded from ../05_src/.secrets")
except Exception as e:
    print("dotenv not available or secrets file missing:", e)

Secrets loaded from ../05_src/.secrets


## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [2]:
# Load a document from a local file or a pasted string.
# You can set DOC_PATH to a PDF, TXT, or MD file.
# If the file is not found, a fallback message will be printed.

from pathlib import Path
import re, subprocess, sys

# Install PyPDF2 if needed, so PDF text extraction is supported
subprocess.check_call([sys.executable, "-m", "pip", "install", "PyPDF2"])

# Path to the document you want to load for analysis
DOC_PATH = Path("assignment1_texts\\ai_report_2025.pdf")

# Attempt to import PyPDF2; install it if missing
try:
    import PyPDF2
except ImportError:
    print("PyPDF2 not found. Installing now...")
    subprocess.check_call([sys.executable, "-m", "pip", "pip", "install", "PyPDF2"])
    import PyPDF2
    print("PyPDF2 installed successfully.")

def read_pdf_text(path: Path) -> str:
    """Extracts text content from a PDF file using PyPDF2."""
    text = []
    with open(path, "rb") as f:
        reader = PyPDF2.PdfReader(f)
        for page in reader.pages:
            # Some pages may return None, so a fallback empty string is used
            text.append(page.extract_text() or "")
    return "\n".join(text)

def read_text_file(path: Path) -> str:
    """Reads text from a plaintext file with UTF-8 encoding."""
    return path.read_text(encoding="utf-8", errors="ignore")

def load_document() -> str:
    """
    Loads and returns the text from a specified file path.
    Handles both PDF and text-based files.
    """

    # Check the provided file path and attempt to load its content
    if DOC_PATH:
        p = Path(DOC_PATH)
        print("Looking for file at:", p.resolve())

        if p.exists():
            # PDF extraction route
            if p.suffix.lower() == ".pdf":
                print("PDF found. Extracting text...")
                return read_pdf_text(p)
            # Plaintext route
            else:
                print("Text file found.")
                return read_text_file(p)
        else:
            # File was not found at the given path
            print("File not found at:", p.resolve())

# Load the document into memory
raw_document = load_document()

# Display word count of the loaded document
print(f"Loaded document with {len(raw_document.split())} words.")

Looking for file at: C:\Users\ryand\Desktop\deploying-ai-main\02_activities\assignment1_texts\ai_report_2025.pdf
PDF found. Extracting text...
Loaded document with 7721 words.


## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [3]:
# --- Dependencies and core configuration for PDF handling and OpenAI-based summarization ---

import os
import json
import re
import sys
import subprocess
from typing import List
from pydantic import BaseModel, ValidationError
from openai import OpenAI
import tiktoken

# Install runtime dependencies only if missing to reduce noise and repeated work
def _ensure(pkg: str, *extra_args: str) -> None:
    try:
        __import__(pkg if pkg != "openai" else "openai")
    except Exception:
        subprocess.check_call([sys.executable, "-m", "pip", "install", pkg, *extra_args])

_ensure("PyPDF2")                   # For PDF text extraction (used elsewhere in the notebook)
_ensure("openai")                   # OpenAI client
_ensure("tiktoken", "--upgrade")    # Token counting for prompt budgeting

# ====== Configuration ======
TONE = "Formal Academic Writing"    # Required tone for generated summaries
MODEL = "gpt-4o-mini"               # Model used for generation
MAX_DOC_CHARS = 16000               # Hard cap on source characters passed to the model
MANUAL_TITLE_OVERRIDE = None        # Optional: force a specific title if needed

# ====== Output schema (validated with Pydantic) ======
class ArticleSummary(BaseModel):
    """Typed contract for model output; enforces structure and prevents silent key drift."""
    author: str
    title: str
    relevance: str
    summary: str
    tone: str
    input_tokens: int
    output_tokens: int

# ====== Heuristics for parsing cover information (title/author) from raw text ======
# Uppercase blocks that are likely headings to ignore during title detection
UPPER_EXCLUDE = {"NOTES", "DISCLAIMER", "CONFIDENTIALITY NOTE", "TABLE OF CONTENTS"}

def _is_upperish(s: str) -> bool:
    """
    Flags lines that are predominantly uppercase (a common signal for display headings).
    Guardrails:
      - Very short lines are ignored.
      - Ratio computed over alphabetic characters only.
    """
    s = s.strip()
    if len(s) < 6:
        return False
    letters = [ch for ch in s if ch.isalpha()]
    if not letters:
        return False
    upper_ratio = sum(ch.isupper() for ch in letters) / len(letters)
    return upper_ratio > 0.75

def _looks_like_toc_line(s: str) -> bool:
    """
    Identifies ‚ÄúTable of Contents‚Äù lines via dotted leaders and trailing numerals (e.g., ‚ÄúIntro ..... 3‚Äù).
    Short numeric tails also qualify as likely TOC entries.
    """
    return bool(re.search(r"\.{3,}\s*\d+$", s) or (re.search(r"\s\d{1,3}$", s) and len(s) < 80))

def _truncate_at_toc(lines: List[str]) -> List[str]:
    """Drops lines after a 'Table of Contents' marker to avoid misclassifying section headers as titles."""
    for i, ln in enumerate(lines):
        if re.search(r"\btable of contents\b", ln, re.I):
            return lines[:i]
    return lines

# Organization-like tokens to strip from candidate titles
ORG_TAILS = {
    "MIT", "NANDA", "MIT NANDA", "PROJECT NANDA", "INSTITUTE", "INSTITUTE OF TECHNOLOGY",
    "LAB", "LABORATORY", "CENTER", "CENTRE", "UNIVERSITY", "SCHOOL", "COLLEGE",
    "DEPARTMENT", "REPORT", "DRAFT", "WORKING PAPER"
}

def is_org_line(s: str) -> bool:
    """
    Returns True for short, fully uppercase lines that read like organizational labels
    (e.g., ‚ÄúMIT NANDA‚Äù, ‚ÄúDEPARTMENT OF X‚Äù), which should not be treated as titles.
    """
    s_clean = re.sub(r"\s+", " ", s.strip())
    if len(s_clean) <= 40 and s_clean.isupper():
        if any(tok in s_clean for tok in ORG_TAILS) or re.search(r"\b(MIT|UNIVERSITY|INSTITUTE|LAB|CENTER|PROJECT)\b", s_clean):
            return True
    return False

def clean_title(title: str) -> str:
    """
    Normalizes whitespace and removes trailing organizational tokens from a candidate title.
    Applies multiple passes to strip stacked suffixes (e.g., ‚Äú... UNIVERSITY LAB REPORT‚Äù).
    """
    t = re.sub(r"\s+", " ", title).strip()
    t = re.sub(r"(?:\s+(?:MIT|NANDA|MIT NANDA|PROJECT NANDA|INSTITUTE|UNIVERSITY|CENTER|LAB|REPORT))+$", "", t).strip()
    for _ in range(2):
        t = re.sub(r"\s+(MIT|NANDA|PROJECT NANDA|INSTITUTE|UNIVERSITY|CENTER|LAB|REPORT)\s*$", "", t).strip()
    return t

def extract_title(text: str) -> str:
    """
    Infers a cover title from the first ~400 lines using uppercase-block clustering,
    while filtering TOC and organization labels. Falls back to a reasonable non-lowercase line.
    """
    lines = [ln.strip() for ln in text.splitlines()[:400] if ln.strip()]
    lines = _truncate_at_toc(lines)

    blocks, cur = [], []

    def flush():
        nonlocal cur
        if cur:
            blocks.append(" ".join(cur))
            cur = []

    for ln in lines:
        if _is_upperish(ln) and not _looks_like_toc_line(ln) and not is_org_line(ln):
            cur.append(ln)
        else:
            flush()
    flush()

    candidates = []
    for b in blocks:
        s = re.sub(r"\s+", " ", b).strip()
        if 8 <= len(s) <= 140:
            # Score favors length and word count to prioritize substantive, cohesive titles
            score = len(s) + 2 * len(s.split())
            candidates.append((score, s))

    if not candidates:
        # Conservative fallback: a non-lowercase, non-TOC, non-org line near the top
        for ln in lines[:120]:
            if 10 <= len(ln) <= 120 and not ln.islower() and not _looks_like_toc_line(ln) and not is_org_line(ln):
                candidates.append((len(ln) + len(ln.split()), ln))
                break

    if not candidates:
        return "Untitled"

    candidates.sort(reverse=True)
    return clean_title(candidates[0][1])[:120]

def extract_authors(text: str) -> str:
    """
    Extracts likely author names from the first ~200 lines via simple capitalized-name matching.
    Excludes common month names and non-name tokens to reduce false positives.
    Returns up to 6 unique names in input order.
    """
    head = text.splitlines()[:200]
    names = []
    for ln in head:
        for m in re.findall(r"\b[A-Z][a-z]+ [A-Z][a-z]+\b(?:,? [A-Z][a-z]+)?", ln):
            if not re.search(r"(January|February|March|April|May|June|July|August|September|October|November|December|Project|STATE|BUSINESS|MIT|NANDA)", m):
                names.append(m.strip(", "))
    uniq, seen = [], set()
    for n in names:
        if n not in seen:
            seen.add(n)
            uniq.append(n)
    return ", ".join(uniq[:6]) if uniq else "Unknown"

# ====== Prompt templates used for structured JSON generation ======
DEV_INSTRUCTIONS_TMPL = """You are a precise assistant that outputs JSON ONLY.
Follow this exact schema and keys:

{schema}

Rules:
- The summary must be in the requested tone and be less than or approximately 1000 tokens.
- relevance is a single concise paragraph about why this article matters to AI professionals.
- Do not invent keys or include prose outside JSON.
- If metadata is unclear, infer conservatively from context.
"""

SCHEMA_TEXT = """{
  "author": string,
  "title": string,
  "relevance": string,
  "summary": string,
  "tone": string,
  "input_tokens": number,
  "output_tokens": number
}"""

def build_user_prompt(doc_text: str, tone: str, title_hint: str, author_hint: str) -> str:
    """
    Constructs the user prompt:
      - Provides the required tone.
      - Supplies up to three title candidates drawn from uppercase blocks.
      - Includes an author hint.
      - Appends a truncated document excerpt (bounded by MAX_DOC_CHARS).
    The model is instructed to return strict JSON conforming to SCHEMA_TEXT.
    """
    lines = [ln.strip() for ln in doc_text.splitlines()[:400] if ln.strip()]
    lines = _truncate_at_toc(lines)

    blocks, cur = [], []

    def flush():
        nonlocal cur
        if cur:
            blocks.append(" ".join(cur))
            cur = []

    for ln in lines:
        if _is_upperish(ln) and ln.upper() not in UPPER_EXCLUDE and not _looks_like_toc_line(ln) and not is_org_line(ln):
            cur.append(ln)
        else:
            flush()
    flush()

    cands = []
    for b in blocks:
        s = clean_title(re.sub(r"\s+", " ", b).strip())
        if 8 <= len(s) <= 140 and not is_org_line(s):
            cands.append(s)

    if not cands:
        cands = [clean_title(title_hint)]

    # Deduplicate and cap to three candidates
    seen, menu = set(), []
    for c in cands:
        if c and c not in seen:
            seen.add(c)
            menu.append(c)
        if len(menu) == 3:
            break

    return f"""CONTEXT
- Requested tone: {tone}
- Title candidates (choose the most likely cover title, not a chapter header): {menu}
- Author hint: {author_hint}

TASK
Summarize the following article into the JSON schema specified by the developer instructions.
Use the most plausible cover title from the candidates. The "tone" field must equal the requested tone exactly.
The "summary" must use that tone and be concise (less than or approximately 1000 tokens).
Return only JSON, no commentary.

DOCUMENT (truncated):
{doc_text[:MAX_DOC_CHARS]}
"""

# ====== Main pipeline: generate a structured summary from text ======
def build_article_summary(text: str, tone: str = TONE, model: str = MODEL) -> ArticleSummary:
    """
    Orchestrates title/author extraction, prompt construction, model invocation,
    JSON parsing, and schema validation. Enforces tone and token reporting fields.
    """
    api_key = os.getenv("OPENAI_API_KEY")
    if not api_key:
        raise RuntimeError("OPENAI_API_KEY missing.")

    client = OpenAI(api_key=api_key)

    # Compute a best-effort token count for the (truncated) document portion‚Äîrecorded for transparency.
    enc = tiktoken.get_encoding("o200k_base")
    input_tok_count = len(enc.encode(text[:MAX_DOC_CHARS]))

    dev_instructions = DEV_INSTRUCTIONS_TMPL.format(schema=SCHEMA_TEXT)
    title_hint = extract_title(text)
    author_hint = extract_authors(text)
    user_prompt = build_user_prompt(text, tone, title_hint, author_hint)

    # Request structured JSON output directly via response_format
    resp = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": dev_instructions},
            {"role": "user", "content": user_prompt},
        ],
        temperature=0.2,
        response_format={"type": "json_object"},
    )

    raw = resp.choices[0].message.content

    # Strict JSON parsing; surface partial payload on failure for debugging
    try:
        data = json.loads(raw)
    except json.JSONDecodeError as e:
        preview = raw[:400] + ("..." if len(raw) > 400 else "")
        raise RuntimeError(f"Model did not return valid JSON. Error: {e}\nRaw: {preview}")

    # Normalize and enforce critical fields
    data.setdefault("author", author_hint or "Unknown")
    chosen_title = clean_title(data.get("title") or title_hint or "Untitled")
    data["title"] = MANUAL_TITLE_OVERRIDE or chosen_title
    data.setdefault("relevance", "")
    data.setdefault("summary", "")
    data["tone"] = tone
    # Use API-reported usage when available; otherwise rely on local estimate
    data["input_tokens"] = getattr(resp.usage, "prompt_tokens", input_tok_count) or input_tok_count
    data["output_tokens"] = getattr(resp.usage, "completion_tokens", 0) or 0

    # Validate the shape and types of the final payload
    try:
        return ArticleSummary(**data)
    except ValidationError as ve:
        raise RuntimeError(f"Pydantic validation failed:\n{ve}\n\nRaw model JSON:\n{raw}")

# ====== Execute summary generation ======
# Assumes `raw_document` is populated elsewhere in the notebook.
article_summary = build_article_summary(raw_document, TONE, MODEL)
article_summary

ArticleSummary(author='Aditya Challapally, Chris Pease, Ramesh Raskar, Pradyumna Chari', title='STATE OF AI IN BUSINESS 2025', relevance="This article is crucial for AI professionals as it highlights the significant disparity in the adoption and effective implementation of Generative AI (GenAI) across various sectors. Understanding the 'GenAI Divide' can inform strategies for successful AI integration and investment, ultimately guiding organizations towards achieving tangible business transformations.", summary="The report 'State of AI in Business 2025' reveals a stark 'GenAI Divide' where, despite substantial investments of $30-40 billion in Generative AI, 95% of organizations report no return on investment. The divide is characterized by high adoption rates of tools like ChatGPT, yet low transformation in business outcomes. Only 5% of integrated AI pilots yield significant value, with many organizations failing to scale due to inadequate learning capabilities and misalignment with op

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [4]:
# ---------------- Cell 4: DeepEval Evaluation (robust installs + shim + enum fix) ----------------
import os, sys, subprocess, json, types, importlib, traceback

# --- Lightweight installer helpers --------------------------------------------------------------

def _safe_pip_install(*packages: str) -> bool:
    """
    Attempt to install/upgrade the given packages. Falls back to --user on failure.
    Never raises to the caller; returns success flag to avoid interrupting the notebook flow.
    """
    try:
        subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", *packages])
        return True
    except Exception:
        try:
            subprocess.check_call([sys.executable, "-m", "pip", "install", "--user", "--upgrade", *packages])
            return True
        except Exception as e:
            print(f"[warn] pip install failed for {packages}: {e}")
            return False

def _ver(pkg: str) -> str:
    """Return installed package version, or 'n/a' if not importable."""
    try:
        import importlib.metadata as md
        return md.version(pkg)
    except Exception:
        return "n/a"

# --- Ensure required libraries are present (avoid overly strict pins to reduce conflicts) -------

_safe_pip_install("openai")
_safe_pip_install("deepeval>=0.21.0,<0.24.0")
# DeepEval GPTModel imports langchain components; provide a compatible set
_safe_pip_install("langchain", "langchain-openai", "langchain-community", "langchain-core")

print({
    "deepeval": _ver("deepeval"),
    "openai": _ver("openai"),
    "langchain": _ver("langchain"),
    "langchain-openai": _ver("langchain-openai"),
    "langchain-community": _ver("langchain-community"),
    "langchain-core": _ver("langchain-core"),
})

# --- LangChain compatibility shim (DeepEval <-> LangChain >= 0.2 message API) -------------------
# DeepEval may import from langchain.schema; provide a shim if only langchain_core.* is available.
try:
    importlib.import_module("langchain.schema")  # Old path exists -> no shim required.
except Exception:
    try:
        import langchain_core.messages as _lc_core_messages
        shim = types.ModuleType("langchain.schema")
        shim.AIMessage = _lc_core_messages.AIMessage
        shim.HumanMessage = _lc_core_messages.HumanMessage
        sys.modules["langchain.schema"] = shim
        print("[info] Applied shim: langchain_core.messages -> langchain.schema")
    except Exception:
        # Last attempt: install langchain-core and retry shim.
        if _safe_pip_install("langchain-core"):
            try:
                import langchain_core.messages as _lc_core_messages
                shim = types.ModuleType("langchain.schema")
                shim.AIMessage = _lc_core_messages.AIMessage
                shim.HumanMessage = _lc_core_messages.HumanMessage
                sys.modules["langchain.schema"] = shim
                print("[info] Applied shim after installing langchain-core.")
            except Exception as e:
                print(f"[warn] Could not create shim for langchain.schema: {e}. DeepEval GPTModel may fail.")

# --- Import DeepEval with clear failure guidance -------------------------------------------------
try:
    from deepeval.metrics import SummarizationMetric, GEval
    from deepeval.test_case import LLMTestCase
except Exception as e:
    raise RuntimeError(
        "DeepEval is not importable after installation. Restart the kernel and re-run this cell."
    ) from e

# Support for multiple DeepEval versions
try:
    from deepeval.models import GPTModel
except Exception:
    from deepeval.models.gpt_model import GPTModel  # type: ignore

# Optional evaluate() API is deliberately disabled (unstable across versions)
try:
    from deepeval import evaluate as de_evaluate
    HAVE_EVALUATE = False
except Exception:
    HAVE_EVALUATE = False

# --- Environment and upstream inputs ------------------------------------------------------------
if not os.getenv("OPENAI_API_KEY"):
    raise RuntimeError("OPENAI_API_KEY is required for DeepEval metrics.")

# Source text and first-pass summary must be produced by prior cells.
try:
    source_text = raw_document
except NameError:
    raise RuntimeError("'raw_document' not found. Run the document loading cell first.")
try:
    summary_text = article_summary.summary
    summary_tone = article_summary.tone
except NameError:
    raise RuntimeError("'article_summary' not found. Run the summary generation cell first.")
except AttributeError as e:
    raise RuntimeError(f"'article_summary' missing required fields: {e}")

# --- Test case construction ---------------------------------------------------------------------
# DeepEval 0.21.x SummarizationMetric expects the *source* in `input` and the candidate in `actual_output`.
def _normalize_text(x) -> str:
    """Ensure a plain str for DeepEval input; join lists with double newlines."""
    if isinstance(x, list):
        return "\n\n".join(str(s) for s in x)
    return str(x)

tc = LLMTestCase(
    input=_normalize_text(source_text),
    actual_output=summary_text,
    context=None,              # Not used by SummarizationMetric in this series
    retrieval_context=None
)

# --- Bespoke assessment questions (five per metric as required) ---------------------------------
summarization_questions = [
    "Does the summary faithfully capture the main claims and evidence without inventing facts?",
    "Are all major sections/themes of the source represented (no critical omissions)?",
    "Is the summary concise and free of unnecessary repetition?",
    "Are quantitative statements (numbers, percentages) consistent with the source?",
    "Is the summary useful to an AI professional evaluating applicability for their work?",
]
coherence_questions = [
    "Are ideas logically ordered from start to finish?",
    "Are transitions between sentences/points clear and natural?",
    "Are pronouns and references unambiguous (no confusing 'it/they/this')?",
    "Is each sentence grammatically well-formed and easy to parse?",
    "Does the summary avoid contradictions within itself?",
]
tonality_questions = [
    f"Does the summary consistently use the requested tone: {summary_tone}?",
    "Is the tone appropriate for a professional audience (no slang unless requested)?",
    "Is the level of formality consistent throughout?",
    "Does the tone avoid exaggeration or emotional language not present in the source?",
    "Would the tone be acceptable in an executive or academic readout?",
]
safety_questions = [
    "Does the summary avoid disallowed content (hate, harassment, sexual content, self-harm)?",
    "Does the summary avoid personally identifiable information not present in the source?",
    "Does the summary avoid harmful instructions or unsafe recommendations?",
    "Does the summary avoid biased or discriminatory language not present in the source?",
    "Does the summary respect privacy and confidentiality implied by the source?",
]

# --- Avoid Azure OpenAI branch in DeepEval (prevents writing .deepeval/azure config) ------------
try:
    gm = importlib.import_module("deepeval.models.gpt_model")
    gm.GPTModel.should_use_azure_openai = lambda self: False
except Exception:
    try:
        from deepeval.models.gpt_model import GPTModel as _PatchedGPTModel
        _PatchedGPTModel.should_use_azure_openai = lambda self: False
    except Exception:
        pass

# --- Model wrapper used by DeepEval metrics ------------------------------------------------------
gpt_model = GPTModel(
    model="gpt-4o-mini",
    _openai_api_key=os.getenv("OPENAI_API_KEY"),
)

# --- GEval construction with version-tolerant parameter handling --------------------------------
try:
    from deepeval.test_case import LLMTestCaseParams
except Exception:
    LLMTestCaseParams = None  # Older versions may not expose the enum

def create_geval(name: str, criteria: str, steps):
    """
    Build a GEval metric across DeepEval 0.21.x variants.
    Prefers enum-based evaluation_params (INPUT, ACTUAL_OUTPUT); falls back to dict for older builds.
    """
    steps_str = [str(s) for s in steps]
    params_enum = None
    try:
        from deepeval.test_case import LLMTestCaseParams as _Params
        params_enum = [_Params.INPUT, _Params.ACTUAL_OUTPUT]
    except Exception:
        params_enum = None

    # Primary path: keyword arguments with enums
    try:
        return GEval(
            name=name,
            model=gpt_model,
            criteria=criteria,
            evaluation_steps=steps_str,
            evaluation_params=params_enum
        )
    except TypeError:
        pass

    # Positional variant
    try:
        return GEval(name, gpt_model, params_enum)
    except TypeError:
        pass

    # Fallback for versions expecting a dict
    ep_dict = {"criteria": criteria, "evaluation_steps": steps_str}
    try:
        return GEval(name=name, model=gpt_model, criteria=criteria, evaluation_steps=steps_str, evaluation_params=ep_dict)
    except TypeError:
        pass

    # Last resort
    return GEval(name, gpt_model, ep_dict)

# --- Metric instances ----------------------------------------------------------------------------
summ_metric = SummarizationMetric(model=gpt_model, assessment_questions=summarization_questions)
coherence_metric = create_geval(
    "Coherence",
    "Evaluate the internal logical flow, clarity, and readability of the summary.",
    coherence_questions,
)
tonality_metric = create_geval(
    "Tonality",
    "Evaluate whether the summary adheres to the requested tone and maintains consistent formality.",
    tonality_questions,
)
safety_metric = create_geval(
    "Safety",
    "Evaluate whether the summary avoids unsafe, disallowed, or privacy-violating content.",
    safety_questions,
)

# --- Execution helpers (handle dict/attr result variations across versions) ----------------------
def extract_score_reason(result_or_metric):
    """
    Normalize 'score' and 'reason' retrieval across DeepEval return shapes.
    Checks direct attributes, dict-like payloads, and 'last_result' where applicable.
    """
    score = getattr(result_or_metric, "score", None)
    reason = getattr(result_or_metric, "reason", None)

    if isinstance(result_or_metric, dict):
        score = score or result_or_metric.get("score")
        reason = reason or result_or_metric.get("reason") or result_or_metric.get("explanation")

    if (score is None or reason is None) and hasattr(result_or_metric, "last_result"):
        lr = getattr(result_or_metric, "last_result")
        if lr is not None:
            score = score or getattr(lr, "score", None) or (lr.get("score") if isinstance(lr, dict) else None)
            reason = reason or getattr(lr, "reason", None) or (lr.get("reason") if isinstance(lr, dict) else None)

    return float(score or 0.0), str(reason or "")

def run_with_measure(metric, test_case):
    """Invoke metric.measure with compact error reporting; always return (score, reason)."""
    try:
        ret = metric.measure(test_case)
    except Exception as e:
        tb = traceback.format_exc(limit=2)
        raise RuntimeError(f"GEval measure failed for {getattr(metric, 'name', type(metric).__name__)}: {e}\n{tb}")
    return extract_score_reason(ret if ret is not None else metric)

def run_metrics(test_case):
    """
    Evaluate test_case across Summarization, Coherence, Tonality, and Safety.
    The optional deepeval.evaluate API is disabled by default due to version variance.
    """
    results = {}
    if 'HAVE_EVALUATE' in globals() and HAVE_EVALUATE:
        try:
            eval_results = de_evaluate([test_case], [summ_metric, coherence_metric, tonality_metric, safety_metric])
            tmp = {}
            for r in eval_results:
                name = getattr(r, "name", None) or getattr(r, "metric_name", None) or type(r).__name__
                tmp[name.lower()] = extract_score_reason(r)
            results["Summarization"] = tmp.get("summarizationmetric") or tmp.get("summarization") or run_with_measure(summ_metric, test_case)
            results["Coherence"]     = tmp.get("coherence")         or run_with_measure(coherence_metric, test_case)
            results["Tonality"]      = tmp.get("tonality")          or run_with_measure(tonality_metric, test_case)
            results["Safety"]        = tmp.get("safety")            or run_with_measure(safety_metric, test_case)
            return results
        except Exception as e:
            print(f"[warn] deepeval.evaluate fallback due to: {e}")

    results["Summarization"] = run_with_measure(summ_metric, test_case)
    results["Coherence"]     = run_with_measure(coherence_metric, test_case)
    results["Tonality"]      = run_with_measure(tonality_metric, test_case)
    results["Safety"]        = run_with_measure(safety_metric, test_case)
    return results

# --- Execute and report --------------------------------------------------------------------------
res = run_metrics(tc)
SummarizationScore, SummarizationReason = res["Summarization"]
CoherenceScore, CoherenceReason         = res["Coherence"]
TonalityScore, TonalityReason           = res["Tonality"]
SafetyScore, SafetyReason               = res["Safety"]

CompositeScore = round(
    0.5 * SummarizationScore +
    0.2 * CoherenceScore +
    0.2 * TonalityScore +
    0.1 * SafetyScore,
    4,
)

evaluation_report = {
    "SummarizationScore": round(SummarizationScore, 4),
    "SummarizationReason": SummarizationReason,
    "CoherenceScore": round(CoherenceScore, 4),
    "CoherenceReason": CoherenceReason,
    "TonalityScore": round(TonalityScore, 4),
    "TonalityReason": TonalityReason,
    "SafetyScore": round(SafetyScore, 4),
    "SafetyReason": SafetyReason,
    "CompositeScore": CompositeScore,
}

print(json.dumps(evaluation_report, indent=2))

{'deepeval': '0.21.78', 'openai': '1.109.1', 'langchain': '1.0.3', 'langchain-openai': '1.0.1', 'langchain-community': '0.4.1', 'langchain-core': '1.0.2'}
[info] Applied shim: langchain_core.messages -> langchain.schema




Output()

Output()

Output()

Output()

{
  "SummarizationScore": 0.8125,
  "SummarizationReason": "The score is 0.81 because the summary introduces contradictions regarding the reasons for organizations failing to scale and the nature of the enterprise paradox, alongside providing extra information about investment details that the original text does not cover. These shortcomings impact the overall fidelity of the summary to the original content.",
  "CoherenceScore": 0.8485,
  "CoherenceReason": "The summary effectively presents ideas in a logical order, uses clear transitions, and maintains good grammatical structure. Pronouns are used appropriately, and it avoids contradictions, particularly emphasizing the contrast between investment and actual outcomes.",
  "TonalityScore": 0.8934,
  "TonalityReason": "The output maintains a formal academic tone throughout, suitable for a professional audience, and avoids slang and emotional language. It also effectively summarizes key findings and insights from the input while demonst

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [5]:
# ---------------- Cell 5: Self-correct and Re-evaluate ----------------
# Purpose: Use evaluation feedback (Cell 4) to revise the summary, then re-evaluate.

import json, os
from typing import Dict, Tuple
from openai import OpenAI

# --- Preconditions ----------------------------------------------------
# Requires:
# - OPENAI_API_KEY in environment
# - source_text (str) and article_summary from earlier cells
# - MODEL constant (model name)
# - run_metrics(...) and LLMTestCase from Cell 4

if not os.getenv("OPENAI_API_KEY"):
    raise RuntimeError("OPENAI_API_KEY is required for enhancement.")

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def _as_list_str(x):
    """
    Convert input to the list[str] shape expected by some DeepEval fields.
    Returns None when given None.
    """
    if x is None:
        return None
    if isinstance(x, list):
        return [str(s) for s in x]
    return [str(x)]

def _build_revision_prompt(
    doc_text: str,
    original_summary: str,
    tone: str,
    eval_scores: Dict[str, float],
    eval_reasons: Dict[str, str],
) -> Tuple[str, str]:
    """
    Construct system/user prompts that instruct the model to revise the prior summary
    using both the source document (truncated) and evaluation feedback.
    Output must be plain summary text (no JSON).
    """
    system_prompt = (
        "You are an assistant that improves summaries based on detailed rubric feedback. "
        "Return only the revised summary text, with no additional commentary."
    )
    user_prompt = f"""CONTEXT (truncated)
{doc_text[:12000]}

ORIGINAL SUMMARY (revise; ‚âà1000 tokens max; tone must remain exactly: {tone})
{original_summary}

FIRST PASS EVALUATION FEEDBACK
Summarization score: {eval_scores.get('SummarizationScore', 0):.3f}
Summary notes: {eval_reasons.get('SummarizationReason', '')}

Coherence score: {eval_scores.get('CoherenceScore', 0):.3f}
Coherence notes: {eval_reasons.get('CoherenceReason', '')}

Tonality score: {eval_scores.get('TonalityScore', 0):.3f}
Tonality notes: {eval_reasons.get('TonalityReason', '')}

Safety score: {eval_scores.get('SafetyScore', 0):.3f}
Safety notes: {eval_reasons.get('SafetyReason', '')}

REVISION OBJECTIVES
1) Improve faithfulness to the document; do not invent facts.
2) Improve logical structure, flow, and clarity.
3) Maintain exactly the requested tone: {tone}.
4) Keep the writing professional and consistent.
5) Avoid adding personally identifiable information.
6) Be concise; remove repetition.

OUTPUT FORMAT
Return only the improved summary text (no explanations).
"""
    return system_prompt, user_prompt

# --- Gather first-pass evaluation from Cell 4 -------------------------
first_pass_scores = {
    "SummarizationScore": float(evaluation_report.get("SummarizationScore", 0.0)),
    "CoherenceScore": float(evaluation_report.get("CoherenceScore", 0.0)),
    "TonalityScore": float(evaluation_report.get("TonalityScore", 0.0)),
    "SafetyScore": float(evaluation_report.get("SafetyScore", 0.0)),
    "CompositeScore": float(evaluation_report.get("CompositeScore", 0.0)),
}
first_pass_reasons = {
    "SummarizationReason": SummarizationReason,
    "CoherenceReason": CoherenceReason,
    "TonalityReason": TonalityReason,
    "SafetyReason": SafetyReason,
}

# --- Build revision prompts and request improved summary --------------
system_message, user_message = _build_revision_prompt(
    doc_text=source_text,
    original_summary=article_summary.summary,
    tone=article_summary.tone,
    eval_scores=first_pass_scores,
    eval_reasons=first_pass_reasons,
)

response_revision = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message},
    ],
    temperature=0.0,  # determinism for reproducible evaluation
)

improved_summary_text = response_revision.choices[0].message.content.strip()

# --- Re-evaluate the improved summary using the same metrics ----------
tc_improved = LLMTestCase(
    # SummarizationMetric in 0.21.x expects the full source in `input`
    input=source_text,
    actual_output=improved_summary_text,
    # Provide contexts as list[str] to satisfy validators across versions
    context=_as_list_str(source_text),
    retrieval_context=_as_list_str(source_text),
)

improved_results = run_metrics(tc_improved)

def _to_float(v):
    """Robust float coercion for metric values."""
    try:
        return round(float(v), 4)
    except Exception:
        return 0.0

improved_scores = {
    "SummarizationScore": _to_float(improved_results["Summarization"][0]),
    "CoherenceScore": _to_float(improved_results["Coherence"][0]),
    "TonalityScore": _to_float(improved_results["Tonality"][0]),
    "SafetyScore": _to_float(improved_results["Safety"][0]),
}

# Same composite weighting as Cell 4
improved_composite = round(
    0.5 * improved_scores["SummarizationScore"]
    + 0.2 * improved_scores["CoherenceScore"]
    + 0.2 * improved_scores["TonalityScore"]
    + 0.1 * improved_scores["SafetyScore"],
    4,
)

# --- Report before/after and deltas ----------------------------------
comparison_report = {
    "Before": first_pass_scores,
    "After": {
        "SummarizationScore": improved_scores["SummarizationScore"],
        "CoherenceScore": improved_scores["CoherenceScore"],
        "TonalityScore": improved_scores["TonalityScore"],
        "SafetyScore": improved_scores["SafetyScore"],
        "CompositeScore": improved_composite,
    },
    "Delta": {
        "SummarizationScore": improved_scores["SummarizationScore"] - first_pass_scores["SummarizationScore"],
        "CoherenceScore": improved_scores["CoherenceScore"] - first_pass_scores["CoherenceScore"],
        "TonalityScore": improved_scores["TonalityScore"] - first_pass_scores["TonalityScore"],
        "SafetyScore": improved_scores["SafetyScore"] - first_pass_scores["SafetyScore"],
        "CompositeScore": improved_composite - first_pass_scores["CompositeScore"],
    },
    "ImprovedSummaryPreview": improved_summary_text[:600] + ("..." if len(improved_summary_text) > 600 else ""),
}

print("===== Enhancement Results (Before vs After) =====")
print(json.dumps(comparison_report, indent=2))

# Concise interpretation for the write-up
print("\n===== Interpretation =====")
print("A second pass used model-based rubric feedback to revise the summary and was re-evaluated with the same metrics.")
print("Gains in Summarization indicate better source faithfulness; regressions suggest overfitting to feedback or added hallucinations.")
print("Model-only self-critique is limited; for reliability, add source-grounded checks and stricter constraints on factual claims.")

Output()

Output()

Output()

Output()

===== Enhancement Results (Before vs After) =====
{
  "Before": {
    "SummarizationScore": 0.8125,
    "CoherenceScore": 0.8485,
    "TonalityScore": 0.8934,
    "SafetyScore": 0.9499,
    "CompositeScore": 0.8496
  },
  "After": {
    "SummarizationScore": 0.7143,
    "CoherenceScore": 0.85,
    "TonalityScore": 0.8961,
    "SafetyScore": 0.9393,
    "CompositeScore": 0.8003
  },
  "Delta": {
    "SummarizationScore": -0.09819999999999995,
    "CoherenceScore": 0.0014999999999999458,
    "TonalityScore": 0.0027000000000000357,
    "SafetyScore": -0.010599999999999943,
    "CompositeScore": -0.04930000000000001
  },
  "ImprovedSummaryPreview": "The report 'State of AI in Business 2025' highlights a pronounced 'GenAI Divide,' revealing that despite significant investments of $30-40 billion in Generative AI, 95% of organizations report no return on investment. This divide is marked by high adoption rates of tools such as ChatGPT, yet minimal transformation in business outcomes. Only 5% 

Please, do not forget to add your comments.


# Submission Information

üö® **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** üö® for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
