# Live Session

## Student question:

> Can you suggest analytical frameworks, beyond artificial analysts to assess how an AI project’s token price correlates with its actual performance, usage, and fundamentals? In short it worth the investment?

### Task/Eval Score/Cost/Model Options

We will build a simple evaluation pipeline:
1) Summarize a paper
2) Score the summary with a judge model
3) Normalize scores into a single metric
4) Compare models and estimate token cost


In [1]:
import os
import getpass
from dotenv import load_dotenv

# Load environment variables from .env file (if it exists)
load_dotenv()

def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")

_set_env("OPENAI_API_KEY")
print("OpenAI API key loaded")

OpenAI API key loaded


## Task Definition

Summarizing papers.

We'll define the task in code so later cells can reuse it reliably.


In [3]:
task_definition = (
    "Evaluate how well a model can summarize a paper using an LLM-as-a-judge."
)
print(f"Task definition set: {task_definition}")


Task definition set: Evaluate how well a model can summarize a paper using an LLM-as-a-judge.


How do we evaluate a summary of a paper? 

More importantly is, how do we quantify a decent score for this evaluation?

The hack will be instead of using traditional metrics, we'll use LLM-AS-A-JUDGE.

Next we define the judge prompt and load the local PDF so we can compare summaries
against the original paper content.


In [7]:
PROMPT_EVAL_SUMMARY = f'''
    You are an expert evaluator. Your task is to assess the quality of the provided summary of a document using the following four criteria. For each, please think step by step before giving a score.

        Evaluation Criteria:

        1. Coherence (1-5) - the collective quality of all sentences in the summary. The summary should be well-structured and well-organized, building a coherent body of information about the topic.

        2. Consistency (1-5) - factual alignment between the summary and the source document. The summary should contain only statements that are logically entailed by the source document.

        3. Fluency (1-5) - the quality of individual sentences. The summary should be free of grammatical errors, awkward phrasing, or formatting issues that impede readability.

        4. Relevance (1-5) - inclusion of the most important content from the source. The summary should focus on key information and avoid including irrelevant or redundant details.

        Instructions:
        - First, read both the source document and all the context of the task and the summary carefully.
        - For each criterion:
        1. Reflect step by step (e.g., "Step-by-step reasoning: ...").
        2. Then provide a score from 1 (poor) to 5 (excellent) for each criterion.

    '''


def eval_score(summary_output, context, model_judge_openai=None, model_judge_anthropic=None):
    # Use defaults set elsewhere if not provided.
    if model_judge_openai is None:
        model_judge_openai = MODEL_JUDGE_OPENAI
    if model_judge_anthropic is None:
        model_judge_anthropic = MODEL_JUDGE_ANTHROPIC

    # Use Claude when evaluating Claude outputs; otherwise use OpenAI.
    if model_judge_anthropic and model_judge_anthropic.startswith("claude-"):
        response = anthropic_client.messages.create(
            model=model_judge_anthropic,
            max_tokens=1000,
            messages=[{
                "role": "user",
                "content": f"{PROMPT_EVAL_SUMMARY}\n\nSummary: {summary_output}\nContext: {context}"
            }]
        )
        return _extract_anthropic_text(response.content)

    response = openai_client.chat.completions.create(
        model=model_judge_openai,
        messages=[
            {"role": "system", "content": PROMPT_EVAL_SUMMARY},
            {"role": "user", "content": f"Summary: {summary_output}\nContext: {context}"}
        ]
    )

    return response.choices[0].message.content


In [None]:
# Use the repo copy of paper.pdf; avoid network downloads
from pathlib import Path
paper_path = Path("paper.pdf")
if not paper_path.exists():
    raise FileNotFoundError("Expected paper.pdf in notebooks/; add it to the repo.")
print(f"Using paper at: {paper_path.resolve()}")

Using paper at: /Users/greatmaster/Desktop/projects/oreilly-live-trainings/oreilly-reasoning-models/notebooks/paper.pdf


In [8]:
from docling.document_converter import DocumentConverter

print("Converting paper.pdf to markdown with docling...")

def convert_pdf_to_markdown(pdf_path):
    'Converts a local PDF to markdown using docling and returns the result.'
    try:
        converter = DocumentConverter()
        result = converter.convert(pdf_path)
        markdown = result.document.export_to_markdown()
        return markdown
    except Exception as e:
        print(f"Error converting document: {e}")
        return None

paper_path = "./paper.pdf"

paper_contents = convert_pdf_to_markdown(paper_path)
if not paper_contents:
    raise ValueError("Failed to convert paper.pdf to markdown")
print("Paper converted successfully.")


Converting paper.pdf to markdown with docling...


  from .autonotebook import tqdm as notebook_tqdm


Paper converted successfully.


We now display the paper contents so you can skim the source text before
summarizing it.


In [7]:
from IPython.display import Markdown

print("Displaying paper contents (markdown):")
Markdown(paper_contents[:500])


Displaying paper contents (markdown):


<!-- image -->

## Chain-of-Thought Reasoning without Prompting

Xuezhi Wang 1 and Denny Zhou 1

1 Google DeepMind, 1 {xuezhiw, dennyzhou}@google.com

In enhancing the reasoning capabilities of large language models (LLMs), prior research primarily focuses on specific prompting techniques such as few-shot or zero-shot chain-of-thought (CoT) prompting. These methods, while effective, often involve manually intensive prompt engineering. Our study takes a novel approach by asking: Can LLMs reason e

In [12]:
MODEL_JUDGE_OPENAI = "gpt-4o"
MODEL_JUDGE_ANTHROPIC = "claude-sonnet-4-5-20250929"

# Choose a default judge model for this notebook
MODEL_JUDGE = MODEL_JUDGE_OPENAI


In [9]:
PROMPT_SUMMARIZE = """
You are an expert in summarizing papers. 
You take in full markdown paper content and return a comprehensive summary of it.
"""

In [10]:
from openai import OpenAI
import anthropic

openai_client = OpenAI()
# Assumes ANTHROPIC_API_KEY is set in the environment
anthropic_client = anthropic.Anthropic()

print("Clients initialized (OpenAI + Anthropic)")


def _extract_anthropic_text(content_blocks):
    text = ""
    for block in content_blocks:
        if getattr(block, "type", None) == "text":
            text += block.text
    return text


def summarize_paper(paper_contents, model):
    if model.startswith("claude-"):
        response = anthropic_client.messages.create(
            model=model,
            max_tokens=2000,
            messages=[
                {"role": "user", "content": f"{PROMPT_SUMMARIZE}\n\nSummarize this paper: {paper_contents}"}
            ]
        )
        text = _extract_anthropic_text(response.content)
        tokens = (response.usage.input_tokens or 0) + (response.usage.output_tokens or 0)
        return text, tokens

    response = openai_client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": PROMPT_SUMMARIZE},
            {"role": "user", "content": f"Summarize this paper: {paper_contents}"}
        ]
    )
    return response.choices[0].message.content, response.usage.total_tokens

summary, tokens = summarize_paper(paper_contents, "gpt-4o")
print("Summary generated (gpt-4o).")
output_score = eval_score(summary, paper_contents, MODEL_JUDGE)
print("Summary evaluated by judge model.")


Clients initialized (OpenAI + Anthropic)
Summary generated (gpt-4o).


NameError: name 'MODEL_JUDGE' is not defined

In [11]:
print("\n=== SUMMARY (Markdown) ===")
Markdown(summary)



=== SUMMARY (Markdown) ===


The paper "Chain-of-Thought Reasoning without Prompting" explores the intrinsic reasoning abilities of large language models (LLMs) without relying on the traditional prompting methods that require manual tinkering and task-specific human inputs. The authors, Xuezhi Wang and Denny Zhou from Google DeepMind, propose a novel approach called CoT-decoding to elicit reasoning paths from LLMs. This approach involves altering the decoding process by inspecting the top k alternative tokens rather than relying solely on greedy decoding. Their findings suggest that large language models inherently possess reasoning capabilities that can be brought to light by considering these alternative decoding paths.

Key contributions and findings include:
1. **Intrinsic Reasoning without Human Prompts**: The study demonstrates that decoding changes can uncover reasoning capabilities in LLMs without the need for external prompting. This challenges the assumption that LLMs lack effective reasoning without specific prompts and highlights that the perceived inability arises from neglecting non-greedy paths in decoding.

2. **Correlation between CoT Paths and Model Confidence**: The presence of a chain-of-thought reasoning path is correlated with higher confidence in the final answer, suggesting that LLMs are more certain of their solutions when following a reasoning path.

3. **CoT-Decoding Methodology**: The authors introduce CoT-decoding, a method that uses answer confidence to select reliable reasoning paths during the decoding process, leading to improved reasoning performance across various benchmarks without additional training data or model tuning.

4. **Applicability across Models and Scales**: The approach effectively enhances reasoning in multiple LLM families, such as PaLM-2, Mistral, and Gemma, and is applicable across different model scales, providing accuracy gains on both math and commonsense reasoning tasks.

5. **Task Difficulty and Pre-training Influence**: The study notes that the discovery of correct CoT paths is influenced by task difficulty and pre-training distribution, with simpler tasks revealing inherent reasoning paths more readily.

6. **Combining CoT-Decoding and Prompting**: While CoT-decoding alone demonstrates significant improvements, its combination with traditional CoT-prompting methods yields even larger reasoning gains, showcasing the versatility and effectiveness of the approach.

The paper concludes that LLMs have untapped reasoning potential that can be harnessed through clever decoding strategies rather than extensive prompt engineering, paving the way for further advancements in understanding and leveraging the intrinsic capabilities of these models. Additionally, they propose areas for future exploration, including further fine-tuning using discovered CoT paths and branching beyond the first decoding token to optimize reasoning paths.

Overall, this research offers a transformative perspective on enhancing LLM reasoning by simplifying the decoding process and exploiting existing model capabilities rather than relying on extensive external inputs.

In [12]:
print("\n=== EVALUATION OUTPUT (Markdown) ===")
Markdown(output_score)



=== EVALUATION OUTPUT (Markdown) ===


# Evaluation of Summary Quality

## 1. Coherence (1-5)

**Step-by-step reasoning:**
- The summary follows a clear logical structure, starting with an introduction to the paper's main contribution
- It systematically covers the key findings numbered 1-6, creating a coherent progression
- Each section builds upon the previous one, from intrinsic reasoning abilities to methodology to applications
- The conclusion naturally synthesizes the main points
- Transitions between sections are smooth and logical
- The organization mirrors the paper's structure while maintaining readability

**Score: 5/5**

The summary is exceptionally well-structured with clear organization and logical flow throughout.

## 2. Consistency (1-5)

**Step-by-step reasoning:**
- Checking key claims against source document:
  - "Chain-of-Thought Reasoning without Prompting" - ✓ Accurate title and authors
  - "Altering the decoding process by inspecting top k alternative tokens" - ✓ Matches source
  - "CoT paths correlated with higher confidence" - ✓ Supported by source (Figure 1, Table 1)
  - "Works across PaLM-2, Mistral, and Gemma" - ✓ Confirmed in experiments
  - "Task difficulty and pre-training influence" - ✓ Discussed in Section 3.2
  - "Combining with prompting yields larger gains" - ✓ Shown in Table 7
- No fabricated information detected
- All numerical claims and model names match the source
- The summary accurately represents the paper's methodology and findings

**Score: 5/5**

The summary is completely consistent with the source document, with no factual errors or misrepresentations.

## 3. Fluency (1-5)

**Step-by-step reasoning:**
- Grammar: All sentences are grammatically correct
- Sentence structure: Varied and sophisticated without being convoluted
- Technical terminology: Used appropriately and consistently (e.g., "LLMs," "CoT-decoding," "greedy decoding")
- Readability: Clear and accessible despite technical subject matter
- No awkward phrasing detected
- Punctuation and formatting are correct throughout
- Professional academic tone maintained consistently

**Score: 5/5**

The summary is exceptionally well-written with no grammatical errors or fluency issues.

## 4. Relevance (1-5)

**Step-by-step reasoning:**
- Captures the main contribution: Novel CoT-decoding approach - ✓
- Includes key methodology: Top k alternative tokens, confidence-based selection - ✓
- Reports important empirical results: Performance across models and tasks - ✓
- Mentions critical insights: Intrinsic reasoning capabilities, correlation with confidence - ✓
- Covers practical applications: Combining with prompting, applicability across scales - ✓
- Discusses limitations appropriately: Task difficulty, pre-training influence - ✓
- Avoids excessive technical details while maintaining substance
- No significant irrelevant information included
- Could have mentioned computational costs briefly (minor omission from Discussion section)

**Score: 5/5**

The summary captures all essential content from the paper, focusing on the most important contributions and findings without including irrelevant details.

## Final Scores:
- **Coherence: 5/5**
- **Consistency: 5/5**
- **Fluency: 5/5**
- **Relevance: 5/5**

This is an exemplary summary that accurately, clearly, and comprehensively represents the source document's key contributions and findings.

In [None]:
from pydantic import BaseModel, Field

class SummaryOutputScore(BaseModel):
    coherence_score: float = Field(description="The score of the coherence of the summary")
    consistency_score: float = Field(description="The score of the consistency of the summary")
    fluency_score: float = Field(description="The score of the fluency of the summary")
    relevance_score: float = Field(description="The score of the relevance of the summary")


def extract_summary_output_score(eval_output):
    response = openai_client.beta.chat.completions.parse(
        model=MODEL_JUDGE,
        messages=[
            {"role": "system", "content": "You extract these scores from the evaluation scores output: coherence_score, consistency_score, fluency_score, relevance_score."},
            {"role": "user", "content": f"Extract the scores from this evaluation scores output: {eval_output}"}
        ],
        response_format=SummaryOutputScore
    )

    return response.choices[0].message.parsed


summary_output_score = extract_summary_output_score(output_score)

print(f"Coherence score: {summary_output_score.coherence_score}")
print(f"Consistency score: {summary_output_score.consistency_score}")
print(f"Fluency score: {summary_output_score.fluency_score}")
print(f"Relevance score: {summary_output_score.relevance_score}")


NameError: name 'output_score' is not defined

In [19]:
def calculate_final_score(scores_structured: SummaryOutputScore) -> float:
    'Calculate the final score as the average of the four criteria (1 to 5).'
    return (scores_structured.coherence_score + scores_structured.consistency_score +         scores_structured.fluency_score + scores_structured.relevance_score) / 4


# final_score = calculate_final_score(summary_output_score)
# print(f"Final average score: {final_score:.2f}/5")


In [14]:
token_cost_dict = {
    "gpt-5.2": {
        "input": 1.75 / 1_000_000,
        "cached_input": 0.175 / 1_000_000,
        "output": 14.00 / 1_000_000,
    },
    "gpt-4o": {
        "input": 2.50 / 1_000_000,
        "cached_input": 1.25 / 1_000_000,
        "output": 10.00 / 1_000_000,
    },
    "claude-sonnet-4-5": {
        "input": 3.00 / 1_000_000,
        "cached_input": 0.30 / 1_000_000,
        "output": 15.00 / 1_000_000,
    },
    "gpt-5-mini": {
        "input": 0.15 / 1_000_000,
        "cached_input": 0.015 / 1_000_000,
        "output": 1.20 / 1_000_000,
    },
    "gpt-4.1-mini": {
        "input": 0.03 / 1_000_000,
        "cached_input": 0.003 / 1_000_000,
        "output": 0.24 / 1_000_000,
    },
    "gpt-4.1": {
        "input": 0.15 / 1_000_000,
        "cached_input": 0.015 / 1_000_000,
        "output": 1.20 / 1_000_000,
    },
}

def get_token_cost(model: str, input_tokens: int, output_tokens: int, cached_tokens: int = 0) -> float:
    'Calculate the total cost for a model run.'
    if model not in token_cost_dict:
        raise ValueError(f"Model {model} not found in token_cost_dict")
    prices = token_cost_dict[model]

    input_cost = (input_tokens * (prices["input"] or 0)) if prices["input"] else 0
    cached_cost = (cached_tokens * (prices["cached_input"] or 0)) if prices["cached_input"] else 0
    output_cost = (output_tokens * (prices["output"] or 0)) if prices["output"] else 0

    return input_cost + cached_cost + output_cost


example_cost = get_token_cost("gpt-4o", 1000, 1000, 0)
print(f"Example cost for 1k input/1k output (gpt-4o): ${example_cost:.6f}")


Example cost for 1k input/1k output (gpt-4o): $0.012500


In [16]:
response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": 'tell jokes'},
        {"role": "user", "content": "Tell me a funny joke"}
    ]
)

token_count = response.usage.total_tokens
print(f"Sample run token count: {token_count}")


Sample run token count: 32


In [15]:
import pandas as pd

num_generations = 3
results = []
model_options = ['gpt-5-mini', 'gpt-5.2', 'gpt-4.1']

print(f"Running {num_generations} summaries per model...")

for model in model_options:
    for n in range(num_generations):
        summary, tokens = summarize_paper(paper_contents, model)
        output_score = eval_score(summary, paper_contents, model)
        token_cost = get_token_cost(model, tokens, tokens, 0)
        results.append({
            "model": model,
            "generation": n + 1,
            "summary": summary,
            "output_score": output_score,
            "tokens": tokens,
            "token_cost": token_cost
        })


Running 3 summaries per model...


In [16]:
results_df = pd.DataFrame(results)
results_df

Unnamed: 0,model,generation,summary,output_score,tokens,token_cost
0,gpt-5-mini,1,"Summary — ""Chain-of-Thought Reasoning without ...",## Evaluation of Summary Quality\n\nI will eva...,22223,0.030001
1,gpt-5-mini,2,"Here’s a concise, comprehensive summary of the...",# SUMMARY EVALUATION\n\nI'll evaluate this sum...,22020,0.029727
2,gpt-5-mini,3,Paper title: Chain-of-Thought Reasoning withou...,## Evaluation of Summary Quality\n\nI'll evalu...,22744,0.030704
3,gpt-5.2,1,## High-level idea\n\nThe paper asks whether *...,Looking at this summary against the source doc...,21954,0.345776
4,gpt-5.2,2,## High-level idea\n\nThe paper argues that la...,# Step-by-Step Evaluation of the Summary\n\n##...,21682,0.341492
5,gpt-5.2,3,## Paper in one sentence\nThe paper shows that...,I'll evaluate this summary using the four crit...,21655,0.341066
6,gpt-4.1,1,Certainly! Here is a comprehensive summary of ...,I'll evaluate this summary across the four cri...,21615,0.02918
7,gpt-4.1,2,Here's a comprehensive summary of the paper **...,# Evaluation of Summary\n\nLet me evaluate thi...,21899,0.029564
8,gpt-4.1,3,Here is a comprehensive summary of the paper *...,# Evaluation of the Summary\n\nLet me evaluate...,21655,0.029234


This table includes the raw summaries and judge outputs for each run.


Now we extract numeric scores and build a compact comparison table.


In [20]:
# Extract summaries and structured output_scores, and create a new DataFrame

# Assuming the structured output function is called `extract_structured_output`
# and it takes the output_score (which may be a string or object) and returns a dict

# Extract summaries
summaries = [result['summary'] for result in results]

# # Extract structured output scores
structured_scores = [calculate_final_score(extract_summary_output_score(result['output_score'])) for result in results]

In [21]:
# Combine into a new DataFrame
summary_table = pd.DataFrame(structured_scores)
summary_table['model'] = [result['model'] for result in results]
summary_table['generation'] = [result['generation'] for result in results]
summary_table['summary'] = summaries
summary_table['final_score'] = structured_scores
summary_table['token_cost'] = [result['token_cost'] for result in results]
# Reorder columns for clarity if desired
cols = ['model', 'generation', 'summary', 'final_score', 'token_cost'] + [col for col in summary_table.columns if col not in ['model', 'generation', 'summary', 'final_score', 'token_cost']]
summary_table = summary_table[cols]

summary_table


Unnamed: 0,model,generation,summary,final_score,token_cost,0
0,gpt-5-mini,1,"Summary — ""Chain-of-Thought Reasoning without ...",5.0,0.030001,5.0
1,gpt-5-mini,2,"Here’s a concise, comprehensive summary of the...",5.0,0.029727,5.0
2,gpt-5-mini,3,Paper title: Chain-of-Thought Reasoning withou...,5.0,0.030704,5.0
3,gpt-5.2,1,## High-level idea\n\nThe paper asks whether *...,5.0,0.345776,5.0
4,gpt-5.2,2,## High-level idea\n\nThe paper argues that la...,5.0,0.341492,5.0
5,gpt-5.2,3,## Paper in one sentence\nThe paper shows that...,5.0,0.341066,5.0
6,gpt-4.1,1,Certainly! Here is a comprehensive summary of ...,4.75,0.02918,4.75
7,gpt-4.1,2,Here's a comprehensive summary of the paper **...,5.0,0.029564,5.0
8,gpt-4.1,3,Here is a comprehensive summary of the paper *...,5.0,0.029234,5.0
