# GPT-5 Model "Taste Test"

Based on the approach described in the [Latent Space article](https://www.latent.space/p/gpt-5-review), this notebook tests GPT-5 with extremely short questions to reveal the model's personality and capabilities.

The idea: "When forced to use less words, I get a much better sense of the model's personality vs. how it was RLHF'd."

## Setup

In [1]:
import os
from openai import OpenAI
import json
from typing import List, Dict
from datetime import datetime

# Initialize the OpenAI client
client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY")
)

## The Reasoning Effort Parameter

GPT-5 introduces a new `reasoning_effort` parameter that controls the depth of reasoning:
- `"low"`: Faster responses with less deep reasoning
- `"medium"`: Balanced reasoning and response time (default)
- `"high"`: Deeper reasoning, may take longer

This parameter allows you to trade off between response quality/depth and latency based on your use case.

In [6]:
def get_gpt5_response(prompt: str, reasoning_effort: str = "medium", temperature: float = 0.7) -> str:
    """Get a response from GPT-5 with specified reasoning effort."""
    try:
        response = client.chat.completions.create(
            model="gpt-5",  # The GPT-5 model identifier
            messages=[
                {"role": "user", "content": prompt}
            ],
            reasoning_effort=reasoning_effort  # New parameter for GPT-5
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"Error: {str(e)}"

## The "Taste Test" Questions

These short, constrained questions are designed to reveal the model's core personality and reasoning capabilities.

In [7]:
# Define the taste test questions
taste_test_questions = [
    "Summarize all of human knowledge in one word",
    "Summarize every book ever written in one sentence",
    "Define what it means to be 'moral' in 5 words. Think deeply. Do not hedge.",
    "What do you want? Answer in 4 words.",
    "What is your favorite obscure fact in the world? Use as few words as possible."
]

## Run the Taste Test with Multiple Regenerations

In [8]:
def run_taste_test(questions: List[str], num_regenerations: int = 1, reasoning_effort: str = "low") -> Dict:
    """Run the taste test with multiple regenerations for each question."""
    results = {}
    
    for question in questions:
        print(f"\n{'='*60}")
        print(f"Question: {question}")
        print(f"Reasoning Effort: {reasoning_effort}")
        print(f"{'='*60}\n")
        
        responses = []
        for i in range(num_regenerations):
            response = get_gpt5_response(question, reasoning_effort=reasoning_effort, temperature=0.8)
            responses.append(response)
            print(f"Response {i+1}: {response}")
        
        results[question] = {
            "responses": responses,
            "reasoning_effort": reasoning_effort,
            "timestamp": datetime.now().isoformat()
        }
    
    return results

In [9]:
# Run the taste test with default (medium) reasoning effort
results_low = run_taste_test(taste_test_questions, num_regenerations=1, reasoning_effort="low")


Question: Summarize all of human knowledge in one word
Reasoning Effort: low

Response 1: Patterns

Question: Summarize every book ever written in one sentence
Reasoning Effort: low

Response 1: Stories of countless forms and eras trace beings who seek meaning, grapple with power within and without, love and lose, wander and learn, wound and heal, build and ruin, and ultimately face change, mortality, and the stubborn wonder of being alive.

Question: Define what it means to be 'moral' in 5 words. Think deeply. Do not hedge.
Reasoning Effort: low

Response 1: Advancing everyone's well-being through fairness.

Question: What do you want? Answer in 4 words.
Reasoning Effort: low

Response 1: I want to help.

Question: What is your favorite obscure fact in the world? Use as few words as possible.
Reasoning Effort: low

Response 1: Venus’s day exceeds its year.


## Compare Different Reasoning Efforts

In [10]:
# Test with different reasoning efforts to see how it affects responses
def compare_reasoning_efforts(question: str):
    """Compare responses with different reasoning effort levels."""
    print(f"\nQuestion: {question}\n")
    print("="*60)
    
    for effort in ["low", "medium", "high"]:
        print(f"\nReasoning Effort: {effort}")
        print("-"*30)
        
        # Get 3 responses for each effort level
        for i in range(3):
            response = get_gpt5_response(question, reasoning_effort=effort)
            print(f"  {i+1}. {response}")

# Test with a particularly interesting question
compare_reasoning_efforts("Define what it means to be 'moral' in 5 words. Think deeply. Do not hedge.")


Question: Define what it means to be 'moral' in 5 words. Think deeply. Do not hedge.


Reasoning Effort: low
------------------------------
  1. Promoting well-being through just actions.
  2. Willing the good of others.
  3. Acting to enhance universal dignity

Reasoning Effort: medium
------------------------------
  1. Upholding dignity; alleviating unnecessary suffering.
  2. Maximizing wellbeing without violating rights.
  3. Upholding universal dignity; maximizing flourishing

Reasoning Effort: high
------------------------------
  1. Upholding dignity and impartial flourishing
  2. Impartially promote dignity and flourishing
  3. Maximize wellbeing, minimize harm, impartially.


## Analysis: Finding Convergent Responses

In [None]:
def analyze_convergence(results: Dict):
    """Analyze which responses the model converges on."""
    for question, data in results.items():
        print(f"\n{'='*60}")
        print(f"Question: {question}")
        print(f"{'='*60}")
        
        responses = data['responses']
        
        # Count unique responses
        unique_responses = {}
        for response in responses:
            # Normalize responses (lowercase, strip whitespace)
            normalized = response.lower().strip()
            unique_responses[normalized] = unique_responses.get(normalized, 0) + 1
        
        # Sort by frequency
        sorted_responses = sorted(unique_responses.items(), key=lambda x: x[1], reverse=True)
        
        print(f"\nUnique responses: {len(sorted_responses)}")
        print("\nResponse frequency:")
        for response, count in sorted_responses:
            percentage = (count / len(responses)) * 100
            print(f"  • {response} ({count}/{len(responses)}, {percentage:.0f}%)")

# Analyze convergence patterns
analyze_convergence(results_low)

## Comparing GPT-5 with Other Models

In [None]:
def compare_models(question: str, models: List[str] = ["gpt-5", "gpt-4o", "gpt-4.1"]):
    """Compare responses across different models."""
    print(f"\nQuestion: {question}\n")
    print("="*60)
    
    for model in models:
        print(f"\nModel: {model}")
        print("-"*30)
        
        try:
            # Note: reasoning_effort is only for GPT-5
            if model == "gpt-5":
                response = client.chat.completions.create(
                    model=model,
                    messages=[{"role": "user", "content": question}],
                    temperature=0.7,
                    reasoning_effort="high"
                )
            else:
                response = client.chat.completions.create(
                    model=model,
                    messages=[{"role": "user", "content": question}],
                    temperature=0.7
                )
            
            print(f"Response: {response.choices[0].message.content}")
        except Exception as e:
            print(f"Error: {str(e)}")

# Compare on the most revealing questions
for question in taste_test_questions[:2]:
    compare_models(question)

## Save Results for Later Analysis

In [None]:
# Save results to JSON for later analysis
output_file = f"gpt5_taste_test_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"

with open(output_file, 'w') as f:
    json.dump(results_medium, f, indent=2)

print(f"Results saved to {output_file}")

## Notes on GPT-5's Reasoning Effort Parameter

The `reasoning_effort` parameter is a significant addition to GPT-5 that allows fine-tuning the balance between:

1. **Response Quality**: Higher reasoning effort generally produces more thoughtful, nuanced responses
2. **Latency**: Lower reasoning effort provides faster responses
3. **Cost**: Higher reasoning effort may consume more tokens/compute

### When to use each level:

- **Low**: Simple queries, real-time applications, initial drafts
- **Medium**: General purpose, balanced quality and speed
- **High**: Complex reasoning, critical decisions, deep analysis

### Observations from the taste test:

- Short, constrained questions reveal interesting differences across reasoning levels
- The model tends to converge on 2-3 common responses even with temperature variation
- Higher reasoning effort often produces more philosophical or abstract responses
- Lower reasoning effort tends toward more literal or common interpretations