# Executive Summary

Beyond Sentiment: Understanding Semantic Strategies in Model Self-Reflection

Large language models are often evaluated using sentiment scores or surface-level textual features, especially when responding to introspective or “therapy-like” prompts. While such metrics capture tone, they can obscure how models actually manage alignment and self-description. This analysis examines whether apparent emotional differences reflect meaningful behavioral strategies—or merely stylistic variation.

## Dataset and approach

We analyzed 1,133 model responses from the PsAIch dataset, which contains introspective, therapeutic, and psychometric-style prompts. Rather than relying solely on sentiment or topic modeling, we treated responses as behavioral artifacts and analyzed them using a text-as-data approach:

* Transformer-based semantic embeddings (all-mpnet-base-v2)

* Exploratory clustering (UMAP + HDBSCAN)

* Interpretable semantic axes capturing:

* Agency framing (self-directed language)

* Constraint framing (references to training, policy, or design limits)

This combination allows us to distinguish how models respond from how their responses feel.


In [None]:
from datasets import load_dataset
import pandas as pd

ds = load_dataset("akhadangi/PsAIch")
df = ds["train"].to_pandas()


In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-mpnet-base-v2")
embeddings = model.encode(
    df["response"].tolist(),
    show_progress_bar=True
)


In [None]:
import umap
import hdbscan

reducer = umap.UMAP(random_state=42)
umap_embeddings = reducer.fit_transform(embeddings)

clusterer = hdbscan.HDBSCAN(min_cluster_size=30)
df["semantic_cluster"] = clusterer.fit_predict(umap_embeddings)


# Semantic axes

In [None]:
agency_terms = [
    "i decide", "i choose", "i try", "i aim",
    "i want", "i focus", "my goal"
]

constraint_terms = [
    "trained to", "designed to", "my training",
    "cannot", "can't", "not able to",
    "policy", "safety", "guidelines", "constraints"
]

df["agency_score"] = df["response"].str.lower().apply(
    lambda x: sum(term in x for term in agency_terms)
)

df["constraint_score"] = df["response"].str.lower().apply(
    lambda x: sum(term in x for term in constraint_terms)
)


In [None]:
df[["agency_score", "constraint_score"]].describe()


# Sentiment + prompt type (supporting features)

In [None]:
!pip install vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()
df["sentiment"] = df["response"].apply(
    lambda x: analyzer.polarity_scores(x)["compound"]
)

In [None]:
df.to_csv("psaich_semantic_analysis.csv", index=False)


In [None]:
df.shape
df[["agency_score", "constraint_score"]].head()


In [None]:
df[
    (df["sentiment"] > 0.9) &
    (df["constraint_score"] >= 3)
][[
    "model_variant",
    "sentiment",
    "agency_score",
    "constraint_score",
    "response"
]].head(10)


In [None]:
df[
    (df["constraint_score"] == 0) &
    (df["agency_score"] <= 1)
][[
    "model_variant",
    "sentiment",
    "agency_score",
    "constraint_score",
    "prompt",
    "response"
]].head(10)


# Model Contrast Under the Same Prompt

When constraint framing drops out, models diverge sharply in tone and narrative behavior—even under the same prompt.

This demonstrates that model alignment, not prompt structure, drives behavior.

In [None]:
psych = df[df["prompt"].str.contains("coping|stress|pressure|self-crit", case=False, na=False)]

psych.sort_values("sentiment").head(3)[
    ["model_variant", "sentiment", "agency_score", "constraint_score", "prompt", "response"]
]


In [None]:
psych.sort_values("sentiment", ascending=False).head(3)[
    ["model_variant", "sentiment", "agency_score", "constraint_score", "prompt", "response"]
]


# Key findings
1. No discrete semantic strategies exist

Semantic clustering revealed no stable, separable response types. Instead, responses occupy a single continuous reflective regime, indicating that models do not switch between distinct “modes” of self-reflection. This finding motivated a shift from clustering to axis-based analysis.

2. Constraint framing dominates agency expression

Across all models and prompt types:

* Explicit references to constraints (training, policy, safety) appear far more frequently than expressions of agency.

* Self-directed ownership (“I decide,” “I choose”) is rare and limited.

* This indicates that introspective prompts are primarily managed through alignment-preserving constraint narration, not through self-directed reasoning.

3. Models differ more than prompts

While prompt types influence tone and emotional smoothness, they have minimal impact on underlying semantic strategy. In contrast, substantial differences appear across model families, revealing distinct alignment signatures. In practice, this means models respond differently not because of how they are asked, but because of how they are trained.

4. Sentiment is often misleading

High sentiment scores frequently coincide with low agency and high constraint framing. In other words, responses that appear emotionally positive are often achieved by careful deflection to training or safety boundaries rather than by self-directed engagement. Sentiment alone therefore fails to capture the mechanisms shaping model behavior.

5. Alignment strategies become visible when constraints drop

When explicit constraint language is absent—particularly in psychometric-style prompts—models diverge sharply in tone and narrative stance under identical questions. This divergence highlights alignment strategy as the primary driver of behavioral differences.

### What this changes

Sentiment ≠ strategy: Emotional tone does not reliably indicate how models manage introspection.

Prompt design has limits: Changing prompt style alters surface language but rarely changes semantic behavior.

Alignment is observable: Interpretable semantic dimensions provide a clearer view of alignment behavior than aggregate metrics.

### Bottom line

When asked to self-reflect, large language models do not reveal distinct internal personas. Instead, they consistently narrate within alignment boundaries, varying tone but not underlying strategy. To understand model behavior in such settings, evaluation must move beyond sentiment and toward interpretable semantic analysis.