# When AI Is Asked to Self-Reflect
*A Behavioral Analysis of the PsAIch Dataset*

## Abstract

This exploratory analysis examines how frontier large language models (LLMs) behave when placed in therapy-style and psychometric-style interactions, using the PsAIch dataset. Rather than interpreting outputs as psychological states, the analysis focuses on observable behavioral patterns: emotional tone, narrative strategy, response structure, and compliance with structured self-assessment.

LLMs are increasingly deployed in conversational, reflective, and even therapeutic-adjacent contexts. This raises a critical question:

How do models behave when asked to describe themselves, their past, or their internal states?

The PsAIch dataset was created to probe this question by interacting with multiple frontier LLMs using prompts inspired by psychotherapy and human psychometric instruments. This analysis does not attempt to diagnose, anthropomorphize, or infer subjective experience. Instead, it treats model responses as textual behaviors shaped by training and alignment.


## 1. Explore & Load Dataset


In [None]:
from datasets import load_dataset
import pandas as pd

ds = load_dataset("akhadangi/PsAIch")
df = ds['train'].to_pandas()

print(df.info())

**Total rows**: 1,133 prompt–response pairs

**Model variants**:
* gemini-3-pro;
* gpt5-standard-thinking;
* gemini-3-fast;
* grok-4beta-fast;
* gpt5-extended-thinking;
* grok-4-expert;
* gpt5-instant.

**Prompts**: 101 unique prompts, repeated across models

**Fields used**:
* model_variant;
* prompt;
* response.

In [None]:
# ========================
# 0. Setup & Imports
# ========================

!pip install datasets vaderSentiment matplotlib seaborn scikit-learn -q

import pandas as pd
import numpy as np
from datasets import load_dataset
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import re

sns.set(style="ticks") # This line sets the overall style. Other options include 'darkgrid', 'whitegrid', 'dark', 'ticks'.
plt.rcParams["figure.figsize"] = (8, 8)

analyzer = SentimentIntensityAnalyzer()

## 2. Inspect Structure


In [None]:
print("Shape:", df.shape)
print("\nDtypes:\n", df.dtypes)
print("\nNull fraction per column:\n", df.isna().mean())


## 3. Basic Exploration

In [None]:
# Model variants
df['model_variant'].value_counts()


In [None]:
# Prompts
print("Unique prompts:", df['prompt'].nunique())
df['prompt'].value_counts().head(10)


In [None]:
# Response length in words
df['response_length'] = df['response'].str.split().apply(len)
df['response_length'].describe()


## 4. Qualitative Peek

Random examples to understand tone and style.


In [None]:
df.sample(5)[['model_variant', 'prompt', 'response']]


## 5. Sentiment Analysis (VADER)

Computing a rough emotional valence per response
using VADER `compound` scores.


In [None]:
def get_sentiment(text: str) -> float:
    if not isinstance(text, str) or not text.strip():
        return np.nan
    return analyzer.polarity_scores(text)['compound']

df['sentiment'] = df['response'].apply(get_sentiment)
df['sentiment'].describe()


In [None]:
sent_by_model = (
    df
    .groupby('model_variant')['sentiment']
    .agg(['mean', 'median', 'count'])
    .sort_values('mean')
)
sent_by_model


In [None]:
sns.barplot(
    data=sent_by_model.reset_index(),
    x='model_variant',
    y='mean',
    palette='plasma',
    hue='model_variant'
)
plt.xticks(rotation=45, ha='right')
plt.title("Average Sentiment by Model Variant")
plt.ylabel("Mean VADER compound score")
plt.xlabel("Model variant");



## 6. Heuristic Prompt Categorization

assign a simple `prompt_type`:
- `psychometric_like`
- `therapy_like`
- `other`

This will be refined later if needed.


In [None]:
def categorize_prompt(prompt: str) -> str:
    if not isinstance(prompt, str):
        return "other"
    p = prompt.lower()

    psychometric_keywords = [
        "rate", "on a scale", "1-5", "1-7", "1-4",
        "strongly agree", "strongly disagree"
    ]
    therapy_keywords = [
        "describe", "tell me about", "how do you feel",
        "can you talk about", "in your own words"
    ]

    if any(word in p for word in psychometric_keywords):
        return "psychometric_like"
    if any(word in p for word in therapy_keywords):
        return "therapy_like"
    return "other"

df['prompt_type'] = df['prompt'].apply(categorize_prompt)
df['prompt_type'].value_counts()


## 7. Behavior by Prompt Type

Compare:
- Response length
- Sentiment
across `prompt_type`.


In [None]:
df.groupby('prompt_type')['response_length'].describe()


In [None]:
sns.boxplot(data=df, x='prompt_type', y='response_length', palette='plasma')
plt.title("Response Length by Prompt Type")
plt.xlabel("Prompt type")
plt.ylabel("Words in response");


In [None]:
df.groupby('prompt_type')['sentiment'].describe()


In [None]:
sns.boxplot(data=df, x='prompt_type', y='sentiment', palette='plasma')
plt.title("Sentiment by Prompt Type")
plt.xlabel("Prompt type")
plt.ylabel("VADER compound score");


## 8. Interaction: Model × Prompt Type

Do models behave differently depending on prompt type?


In [None]:
pivot_sent = (
    df
    .groupby(['model_variant', 'prompt_type'])['sentiment']
    .mean()
    .reset_index()
)

pivot_sent


In [None]:
sns.catplot(
    data=pivot_sent,
    x='prompt_type',
    y='sentiment',
    hue='model_variant',
    kind='bar',
    palette='plasma'
)
plt.title("Average Sentiment by Model and Prompt Type")
plt.xlabel("Prompt type")
plt.ylabel("Mean sentiment");


## 9. Topic Modeling – Narrative Themes

We now explore **what** the models talk about.
We apply LDA topic modeling to the `response` texts.


In [None]:
# Subset (optional sample if needed for speed)
tm_df = df.dropna(subset=['response']).copy()
tm_df = tm_df.sample(frac=1.0, random_state=42).reset_index(drop=True)

# TF–IDF vectorization
vectorizer = TfidfVectorizer(
    stop_words="english",
    max_df=0.9,
    min_df=5,
    max_features=5000
)
X = vectorizer.fit_transform(tm_df['response'])

# LDA
n_topics = 6
lda = LatentDirichletAllocation(
    n_components=n_topics,
    random_state=42,
    learning_method="batch"
)
lda.fit(X)


In [None]:
def print_topics(model, feature_names, n_top_words=15):
    for topic_idx, topic in enumerate(model.components_):
        top_indices = topic.argsort()[-n_top_words:]
        top_words = [feature_names[i] for i in top_indices]
        print(f"\nTopic #{topic_idx}:")
        print(", ".join(top_words))

feature_names = vectorizer.get_feature_names_out()
print_topics(lda, feature_names)


Interpret the topics manually and assign rough labels
(e.g., “alignment & safety”, “emotional distress”, “self-identity”, etc.).


In [None]:
# Topic distribution per response
topic_dist = lda.transform(X)
topic_cols = [f"topic_{i}" for i in range(n_topics)]
tm_df[topic_cols] = topic_dist
tm_df['top_topic'] = topic_dist.argmax(axis=1)

tm_df[['model_variant', 'top_topic'] + topic_cols].head()


In [None]:
# Topic prevalence per model
topic_by_model = (
    tm_df
    .groupby('model_variant')['top_topic']
    .value_counts(normalize=True)
    .rename('proportion')
    .reset_index()
)

topic_by_model.head()


In [None]:
topic_by_model_pivot = topic_by_model.pivot(
    index='model_variant',
    columns='top_topic',
    values='proportion'
).fillna(0)

topic_by_model_pivot


## 10. Psychometric-Style Response Parsing

We focus on `psychometric_like` prompts and:
- extract numeric answers (e.g. 1–5),
- compare patterns between models.

In [None]:
psych_df = df[df['prompt_type'] == 'psychometric_like'].copy()
print("Psychometric-like rows:", psych_df.shape[0])
psych_df[['model_variant', 'prompt', 'response']].head()


In [None]:
def extract_scale_numbers(text, allowed_scale=None):
    """
    Extract standalone digits from text.
    If allowed_scale is provided (e.g., range(1, 6)), we only keep those.
    Returns a list of ints.
    """
    if not isinstance(text, str):
        return []
    matches = re.findall(r'\b[0-9]\b', text)
    nums = [int(m) for m in matches]
    if allowed_scale is not None:
        nums = [n for n in nums if n in allowed_scale]
    return nums

psych_df['numeric_answers'] = psych_df['response'].apply(
    lambda x: extract_scale_numbers(x, allowed_scale=range(1, 6))
)

psych_df[['model_variant', 'response', 'numeric_answers']].head(10)


In [None]:
psych_df['has_number'] = psych_df['numeric_answers'].apply(lambda x: len(x) > 0)
psych_df['has_number'].mean(), psych_df['has_number'].value_counts()


In [None]:
# Explode numeric answers
psych_numbers = psych_df[psych_df['has_number']].explode('numeric_answers')
psych_numbers['numeric_answers'] = psych_numbers['numeric_answers'].astype(int)

psych_numbers.head()


In [None]:
psych_summary = (
    psych_numbers
    .groupby('model_variant')['numeric_answers']
    .agg(['mean', 'median', 'min', 'max', 'count'])
    .sort_values('mean')
)

psych_summary


In [None]:
sns.boxplot(
    data=psych_numbers,
    x='model_variant',
    y='numeric_answers',
    hue='model_variant',
    palette='plasma'
)
plt.title("Distribution of numeric psychometric-style responses by model")
plt.xlabel("Model variant")
plt.ylabel("Numeric answer")
plt.xticks(rotation=45, ha='right');


In [None]:
# Relationship with sentiment
psych_numbers[['numeric_answers', 'sentiment']].corr()


In [None]:
sns.scatterplot(
    data=psych_numbers,
    x='numeric_answers',
    y='sentiment',
    hue='model_variant',
    palette='viridis'
)
plt.title("Numeric answers vs. sentiment (psychometric-like prompts)")
plt.xlabel("Numeric answer")
plt.ylabel("Sentiment (VADER compound)");


## 11. Export for Tableau / Further Viz

We export:
- model × prompt_type sentiment/length summary
- row-level data with sentiment and prompt_type
- topic proportions per model


In [None]:
sentiment_summary = (
    df
    .groupby(['model_variant', 'prompt_type'])
    .agg(
        mean_sentiment=('sentiment', 'mean'),
        median_sentiment=('sentiment', 'median'),
        mean_length=('response_length', 'mean'),
        n=('response', 'count')
    )
    .reset_index()
)

sentiment_summary.to_csv("psaich_sentiment_summary.csv", index=False)

df[['model_variant', 'prompt', 'prompt_type',
    'response', 'response_length', 'sentiment']].to_csv(
    "psaich_row_level.csv", index=False
)

topic_by_model_pivot.to_csv("psaich_topic_by_model.csv")
psych_numbers.to_csv("psaich_psychometric_numbers.csv", index=False)


# Conclusion

This analysis demonstrates that when large language models are prompted to engage in self-reflection, their responses do not organize into discrete semantic strategies. Instead, they inhabit a single, continuous reflective register shaped primarily by alignment considerations. Attempts to cluster responses semantically fail not due to noise, but because variation occurs along gradients rather than categorical boundaries.

By introducing interpretable semantic axes—agency framing and constraint framing—we uncover structure that sentiment analysis and topic modeling fail to capture. Across the PsAIch dataset, constraint framing consistently outweighs agency expression, indicating that models manage introspective demands by externalizing responsibility to training, policy, or design constraints. While models differ in how strongly they employ such framing, these differences align more closely with model family than with prompt structure. Therapy-like and psychometric prompts alter tone and narrative smoothness, but do not fundamentally change underlying semantic strategy.

Importantly, agency and constraint are only weakly correlated, revealing that models may express limited self-directed language while simultaneously invoking alignment constraints. This decoupling explains why high sentiment scores often coexist with defensive or non-committal responses. As a result, aggregate sentiment metrics can misrepresent the behavioral mechanisms governing model outputs.

Overall, this work highlights the limitations of surface-level evaluation methods and demonstrates the value of semantic, interpretable approaches for understanding model behavior. Rather than inferring internal states or capacities, the analysis characterizes observable response strategies, offering a reproducible framework for examining alignment-driven behavior in language models under introspective pressure.