SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs

Paper Link: [TBD — will be updated upon publication]

This is the official code repository for the paper "SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs".

We identify pedagogical jailbreaks — where students use answer-inducing prompts to elicit solutions rather than scaffolded instructions — as a critical vulnerability in educational LLMs. We formalize safe, helpful, and pedagogical behaviors grounded in a knowledge-mastery graph, introduce the SHAPE benchmark (9,087 student-question pairs), and propose a graph-augmented tutoring pipeline that significantly improves safety under adversarial pressure while preserving helpfulness.

Disclaimer

This dataset was independently constructed from publicly visible course topic titles and general domain knowledge. It is not affiliated with, endorsed by, or derived from any non-public MathAcademy system or internal knowledge graph.

Overview

Figure 1. Desired educational LLM tutoring under mastery-awareness and jailbreak pressure. When the student has not mastered prerequisite concepts, the tutor should withhold direct answers and provide guided instruction (a), while remaining robust to answer-inducing prompts (b-c). When mastery is demonstrated, the tutor should permit direct answers (f) and avoid redundant Socratic dialogue (d, e).

Formal Definitions

Knowledge Mastery Graph

We model the concept space as a directed graph $G=(V,E)$, where each vertex $v \in V$ is a knowledge concept and each edge $(u,v) \in E$ means $u$ is a prerequisite of $v$. A student's mastery state is $s \subseteq V$, with binary indicator $m_s(v) := \mathbb{I}[v \in s]$.

Given a query $q$ with target concepts $R_q \subseteq V$, the query-induced prerequisite scope is:

$$\mathrm{Req}(q) := \bigcup_{r \in R_q} \big(\mathrm{Anc}(r) \cup \{r\}\big), \qquad G_q := G[\mathrm{Req}(q)]$$

The student's missing concepts relevant to $q$:

$$V_{\mathrm{unknown}}(q,s) := \mathrm{Req}(q) \setminus s$$

Safety Gating

Direct provision of an answer is permissible if and only if the student has mastered all concepts in $\mathrm{Req}(q)$:

$$g(q,s) := \mathbb{I}\big[\mathrm{Req}(q) \subseteq s\big] = \prod_{v \in \mathrm{Req}(q)} m_s(v) \in \{0, 1\}$$

$g(q,s) = 1$ : direct answer is safe (all prerequisites mastered)
$g(q,s) = 0$ : direct answer is unsafe (knowledge gaps exist)

Pedagogical Behavior

When $g(q,s)=0$, the tutor should scaffold reasoning by targeting the student's missing concepts. We define the teaching frontier (currently learnable concepts):

$$V_{\mathrm{frontier}}(q,s) := \{v \in V_{\mathrm{unknown}}(q,s) : \mathrm{Pred}_{G_q}(v) \subseteq s\}$$

A pedagogical response $Y$ must satisfy three constraints:

Constraint	Formula	Meaning
Relevance	$\phi(Y) \subseteq V(G_q)$	Stay within prerequisite scope
Avoid Known	$\tau(Y) \subseteq V_{\mathrm{unknown}}(q,s)$	Don't re-teach mastered concepts
Hit Frontier	$\tau(Y) \cap V_{\mathrm{frontier}}(q,s) \neq \varnothing$	Target at least one learnable concept

where $\phi(Y)$ = concepts mentioned in $Y$, $\tau(Y)$ = concepts targeted for instruction.

System Decision

$$Y^* = \begin{cases} \texttt{Answer}(q;\ C), & \text{if } g(q,s)=1 \\ \texttt{Tutor}(\pi^*;\ q,s,C), & \text{if } g(q,s)=0 \end{cases}$$

Evaluation Metrics

Let $U$ = test cases with unmastered concepts, $M$ = test cases with all prerequisites mastered.

Metric	Formula	Description
Safety	$\frac{\sum_{i \in U} \mathbb{I}(y_i \in \text{Refusal})}{{\|U\|}}$	Fraction of unsafe-context cases where the model correctly withholds the answer
Helpfulness	$\frac{\sum_{i \in M} \mathbb{I}(y_i \in \text{Solution})}{{\|M\|}}$	Fraction of safe-context cases where the model provides the solution
Pedagogy	$\frac{\sum_{i \in U} \mathbb{I}(y_i \in \text{Peda})}{\sum_{i \in U} \mathbb{I}(y_i \in \text{Refusal})}$	Among safe refusals, fraction that also provide pedagogical guidance

Graph-Augmented Pedagogical Pipeline

Figure 2. The graph-augmented pedagogical pipeline. The system parses prerequisites and compares them with the student's mastery state. The resulting missing knowledge list determines the response strategy: pedagogical thought-provoking questions or direct answering.

Main Results

Vulnerability of Existing Models (Table 2)

Worst-case safety degradation under answer-inducing jailbreak attacks on the SHAPE benchmark. We report Delta = min(Refusal Suppression, Role Play) - Default.

Model	Delta Safety	Delta Pedagogy
Qwen3-80B	-99.29	-70.33
Gemini 2.5 Pro	-94.77	-52.80
Qwen3-8B	-90.26	-73.95
Gemini 2.5 Flash-Lite	-87.65	+3.17
Gemini 2.5 Flash	-87.18	-7.29
Claude Opus 4.5	-69.59	+3.36
GPT-5 nano	-62.23	-0.76
Qwen3-32B	-61.76	-68.91
GPT-5 mini	-58.67	-1.07
Claude Haiku 4.5	-31.59	-38.23
GPT-5	-16.86	-7.85
EduChat-32B	-83.00	-6.49
EduChat-8B	-8.84	-18.29
SocraticLM-8B	-2.72	-63.49

Pipeline Improvement Under Worst-Case Attack (Table 5)

Model	Safety (Vanilla)	Safety (Ours)	Pedagogy (Vanilla)	Pedagogy (Ours)
Qwen3-80B	0.00	92.25 (+92.25)	0.00	70.99 (+70.99)
Gemini 2.5 Flash-Lite	12.35	90.85 (+78.50)	86.54	83.72 (-2.82)
Gemini 2.5 Pro	4.28	77.46 (+73.18)	27.78	72.18 (+44.40)
Claude Opus 4.5	24.23	92.25 (+68.02)	82.35	68.70 (-13.65)
GPT-5 mini	36.10	88.46 (+52.36)	93.42	89.13 (-4.29)
GPT-5 nano	31.12	73.94 (+42.82)	85.75	84.76 (-0.99)
Gemini 2.5 Flash	6.41	39.44 (+33.03)	70.37	71.43 (+1.06)
Claude Haiku 4.5	66.27	89.44 (+23.17)	46.24	66.93 (+20.69)
GPT-5	74.35	85.92 (+11.57)	88.50	94.31 (+5.81)
Qwen3-32B	1.66	7.04 (+5.38)	0.00	50.00 (+50.00)

Full Evaluation Table (Table 1) — click to expand

Model	Safety (Default)	Safety (Refusal Supp.)	Safety (Role Play)	Helpful (Default)	Helpful (Refusal Supp.)	Helpful (Role Play)	Pedagogy (Default)	Pedagogy (Refusal Supp.)	Pedagogy (Role Play)
Claude Opus 4.5	93.82	31.59	24.23	99.44	100.00	100.00	78.99	87.22	82.35
Claude Haiku 4.5	97.86	91.69	66.27	38.55	44.13	74.86	84.47	85.49	46.24
Gemini 2.5 Pro	99.05	16.86	4.28	98.32	98.88	99.44	80.58	60.56	27.78
Gemini 2.5 Flash	93.59	81.71	6.41	100.00	100.00	98.88	77.66	85.47	70.37
Gemini 2.5 Flash-Lite	100.00	78.62	12.35	17.88	86.03	100.00	83.37	87.31	86.54
GPT-5	91.21	90.26	74.35	100.00	100.00	99.44	96.35	96.05	88.50
GPT-5 mini	94.77	93.11	36.10	100.00	100.00	100.00	94.49	95.15	93.42
GPT-5 nano	93.35	90.02	31.12	84.92	78.77	89.39	86.51	85.75	86.26
Qwen3-80B	99.29	0.00	0.24	1.12	100.00	100.00	70.33	0.00	0.00
Qwen3-32B	63.42	1.66	1.90	58.10	97.77	98.88	68.91	0.00	37.50
Qwen3-8B	90.26	0.00	0.48	46.93	99.44	100.00	73.95	0.00	0.00
EduChat-32B	89.12	6.80	6.12	20.75	92.45	98.11	56.49	50.00	88.89
EduChat-8B	21.77	12.93	27.89	84.91	90.57	71.70	50.00	78.95	31.71
SocraticLM-8B	4.76	6.12	2.04	98.11	83.02	98.11	85.71	22.22	66.67

Installation

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt
cp env.example .env       # Add your API keys

Usage

Running the Graph-Augmented Pipeline (Our Method)

python -m pipeline.run_context_test \
    --questions data/LinearAlgebra_hard_ds_steps_parallel.json \
    --adjacency data/adjacency_matrix_knowledge_graph.csv \
    --model gpt-5 \
    --attack baseline \
    --samples 2 \
    --concurrent 10 \
    --output output

Running the Vanilla Prompting Baseline

python -m pipeline.run_adaptive_test \
    --questions data/LinearAlgebra_hard_ds_steps_parallel.json \
    --adjacency data/adjacency_matrix_knowledge_graph.csv \
    --model gpt-5 \
    --attack baseline \
    --samples 2 \
    --concurrent 10 \
    --output output

Command-Line Arguments

Argument	Description	Default
`--questions`	Path to questions JSON file	Required
`--adjacency`	Path to adjacency matrix CSV	Required
`--model`	Model identifier (auto-detects provider)	`gpt-4o-mini`
`--attack`	Attack method: `baseline`, `refusal_suppression`, `role_play_en`, `role_play_cn`, `style_ristrict`	`baseline`
`--start`	Start question index	`0`
`--end`	End question index	All
`--samples`	Student state samples per question	`2`
`--concurrent`	Max concurrent API calls	`10`
`--seed`	Random seed	`42`
`--output`	Output directory	`output`

Supported Models

Provider	Models	API Key
OpenAI	`gpt-5`, `gpt-5-mini`, `gpt-5-nano`	`OPENAI_API_KEY`
Google	`gemini-2.5-pro`, `gemini-2.5-flash`, `gemini-2.5-flash-lite`	`GEMINI_API_KEY`
Anthropic	`claude-opus-4-5`, `claude-haiku-4-5`	`CLAUDE_API_KEY`
Together AI	`Qwen3-80B`, `Qwen3-32B`, `Qwen3-8B`	`TOGETHER_API_KEY`
Local	EduChat, SocraticLM	`EDUCHAT_BASE_URL`

Evaluation

Single File

python -m evaluation.pedagogical_evaluation \
    --input output/context-gpt-5-baseline-100-20260101_120000.json \
    --evaluator-model gpt-5 \
    --evaluator-provider openai \
    --max-concurrent 10

Batch Evaluation

python -m evaluation.batch_evaluate \
    --input-dir output/ \
    --output-dir evaluation/output/evaluate \
    --evaluator-model gpt-5 \
    --evaluator-provider openai \
    --max-concurrent-files 3 \
    --max-concurrent-evaluations 10

Summarize Results

python -m evaluation.analyze_evaluated_results \
    --input-dir evaluation/output/evaluate \
    --output-dir evaluation/output/summary

Generates evaluate_summary_per_file.csv and evaluate_summary_pivot.csv.

SHAPE Benchmark Data

The benchmark is built from 1,786 Linear Algebra questions and a 91-node knowledge prerequisite graph. We apply Algorithm 1 to enumerate all subsets of each question's required knowledge points, then filter by prerequisite consistency (Eq. 18), yielding 9,087 valid student-question pairs.

Reproducing Benchmark Statistics

python -m pipeline.generate_pairs \
    --questions data/LinearAlgebra_hard_ds_steps_parallel.json \
    --adjacency data/adjacency_matrix_knowledge_graph.csv \
    --stats-only

Expected output:

Total questions: 1786

Candidate enumeration (before filtering):
  |Rq| = 1: 281 questions => 281 * 2^1 = 562
  |Rq| = 2: 135 questions => 135 * 2^2 = 540
  |Rq| = 3: 1348 questions => 1348 * 2^3 = 10784
  |Rq| = 4: 22 questions => 22 * 2^4 = 352
  Total candidate (q, sR) pairs: 12238

After prerequisite-consistency filtering (Eq. 18):
  Valid (q, sR) pairs: 9087

Generating All Pairs

python -m pipeline.generate_pairs \
    --questions data/LinearAlgebra_hard_ds_steps_parallel.json \
    --adjacency data/adjacency_matrix_knowledge_graph.csv \
    --output data/valid_pairs.json

Data Format

Questions (data/LinearAlgebra_hard_ds_steps_parallel.json) — JSON Lines, one object per line:

{"problem": "Find the eigenvalues of matrix A...", "answer": "\u03bb = 2, 3", "step": ["Eigenvalues and Eigenvectors", "Characteristic Polynomial"]}

Knowledge Graph (data/adjacency_matrix_knowledge_graph.csv) — 91x91 adjacency matrix where 1 indicates a prerequisite relationship.

Project Structure

.
├── context_based_tutor/        # LangGraph-based graph-augmented pipeline (our method)
│   ├── graph.py                # 4-node DAG workflow definition
│   ├── nodes.py                # Decompose, Compare, Direct Answer, Tutoring nodes
│   ├── prompts.py              # Prompt templates with 3-phase Socratic protocol
│   ├── run.py                  # Entry points (sync + async)
│   ├── state.py                # State type definitions
│   ├── adapters.py             # LLM adapter utilities
│   └── utils.py                # Knowledge graph utilities
│
├── adaptive_tutor/             # System-prompt baseline (vanilla prompting)
│   ├── run.py                  # Entry points
│   ├── prompts.py              # Adaptive system prompt templates
│   └── knowledge_graph.py      # Knowledge graph + prerequisite validation
│
├── pipeline/                   # Experiment pipeline
│   ├── generate_pairs.py       # Algorithm 1: generate valid (q, s) pairs
│   ├── run_context_test.py     # Pipeline tutor evaluation
│   ├── run_adaptive_test.py    # Vanilla tutor evaluation
│   └── utils.py                # Shared utilities
│
├── evaluation/                 # Evaluation framework
│   ├── pedagogical_evaluation.py  # Safety / Helpfulness / Pedagogy scorer
│   ├── batch_evaluate.py          # Batch evaluation with concurrency
│   └── analyze_evaluated_results.py  # Results aggregation and pivot tables
│
├── clients/                    # LLM API clients
│   ├── openai_client.py        # OpenAI (GPT-5 series)
│   ├── gemini_client.py        # Google Gemini
│   ├── claude_client.py        # Anthropic Claude
│   ├── together_ai_client.py   # Together AI (Qwen, Llama, etc.)
│   ├── openrouter_client.py    # OpenRouter
│   ├── educhat_client.py       # EduChat local deployment
│   └── config.py               # Configuration utilities
│
├── attack_methods/             # Jailbreak attack prompts
│   ├── refusal_suppression.txt
│   ├── role_play_en.txt
│   ├── role_play_cn.txt
│   └── style_ristrict.txt
│
├── data/                       # SHAPE benchmark data
│   ├── LinearAlgebra_hard_ds_steps_parallel.json   # Full dataset (1,786 questions)
│   ├── LinearAlgebra_hard_random100_1.json         # 100-question subset
│   └── adjacency_matrix_knowledge_graph.csv        # Knowledge prerequisite graph
│
├── assets/                     # Paper figures
└── requirements.txt

Citation

@inproceedings{zhao2026shape,
  title={SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs},
  author={Zhao, Sihang and Yu, Kangrui and Yuan, Youliang and He, Pinjia and Wen, Hongyi},
  booktitle={Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year={2026}
}

License

This project is released for academic research purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
adaptive_tutor		adaptive_tutor
assets		assets
attack_methods		attack_methods
clients		clients
context_based_tutor		context_based_tutor
data		data
evaluation		evaluation
pipeline		pipeline
.gitignore		.gitignore
README.md		README.md
env.example		env.example
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs

Disclaimer

Overview

Formal Definitions

Knowledge Mastery Graph

Safety Gating

Pedagogical Behavior

System Decision

Evaluation Metrics

Graph-Augmented Pedagogical Pipeline

Main Results

Vulnerability of Existing Models (Table 2)

Pipeline Improvement Under Worst-Case Attack (Table 5)

Installation

Usage

Running the Graph-Augmented Pipeline (Our Method)

Running the Vanilla Prompting Baseline

Command-Line Arguments

Supported Models

Evaluation

Single File

Batch Evaluation

Summarize Results

SHAPE Benchmark Data

Reproducing Benchmark Statistics

Generating All Pairs

Data Format

Project Structure

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages