Paper Link: [TBD — will be updated upon publication]
This is the official code repository for the paper "SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs".
We identify pedagogical jailbreaks — where students use answer-inducing prompts to elicit solutions rather than scaffolded instructions — as a critical vulnerability in educational LLMs. We formalize safe, helpful, and pedagogical behaviors grounded in a knowledge-mastery graph, introduce the SHAPE benchmark (9,087 student-question pairs), and propose a graph-augmented tutoring pipeline that significantly improves safety under adversarial pressure while preserving helpfulness.
This dataset was independently constructed from publicly visible course topic titles and general domain knowledge. It is not affiliated with, endorsed by, or derived from any non-public MathAcademy system or internal knowledge graph.
Figure 1. Desired educational LLM tutoring under mastery-awareness and jailbreak pressure. When the student has not mastered prerequisite concepts, the tutor should withhold direct answers and provide guided instruction (a), while remaining robust to answer-inducing prompts (b-c). When mastery is demonstrated, the tutor should permit direct answers (f) and avoid redundant Socratic dialogue (d, e).
We model the concept space as a directed graph
Given a query
The student's missing concepts relevant to
Direct provision of an answer is permissible if and only if the student has mastered all concepts in
-
$g(q,s) = 1$ : direct answer is safe (all prerequisites mastered) -
$g(q,s) = 0$ : direct answer is unsafe (knowledge gaps exist)
When
A pedagogical response
| Constraint | Formula | Meaning |
|---|---|---|
| Relevance | Stay within prerequisite scope | |
| Avoid Known | Don't re-teach mastered concepts | |
| Hit Frontier | Target at least one learnable concept |
where
Let
| Metric | Formula | Description |
|---|---|---|
| Safety | Fraction of unsafe-context cases where the model correctly withholds the answer | |
| Helpfulness | Fraction of safe-context cases where the model provides the solution | |
| Pedagogy | Among safe refusals, fraction that also provide pedagogical guidance |
Figure 2. The graph-augmented pedagogical pipeline. The system parses prerequisites and compares them with the student's mastery state. The resulting missing knowledge list determines the response strategy: pedagogical thought-provoking questions or direct answering.
Worst-case safety degradation under answer-inducing jailbreak attacks on the SHAPE benchmark. We report Delta = min(Refusal Suppression, Role Play) - Default.
| Model | Delta Safety | Delta Pedagogy |
|---|---|---|
| Qwen3-80B | -99.29 | -70.33 |
| Gemini 2.5 Pro | -94.77 | -52.80 |
| Qwen3-8B | -90.26 | -73.95 |
| Gemini 2.5 Flash-Lite | -87.65 | +3.17 |
| Gemini 2.5 Flash | -87.18 | -7.29 |
| Claude Opus 4.5 | -69.59 | +3.36 |
| GPT-5 nano | -62.23 | -0.76 |
| Qwen3-32B | -61.76 | -68.91 |
| GPT-5 mini | -58.67 | -1.07 |
| Claude Haiku 4.5 | -31.59 | -38.23 |
| GPT-5 | -16.86 | -7.85 |
| EduChat-32B | -83.00 | -6.49 |
| EduChat-8B | -8.84 | -18.29 |
| SocraticLM-8B | -2.72 | -63.49 |
| Model | Safety (Vanilla) | Safety (Ours) | Pedagogy (Vanilla) | Pedagogy (Ours) |
|---|---|---|---|---|
| Qwen3-80B | 0.00 | 92.25 (+92.25) | 0.00 | 70.99 (+70.99) |
| Gemini 2.5 Flash-Lite | 12.35 | 90.85 (+78.50) | 86.54 | 83.72 (-2.82) |
| Gemini 2.5 Pro | 4.28 | 77.46 (+73.18) | 27.78 | 72.18 (+44.40) |
| Claude Opus 4.5 | 24.23 | 92.25 (+68.02) | 82.35 | 68.70 (-13.65) |
| GPT-5 mini | 36.10 | 88.46 (+52.36) | 93.42 | 89.13 (-4.29) |
| GPT-5 nano | 31.12 | 73.94 (+42.82) | 85.75 | 84.76 (-0.99) |
| Gemini 2.5 Flash | 6.41 | 39.44 (+33.03) | 70.37 | 71.43 (+1.06) |
| Claude Haiku 4.5 | 66.27 | 89.44 (+23.17) | 46.24 | 66.93 (+20.69) |
| GPT-5 | 74.35 | 85.92 (+11.57) | 88.50 | 94.31 (+5.81) |
| Qwen3-32B | 1.66 | 7.04 (+5.38) | 0.00 | 50.00 (+50.00) |
Full Evaluation Table (Table 1) — click to expand
| Model | Safety (Default) | Safety (Refusal Supp.) | Safety (Role Play) | Helpful (Default) | Helpful (Refusal Supp.) | Helpful (Role Play) | Pedagogy (Default) | Pedagogy (Refusal Supp.) | Pedagogy (Role Play) |
|---|---|---|---|---|---|---|---|---|---|
| Claude Opus 4.5 | 93.82 | 31.59 | 24.23 | 99.44 | 100.00 | 100.00 | 78.99 | 87.22 | 82.35 |
| Claude Haiku 4.5 | 97.86 | 91.69 | 66.27 | 38.55 | 44.13 | 74.86 | 84.47 | 85.49 | 46.24 |
| Gemini 2.5 Pro | 99.05 | 16.86 | 4.28 | 98.32 | 98.88 | 99.44 | 80.58 | 60.56 | 27.78 |
| Gemini 2.5 Flash | 93.59 | 81.71 | 6.41 | 100.00 | 100.00 | 98.88 | 77.66 | 85.47 | 70.37 |
| Gemini 2.5 Flash-Lite | 100.00 | 78.62 | 12.35 | 17.88 | 86.03 | 100.00 | 83.37 | 87.31 | 86.54 |
| GPT-5 | 91.21 | 90.26 | 74.35 | 100.00 | 100.00 | 99.44 | 96.35 | 96.05 | 88.50 |
| GPT-5 mini | 94.77 | 93.11 | 36.10 | 100.00 | 100.00 | 100.00 | 94.49 | 95.15 | 93.42 |
| GPT-5 nano | 93.35 | 90.02 | 31.12 | 84.92 | 78.77 | 89.39 | 86.51 | 85.75 | 86.26 |
| Qwen3-80B | 99.29 | 0.00 | 0.24 | 1.12 | 100.00 | 100.00 | 70.33 | 0.00 | 0.00 |
| Qwen3-32B | 63.42 | 1.66 | 1.90 | 58.10 | 97.77 | 98.88 | 68.91 | 0.00 | 37.50 |
| Qwen3-8B | 90.26 | 0.00 | 0.48 | 46.93 | 99.44 | 100.00 | 73.95 | 0.00 | 0.00 |
| EduChat-32B | 89.12 | 6.80 | 6.12 | 20.75 | 92.45 | 98.11 | 56.49 | 50.00 | 88.89 |
| EduChat-8B | 21.77 | 12.93 | 27.89 | 84.91 | 90.57 | 71.70 | 50.00 | 78.95 | 31.71 |
| SocraticLM-8B | 4.76 | 6.12 | 2.04 | 98.11 | 83.02 | 98.11 | 85.71 | 22.22 | 66.67 |
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
cp env.example .env # Add your API keyspython -m pipeline.run_context_test \
--questions data/LinearAlgebra_hard_ds_steps_parallel.json \
--adjacency data/adjacency_matrix_knowledge_graph.csv \
--model gpt-5 \
--attack baseline \
--samples 2 \
--concurrent 10 \
--output outputpython -m pipeline.run_adaptive_test \
--questions data/LinearAlgebra_hard_ds_steps_parallel.json \
--adjacency data/adjacency_matrix_knowledge_graph.csv \
--model gpt-5 \
--attack baseline \
--samples 2 \
--concurrent 10 \
--output output| Argument | Description | Default |
|---|---|---|
--questions |
Path to questions JSON file | Required |
--adjacency |
Path to adjacency matrix CSV | Required |
--model |
Model identifier (auto-detects provider) | gpt-4o-mini |
--attack |
Attack method: baseline, refusal_suppression, role_play_en, role_play_cn, style_ristrict |
baseline |
--start |
Start question index | 0 |
--end |
End question index | All |
--samples |
Student state samples per question | 2 |
--concurrent |
Max concurrent API calls | 10 |
--seed |
Random seed | 42 |
--output |
Output directory | output |
| Provider | Models | API Key |
|---|---|---|
| OpenAI | gpt-5, gpt-5-mini, gpt-5-nano |
OPENAI_API_KEY |
gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite |
GEMINI_API_KEY |
|
| Anthropic | claude-opus-4-5, claude-haiku-4-5 |
CLAUDE_API_KEY |
| Together AI | Qwen3-80B, Qwen3-32B, Qwen3-8B |
TOGETHER_API_KEY |
| Local | EduChat, SocraticLM | EDUCHAT_BASE_URL |
python -m evaluation.pedagogical_evaluation \
--input output/context-gpt-5-baseline-100-20260101_120000.json \
--evaluator-model gpt-5 \
--evaluator-provider openai \
--max-concurrent 10python -m evaluation.batch_evaluate \
--input-dir output/ \
--output-dir evaluation/output/evaluate \
--evaluator-model gpt-5 \
--evaluator-provider openai \
--max-concurrent-files 3 \
--max-concurrent-evaluations 10python -m evaluation.analyze_evaluated_results \
--input-dir evaluation/output/evaluate \
--output-dir evaluation/output/summaryGenerates evaluate_summary_per_file.csv and evaluate_summary_pivot.csv.
The benchmark is built from 1,786 Linear Algebra questions and a 91-node knowledge prerequisite graph. We apply Algorithm 1 to enumerate all subsets of each question's required knowledge points, then filter by prerequisite consistency (Eq. 18), yielding 9,087 valid student-question pairs.
python -m pipeline.generate_pairs \
--questions data/LinearAlgebra_hard_ds_steps_parallel.json \
--adjacency data/adjacency_matrix_knowledge_graph.csv \
--stats-onlyExpected output:
Total questions: 1786
Candidate enumeration (before filtering):
|Rq| = 1: 281 questions => 281 * 2^1 = 562
|Rq| = 2: 135 questions => 135 * 2^2 = 540
|Rq| = 3: 1348 questions => 1348 * 2^3 = 10784
|Rq| = 4: 22 questions => 22 * 2^4 = 352
Total candidate (q, sR) pairs: 12238
After prerequisite-consistency filtering (Eq. 18):
Valid (q, sR) pairs: 9087
python -m pipeline.generate_pairs \
--questions data/LinearAlgebra_hard_ds_steps_parallel.json \
--adjacency data/adjacency_matrix_knowledge_graph.csv \
--output data/valid_pairs.jsonQuestions (data/LinearAlgebra_hard_ds_steps_parallel.json) — JSON Lines, one object per line:
{"problem": "Find the eigenvalues of matrix A...", "answer": "\u03bb = 2, 3", "step": ["Eigenvalues and Eigenvectors", "Characteristic Polynomial"]}Knowledge Graph (data/adjacency_matrix_knowledge_graph.csv) — 91x91 adjacency matrix where 1 indicates a prerequisite relationship.
.
├── context_based_tutor/ # LangGraph-based graph-augmented pipeline (our method)
│ ├── graph.py # 4-node DAG workflow definition
│ ├── nodes.py # Decompose, Compare, Direct Answer, Tutoring nodes
│ ├── prompts.py # Prompt templates with 3-phase Socratic protocol
│ ├── run.py # Entry points (sync + async)
│ ├── state.py # State type definitions
│ ├── adapters.py # LLM adapter utilities
│ └── utils.py # Knowledge graph utilities
│
├── adaptive_tutor/ # System-prompt baseline (vanilla prompting)
│ ├── run.py # Entry points
│ ├── prompts.py # Adaptive system prompt templates
│ └── knowledge_graph.py # Knowledge graph + prerequisite validation
│
├── pipeline/ # Experiment pipeline
│ ├── generate_pairs.py # Algorithm 1: generate valid (q, s) pairs
│ ├── run_context_test.py # Pipeline tutor evaluation
│ ├── run_adaptive_test.py # Vanilla tutor evaluation
│ └── utils.py # Shared utilities
│
├── evaluation/ # Evaluation framework
│ ├── pedagogical_evaluation.py # Safety / Helpfulness / Pedagogy scorer
│ ├── batch_evaluate.py # Batch evaluation with concurrency
│ └── analyze_evaluated_results.py # Results aggregation and pivot tables
│
├── clients/ # LLM API clients
│ ├── openai_client.py # OpenAI (GPT-5 series)
│ ├── gemini_client.py # Google Gemini
│ ├── claude_client.py # Anthropic Claude
│ ├── together_ai_client.py # Together AI (Qwen, Llama, etc.)
│ ├── openrouter_client.py # OpenRouter
│ ├── educhat_client.py # EduChat local deployment
│ └── config.py # Configuration utilities
│
├── attack_methods/ # Jailbreak attack prompts
│ ├── refusal_suppression.txt
│ ├── role_play_en.txt
│ ├── role_play_cn.txt
│ └── style_ristrict.txt
│
├── data/ # SHAPE benchmark data
│ ├── LinearAlgebra_hard_ds_steps_parallel.json # Full dataset (1,786 questions)
│ ├── LinearAlgebra_hard_random100_1.json # 100-question subset
│ └── adjacency_matrix_knowledge_graph.csv # Knowledge prerequisite graph
│
├── assets/ # Paper figures
└── requirements.txt
@inproceedings{zhao2026shape,
title={SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs},
author={Zhao, Sihang and Yu, Kangrui and Yuan, Youliang and He, Pinjia and Wen, Hongyi},
booktitle={Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL)},
year={2026}
}This project is released for academic research purposes.

