diff --git a/README.md b/README.md
index ef910b0..f305e09 100644
--- a/README.md
+++ b/README.md
@@ -40,12 +40,12 @@ Agent-R1 v0.1.0 is the first official release of the new architecture. It is bui
 - **Retokenization drift in text-based pipelines**: if rollout data is collected as text and later tokenized again for training, the `Token -> Text -> Token` conversion is not reversible.
 - **Rigid token-only trajectory construction**: if the whole interaction is represented as a single growing token list, context handling becomes hard-wired to simple append-only logic.
 
-Agent-R1 addresses these issues with a **step-level trajectory representation**:
+Agent-R1 addresses these issues with a coordinated step-level training perspective:
 
-- each step stores its own prompt and response
-- the environment, not raw token concatenation, controls the next observation
-- context can be **truncated**, **summarized**, **rewritten**, or **augmented** between steps
-- standard RL loops such as `obs -> action -> step -> next_obs` map naturally onto agent training
+- **Step-level MDP** treats each interaction round as a proper RL transition.
+- **Step-level trajectory representation** preserves step boundaries during replay instead of collapsing everything into one flat text reconstruction.
+- **Step-level credit assignment** propagates reward across interaction steps rather than only across tokens or whole trajectories.
+- **Layered abstractions** map those ideas into practical programming interfaces such as `AgentEnvLoop`, `ToolEnv`, and `BaseTool`.
 
 This makes Agent-R1 a better fit for real multi-step agent tasks with tool use, environment feedback, and flexible context management.
 
@@ -92,6 +92,8 @@ This is the main Agent-R1 path, where `AgentEnvLoop` drives multi-step rollout a
 Core concepts:
 
 - [Step-level MDP](https://agentr1.github.io/Agent-R1/core-concepts/step-level-mdp/)
+- [Step-Level Trajectory Representation](https://agentr1.github.io/Agent-R1/core-concepts/step-level-trajectory-representation/)
+- [Step-Level Credit Assignment](https://agentr1.github.io/Agent-R1/core-concepts/step-level-credit-assignment/)
 - [Layered Abstractions](https://agentr1.github.io/Agent-R1/core-concepts/layered-abstractions/)
 
 ## Awesome Projects Using Agent-R1
diff --git a/docs/README.md b/docs/README.md
index b3c4c9a..7354f80 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -37,6 +37,9 @@ docs/
 ├── README.md                  # maintenance notes for the docs directory
 ├── requirements.txt           # documentation dependencies
 ├── index.md                   # documentation homepage
+├── assets/                    # rendered figures reused from the paper
+├── background/                # paper-style conceptual background
+│   └── step-level-training-logic.md
 ├── getting-started/           # minimal setup and sanity-check flow
 │   ├── index.md
 │   ├── installation-guide.md
@@ -44,6 +47,8 @@ docs/
 ├── core-concepts/             # key framework concepts
 │   ├── index.md
 │   ├── step-level-mdp.md
+│   ├── step-level-trajectory-representation.md
+│   ├── step-level-credit-assignment.md
 │   └── layered-abstractions.md
 └── tutorials/                 # task-oriented tutorials
     ├── index.md
diff --git a/docs/assets/step-level-credit-assignment.png b/docs/assets/step-level-credit-assignment.png
new file mode 100644
index 0000000..103c5d5
Binary files /dev/null and b/docs/assets/step-level-credit-assignment.png differ
diff --git a/docs/assets/step-level-mdp.png b/docs/assets/step-level-mdp.png
new file mode 100644
index 0000000..f4f6f1d
Binary files /dev/null and b/docs/assets/step-level-mdp.png differ
diff --git a/docs/assets/step-level-trajectory-representation.png b/docs/assets/step-level-trajectory-representation.png
new file mode 100644
index 0000000..92bcae4
Binary files /dev/null and b/docs/assets/step-level-trajectory-representation.png differ
diff --git a/docs/background/step-level-training-logic.md b/docs/background/step-level-training-logic.md
new file mode 100644
index 0000000..fc291de
--- /dev/null
+++ b/docs/background/step-level-training-logic.md
@@ -0,0 +1,59 @@
+# Step-Level Training Logic
+
+As large language models evolve from single-turn assistants into multi-step agents, reinforcement learning is no longer optimizing only a final response or a short reasoning trace. In agent settings, the model repeatedly receives observations, emits actions, calls tools, and accumulates consequences over multiple rounds of interaction. Once optimization targets this regime, the central question is no longer only how to score an answer, but how to define the decision process itself, how to represent the resulting trajectory, and how to propagate delayed reward across it.
+
+The position taken in Agent-R1 is that these questions should be answered at the same semantic level: the interaction step. The step, rather than the individual token, is the natural unit at which agent behavior becomes legible as decision making. This view leads to a coordinated shift in three places. First, the Markov decision process should be formulated at the step level rather than the token level. Second, trajectory representation should preserve step boundaries rather than collapse the whole interaction into reconstructed messages or a single append-only token stream. Third, credit assignment should propagate reward across interaction steps rather than only across tokens or whole trajectories. The three parts are conceptually distinct, but they support one another. Step-level MDP without step-aware replay remains empirically fragile, while step-level MDP without step-level credit assignment still attributes reward through the wrong unit.
+
+## From Token-Level MDP to Step-Level MDP
+
+Token-level Markov decision process formulation is a natural extension of autoregressive language modeling. If a model generates a response one token at a time, then each partial prefix can be treated as a state and each next token as an action. This formulation remains clean and effective for single-turn post-training settings in which the environment is static or only weakly coupled to intermediate generations.
+
+However, long-horizon agents do not interact with the world merely by appending one more token to a growing sequence. They call tools, receive observations, revise plans, restructure context, and branch on external outcomes. In such settings, the semantically meaningful transition is no longer "emit one token," but "complete one interaction round and receive new environment feedback." When optimization remains purely token-centric, high-level decisions are fragmented across many low-level actions, and the role of the environment is obscured inside one long flat trace.
+
+Agent-R1 therefore adopts a step-level MDP view. At step $t$, the state $s_t$ is the observation presented to the policy, the action $a_t$ is the complete response or interaction action emitted at that step, and the environment returns reward $r_t$ together with the next observation $s_{t+1}$. What changes is not whether tokens still exist inside the model, but the unit at which the RL transition is defined. The transition is one interaction step rather than one appended token.
+
+![Comparison between token-level and step-level MDP formulations](../assets/step-level-mdp.png)
+
+<div style="text-align: center; color: #666;" markdown>
+Comparison between token-level MDP formulation and step-level MDP formulation. The key shift is that the atomic action changes from a single token to a complete agent-environment interaction step.
+</div>
+
+This reformulation clarifies the division of responsibility between policy and environment. The policy chooses an action conditioned on the current observation. The environment is then responsible for executing that action, producing feedback, and constructing the next observation. Once the interaction round is treated as the transition unit, context summarization, truncation, tool execution, and other environment-mediated operations become part of the transition function rather than awkward exceptions to an append-only token view.
+
+## Step-Level Trajectory Representation
+
+Establishing a step-level MDP mathematically is not enough on its own. The empirical training pipeline must also record and replay trajectories in a way that respects the same step boundaries. This is the representation problem.
+
+One common representation for multi-turn agents is a sequence of chat-style messages. This format is simple and interoperable with standard chat interfaces, but it hides a serious inconsistency. Rollout takes place in token space, whereas replay may reconstruct text and tokenize it again during optimization. Because the mapping from token sequence to text and back is not reversible in general, the replayed sequence may differ from the one that originally produced the trajectory. Once this retokenization drift occurs, masks, log-probabilities, and reward annotations can no longer be aligned reliably with the original rollout.
+
+A stronger alternative is flat token-space storage, where prompts and responses are preserved directly as token IDs. This restores rollout-training consistency, but it still treats the whole interaction as one monolithic append-only sequence. That structure is workable for some training pipelines, yet it remains too rigid for long-horizon agents whose interaction history may need to be reconstructed, truncated, or reorganized at step boundaries.
+
+The representation that best matches Agent-R1's perspective is therefore a structured step-level trajectory. Each interaction round is stored as a distinct unit containing the observation shown to the policy, the action produced at that step, and the reward or metadata attached to that interaction. This preserves token-level information inside each action while keeping the step itself explicit as the unit of replay and analysis.
+
+![Evolution of trajectory representation toward step-level structure](../assets/step-level-trajectory-representation.png)
+
+<div style="text-align: center; color: #666;" markdown>
+The evolution of trajectory representation from message-based traces to token-space-consistent records and finally to step-based sequences.
+</div>
+
+The distinction matters because trajectory representation and MDP formulation answer different questions. MDP formulation defines what the RL transition is. Representation defines how the interaction history is stored and replayed for optimization. The two should not be conflated, but they must remain compatible. If the MDP is step-level while the replay format obscures or corrupts step boundaries, optimization is still misaligned with the underlying decision process.
+
+## Step-Level Credit Assignment
+
+Once the decision process is formulated at the step level and the trajectory is represented in a step-native form, reward propagation should also move to the same granularity. Otherwise, a mismatch remains between the unit at which decisions are modeled and the unit at which responsibility is assigned.
+
+Trajectory-level credit assignment is too coarse for this purpose. Assigning one scalar signal to the whole rollout may be simple and stable, but it cannot distinguish productive intermediate actions from harmful ones when an episode contains many interaction rounds. Token-level credit assignment lies at the opposite extreme. It reuses the standard machinery of language-model RL, yet in agent settings it is often too fine. The strategically decisive event may be a retrieval call, a decomposition step, a context-management choice, or a tool invocation, while the reward arrives only later. If delayed return is attributed directly through surface tokens, the learning signal becomes diluted relative to the actual interaction choice.
+
+The natural counterpart of a step-level MDP is therefore step-level credit assignment. In this view, value estimation, temporal-difference residuals, generalized advantage estimation, and PPO-style optimization are all organized around the interaction step. The policy may still factor internally over tokens, but the unit that receives advantage and responsibility is the complete interaction action rather than an isolated token.
+
+![Comparison of token-level, trajectory-level, and step-level credit assignment](../assets/step-level-credit-assignment.png)
+
+<div style="text-align: center; color: #666;" markdown>
+Comparison of token-level, trajectory-level, and step-level credit assignment. The main change is not how actions are tokenized, but where delayed rewards are attributed and propagated.
+</div>
+
+This shift is especially important under delayed reward. In many agent tasks, the final outcome depends on an earlier decision that changes the entire later trajectory: choosing the right tool, retrieving the right evidence, or preserving the right context for subsequent turns. Step-level credit assignment makes it possible to attribute success or failure to that earlier interaction decision without collapsing the signal into one trajectory-level scalar or dispersing it across many locally meaningless token choices.
+
+## Conclusion
+
+The step-level perspective in Agent-R1 is not a single isolated design choice. It is a coordinated training logic. Once agent behavior is understood as multi-step interaction, the MDP transition, the trajectory representation, and the credit-assignment unit should all be aligned around the same object: the interaction step. This is the conceptual thread that connects the framework's modeling choices to its optimization logic, and it is the main reason the discussion naturally proceeds in the order of MDP, trajectory representation, and credit assignment.
diff --git a/docs/core-concepts/index.md b/docs/core-concepts/index.md
index 9790a19..3fc8f85 100644
--- a/docs/core-concepts/index.md
+++ b/docs/core-concepts/index.md
@@ -5,8 +5,10 @@ This section introduces the ideas that shape Agent-R1 as a framework for agent t
 ## In This Section
 
 - [`Step-level MDP`](step-level-mdp.md): why Agent-R1 models agent training as multi-step interaction instead of a single growing token stream.
+- [`Step-Level Trajectory Representation`](step-level-trajectory-representation.md): how Agent-R1 stores and replays interaction history without collapsing everything into one growing token stream.
+- [`Step-Level Credit Assignment`](step-level-credit-assignment.md): why reward propagation should follow interaction steps rather than only tokens or whole trajectories.
 - [`Layered Abstractions`](layered-abstractions.md): how `AgentFlowBase`, `AgentEnvLoop`, `AgentEnv`, `ToolEnv`, and `BaseTool` fit together.
 
 ## Why These Concepts Matter
 
-Agent-R1 is designed for agent tasks where an LLM interacts with an environment, receives new observations, and improves through reinforcement learning over trajectories. These two pages explain the core formulation and the programming model that support that workflow.
+Agent-R1 is designed for agent tasks where an LLM interacts with an environment, receives new observations, and improves through reinforcement learning over trajectories. Together, these pages explain the framework's step-level MDP, trajectory representation, credit assignment, and programming model.
diff --git a/docs/core-concepts/step-level-credit-assignment.md b/docs/core-concepts/step-level-credit-assignment.md
new file mode 100644
index 0000000..64af556
--- /dev/null
+++ b/docs/core-concepts/step-level-credit-assignment.md
@@ -0,0 +1,54 @@
+# Step-Level Credit Assignment
+
+Once the decision process is formulated at the step level and the trajectory is represented in a step-native form, reward propagation should also move to the same granularity. Otherwise, a mismatch remains between the unit at which decisions are modeled and the unit at which responsibility is assigned.
+
+![Comparison of token-level, trajectory-level, and step-level credit assignment](../assets/step-level-credit-assignment.png)
+
+<div style="text-align: center; color: #666;" markdown>
+Comparison of token-level, trajectory-level, and step-level credit assignment. The main change is not how actions are tokenized, but where delayed rewards are attributed and propagated.
+</div>
+
+## Granularity Mismatch
+
+Trajectory-level credit assignment is too coarse for this purpose. Assigning one scalar signal to the whole rollout may be simple and stable, but it cannot distinguish productive intermediate actions from harmful ones when an episode contains many interaction rounds.
+
+Token-level credit assignment lies at the opposite extreme. It reuses the standard machinery of language-model RL, yet in agent settings it is often too fine. The strategically decisive event may be a retrieval call, a decomposition step, a context-management choice, or a tool invocation, while the reward arrives only later. If delayed return is attributed directly through surface tokens, the learning signal becomes diluted relative to the actual interaction choice.
+
+## Step-Level Objective
+
+The natural counterpart of a step-level MDP is therefore step-level credit assignment. In this view, value estimation, temporal-difference residuals, generalized advantage estimation, and PPO-style optimization are all organized around the interaction step. The policy may still factor internally over tokens, but the unit that receives advantage and responsibility is the complete interaction action rather than an isolated token.
+
+This distinction matters especially under delayed reward. In many agent tasks, the final outcome depends on an earlier decision that changes the later trajectory: choosing the right tool, retrieving the right evidence, or preserving the right context for subsequent turns. Step-level credit assignment makes it possible to attribute success or failure to that earlier interaction decision without collapsing the signal into one trajectory-level scalar or dispersing it across many locally meaningless token choices.
+
+## How This Appears in Code
+
+Agent-R1's GAE implementation first aggregates token rewards into a step reward, then computes advantages over step indices:
+
+```python
+def compute_gae_advantage_return(
+    token_level_rewards,
+    values,
+    response_mask,
+    trajectory_uids,
+    step_indices,
+    gamma,
+    lam,
+):
+    # Step-level reward: sum of token rewards inside the step.
+    rewards = (token_level_rewards * response_mask).sum(dim=1)
+
+    rewards_map[traj_inv, step_ids] = rewards
+    values_map[traj_inv, step_ids] = values
+
+    for t in reversed(range(max_step)):
+        nextvalues = values_map[:, t + 1] if t < max_step - 1 else 0.0
+        delta = rewards_map[:, t] + gamma * nextvalues - values_map[:, t]
+        lastgaelam = delta + gamma * lam * lastgaelam
+```
+
+The important point is that the code does not propagate advantage over one flat token stream. It first builds step-level rewards and values, computes GAE over the step timeline, and only then broadcasts the result back to token positions when needed by PPO training.
+
+The relevant implementation lives in:
+
+- `agent_r1/core_algos.py`
+- `agent_r1/ray_agent_trainer.py`
diff --git a/docs/core-concepts/step-level-mdp.md b/docs/core-concepts/step-level-mdp.md
index e28aced..e759cd0 100644
--- a/docs/core-concepts/step-level-mdp.md
+++ b/docs/core-concepts/step-level-mdp.md
@@ -1,46 +1,57 @@
 # Step-level MDP
 
-## A Principled Foundation for RL Agent Training
+This page focuses on the modeling layer of the problem. The central claim is that long-horizon agent optimization should not be formulated only as a token-level decision process. Instead, the natural transition unit for agent training is the interaction step.
 
-Most existing frameworks treat the LLM agent as a token-level process: the "state" is the ever-growing concatenation of all past tokens, and the "action" is the next token. This token-level view forces context to grow monotonically and makes it hard to apply standard RL algorithms at a meaningful granularity.
+## Token-Level Formulation
 
-Agent-R1 adopts a **step-level MDP** that models the LLM as an agent acting inside an environment:
+Token-level Markov decision process formulation is a natural extension of autoregressive language modeling. Given prompt \(x\) and response \(y = (y_1, \dots, y_L)\), the policy factorizes as
 
-| MDP Element | Definition |
-|---|---|
-| **State** \(s_t\) | The prompt presented to the LLM at step \(t\), determined entirely by the environment |
-| **Action** \(a_t\) | The LLM's complete response at step \(t\) |
-| **Transition** \(T(s_{t+1} \mid s_t, a_t)\) | The environment produces the next observation given the current state and the LLM's response |
-| **Reward** \(r_t\) | A per-step reward signal from the environment |
-| **Policy** \(\pi(a_t \mid s_t)\) | The LLM itself |
+\[
+\pi_\theta(y \mid x) = \prod_{i=1}^{L}\pi_\theta(y_i \mid x, y_{<i}).
+\]
 
-```mermaid
-graph LR
-    state_t["State s_t"] -->|"Policy π (LLM)"| action_t["Action a_t"]
-    action_t -->|"Environment"| state_t1["State s_{t+1}"]
-    action_t -->|"Environment"| reward_t["Reward r_t"]
-    state_t1 -->|"Policy π (LLM)"| action_t1["Action a_{t+1}"]
-    action_t1 -->|"..."| more_steps["..."]
-```
+This factorization induces a token-level decision process almost for free. At token position \(i\), the state and action can be written as
 
-This formulation leads to three key insights:
+\[
+s_i^{\mathrm{tok}} = (x, y_{<i}), \qquad a_i^{\mathrm{tok}} = y_i,
+\]
 
-!!! success "Flexible Context"
-    Because the state \(s_t\) is provided by the environment -- not derived by concatenating all prior tokens -- the environment is free to **summarize**, **truncate**, **restructure**, or even **completely replace** the context between steps. As long as the transition function is well-defined, the MDP remains valid.
+with transition
 
-!!! success "Valid RL Training"
-    Each step has its own observation, action, and reward. Log-probabilities are computed conditioned on \(s_t\) independently at each step, so standard policy gradient methods (PPO, GRPO, etc.) apply directly at the step level.
+\[
+s_{i+1}^{\mathrm{tok}} = (x, y_{\le i}).
+\]
 
-!!! success "Concat as a Special Case"
-    The traditional "append everything" approach is simply one particular transition function: \(s_{t+1} = \text{concat}(s_t,\; a_t,\; \text{env}_{output_t})\). It is a valid but by no means the only choice. Agent-R1 supports it as a special case rather than a hard-wired constraint.
+Such a formulation aligns cleanly with policy-gradient training and remains highly effective for single-turn alignment or reasoning tasks in which the environment is static or only weakly coupled to intermediate generations.
 
-## Why It Matters for Agent Tasks
+## Step-Level Reformulation
 
-This is the main reason Agent-R1 is built around **multi-step agent behavior** rather than single-step prompting. Once the environment owns the next observation, the framework can naturally support:
+The difficulty is that multi-turn agents do not interact with the world only through token append operations. They call tools, receive observations, update working memory, revise context, branch on execution outcomes, and sometimes perform explicit context management between rounds. In such settings, the semantically meaningful transition is no longer "emit one more token," but rather "complete one interaction step and receive new environment feedback." When optimization remains purely token-centric, high-level decisions are fragmented across many low-level actions, and environment-mediated transitions are obscured by a long flat token trace.
 
-- tool calls and structured environment feedback
-- state updates across multiple turns
-- per-step rewards instead of only outcome rewards
-- trajectory-level training for real agent tasks
+This motivates step-level MDP formulation. Let \(s_t\) denote the observation available at interaction step \(t\), let \(a_t\) denote the complete interaction action chosen at that step, and let the environment return reward \(r_t\) together with the next observation \(s_{t+1}\). A trajectory is then written as
 
-In practice, this means the important unit in Agent-R1 is not just a token stream, but a sequence of environment-mediated interaction steps.
+\[
+\tau = \{(s_t, a_t, r_t, s_{t+1})\}_{t=0}^{T-1},
+\]
+
+with objective
+
+\[
+J(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=0}^{T-1}\gamma^t r_t\right].
+\]
+
+Here \(a_t\) is not a single token. It may internally contain a token sequence, a structured tool call, or a mixed response that combines reasoning and external actions. What changes is the unit at which the MDP is defined: the RL transition is now one interaction step rather than one appended token.
+
+![Comparison between token-level and step-level MDP formulations](../assets/step-level-mdp.png)
+
+<div style="text-align: center; color: #666;" markdown>
+Comparison between token-level MDP formulation and step-level MDP formulation. The key shift is that the atomic action changes from a single token to a complete agent-environment interaction step.
+</div>
+
+The move to step-level MDP does not imply that token information is discarded. On the contrary, token-space consistency remains important for stable optimization inside each action. What changes is the unit at which the decision process is defined. Step-level formulation better reflects the causal structure of agent behavior: a step begins with an observation, produces an action, triggers an external transition, and only then exposes the next observation.
+
+Step-level MDP also clarifies which parts of the loop belong to the policy and which belong to the environment. The policy is responsible for choosing \(a_t\) conditioned on \(s_t\). The environment is responsible for turning that interaction action into \(s_{t+1}\), possibly through tool execution, response parsing, external feedback, or context rewriting. This separation is difficult to maintain in a pure token-append abstraction, but it becomes explicit once the interaction round is treated as the transition unit.
+
+## What This Leads To
+
+However, establishing step-level MDP formulation mathematically is only half the battle. To actually optimize a policy over these \((s_t, a_t, r_t, s_{t+1})\) transitions, the training pipeline must also record and replay the interaction history in a way that honors the same step boundaries. If the empirical trajectory representation misaligns with this theoretical MDP, optimization remains fragile. That broader logic is summarized in [`Step-Level Training Logic`](../background/step-level-training-logic.md) and developed in the next two concept pages on trajectory representation and credit assignment.
diff --git a/docs/core-concepts/step-level-trajectory-representation.md b/docs/core-concepts/step-level-trajectory-representation.md
new file mode 100644
index 0000000..1804c1f
--- /dev/null
+++ b/docs/core-concepts/step-level-trajectory-representation.md
@@ -0,0 +1,49 @@
+# Step-Level Trajectory Representation
+
+Trajectory representation answers a different question from MDP formulation. It does not define the RL transition directly. Instead, it defines how an interaction history is stored and replayed for optimization. The two layers are related, but they should not be conflated. A framework may adopt step-level decision making while still storing trajectories in a way that weakens replay fidelity or obscures the boundaries required for optimization.
+
+![Evolution of trajectory representation toward step-level structure](../assets/step-level-trajectory-representation.png)
+
+<div style="text-align: center; color: #666;" markdown>
+The evolution of trajectory representation from message-based traces to token-space-consistent records and finally to step-based sequences. This figure is intended as background and concept setup rather than the main technical claim.
+</div>
+
+## Text-Space Representation
+
+One common representation for multi-turn agents is a sequence of chat-style messages. This format is simple and interoperable with standard chat interfaces, but it hides a serious inconsistency. Rollout takes place in token space, whereas replay may reconstruct text and tokenize it again during optimization. Because the mapping from token sequence to text and back is not reversible in general, the replayed sequence may differ from the one that originally produced the trajectory.
+
+Once this retokenization drift occurs, masks, log-probabilities, and reward annotations can no longer be aligned reliably with the original rollout. This is why message-space convenience is not enough for stable step-level optimization.
+
+## Flat Token-Space Representation
+
+A stronger alternative is flat token-space storage, where prompts and responses are preserved directly as token IDs. This restores rollout-training consistency, but it still treats the whole interaction as one monolithic append-only sequence. That structure is workable for some training pipelines, yet it remains too rigid for long-horizon agents whose interaction history may need to be reconstructed, truncated, or reorganized at step boundaries.
+
+## Structured Step-Level Representation
+
+The representation that best matches Agent-R1's perspective is a structured step-level trajectory. Each interaction round is stored as a distinct unit containing the observation shown to the policy, the action produced at that step, and the reward or metadata attached to that interaction. This preserves token-level information inside each action while keeping the step itself explicit as the unit of replay and analysis.
+
+The distinction matters because MDP formulation defines what the RL transition is, while representation defines how the interaction history is stored and replayed for optimization. If the MDP is step-level while the replay format obscures or corrupts step boundaries, optimization remains misaligned with the underlying decision process.
+
+## How This Appears in Code
+
+In Agent-R1, the trajectory is explicitly represented as a list of steps rather than a single monolithic sample:
+
+```python
+class AgentFlowStep(BaseModel):
+    prompt_ids: list[int]
+    response_ids: list[int]
+    reward_score: Optional[float] = None
+    extra_fields: dict[str, Any] = {}
+
+
+class AgentFlowOutput(BaseModel):
+    steps: list[_InternalAgentFlowStep]
+    metrics: AgentFlowMetrics
+```
+
+This is the core implementation idea behind step-level trajectory representation: each rollout is organized as a sequence of step records, and each step keeps its own prompt ids, response ids, and reward signal.
+
+The relevant implementation lives in:
+
+- `agent_r1/agent_flow/agent_flow.py`
+- `agent_r1/agent_flow/agent_env_loop.py`
diff --git a/docs/getting-started/quick-start.md b/docs/getting-started/quick-start.md
index 2ca419b..b8c5607 100644
--- a/docs/getting-started/quick-start.md
+++ b/docs/getting-started/quick-start.md
@@ -34,5 +34,6 @@ The script entrypoint is [`examples/run_qwen2.5-3b.sh`](https://github.com/Agent
 ## 3. What to Do Next
 
 - Read [`Step-level MDP`](../core-concepts/step-level-mdp.md) to understand the main training abstraction.
+- Read [`Step-Level Trajectory Representation`](../core-concepts/step-level-trajectory-representation.md) and [`Step-Level Credit Assignment`](../core-concepts/step-level-credit-assignment.md) if you want to see how replay and reward propagation align with the same step-level view.
 - Read [`Layered Abstractions`](../core-concepts/layered-abstractions.md) to see how `AgentFlowBase`, `AgentEnvLoop`, and `ToolEnv` fit together.
 - Continue to the [`Agent Task Tutorial`](../tutorials/agent-task.md) for the main Agent-R1 workflow based on multi-step interaction.
diff --git a/docs/index.md b/docs/index.md
index a4aa9d6..0d26580 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -14,6 +14,22 @@ Agent-R1 is an open-source framework for training powerful language agents with
 
     [:octicons-arrow-down-24: Learn more](core-concepts/step-level-mdp.md)
 
+-   :material-source-branch:{ .lg .middle } **Step-Level Trajectory Representation**
+
+    ---
+
+    See how Agent-R1 represents trajectories at the same semantic level as multi-step interaction.
+
+    [:octicons-arrow-down-24: Learn more](core-concepts/step-level-trajectory-representation.md)
+
+-   :material-chart-timeline-variant:{ .lg .middle } **Step-Level Credit Assignment**
+
+    ---
+
+    See why Agent-R1 propagates reward at the level of interaction steps rather than only tokens.
+
+    [:octicons-arrow-down-24: Learn more](core-concepts/step-level-credit-assignment.md)
+
 -   :material-layers-outline:{ .lg .middle } **Layered Abstractions**
 
     ---
@@ -29,12 +45,13 @@ Agent-R1 is an open-source framework for training powerful language agents with
 ## Reading Guide
 
 - Start with [`Getting Started`](getting-started/index.md) if you want the minimal path: use the same environment as `verl`, run a sanity check, and confirm the repository is ready.
-- Read [`Step-level MDP`](core-concepts/step-level-mdp.md) and [`Layered Abstractions`](core-concepts/layered-abstractions.md) if you want to understand the framework design before touching code.
+- Read [`Step-Level Training Logic`](background/step-level-training-logic.md) if you want the full conceptual argument behind Agent-R1's step-level perspective.
+- Read [`Step-level MDP`](core-concepts/step-level-mdp.md), [`Step-Level Trajectory Representation`](core-concepts/step-level-trajectory-representation.md), [`Step-Level Credit Assignment`](core-concepts/step-level-credit-assignment.md), and [`Layered Abstractions`](core-concepts/layered-abstractions.md) if you want the framework ideas broken into concrete pieces.
 - Follow [`Agent Task Tutorial`](tutorials/agent-task.md) if you want to see the main Agent-R1 workflow: multi-step interaction through `AgentEnvLoop` and `ToolEnv`.
 
 ## Scope of This Documentation
 
-This version of the documentation is intentionally compact. It focuses on the parts that are already central to Agent-R1 today and leaves room for future tutorials as more environments and tools are added.
+This version of the documentation is intentionally compact. It focuses on the parts that are already central to Agent-R1 today while making the core design logic more explicit: step-level MDP, step-level trajectory representation, step-level credit assignment, and the layered abstractions used to build agent tasks.
 
 ---
 
diff --git a/docs/tutorials/agent-task.md b/docs/tutorials/agent-task.md
index 4a048c5..da0cd07 100644
--- a/docs/tutorials/agent-task.md
+++ b/docs/tutorials/agent-task.md
@@ -101,4 +101,5 @@ The single-step GSM8K script is still useful, but only as a setup check. This tu
 ## 6. Where to Look Next
 
 - Read [`Step-level MDP`](../core-concepts/step-level-mdp.md) to connect this tutorial to the core RL formulation.
+- Read [`Step-Level Trajectory Representation`](../core-concepts/step-level-trajectory-representation.md) and [`Step-Level Credit Assignment`](../core-concepts/step-level-credit-assignment.md) to connect this workflow to replay structure and step-level optimization.
 - Read [`Layered Abstractions`](../core-concepts/layered-abstractions.md) to see why this example maps naturally to `AgentEnvLoop + ToolEnv`.
diff --git a/mkdocs.yml b/mkdocs.yml
index 56e0cf4..c135c64 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -37,6 +37,8 @@ theme:
 
 nav:
   - Home: index.md
+  - Background:
+      - Step-Level Training Logic: background/step-level-training-logic.md
   - Getting Started:
       - getting-started/index.md
       - Installation Guide: getting-started/installation-guide.md
@@ -44,6 +46,8 @@ nav:
   - Core Concepts:
       - core-concepts/index.md
       - Step-level MDP: core-concepts/step-level-mdp.md
+      - Step-Level Trajectory Representation: core-concepts/step-level-trajectory-representation.md
+      - Step-Level Credit Assignment: core-concepts/step-level-credit-assignment.md
       - Layered Abstractions: core-concepts/layered-abstractions.md
   - Tutorials:
       - tutorials/index.md