# What is MDP

At its core, an MDP is a mathematical framework for decision-making in uncertain environments, where outcomes are partly random and partly under the control of an agent. 

"A game where at each step, you're in a certain situation, you take an action, something random happens, and then you end up in a new situation with a reward. The goal is to make a sequence of smart choices to earn the most reward over time."

5 Key ingredients of an MDP:
1. State (S) - "Where you are"
    - Each state represents a situation the agent can be in
    - Example: In a maze, a state could be your current location

2. Actions (A) - "What you can do"
    - From any state, the agent can choose from a set of actions
    - Example: In the maze, actions could be "go left", "go right", "go up", or "go down". 

3. Transition probabilities (P) - "What happens when you act"
    - These define the rules of the world: If you take an action in a state, what is the chance you'll end up in a particular next state?
    - Example: You press "go up" from cell A2, but there's a 10% chance you slip and end up in A1 instead of A3. 

4. Rewards (R) - "How good was that move?"
    - After each action, you get a reward - a number indicating how good or bad the result is.
    - Example: Reaching the goal gives you +10, stepping into lava gives you -100, and walking on empty tiles gives you 0. 

5. Discount factor ($\gamma$) - "How much do you care about the future?"
    - This is a number between 0 and 1 that tells the agent how far into the future it should care about rewards. 
    - If $\gamma$ is close to 1, you care about long-term rewards.
    - If $\gamma$ is close to 0, you only care about immediate rewards. 


What's the "Markov" part?
The Markov property means:
"The future only depends on the current state and action, not on the history of how you got there."
So, the agent doesn't need to remember the full path - just where it is now and what it cana do. 

What's the Agent Trying to Learn?
The agent wants to learn a policy:
"A rule that says: "If I'm in state X, I should take action Y."
The ultimate goal is to find the optimal policy - one that gives the highest total expected reward over time. 



# Breaking the Markov Assumption: When History Matters

In many real-world problems, "where you've been" does matter. Experience shapes decisions. And just like in life, the assumption of "memoryless" (the Markov property) can be too limiting for smart agents.

The standard MDP assumes: 
"The next state and reward depend only on the current state and action - not on the path taken to get there"

But in many problems:
 - The state isn't fully observable
 - There's hidden information, delayed effects, or ambiguous situations
 - A single observation isn't enough to act optimally - you need context

Solutions
1. Recurrent Neural Networks (RNNs) / LSTMs

    In Deep RL: 
    * Use an RNN or LSTM as part of the policy or value network
    * This allows the agent to summarize past observations into a hidden state - essentially giving it a memory
    * Now your policy is a function of history, not just current state
        $$ \pi (a_t | o_{1:t}) \text{ instead of } \pi (a_t | o_t) $$

2. Experience Replay

    While this doesn't help in real-time decision-making, it's about learning from past episodes
    * The agent stores a buffer or transitions (s, a, r, s') and reuses them during training 
    * Like journaling your past mistakes and learning from them in retrospect

3. Learning from Trajectories / Behavior Cloning

    Train agents not just on current transitions but entire sequences of actions
    * For example, imitation learning or inverse reinforcement learning often uses trajectory-level data to infer goals or strategies

Beyond Memory: Identity, Strategy,Growth
* Meta-RL: where the agent learns how to learn from its past episodes across different tasks
* Curriculum learning: building up competence through progressively harder challenges - much like gaining wisdom through life. 
* Lifelong / continual learning: retaining knowledge across tasks adn avoiding forgetting old skills



# A World-Modeling Agent

Humans don't just react. We anticipate, generalize, and simulate the future based on our internal understanding of how the world works. In RL terms, I am aiming for:
" Agents that learn not just policies, but internal models of the environment. These models are then used to reason, simulate, and plan - like we do in our heads."

_"Model-Based RL" - The First Step Toward Understanding_

There are two broad types of RL
1. Model-Free RL
    * Learns policy/value directly from experience
    * Doesn't build any understanding of the environment
    * "I don't know why, but when I do this, I get rewarded"

2. Model-Based RL:
    * Tries to learn a model of the environment 
    $$ f(s,a) \rightarrow s', r $$
    * The uses this learned model to simulate outcomes, plan ahead, or optimize actions
    * Much more data-efficient, interpretable, and aligned with how humans operate
    * Challenge: learned models are often imperfect, and compounding errors can sabotage planning


_The Frontier: Agents that Understand the "Why"_

We actually want a structured world models or abstract reasoning. 
* Not just learning to predict, but learning why things happen
* Not just optimizing a reward, but understanding the structure and rules of the environment
* Not just reacting, but explaining - being able to say
    "If I do X, Y will likely happen because that's how this world tends to behave"

This connects to areas like:
1. World Models 
* The agent learns a latent-space simulator of the environment (like a mental imagination engine)
* Then it uses this simulator for planning actions without needing the real environment
* Very aligned with the idea: learn how the world works, not just what actions are best

2. Causal Reinforcement Learning:
* Learn the causal structure of the environment: "What causes what?"
* Allows better generalization, transfer learning, and counterfactual reasoning
* Example: Knowing that turning off the stove causes the fire to go out - rather than just observing correlation
* Strong overlap with real-world reasoning

3. Self-Reflective and Meta-Cognitive Agents
* Agent that don't just act, but also reflect on their experiences
* Like humans journaling and extracting lessons over time. 
* In RL, this is an area that blends meta-learning with world modeling


_A Practical Vision We Could Build Toward_:

1. Experience: Agent explores an environment (e.g., a factory process, a game, or a simulated chemistry system)
2. Model Learning: It builds an internal model of the system's dynamics. 
3. Abstraction: Over time, it starts clustering patterns, learning rules, and predicting outcomes beyond raw memorization
4. Reasoning: Agent begins to simulate potential futures, even ones it hasn't directly experienced
5. Intuition: Given a new state, the agent doesn't just act - it can say:
    - "Given what I know about this kind of situation, the best action is likely X"
6. Explanation: Optionally, it could output:
    - "I chose X because last time this condition led to a delayed failure, and this structure often signals that."


_Research Concepts to Explore_:

* Model-based RL: Learning dynamics and using them to plan
* World Models: Compress high-dimensional observations into internal simulations
* Causal Inference in RL: Learning cause-effect relationships
* Structured Learning: Learning with graphs, rules, or symbolic representations
* Meta-RL: Learning how to learn across many tasks
* Explaniable RL: Agents that can articulate why they chose actions
* Theory of Mind (ToM): Agents that model other's beliefs, intentions, and learning processes (multi-agent)
