# Math Behind Reinforcement Learning (Not for Dummies)

RL : Agent learns to make decision by interaction with an environment, goal of agent is maximize reward overtime.

Basic Component : 

1. Agent, Learner/decision maker.
2. Environment, Where agent operates.
3. States(s), current situation of environment.
4. Action(a), what can agent do in that state.
5. Reward(r), feedback after that action(+ or -).
6. Policy(π), strategy that agent follow to choose action.
7. Value Action, Predict how good a state, in term of future r.

Markov Decision Process (MDP)

![](https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Bellman-Equation.png?ssl=1)

MDP models the environment with this element:

1. State (S)
2. Action (A)
3. Transition Probabilities (P) = P(s'|s,a) : The probability of moving to state s' after taking action a in state s.
4. Reward Funtion (R) = R(s,a) : Expected reward for each state and action
5. Discount Factor (γ) : how much future rewards are worth to immediate ones (0 < γ < 1).

The Markov Property : The next state and reward depend only on the current state and action, not on full history.

Goal of RL : Find the expected value ( policy π(a|s) ) that maximizes the expected cumulative reward over time, also called the return:

Transition Probability 

![](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*VmV-tIr2e1eX24Y_0KMi5w.png)

Random Variable (Return without discounting)

![](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*FdQcldGubZNbfJRrh1GO8g.png)

Discounted Return (Return discounting)

![](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*ZEtDC9eBwSVsQ8_jtnuudg.png)



## 

In [1]:
pip install streamlit gym numpy

[33mDEPRECATION: Loading egg at /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/mask_rcnn-2.1-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0mCollecting streamlit
  Downloading streamlit-1.44.1-py3-none-any.whl.metadata (8.9 kB)
Collecting gym
  Downloading gym-0.26.2.tar.gz (721 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m721.7/721.7 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting altair<6,>=4.0 (from streamlit)
  Downloading altair-5.5.0-py3-none-any.whl.metadata (11 kB)
Collecting cachetools<6,>=4.0 (from streamlit)
  Downloading cachetools-5.5.2-py3-none-any.whl.metadata (5.4 kB)
Collec

In [2]:
import streamlit as st
import gym
import numpy as np

st.title("RL Demo: Frozen Lake Q-Learning")

# Set up environment
env = gym.make("FrozenLake-v1", is_slippery=False)
n_states = env.observation_space.n
n_actions = env.action_space.n
Q = np.zeros((n_states, n_actions))

# Hyperparams
alpha = 0.8
gamma = 0.95
epsilon = 0.1

# Training
episodes = st.slider("Episodes", 100, 5000, 1000)
for _ in range(episodes):
    state = env.reset()[0]
    done = False
    while not done:
        action = np.random.choice(n_actions) if np.random.rand() < epsilon else np.argmax(Q[state])
        next_state, reward, done, _, _ = env.step(action)
        Q[state, action] += alpha * (reward + gamma * np.max(Q[next_state]) - Q[state, action])
        state = next_state

# Test and show result
state = env.reset()[0]
st.write("### Agent's Path")
path = [state]
done = False
while not done:
    action = np.argmax(Q[state])
    state, _, done, _, _ = env.step(action)
    path.append(state)

st.write(path)

2025-04-15 08:57:32.046 
  command:

    streamlit run /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/ipykernel_launcher.py [ARGUMENTS]
  if not isinstance(terminated, (bool, np.bool8)):


KeyboardInterrupt: 