## Part 1: Least to Most Prompting

### Basic Prompt
**Prompt:** "Explain what a Markov Decision Process is in Pong."  
**Output (GPT):**  
A Markov Decision Process (MDP) is a framework used in reinforcement learning to model decision-making where outcomes are partly random and partly under the control of an agent.

**Reflection:** This is correct but very general. It misses the role of states and rewards.  

### Intermediate Prompt
**Prompt:** "In the context of Pong, explain how rewards work when the paddle hits or misses the ball. Include specific reward values."  
**Output (GPT):**  
In reinforcement learning, the reward function tells the agent (the paddle) whether it is doing well or poorly. In Pong, rewards are typically assigned based on whether the agent successfully hits the ball or misses it.

1. Hitting the Ball

When the paddle successfully hits the ball back toward the opponent, it is considered a positive action.

Reward value: 0 or sometimes a small positive reward like +0.1 for encouraging good paddle behavior before actually scoring.

Why 0 sometimes? Because the main reward is usually given for scoring a point, not just hitting the ball. However, giving a small positive reward can speed up learning.

2. Scoring a Point

When the ball passes the opponent’s paddle, the agent scores a point.

Reward value: +1

This is the primary positive reinforcement.

3. Missing the Ball

When the paddle fails to hit the ball and it passes the agent, the opponent scores.

Reward value: -1

This is a penalty that discourages the agent from missing the ball.

4. No Event

While the ball is moving and nothing significant happens (no score, no miss), the reward is usually:

Reward value: 0

**Reflection:** More useful since it connects to the actual game.  

| Event                  | Reward    |
| ---------------------- | --------- |
| Paddle hits ball       | 0 or +0.1 |
| Paddle misses ball     | -1        |
| Paddle scores a point  | +1        |
| Ball in play, no score | 0         |

**Summary**

Positive reward (+1): scoring a point

Negative reward (-1): missing the ball

Optional small reward (+0.1): hitting the ball

Neutral reward (0): ball in play without scoring

This reward structure teaches the agent to maximize its points while minimizing misses, which is the goal of Pong.


### Advanced Prompt
**Prompt:** "Act as a reinforcement learning expert. Explain how MDPs, rewards, and returns work together in Pong, using paddle position and ball trajectory as state examples."  
**Output (GPT):**  
In reinforcement learning, the MDP defines the interaction between the agent (paddle) and environment (ball).

Example State (s): Paddle at y=5, Ball at (10, 5) moving left-down.

Example Action (a): Move paddle down by 1 unit.

The Transition updates the state: ball moves left and down, paddle moves to y=6.

The Reward (R): If the paddle aligns with the ball and bounces it, reward = +1. Otherwise, –1.

The Return (G): The total discounted sum of rewards over future steps, e.g.,


		G0​=R1​+γR2​+γ2R3​+…


Thus, the paddle learns to maximize returns by repeatedly moving to states that give higher chances of bouncing the ball back.



**Reflection:** This ties concepts together and starts showing how the theory maps to the game.  


### Expert Prompt
**Prompt:** "First explain MDPs, then show how rewards accumulate into returns in Pong. Verify your explanation by describing how the paddle would learn optimal positioning."  
**Output (GPT):**  
An MDP is defined by states (paddle and ball positions), actions (paddle movements), transitions (ball trajectory and bounce), and rewards (+1 for hit, –1 for miss).

In Pong:

Suppose at time step 0, the paddle moves to intercept the ball. Reward = +1.

At time step 1, the paddle continues tracking the ball. Reward = +1.

The Return (G) is calculated using a discount factor γ, such as:

		G0​=1+0.9(1)+0.92(1)+…

This produces a decreasing sequence of returns (e.g., 4.1, 3.44, 2.71, …).
	The paddle learns optimal positioning by trial and error. Over many games, it sees that moving towards the ball maximizes the total return, while ignoring the ball leads to negative returns. Eventually, it develops the policy: “Track the ball’s y-position to maximize returns.
**Reflection:** This is the deepest level. It connects MDP → rewards → return → learning strategy.  


## Part 2: Define the MRP Tuple in Pong

- **S (States):** 
  - Ball position (x, y)
  - Ball velocity (up, down, left, right)
  - Paddle position (y-axis)

- **P (Transition Probabilities):**
  - Given the current state, where will the ball go next?
  - Example: If the ball is moving downwards, 80% chance it continues down, 20% chance it bounces.

- **R (Rewards):**
  - +1 if the paddle hits the ball
  - –1 if the paddle misses the ball
  - 0 otherwise


In [None]:
#Sample Run: State, Action, Reward, and Return
import numpy as np

# Parameters
n_steps = 5
gamma = 0.9  # discount factor

# State variables
ball_y = 0.5       # initial ball position
paddle_y = 0.5     # paddle starts in middle
ball_dir = 1       # 1 = moving down, -1 = moving up

states, actions, rewards, returns = [], [], [], []

for t in range(n_steps):
    # Action: move paddle towards ball (simple policy)
    action = np.sign(ball_y - paddle_y) * 0.1
    paddle_y += action

    # Ball moves
    ball_y += ball_dir * 0.1
    if ball_y >= 1 or ball_y <= 0:
        ball_dir *= -1   # bounce

    # Reward
    if abs(ball_y - paddle_y) < 0.1:
        reward = 1
    else:
        reward = -1

    states.append((ball_y, paddle_y))
    actions.append(action)
    rewards.append(reward)

# Compute returns G_t
for t in range(n_steps):
    G = 0
    discount = 1
    for k in range(t, n_steps):
        G += discount * rewards[k]
        discount *= gamma
    returns.append(G)

# Display results
for t in range(n_steps):
    print(f"Step {t}: State={states[t]}, Action={actions[t]:.2f}, "
          f"Reward={rewards[t]}, Return={returns[t]:.2f}")


Step 0: State=(0.6, np.float64(0.5)), Action=0.00, Reward=1, Return=4.10
Step 1: State=(0.7, np.float64(0.6)), Action=0.10, Reward=1, Return=3.44
Step 2: State=(0.7999999999999999, np.float64(0.7)), Action=0.10, Reward=1, Return=2.71
Step 3: State=(0.8999999999999999, np.float64(0.7999999999999999)), Action=0.10, Reward=1, Return=1.90
Step 4: State=(0.9999999999999999, np.float64(0.8999999999999999)), Action=0.10, Reward=1, Return=1.00



### Interpretation of Results
- **State:** The state is represented by the ball’s y-position and the paddle’s y-position. 
- **Action:** The action is how much the paddle moves (positive = up). In this run, the paddle consistently moved towards the ball.
- **Reward:** Each step gave a reward of +1, meaning the paddle was aligned well with the ball.
- **Return:** Returns decreased over time (4.10 → 1.00) because the discount factor (γ=0.9) reduces the weight of future rewards.
- **Connection to Theory:** This matches the Bellman equation for returns:
  G_t = R_t + γR_{t+1} + γ²R_{t+2} + ...




## 6. Reflection & Talking Points

- This exercise demonstrated how an **MDP (Markov Decision Process)** can be converted into an **MRP (Markov Reward Process)** by fixing the agent’s actions.  
- In the Pong example, the **state** was represented by the paddle and ball positions, while the **action** was the paddle movement.  
- The simulation confirmed the **reward mechanism**: every correct paddle alignment gave a reward of +1, and the discounted return decreased step by step (4.10 → 3.44 → 2.71 → 1.90 → 1.00).  
- This sequence matches the theoretical formula for returns:  
  \[
  G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots
  \]  
- The experiment showed that if the paddle consistently tracks the ball, rewards stay positive and the return value remains high. If the paddle misses, rewards would turn negative, reducing the return drastically.  

