# Reinforcement Learning Practical: Weighted Blackjack (MDP)
This notebook demonstrates how to use a modified Blackjack environment to apply Markov Decision Process (MDP) concepts and Bellman equations.

## Step 1: Import Required Libraries

In [None]:
import gym
import random
from gym.envs.toy_text.blackjack import BlackjackEnv


## Step 2: Create a Weighted Blackjack Environment
We modify the card drawing so that:
- 2/3 probability = black card (positive value)
- 1/3 probability = red card (negative value)

In [None]:
class WeightedBlackjackEnv(BlackjackEnv):
    def draw_card(self):
        card = random.choice([1,2,3,4,5,6,7,8,9,10])
        if random.random() < 2/3:
            return card
        else:
            return -card

env = WeightedBlackjackEnv()


## Step 3: Expected Value Function (Bellman Expectation)
This function approximates:
Q(s,a) = E[R | s,a] using multiple trials.

In [None]:
def expected_value(state, action, trials=100):
    total_reward = 0
    for _ in range(trials):
        env.reset()
        env.player = [state[0]]
        env.dealer = [state[1]]
        _, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        if done:
            total_reward += reward
    return total_reward / trials


## Step 4: Optimal Policy Using Bellman Optimality
We choose the action that gives maximum expected reward.

In [None]:
def optimal_policy(state):
    hit_value = expected_value(state, 1)
    stick_value = expected_value(state, 0)
    print("  Bellman Evaluation:")
    print(f"    Q(s, Hit)   ≈ {hit_value:.3f}")
    print(f"    Q(s, Stick) ≈ {stick_value:.3f}")
    if hit_value > stick_value:
        print("    -> Optimal Action: HIT\n")
        return 1
    else:
        print("    -> Optimal Action: STICK\n")
        return 0


## Step 5: Agent–Environment Interaction Loop
The agent interacts with the environment using the optimal policy.

In [None]:
for i_episode in range(5):
    state = env.reset()
    print("\n================ NEW EPISODE ================")
    while True:
        print(f"MDP State s = {state}")
        action = optimal_policy(state)
        state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        if done:
            print("Terminal State Reached")
            print(f"Reward r = {reward}")
            if reward > 0:
                print("Outcome: WIN")
            elif reward < 0:
                print("Outcome: LOSS")
            else:
                print("Outcome: DRAW")
            break


## Conclusion
This practical demonstrates:
- MDP modeling using Blackjack
- Bellman expectation through simulation
- Bellman optimality for policy selection
- Agent–environment interaction cycle