# **Reinforcement Learning Assignment**

Your task is to complete the implementation of the value iteration algorithm in the code below.

Steps:
- Read the code carefully to get an idea of how it works.
- Run all the cells. You should get an output that shows more losses than wins like this:

```
Results after 1000 simulations: {'win': 310, 'loss': 690, 'draw': 0}

```

- Implement the `value_iteration` function without changing its signature or its return type.

- After a successful implementation of the value iteration algorithm, your results should show **more wins than losses** most of the time we run the algorithm. You should get an output like this:

```
Results after 1000 simulations: {'win': 478, 'loss': 448, 'draw': 74}

Results after 1000 simulations: {'win': 502, 'loss': 425, 'draw': 73}

```

**Notes:**
- ou may change the default values of "theta" and "gamma" in the `value_iteration` function, but do not change the signature or return type.

**Submission:**
- Submit your notebook link and the number of wins and losses via Gradescope.


**AI Policy:**
- The use of any artificial intelligence (AI) or code generation tools in this assignment is **strictly prohibited**. Violating this rule may result in your disenrollment from this course. You may be required to orally defend your submission. Failure to satisfactorily explain your work will result in a zero for the assignment, and you may fail the course.

In [None]:
!pip install matplot
!pip install numpy
!pip install seaborn
!pip install tqdm
!pip install gymnasium
%matplotlib inline

In [None]:
HIT = 1 # deal another card / draw
STICK = 0 # stop dealing cards

# each state is a tuple of (player_value, dealer_facing_card_value, usable_ace)
actions = [HIT, STICK]

states = []
for player_value in range(4, 22):
    for dealer_facing_card_value in range(1, 11):
        for usable_ace in [True, False]:
            states.append((player_value, dealer_facing_card_value, usable_ace))

transition_probabilities = {}
for state in states:
    for action in actions:
        for next_state in states:
            if state[0] >= next_state[0]:
                transition_probabilities[(state, action, next_state)] = 0
            else:
                transition_probabilities[(state, action, next_state)] = 0.0025
rewards = {}
for state in states:
    for action in actions:
        for next_state in states:
            if state[0] == 21:  # Simplified condition for winning
                rewards[(state, STICK, next_state)] = 1
                rewards[(state, HIT, next_state)] = -1
            elif next_state[0] == 21:
                rewards[(state, action, next_state)] = 1
            else:
                rewards[(state, action, next_state)] = -0.5

def value_iteration(states, actions, transition_probabilities, rewards, discount_factor=0.99, theta=0.001):
    # Initialize V-values of the states
    V = {state: 0 for state in states}

    while True:
        delta = 0
        # For each state, calculate the new value based on the Bellman equation
        for state in states:
            previous_value = V[state]
            max_expected_value  = float('-inf')

             # Evaluate each possible action
            for action in actions:
                value = 0
                # Calculate the expected value for each possible next state
                for next_state in states:
                    value += transition_probabilities[(state, action, next_state)] * \
                             (rewards[(state, action, next_state)] + discount_factor * V[next_state])
                max_expected_value  = max(max_expected_value , value)
            V[state] = max_expected_value
            delta = max(delta, abs(previous_value - V[state]))
        if delta < theta:
            break

    return V



# Run value iteration
V = value_iteration(states, actions, transition_probabilities, rewards)

# Display some of the results
for state in [(12, 1, True), (21, 10, False), (15, 5, True)]:  # Example states
    print(f"Value of state {state}: {V[state]}")


# for state in states:
#     if V[state] > 0:
#         print(f"Value of state {state}: {V[state]}")

import random
def draw_card():
    """Draw a card with values between 1 and 10, simulating a simplified deck."""
    return min(random.randint(1, 13), 10)

def use_policy(state, V):
    """Decide action based on the policy derived from V."""
    hit_value = V.get((state[0] + draw_card(), state[1], state[2]), 0)
    stick_value = V.get((state[0], state[1], state[2]), 0)
    return HIT if hit_value > stick_value else STICK

def simulate_game(V):
    """Simulate a single game of blackjack based on the policy derived from V."""
    player_value, dealer_value = draw_card(), draw_card()
    usable_ace = player_value == 1
    if usable_ace: player_value += 10  # Simplify ace handling: always count as 11 when first drawn

    # Player's turn
    while True:
        action = use_policy((player_value, dealer_value, usable_ace), V)
        if action == STICK or player_value >= 21:
            break
        card = draw_card()
        if card == 1 and player_value <= 10:
            player_value += 11  # Simplify ace handling: count as 11 if beneficial
            usable_ace = True
        else:
            player_value += card

    # Dealer's turn
    while dealer_value < 17:
        dealer_value += draw_card()
        if dealer_value > 21:  # Dealer busts
            return 'win'

    # Determine outcome
    if player_value > 21 or (player_value < dealer_value and dealer_value <= 21):
        return 'loss'
    elif player_value == dealer_value:
        return 'draw'
    else:
        return 'win'

# Run simulation 1000 times
results = {'win': 0, 'loss': 0, 'draw': 0}
for _ in range(1000):
    result = simulate_game(V)
    results[result] += 1

print(f"Results after 1000 simulations: {results}")


Value of state (12, 1, True): -0.16471004043361878
Value of state (21, 10, False): 0.0
Value of state (15, 5, True): -0.07434060389623591
Results after 1000 simulations: {'win': 524, 'loss': 405, 'draw': 71}
