# Team information

| Team member 1     | Details  | Team member 2     | Details  |
| :---------------- | :------: | :---------------- | :------: |
| Name              | Nadia Victoria Aritonang         | Name              | Reiner Anggriawan Jasin |
| NUSNet (Exxxxxxx) | E1505949         | NUSNet (Exxxxxxx) | E1503344 |
| Matric (AxxxxxxxZ)| A0314698N         | Matric (AxxxxxxxZ)| A0314502W |


In [1]:
# Connect to Google drive to save your model, etc.,

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Installation and setup

The gym environment requires an older version numpy (and corresponding packages). <br>
The following cell contains the `requirements.txt` to setup the python environment used in the rest of this notebook.


In [2]:
%%writefile requirements.txt

cloudpickle==3.1.1
contourpy==1.3.0
cycler==0.12.1
filelock==3.18.0
fonttools==4.56.0
fsspec==2025.3.0
gym==0.26.2
gym-notices==0.0.8
importlib_metadata==8.6.1
importlib_resources==6.5.2
Jinja2==3.1.6
kiwisolver==1.4.7
MarkupSafe==3.0.2
matplotlib==3.9.4
mpmath==1.3.0
networkx==3.2.1
numpy==1.24.2
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-cusparselt-cu12==0.6.2
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
packaging==24.2
pillow==11.1.0
ply==3.11
pygame==2.6.1
pyparsing==3.2.1
python-dateutil==2.9.0.post0
six==1.17.0
sympy==1.13.1
torch==2.6.0
tqdm==4.67.1
triton==3.2.0
zipp==3.21.0

Overwriting requirements.txt


Now install the requirements.

You may be asked to restart the session to load the installed versions of the packages. If so, restart the session and continue using the notebook

In [3]:
!pip install -r requirements.txt

Collecting contourpy==1.3.0 (from -r requirements.txt (line 3))
  Using cached contourpy-1.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.4 kB)
Collecting kiwisolver==1.4.7 (from -r requirements.txt (line 13))
  Using cached kiwisolver-1.4.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB)
Collecting matplotlib==3.9.4 (from -r requirements.txt (line 15))
  Using cached matplotlib-3.9.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting numpy==1.24.2 (from -r requirements.txt (line 18))
  Using cached numpy-1.24.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting pillow==11.1.0 (from -r requirements.txt (line 33))
  Downloading pillow-11.1.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (9.1 kB)
Collecting pyparsing==3.2.1 (from -r requirements.txt (line 36))
  Using cached pyparsing-3.2.1-py3-none-any.whl.metadata (5.0 kB)
Using cached contourpy-1.3.0-cp311

We will use a discretized version of
the [elevator domain](https://ataitler.github.io/IPPC2023/elevator.html) from the International Planning Competition, 2023.

Install the pyRDDL gym environment using the given repository.

In [2]:
!pip install -q git+https://github.com/tasbolat1/pyRDDLGym.git --force-reinstall

## Install other packages if needed

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pyRDDLGym (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.18.0 requires numpy<2.1.0,>=1.26.0, but you have numpy 2.2.4 which is incompatible.
dopamine-rl 4.1.2 requires gym<=0.25.2, but you have gym 0.26.2 which is incompatible.
numba 0.60.0 requires numpy<2.1,>=1.22, but you have numpy 2.2.4 which is incompatible.[0m[31m
[0m

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

import copy
import itertools
import numpy as np
import random
import tqdm
import matplotlib.pyplot as plt
from collections import deque

from pyRDDLGym.Visualizer.MovieGenerator import MovieGenerator # loads visualizer utilites
from IPython.display import Image, display, clear_output # for displaying gifs in colab
from pyRDDLGym.Elevator import Elevator # imports Discrete Elevator

## Add more imports here as required
from collections import namedtuple

  from pyRDDLGym.Visualizer.MovieGenerator import MovieGenerator # loads visualizer utilites


# Environment Initialization

In [2]:
## IMPORTANT: Do not change the instance of the environment.
env = Elevator(instance = 5)

print('Discrete environment actions:')
print(env.disc_actions)
print('Continuous environment actions:')
print(env.base_env.action_space)
print(f"Observation space size for the discrete Elevator Environment: {len(env.disc_states)}")

<op> is one of {<=, <, >=, >}
<rhs> is a deterministic function of non-fluents or constants only.
>> ( sum_{?f: floor} [ elevator-at-floor(?e, ?f) ] ) == 1


/home/reiner/github/elevator_problem/venv/lib/python3.12/site-packages/pyRDDLGym/Examples /home/reiner/github/elevator_problem/venv/lib/python3.12/site-packages/pyRDDLGym/Examples/manifest.csv
Available example environment(s):
PropDBN -> Simple propositional DBN.
MarsRover -> Multi Rover Navigation, where a group of agent needs to harvest mineral.
NewLanguage -> Example with new language features.
NewtonZero -> Example with Newton root-finding method.
HVAC -> Multi-zone and multi-heater HVAC control problem
SupplyChain -> A supply chain with factory and multiple warehouses.
SupplyChainNet -> A supply chain network with factory and multiple warehouses.
RaceCar -> A simple continuous MDP for the racecar problem.
Traffic -> BLX/QTM traffic model.
Wildfire -> A boolean version of the wildfire fighting domain.
Reservoir_continuous -> Continuous action version of management of the water level in interconnected reservoirs.
Reservoir_discrete -> Discrete version of management of the water leve

# Hyperparameters

In [3]:
# Define hyperparameters

## IMPORTANT: <BEGIN> DO NOT CHANGE THIS CODE!
## GENERAL HYPERPARAMS
num_episodes = 3000
## IMPORTANT: <END> DO NOT CHANGE THIS CODE!

# learning_rate = 3e-4
learning_rate = 0.005
batch_size = 64
clip_value = 1.0  # Gradient clipping value

## ALGO SPECIFIC HYPERPARAMS
# Update the hyperparams as necessary for your implementation



# Model Definition

Define your model here. You can rename the class `YourModel` appropriately and use it later in the code.
Note: In case of actor-critic or other models, all components must subclass `nn.Module`

- Your model should take in 11 inputs, which will be derived from the convert_state_to_list function.
- Your model should return 6 values corresponding to action logits or probabilities.

In [4]:
class YourModel(nn.Module):
    def __init__(self, input_size=13, hidden_size=128, output_size=6):
        super(YourModel, self).__init__()
        # Your model layers and initializations here
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # x will be a tensor with shape [batch_size, 11]
        # Your forward pass logic here
        # Ensure the output has shape [batch_size, 6]
        x = self.fc1(x)
        x = F.relu(x)
        output = self.fc2(x)

        return output

# Feature Extraction

In [5]:
## IMPORTANT: DO NOT CHANGE THIS CODE!
env_features = list(env.observation_space.keys())

def convert_state_to_list(state, env_features):
    out = []
    for i in env_features:
        out.append(state[i])
    return out

# Neural Net Initialization

In [6]:
# Initialize the network and optimizer
input_size = len(env_features)
output_size = 6

# INITIALIZE OTHER NETWORK PARAMS HERE
hidden_size = 128

# INITIALIZE YOUR NETWORK HERE
your_network = YourModel(input_size, hidden_size, output_size)

# INIT OPTIMIZER - Adam is a good start, but you can try changing this as well
# optimizer = optim.Adam(
#     your_network.parameters(), lr=learning_rate
# )

optimizer = optim.RMSprop(
    your_network.parameters(), lr=learning_rate,
)

# optimizer = optim.Adam(
#     your_network.parameters(), lr=learning_rate
# )

In [7]:
import torch.optim.lr_scheduler as lr_scheduler

scheduler = lr_scheduler.StepLR(optimizer, step_size=num_episodes/100, gamma=0.9)

In [8]:
# Convert networks to CUDA if available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
your_network.to(device)

# Define other constructs (replay buffers, etc) as necessary

YourModel(
  (fc1): Linear(in_features=13, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=6, bias=True)
)

## Gradient Clipping (Optional, you can use torch's version as well)

In [9]:
# Define a function for gradient clipping
def clip_grads(model, clip_value):
    for param in model.parameters():
        if param.grad is not None:
            param.grad.data = torch.clamp(param.grad.data, -clip_value, clip_value)

# Live Plotting Setup

In [10]:
# Create a figure for plotting
plt.style.use('ggplot')
fig, ax = plt.subplots(figsize=(10, 6))
plt.ion()

# Lists to store rewards and episode numbers
rewards_list = []
episodes = []

def exponential_smoothing(data, alpha=0.1):
    """Compute exponential smoothing."""
    smoothed = [data[0]]  # Initialize with the first data point
    for i in range(1, len(data)):
        st = alpha * data[i] + (1 - alpha) * smoothed[-1]
        smoothed.append(st)
    return smoothed

def live_plot(data_dict, figure, ylabel="Total Rewards"):
    """Plot the live graph."""
    clear_output(wait=True)
    ax.clear()
    for label, data in data_dict.items():
        if label == "Total Reward":
            ax.plot(data, label=label, color="yellow", linestyle='--')

            # Compute and plot moving average for total reward
            ma = exponential_smoothing(data)
            ma_idx_start = len(data) - len(ma)
            ax.plot(range(ma_idx_start, len(data)), ma, label="Smoothed Value", linestyle="-", color="purple", linewidth=2)
        else:
            ax.plot(data, label=label)
    ax.set_ylabel(ylabel)
    ax.legend(loc='upper left')
    display(figure)


# RL Algorithm

In [11]:
# Define the loss calculation function
def calculate_loss(batch):
    ## TODO - CALCULATE LOSS VALUE & RETURN IT
    state_tensor, action, reward, next_state_tensor, done = zip(*batch)

    state_tensor = torch.tensor(state_tensor, dtype=torch.float32, device=device)
    action = torch.tensor(action, dtype=torch.long, device=device)
    reward = torch.tensor(reward, dtype=torch.float32, device=device)
    next_state_tensor = torch.tensor(next_state_tensor, dtype=torch.float32, device=device)
    done = torch.tensor(done, dtype=torch.float16, device=device)

    q_values = your_network(state_tensor)
    q_value = q_values.gather(1, action.unsqueeze(1)).squeeze(1)

    with torch.no_grad():
        next_q_values = your_network(next_state_tensor)
        next_q_value = next_q_values.max(1)[0]
        target_q_value = reward + (1 - done) * 0.99 * next_q_value  # Discounted reward

    loss = F.mse_loss(q_value, target_q_value.detach())

    return loss

In [12]:
def choose_action(state_tensor, epsilon=0.1):
    ## TODO - RETURN AN INTEGER FROM 0 - 5 (both inclusive) based on your model training/testing strategy

    if random.random() < epsilon:
      return random.randint(0, output_size - 1)
    else:
      state_dimension = state_tensor.unsqueeze(0)

      with torch.no_grad():
        q_values = your_network(state_dimension)

      return torch.argmax(q_values).item()

In [13]:

Transition = namedtuple('Transition',
                        ('state', 'action', 'reward', 'next_state', 'done'))

class ReplayMemory(object):

    def __init__(self, capacity):
        self.memory = deque(maxlen=capacity)

    def push(self, *args):
        """Save a transition"""
        self.memory.append(Transition(*args))

    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)

memory = ReplayMemory(10000)


## Training loop with live plotting

Use the graph generated here in your pdf submission.

In [14]:
plt.style.use('ggplot')
fig, ax = plt.subplots(figsize=(10, 6))
plt.ion()

# Create a tqdm progress bar
progress_bar = tqdm.tqdm(range(num_episodes), postfix={'Total Reward': 0, 'Loss': 0})

# RL algorithm training loop
for episode in progress_bar:
    total_reward = 0
    state = env.reset()

    while True:
        # Convert the original state to the suitable format for the network
        state_desc = env.disc2state(state)
        state_list = convert_state_to_list(state_desc, env_features)
        state_tensor = torch.tensor(state_list, dtype=torch.float32, device=device)

        action = choose_action(state_tensor, epsilon=max(0.01, 0.1 * (0.99 ** episode)))

        # Take the chosen action and observe the next state and reward
        next_state, reward, done, _ = env.step((action))

        # Convert the next state to the suitable format for the network
        next_state_desc = env.disc2state(next_state)
        next_state_list = convert_state_to_list(next_state_desc, env_features)
        next_state_tensor = torch.tensor(next_state_list, dtype=torch.float32, device=device)


        # Hint: You may want to collect experiences from the environment to update the agent in batches!

        memory.push(state_list, action, reward, next_state_list, done)

        if len(memory) > batch_size:
            batch = memory.sample(batch_size)
            # print(f'batch {batch}')
            loss = calculate_loss(batch)

            optimizer.zero_grad()

            clip_grads(your_network, clip_value)

            loss.backward()
            optimizer.step()

        state = next_state
        total_reward += reward

        if done:
            break


    rewards_list.append(total_reward)
    episodes.append(episode)

    live_plot({'Total Reward': rewards_list}, fig)

    scheduler.step()

    # Saving the model
    if episode%500 == 0:
      torch.save(your_network, f'model.pt')

    progress_bar.set_postfix({'Total Reward': total_reward, 'Loss': loss.item(), 'lr': scheduler.get_lr()})

  0%|          | 0/3000 [00:00<?, ?it/s, Loss=0, Total Reward=0]

  0%|          | 0/3000 [00:00<?, ?it/s, Loss=0, Total Reward=0]


AttributeError: 'FigureCanvasAgg' object has no attribute 'tostring_rgb'

## Compute the mean rewards

Report the mean rewards obtained in your pdf submission

In [None]:
print(f"\nMean Rewards: ...")

# close the environment
env.close()

# HMM


In [None]:
# Connect to Google drive to save your model, etc.,

from google.colab import drive
drive.mount('/content/drive')

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

import copy
import itertools
import numpy as np
import random
import tqdm
import matplotlib.pyplot as plt
from collections import deque

from pyRDDLGym.Visualizer.MovieGenerator import MovieGenerator # loads visualizer utilites
from IPython.display import Image, display, clear_output # for displaying gifs in colab
from pyRDDLGym.Elevator import Elevator # imports Discrete Elevator

## Add more imports here as required

## IMPORTANT: Do not change the instance of the environment.
env = Elevator(instance = 5)

print('Discrete environment actions:')
print(env.disc_actions)
print('Continuous environment actions:')
print(env.base_env.action_space)
print(f"Observation space size for the discrete Elevator Environment: {len(env.disc_states)}")

# Define hyperparameters

## IMPORTANT: <BEGIN> DO NOT CHANGE THIS CODE!
## GENERAL HYPERPARAMS
num_episodes = 3000
## IMPORTANT: <END> DO NOT CHANGE THIS CODE!

learning_rate = 3e-4
batch_size = 64
clip_value = 1.0  # Gradient clipping value

## ALGO SPECIFIC HYPERPARAMS
# Update the hyperparams as necessary for your implementation

class YourModel(nn.Module):
    def __init__(self):
        super(YourModel, self).__init__()
        # Your model layers and initializations here
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # x will be a tensor with shape [batch_size, 13]
        # Your forward pass logic here
        # Ensure the output has shape [batch_size, 6]
        x = F.relu(self.fc1(x))
        output = self.fc2(x)
        return output

## IMPORTANT: DO NOT CHANGE THIS CODE!
env_features = list(env.observation_space.keys())

def convert_state_to_list(state, env_features):
    out = []
    for i in env_features:
        out.append(state[i])
    return out

for feature in env_features:
  print(f"{feature}: {env.observation_space[feature]}")

# Initialize the network and optimizer
input_size = len(env_features)
output_size = 6

# INITIALIZE OTHER NETWORK PARAMS HERE
hidden_size = 128

# INITIALIZE YOUR NETWORK HERE
your_network = YourModel()

# INIT OPTIMIZER - Adam is a good start, but you can try changing this as well
optimizer = optim.Adam(
    your_network.parameters(), lr=learning_rate
)

# Convert networks to CUDA if available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
your_network.to(device)

# Define other constructs (replay buffers, etc) as necessary

# Define a function for gradient clipping
def clip_grads(model, clip_value):
    for param in model.parameters():
        if param.grad is not None:
            param.grad.data = torch.clamp(param.grad.data, -clip_value, clip_value)

# Create a figure for plotting
plt.style.use('ggplot')
fig, ax = plt.subplots(figsize=(10, 6))
plt.ion()

# Lists to store rewards and episode numbers
rewards_list = []
episodes = []

def exponential_smoothing(data, alpha=0.1):
    """Compute exponential smoothing."""
    smoothed = [data[0]]  # Initialize with the first data point
    for i in range(1, len(data)):
        st = alpha * data[i] + (1 - alpha) * smoothed[-1]
        smoothed.append(st)
    return smoothed

def live_plot(data_dict, figure, ylabel="Total Rewards"):
    """Plot the live graph."""
    clear_output(wait=True)
    ax.clear()
    for label, data in data_dict.items():
        if label == "Total Reward":
            ax.plot(data, label=label, color="yellow", linestyle='--')

            # Compute and plot moving average for total reward
            ma = exponential_smoothing(data)
            ma_idx_start = len(data) - len(ma)
            ax.plot(range(ma_idx_start, len(data)), ma, label="Smoothed Value", linestyle="-", color="purple", linewidth=2)
        else:
            ax.plot(data, label=label)
    ax.set_ylabel(ylabel)
    ax.legend(loc='upper left')
    display(figure)

# Define the loss calculation function
def calculate_loss(
    ## INCLUDE PARAMS YOU NEED HERE
    state_tensor, action, reward, next_state_tensor, done
    ):
    ## TODO - CALCULATE LOSS VALUE & RETURN IT
    q_values = your_network(state_tensor)
    q_value = q_values.gather(1, action.unsqueeze(1)).squeeze(1)

    with torch.no_grad():
      next_q_values = your_network(next_state_tensor)
      next_q_value = next_q_values.max(1)[0]
      target_q_value = reward + (1 - done) * 0.99 * next_q_value

    loss = F.mse_loss(q_value, target_q_value)

    return loss

def choose_action(
    ## INCLUDE PARAMS YOU NEED HERE
    state_tensor, epsilon=0.1
    ):
    ## TODO - RETURN AN INTEGER FROM 0 - 5 (both inclusive) based on your model training/testing strategy

    if random.random() < epsilon:
      return torch.tensor(random.choice(range(output_size)), dtype=torch.long, device=device)
    else:
      q_values = your_network(state_tensor)
      return q_values.argmax().unsqueeze(0)

plt.style.use('ggplot')
fig, ax = plt.subplots(figsize=(10, 6))
plt.ion()

# Create a tqdm progress bar
progress_bar = tqdm.tqdm(range(num_episodes), postfix={'Total Reward': 0, 'Loss': 0})

# RL algorithm training loop
for episode in progress_bar:
    total_reward = 0
    state = env.reset()

    while True:
        # Convert the original state to the suitable format for the network
        state_desc = env.disc2state(state)
        state_list = convert_state_to_list(state_desc, env_features)
        state_tensor = torch.tensor(state_list, dtype=torch.float32, device=device)

        action = choose_action(
            ## TODO: FILL IN PARAMS FOR CALLING choose_action
            state_tensor, epsilon=0.1
        )

        # Take the chosen action and observe the next state and reward
        next_state, reward, done, _ = env.step((action))

        # Convert the next state to the suitable format for the network
        next_state_desc = env.disc2state(next_state)
        next_state_list = convert_state_to_list(next_state_desc, env_features)
        next_state_tensor = torch.tensor(next_state_list, dtype=torch.float32, device=device)


        # Hint: You may want to collect experiences from the environment to update the agent in batches!

        loss = calculate_loss(
            ## TODO: FILL IN PARAMS FOR CALLING calculate_loss
            state_tensor, action, reward, next_state_tensor, done
        )

        optimizer.zero_grad()
        loss.backward()

        clip_grads(your_network, clip_value)
        optimizer.step()

        state = next_state
        total_reward += reward

        if done:
            break


    rewards_list.append(total_reward)
    episodes.append(episode)

    live_plot({'Total Reward': rewards_list}, fig)

    # Saving the model
    if episode%500 == 0:
      torch.save(your_network, f'model.pt')

    progress_bar.set_postfix({'Total Reward': total_reward, 'Loss': loss.item()})

print(f"\nMean Rewards: {np.mean(rewards_list)}")

# close the environment
env.close()

In [None]:
# Connect to Google drive to save your model, etc.,

from google.colab import drive
drive.mount('/content/drive')

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

import copy
import itertools
import numpy as np
import random
import tqdm
import matplotlib.pyplot as plt
from collections import deque

from pyRDDLGym.Visualizer.MovieGenerator import MovieGenerator # loads visualizer utilites
from IPython.display import Image, display, clear_output # for displaying gifs in colab
from pyRDDLGym.Elevator import Elevator # imports Discrete Elevator

## Add more imports here as required

## IMPORTANT: Do not change the instance of the environment.
env = Elevator(instance = 5)

print('Discrete environment actions:')
print(env.disc_actions)
print('Continuous environment actions:')
print(env.base_env.action_space)
print(f"Observation space size for the discrete Elevator Environment: {len(env.disc_states)}")

# Define hyperparameters

## IMPORTANT: <BEGIN> DO NOT CHANGE THIS CODE!
## GENERAL HYPERPARAMS
num_episodes = 3000
## IMPORTANT: <END> DO NOT CHANGE THIS CODE!

learning_rate = 3e-4
batch_size = 64
clip_value = 1.0  # Gradient clipping value

## ALGO SPECIFIC HYPERPARAMS
# Update the hyperparams as necessary for your implementation

class YourModel(nn.Module):
    def __init__(self):
        super(YourModel, self).__init__()
        # Your model layers and initializations here
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # x will be a tensor with shape [batch_size, 13]
        # Your forward pass logic here
        # Ensure the output has shape [batch_size, 6]
        x = F.relu(self.fc1(x))
        output = self.fc2(x)
        return output

## IMPORTANT: DO NOT CHANGE THIS CODE!
env_features = list(env.observation_space.keys())

def convert_state_to_list(state, env_features):
    out = []
    for i in env_features:
        out.append(state[i])
    return out

for feature in env_features:
  print(f"{feature}: {env.observation_space[feature]}")

# Initialize the network and optimizer
input_size = len(env_features)
output_size = 6

# INITIALIZE OTHER NETWORK PARAMS HERE
hidden_size = 128

# INITIALIZE YOUR NETWORK HERE
your_network = YourModel()

# INIT OPTIMIZER - Adam is a good start, but you can try changing this as well
optimizer = optim.Adam(
    your_network.parameters(), lr=learning_rate
)

# Convert networks to CUDA if available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
your_network.to(device)

# Define other constructs (replay buffers, etc) as necessary

# Define a function for gradient clipping
def clip_grads(model, clip_value):
    for param in model.parameters():
        if param.grad is not None:
            param.grad.data = torch.clamp(param.grad.data, -clip_value, clip_value)

# Create a figure for plotting
plt.style.use('ggplot')
fig, ax = plt.subplots(figsize=(10, 6))
plt.ion()

# Lists to store rewards and episode numbers
rewards_list = []
episodes = []

def exponential_smoothing(data, alpha=0.1):
    """Compute exponential smoothing."""
    smoothed = [data[0]]  # Initialize with the first data point
    for i in range(1, len(data)):
        st = alpha * data[i] + (1 - alpha) * smoothed[-1]
        smoothed.append(st)
    return smoothed

def live_plot(data_dict, figure, ylabel="Total Rewards"):
    """Plot the live graph."""
    clear_output(wait=True)
    ax.clear()
    for label, data in data_dict.items():
        if label == "Total Reward":
            ax.plot(data, label=label, color="yellow", linestyle='--')

            # Compute and plot moving average for total reward
            ma = exponential_smoothing(data)
            ma_idx_start = len(data) - len(ma)
            ax.plot(range(ma_idx_start, len(data)), ma, label="Smoothed Value", linestyle="-", color="purple", linewidth=2)
        else:
            ax.plot(data, label=label)
    ax.set_ylabel(ylabel)
    ax.legend(loc='upper left')
    display(figure)

# Define the loss calculation function
def calculate_loss(
    ## INCLUDE PARAMS YOU NEED HERE
    state_tensor, action, reward, next_state_tensor, done
    ):
    ## TODO - CALCULATE LOSS VALUE & RETURN IT
    q_values = your_network(state_tensor)
    q_value = q_values.gather(1, action).squeeze(1)

    with torch.no_grad():
      next_q_values = your_network(next_state_tensor)
      next_q_value = next_q_values.max(1)[0]
      target_q_value = reward + (1 - done) * 0.99 * next_q_value

    loss = F.mse_loss(q_value, target_q_value)

    return loss

def choose_action(
    ## INCLUDE PARAMS YOU NEED HERE
    state_tensor, epsilon=0.1
    ):
    ## TODO - RETURN AN INTEGER FROM 0 - 5 (both inclusive) based on your model training/testing strategy

    if random.random() < epsilon:
      return torch.tensor(random.choice(range(output_size)), dtype=torch.long, device=device)
    else:
      q_values = your_network(state_tensor)
      return q_values.argmax().unsqueeze(0)

plt.style.use('ggplot')
fig, ax = plt.subplots(figsize=(10, 6))
plt.ion()

# Create a tqdm progress bar
progress_bar = tqdm.tqdm(range(num_episodes), postfix={'Total Reward': 0, 'Loss': 0})

# RL algorithm training loop
for episode in progress_bar:
    total_reward = 0
    state = env.reset()

    while True:
        # Convert the original state to the suitable format for the network
        state_desc = env.disc2state(state)
        state_list = convert_state_to_list(state_desc, env_features)
        state_tensor = torch.tensor(state_list, dtype=torch.float32, device=device)

        action = choose_action(
            ## TODO: FILL IN PARAMS FOR CALLING choose_action
            state_tensor, epsilon=0.1
        )

        # Take the chosen action and observe the next state and reward
        next_state, reward, done, _ = env.step((action.item()))

        # Convert the next state to the suitable format for the network
        next_state_desc = env.disc2state(next_state)
        next_state_list = convert_state_to_list(next_state_desc, env_features)
        next_state_tensor = torch.tensor(next_state_list, dtype=torch.float32, device=device)


        # Hint: You may want to collect experiences from the environment to update the agent in batches!

        loss = calculate_loss(
            ## TODO: FILL IN PARAMS FOR CALLING calculate_loss
            state_tensor, action, reward, next_state_tensor, done
        )

        optimizer.zero_grad()
        loss.backward()

        clip_grads(your_network, clip_value)
        optimizer.step()

        state = next_state
        total_reward += reward

        if done:
            break


    rewards_list.append(total_reward)
    episodes.append(episode)

    live_plot({'Total Reward': rewards_list}, fig)

    # Saving the model
    if episode%500 == 0:
      torch.save(your_network, f'model.pt')

    progress_bar.set_postfix({'Total Reward': total_reward, 'Loss': loss.item()})

print(f"\nMean Rewards: {np.mean(rewards_list)}")

# close the environment
env.close()