## Solving CartPole with DQNs
In this assignment you will make an RL agent capable of achieving 150+ average reward in the CartPole environment

In [14]:
# Make all necessary imports here
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import gym
import matplotlib.pyplot as plt
import numpy as np
import imageio
from tqdm import tqdm
from collections import deque
from IPython.display import display, Image

Regarding the CartPoleAgent class:
- The constructor (\_\___init__\_\_) should initialize __gamma__ and __epsilon__ as class variables. It initializes online network, saves it and loads it again in target network (We do this so that both our target and online network are same during initialization)
- The __choose_action()__ function should take the __Q(s, a)__ values vector for a state s as input, for example if __Q_s__ is the given input, __Q_s[0]__ represents __Q(s, 0)__, __Q_s[1]__ represents __Q(s, 1)__ and so on, and the function should output the chosen action (an integer) according to the current exploration strategy (For example choose random action with probability ε and choose action with highest Q(s, a) value with probability 1-ε)
- The __train()__ function runs for a specific number of loops, in each loop:
    - It generates training data using __generate_training_data()__ function and passes it to train_instance function of the online network (which trains the online network)
    - It then saves the online network and loads that same saved function as target network
    - Calls the __evaluate_performace()__ function
    - Updates the value of epsilon as required
- The __generate_training_data()__ function:
    - Simulates lots of episodes/games/trajectories, it uses the online network for chossing actions, and the target netowrk for determining targets, it then stores all such states in an list/array/tensor and corresponding labels (i.e. targets) in another list/array/tensor.
    - It then makes a __CustomDataset__ variable with these state and labels and returns it
    - The CartPole environment terminates after 500 steps truncates itself after 500 steps in a single episode, you have to check this yourself and terminate the episode if it's length becomes >= 500
    - The number of data and targets in the dataset returned should be large enough (around 5000-10000), so that when we choose any random datapoints, they satisy the iid condition
- The __evaluate_performance()__ function calculates the average achieved reward with the current online network by simulating atleast 5 episodes (without any exploration as we are just calculating average reward), it then prints the average reward

Generally you should see a rising trend in your average obtained reward

Now some recommendations:
- You need a good exploratory strategy, exponentially decaying exploration is prefered, you can start with ε=0.5 and then divide it by a constant after each training loop, so that it finally reaches a value of ε = 0.01
- Whenever you use forward function of the DQN class in __generate_training_data()__ or __evaluate()__, make sure to detach the tensor so that it does not calculate gradients. You can detach any tensor "__a__" like:
```
    a = a.detach()
```
- 0.99 is a good value for Gamma

Some more things you can do (Optional):
- You can load an already saved PyTorch model with name "model.pth" into any variable network as follows:
```
    network = torch.load("model.pth)
```
- In the __evaluate()__ function, you can use __imageio__ library to make gifs of your agent playing the game (Google How!), but you have to initialize your environment as:
```
    env = gym.make("CartPole-v1", render_mode="rgb_array")
```
- In the __evaluate()__ function, you can calculate the Mean-Square Error of the model and store these values for each iterations and finally plot it to get an idea of how is your training going.

In [15]:
'''running perfectly fine on google colab'''
class CartPoleAgent:
    def __init__(self, epsilon, gamma=0.99) -> None:
        self.gamma = gamma
        self.epsilon = epsilon
        self.online_network = DQN(input_size=4, output_size=2)  # Corrected input_size to 4
        self.target_network = DQN(input_size=4, output_size=2)  # Corrected input_size to 4
        self.target_network.load_state_dict(self.online_network.state_dict())
        self.env = gym.make('CartPole-v1')  # Initialize environment

    def choose_action(self, Q_s) -> int:
        if np.random.rand() < self.epsilon:
            return np.random.choice(len(Q_s))
        else:
            return np.argmax(Q_s)

    def generate_training_data(self) -> CustomDataset:
        data = []
        labels = []

        for _ in range(100):
            state = self.env.reset()
            state = np.append(state, self.choose_action(self.online_network(torch.tensor(state, dtype=torch.float32)).detach().numpy()))
            done = False
            total_reward = 0

            while not done:
                action = self.choose_action(self.online_network(torch.tensor(state[:4], dtype=torch.float32)).detach().numpy())
                next_state, env_reward, done, _ = self.env.step(action)
                next_state = np.append(next_state, self.choose_action(self.online_network(torch.tensor(next_state, dtype=torch.float32)).detach().numpy()))

                reward = env_reward - (abs(state[0]) + abs(state[2])) / 2.5
                total_reward += reward
                target = np.zeros(2)  # One value for each action
                target[action] = reward + self.gamma * np.max(self.target_network(torch.tensor(next_state[:4], dtype=torch.float32)).detach().numpy())

                data.append(state[:4])
                labels.append(target)

                state = next_state

                if done or len(data) >= 5000:
                    break

        return CustomDataset(np.array(data), np.array(labels))

    def train_agent(self, num_loops=100):
        for _ in range(num_loops):
            train_dataset = self.generate_training_data()
            self.online_network.train_instance(train_dataset)

            self.online_network.save_model('model.pth')
            self.target_network.load_state_dict(self.online_network.state_dict())

            self.evaluate_performance(_)
            self.epsilon *= 0.99

    def evaluate_performance(self, iter) -> None:
        total_reward = 0
        for _ in range(5):
            state = self.env.reset()
            done = False
            while not done:
                # Choose action using the online network
                action = np.argmax(self.online_network(torch.tensor(state, dtype=torch.float32)).detach().numpy())

                state, env_reward, done, _ = self.env.step(action)
                total_reward += env_reward

        print(f'Iteration {iter + 1}, Average Reward: {total_reward / 5}')

    def play_and_save_gif(self, num_episodes=5):
        frames = []  # List to store frames for GIF

        for _ in range(num_episodes):
            state = self.env.reset()
            done = False
            while not done:
                # Choose action using the online network
                action = np.argmax(self.online_network(torch.tensor(state, dtype=torch.float32)).detach().numpy())

                state, _, done, _ = self.env.step(action)

                # Append the current frame to the frames list
                frames.append(self.env.render(mode='rgb_array'))

        # Save the frames as a GIF
        gif_path = 'cartpole_play.gif'
        imageio.mimsave(gif_path, frames, fps=30)

        # Display the GIF in Colab
        with open(gif_path, 'rb') as f:
            display(Image(data=f.read(), format='png'))

        print(f'GIF saved at: {gif_path}')

You should run the below cell to start training

In [17]:
'''running perfectly fine on google colab'''
# This cell should not be changed
Agent = CartPoleAgent(epsilon=0.5)
Agent.train_agent()
Agent.play_and_save_gif(num_episodes=100)

ValueError: expected sequence of length 4 at dim 1 (got 0)