#NEURAL NETWORKS AND DEEP LEARNING
> M.Sc. ICT FOR LIFE AND HEALTH
> 
> Department of Information Engineering

> M.Sc. COMPUTER ENGINEERING
>
> Department of Information Engineering

> M.Sc. AUTOMATION ENGINEERING
>
> Department of Information Engineering
 
> M.Sc. PHYSICS OF DATA
>
> Department of Physics and Astronomy
 
> M.Sc. COGNITIVE NEUROSCIENCE AND CLINICAL NEUROPSYCHOLOGY
>
> Department of General Psychology

---
A.A. 2020/21 (6 CFU) - Dr. Alberto Testolin, Dr. Matteo Gadaleta
---


# Homework 3 - Deep Reinforcement Learning
# 1. Cartpole

## General overview
In this homework you will learn how to implement and test neural network models for solving reinforcement learning problems. The basic tasks for the homework will require to implement some extensions to the code that you have seen in the Lab. More advanced tasks will require to train and test your learning agent on a different environment. Given the higher computational complexity of RL, in this homework you don’t need to tune learning hyperparameters using search procedures and cross-validation; however, you are encouraged to play with model hyperparameters in order to find a satisfactory configuration.


## Technical notes
The homework should be implemented in Python using the PyTorch framework. The student can explore additional libraries and tools to implement the models; however, please make sure you understand the code you are writing because during the exam you might receive specific questions related to your implementation. The entire source code required to run the homework must be uploaded as a compressed archive in a Moodle section dedicated to the homework. If your code will be entirely included in a single Python notebook, just upload the notebook file.

As an example of more advanced libraries that can be used to implement deep RL agents, you can check this website:

https://stable-baselines.readthedocs.io/en/master/



## Final report
Along with the source code, you must separately upload a PDF file containing a brief report of your homework. The report should include a brief Introduction on which you explain the homework goals and the main implementation strategies you choose, a brief Method section where you describe your model architectures and hyperparameters, and a Result section where you present the simulation results. Total length must not exceed 6 pages, though you can include additional tables and figures in a final Appendix (optional). Given the dynamical nature of RL problems, you can explore more sophisticated media for showing the results of your model (e.g., animated GIFs or short movies).




## Grade
The maximum grade for this homework will be **8 points**. Points will be assigned based on the correct implementation of the following items:
*	2 pt: extend the notebook used in Lab 07, in order to study how the exploration profile (either using eps-greedy or softmax) impacts the learning curve. Try to tune the model hyperparameters or tweak the reward function in order to speed-up learning convergence (i.e., reach the same accuracy with fewer training episodes).
*	3 pt: extend the notebook used in Lab 07, in order to learn to control the CartPole environment using directly the screen pixels, rather than the compact state representation used during the Lab (cart position, cart velocity, pole angle, pole angular velocity). This will require to change the “observation_space”.
*	3 pt: train a deep RL agent on a different Gym environment. You are free to choose whatever Gym environment you like from the available list, or even explore other simulation platforms:
https://gym.openai.com/envs 



## Deadline
The complete homework (source code + report) must be submitted through Moodle at least 10 days before the chosen exam date.

The following **Ubuntu** packages are needed: `ffmpeg`, `python-opengl`, `xvfb`

In [1]:
import sys

#Install all the required packages with the correct versions in the current environment.
#Note: this notebook has been run with Python 3.9.5 on a 64-bit **Ubuntu 20.04** machine, with a AMD Ryzen 7 1700 8-core CPU, GTX 970 GPU, and 32GB of DDR4 RAM. 

#[UNCOMMENT] the following line to install the packages.
# !{sys.executable} -m pip install numpy~=1.20.1 pandas~=1.2.3 matplotlib~=3.3.4 hiplot~=0.1.24 scipy~=1.6.0 tqdm~=4.59.0 torch~=1.8.0 torchvision~=0.9.0 optuna~=2.7.0 pytorch_lightning~=1.3.4 torchmetrics~=0.3.2 gym~=0.18.3 atari_py~=0.2.9 ipympl~=0.7.0 pyvirtualdisplay~=2.2 piglet~=1.0.0

In [2]:
#Autoreload imported functions
%load_ext autoreload
%autoreload 2

%matplotlib widget 
#Or use %matplotlib notebook
#I'm running jupyter inside of Visual Studio Code, so %matplotlib widget is needed for me.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl2latex import mpl2latex, latex_figsize
from plotting import COLUMNWIDTH

from pathlib import Path
import json
import random

import logging 
logging.basicConfig(filename='01_Cartpole.log', encoding='utf-8', level=logging.INFO, filemode='w')

In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, TensorDataset, random_split
import torchvision

import pytorch_lightning as pl

print("Torch version:", torch.__version__)

#Select device for training
#device = "cpu" #For this very simple dataset it is actually faster
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") #Uncomment for GPU 

logging.info(f"Using {device} for training")
print(f'Using "{device}" for training')

Torch version: 1.9.0
Using "cuda" for training


In [5]:
import gym
import os
from glob import glob
from pyvirtualdisplay import Display

display = Display(visible=0, size=(700, 450))
display.start()

<pyvirtualdisplay.display.Display at 0x7fa5a7f0b3d0>

# Exploration Profile

2 pt: extend the notebook used in Lab 07, in order to study how the exploration profile (either using eps-greedy or softmax) impacts the learning curve. Try to tune the model hyperparameters or tweak the reward function in order to speed-up learning convergence (i.e., reach the same accuracy with fewer training episodes).

In [6]:
#Quick test of all the modules
from data import ReplayMemory
from model import DQN
from agent import Agent

env = gym.make("CartPole-v1")
memory = ReplayMemory(capacity=10)
net = DQN(state_space_dim=(4,), action_space_dim=2)

agent = Agent(env, memory)

done = False
i = 0
while not done:
    reward, done = agent.play_step(net)
    logging.debug(agent.state, done)
    i += 1

print(f"Simulation run for {i} steps")

Simulation run for 30 steps


In [7]:
from model import DeepQLearner, DQN
from callbacks import NotebookProgressBar, LearningRateAdjust, StopAfterNEpisodes


bar = NotebookProgressBar()
lr_adj = LearningRateAdjust()
stop = StopAfterNEpisodes(1000)

add_to_reward = lambda state, reward : -np.abs(state[0])

# RL_net = DeepQLearner(env="CartPole-v1", Network=DQN, reach_zero_temperature_after_n_episodes=400, batch_size=256, gamma=.98, target_net_update_steps=300, learning_rate=1e-1, min_samples_for_training=10000, add_to_reward=add_to_reward, update_target_every_frame=True) 
# [UNCOMMENT] the previous line to initialize the model to be trained. This involves stepping through 10k frames, which takes a bit of time.

In [8]:
trainer = pl.Trainer(gpus=0, max_epochs=1000000, callbacks=[bar, stop, lr_adj], gradient_clip_val=2) 
#max_epochs is a very large number, since the callback "StopAfterNEpisodes" is used to stop training

# trainer.fit(RL_net) 
# [UNCOMMENT] the previous line to re-run training. Otherwise, the next cell will load a saved checkpoint.
# During training, scores of episodes are saved in `01_Cartpole.log`. 

# NOTE: I encountered a weird error "CUDA error: CUBLAS_STATUS_EXECUTION_FAILED", which I solved by reinstalling PyTorch with conda (previously it was installed with pip).
# The error is probably due to the specific environment used for running the code, so I'll leave this note here in case something similar happens.



In [9]:
#---Save the model and the learning stats---#

import pickle
from copy import deepcopy
from datetime import datetime

save = False
# [SET] save to True to save the previously trained model
# (not necessary if training is not re-executed)
if save:
    all_info = deepcopy(RL_net.hyper_parameters)
    all_info["score"] = deepcopy(RL_net.episode_history)
    all_info["temp"]  = deepcopy(RL_net.temp_history)

    DATE_FMT = "%d_%m_%y-%Hh%Mm%S"
    now = datetime.now()
    date = now.strftime(DATE_FMT) #Current time

    os.makedirs("SavedModels/1", exist_ok=True)

    with open(f"SavedModels/1/{date}.result", 'wb') as file:
        file.write(pickle.dumps(all_info))

    trainer.save_checkpoint(f"SavedModels/1/{date}.ckpt")


In [32]:
from glob import glob
import pickle

def moving_average(x : "np.ndarray", w_size : int):
    """Compute the rolling average of a 1D array `x`, averaging the values
    within a window of length `w_size'.

    Returns
    -------
    xs : np.ndarray
        Indices of the original array representing the "centers" of convolved points
    ys : np.ndarray
        Convolved points
    """

    return (np.arange(w_size // 2, len(x) - (w_size - w_size // 2) + 1), 
           np.convolve(x, np.ones(w_size), 'valid') / w_size)

#---Plot the learning curve of all the attempted trials---#

for file in glob("SavedModels/1/*.result"):
    with open(file, 'rb') as f:
        new_var = pickle.loads(f.read()) #works 

    with mpl2latex(True):
        fig, ax1 = plt.subplots(figsize=latex_figsize(wf=1., columnwidth=COLUMNWIDTH))

        color = 'tab:red'
        ax1.plot(new_var['score'], color=color, label='Score (Raw)', alpha=.4)
        ax1.plot(*moving_average(new_var['score'], 10), '--', color=color, label='Score (Avg. 10)')
        ax1.set_xlabel('Episode')
        ax1.set_ylabel('Score', color=color)
        ax1.tick_params(axis='y', labelcolor=color)

        ax2 = ax1.twinx()
        
        color = 'tab:blue'
        ax2.set_ylabel('Temperature', color=color, rotation=270, labelpad=15)
        ax2.plot(new_var['temp'], color=color, label="Temperature")
        ax2.tick_params(axis='y', labelcolor=color)

        ax1.set_title("CartPole-v1 - Training")

        textstr = f"""replay\_capacity: {new_var['replay_memory_capacity']}
gamma: {new_var['gamma']}
lr: {new_var['learning_rate']}
batch: {new_var['batch_size']}
update\_every: {new_var['target_net_update_steps']}
init\_temp: {new_var['initial_temperature']}
zero\_temp\_at: {new_var['reach_zero_temperature_after_n_episodes']}"""

        props = dict(boxstyle='round, pad=.3', facecolor='gray', alpha=.1)
        ax1.text(0.15, 0.95, textstr, transform=ax1.transAxes, fontsize=10, verticalalignment='top', bbox=props)
        fig.tight_layout()

        ax1.legend(loc=(0.45, 0.7))
        ax2.legend(loc=(0.45, 0.9))

        filename = os.path.splitext(os.path.basename(file))[0]
        fig.savefig(f"Plots/1/{filename}.pdf", transparent=True, bbox_inches='tight')


Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [11]:
from model import DeepQLearner
from utils import wrap_env
from tqdm.notebook import tqdm

#---Test the final training---#

model = DeepQLearner.load_from_checkpoint("SavedModels/24_06_21-00h43m46.ckpt", min_samples_for_training=0) 

#NOTE: I have modified the DeepQLearner adding other parameters
#over time, so loading different checkpoints may not work
#due to some breaking changes in their interface (the architecuture is the same though)

# Initialize the Gym environment
env = gym.make('CartPole-v1') 
env.seed(1) # Set a random seed for the environment (reproducible results)

# This is for creating the output video in Colab, not required outside Colab
env = wrap_env(env, video_callable=lambda episode_id: True) # Save a video every episode

model.eval()
# Let's try for a total of 10 episodes
for num_episode in tqdm(range(10)): 
    # Reset the environment and get the initial state
    state = env.reset()
    # Reset the score. The final score will be the total amount of steps before the pole falls
    score = 0
    done = False
    # Go on until the pole falls off or the score reach 490
    while not done:
      with torch.no_grad():
        action = int(model.policy_net(torch.tensor(state, dtype=torch.float32)).argmax())

      # Apply the action and get the next state, the reward and a flag "done" that is True if the game is ended
      next_state, reward, done, info = env.step(action)
      # Visually render the environment
      env.render()
      # Update the final score (+1 for each step)
      score += reward 
      # Set the current state for the next iteration
      state = next_state
      # Check if the episode ended (the pole fell down)
    # Print the final score
    print(f"EPISODE {num_episode + 1} - FINAL SCORE: {score}") 
env.close() 

0it [00:00, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

EPISODE 1 - FINAL SCORE: 500.0
EPISODE 2 - FINAL SCORE: 500.0
EPISODE 3 - FINAL SCORE: 500.0
EPISODE 4 - FINAL SCORE: 500.0
EPISODE 5 - FINAL SCORE: 500.0
EPISODE 6 - FINAL SCORE: 500.0
EPISODE 7 - FINAL SCORE: 500.0
EPISODE 8 - FINAL SCORE: 500.0
EPISODE 9 - FINAL SCORE: 500.0
EPISODE 10 - FINAL SCORE: 500.0
