<h1 style="color:#333333; text-align:center; line-height: 0;">Reinforcement Learning | Assignment 3</h1>

<br/><br/>

This notebook covers the Deep **Critic** approach.

Complete the code snippets given in the Section 3: there is a places
to insert your code and string fields for your first and last name. The latter are needed to automatically save the results of the algorithms deployment in .json file. After you did that, please upload the notebook (.ipynb) and .json via https://forms.gle/wzqF43ma7yaDuzUc8.

* Problem 3.1 - NN Critic (30 points)

***

<h2 style="color:#A7BD3F;">Section 1: Theory recap</h2>

### Problem

Let us reformulate the Pendulum problem from the previous assignment in terms of Actor-Critic.

The learning (or, more formally, gradient ascent) here is organized as in the following pseudocode. Generally, the parameters are updated after collecting experience from multiple trajectories of a certain length.

The expected reward will be evaluated by a Neural Network (NN), taking concatenated observation and action vectors as input.

### Neural Net training

The NN is learned in accordance with the assumption that for the perfectly trained model the following equation holds: Q(observation_curr, action_curr) = reward_curr + gamma * Q(observation_old, action_old). So, with introducing Temporal Difference

$TD = reward + \gamma Q_{old}(x_{old}) - Q_{new}(x_{new})$

the loss function for the net could be formulated as

$loss = \frac{1}{2}TD^2$

Accordingly, the HJB equation reads like:

$Q(obs\_new, act\_new) = reward + \gamma * Q(obs\_old, act\_old)$

With the necessary modifications to the algorithm of the previous assignment, the pseudocode is

<img src="nn_critic.png" alt="REINFORCE" width=75% height=75% />

***

<h2 style="color:#A7BD3F;">Section 2: NN recap</h2>

First, let us recall the process of NN training.

This network (which is a single Linear layer) should learn how to multiply its input by 2. The train loop, that starts on 33 line, consists of the following:
* data (input and output) generation
* zeroing gradients of the model(ltdr for the learning to work correctly)
* running the net
* applying criterion to judge how far the correct answer is from the model output
* calculating gradients of the loss by model parameters
* performing the weights modification step

If you are not very much familiar with neural networks, examine the code below and familiarize yourself with the methods that are used there. Feel free to play around with the code, uncomment some strings, print random things in order to find out what they are.

In [7]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import torch.optim as optim

class Net(nn.Module):
    def __init__(self):
        super().__init__()
                
        self.fc1 = nn.Linear(1, 1)

    def forward(self, x):
        x = self.fc1(x)
        
        return x

net = Net()

lr = 0.01
criterion = nn.MSELoss()
optimizer = optim.SGD(net.parameters(), lr=lr, momentum=0.9)

#let us print the parameters of the model
for param in net.parameters():
    print(param)

print("")

episodes_num = 200

for i in range(episodes_num):
    x = torch.tensor([np.random.random_sample()])
    y = x * 2

    optimizer.zero_grad()

    outputs = net(x)
    loss = criterion(outputs, y)

    loss.backward()
    optimizer.step()
    
    loss_val = loss.item()
    
    if (i % 30 == 0):
        print(loss_val)

# for param in net.parameters():
#     print(param.grad * lr)
#     print(param)
#     print("")

print('Finished Training')
print("")
print("test")

for param in net.parameters():
    print(param)

for i in range(5):
    x = torch.tensor([np.random.random_sample()])
    
    outputs = net(x)
    
    print(x, outputs)

Parameter containing:
tensor([[-0.0838]], requires_grad=True)
Parameter containing:
tensor([0.6586], requires_grad=True)

0.11534121632575989
0.049824558198451996
0.3921926021575928
0.0640149787068367
0.014210513792932034
0.007354531902819872
0.0035941866226494312
Finished Training

test
Parameter containing:
tensor([[1.9060]], requires_grad=True)
Parameter containing:
tensor([0.0458], requires_grad=True)
tensor([0.0920]) tensor([0.2212], grad_fn=<AddBackward0>)
tensor([0.7881]) tensor([1.5480], grad_fn=<AddBackward0>)
tensor([0.6091]) tensor([1.2068], grad_fn=<AddBackward0>)
tensor([0.4776]) tensor([0.9561], grad_fn=<AddBackward0>)
tensor([0.7087]) tensor([1.3966], grad_fn=<AddBackward0>)


***

<h2 style="color:#A7BD3F;">Section 3: Problems</h2>

### <font color="blue">Problem 3.1 - NN Critic </font>

Complete the code snippet below: add NN critic into the gradient ascent according to the pseudocode above.

* Note the way in which the Pendulum environment is created. It is necessary to manually set the state of the pendulum during the ascent.
* Class Q_net is added, feel free to modify it if you need.

The output is (as always) a .json file, but in this particular assignment please feel free to modify the length of the training, the length of the episode, to add your custom methods. Generally, the comments ### YOUR SOLUTION BELOW mark those places where the biggest effort is required, but be prepared to modify the code not only there.

* You could modify Q_net: widen the hidden layer, add more hidden layers, etc., whatever needed to make it work
* Implement custom TD-based loss function
* 

In [2]:
import gym
import numpy as np
import collections
import sys
from tqdm import tqdm
from IPython.display import clear_output
import time
import matplotlib.pyplot as plt
import math

from gym.envs.classic_control import PendulumEnv

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

def parametrized_swing_up_policy(obs, vartheta, s):
    #normal random variable
    nrv = np.random.normal(0, s, 1)[0]
    
    if (obs[0] > 0.8):
        torque = vartheta[0] * (obs[2] + obs [1]) + nrv
        
        return [torque], nrv
    
    else:
        return [vartheta[1] * obs[2] + nrv], nrv

#x - state
#u - action
#s - sigma of the normal distribution
#nrv - the specific value of the random variable
def param_policy_grad(x, u, s, nrv):
    if (x[0] > 0.8):
        by_0 = nrv * (x[2] + x[1]) / s**2
        
        return np.array([by_0, 0])

    else:
        by_1 = nrv * x[2] / s**2
        
        return np.array([0, by_1])

### YOUR SOLUTION BELOW
class Q_net(nn.Module):
    def __init__(self, inp_dim):
        super().__init__()
        
        self.inp_dim = inp_dim
        
        self.fc1 = nn.Linear(self.inp_dim, self.inp_dim)
        self.fc2 = nn.Linear(self.inp_dim, 1)

    def forward(self, x):
        x = self.fc1(x)
        x = F.leaky_relu(x)
        x = self.fc2(x)
        
        return x
### YOUR SOLUTION ABOVE

ep_len = 340

#env = gym.make('Pendulum-v0')
env = PendulumEnv()

env._max_episode_steps = ep_len

def NN_critic(env, update_params, visualize = False):
    observation = env.reset()
    
    ### YOUR SOLUTION BELOW
    vartheta = np.array([-10.0, 0.08])

    PG_updates_num = 1
    episodes_num   = 3
    episode_length = 20
    
    policy = parametrized_swing_up_policy
    
    alpha = 0.001
    sigma = 0.3
    gamma = 0.9
    ### YOUR SOLUTION ABOVE

    reward_history = []
        
    observation_dim, action_dim = 3, 1
    
    q_net      = Q_net(observation_dim + action_dim)
    q_net_copy = Q_net(observation_dim + action_dim)
    
    def run_q_net(q_net, observation, action, w):
        q_net.load_state_dict(w)
                
        net_input = torch.tensor([observation[0], observation[1],
                                  observation[2], action[0]]).float()
        
        return q_net(net_input)
    
    #let us define custom loss function
    def TD_loss(q_net, q_net_copy, reward, observation, action,
                w_prev, obcervation_curr, action_curr, w):
        ### YOUR SOLUTION BELOW
        
        return loss
        ### YOUR SOLUTION ABOVE
    
    optimizer = optim.SGD(q_net.parameters(), lr=alpha, momentum=0.9)
    
    start_state = env.state
    print(start_state)
    
    w      = q_net.state_dict()
    w_prev = q_net.state_dict()
    
    for PG_step in range(PG_updates_num):
        Grad = np.array([0.0, 0.0])
        Sum_Grad_over_episodes = np.array([0.0, 0.0])

        sum_acc_rewards_over_episodes = 0
        sum_param_policy_PDF_grad_acc_over_episodes = 0
        
        skip_loss_calculation = True
        
        for ep in range(episodes_num):
            print("ep", ep)
            
            acc_reward = 0
            policy_PDF_grad_acc = np.array([0.0, 0.0])
            
            ####env.reset_state_into_init_state
            env.state = start_state
            print(env.state)
            
            #run_q_net(q_net_copy, observation_curr, action_curr, w)

            for time_step in range(episode_length):
                if (visualize == True):
                    env.render()
                
                #print(env.state)
                
                action, nrv = policy(observation, vartheta, sigma)
                observation, reward, done, info = env.step(action)
                reward_history.append(reward)

                #HJB: Q(observation_curr, action_curr) = reward_curr + gamma * Q(observation, action)
                
                if (not skip_loss_calculation):
                    ### YOUR SOLUTION ABOVE

                    ### YOUR SOLUTION ABOVE
                
                else:
                    skip_loss_calculation = False
                
                time.sleep(0.01)
                
                observation_curr, reward_curr, action_curr = observation, reward, action
                
                #It's a standard backprop on loss = 1/2 TD^2!
                #w -= - alpha_w * TD * grad_Q_NN(observation_curr, action_curr, w)
                
                #w_prev = w
                w_prev = q_net_copy.state_dict()
                q_net_copy.load_state_dict(q_net.state_dict())

                ppg = param_policy_grad(observation, action, sigma, nrv)
                                
                Q = run_q_net(q_net, observation, action, w).detach().numpy()
                
                Grad = ppg * Q

                #print(q_net.state_dict())
                #print("ppg, Q", ppg, Q)
                
                Sum_Grad_over_episodes += Grad

                sum_acc_rewards_over_episodes += acc_reward
                sum_param_policy_PDF_grad_acc_over_episodes += policy_PDF_grad_acc

        Grad = 1 / episodes_num * Sum_Grad_over_episodes
        
        #print("grad", Sum_Grad_over_episodes)
        
        if (update_params):
            vartheta += alpha * Grad
    
    print(vartheta)
    
    return reward_history

nn_critic_reward_history = REINFORCE(env, update_params = True, visualize = True)

env.close()

[0.41171236 0.88141551]
ep 0
[0.41171236 0.88141551]
OrderedDict([('fc1.weight', tensor([[-0.2579,  0.3150, -0.0277,  0.0497],
        [ 0.1754, -0.3301,  0.3262, -0.4465],
        [ 0.2348,  0.2748, -0.1653, -0.4886],
        [-0.4261,  0.2213,  0.4468,  0.3506]])), ('fc1.bias', tensor([-0.3771, -0.2103, -0.4741, -0.2946])), ('fc2.weight', tensor([[-0.0742,  0.1239, -0.3416,  0.3812]])), ('fc2.bias', tensor([0.3154]))])
ppg, Q [-0.65112494  0.        ] [-1.0269294]
OrderedDict([('fc1.weight', tensor([[-0.2579,  0.3150, -0.0277,  0.0497],
        [ 0.1755, -0.3301,  0.3262, -0.4468],
        [ 0.2347,  0.2748, -0.1654, -0.4876],
        [-0.4261,  0.2213,  0.4468,  0.3506]])), ('fc1.bias', tensor([-0.3771, -0.2102, -0.4741, -0.2946])), ('fc2.weight', tensor([[-0.0742,  0.1252, -0.3402,  0.3811]])), ('fc2.bias', tensor([0.3156]))])
ppg, Q [-2.62037274  0.        ] [-1.0656366]
OrderedDict([('fc1.weight', tensor([[-0.2579,  0.3150, -0.0277,  0.0497],
        [ 0.1755, -0.3300,  0.3263, -

OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [13.65588404  0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [-5.35709596  0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [-0.80138043  0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
      

OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [0.30154701 0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [0.16527797 0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [-0.00304346  0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [n

OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [0.01475781 0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [-0.03188273  0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [0.21411321 0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [n

OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [0.21086257 0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [0.06743262 0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [0.00420505 0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan

OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [0.01415619 0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [0.1245273 0.       ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [0.11270053 0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, 

OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [0.0375446 0.       ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [0.29351741 0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [0.0076538 0.       ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, na

OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [0.06518585 0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [0.07887437 0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [-0.00837502  0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [n

OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [0.58515154 0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [0.01444234 0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [0.01702506 0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan

OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [0.20725109 0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [0.36061442 0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan],
        [nan, nan, nan, nan]])), ('fc1.bias', tensor([nan, nan, nan, nan])), ('fc2.weight', tensor([[nan, nan, nan, nan]])), ('fc2.bias', tensor([nan]))])
ppg, Q [0.23886999 0.        ] [nan]
OrderedDict([('fc1.weight', tensor([[nan, nan, nan, nan],
        [nan

### <font color="orange">Auto-grading</font>
Run this cell to track your answers and to save your answer for problem 3.1. Make sure you defined the necessary variable above to avoid a `NameError` below.

In [None]:
### GRADING DO NOT MODIFY
from grading_utilities import AnswerTracker
asgn1_answers = AnswerTracker()
asgn1_answers.record('problem_3-1', {'reward_history': nn_critic_reward_history})

### <font color="orange">Auto-grading: Submit your answers</font>
Enter your first and last name in the cell below and then run it to save your answers for this assumption to a JSON file. The file is saved next to this notebook. After the file is created, upload the JSON file and the notebook via the form provided in the beginning of the assumption.

In [33]:
assignment_name = "asgn_3"
first_name = ""
last_name = ""

asgn2_answers.save_to_json(assignment_name, first_name, last_name)

## Questions?

Reach out to Ilya Osokin (@elijahmipt) on Telegram.

## Sources

***

<sup>[1]</sup> Ng, A. Stanford University, CS229 Notes: Reinforcement Learning and Control.

<sup>[2]</sup> Barnabás Póczos, Carnegie Mellon, Introduction To Machine Learning: Reinforcement Learning (Course).

<sup>[3]</sup> **Sutton, R. S., Barto, A. G. (2018 ). Reinforcement Learning: An Introduction. The MIT Press.** 

<sup>[4]</sup> OpenAI: Spinning Up. Retrieved from https://spinningup.openai.com/en/latest/spinningup/rl_intro.html