<h1 style="color:#333333; text-align:center; line-height: 0;"> <img style="right;" src="logo.png" width=18% height=18%> Reinforcement Learning | Assignment 3 
</h1>
<br/><br/>


The goal of this assignment is to implement:
- Critic
- PyTorch optimizer
- Actor-Critic algorithm

___Total points:___ 100

###  <font color="blue"> A brief introduction </font>
Examine it carefully, it covers most of your possible needs to make an assignment.

***

### About Rcognita
The platform for this (and all subsequent work) is [Rcognita](https://gitflic.ru/project/aidynamicaction/rcognita), a framework for applying control theory and machine learning algorithms to control problems, an integral part of which is the closed-loop interaction between the agent under control and the environment evolving over time. In the Rcognita paradigm, the main bearer of all the classes and variables needed to run the simulation is the `pipeline`. 

The main parts of `pipeline` are: 
* `simulator`, which is defined at module `simulators.py` and responsible for simulation of evolution of the environment
* `actor`, defined at module `actors.py`, which is responsible for obtaining of action
* `critic`, defined at module `critics.py`, which is reponsible for learning of reward function and obtaining its value 
* `controller`, which is defined at module `controllers.py` and it's needed to put it all together into an RL (or other) controller
* `system`, which is defined at module `systems.py`.

Other minor things are also declarated in the pipeline and assembled module by module up to the execution of the pipeline itself. 
Just to be on the same page, we provide some notation to prevent further confusions.
* `weights` is the general name and for weights of neural network and for values in tables of value function and policy as well. This agreement comes from the motivation for being consistent with classical RL where critic and actor are being implemented as some neural networks with some **weights**. So, here comes the second term
* `model`. It's obvious that parameters give specificity to something. But the general form itself is being called `model`. There are plenty of models of different types and forms (such as NN). Model is what critic and actor and even running cost always have, no matter what.
* `predictor` - Inspite of it's cryptic name, this object performs an important function, namely, it carries the law by which the dynamics of our system is being predicted in future. For example, if we have some differential equation
$
\begin{cases}
\dot{\boldsymbol x} = \boldsymbol f(\boldsymbol x, \boldsymbol u)\\
\boldsymbol y = h(\boldsymbol x) \\
\boldsymbol x(0)=\boldsymbol x_{0}\\
\end{cases}
$
where $x_{0}$ is the **initial state**.
in general, there are several ways of prediction: 
> - **Analytical**, when we have a precise formula of analytical solution $\boldsymbol x(t)$ to the ODE and have no problems to compute it at any given time. This is great but not that possible in real life. Nevertheless, our predictor could be expressed like:  $\text{predictor}(\boldsymbol x(\tau),dt) = \boldsymbol x(\tau + dt)$
> - **Numerical** way is mostly a case. The simplest way of prediction then is an Euler method:
$\boldsymbol x_{k+1}= \text{predictor}(\boldsymbol x_k, \delta)=\boldsymbol x_{k}+\delta \boldsymbol f\left(\boldsymbol x_{k}, \boldsymbol u_{k}\right) \text {, }$

In this assignment we meet a new object - **scenario**. Scenario is a module that forms and executes the main loops for different scenarios, like online or episodical scenario. In this assignment we will use an episodical scenario.


<a id='Notation'></a>
### Notation summary
From now and on we will use the following notation:

| Notation &nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;| &nbsp;&nbsp;Description |
|:-----------------------:|-------------|
| $\boldsymbol f(\cdot, \cdot, \cdot) : \mathbb{R}^{n+1}\times \mathbb{R}^{m} \rightarrow \mathbb{R}^{n}$ |A **state dynamic function** or, more informally, **righ-hand-side** of a system <br /> of ordinary differential equations $\dot{\boldsymbol x} = \boldsymbol f(t, \boldsymbol x, \boldsymbol u)$|
| $\boldsymbol x \in \mathbb{R}^{n} $ | An element of the **state space** of a controlled system of dimensionality $n$ |
| $\boldsymbol u \in \mathbb{R}^{m}$ | An element of the **action space** of a controlled system of dimensionality $m$ |
| $\boldsymbol y \in \mathbb{R}^{k}$ | An **observartion**|
| $\mathbb{X}\subset \mathbb{R}^{n} $| **State constraint set**|
| $\mathbb{U}\subset \mathbb{R}^{m} $| **Action constraint set**|
| $\boldsymbol h(\cdot): \mathbb{R}^{n} \rightarrow \mathbb{R}^{k}$ | **Observation function**  |
| $\boldsymbol\rho(\cdot) : \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}$ | **Policy** function |
| $r(\cdot) : \mathbb{R}^n \times \mathbb{R}^m \rightarrow \mathbb{R}$ | **Running cost** function  |


### Goal
Our main goal here is to implement Actor-Critic algorithm to [PID-regulator](https://en.wikipedia.org/wiki/PID_controller) coefficients tuning

###  <font color="blue"> Algorithm description </font>

In this setup we will interchange a trivial critic with the "action-value" one. The purpose of this critic is to learn the $Q$-function, so the algorithm will be constructed as follows:

I. **Initialization**:
- set iterations number **N_iterations**
- set episodes number **N_episodes**
- set **discount factor** $\gamma$
- initialize some **policy** parameters $\boldsymbol w_0$, learning rate $\eta$
- initialize some **critic** parameters $\boldsymbol \vartheta_0$, learning rate $\hat{\eta}$

II. **Main loop**:<br/>
(Run episodical scenario)
>**for** i in range(**N_iterations**):
>>**for** j in range(**N_episodes**):
>>> **while** **time** < **t1**:
(corresponding utilized parts of Rcognita are provided in bold inside parentheses)
>>>> - simulate environment evolution (**simulator**, **system**)
>>>> - obtain observation (**system**) $\boldsymbol y_i = \boldsymbol h(\boldsymbol x_i)$
>>>> - obtain action (**actor**) $u_i \sim  \mathcal{N}(\mu,\,\sigma^{2})$
>>>> - compute and store new gradient (**actor.model**)
>>>  - **reset episode:**
>>>> - compute and store REINFORCE objective gradient (**scenario**): $\sum_{k=0}^N \nabla_w \ln \rho^w(\boldsymbol u_k \vert \boldsymbol y_k) \cdot Q\left(\boldsymbol y_k, \boldsymbol u_k\right)$
>>>> - compute and store sum of squared Temporal Difference terms (scenario): $\sum_{k=0}^N \text{TD}_k^2$
>> - **iteration update:**
>>> - **critic update**
>>>> - compute **mean**(`squared_TD_sums_of_episodes`) = $\mathbb{E}[\sum_{k=0}^N \text{TD}_k^2]$ - mean of saquared TD by episodes
>>>> - $\boldsymbol \vartheta_{i+1} = \boldsymbol \vartheta_{i} - \hat{\eta}\nabla_{\boldsymbol \vartheta} \mathbb{E}[\sum_{k=0}^N \text{TD}_k^2]$ (1 step)
>>> - **actor update**
>>>> - compute **mean** overall stored REINFORCE objective gradients (**scenario**): $\mathbb{E}\left[\sum_{k=0}^N \nabla_{\boldsymbol w} \ln \rho^{\boldsymbol w}(\boldsymbol u_k \vert \boldsymbol y_k) \cdot Q\left(\boldsymbol y_k, \boldsymbol u_k\right)\right]$
>>>> - perform a gradient step (**scenario**): $\boldsymbol w_{i+1}=\boldsymbol w_i+\eta \mathbb{E}\left[\sum_{k=0}^N \nabla_{\boldsymbol w} \ln \rho^{\boldsymbol w}(\boldsymbol u_k \vert \boldsymbol y_k) \cdot Q\left(\boldsymbol y_i, \boldsymbol u_i\right)\right]$
>> - **reset iteration** ...

***

In [32]:
%%capture
"""
Just importing all the necessary stuff here.
DO NOT CHANGE
"""
%matplotlib qt
%load_ext autoreload
%autoreload 2

from rcognita_framework.pipelines.pipeline_inverted_pendulum import PipelineInvertedPendulum
from rcognita_framework.rcognita.actors import ActorProbabilisticEpisodic
from rcognita_framework.rcognita.critics import CriticActionValue
from rcognita_framework.rcognita.systems import SysInvertedPendulum
from rcognita_framework.rcognita.models import ModelGaussianConditional, ModelNN
from rcognita_framework.rcognita.scenarios import EpisodicScenario
from rcognita_framework.rcognita.utilities import rc
from rcognita_framework.rcognita.optimizers import BaseOptimizer
import numpy as np
from torch import nn
import torch
import matplotlib.animation as animation
import matplotlib.pyplot as plt
import warnings

<h2 style="color:#A7BD3F;"> Section 1: Critic implementation </h2>

***
Contents:
* Model implementation
* Critic implementation

***

Implement your topology here. You can try out pass `(input tensor ** 2)` into linear layer. Why is that? Because one may observe that critic should be semi-negative-definite in our case. In other words, `critic([0,0,0],[0])` should be zero and `critic(y, u)` < 0

In [26]:


class ModelNNStudent(ModelNN):

    model_name = "NN"

    def __init__(self, dim_observation, dim_action, *args, weights = None, **kwargs):
        super().__init__(dim_observation, dim_action, *args, weights=weights, **kwargs)
        
        #############################################
        # YOUR CODE BELOW
        #############################################

        self.fc1 = nn.Linear(
            dim_observation + dim_action, dim_observation + dim_action, bias=False
        )

        if weights is not None:
            self.load_state_dict(weights)
            
        #############################################
        # YOUR CODE ABOVE
        #############################################

        self.double()
        self.cache_weights()

    def forward(self, input_tensor, weights=None):
        if weights is not None:
            self.update(weights)
        
        #############################################
        # YOUR CODE BELOW
        #############################################

        x = input_tensor
        x = self.fc1(x)

        x = -(x ** 2)
        x = torch.sum(x)
        
        #############################################
        # YOUR CODE ABOVE
        #############################################

        return x


### Q-critic (Action-Value-critic) implementation

Here we will implement a temporal difference.
FYI:
* Vectores in the data buffer are stored in the order from the latest (top, vectors indiced as `[0,:]`) to the  recent one (bottom, `[-1, :]`)
* Data buffer length corresponds to `self.data_buffer_size`
* Important reminder: all models have an ability to cache themselves. Here you can use this to evaluate a TD correctly. Just set `use_stored_weights=True` when you call the critic's model. It will invoke cached weights that also **were detached automatically**. So, you can construct a temporal difference without using Torch functionality  directly. Here are some examples:

In [27]:
model = ModelNNStudent(3, 1)
model([1.,2.,3.], [1.])

tensor(-2.6767, dtype=torch.float64, grad_fn=<SumBackward0>)

In [28]:
model([1.,2.,3.], [1.], use_stored_weights=True) #### gradient won't flow through this tensor

tensor(-2.6767, dtype=torch.float64)

Note, how the first input is different from the second one. Okay, let's move on!
* Last, but not least! $\text{TD}(\boldsymbol y_{\text{old}},y_{\text{next}},a_{\text{old}},a_{\text{next}})= \text{critic}(y_{\text{old}}, a_{\text{old}}) - \text{critic}^*(y_{\text{next}}, a_{\text{next}})$, where $\text{critic}^*$ is a critic with fixed weights!.

* Function `objective` should return sum of all possible squared TDs given the current data buffer

In [29]:
class CriticActionValue(CriticActionValue):
    def objective(self, data_buffer=None, weights=None):
        """
        Objective of the critic, say, a squared temporal difference.

        """
        if data_buffer is None:
            observation_buffer = self.observation_buffer
            action_buffer = self.action_buffer
        else:
            observation_buffer = data_buffer["observation_buffer"]
            action_buffer = data_buffer["action_buffer"]

        critic_objective = 0
        
        ####### At this point the data buffer is available, just use it #######
        
        #############################################
        # YOUR CODE BELOW
        #############################################

        for k in range(self.data_buffer_size - 1, 0, -1):
            observation_old = observation_buffer[k - 1, :]
            observation_next = observation_buffer[k, :]
            action_old = action_buffer[k - 1, :]
            action_next = action_buffer[k, :]

            # Temporal difference

            critic_old = self.model(observation_old, action_old, weights=weights)
            critic_next = self.model(
                observation_next, action_next, use_stored_weights=True
            )

            temporal_difference = (
                critic_old
                - self.discount_factor * critic_next
                - self.running_objective(observation_old, action_old)
            )

            critic_objective += 1 / 2 * temporal_difference ** 2
            
        #############################################
        # YOUR CODE ABOVE
        #############################################

        return critic_objective

<h2 style="color:#A7BD3F;"> Section 2: Actor modification </h2>

As you might remember from the last assignment, you've implemented the evaluation of the policy distribution gradient in the `update` Actor's method. Now the only thing we should do is change it a bit. 

Take the `update` function from your last solution, insert it here and multiply your gradient by Q-function using your critic.


In [31]:
class ActorProbabilisticEpisodicAC(ActorProbabilisticEpisodic):
    def update(self, observation):
        #############################################
        # YOUR CODE BELOW
        #############################################
        action_sample = self.model.sample_from_distribution(observation)
        self.action = np.array(
            np.clip(action_sample, self.action_bounds[0], self.action_bounds[1])
        )
        self.action_old = self.action

        Q_value = self.critic(observation, action_sample).detach().numpy()
        current_gradient = self.model.compute_gradient(action_sample) * Q_value

        self.store_gradient(current_gradient)
        #############################################
        # YOUR CODE ABOVE
        #############################################

<h2 style="color:#A7BD3F;"> Section 3: Optimizer implementation </h2>

Torch optimization loop should be implemented here.
1. Zero all gradients
2. Compute loss with objective function you've passed
3. Invoke gradients evaluation with .backward()
4. perform optimization step with .step()

In [33]:
class TorchOptimizer(BaseOptimizer):
    engine = "Torch"

    def __init__(
        self, opt_options, iterations=1, opt_method=torch.optim.Adam, verbose=False
    ):
        self.opt_method = opt_method
        self.opt_options = opt_options
        self.iterations = iterations
        self.verbose = verbose
        self.loss_history = []

    def optimize(self, objective, model, objective_input):
        optimizer = self.opt_method(
            model.parameters(), **self.opt_options, weight_decay=0
        )
        #############################################
        # YOUR CODE BELOW
        #############################################
        for _ in range(self.iterations):
            optimizer.zero_grad()
            loss = objective(objective_input)
            loss_before = loss.detach().numpy()
            loss.backward()
            optimizer.step()
        #############################################
        # YOUR CODE ABOVE
        #############################################

<h2 style="color:#A7BD3F;"> Section 4: Scenario implementation</h2>

***

In this task you will just modify the scenario you've touched in the previous assignment. This is an elementary task that is needed just to make you think about what is happening at the end of episode and iteration. 
* At the end of the episode you should append a critic objective value to `squared_TD_sums_of_episodes` list
* in the iteration update phase you launch optimizer
* when you reset the iteration, you just empty `squared_TD_sums_of_episodes`


In [6]:
class EpisodicScenarioAsyncAC(EpisodicScenario):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.critic_optimizer = TorchOptimizer({"lr": 0.01})
        self.squared_TD_sums_of_episodes = []
        self.square_TD_means = []

    def reset_episode(self):
        #############################################
        # YOUR CODE BELOW
        #############################################
        self.squared_TD_sums_of_episodes.append(self.critic.objective())
        #############################################
        # YOUR CODE ABOVE
        #############################################
        super().reset_episode()

    def iteration_update(self):
        mean_sum_of_squared_TD = self.get_mean(self.squared_TD_sums_of_episodes) #just for visualization purposes
        self.square_TD_means.append(mean_sum_of_squared_TD.detach().numpy()) #just for visualization purposes
        #############################################
        # YOUR CODE BELOW
        #############################################
        self.critic_optimizer.optimize(
            objective=self.get_mean,
            model=self.critic.model,
            objective_input=self.squared_TD_sums_of_episodes,
        )
        #############################################
        # YOUR CODE ABOVE
        #############################################
        super().iteration_update()

    def reset_iteration(self):
        #############################################
        # YOUR CODE BELOW
        #############################################
        self.squared_TD_sums_of_episodes = []
        #############################################
        # YOUR CODE ABOVE
        #############################################
        super().reset_iteration()

<h2 style="color:#A7BD3F;"> Section 5: Testing</h2>

***

Here you have a full freedom of choice: you can tune whatever you want, change whatever you want. Your goal here is to beat a baseline. If you get an outcome g.t. -350, you earn 100 points. If your result lower than -700, you earn nothing.
Play with hyperparameters, choose number of episodes and number of iterations. You may also vary the length of one episode. There is plenty of work. Applying is not that straightforward, we made some necessary stuff but your main objective here is to apply your ML an mathematical intuition to obtain the best result you can. You will definitely have some questions. So do not hesitate to DM me on telegram 😊 -> @odinmaniac

Some addendums:
* The problem is pretty stochastic, so sometimes you can occasionally obtain some good results. But the point here is to achieve a convergence! So, if you didn't obtain a stabilization of parameters and loss, it doesn't count (you will be able to see it on 3-rd and 4-th subplot). So, if you think that you obtained some solid results, make plots please, or ask me to launch your notebook if you struggle with hardware or software issues
* If you're desperate, results are bad and you don't know what to do, you could try the following heuristics:
    * disable the optimization of parameter I (the second one - $\theta_2$)
    * Change learning rate (sometimes you need to change is really drammatically, so it's okay)
    * Bound your weights

In [None]:
class PipelineInvertedPendulumStudent(PipelineInvertedPendulum):

    def initialize_system(self):
        self.system = SysInvertedPendulumStudent(
            sys_type="diff_eqn",
            dim_state=self.dim_state,
            dim_input=self.dim_input,
            dim_output=self.dim_output,
            dim_disturb=self.dim_disturb,
            pars=[self.m, self.g, self.l],
            is_dynamic_controller=self.is_dynamic_controller,
            is_disturb=self.is_disturb,
            pars_disturb=[],
        )
        self.observation_init = self.system.out(self.state_init, time=0)

    def initialize_models(self):
        super().initialize_models()
        self.actor_model = ModelGaussianConditionalStudent(
            expectation_function=self.safe_controller,
            arg_condition=self.observation_init,
            weights=self.initial_weights,
        )

    def initialize_actor_critic(self):
        self.critic = CriticTrivialStudent(
            running_objective=self.running_objective, sampling_time=self.sampling_time
        )
        self.actor = ActorProbabilisticEpisodicStudent(
            self.prediction_horizon,
            self.dim_input,
            self.dim_output,
            self.control_mode,
            self.action_bounds,
            action_init=self.action_init,
            predictor=self.predictor,
            optimizer=self.actor_optimizer,
            critic=self.critic,
            running_objective=self.running_objective,
            model=self.actor_model,
        )

    def initialize_scenario(self):
        self.scenario = EpisodicScenarioStudent(
            system=self.system,
            simulator=self.simulator,
            controller=self.controller,
            actor=self.actor,
            critic=self.critic,
            logger=self.logger,
            datafiles=self.datafiles,
            time_final=self.time_final,
            running_objective=self.running_objective,
            no_print=self.no_print,
            is_log=self.is_log,
            is_playback=self.is_playback,
            N_episodes=self.N_episodes,
            N_iterations=self.N_iterations,
            state_init=self.state_init,
            action_init=self.action_init,
        )

    def execute_pipeline(self, **kwargs):
        """
        Full execution routine
        """
        np.random.seed(42)
        self.load_config()
        self.setup_env()
        self.__dict__.update(kwargs)
        self.initialize_system()
        self.initialize_predictor()
        self.initialize_safe_controller()
        self.initialize_models()
        self.initialize_objectives()
        self.initialize_optimizers()
        self.initialize_actor_critic()
        self.initialize_controller()
        self.initialize_simulator()
        self.initialize_logger()
        self.initialize_scenario()
        if not self.no_visual and not self.save_trajectory:
            self.initialize_visualizer()
            self.main_loop_visual()
        else:
            self.scenario.run()
            if self.is_playback:
                self.playback()
                
    #def playback(self):
    #    self.initialize_visualizer()
    #    anm = animation.FuncAnimation(
    #        self.animator.fig_sim,
    #        self.animator.playback,
    #        init_func=self.animator.init_anim,
    #        blit=False,
    #        interval=self.sampling_time / 1e6,
    #        repeat=False,
    #    )
#
    #    self.animator.get_anm(anm)
    #    self.animator.speedup = self.speedup
#
    #    cId = self.animator.fig_sim.canvas.mpl_connect(
    #        "key_press_event", lambda event: on_key_press(event, anm)
    #    )
#
    #    anm.running = True
#
    #    self.animator.fig_sim.tight_layout()
    #    plt.show()
    
##### Execution here!!! Full list of kwargs can be seen at 
##### rcognita_framework.pipelines.config_blueprints in the ConfigInvertedPendulum
pipeline = PipelineInvertedPendulumStudent()
pipeline.execute_pipeline(
    no_visual=True, 
    time_final=10, 
    speedup=50,
    is_playback=True, 
    N_episodes=3, 
    N_iterations=8, 
    learning_rate=0.01,
    initial_weights=[1., 0., 1.],
    sampling_time=0.1, # Do not change it!
    no_print=True)

### Grading!

In [None]:
pipeline = PipelineInvertedPendulumStudent()
pipeline.execute_pipeline(
    no_visual=True, 
    t1=10, 
    is_playback=False, 
    N_episodes= , ##### set your episodes number 
    N_iterations=1,##### set your iterations number 
    initial_weights=[1., 0., 1.],
    no_print=True)

mean_episodic = pipeline.scenario.outcome_episodic_means[0]
grade = np.clip(0.28 * mean_episodic + 200, 0, 100)
print(f"Your grade: {grade}")