<h1 style="color:#333333; text-align:center; line-height: 0;"> <img style="right;" src="logo.png" width=18% height=18%> Reinforcement Learning | Assignment 2 
</h1>
<br/><br/>


The goal of this assignment is to implement:
- system 
- conditional model 
- actor 
- critic
- REINFORCE

___Total points:___ 100

###  <font color="blue"> A brief introduction </font>
Examine it carefully, it covers most of your possible needs to make an assignment.

***

### About Rcognita
The platform for this (and all subsequent work) is [Rcognita](https://gitflic.ru/project/aidynamicaction/rcognita), a framework for applying control theory and machine learning algorithms to control problems, an integral part of which is the closed-loop interaction between the agent under control and the environment evolving over time. In the Rcognita paradigm, the main bearer of all the classes and variables needed to run the simulation is the `pipeline`. 

The main parts of `pipeline` are: 
* `simulator`, which is defined at module `simulators.py` and responsible for simulation of evolution of the environment
* `actor`, defined at module `actors.py`, which is responsible for obtaining of action
* `critic`, defined at module `critics.py`, which is reponsible for learning of reward function and obtaining its value 
* `controller`, which is defined at module `controllers.py` and it's needed to put it all together into an RL (or other) controller
* `system`, which is defined at module `systems.py`.

Other minor things are also declarated in the pipeline and assembled module by module up to the execution of the pipeline itself. 
Just to be on the same page, we provide some notation to prevent further confusions.
* `weights` is the general name and for weights of neural network and for values in tables of value function and policy as well. This agreement comes from the motivation for being consistent with classical RL where critic and actor are being implemented as some neural networks with some **weights**. So, here comes the second term
* `model`. It's obvious that parameters give specificity to something. But the general form itself is being called `model`. There are plenty of models of different types and forms (such as NN). Model is what critic and actor and even running cost always have, no matter what.
* `predictor` - Inspite of it's cryptic name, this object performs an important function, namely, it carries the law by which the dynamics of our system is being predicted in future. For example, if we have some differential equation
$
\begin{cases}
\dot{\boldsymbol x} = \boldsymbol f(\boldsymbol x, \boldsymbol u)\\
\boldsymbol y = h(\boldsymbol x) \\
\boldsymbol x(0)=\boldsymbol x_{0}\\
\end{cases}
$
where $x_{0}$ is the **initial state**.
in general, there are several ways of prediction: 
> - **Analytical**, when we have a precise formula of analytical solution $\boldsymbol x(t)$ to the ODE and have no problems to compute it at any given time. This is great but not that possible in real life. Nevertheless, our predictor could be expressed like:  $\text{predictor}(\boldsymbol x(\tau),dt) = \boldsymbol x(\tau + dt)$
> - **Numerical** way is mostly a case. The simplest way of prediction then is an Euler method:
$\boldsymbol x_{k+1}= \text{predictor}(\boldsymbol x_k, \delta)=\boldsymbol x_{k}+\delta \boldsymbol f\left(\boldsymbol x_{k}, \boldsymbol u_{k}\right) \text {, }$

In this assignment we meet a new object - **scenario**. Scenario is a module that forms and executes the main loops for different scenarios, like online or episodical scenario. In this assignment we will use an episodical scenario.


<a id='Notation'></a>
### Notation summary
From now and on we will use the following notation:

| Notation &nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;| &nbsp;&nbsp;Description |
|:-----------------------:|-------------|
| $\boldsymbol f(\cdot, \cdot, \cdot) : \mathbb{R}^{n+1}\times \mathbb{R}^{m} \rightarrow \mathbb{R}^{n}$ |A **state dynamic function** or, more informally, **righ-hand-side** of a system <br /> of ordinary differential equations $\dot{\boldsymbol x} = \boldsymbol f(t, \boldsymbol x, \boldsymbol u)$|
| $\boldsymbol x \in \mathbb{R}^{n} $ | An element of the **state space** of a controlled system of dimensionality $n$ |
| $\boldsymbol u \in \mathbb{R}^{m}$ | An element of the **action space** of a controlled system of dimensionality $m$ |
| $\boldsymbol y \in \mathbb{R}^{k}$ | An **observartion**|
| $\mathbb{X}\subset \mathbb{R}^{n} $| **State constraint set**|
| $\mathbb{U}\subset \mathbb{R}^{m} $| **Action constraint set**|
| $\boldsymbol h(\cdot): \mathbb{R}^{n} \rightarrow \mathbb{R}^{k}$ | **Observation function**  |
| $\boldsymbol\rho(\cdot) : \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}$ | **Policy** function |
| $r(\cdot) : \mathbb{R}^n \times \mathbb{R}^m \rightarrow \mathbb{R}$ | **Running cost** function  |


### Goal
Our main goal here is to implement the whole system almost from scratch and to apply Policy Gradient algorithm to [PID-regulator](https://en.wikipedia.org/wiki/PID_controller) coefficients (more precisely, P and D coefficient) tuning

###  <font color="blue"> Algorithm description </font>

I. **Initialization**:
- set iterations number **N_iterations**
- set episodes number **N_episodes**
- set **discount factor** $\gamma$
- initialize some **policy** parameters $\theta_0$, learning rate $\eta$

II. **Main loop**:<br/>
(Run episodical scenario)
>**for** i in range(**N_iterations**):
>>**for** j in range(**N_episodes**):
>>> **while** **time** < **t1**:
(corresponding utilized parts of Rcognita are provided in bold inside parentheses)
>>>> - simulate environment evolution (**simulator**, **system**)
>>>> - obtain observation (**system**) $\boldsymbol y_i = \boldsymbol h(\boldsymbol x_i)$
>>>> - obtain action (**actor**) $u_i \sim  \mathcal{N}(\mu,\,\sigma^{2})$
>>>> - compute and store new gradient (**actor.model**)
>>>> - update accumulated outcome (**critic**): $\sum_{i=0}^N \gamma^i \cdot\left(y_i, u_i\right)$
>>> - compute and store REINFORCE objective gradient (**scenario**): $\sum_{k=0}^N \nabla_\theta \ln \rho^\theta(u_k \vert y_k) \cdot \sum_{i=0}^N \gamma^i \cdot\left(y_i, u_i\right)$
>> - compute mean overall stored REINFORCE objective gradients (**scenario**): $\mathbb{E}\left[\sum_{k=0}^N \nabla_\theta \ln \rho^\theta(u_k \vert y_k) \cdot \sum_{i=0}^N \gamma^i \cdot\left(y_i, u_i\right)\right]$
>> - perform a gradient step (**scenario**): $\theta_{i+1}=\theta_i+\eta \mathbb{E}\left[\sum_{k=0}^N \nabla_\theta \ln \rho^\theta(u_k \vert y_k) \cdot \sum_{i=0}^N \gamma^i \cdot\left(y_i, u_i\right)\right]$

***

In [2]:
!pip install icecream

Collecting icecream
  Downloading icecream-2.1.3-py2.py3-none-any.whl (8.4 kB)
Collecting executing>=0.3.1
  Downloading executing-1.1.1-py2.py3-none-any.whl (22 kB)
Collecting asttokens>=2.0.1
  Downloading asttokens-2.0.8-py2.py3-none-any.whl (23 kB)
Installing collected packages: executing, asttokens, icecream
Successfully installed asttokens-2.0.8 executing-1.1.1 icecream-2.1.3


In [3]:
%%capture
"""
Just importing all the necessary stuff here.
DO NOT CHANGE
"""
%matplotlib notebook
%load_ext autoreload
%autoreload 2

from rcognita_framework.pipelines.pipeline_inverted_pendulum import PipelineInvertedPendulum
from rcognita_framework.rcognita.actors import ActorProbabilisticEpisodic
from rcognita_framework.rcognita.critics import CriticTrivial
from rcognita_framework.rcognita.systems import SysInvertedPendulum
from rcognita_framework.rcognita.models import ModelGaussianConditional
from rcognita_framework.rcognita.scenarios import EpisodicScenario
from rcognita_framework.rcognita.utilities import rc
import numpy as np
import matplotlib.animation as animation
import matplotlib.pyplot as plt
import warnings
from icecream import ic

<h2 style="color:#A7BD3F;"> Section 1: System implementation </h2>

***

<img style="left;" src="n_pendulum.png" width=18% height=18%>
in our case the system has the following view
\begin{equation}
\begin{cases}
\dot{\varphi} = \theta \\
\dot{\theta} = \frac{g}{l}\sin{\varphi} + \frac{u}{ml^2} \\
\end{cases}
\end{equation}

Your task is to implement this system. More precisely, there are two crucial methods: 
* `_compute_state_dynamics(self, time, state, action, disturb)` - which computes and returns the $\boldsymbol f(t, \boldsymbol x, \boldsymbol u)$ - the right-hand-side of the system. (You should fill the `Dstate` with correct values $(\dot{\boldsymbol \varphi}, \dot{\boldsymbol \theta})$)
* `out(state, time, action)`- which yields us an observation $\boldsymbol y = \boldsymbol h(\boldsymbol x)$, where $\boldsymbol y = (\varphi, \int\limits_0^t \varphi dt, \dot{\varphi})$.

***


In [4]:
class SysInvertedPendulumStudent(SysInvertedPendulum):
    """
    System class: mathematical inverted pendulum

    """

    def _compute_state_dynamics(self, time, state, action, disturb=[]):
        """
        Method computes state dynamics function of the system
        """
        m, g, l = self.pars[0], self.pars[1], self.pars[2]

        #############################################
        # YOUR CODE BELOW
        #############################################

        Dstate = np.zeros(self.dim_state)
        
        u = action[0]
        phi, theta = state
        
        Dstate[0] = theta
        Dstate[1] = g / l * np.sin(phi) + (u / (m * l ** 2))
        # ic(Dstate, time)

        #############################################
        # YOUR CODE ABOVE
        #############################################

        return Dstate

    def out(self, state, time=None, action=None):
        """
        Method computes observation
        """

        
        #############################################
        # YOUR CODE BELOW
        #############################################

        delta_time = time - self.time_old
        phi, theta = state[0], state[1]
        self.integral_alpha += delta_time * phi
        observation = np.array([phi, 
                                self.integral_alpha,
                                theta])
        ic(observation, time)
        return observation
    
        #############################################
        # YOUR CODE ABOVE
        #############################################

<h2 style="color:#A7BD3F;"> Section 2: Conditional model implementation</h2>

***

In our setting we have a stochastic policy wich is modeled by a conditional distribution $\rho^w(u | \boldsymbol y) = \frac{1}{\sqrt{\pi}}\exp{-(u-\mu)^2}$,
where $\boldsymbol w:=(w_1, 0, w_2)$, $\mu:=-\langle \boldsymbol w,\boldsymbol y\rangle$

Implement the following methods:
* `update_expectation(self, arg_condition)` - it should compute the expectation parameter of the distribution given the passed `arg_condition` 
* `compute_gradient(self, argin)`- self-explanatory :) Compute it yourself on the paper first.
* `update(self)` - this method is being invoked after each gradient update. So it just basically resets the model. Note that you can access to `self.arg_condition_init` for these purposes

***

`arg_condition` in this setting is an observation $\boldsymbol y$


In [5]:
class ModelGaussianConditionalStudent(ModelGaussianConditional):
    
    """
    Gaussian probability distribution model with `weights[0]` being an expectation vector
    and `weights[1]` being a covariance matrix.
    The expectation vector can optionally be generated 
    """
    

    model_name = "model-gaussian"

    def update_expectation(self, arg_condition):
        """
        update expectation (mu) based on arg_condition
        """
        #############################################
        # YOUR CODE BELOW
        #############################################
        
        self.arg_condition = arg_condition
        self.expectation = -np.dot(arg_condition, self.weights)
        #############################################
        # YOUR CODE ABOVE
        #############################################

    def compute_gradient(self, argin):
        """
        Compute grad manually
        """
        #############################################
        # YOUR CODE BELOW
        #############################################

        grad = -2 * self.arg_condition * (self.expectation - argin[0]) / self.covariance

        ic(self.arg_condition, argin, self.expectation )
        
        # grad = -self.arg_condition
        
        return grad
        
        #############################################
        # YOUR CODE ABOVE
        #############################################

    def update(self, new_weights):
        """
        transform the new_weights into expectation 
        and apply update_expectation method
        """
        #############################################
        # YOUR CODE BELOW
        #############################################

        self.weights = np.clip(new_weights, 0, 100)
        self.update_expectation(self.arg_condition_init)
        
        #self.update_covariance()
        
        #############################################
        # YOUR CODE ABOVE
        #############################################

    def sample_from_distribution(self, argin):
        self.update_expectation(argin)
        self.update_covariance()
    
        return np.array([np.random.normal(self.expectation, self.covariance)])

<h2 style="color:#A7BD3F;"> Section 3: Actor implementation</h2>

***

As you remember from the introduction, actor is responsible for the action obtaining. During the episode simulation it samples action according to it's model. By default, `update(self, observation)` method performs this operation.
But indeed we also should clip the action obtained from distribution

In [6]:
class ActorProbabilisticEpisodicStudent(ActorProbabilisticEpisodic):

    def update(self, observation):
        """
        obtain and store the action
        """
        action_sample = self.model.sample_from_distribution(observation) 
        ### use here sample_from_distribution from model
        self.action = np.array(
            np.clip(action_sample, self.action_bounds[0], self.action_bounds[1])
        )
        self.action_old = self.action
        current_gradient = self.model.compute_gradient(action_sample)
        ### compute gradient here using the corresponding model's method you've just implemented
        self.store_gradient(current_gradient)


    def update_weights_by_gradient(self, gradient, learning_rate):
        """
        Perform a step towards the gradient with some learning rate
        """
        model_weights = self.model.weights
        new_model_weights = np.array(
            model_weights - learning_rate * gradient * np.array([1, 0, 1])
        )
        self.model.update(new_model_weights)
        ic.enable()
        ic(new_model_weights)
        ic.disable()
        ### A gradient step should be performed here

<h2 style="color:#A7BD3F;"> Section 4: Critic implementation</h2>

***

In [7]:
class CriticTrivialStudent(CriticTrivial):
    """
    This is a dummy to calculate outcome (accumulated running objective).
    Use an Euler method for that

    """

    def __init__(self, running_objective, sampling_time=0.01):
        self.running_objective = running_objective
        self.sampling_time = sampling_time
        self.outcome = 0

    def update_outcome(self, observation, action):
        #############################################
        # YOUR CODE BELOW
        #############################################
        
        # self.outcome += ... ### old += new * sampling_time
        
        self.outcome += self.running_objective(observation, action) * self.sampling_time
        # ic.enable()
        # ic(self.running_objective(observation, action), self.outcome)
        # ic.disable()
        #############################################
        # YOUR CODE ABOVE
        #############################################

<h2 style="color:#A7BD3F;"> Section 4: REINFORCE</h2>

***

* Actor stores gradients in `self.actor.gradients`
* total episodic outcome can be accessed through `self.critic.outcome`

In [8]:
class EpisodicScenarioStudent(EpisodicScenario):

    def store_REINFORCE_objective_gradient(self):
        self.outcomes_of_episodes.append(self.critic.outcome)
        """
        This method should compute and then append reinforce objective gradient 
        to the `self.episode_REINFORCE_objective_gradients` variable
        """
        #############################################
        # YOUR CODE BELOW
        #############################################
        ic()
        self.outcomes_of_episodes.append(self.critic.outcome)
        episode_REINFORCE_objective_gradient = self.critic.outcome * sum(
            self.actor.gradients
        )
        self.episode_REINFORCE_objective_gradients.append(
            episode_REINFORCE_objective_gradient
        )

        #############################################
        # YOUR CODE ABOVE
        #############################################


<h2 style="color:#A7BD3F;"> Section 5: Testing</h2>

***

Here you have a full freedom of choice: you can tune whatever you want, change whatever you want. Your goal here is to beat a baseline. If you get an outcome g.t. -350, you earn 100 points. If your result lower than -700, you earn nothing.
Play with hyperparameters, choose number of episodes and number of iterations. You may also vary the length of one episode. There is plenty of work. Applying is not that straightforward, we made some necessary stuff but your main objective here is to apply your ML an mathematical intuition to obtain the best result you can. You will definitely have some questions. So do not hesitate to DM me on telegram 😊 -> @odinmaniac

Some addendums:
* The problem is pretty stochastic, so sometimes you can occasionally obtain some good results. But the point here is to achieve a convergence! So, if you didn't obtain a stabilization of parameters and loss, it doesn't count (you will be able to see it on 3-rd and 4-th subplot). So, if you think that you obtained some solid results, make plots please, or ask me to launch your notebook if you struggle with hardware or software issues
* If you're desperate, results are bad and you don't know what to do, you could try the following heuristics:
    * disable the optimization of parameter I (the second one - $\theta_2$)
    * Change learning rate (sometimes you need to change is really drammatically, so it's okay)
    * Bound your weights

In [19]:
class PipelineInvertedPendulumStudent(PipelineInvertedPendulum):

    def initialize_system(self):
        self.system = SysInvertedPendulumStudent(
            sys_type="diff_eqn",
            dim_state=self.dim_state,
            dim_input=self.dim_input,
            dim_output=self.dim_output,
            dim_disturb=self.dim_disturb,
            pars=[self.m, self.g, self.l],
            is_dynamic_controller=self.is_dynamic_controller,
            is_disturb=self.is_disturb,
            pars_disturb=[],
        )
        self.observation_init = self.system.out(self.state_init, time=0)

    def initialize_models(self):
        super().initialize_models()
        self.actor_model = ModelGaussianConditionalStudent(
            expectation_function=self.safe_controller,
            arg_condition=self.observation_init,
            weights=self.initial_weights,
        )

    def initialize_actor_critic(self):
        self.critic = CriticTrivialStudent(
            running_objective=self.running_objective, sampling_time=self.sampling_time
        )
        self.actor = ActorProbabilisticEpisodicStudent(
            self.prediction_horizon,
            self.dim_input,
            self.dim_output,
            self.control_mode,
            self.action_bounds,
            action_init=self.action_init,
            predictor=self.predictor,
            optimizer=self.actor_optimizer,
            critic=self.critic,
            running_objective=self.running_objective,
            model=self.actor_model,
        )

    def initialize_scenario(self):
        self.scenario = EpisodicScenarioStudent(
            system=self.system,
            simulator=self.simulator,
            controller=self.controller,
            actor=self.actor,
            critic=self.critic,
            logger=self.logger,
            datafiles=self.datafiles,
            time_final=self.time_final,
            running_objective=self.running_objective,
            no_print=self.no_print,
            is_log=self.is_log,
            is_playback=self.is_playback,
            N_episodes=self.N_episodes,
            N_iterations=self.N_iterations,
            state_init=self.state_init,
            action_init=self.action_init,
            learning_rate=self.learning_rate
        )
        self.scenario.is_plot_critic=False

    def execute_pipeline(self, **kwargs):
        """
        Full execution routine
        """
        np.random.seed(42)
        self.load_config()
        self.setup_env()
        self.__dict__.update(kwargs)
        self.initialize_system()
        self.initialize_predictor()
        self.initialize_safe_controller()
        self.initialize_models()
        self.initialize_objectives()
        self.initialize_optimizers()
        self.initialize_actor_critic()
        self.initialize_controller()
        self.initialize_simulator()
        self.initialize_logger()
        self.initialize_scenario()
        if not self.no_visual and not self.save_trajectory:
            self.initialize_visualizer()
            self.main_loop_visual()
        else:
            self.scenario.run()
            if self.is_playback:
                self.playback()
ic.disable()
##### Execution here!!! Full list of kwargs can be seen at 
##### rcognita_framework.pipelines.config_blueprints in the ConfigInvertedPendulum
pipeline = PipelineInvertedPendulumStudent()
pipeline.execute_pipeline(
    no_visual=True, 
    time_final=10, 
    speedup=100,
    is_playback=False, 
    N_episodes=2, 
    N_iterations=2, 
    learning_rate=0.0001,
    initial_weights=[1, 0., 1],
    sampling_time=0.1, # Do not change it!
    no_print=True
)

End of simulation episode


ic| new_model_weights: array([-0.78899064,  0.        ,  0.52893148])


End of simulation episode
End of simulation episode


ic| new_model_weights: array([4.72894563, 0.        , 0.57638671])


End of simulation episode


In [None]:
mean_episodic = pipeline.scenario.outcome_episodic_means[-1]

In [17]:
mean_episodic

-1211.9310745547182

### Grading!

In [18]:
pipeline = PipelineInvertedPendulumStudent()
pipeline.execute_pipeline(
    no_visual=True, 
    t1=10, 
    is_playback=False, 
    N_episodes= 10, ##### set your episodes number 
    N_iterations= 20,##### set your iterations number
    learning_rate=0.0001,
    initial_weights=[1., 0., 1.],
    no_print=True)

mean_episodic = pipeline.scenario.outcome_episodic_means[-1]
grade = np.clip(0.28 * mean_episodic + 200, 0, 100)
print(f"Your grade: {grade}")

End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode


ic| new_model_weights: array([-0.27836498,  0.        ,  1.16773803])


End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode


ic| new_model_weights: array([8.17931563, 0.        , 1.24372018])


End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode


ic| new_model_weights: array([13.03049434,  0.        , -0.2941316 ])


End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode


ic| new_model_weights: array([-89.48318237,   0.        ,  93.94790187])


End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode


ic| new_model_weights: array([ 4.33448843,  0.        , 93.95601846])


End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode


ic| new_model_weights: array([ 4.38394001,  0.        , 93.94441419])


End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode


ic| new_model_weights: array([11.93359753,  0.        , 93.71357599])


End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode


ic| new_model_weights: array([10.27934762,  0.        , 93.81927265])


End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode


ic| new_model_weights: array([13.79900397,  0.        , 93.51319966])


End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode


ic| new_model_weights: array([13.74931383,  0.        , 93.30286898])


End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode


ic| new_model_weights: array([19.32945064,  0.        , 92.64998203])


End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode


ic| new_model_weights: array([20.59273637,  0.        , 92.52828048])


End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode


ic| new_model_weights: array([20.85401482,  0.        , 92.42687053])


End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode


ic| new_model_weights: array([22.86345644,  0.        , 91.97867536])


End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode


ic| new_model_weights: array([22.30290267,  0.        , 92.09815965])


End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode


ic| new_model_weights: array([23.74951527,  0.        , 91.78620667])


End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode


ic| new_model_weights: array([22.22906658,  0.        , 92.05908554])


End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode


ic| new_model_weights: array([26.59165915,  0.        , 91.11877354])


End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode


ic| new_model_weights: array([24.81623814,  0.        , 91.47696748])


End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode
End of simulation episode


ic| new_model_weights: array([27.43090916,  0.        , 90.85169088])


End of simulation episode


<IPython.core.display.Javascript object>

AttributeError: 'EpisodicScenarioStudent' object has no attribute 'square_TD_means'

In [None]:
pipeline.scenario.outcome_episodic_means

In [11]:
pipeline.scenario.outcome_episodic_means[-1]

-897.5062063067613

In [13]:
mean_episodic = pipeline.scenario.outcome_episodic_means[-1]
0.28 * mean_episodic + 200

-51.30173776589319

In [14]:
np.clip(0.28 * mean_episodic + 200, 0, 100)

0.0