# Deep Reinforcement Learning course at Moscow School of AI

<img src="IntroImages/Event Cover.png" alt="Drawing" style="width: 1200px;" />

The course of lectures is composed so that you can understand the basics of RL as MDP and Markov Games, train an intelligent agent using the algorithms like DQN, PPO, A3C, MADDPG and ultimately apply the knowledge to solve problems that you choose for yourself.

## Application of RL

Reinforcement learning is about building code that can learn complex tasks by itself. Deep Reinforcement Learning, where the approach is represented by a deep neural network.  The computer is able to win the world champion on the game of "Go" without knowing any rule of the game and based on the winning or losing it improve its strategy. Reinforcement learning helps to AlphaZero to learn from the experience so it can gradually enhance the decision-making process and eventually winning the game.  RL is not limited only to board games, it can be used in many fields for example: teach the robot to walk and manipulate with objects, self-driving cars, stock market prediction, etc. RL is a step towards General Artificial Intelligence, but we quite far from it. 

* Autonomous driving
* Flying cars
* Stock market prediction
* Generally any decision-making kind of job

![Car](IntroImages/intro.jpg)


<img src="IntroImages/soccer.gif" alt="Drawing" style="width: 900px;" />
<img src="IntroImages/agent_example.PNG" alt="Drawing" style="width: 900px;" />

### Chess - MCTS
**Elo rating** - is a number that measures a relative skill level of a player. 

<img src="IntroImages/top_chess_players.PNG" alt="Drawing" style="width: 900px;" />

<img src="IntroImages/carlsen_magnus.png" alt="Drawing" style="width: 900px;" />

* | *
- | -
<img src="IntroImages/stockfish_1.PNG" alt="Drawing" style="height: 400px;" /> | <img src="IntroImages/stockfish_3.PNG" alt="Drawing" style="height: 400px;" />

<img src="IntroImages/alpha_zero.PNG" alt="Drawing" style="width: 700px;" />

### Dota 2 - MARL

<img src="IntroImages/dota2.gif" alt="Drawing" style="width: 900px;" />

## RL framework


In general, the RL setting consists of an agent and an environment. 
At initial time-step, an agent observes the environment. Then an agent must select an appropriate action in response.  An environment in response to action presents a new observation and a reward, which has an evaluation of the agent's taken action.

* | *
- | -
<img src="IntroImages/agent_env_berkeley.PNG" alt="Drawing" style="height: 400px;" /> | <img src="IntroImages/agent_env.png" alt="Drawing" style="height: 400px;" />

The goal of the Agent: Maximize expected cumulative reward

## Learning Environment

* Gym Environment
* Unity ML

<img src="IntroImages/envs.gif" alt="Drawing" style="width: 900px;" />

## Pong from pixels  (PyTorch)
* Compute device selection
* Gym environment initialization

In [None]:
# custom utilies for displaying animation, collecting rollouts and more
import pong_utils
import torch

%matplotlib inline

# check which device is being used. 
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("using device: ",device)
# render ai gym environment
import gym
import time

# PongDeterministic does not contain random frameskip
# so is faster to train than the vanilla Pong-v4 environment
env = gym.make('PongDeterministic-v4')

print("List of available actions: ", env.unwrapped.get_action_meanings())

# we will only use the actions 'RIGHTFIRE' = 4 and 'LEFTFIRE" = 5
# the 'FIRE' part ensures that the game starts again after losing a life
# the actions are hard-coded in pong_utils.py

### Preprocessing
* Downsampling to 80 by 80
* Removing background

On the low level, the game works as follows: we receive an image frame (a 210x160x3 byte array (integers from 0 to 255 giving pixel values)), and we get to decide if we want to move the paddle UP or DOWN (i.e., a binary choice). After every single choice, the game simulator executes the action and gives us a reward: Either a +1 reward if the ball went past the opponent, a -1 reward if we missed the ball, or 0 otherwise. And of course, our goal is to move the paddle so that we get lots of rewards. [Andrej Karpathy blog](http://karpathy.github.io/2016/05/31/rl/)

In [None]:
import matplotlib
import matplotlib.pyplot as plt

# show what a preprocessed image looks like
env.reset()
_, _, _, _ = env.step(0)
# get a frame after 20 steps
for _ in range(20):
    frame, _, _, _ = env.step(1)

plt.subplot(1,2,1)
plt.imshow(frame)
plt.title('original image')

plt.subplot(1,2,2)
plt.title('preprocessed image')

# 80 x 80 black and white image
plt.imshow(pong_utils.preprocess_single(frame), cmap='Greys')
plt.show()
print ("Original shape is: {}".format(frame.shape))
print ("Preprocess shape is: {}".format(pong_utils.preprocess_single(frame).shape))

### POLICY

This network will take the state of the game and decide what we should do (move UP or DOWN). As our favorite simple block of a computer, we’ll use a 2-layer neural network that takes the raw image pixels 

<img src="IntroImages/state_to_action_example.PNG" alt="Drawing" style="width: 800px;" />
 
Here, we define our policy. The input is the stack of two different frames (which captures the movement), and the output is a number $P_{\rm right}$, the probability of moving left. Note that $P_{\rm left}= 1-P_{\rm right}$.
We use the sigmoid non-linearity at the end, which squashes the output probability to the range [0,1]. 

**Credit assignment problem**

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class Policy(nn.Module):

    def __init__(self):
        super(Policy, self).__init__()
        # 80x80x2 to 38x38x4
        # 2 channel from the stacked frame
        self.conv1 = nn.Conv2d(2, 4, kernel_size=6, stride=2, bias=False)
        # 38x38x4 to 9x9x32
        self.conv2 = nn.Conv2d(4, 16, kernel_size=6, stride=4)
        self.size=9*9*16
        
        # two fully connected layer
        self.fc1 = nn.Linear(self.size, 256)
        self.fc2 = nn.Linear(256, 1)

        # Sigmoid to 
        self.sig = nn.Sigmoid()
        
    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = x.view(-1,self.size)
        x = F.relu(self.fc1(x))
        return self.sig(self.fc2(x))

policy=Policy().to(device)
# we use the adam optimizer with learning rate 2e-4
import torch.optim as optim
optimizer = optim.Adam(policy.parameters(), lr=1e-4)

## RL connection to Supervised Learning

<img src="IntroImages/supervised_learning_example.PNG" alt="Drawing" style="width: 800px;" />

If the agent learning is represented as a Supervised Learning problem, then the agent will never be better than those data that it was trained on. If we talk about RL learning, then we assume that we can achieve **super-human** level of performance. 

RL Policy Gradients is very similar to Supervised Learning.

<img src="IntroImages/connection_to_sl.png" alt="Drawing" style="width: 800px;" />

Supervised learning setup, we have a bunch of labeled data that is feed into NN, in this context learning means tweaking the weights (back-propagation) of the NN to identify a given picture correctly. Changing the weights increases the probability of providing the right label. In RL setup, we collect many episodes by following some policy that is labeled at the end of the episode by winning or losing the game, and then actions are the same as pictures in Supervised learning.

## Policy Gradients

* Sample an action from this distribution; E.g., suppose we sample DOWN, and we will execute it in the game. 
* Wait until the end of the game, then take the reward we get (either +1 if we won or -1 if we lost), and enter that scalar as the gradient for the action we have taken (DOWN in this case).
* Fill in -1 for log probability of DOWN and do backprop we will find a gradient that discourages the network to take the DOWN action for that input in the future.
[Andrej Karpathy blog](http://karpathy.github.io/2016/05/31/rl/)

<img src="IntroImages/forward_pass.PNG" alt="Drawing" style="width: 800px;" />

### Reflection

**Policy Gradients**: Run a policy for a while. See what actions led to high rewards. Increase their probability.

<img src="IntroImages/agent_learning.PNG" alt="Drawing" style="width: 800px;" />

### Optimization

<img src="IntroImages/score_function.PNG" alt="Drawing" style="width: 800px;" />

## REINFORCE algorithm

REINFORCE is the algorithm that can be used to find the best weights for a policy network that maximizes the expected return U. 

1. Use the policy $\pi_{\theta}$ to collect m trajectories $\tau^{1}, \tau^{2}, ..., \tau^{m}$ with horizont $H$. We refer to the $i$-th trajectory as
$$\tau^{i} = (s_0^{i}, a_0^{i}, ..., s_H^{i}, a_H^{i}, s_{H+1}^{i})$$
2. In the REINFORCE algorithm log of probability it used to increase/decrease the occurrence of the action in the trajectory. Use the trajectories to estimate the gradient $\nabla_{\theta}U(\theta)$:

$$\nabla_{\theta}U(\theta) \approx \hat{g} = \frac{1}{m} \sum_{i=1}^{m} \sum_{t=0}^{H} \nabla_\theta log \pi_\theta (a_t^{i}|s_t^i) R(\tau^i)$$
3. Update the weights of the policy: 
$$\theta \leftarrow \theta + \alpha \hat{g}$$
4. Loop over steps 1-3


### REINFORCE
you have two choices (usually it's useful to divide by the time since we've normalized our rewards and the time of each trajectory is fixed)

1. $\frac{1}{T}\sum^T_t R_{t}^{\rm future}\log(\pi_{\theta'}(a_t|s_t))$
2. $\frac{1}{T}\sum^T_t R_{t}^{\rm future}\frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}$ where $\theta'=\theta$ and make sure that the no_grad is enabled when performing the division

In [None]:
import numpy as np
RIGHT=4
LEFT=5

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# convert states to probability, passing through the policy
def states_to_prob(policy, states):
    states = torch.stack(states)
    policy_input = states.view(-1,*states.shape[-3:])
    return policy(policy_input).view(states.shape[:-3])
# return sum of log-prob divided by T
# same thing as -policy_loss
def surrogate(policy, old_probs, states, actions, rewards,
              discount = 0.995, beta=0.01):

    discount = discount**np.arange(len(rewards))
    rewards = np.asarray(rewards)*discount[:,np.newaxis]
    
    # convert rewards to future rewards
    rewards_future = rewards[::-1].cumsum(axis=0)[::-1]
    
    mean = np.mean(rewards_future, axis=1)
    std = np.std(rewards_future, axis=1) + 1.0e-10

    rewards_normalized = (rewards_future - mean[:,np.newaxis])/std[:,np.newaxis]
    
    # convert everything into pytorch tensors and move to gpu if available
    actions = torch.tensor(actions, dtype=torch.int8, device=device)
    old_probs = torch.tensor(old_probs, dtype=torch.float, device=device)
    rewards = torch.tensor(rewards_normalized, dtype=torch.float, device=device)

    # convert states to policy (or probability)
    new_probs = states_to_prob(policy, states)
    new_probs = torch.where(actions == RIGHT, new_probs, 1.0-new_probs)

    ratio = new_probs/old_probs

    # include a regularization term
    # this steers new_policy towards 0.5
    # add in 1.e-10 to avoid log(0) which gives nan
    entropy = -(new_probs*torch.log(old_probs+1.e-10)+ \
        (1.0-new_probs)*torch.log(1.0-old_probs+1.e-10))

    return torch.mean(ratio*rewards + beta*entropy)


### TRAINING
We are now ready to train our policy!

In [None]:
from parallelEnv import parallelEnv
import numpy as np
# WARNING: running through all 800 episodes will take 30-45 minutes

# training loop max iterations
episode = 800

# widget bar to display progress
!pip install progressbar
import progressbar as pb
widget = ['training loop: ', pb.Percentage(), ' ', 
          pb.Bar(), ' ', pb.ETA() ]
timer = pb.ProgressBar(widgets=widget, maxval=episode).start()

# initialize environment
envs = parallelEnv('PongDeterministic-v4', n=8, seed=1234)

discount_rate = .99
beta = .01
tmax = 100

# keep track of progress
mean_rewards = []

for e in range(episode):

    # collect trajectories
    old_probs, states, actions, rewards = \
        pong_utils.collect_trajectories(envs, policy, tmax=tmax)
        
    total_rewards = np.sum(rewards, axis=0)
  
    L = -surrogate(policy, old_probs, states, actions, rewards, beta=beta)
    optimizer.zero_grad()
    L.backward()
    optimizer.step()
    del L
        
    # the regulation term also reduces
    # this reduces exploration in later runs
    beta*=.995
    
    # get the average reward of the parallel environments
    mean_rewards.append(np.mean(total_rewards))
    
    # display some progress every 20 iterations
    if (e+1)%20 ==0 :
        print("Episode: {0:d}, score: {1:f}".format(e+1,np.mean(total_rewards)))
        print(total_rewards)
        
    # update progress widget bar
    timer.update(e+1)
    
timer.finish()

In [None]:
# play game after training!
pong_utils.play(env, policy, time=2000) 
torch.save(policy, 'REINFORCE.policy')

### Whatch a smart agent

In [None]:
import torch
policy_solution = torch.load('REINFORCE_solution.policy', map_location='cpu')
pong_utils.play(env, policy_solution, time=2000) 

## Content of this course 
The content is divided into four parts.

1. Intro to Deep Reinforcement Learning
     * The RL Framework: MDP
     * Monte-Carlo Methods 
     * Temporal-Difference Methods (SARSA and Q-learning)
     * Project 1
1. Value-Based Methods
    * Deep Q-Networks
    * Project 2
1. Policy-Based Methods
    * Intro to Policy-Based Methods
    * Black-Box optimization
    * REward Increment = Nonnegative Factor x Offset Reinforcement x Characteristic Eligibility (REINFORCE)
    * Proximal Policy Optimization (PPO)
    * Actor-Critic Methods (A3C, DDPG)
    * Project 3
1. Multi-Agent Reinforcement Learning
    * Intro to Multi-Agent RL
    * AlphaZero
    * Project 4

* | *
- | -
<img src="IntroImages/LunarLander.gif" alt="Drawing" style="height: 300px;" /> | <img src="IntroImages/Banana.gif" alt="Drawing" style="height: 300px;" />
<img src="IntroImages/Reacher.gif" alt="Drawing" style="height: 300px;" /> | <img src="IntroImages/Tennis.gif" alt="Drawing" style="height: 300px;" />

<img src="IntroImages/multi_agent_rl.png" alt="Drawing" style="height: 500px;" />