# Reinforcement learning

Reinforcement learning komt uit de studie van Markov Chains of Processen voor.
Dit is een random opeenvolging van states waarbij elke transisitie een mogelijke kans heeft.
Door een reward te koppelen aan elke state waarin je komt kan je een functie opstellen die de de totale reward maximaliseert.
Dit is het basisidee achter reinforcement learning.

Een aantal belangrijke termen/concepten hierbij zijn:
* De agent
* Het environment
* De state space
* De action space
* De reward en return
* Exploration vs exploitation

## Q-learning

Een eerste algoritme dat we bekijken voor reinforcement learning uit te voeren is Q-learning.
Dit algoritme maakt gebruik van de Q-functie of action-value function.
Hiervoor houdt het Q-learning algoritme een matrix bij dat de reward van actie in een state bepaald.
In een verkenningsfase laten we toe dat er sub-optimale keuzes genomen worden.
Nadat dit lang genoeg gerund heeft, gaan we over naar een exploitation fase waarbij enkel de beste keuzes genomen worden.

Om te tonen hoe je het Q-learning algoritme kan implementeren, kan je gebruik maken van het gymnasium package.
Dit bevat heel wat eenvoudige environments van spelletjes in python die hiervoor gebruikt kunnen worden.
De bron voor onderstaande code komt van een [tutorial van de library](https://gymnasium.farama.org/tutorials/training_agents/blackjack_tutorial/)

In [None]:
%pip install gymnasium
%pip install gymnasium[classic-control]

In [1]:
%matplotlib inline

from __future__ import annotations

from collections import defaultdict

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from matplotlib.patches import Patch
from tqdm import tqdm

import gymnasium as gym
from gymnasium.utils.play import play

env = gym.make("Blackjack-v1", sab=True, render_mode='rgb_array')


In [None]:
# reset the environment to get the first observation
done = False
observation, info = env.reset()
print(observation, info)

Note that our observation is a 3-tuple consisting of 3 values:

-  The players current sum
-  Value of the dealers face-up card
-  Boolean whether the player holds a usable ace (An ace is usable if it
   counts as 11 without busting)

In [None]:
# sample a random action from all valid actions
action = env.action_space.sample()
# action=1

# execute the action in our environment and receive infos from the environment
next_state, reward, terminated, truncated, info = env.step(action)
print(observation)
print(reward)
print(terminated)   # geeft aan of het spel over is
print(truncated)    # ook gedaan als dit true is
print(info)


# observation=(24, 10, False)
# reward=-1.0
# terminated=True
# truncated=False
# info={}

In [None]:
# dit is de AI/bot
class BlackjackAgent:
    def __init__(
        self,
        learning_rate: float,
        initial_epsilon: float,
        epsilon_decay: float,
        final_epsilon: float,
        discount_factor: float = 0.95,
    ):
        """Initialize a Reinforcement Learning agent with an empty dictionary
        of state-action values (q_values), a learning rate and an epsilon.

        Args:
            learning_rate: The learning rate
            initial_epsilon: The initial epsilon value
            epsilon_decay: The decay for epsilon
            final_epsilon: The final epsilon value
            discount_factor: The discount factor for computing the Q-value
        """
        self.q_values = defaultdict(lambda: np.zeros(env.action_space.n))

        self.lr = learning_rate
        self.discount_factor = discount_factor

        self.epsilon = initial_epsilon
        self.epsilon_decay = epsilon_decay
        self.final_epsilon = final_epsilon

        self.training_error = []

    def get_action(self, obs: tuple[int, int, bool]) -> int:
        """
        Returns the best action with probability (1 - epsilon)
        otherwise a random action with probability epsilon to ensure exploration.
        """
        # with probability epsilon return a random action to explore the environment
        if np.random.random() < self.epsilon:
            return env.action_space.sample()

        # with probability (1 - epsilon) act greedily (exploit)
        else:
            return int(np.argmax(self.q_values[obs]))

    def update(
        self,
        obs: tuple[int, int, bool],
        action: int,
        reward: float,
        terminated: bool,
        next_obs: tuple[int, int, bool],
    ):
        """Updates the Q-value of an action."""
        future_q_value = (not terminated) * np.max(self.q_values[next_obs])
        temporal_difference = (
            reward + self.discount_factor * future_q_value - self.q_values[obs][action]
        )

        self.q_values[obs][action] = (
            self.q_values[obs][action] + self.lr * temporal_difference
        )
        self.training_error.append(temporal_difference)

    def decay_epsilon(self):            # in het begin grote epsilon -> neem veel random zetten, later kleinere epsilon zodat er vooral de beste zet genomen wordt
        self.epsilon = max(self.final_epsilon, self.epsilon - self.epsilon_decay)

In [None]:
# hyperparameters
learning_rate = 0.01
n_episodes = 100000
start_epsilon = 1.0
epsilon_decay = start_epsilon / (n_episodes / 2)  # reduce the exploration over time
final_epsilon = 0.1

agent = BlackjackAgent(
    learning_rate=learning_rate,
    initial_epsilon=start_epsilon,
    epsilon_decay=epsilon_decay,
    final_epsilon=final_epsilon,
)

In [None]:
env = gym.wrappers.RecordEpisodeStatistics(env, deque_size=n_episodes)
for episode in tqdm(range(n_episodes)):
    obs, info = env.reset()
    done = False

    # play one episode
    while not done:
        action = agent.get_action(obs)
        next_obs, reward, terminated, truncated, info = env.step(action)

        # update the agent
        agent.update(obs, action, reward, terminated, next_obs)     # fit

        # update if the environment is done and the current obs
        done = terminated or truncated
        obs = next_obs

    agent.decay_epsilon()

In [None]:
rolling_length = 500
fig, axs = plt.subplots(ncols=3, figsize=(12, 5))
axs[0].set_title("Episode rewards")
# compute and assign a rolling average of the data to provide a smoother graph
reward_moving_average = (
    np.convolve(
        np.array(env.return_queue).flatten(), np.ones(rolling_length), mode="valid"
    )
    / rolling_length
)
axs[0].plot(range(len(reward_moving_average)), reward_moving_average)
axs[1].set_title("Episode lengths")
length_moving_average = (
    np.convolve(
        np.array(env.length_queue).flatten(), np.ones(rolling_length), mode="same"
    )
    / rolling_length
)
axs[1].plot(range(len(length_moving_average)), length_moving_average)
axs[2].set_title("Training Error")
training_error_moving_average = (
    np.convolve(np.array(agent.training_error), np.ones(rolling_length), mode="same")
    / rolling_length
)
axs[2].plot(range(len(training_error_moving_average)), training_error_moving_average)
plt.tight_layout()
plt.show()

In [None]:
def create_grids(agent, usable_ace=False):
    """Create value and policy grid given an agent."""
    # convert our state-action values to state values
    # and build a policy dictionary that maps observations to actions
    state_value = defaultdict(float)
    policy = defaultdict(int)
    for obs, action_values in agent.q_values.items():
        state_value[obs] = float(np.max(action_values))
        policy[obs] = int(np.argmax(action_values))

    player_count, dealer_count = np.meshgrid(
        # players count, dealers face-up card
        np.arange(12, 22),
        np.arange(1, 11),
    )

    # create the value grid for plotting
    value = np.apply_along_axis(
        lambda obs: state_value[(obs[0], obs[1], usable_ace)],
        axis=2,
        arr=np.dstack([player_count, dealer_count]),
    )
    value_grid = player_count, dealer_count, value

    # create the policy grid for plotting
    policy_grid = np.apply_along_axis(
        lambda obs: policy[(obs[0], obs[1], usable_ace)],
        axis=2,
        arr=np.dstack([player_count, dealer_count]),
    )
    return value_grid, policy_grid


def create_plots(value_grid, policy_grid, title: str):
    """Creates a plot using a value and policy grid."""
    # create a new figure with 2 subplots (left: state values, right: policy)
    player_count, dealer_count, value = value_grid
    fig = plt.figure(figsize=plt.figaspect(0.4))
    fig.suptitle(title, fontsize=16)

    # plot the state values
    ax1 = fig.add_subplot(1, 2, 1, projection="3d")
    ax1.plot_surface(
        player_count,
        dealer_count,
        value,
        rstride=1,
        cstride=1,
        cmap="viridis",
        edgecolor="none",
    )
    plt.xticks(range(12, 22), range(12, 22))
    plt.yticks(range(1, 11), ["A"] + list(range(2, 11)))
    ax1.set_title(f"State values: {title}")
    ax1.set_xlabel("Player sum")
    ax1.set_ylabel("Dealer showing")
    ax1.zaxis.set_rotate_label(False)
    ax1.set_zlabel("Value", fontsize=14, rotation=90)
    ax1.view_init(20, 220)

    # plot the policy
    fig.add_subplot(1, 2, 2)
    ax2 = sns.heatmap(policy_grid, linewidth=0, annot=True, cmap="Accent_r", cbar=False)
    ax2.set_title(f"Policy: {title}")
    ax2.set_xlabel("Player sum")
    ax2.set_ylabel("Dealer showing")
    ax2.set_xticklabels(range(12, 22))
    ax2.set_yticklabels(["A"] + list(range(2, 11)), fontsize=12)

    # add a legend
    legend_elements = [
        Patch(facecolor="lightgreen", edgecolor="black", label="Hit"),
        Patch(facecolor="grey", edgecolor="black", label="Stick"),
    ]
    ax2.legend(handles=legend_elements, bbox_to_anchor=(1.3, 1))
    return fig


# state values & policy with usable ace (ace counts as 11)
value_grid, policy_grid = create_grids(agent, usable_ace=True)
fig1 = create_plots(value_grid, policy_grid, title="With usable ace")
plt.show()

In [None]:
# state values & policy without usable ace (ace counts as 1)
value_grid, policy_grid = create_grids(agent, usable_ace=False)
fig2 = create_plots(value_grid, policy_grid, title="Without usable ace")
plt.show()

In [None]:
gym.utils.play.play(env, keys_to_action={"w":0, "s":1})

## RL in neural networks

Het gebruik van Q-learning werkt goed als het aantal states en acties beperkt zijn.
Dit is echter zelden het geval, denk bijvoorbeeld aan een continue variabele zoals snelheid of locatie.

Een oplossing hiervoor is om de action-value functie die in Q-learning geoptimaliseerd wordt te benaderen ipv exact te berekenen.
Dit kan bijvoorbeeld door middel van een neuraal netwerk te gebruiken.
Er zijn verschillende model-structuren die hiervoor ontwikkeld zijn zoals:
- DQN (onderwerp van onderstaande demo)
- REINFORCE
- DDPG
- TD3
- PPO
- SAC

Voor we beginnen met het uitwerken van een model.
Bekijk [deze tutorial](https://www.tensorflow.org/agents/tutorials/1_dqn_tutorial) en beantwoord de volgende vragen:
- Wat is de state en wat zijn de mogelijke acties?
- Wat is de structuur van het gebruikte DQN?
- Zijn er nieuwe hyperparameters gebruikt?
- Welke metriek wordt er gebruikt en waar wordt deze berekend?
- Hoe worden de gewichten aangepast?
- Waarvoor wordt de ReplayBuffer gebruikt?

**Antwoord:**
- Vraag 1:
    - State: de positie en snelheid van het karretje en de hoek/hoeksnelheid van de staaf.
    - Acties: Beweeg naar links en beweeg naar rechts
- Vraag 2: Er zijn drie lagen met respectievelijk 100, 50 en 2 neuronen. Het is belangrijk dat het aantal neuronen in de laatste laag overeenkomt met het aantal acties.
- Vraag 3: De enige nieuwe hyperparameter bij het aanmaken van het neuraal netwerk is de initialiser. De hidden lagen gebruiken een VarianceScaler als kernel-initalisator wat inhoudt dat ze gesampled worden uit een Normaalverdeling. De outputlayer gebruikt een RandomUniform kernel-initializer (sample de gewichten uit een uniforme verdeling) en een constante waarde als bias-initializer
- Vraag 4: De average return wordt hiervoor gebruikt en deze wordt [hier](https://www.tensorflow.org/agents/tutorials/1_dqn_tutorial#metrics_and_evaluation) berekend. De return is de tijd dat de staaf omhoog blijft (1 voor elke tijdstap)
- Vraag 5 en 6: Je laadt het netwerk wat lopen, de uitgevoerde acties en bekomen rewards worden opgeslaan in de ReplayBuffer. Batches of data worden uit de replaybuffer gehaald om het netwerk te trainen op basis van de gemiddelde return

Schrijf nu zelf de nodige code om het DQN-model toe te passen op het "Mountain Car" environment van gymnasium.

In [1]:
import gymnasium as gym
from gymnasium.utils.play import play
env = gym.make('MountainCar-v0', render_mode="rgb_array")
play(env, keys_to_action={"a": 0, "s": 1, "d": 2})

In [None]:
# wat is de activatiefunctie van de laatste laag en waarom

In [4]:
import argparse
import os

import gymnasium as gym
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
from time import time
import tensorflow as tf

import imageio

In [5]:
# deze klasse bevat het neuraal netwerk
class DQNetwork:

    def __init__(self, state_size, action_size, layer_sizes=(100, 50),
                 learning_rate=0.0001):
        self.state_size = state_size
        self.action_size = action_size
        self.layer_sizes = layer_sizes
        self.learning_rate = learning_rate

        self.build_model()

    def build_model(self):
        states = tf.keras.layers.Input(shape=(self.state_size,), name='states')
        net = states
        # hidden layers

        for layer_count in range(len(self.layer_sizes)):
            net = tf.keras.layers.Dense(units=self.layer_sizes[layer_count], activation="relu")(net)
            
        # dit is lineair want we willen de Q-functie gaan benaderen - deze waarden komen rewards/returns en deze kunnen negatief, positief zijn en groter dan 1
        # dit lijkt dus op regressie dus nemen we een lineaire activatie functie
        actions = tf.keras.layers.Dense(units=self.action_size, activation='linear',
                                 name='raw_actions')(net)

        self.model = tf.keras.models.Model(inputs=states, outputs=actions)

        self.optimizer = tf.keras.optimizers.Adam(lr=self.learning_rate)
        self.model.compile(loss='mse', optimizer=self.optimizer)

In [10]:

import random
from collections import namedtuple, deque

# dit is de agent of de bot, deze bepaald hoe er geleerd moet worden
class DDQNAgent:

    def __init__(self, env, buffer_size=int(1e5), batch_size=64, gamma=0.99, tau=1e-3, lr=5e-4, callbacks=()):
        self.env = env
        #self.env.seed(1024)      # dit zal niet meer werken met gymnasium versie
        # neurale netwerken trainen efficienter als je met grotere batches werkt -> dus update de gewichten maar na dit aantal acties
        self.batch_size = batch_size
        self.gamma = gamma
        self.tau = tau
        self.Q_targets = 0.0
        self.state_size = env.observation_space.shape[0]
        self.action_size = env.action_space.n
        self.callbacks = callbacks

        layer_sizes = [20, 5]

        print("Initialising DDQN Agent with params : {}".format(self.__dict__))

        # Make local & target model
        # we maken er 2 om tijdens het trainen nog steeds toegang te hebben tot de oude gewichten (target is de backup van local)
        print("Initialising Local DQNetwork")
        self.local_network = DQNetwork(self.state_size, self.action_size,
                                       layer_sizes=layer_sizes,
                                       learning_rate=lr)

        print("Initialising Target DQNetwork")
        self.target_network = DQNetwork(self.state_size, self.action_size,
                                        layer_sizes=layer_sizes,
                                        learning_rate=lr)

        # dit is een trucje om neurale netwerken beter te laten trainen in Reinforcement learning
        self.memory = ReplayBuffer(buffer_size=buffer_size, batch_size=batch_size)

    def reset_episode(self):
        state = self.env.reset()
        self.last_state = state
        return state

    def step(self, action, reward, next_state, done):
        self.memory.add(self.last_state, action, reward, next_state, done)

        # leer in batches
        if len(self.memory) > self.batch_size:
            experiences = self.memory.sample()
            # update de gewichten van het neuraal netwerk
            self.learn(experiences, self.gamma)

        self.last_state = next_state

    def act(self, state, eps=0.):
        state = np.reshape(state, [-1, self.state_size])
        action = self.local_network.model.predict(state)

        if random.random() > eps:
            # dit is niet random -> greedy approach
            # de beste actie volgens het neuraal netwerk
            return np.argmax(action)
        else:
            # verkennend, exploring, random acties
            return random.choice(np.arange(self.action_size))

    def learn(self, experiences, gamma):
        states, actions, rewards, next_states, dones = experiences

        # voor elke actie in de batch
        for itr in range(len(states)):
            # inputs goed zetten
            state, action, reward, next_state, done = states[itr], actions[itr], rewards[itr], next_states[itr], dones[
                itr]
            state = np.reshape(state, [-1, self.state_size])
            next_state = np.reshape(next_state, [-1, self.state_size])

            # dit berekend de nieuwe Q's
            self.Q_targets = self.local_network.model.predict(state, verbose=0)
            if done:
                self.Q_targets[0][action] = reward
            else:
                next_Q_target = self.target_network.model.predict(next_state, verbose=0)[0]
                # next_Q_target zijn de toekomstige Q-waarden
                self.Q_targets[0][action] = (reward + gamma * np.max(next_Q_target))

            self.local_network.model.fit(state, self.Q_targets, epochs=1, verbose=0, callbacks=self.callbacks)

    # dit is om het target te updaten met de nieuwe gewichten van het local
    def update_target_model(self):
        self.target_network.model.set_weights(self.local_network.model.get_weights())

# double ended queue (deque)
# (ChatGPT) Here's why a replay buffer is beneficial in DQN:
# Stability: Training on a randomly sampled batch of experiences helps to break the temporal correlation between consecutive experiences. This reduces the risk of the learning process being influenced by the order in which experiences are encountered.
# Data Efficiency: Reusing past experiences allows the agent to learn more from its limited set of interactions with the environment. This is particularly useful when data is expensive or time-consuming to collect.
# Sample Efficiency: The replay buffer helps the agent learn from a more diverse set of experiences, potentially preventing the network from getting stuck in local minima.
class ReplayBuffer:

    def __init__(self, buffer_size, batch_size):
        self.memory = deque(maxlen=buffer_size)  # internal memory (deque)
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])

    # sla de gegevens op
    def add(self, state, action, reward, next_state, done):
        e = self.experience(state, action, reward, next_state, done)
        self.memory.append(e)

    # geef me een lijst van tijdstippen
    def sample(self):
        experiences = random.sample(self.memory, k=self.batch_size)

        states = np.vstack([e.state for e in experiences if e is not None])
        actions = np.vstack([e.action for e in experiences if e is not None])
        rewards = np.vstack([e.reward for e in experiences if e is not None])
        next_states = np.vstack([e.next_state for e in experiences if e is not None])
        dones = np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)

        return (states, actions, rewards, next_states, dones)

    def __len__(self):
        return len(self.memory)


In [11]:
import numpy as np
import matplotlib.pyplot as plt
from IPython import display as ipythondisplay
from pyvirtualdisplay import Display

def train_model(n_episodes=200, eps_start=1.0, eps_end=0.001, eps_decay=0.9, target_reward=1000):
    scores = []
    scores_window = deque(maxlen=100)
    eps = eps_start
    print("Starting model training for {} episodes.".format(n_episodes))
    consolidation_counter = 0
    # for lusje om het aantal spelletjes te simuleren
    for i_episode in range(1, n_episodes + 1):
        init_time = time()
        state = agent.reset_episode()
        score = 0
        done = False
        # simuleer het spelletje hier
        while not done:
            action = agent.act(state, eps)
            next_state, reward, done, _ = env.step(action)
            agent.step(action, reward, next_state, done)
            state = next_state
            score += reward
            if done:
                # als het spel gedaan is, gebruik de nieuwe gewichten voor het volgend spel
                agent.update_target_model()
                break
        time_taken = time() - init_time
        scores_window.append(score)
        scores.append(score)
        eps = max(eps_end, eps_decay * eps)

        # debugging output / progress reports / checkpoints bijhouden
        print('Episode {}\tAverage Score: {:.2f}\tScore: {:.2f}\tState: {}\tMean Q-Target: {:.4f}'
                     '\tEffective Epsilon: {:.3f}\tTime Taken: {:.2f} sec'.format(
            i_episode, np.mean(scores_window), score, state[0], np.mean(agent.Q_targets), eps, time_taken))
        if i_episode % 100 == 0:
            print(
                'Episode {}\tAverage Score: {:.2f}\tScore: {:.2f}\tState: {}\tMean Q-Target: {:.4f}\tTime Taken: {:.2f} sec '.format(
                    i_episode, np.mean(scores_window), score, state[0], np.mean(agent.Q_targets), time_taken))
            agent.local_network.model.save('save/{}_local_model_{}.h5'.format(env_name, initial_timestamp))
            agent.target_network.model.save('save/{}_target_model_{}.h5'.format(env_name, initial_timestamp))
        if np.mean(scores_window) >= target_reward:
            consolidation_counter += 1
            if consolidation_counter >= 5:
                print("Completed model training with avg reward {} over last {} episodes."
                                    " Training ran for total of {} epsiodes".format(
                    np.mean(scores_window), 100, i_episode))
                return scores
        else:
            consolidation_counter = 0
    print("Completed model training with avg reward {} over last {} episodes."
                        " Training ran for total of {} epsiodes".format(
        np.mean(scores_window), 100, n_episodes))
    return scores


def play_model(actor, env_render=False, return_render_img=False):
    state = env.reset()
    print("Start state : {}".format(state))
    score = 0
    done = False
    images = []
    R = 0
    t = 0
    while not done:
        if env_render:
            if return_render_img:
                images.append(env.render("rgb_array"))
            else:
                env.render()
        state = np.reshape(state, [-1, env.observation_space.shape[0]])
        action = actor.predict(state, verbose=0)
        next_state, reward, done, _ = env.step(np.argmax(action))
        R += reward
        t += 1
        state = next_state
        score += reward
        if done:
            return score, images
    return 0, images
     

In [12]:
def display_frames_as_gif(frames):
    """
    Displays a list of frames as a gif, with controls
    """
    plt.figure(figsize=(frames[0].shape[1] / 72.0, frames[0].shape[0] / 72.0), dpi = 72)
    patch = plt.imshow(frames[0])
    plt.axis('off')

    def animate(i):
        patch.set_data(frames[i])

    anim = animation.FuncAnimation(plt.gcf(), animate, frames = len(frames), interval=50)
    display(display_animation(anim, default_mode='loop'))

In [13]:
#train
env_name = "MountainCar-v0"
env = gym.make(env_name)
agent = DDQNAgent(env, buffer_size=100000, gamma=0.99, batch_size=64, lr=0.0001, callbacks=[])
scores = train_model(n_episodes=1, target_reward=-110, eps_decay=0.9)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(len(scores)), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

Initialising DDQN Agent with params : {'env': <TimeLimit<OrderEnforcing<PassiveEnvChecker<MountainCarEnv<MountainCar-v0>>>>>, 'batch_size': 64, 'gamma': 0.99, 'tau': 0.001, 'Q_targets': 0.0, 'state_size': 2, 'action_size': 3, 'callbacks': []}
Initialising Local DQNetwork




Initialising Target DQNetwork
Starting model training for 1 episodes.


ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

In [None]:
model = "MountainCar-v0_local_model.h5"
total_iterations = 100
expected_reward = -110

#Test
test_scores = []
print("Loading the saved model from '{}'".format(model))
actor = tf.keras.models.load_model('{}'.format(model))
print("Now running model test for {} iterations with expected reward >= {}".format(
    total_iterations, expected_reward))
frames = play_model(actor, True, True)[1]
for itr in range(1, total_iterations + 1):
    score = play_model(actor, True)[0]
    test_scores.append(score)
    print("Iteration: {} Score: {}".format(itr, score))
avg_reward = np.mean(test_scores)
print("Total Avg. Score over {} consecutive iterations : {}".format(total_iterations,
                                                                                 avg_reward))
if avg_reward >= expected_reward:
    print("Env. solved successfully.")
else:
    print("Agent failed to solve the env.")

In [None]:
import matplotlib.pyplot as plt
import matplotlib.animation
import numpy as np
from IPython.display import HTML

plt.figure(figsize=(frames[0].shape[1] / 72.0, frames[0].shape[0] / 72.0), dpi = 72)
patch = plt.imshow(frames[0])
plt.axis('off')
animate = lambda i: patch.set_data(frames[i])
ani = matplotlib.animation.FuncAnimation(plt.gcf(), animate, frames=len(frames), interval = 50)
HTML(ani.to_jshtml())