#### Eric Parsons, Brandon Lavigne, Jordan Rayfield

* Note: Ensure that when you run the model, that you click on the window as it may start out minimized in the task bar. In addition, ensure that you run `pip install gym` and `pip install gym[atari]` to download the environments.

## DDQN Models
Below are models trained with the ddqn architecture. If you wish to change length that each network plays for, modify the seconds_to_play parameter. Each model is currently set to play for 5 seconds for a quick demonstration. The cell for each model can be run to load in a pretrained model and run the game, but there are also gifs provided below that demonstrate the same models.

In [None]:
import ddqn.play as play

### DDQN Pong
This model was trained for 3,000,000 frames on pong with a reward function that penalizes changes in momentum. The agent only moves when neccesary to hit the ball.

In [None]:
game_name = 'pong'
env = play.create_environment(game_name)
play.play(env, game_name, seconds_to_play=5)

![PongUrl](https://i.imgur.com/lTvAUV4.gif)

### DDQN Space Invaders
This model was trained for 7,000,000 frames on space invaders. The model plays like a very skilled human player and you can see where it clearly makes attempts to shoot the mothership.

In [None]:
game_name = 'space_invaders'
env = play.create_environment(game_name)
play.play(env, game_name, seconds_to_play=5)

![DDQNSpaceInvaders](https://i.imgur.com/IhbWQp1.gif)

### DDQN Breakout
This model was trained for 7,000,000 frames on breakout. It plays quite well, but it needs more frames to fully converge. You can notice some instances of tunneling where it hits the ball behind the blocks in order to score rapidly.

In [None]:
game_name = 'breakout'
env = play.create_environment(game_name)
play.play(env, game_name, seconds_to_play=5)

![DDQNBreakout](https://i.imgur.com/DLVzyDs.gif)

## A2C Models
Running with the implemented A2C algorithm architecture plays a game session, and stopis it for you. Though for some games, the agent never loses, so to stop it, just press the `stop` button for this cell to interrupt the kernel. BUT NOTE THAT IF YOU DO THIS, YOU NEED TO RESTART THE KERNEL (tensorflow related issue). Below are gifs of the latest models, but feel free to run and modify the code to play the models natively.
##### A2C Breakout,
always manages to beat each generation. It realized that the most optimal path would be to dig a tunnel, as it almost always does. Near the end where they are little blocks left, it tries new things and sometimes loses a life. It's never a fatal mistake though.
##### A2C Space Invaders,
has amazing accuracy, but preforms similar to breakout when there's little enemies left.
##### A2C Star Gunner,
seems to never die. It always dodges incoming enemy bullets and plays with amazing reflexes. I've played this model on hours on end, and its never seems to lose a single life.
##### A2C James Bond,
has trouble getting far. This game took much to train to ensure it passes the volcanoe obsticle. It eventually passes it, and sometimes gets to the second level.
##### A2C Pong,
wins flawlessly every single time. Not only does it always win, but also manages to beat the opponent on a single swing, always.

The amount of generations are listed below.

Breakout 105,000|Space Invaders 820,000|Star Gunner 835,000|James Bond 800,000|Pong 1,515,000
- | - | - | - | - |
![A2CBreakout](https://i.imgur.com/4hyRox7.gif) | ![A2CSpaceInvaders](https://i.imgur.com/r8lyu6n.gif) | ![A2CStargunner](https://i.imgur.com/zmobGTs.gif) | ![A2CJamesBond](https://i.imgur.com/lvix1lW.gif) | ![A2CPong](https://i.imgur.com/2U0v5ly.gif)

The commands below may used to play each custom model. Refer to `A2C Models` to stop them

In [None]:
import a2c.play as play

In [None]:
play.main('BreakoutNoFrameskip-v4', 455000, 'a2c/models', 'a2c/gifs')

In [None]:
play.main('JamesbondNoFrameskip-v4', 1305000, 'a2c/models', 'a2c/gifs')

In [None]:
play.main('PongNoFrameskip-v4', 1515000, 'a2c/models', 'a2c/gifs')

In [None]:
play.main('SpaceInvadersNoFrameskip-v4', 1335000, 'a2c/models', 'a2c/gifs')

In [None]:
play.main('StarGunnerNoFrameskip-v4', 1360000, 'a2c/models', 'a2c/gifs')

## MCPG Models

This is the further-along implementation of the Monte Carlo Gradinet Policy. Although it did not finish converging, it has reached a peak win rate of 32%. This model will run for 10 seconds, and you can observe that it generally does a good job of trying to knock the ball in the other direction, even though it does not always catch it.

In [None]:
import time
import math
import random
from itertools import count
from torch.distributions import Categorical
import matplotlib.pyplot as plt
from pdb import set_trace
from collections import deque
import gym
import numpy as np
import _pickle as pickle
import torch.optim as optim
import torch.nn.functional as F
import torch.nn as nn
import torch

# hyperparameters
resume = True  # resume from previous checkpoint?
render = False
MAX_FRAMES = 2000000
eps = np.finfo(np.float32).eps.item()


class Flatten(nn.Module):
    def forward(self, input):
        return input.view(input.size(0), -1)


class CnnPGN(nn.Module):
    def __init__(self, input_shape, num_actions):
        super(CnnPGN, self).__init__()

        self.input_shape = input_shape
        self.num_actions = num_actions

        self.conv_layer = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU(),
            Flatten()
        )

        self.fc = nn.Sequential(
            nn.Linear(2304, 512),
            nn.ReLU(),
            nn.Linear(512, num_actions),
            nn.Softmax(dim=1)
        )

    def forward(self, x):
        x = self.conv_layer(x)
        #print("Feeding forward:")
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x


def prepro(I):
    """ prepro 210x160x3 uint8 frame into 6400 (80x80) 1D float vector """
    I = I[35:195]  # crop
    I = I[::2, ::2, 0]  # downsample by factor of 2
    I[I == 144] = 0  # erase background (background type 1)
    I[I == 109] = 0  # erase background (background type 2)
    I[I != 0] = 1  # everything else (paddles, ball) just set to 1
    return I


def select_action(policy, state, device):
    state = torch.from_numpy(state).float().unsqueeze(0)
    state = state.view((1, 1, 80, 80)).to(device)
    probs = policy(state)
    m = Categorical(probs)
    action = m.sample()
    return action.item()


def play_game():
    render = True
    env = gym.make("Pong-v0")

    observation = env.reset()

    previous_frame = None
    episode_num = 0
    frame_count = 1
    current_frame = prepro(observation)
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    policy = CnnPGN(current_frame.shape, 4).to(device)
    policy.load_state_dict(torch.load('mcpg/policy_1550000_Final'))

    print("Preparing Game:")
    start_time = time.time()

    while episode_num < 2:
        if render:
            env.render()

        if time.time() - start_time >= 10:
            env.close()
            return
        
        time.sleep(.005)
        difference_image = current_frame - \
            previous_frame if previous_frame is not None else np.zeros_like(
                current_frame)
        previous_frame = current_frame

        action = select_action(policy, difference_image, device)
        current_frame, reward, done, _ = env.step(action)

        current_frame = prepro(current_frame)

        if done:
            episode_num += 1
            previous_frame = None
            current_frame = prepro(env.reset())

        frame_count += 1

    print("Finished!!!")


play_game()