<div style="text-align: center">
  <img src="https://github.com/KarolisRam/MineRL2021-Intro-baselines/blob/main/img/colab_banner.png?raw=true">
</div>

# Introduction
This notebook contains the Behavioural Cloning baselines for the Research track of the [MineRL 2021](https://minerl.io/) competition. To run it you will need to enable GPU by going to `Runtime -> Change runtime type` and selecting GPU from the drop down list.

These baselines differ slightly from the standalone version of these baselines on github - the DATA_SAMPLES parameter is set to 400,000 instead of the default 1,000,000. This is done to fit into the RAM limits of Colab.

To train the agent using the obfuscated action space we first discretize the action space using KMeans clustering. We then train the agent using Behavioural cloning. The training takes 10-15 mins.

You can find more details about the obfuscation here:  
[K-means exploration](https://minerl.io/docs/tutorials/k-means.html)

Also see the in-depth analysis of the obfuscation and the KMeans approach done by one of the teams in the 2020 competition:

[Obfuscation and KMeans analysis](https://github.com/GJuceviciute/MineRL-2020)

Please note that any attempt to work with the obfuscated state and action spaces should be general and work with a different dataset or even a completely new environment.

# Setup

In [1]:
skip_install = True

In [2]:
#%%capture
if not skip_install:
    !add-apt-repository -y ppa:openjdk-r/ppa
    !apt-get -y purge openjdk-*
    !apt-get -y install openjdk-8-jdk
    !apt-get -y install xvfb xserver-xephyr vnc4server python-opengl ffmpeg

In [3]:
%%capture
if not skip_install:
    !pip3 install --upgrade minerl
    !pip3 install pyvirtualdisplay
    !pip3 install torch
    !pip3 install scikit-learn
    !pip3 install -U colabgymrender

# Import Libraries

In [6]:
import random
import numpy as np
import torch as th
from torch import nn
import gym
import minerl
from tqdm.notebook import tqdm
#from colabgymrender.recorder import Recorder
#from pyvirtualdisplay import Display
from sklearn.cluster import KMeans
import logging
logging.disable(logging.ERROR) # reduce clutter, remove if something doesn't work to see the error logs.

In [7]:
print(th.cuda.is_available())
dev = th.device("cuda:0" if th.cuda.is_available() else "cpu")

False


  return torch._C._cuda_getDeviceCount() > 0


# Neural network

In [8]:
class NatureCNN(nn.Module):
    """
    CNN from DQN nature paper:
        Mnih, Volodymyr, et al.
        "Human-level control through deep reinforcement learning."
        Nature 518.7540 (2015): 529-533.

    Nicked from stable-baselines3:
        https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/torch_layers.py

    :param input_shape: A three-item tuple telling image dimensions in (C, H, W)
    :param output_dim: Dimensionality of the output vector
    """

    def __init__(self, input_shape, output_dim):
        super().__init__()
        n_input_channels = input_shape[0]
        self.cnn = nn.Sequential(
            nn.Conv2d(n_input_channels, 32, kernel_size=8, stride=4, padding=0),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2, padding=0),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=0),
            nn.ReLU(),
            nn.Flatten(),
        )

        # Compute shape by doing one forward pass
        with th.no_grad():
            n_flatten = self.cnn(th.zeros(1, *input_shape)).shape[1]

        self.linear = nn.Sequential(
            nn.Linear(n_flatten, 512),
            nn.ReLU(),
            nn.Linear(512, output_dim)
        )

    def forward(self, observations: th.Tensor) -> th.Tensor:
        return self.linear(self.cnn(observations))

# Setup training

In [10]:
def filter_actions_ix(actions,rewards,min_reward=2,before_count=40,after_count=40):
    """
    Crude, extremely slow...
    Does not take into account overlapping ...
    """
    filtered_ix = []
    for i,act in tqdm(enumerate(actions)):
      if i-before_count<0:
        before_ix = 0
      else:
        before_ix = i-before_count
      if i+after_count>len(actions):
        after_ix = len(actions)-1
      else:
        after_ix = i+after_count
      if sum(rewards[before_ix:after_ix]) > min_reward:
        filtered_ix.append(i)

    print(f'Using {len(filtered_ix)} of {len(actions)} samples!')    
    return filtered_ix

def cluster(actions):
    print("Running KMeans on the action vectors")
    kmeans = KMeans(n_clusters=NUM_ACTION_CENTROIDS,verbose=1)
    kmeans.fit(actions)
    action_centroids = kmeans.cluster_centers_
    return action_centroids
    print("KMeans done")

In [11]:
def train():
    # For demonstration purposes, we will only use ObtainPickaxe data which is smaller,
    # but has the similar steps as ObtainDiamond in the beginning.
    # "VectorObf" stands for vectorized (vector observation and action), where there is no
    # clear mapping between original actions and the vectors (i.e. you need to learn it)
    data = minerl.data.make("MineRLTreechopVectorObf-v0",  data_dir='data', num_workers=1)

    # First, use k-means to find actions that represent most of them.
    # This proved to be a strong approach in the MineRL 2020 competition.
    # See the following for more analysis:
    # https://github.com/GJuceviciute/MineRL-2020

    # Go over the dataset once and collect all actions and the observations (the "pov" image).
    # We do this to later on have uniform sampling of the dataset and to avoid high memory use spikes.
    all_actions = []
    all_pov_obs = []
    all_rewards = []

    print("Loading data")
    trajectory_names = data.get_trajectory_names()
    random.shuffle(trajectory_names)

    # Add trajectories to the data until we reach the required DATA_SAMPLES.
    for trajectory_name in trajectory_names:
        trajectory = data.load_data(trajectory_name, skip_interval=0, include_metadata=False)
        
        for dataset_observation, dataset_action, dataset_reward, _, _ in trajectory:
            all_actions.append(dataset_action["vector"])
            all_pov_obs.append(dataset_observation["pov"])
            all_rewards.append(dataset_reward)
        if len(all_actions) >= DATA_SAMPLES:
            break

    all_actions = np.array(all_actions)
    all_pov_obs = np.array(all_pov_obs)

    # apply filtering of 'low reward' actions

    ix = filter_actions_ix(all_actions,all_rewards)


    filtered_actions = all_actions[ix]
    filtered_pov_obs = all_pov_obs[ix]

    print(len(ix),max(ix))
    print(all_actions.shape)
    print(filtered_actions.shape)

    # Run k-means clustering using scikit-learn.  
    action_centroids = cluster(filtered_actions)


    # Now onto behavioural cloning itself.
    # Much like with intro track, we do behavioural cloning on the discrete actions,
    # where we turn the original vectors into discrete choices by mapping them to the closest
    # centroid (based on Euclidian distance).

    network = NatureCNN((3, 64, 64), NUM_ACTION_CENTROIDS).to(dev)
    optimizer = th.optim.Adam(network.parameters(), lr=LEARNING_RATE)
    loss_function = nn.CrossEntropyLoss()

    num_samples = filtered_actions.shape[0]
    update_count = 0
    losses = []
    # We have the data loaded up already in all_actions and all_pov_obs arrays.
    # Let's do a manual training loop
    print("Training")
    for _ in range(EPOCHS):
        # Randomize the order in which we go over the samples
        epoch_indices = np.arange(num_samples)
        np.random.shuffle(epoch_indices)
        for batch_i in range(0, num_samples, BATCH_SIZE):
            # NOTE: this will cut off incomplete batches from end of the random indices
            batch_indices = epoch_indices[batch_i:batch_i + BATCH_SIZE]

            # Load the inputs and preprocess
            obs = filtered_pov_obs[batch_indices].astype(np.float32)
            # Transpose observations to be channel-first (BCHW instead of BHWC)
            obs = obs.transpose(0, 3, 1, 2)
            # Normalize observations. Do this here to avoid using too much memory (images are uint8 by default)
            obs /= 255.0

            # Map actions to their closest centroids
            action_vectors = filtered_actions[batch_indices]
            # Use numpy broadcasting to compute the distance between all
            # actions and centroids at once.
            # "None" in indexing adds a new dimension that allows the broadcasting
            distances = np.sum((action_vectors - action_centroids[:, None]) ** 2, axis=2)
            # Get the index of the closest centroid to each action.
            # This is an array of (batch_size,)
            actions = np.argmin(distances, axis=0)

            # Obtain logits of each action
            logits = network(th.from_numpy(obs).float().to(dev))

            # Minimize cross-entropy with target labels.
            # We could also compute the probability of demonstration actions and
            # maximize them.
            loss = loss_function(logits, th.from_numpy(actions).long().to(dev))

            # Standard PyTorch update
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            update_count += 1
            losses.append(loss.item())
            if (update_count % 1000) == 0:
                mean_loss = sum(losses) / len(losses)
                tqdm.write("Iteration {}. Loss {:<10.3f}".format(update_count, mean_loss))
                losses.clear()
    print("Training done")

    # Save network and the centroids into separate files
    np.save(TRAIN_KMEANS_MODEL_NAME, action_centroids)
    th.save(network.state_dict(), TRAIN_MODEL_NAME)
    del data

   

# Parameters

In [12]:
# Parameters:
EPOCHS = 2  # how many times we train over dataset.
LEARNING_RATE = 0.0001  # Learning rate for the neural network.
BATCH_SIZE = 32
NUM_ACTION_CENTROIDS = 100  # Number of KMeans centroids used to cluster the data.

DATA_SAMPLES = 400000  # how many samples to use from the dataset. Impacts RAM usage

TRAIN_MODEL_NAME = 'research_potato.pth'  # name to use when saving the trained agent.
TEST_MODEL_NAME = 'research_potato.pth'  # name to use when loading the trained agent.
TRAIN_KMEANS_MODEL_NAME = 'centroids_for_research_potato.npy'  # name to use when saving the KMeans model.
TEST_KMEANS_MODEL_NAME = 'centroids_for_research_potato.npy'  # name to use when loading the KMeans model.

TEST_EPISODES = 10  # number of episodes to test the agent for.
MAX_TEST_EPISODE_LEN = 2000  # 18k is the default for MineRLObtainDiamondVectorObf.

# Download the data

In [14]:
minerl.data.download(directory='data', experiment='MineRLTreechopVectorObf-v0');

Download: https://minerl.s3.amazonaws.com/v3/MineRLTreechopVectorObf-v0.tar:   4%|▎        | 62.0/1617.2544 [00:05<01:46, 14.63MB/s]

KeyboardInterrupt: 

# Train

In [10]:
display = Display(visible=0, size=(400, 300))
display.start();

In [11]:
train()  # only need to run this once.

Loading data


100%|██████████| 2149/2149 [00:00<00:00, 103405.64it/s]
100%|██████████| 2661/2661 [00:00<00:00, 102837.37it/s]
100%|██████████| 2146/2146 [00:00<00:00, 101239.22it/s]
100%|██████████| 1655/1655 [00:00<00:00, 96001.40it/s]
100%|██████████| 1603/1603 [00:00<00:00, 91928.54it/s]
100%|██████████| 2094/2094 [00:00<00:00, 98645.17it/s]
100%|██████████| 2075/2075 [00:00<00:00, 106275.03it/s]
100%|██████████| 3531/3531 [00:00<00:00, 101064.46it/s]
100%|██████████| 2159/2159 [00:00<00:00, 97936.50it/s]
100%|██████████| 1703/1703 [00:00<00:00, 102965.17it/s]
100%|██████████| 1877/1877 [00:00<00:00, 99082.62it/s]
100%|██████████| 2215/2215 [00:00<00:00, 95104.55it/s]
100%|██████████| 2063/2063 [00:00<00:00, 91480.32it/s]
100%|██████████| 1384/1384 [00:00<00:00, 91210.61it/s]
100%|██████████| 1792/1792 [00:00<00:00, 100793.79it/s]
100%|██████████| 3017/3017 [00:00<00:00, 93840.58it/s]
100%|██████████| 1926/1926 [00:00<00:00, 102991.35it/s]
100%|██████████| 1787/1787 [00:00<00:00, 104922.19it/s]
1

0it [00:00, ?it/s]

Using 202918 of 401945 samples!
202918 401905
(401945, 64)
(202918, 64)
Running KMeans on the action vectors
Initialization complete
Iteration 0, inertia 56.12639985200622
Iteration 1, inertia 50.19432912885528
Iteration 2, inertia 49.28466345385982
Iteration 3, inertia 48.8603591021657
Iteration 4, inertia 48.54603814636172
Iteration 5, inertia 48.282585564741964
Iteration 6, inertia 48.0471121283338
Iteration 7, inertia 47.87071457342507
Iteration 8, inertia 47.72120158616393
Iteration 9, inertia 47.601882076434414
Iteration 10, inertia 47.49718509032849
Iteration 11, inertia 47.405863722233725
Iteration 12, inertia 47.33018503850763
Iteration 13, inertia 47.282757667074115
Iteration 14, inertia 47.24288286227256
Iteration 15, inertia 47.20249597596789
Iteration 16, inertia 47.16292021649529
Iteration 17, inertia 47.1259832857079
Iteration 18, inertia 47.088909298031815
Iteration 19, inertia 47.04986528578334
Iteration 20, inertia 47.019998939070405
Iteration 21, inertia 46.997637999

# Start Minecraft

In [12]:
env = gym.make('MineRLTreechopVectorObf-v0')
env = Recorder(env, './video', fps=24)

# Run your agent
As the code below runs you should see episode videos and rewards show up. You can run the below cell multiple times to see different episodes.

In [None]:
from tqdm import trange
action_centroids = np.load(TEST_KMEANS_MODEL_NAME)
network = NatureCNN((3, 64, 64), NUM_ACTION_CENTROIDS)to(device)
network.load_state_dict(th.load(TEST_MODEL_NAME))


num_actions = action_centroids.shape[0]
action_list = np.arange(num_actions)

for episode in trange(15):
    obs = env.reset()
    done = False
    total_reward = 0
    steps = 0

    while not done:
        # Process the action:
        #   - Add/remove batch dimensions
        #   - Transpose image (needs to be channels-last)
        #   - Normalize image
        obs = th.from_numpy(obs['pov'].transpose(2, 0, 1)[None].astype(np.float32) / 255).to(device)
        # Turn logits into probabilities
        probabilities = th.softmax(network(obs), dim=1)[0]
        # Into numpy
        probabilities = probabilities.detach().cpu().numpy()
        # Sample action according to the probabilities
        discrete_action = np.random.choice(action_list, p=probabilities)

        # Map the discrete action to the corresponding action centroid (vector)
        action = action_centroids[discrete_action]
        minerl_action = {"vector": action}

        obs, reward, done, info = env.step(minerl_action)
        total_reward += reward
        steps += 1
        if steps >= MAX_TEST_EPISODE_LEN:
            break

    env.release()
    env.play()
    print(f'Episode #{episode + 1} reward: {total_reward}\t\t episode length: {steps}\n')

  0%|          | 0/15 [00:00<?, ?it/s]

In [None]:
! zip -r video1.zip video/
from google.colab import files
files.download("/content/video1.zip")