<div style="text-align: center">
  <img src="https://github.com/KarolisRam/MineRL2021-Intro-baselines/blob/main/img/colab_banner.png?raw=true">
</div>

# Introduction
This notebook contains the Behavioural Cloning baselines for the Research track of the [MineRL 2021](https://minerl.io/) competition. To run it you will need to enable GPU by going to `Runtime -> Change runtime type` and selecting GPU from the drop down list.

These baselines differ slightly from the standalone version of these baselines on github - the DATA_SAMPLES parameter is set to 400,000 instead of the default 1,000,000. This is done to fit into the RAM limits of Colab.

To train the agent using the obfuscated action space we first discretize the action space using KMeans clustering. We then train the agent using Behavioural cloning. The training takes 10-15 mins.

You can find more details about the obfuscation here:  
[K-means exploration](https://minerl.io/docs/tutorials/k-means.html)

Also see the in-depth analysis of the obfuscation and the KMeans approach done by one of the teams in the 2020 competition:

[Obfuscation and KMeans analysis](https://github.com/GJuceviciute/MineRL-2020)

Please note that any attempt to work with the obfuscated state and action spaces should be general and work with a different dataset or even a completely new environment.

# Setup

In [1]:
skip_install = True

In [2]:

if not skip_install:
    #%%capture
    !add-apt-repository -y ppa:openjdk-r/ppa
    !apt-get -y purge openjdk-*
    !apt-get -y install openjdk-8-jdk
    !apt-get -y install xvfb xserver-xephyr vnc4server python-opengl ffmpeg

In [3]:
%%capture
if not skip_install:
    !pip3 install --upgrade minerl
    !pip3 install pyvirtualdisplay
    !pip3 install torch
    !pip3 install scikit-learn
    !pip3 install -U colabgymrender

# Import Libraries

In [4]:
import random
import numpy as np
import torch as th
from torch import nn
import gym
import minerl
from tqdm.notebook import tqdm
from colabgymrender.recorder import Recorder
#from pyvirtualdisplay import Display
from sklearn.cluster import KMeans
import logging
logging.disable(logging.ERROR) # reduce clutter, remove if something doesn't work to see the error logs.



In [19]:
from os.path import join
from os import makedirs
# Parameters:
EPOCHS = 3  # how many times we train over dataset.
LEARNING_RATE = 0.0001  # Learning rate for the neural network.
BATCH_SIZE = 32
NUM_ACTION_CENTROIDS = 100  # Number of KMeans centroids used to cluster the data.

DATA_SAMPLES = 400000  # how many samples to use from the dataset. Impacts RAM usage



TEST_EPISODES = 25  # number of episodes to test the agent for.
MAX_TEST_EPISODE_LEN = 2000  # 18k is the default for MineRLObtainDiamondVectorObf.

# Bypass filter and use whole dataset
BYPASS_FILTER = False

# Filter frames. Use window-before and window-after frames before and after
# current frame to count present rewards
WINDOW_BEFORE=40
WINDOW_AFTER=40

# Latent dimension of pov. Use 1,4 or 8
LATENT_PIC_DIMENSION=1

## Use small Dataset for debugging
DEBUG=False

## EVAL is not implemented correctly
EVAL=False

MODEL_NAME = f'window-before={WINDOW_BEFORE}_window-after={WINDOW_AFTER}_latent-pic-dimension={LATENT_PIC_DIMENSION}_epochs={EPOCHS}'
try:
    makedirs(MODEL_NAME,exist_ok=False)
except:
    print("WARINING! MODEL ALREADY PRESENT. OLD MODEL WILL BE OVERWRITTEN!")
TRAIN_MODEL_NAME=join(MODEL_NAME,'research_potato.pth')
TRAIN_KMEANS_MODEL_NAME= join(MODEL_NAME,'centroids_for_research_potato.npy')
TEST_MODEL_NAME = TRAIN_MODEL_NAME
TEST_KMEANS_MODEL_NAME = TRAIN_KMEANS_MODEL_NAME

In [6]:
print(th.cuda.is_available())
dev = th.device("cuda:0" if th.cuda.is_available() else "cpu")

False


# Neural network

In [7]:
class NatureCNN(nn.Module):
    """
    CNN from DQN nature paper:
        Mnih, Volodymyr, et al.
        "Human-level control through deep reinforcement learning."
        Nature 518.7540 (2015): 529-533.

    Nicked from stable-baselines3:
        https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/torch_layers.py

    :param input_shape: A three-item tuple telling image dimensions in (C, H, W)
    :param output_dim: Dimensionality of the output vector
    """

    def __init__(self, input_shape, output_dim,latent_pic_dim=4):
        super().__init__()
        n_input_channels = input_shape[0]
        
        if latent_pic_dim ==8:
            
            self.cnn = nn.Sequential(
                nn.Conv2d(n_input_channels, 64, kernel_size=8, stride=4, padding=0),
                nn.ReLU(),
                nn.Conv2d(64, 64, kernel_size=4, stride=1, padding=0),
                nn.ReLU(),
                nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=0),
                nn.ReLU(),
                nn.Conv2d(128, 128, kernel_size=3, stride=1, padding=0),
                nn.ReLU(),
                nn.Flatten()

            )

        elif latent_pic_dim == 4:
            self.cnn = nn.Sequential(
                nn.Conv2d(n_input_channels, 64, kernel_size=8, stride=4, padding=0),
                nn.ReLU(),
                nn.Conv2d(64, 64, kernel_size=4, stride=2, padding=0),
                nn.ReLU(),
                nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=0),
                nn.ReLU(),
                nn.Conv2d(128, 128, kernel_size=1, stride=1, padding=0),
                nn.ReLU(),
                nn.Flatten()

            )
        elif latent_pic_dim ==1:
            self.cnn = nn.Sequential(
                nn.Conv2d(n_input_channels, 64, kernel_size=8, stride=4, padding=0),
                nn.ReLU(),
                nn.Conv2d(64, 64, kernel_size=4, stride=2, padding=0),
                nn.ReLU(),
                nn.Conv2d(64, 128, kernel_size=4, stride=1, padding=0),
                nn.ReLU(),
                nn.Conv2d(128, 128, kernel_size=3, stride=1, padding=0),
                nn.ReLU(),
                nn.Flatten()

            )
        else:
            print("You can only use 8,4, or 1 as latent pic dim!")
            exit(-1)
            
            
            
        # Compute shape by doing one forward pass
        with th.no_grad():
            n_flatten = self.cnn(th.zeros(1, *input_shape)).shape[1]

        self.linear = nn.Sequential(
            nn.Linear(n_flatten, 512),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(512, output_dim)
        )

    def forward(self, observations: th.Tensor) -> th.Tensor:
        
        return self.linear(self.cnn(observations))
    
net = NatureCNN((3, 64, 64),100,8)    


# Setup training

In [8]:
def filter_actions_ix(actions,trajectory_ix ,rewards,min_reward=2,before_count=40,after_count=40,bypass=False):
    """
    Crude, extremely slow...
    Does account for overlapping
    """
    # add 0 as starting point.
    trajectory_ix.insert(0,0)
    
    if bypass : return range(actions)
    filtered_ix = []
    for i,current_trajectory in tqdm(enumerate(trajectory_ix),total=len(trajectory_ix)-1):
        if i+1>= len(trajectory_ix):
            next_trajectory = len(actions)
        else:    
            next_trajectory = trajectory_ix[i+1]
        for i,act in (enumerate(actions[current_trajectory:next_trajectory])):

          if i-before_count<0:
            before_ix = 0
          else:
            before_ix = i-before_count
          if i+after_count>len(actions):
            after_ix = len(actions)-1
          else:
            after_ix = i+after_count
          if sum(rewards[before_ix:after_ix]) > min_reward:
            filtered_ix.append(current_trajectory+i)

    print(f'Using {len(filtered_ix)} of {len(actions)} samples!')    
    assert len(np.unique(filtered_ix) == len(filtered_ix)), 'WARNING: Duplicate samples!'
    return filtered_ix

def cluster(actions):
    print("Running KMeans on the action vectors")
    kmeans = KMeans(n_clusters=NUM_ACTION_CENTROIDS,verbose=1)
    kmeans.fit(actions)
    action_centroids = kmeans.cluster_centers_
    return action_centroids
    print("KMeans done")

In [9]:
def train():
    # For demonstration purposes, we will only use ObtainPickaxe data which is smaller,
    # but has the similar steps as ObtainDiamond in the beginning.
    # "VectorObf" stands for vectorized (vector observation and action), where there is no
    # clear mapping between original actions and the vectors (i.e. you need to learn it)
    data = minerl.data.make("MineRLTreechopVectorObf-v0",  data_dir='data', num_workers=1)

    # First, use k-means to find actions that represent most of them.
    # This proved to be a strong approach in the MineRL 2020 competition.
    # See the following for more analysis:
    # https://github.com/GJuceviciute/MineRL-2020

    # Go over the dataset once and collect all actions and the observations (the "pov" image).
    # We do this to later on have uniform sampling of the dataset and to avoid high memory use spikes.
    all_actions = []
    all_pov_obs = []
    all_rewards = []
    
    trajectory_lens = []

    print("Loading data")
    trajectory_names = data.get_trajectory_names()
    random.shuffle(trajectory_names)

    if DEBUG:
        trajectory_names = trajectory_names[0:5]
    
    # Add trajectories to the data until we reach the required DATA_SAMPLES
    for trajectory_name in trajectory_names:
        trajectory = data.load_data(trajectory_name, skip_interval=0, include_metadata=False)
        trajectory = list(trajectory)
        trajectory_lens.append(len((trajectory)))
        
        for dataset_observation, dataset_action, dataset_reward, _, _ in trajectory:
            all_actions.append(dataset_action["vector"])
            all_pov_obs.append(dataset_observation["pov"])
            all_rewards.append(dataset_reward)
        if len(all_actions) >= DATA_SAMPLES:
            break
        del trajectory     

    all_actions = np.array(all_actions)
    all_pov_obs = np.array(all_pov_obs)
    print(all_actions.shape)
    # apply filtering of 'low reward' actions

    trajectory_ix = np.cumsum(trajectory_lens)
    ix = filter_actions_ix(all_actions,list(trajectory_ix),all_rewards,bypass=BYPASS_FILTER)


    filtered_actions = all_actions[ix]
    filtered_pov_obs = all_pov_obs[ix]



    # Run k-means clustering using scikit-learn.  
    action_centroids = cluster(filtered_actions)


    # Now onto behavioural cloning itself.
    # Much like with intro track, we do behavioural cloning on the discrete actions,
    # where we turn the original vectors into discrete choices by mapping them to the closest
    # centroid (based on Euclidian distance).

    network = NatureCNN((3, 64, 64), NUM_ACTION_CENTROIDS,LATENT_PIC_DIMENSION).to(dev)
    optimizer = th.optim.Adam(network.parameters(), lr=LEARNING_RATE)
    loss_function = nn.CrossEntropyLoss()

    num_samples = filtered_actions.shape[0]
    update_count = 0
    losses = []
    # We have the data loaded up already in all_actions and all_pov_obs arrays.
    # Let's do a manual training loop
    print("Training")
    for _ in range(EPOCHS):
        # Randomize the order in which we go over the samples
        epoch_indices = np.arange(num_samples)
        np.random.shuffle(epoch_indices)
        for batch_i in range(0, num_samples, BATCH_SIZE):
            # NOTE: this will cut off incomplete batches from end of the random indices
            batch_indices = epoch_indices[batch_i:batch_i + BATCH_SIZE]

            # Load the inputs and preprocess
            obs = filtered_pov_obs[batch_indices].astype(np.float32)
            # Transpose observations to be channel-first (BCHW instead of BHWC)
            obs = obs.transpose(0, 3, 1, 2)
            # Normalize observations. Do this here to avoid using too much memory (images are uint8 by default)
            obs /= 255.0

            # Map actions to their closest centroids
            action_vectors = filtered_actions[batch_indices]
            # Use numpy broadcasting to compute the distance between all
            # actions and centroids at once.
            # "None" in indexing adds a new dimension that allows the broadcasting
            distances = np.sum((action_vectors - action_centroids[:, None]) ** 2, axis=2)
            # Get the index of the closest centroid to each action.
            # This is an array of (batch_size,)
            actions = np.argmin(distances, axis=0)

            # Obtain logits of each action
            logits = network(th.from_numpy(obs).float().to(dev))

            # Minimize cross-entropy with target labels.
            # We could also compute the probability of demonstration actions and
            # maximize them.
            loss = loss_function(logits, th.from_numpy(actions).long().to(dev))

            # Standard PyTorch update
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            update_count += 1
            losses.append(loss.item())
            if (update_count % 1000) == 0:
                mean_loss = sum(losses) / len(losses)
                tqdm.write("Iteration {}. Loss {:<10.3f}".format(update_count, mean_loss))
                losses.clear()
    print("Training done")

    # Save network and the centroids into separate files
    np.save(TRAIN_KMEANS_MODEL_NAME, action_centroids)
    th.save(network.state_dict(), TRAIN_MODEL_NAME)
    del data

   

# Download the data

In [10]:
# uncomment to download 
#minerl.data.download(directory='data', environment='MineRLTreechopVectorObf-v0');

# Train

In [11]:
# inetia should be ~60-70. Depends on random state i sthink
train()  # only need to run this once.


Loading data


100%|███████████████████████████████████████████████████████████████████████████████████████| 1950/1950 [00:00<00:00, 154973.72it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████| 1674/1674 [00:00<00:00, 65249.75it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 1949/1949 [00:00<00:00, 120240.91it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 1677/1677 [00:00<00:00, 166730.22it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 1573/1573 [00:00<00:00, 146614.23it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 2853/2853 [00:00<00:00, 116621.99it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 3097/3097 [00:00<00:00, 142763.44it/s]
100%|████████████████████████████████████████████████████████████████

100%|████████████████████████████████████████████████████████████████████████████████████████| 1655/1655 [00:00<00:00, 97675.09it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 1909/1909 [00:00<00:00, 168790.74it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 1603/1603 [00:00<00:00, 167229.68it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 1945/1945 [00:00<00:00, 144518.44it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 2175/2175 [00:00<00:00, 156016.75it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████| 1853/1853 [00:00<00:00, 22731.06it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 1739/1739 [00:00<00:00, 138572.36it/s]
100%|████████████████████████████████████████████████████████████████

100%|███████████████████████████████████████████████████████████████████████████████████████| 3577/3577 [00:00<00:00, 164823.13it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 1465/1465 [00:00<00:00, 165691.12it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████| 3257/3257 [00:00<00:00, 39115.60it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████| 1792/1792 [00:00<00:00, 93807.01it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 2122/2122 [00:00<00:00, 164118.55it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 3017/3017 [00:00<00:00, 161928.36it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 1761/1761 [00:00<00:00, 154541.77it/s]
100%|████████████████████████████████████████████████████████████████

100%|███████████████████████████████████████████████████████████████████████████████████████| 1535/1535 [00:00<00:00, 124329.07it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 1593/1593 [00:00<00:00, 173056.19it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 2586/2586 [00:00<00:00, 169341.15it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████| 2461/2461 [00:00<00:00, 25439.95it/s]


(401718, 64)


  0%|          | 0/187 [00:00<?, ?it/s]

Using 225742 of 401718 samples!
Running KMeans on the action vectors
Initialization complete
Iteration 0, inertia 87.77852666242653
Iteration 1, inertia 78.54154358455617
Iteration 2, inertia 76.77709282783735
Iteration 3, inertia 75.87175224732009
Iteration 4, inertia 75.33545012891116
Iteration 5, inertia 75.03226652181628
Iteration 6, inertia 74.78896321029723
Iteration 7, inertia 74.61937781827355
Iteration 8, inertia 74.5024606147548
Iteration 9, inertia 74.41975982644446
Iteration 10, inertia 74.357229337749
Iteration 11, inertia 74.31417750123757
Iteration 12, inertia 74.27872765669811
Iteration 13, inertia 74.25252230037951
Iteration 14, inertia 74.23180588020487
Iteration 15, inertia 74.21651030675
Iteration 16, inertia 74.20322466116176
Iteration 17, inertia 74.19098292126604
Iteration 18, inertia 74.1796982859807
Iteration 19, inertia 74.16901437624414
Iteration 20, inertia 74.1586270544202
Iteration 21, inertia 74.15009810065847
Iteration 22, inertia 74.14331195128229
Itera

Iteration 36, inertia 74.73461306647637
Iteration 37, inertia 74.7324475173715
Iteration 38, inertia 74.73033936410496
Iteration 39, inertia 74.72810741881376
Iteration 40, inertia 74.72656066684753
Iteration 41, inertia 74.72602974657711
Iteration 42, inertia 74.72551390786654
Iteration 43, inertia 74.72523386900511
Iteration 44, inertia 74.7250084772377
Iteration 45, inertia 74.72475758746717
Iteration 46, inertia 74.72465082179295
Converged at iteration 46: center shift 2.730462134048219e-08 within tolerance 6.282995007382065e-08.
Initialization complete
Iteration 0, inertia 85.3801806175837
Iteration 1, inertia 74.50690802835975
Iteration 2, inertia 73.07761134885162
Iteration 3, inertia 72.51995107486404
Iteration 4, inertia 72.1771743406693
Iteration 5, inertia 71.96985521993102
Iteration 6, inertia 71.83712904795348
Iteration 7, inertia 71.68997608770857
Iteration 8, inertia 71.61797809330525
Iteration 9, inertia 71.56219806538029
Iteration 10, inertia 71.50311184517592
Iteratio

Iteration 47, inertia 75.0568616107837
Converged at iteration 47: center shift 4.079068048819865e-09 within tolerance 6.282995007382065e-08.
Initialization complete
Iteration 0, inertia 86.59603980049769
Iteration 1, inertia 77.05315225779356
Iteration 2, inertia 75.33720604421733
Iteration 3, inertia 74.58785022813166
Iteration 4, inertia 74.24630862788761
Iteration 5, inertia 74.05015877675798
Iteration 6, inertia 73.8838245922843
Iteration 7, inertia 73.7339714639423
Iteration 8, inertia 73.58621866288199
Iteration 9, inertia 73.4587142595641
Iteration 10, inertia 73.36579597620721
Iteration 11, inertia 73.29374490001746
Iteration 12, inertia 73.24142219178809
Iteration 13, inertia 73.17223182585506
Iteration 14, inertia 73.10626825407037
Iteration 15, inertia 73.04627502907003
Iteration 16, inertia 72.99215161925767
Iteration 17, inertia 72.96603714760538
Iteration 18, inertia 72.9435123080788
Iteration 19, inertia 72.92715457254161
Iteration 20, inertia 72.91850969244955
Iteration

# Start Minecraft

In [21]:

if EVAL:
    env = gym.make('MineRLTreechopVectorObf-v0')
    env = Recorder(env, './video', fps=24)



# Run your agent
As the code below runs you should see episode videos and rewards show up. You can run the below cell multiple times to see different episodes.

In [22]:
if EVAL:
    from tqdm import trange


    action_centroids = np.load(TEST_KMEANS_MODEL_NAME)
    network = NatureCNN((3, 64, 64), NUM_ACTION_CENTROIDS).to(dev)
    print(th.load(TEST_MODEL_NAME))
    network.load_state_dict(th.load(TEST_MODEL_NAME))


    num_actions = action_centroids.shape[0]
    action_list = np.arange(num_actions)

    print(action_list)

    for episode in trange(TEST_EPISODES):
        obs = env.reset()
        done = False
        total_reward = 0
        steps = 0

        while not done:
            # Process the action:
            #   - Add/remove batch dimensions
            #   - Transpose image (needs to be channels-last)
            #   - Normalize image
            obs = th.from_numpy(obs['pov'].transpose(2, 0, 1)[None].astype(np.float32) / 255).to(dev)
            # Turn logits into probabilities
            probabilities = th.softmax(network(obs), dim=1)[0]
            # Into numpy
            probabilities = probabilities.detach().cpu().numpy()
            # Sample action according to the probabilities
            discrete_action = np.random.choice(action_list, p=probabilities)

            # Map the discrete action to the corresponding action centroid (vector)
            action = action_centroids[discrete_action]
            minerl_action = {"vector": action}

            obs, reward, done, info = env.step(minerl_action)
            total_reward += reward
            steps += 1
            if steps >= MAX_TEST_EPISODE_LEN:
                break

        env.release()
        #env.play()
        print(f'Episode #{episode + 1} reward: {total_reward}\t\t episode length: {steps}\n')

FileNotFoundError: [Errno 2] No such file or directory: 'window-before=40_window-after=40_latent-pic-dimension=1_epochs=3/centroids_for_research_potato.npy'