<div style="text-align: center">
  <img src="https://github.com/KarolisRam/MineRL2021-Intro-baselines/blob/main/img/colab_banner.png?raw=true">
</div>

# Introduction
This notebook contains the Behavioural Cloning baselines for the Research track of the [MineRL 2021](https://minerl.io/) competition. To run it you will need to enable GPU by going to `Runtime -> Change runtime type` and selecting GPU from the drop down list.

These baselines differ slightly from the standalone version of these baselines on github - the DATA_SAMPLES parameter is set to 400,000 instead of the default 1,000,000. This is done to fit into the RAM limits of Colab.

To train the agent using the obfuscated action space we first discretize the action space using KMeans clustering. We then train the agent using Behavioural cloning. The training takes 10-15 mins.

You can find more details about the obfuscation here:  
[K-means exploration](https://minerl.io/docs/tutorials/k-means.html)

Also see the in-depth analysis of the obfuscation and the KMeans approach done by one of the teams in the 2020 competition:

[Obfuscation and KMeans analysis](https://github.com/GJuceviciute/MineRL-2020)

Please note that any attempt to work with the obfuscated state and action spaces should be general and work with a different dataset or even a completely new environment.

# Setup

In [1]:
skip_install = True

In [2]:

if not skip_install:
    #%%capture
    !add-apt-repository -y ppa:openjdk-r/ppa
    !apt-get -y purge openjdk-*
    !apt-get -y install openjdk-8-jdk
    !apt-get -y install xvfb xserver-xephyr vnc4server python-opengl ffmpeg

In [3]:
%%capture
if not skip_install:
    !pip3 install --upgrade minerl
    !pip3 install pyvirtualdisplay
    !pip3 install torch
    !pip3 install scikit-learn
    !pip3 install -U colabgymrender

# Import Libraries

In [4]:
import random
import numpy as np
import torch as th
from torch import nn
import gym
import minerl
from tqdm.notebook import tqdm
from colabgymrender.recorder import Recorder
#from pyvirtualdisplay import Display
from sklearn.cluster import KMeans
import logging
logging.disable(logging.ERROR) # reduce clutter, remove if something doesn't work to see the error logs.



In [5]:
from os.path import join
from os import makedirs
# Parameters:
EPOCHS = 8  # how many times we train over dataset.
LEARNING_RATE = 0.0001  # Learning rate for the neural network.

# i didnt change this. seems to work fine.
BATCH_SIZE = 32
NUM_ACTION_CENTROIDS = 70  # Number of KMeans centroids used to cluster the data.

DATA_SAMPLES = 400000  # how many samples to use from the dataset. Impacts RAM usage

# you can save checkpoints after x epochs. Put them in a list.
# They will be saved in the model folder
checkpoints=[4,6]

# Not used
TEST_EPISODES = 25  # number of episodes to test the agent for.
MAX_TEST_EPISODE_LEN = 2000  # 18k is the default for MineRLObtainDiamondVectorObf.

# Bypass filter and use whole dataset
BYPASS_FILTER = False

# Filter frames. Use window-before and window-after frames before and after
# current frame to count present rewards
WINDOW_BEFORE=40
WINDOW_AFTER=20

# Latent dimension of pov. Use 1,4 or 8
LATENT_PIC_DIMENSION=4

## Use small Dataset for debugging
DEBUG=False

## EVAL is not implemented correctly
EVAL=False

# define paths etc.
MODEL_NAME = f'window-before={WINDOW_BEFORE}_window-after={WINDOW_AFTER}_latent-pic-dimension={LATENT_PIC_DIMENSION}_epochs={EPOCHS}_clusters={NUM_ACTION_CENTROIDS}'
try:
    makedirs(MODEL_NAME,exist_ok=False)
except:
    print("WARINING! MODEL ALREADY PRESENT. OLD MODEL WILL BE OVERWRITTEN!")
TRAIN_MODEL_NAME=join(MODEL_NAME,'research_potato.pth')
TRAIN_KMEANS_MODEL_NAME= join(MODEL_NAME,'centroids_for_research_potato.npy')
TEST_MODEL_NAME = TRAIN_MODEL_NAME
TEST_KMEANS_MODEL_NAME = TRAIN_KMEANS_MODEL_NAME

# Neural network

In [7]:
class NatureCNN(nn.Module):
    """
    
    I changed the net slightly to make it use 1x1, 4x4 and 8x8 latend picture dimension.
    I also doubled channel count.
    
    CNN from DQN nature paper:
        Mnih, Volodymyr, et al.
        "Human-level control through deep reinforcement learning."
        Nature 518.7540 (2015): 529-533.

    Nicked from stable-baselines3:
        https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/torch_layers.py

    :param input_shape: A three-item tuple telling image dimensions in (C, H, W)
    :param output_dim: Dimensionality of the output vector
    :param latent_pic_dim: choose one of 3 models with values 1,4,8
    """

    def __init__(self, input_shape, output_dim,latent_pic_dim=4):
        super().__init__()
        n_input_channels = input_shape[0]
        
        if latent_pic_dim ==8:
            
            self.cnn = nn.Sequential(
                nn.Conv2d(n_input_channels, 64, kernel_size=8, stride=4, padding=0),
                nn.ReLU(),
                nn.Conv2d(64, 64, kernel_size=4, stride=1, padding=0),
                nn.ReLU(),
                nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=0),
                nn.ReLU(),
                nn.Conv2d(128, 128, kernel_size=3, stride=1, padding=0),
                nn.ReLU(),
                nn.Flatten()

            )

        elif latent_pic_dim == 4:
            self.cnn = nn.Sequential(
                nn.Conv2d(n_input_channels, 64, kernel_size=8, stride=4, padding=0),
                nn.ReLU(),
                nn.Conv2d(64, 64, kernel_size=4, stride=2, padding=0),
                nn.ReLU(),
                nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=0),
                nn.ReLU(),
                nn.Conv2d(128, 128, kernel_size=1, stride=1, padding=0),
                nn.ReLU(),
                nn.Flatten()

            )
        elif latent_pic_dim ==1:
            self.cnn = nn.Sequential(
                nn.Conv2d(n_input_channels, 64, kernel_size=8, stride=4, padding=0),
                nn.ReLU(),
                nn.Conv2d(64, 64, kernel_size=4, stride=2, padding=0),
                nn.ReLU(),
                nn.Conv2d(64, 128, kernel_size=4, stride=1, padding=0),
                nn.ReLU(),
                nn.Conv2d(128, 128, kernel_size=3, stride=1, padding=0),
                nn.ReLU(),
                nn.Flatten()

            )
        else:
            print("You can only use 8,4, or 1 as latent pic dim!")
            exit(-1)
            
            
            
        # Compute shape by doing one forward pass
        with th.no_grad():
            n_flatten = self.cnn(th.zeros(1, *input_shape)).shape[1]

        self.linear = nn.Sequential(
            nn.Linear(n_flatten, 512),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(512, output_dim)
        )

    def forward(self, observations: th.Tensor) -> th.Tensor:
        
        return self.linear(self.cnn(observations))

#test code
#net = NatureCNN((3, 64, 64),100,8)    


# Setup training

In [8]:
def filter_actions_ix(actions,trajectory_ix ,rewards,min_reward=2,before_count=40,after_count=40,bypass=False):
    """
    Filter out frames that dont have ${min_rewards} in the window vicinity.
    Window = [beforecount:i:afterafter_count]
    
    param actions: list of all actions to filter
    param trajectory_ix: the start index of each episode. used to account for overlapping windows
    param rewards: list of rewards, matching actions
    param min_reward: min reward that should be present in the window
    param before_count: window to the left(past)
    param before_count: window to the right(future)
    param bypass: set to True to not filter

    """
    # add 0 as starting point.
    trajectory_ix.insert(0,0)
    print(bypass)
    if bypass :
        return list(range(len(actions)))
    filtered_ix = []
    for i,current_trajectory in tqdm(enumerate(trajectory_ix),total=len(trajectory_ix)-1):
        if i+1>= len(trajectory_ix):
            next_trajectory = len(actions)
        else:    
            next_trajectory = trajectory_ix[i+1]
        for i,act in (enumerate(actions[current_trajectory:next_trajectory])):

          if i-before_count<0:
            before_ix = 0
          else:
            before_ix = i-before_count
          if i+after_count>len(actions):
            after_ix = len(actions)-1
          else:
            after_ix = i+after_count
          if sum(rewards[before_ix:after_ix]) > min_reward:
            filtered_ix.append(current_trajectory+i)

    print(f'Using {len(filtered_ix)} of {len(actions)} samples!')    
    assert len(np.unique(filtered_ix) == len(filtered_ix)), 'WARNING: Duplicate samples!'
    return filtered_ix

def cluster(actions):
    """
    
    run sklearn.KMEANS() on actions. Uses NUM_ACTION_CENTROIDS global.
    
    """
    
    print("Running KMeans on the action vectors")
    kmeans = KMeans(n_clusters=NUM_ACTION_CENTROIDS,verbose=1)
    kmeans.fit(actions)
    action_centroids = kmeans.cluster_centers_
    return action_centroids
    print("KMeans done")

In [9]:
def train():
    
    data = minerl.data.make("MineRLTreechopVectorObf-v0",  data_dir='data', num_workers=1)

    # First, use k-means to find actions that represent most of them.
    # This proved to be a strong approach in the MineRL 2020 competition.
    # See the following for more analysis:
    # https://github.com/GJuceviciute/MineRL-2020

    # Go over the dataset once and collect all actions and the observations (the "pov" image).
    # We do this to later on have uniform sampling of the dataset and to avoid high memory use spikes.
    all_actions = []
    all_pov_obs = []
    all_rewards = []
    
    trajectory_lens = []

    print("Loading data")
    trajectory_names = data.get_trajectory_names()
    random.shuffle(trajectory_names)

    if DEBUG:
        trajectory_names = trajectory_names[0:5]
    
    # Add trajectories to the data until we reach the required DATA_SAMPLES
    for trajectory_name in trajectory_names:
        trajectory = data.load_data(trajectory_name, skip_interval=0, include_metadata=False)
        trajectory = list(trajectory)
        trajectory_lens.append(len((trajectory)))
        
        for dataset_observation, dataset_action, dataset_reward, _, _ in trajectory:
            all_actions.append(dataset_action["vector"])
            all_pov_obs.append(dataset_observation["pov"])
            all_rewards.append(dataset_reward)
        if len(all_actions) >= DATA_SAMPLES:
            break
        del trajectory     

    all_actions = np.array(all_actions)
    all_pov_obs = np.array(all_pov_obs)
    
    
    # apply filtering of 'low reward' actions

    trajectory_ix = np.cumsum(trajectory_lens)
    ix = filter_actions_ix(all_actions,list(trajectory_ix),all_rewards,bypass=BYPASS_FILTER)


    filtered_actions = all_actions[ix]
    filtered_pov_obs = all_pov_obs[ix]



    # Run k-means clustering using scikit-learn.  
    action_centroids = cluster(filtered_actions)


    # Now onto behavioural cloning itself.
    # Much like with intro track, we do behavioural cloning on the discrete actions,
    # where we turn the original vectors into discrete choices by mapping them to the closest
    # centroid (based on Euclidian distance).

    network = NatureCNN((3, 64, 64), NUM_ACTION_CENTROIDS,LATENT_PIC_DIMENSION).to(dev)
    optimizer = th.optim.Adam(network.parameters(), lr=LEARNING_RATE)
    loss_function = nn.CrossEntropyLoss()

    num_samples = filtered_actions.shape[0]
    update_count = 0
    losses = []
    # We have the data loaded up already in all_actions and all_pov_obs arrays.
    # Let's do a manual training loop
    print("Training")
    for e in range(EPOCHS):
        print(f"starting epoch {e+1}")
        # Randomize the order in which we go over the samples
        epoch_indices = np.arange(num_samples)
        np.random.shuffle(epoch_indices)
        for batch_i in range(0, num_samples, BATCH_SIZE):
            # NOTE: this will cut off incomplete batches from end of the random indices
            batch_indices = epoch_indices[batch_i:batch_i + BATCH_SIZE]

            # Load the inputs and preprocess
            obs = filtered_pov_obs[batch_indices].astype(np.float32)
            # Transpose observations to be channel-first (BCHW instead of BHWC)
            obs = obs.transpose(0, 3, 1, 2)
            # Normalize observations. Do this here to avoid using too much memory (images are uint8 by default)
            obs /= 255.0

            # Map actions to their closest centroids
            action_vectors = filtered_actions[batch_indices]
            # Use numpy broadcasting to compute the distance between all
            # actions and centroids at once.
            # "None" in indexing adds a new dimension that allows the broadcasting
            distances = np.sum((action_vectors - action_centroids[:, None]) ** 2, axis=2)
            # Get the index of the closest centroid to each action.
            # This is an array of (batch_size,)
            actions = np.argmin(distances, axis=0)

            # Obtain logits of each action
            logits = network(th.from_numpy(obs).float().to(dev))

            # Minimize cross-entropy with target labels.
            # We could also compute the probability of demonstration actions and
            # maximize them.
            loss = loss_function(logits, th.from_numpy(actions).long().to(dev))

            # Standard PyTorch update
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            update_count += 1
            losses.append(loss.item())
            if (update_count % 1000) == 0:
                mean_loss = sum(losses) / len(losses)
                tqdm.write("Iteration {}. Loss {:<10.3f}".format(update_count, mean_loss))
                losses.clear()
        
        # save checkpoints.
        if (e+1) in checkpoints:
            print(f'saving_checkpoint_epoch{e+1}')
            th.save(network.state_dict(), f"{TRAIN_MODEL_NAME}_checkpoint_epoch={e+1}")
            
    print("Training done")

    # Save network and the centroids into separate files
    np.save(TRAIN_KMEANS_MODEL_NAME, action_centroids)
    th.save(network.state_dict(), TRAIN_MODEL_NAME)
    del data

   

# Download the data

In [10]:
# uncomment to download 
#minerl.data.download(directory='data', environment='MineRLTreechopVectorObf-v0');

# Train

In [11]:
# inertia should be ~60-70 for 100 centroids, ~90-100 for 70, ~200 for 40. Depends on random state i think
# filtered dataset get lower inertia :) so it seems to be good
train()  # only need to run this once.


Loading data


100%|███████████████████████████████████████████████████████████████████████████████████████| 1549/1549 [00:00<00:00, 155608.76it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 1945/1945 [00:00<00:00, 126432.36it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 1949/1949 [00:00<00:00, 158772.09it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 1918/1918 [00:00<00:00, 164502.69it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████| 1889/1889 [00:00<00:00, 90449.80it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 2175/2175 [00:00<00:00, 130036.94it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 1538/1538 [00:00<00:00, 162793.10it/s]
100%|████████████████████████████████████████████████████████████████

100%|███████████████████████████████████████████████████████████████████████████████████████| 2050/2050 [00:00<00:00, 167119.98it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 2686/2686 [00:00<00:00, 156773.50it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 1975/1975 [00:00<00:00, 129488.23it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 2168/2168 [00:00<00:00, 150975.45it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 1581/1581 [00:00<00:00, 161736.45it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 1672/1672 [00:00<00:00, 144111.06it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 2651/2651 [00:00<00:00, 138503.98it/s]
100%|████████████████████████████████████████████████████████████████

100%|███████████████████████████████████████████████████████████████████████████████████████| 1795/1795 [00:00<00:00, 166924.06it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 2853/2853 [00:00<00:00, 175174.56it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████| 1963/1963 [00:00<00:00, 22818.70it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 1779/1779 [00:00<00:00, 168214.68it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 2544/2544 [00:00<00:00, 149681.00it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 1885/1885 [00:00<00:00, 146114.64it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 1830/1830 [00:00<00:00, 138929.49it/s]
100%|████████████████████████████████████████████████████████████████

100%|███████████████████████████████████████████████████████████████████████████████████████| 2531/2531 [00:00<00:00, 160845.20it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 1806/1806 [00:00<00:00, 164657.70it/s]


(401196, 64)
False


  0%|          | 0/185 [00:00<?, ?it/s]

Using 285612 of 401196 samples!
Running KMeans on the action vectors
Initialization complete
Iteration 0, inertia 157.09396525434045
Iteration 1, inertia 141.37720159380456
Iteration 2, inertia 138.86985702651253
Iteration 3, inertia 137.43716766816948
Iteration 4, inertia 136.4791720373049
Iteration 5, inertia 136.0431700812798
Iteration 6, inertia 135.8152431417297
Iteration 7, inertia 135.64537428777726
Iteration 8, inertia 135.49223970518156
Iteration 9, inertia 135.3607541563796
Iteration 10, inertia 135.2822508742166
Iteration 11, inertia 135.1998612062331
Iteration 12, inertia 135.10557635448558
Iteration 13, inertia 135.03225024334054
Iteration 14, inertia 134.97598317280364
Iteration 15, inertia 134.94295896077602
Iteration 16, inertia 134.92065413029135
Iteration 17, inertia 134.90554889555946
Iteration 18, inertia 134.8955425146052
Iteration 19, inertia 134.88873338321406
Iteration 20, inertia 134.88475942607604
Iteration 21, inertia 134.88235302477003
Iteration 22, inertia 

Initialization complete
Iteration 0, inertia 154.1029620477193
Iteration 1, inertia 139.83152755874292
Iteration 2, inertia 137.4780757544344
Iteration 3, inertia 136.399189325832
Iteration 4, inertia 135.68322026906097
Iteration 5, inertia 135.16171208689943
Iteration 6, inertia 134.75499926221377
Iteration 7, inertia 134.5742089089087
Iteration 8, inertia 134.48881450719293
Iteration 9, inertia 134.41659069590924
Iteration 10, inertia 134.36292250455972
Iteration 11, inertia 134.2962860421949
Iteration 12, inertia 134.24350146600491
Iteration 13, inertia 134.212452774506
Iteration 14, inertia 134.18866359177485
Iteration 15, inertia 134.16801479668874
Iteration 16, inertia 134.1501744749083
Iteration 17, inertia 134.1388438628905
Iteration 18, inertia 134.12916047616648
Iteration 19, inertia 134.1219393255688
Iteration 20, inertia 134.11698591062026
Iteration 21, inertia 134.1113886735879
Iteration 22, inertia 134.1036279487972
Iteration 23, inertia 134.09901812077425
Iteration 24, i

Iteration 30, inertia 137.74133587605164
Iteration 31, inertia 137.70469142216857
Iteration 32, inertia 137.66985283036055
Iteration 33, inertia 137.63372321810593
Iteration 34, inertia 137.60141693321305
Iteration 35, inertia 137.57395122331323
Iteration 36, inertia 137.55042614765847
Iteration 37, inertia 137.53063488398084
Iteration 38, inertia 137.50814068888178
Iteration 39, inertia 137.4809267945604
Iteration 40, inertia 137.4511342798778
Iteration 41, inertia 137.417373452028
Iteration 42, inertia 137.37887694968586
Iteration 43, inertia 137.33956994699636
Iteration 44, inertia 137.30780713371166
Iteration 45, inertia 137.2749356637783
Iteration 46, inertia 137.2537011010074
Iteration 47, inertia 137.2377304533821
Iteration 48, inertia 137.21968643872972
Iteration 49, inertia 137.20487966399773
Iteration 50, inertia 137.18826880621327
Iteration 51, inertia 137.1719102325568
Iteration 52, inertia 137.15510077624123
Iteration 53, inertia 137.1390114644288
Iteration 54, inertia 137

Iteration 56000. Loss 1.401     
Iteration 57000. Loss 1.408     
Iteration 58000. Loss 1.401     
Iteration 59000. Loss 1.402     
Iteration 60000. Loss 1.407     
Iteration 61000. Loss 1.391     
Iteration 62000. Loss 1.400     
6 True
Iteration 63000. Loss 1.380     
Iteration 64000. Loss 1.369     
Iteration 65000. Loss 1.388     
Iteration 66000. Loss 1.359     
Iteration 67000. Loss 1.348     
Iteration 68000. Loss 1.381     
Iteration 69000. Loss 1.365     
Iteration 70000. Loss 1.352     
Iteration 71000. Loss 1.343     
7 False
Training done


# Start Minecraft

In [12]:

if EVAL:
    env = gym.make('MineRLTreechopVectorObf-v0')
    env = Recorder(env, './video', fps=24)

# Run your agent
As the code below runs you should see episode videos and rewards show up. You can run the below cell multiple times to see different episodes.

In [13]:
if EVAL:
    from tqdm import trange


    action_centroids = np.load(TEST_KMEANS_MODEL_NAME)
    network = NatureCNN((3, 64, 64), NUM_ACTION_CENTROIDS).to(dev)
    print(th.load(TEST_MODEL_NAME))
    network.load_state_dict(th.load(TEST_MODEL_NAME))


    num_actions = action_centroids.shape[0]
    action_list = np.arange(num_actions)

    print(action_list)

    for episode in trange(TEST_EPISODES):
        obs = env.reset()
        done = False
        total_reward = 0
        steps = 0

        while not done:
            # Process the action:
            #   - Add/remove batch dimensions
            #   - Transpose image (needs to be channels-last)
            #   - Normalize image
            obs = th.from_numpy(obs['pov'].transpose(2, 0, 1)[None].astype(np.float32) / 255).to(dev)
            # Turn logits into probabilities
            probabilities = th.softmax(network(obs), dim=1)[0]
            # Into numpy
            probabilities = probabilities.detach().cpu().numpy()
            # Sample action according to the probabilities
            discrete_action = np.random.choice(action_list, p=probabilities)

            # Map the discrete action to the corresponding action centroid (vector)
            action = action_centroids[discrete_action]
            minerl_action = {"vector": action}

            obs, reward, done, info = env.step(minerl_action)
            total_reward += reward
            steps += 1
            if steps >= MAX_TEST_EPISODE_LEN:
                break

        env.release()
        #env.play()
        print(f'Episode #{episode + 1} reward: {total_reward}\t\t episode length: {steps}\n')