CLIP4MC: An RL-Friendly Vision-Language Model for Minecraft

CLIP4MC is a Vision-Language model for Minecraft, aligning actions implicitly contained in the video and transcript clips in addition to entities. We construct and release a neat vision-language dataset for Minecraft based on YouTube datset from MineDojo, and we train our CLIP4MC model on the constructed dataset. Empirically, our method can provide a more friendly reward signal for the RL training procedure.

Demonstrations

Here are some demonstrations of agent trained with CLIP4MC.

Harvest a leaf	Milk a cow	Shear a sheep

Requirements

Packages

Install python packages in requirements.txt.

Note that we require PyTorch>=1.10.0 and x-transformers==0.27.1.

Data

Dataset should get ready before training. Information of each data piece is available in our released dataset.

In this project we provide a naive implementation of dataloader and dataset. To use the dataloader and dataset, the data should be organized in the following structure:

data_dir_0
├── text_input.pkl
├── video_input.pkl
data_dir_1
├── text_input.pkl
├── video_input.pkl
...
data_dir_n
├── text_input.pkl
├── video_input.pkl

Tokenized and padded text (via CLIP tokenizer) and equidistantly-sampled frames are stored in text_input.pkl and video_input.pkl respectively and are supposed to be loaded via pickle.load function.

A log file for each dataset is also required. The log file should be a json file with the following structure:

{
  "train": [data_dir_0, data_dir_1, ..., data_dir_n],
  "test" : [data_dir_0, data_dir_1, ..., data_dir_n],
}

The train and test keys are required. The train key should contain a list of data directories for training. The test key should contain a list of data directories for testing.

Pretrained Models

A ViT-B-16 version of pretrained CLIP is required for training from scratch. You can download it from here.
The weight of our pretrained CLIP4MC model is available here for fine-tuning or evaluation.

Usage

You can use the command pattern below to train the model. XXX means a path to fill in. [...] means optional arguments. {...} means a choice of arguments.

torchrun --nproc_per_node=4 train_ddp.py --dataset_log_file XXX \ 
                                         [--use_pretrained_CLIP --pretrain_CLIP_path XXX \] 
                                         [--use_pretrained_model --pretrain_model_path XXX \] 
                                         --model_type {CLIP4MC,CLIP4MC_simple,MineCLIP}

Citation

@article{ding2023clip4mc,
  title={CLIP4MC: An RL-Friendly Vision-Language Model for Minecraft},
  author={Ding, Ziluo and Luo, Hao and Li, Ke and Yue, Junpeng and Huang, Tiejun and Lu, Zongqing},
  journal={arXiv preprint arXiv:2303.10571},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
figs		figs
model		model
module		module
utils		utils
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
requirements.txt		requirements.txt
train_ddp.py		train_ddp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

figs

figs

model

model

module

module

utils

utils

LICENSE

LICENSE

README.md

README.md

config.yaml

config.yaml

requirements.txt

requirements.txt

train_ddp.py

train_ddp.py

Repository files navigation

CLIP4MC: An RL-Friendly Vision-Language Model for Minecraft

Demonstrations

Requirements

Packages

Data

Pretrained Models

Usage

Citation

About

Releases

Packages

Languages

License

PKU-RL/CLIP4MC

Folders and files

Latest commit

History

Repository files navigation

CLIP4MC: An RL-Friendly Vision-Language Model for Minecraft

Demonstrations

Requirements

Packages

Data

Pretrained Models

Usage

Citation

About

Resources

License

Stars

Watchers

Forks

Languages