# Training your own model

This notebook will walk you through training your own model using [DeCLUTR](https://github.com/JohnGiorgi/DeCLUTR).

## 🔧 Install the prerequisites

In [None]:
!git clone https://github.com/JohnGiorgi/DeCLUTR.git
!pip install --editable DeCLUTR

For the time being, you will need to install a specific commit of [AllenNLP](https://allennlp.org/).

In [None]:
!git clone https://github.com/allenai/allennlp.git
%cd allennlp
!git checkout 9766eb4
!pip install -e .
%cd ../


A dataset is simply a file containing one item of text (a document, a scientific paper, etc.) per line. For convenience, we have provided a script that will download the [WikiText-103](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/) dataset and format it for training with our method.

The only "gotcha" is that each piece of text needs to be long enough so that we can sample spans from it. In general, you should collect documents of a minimal length according to the following:

```python
min_length = num_anchors * max_span_len * 2
```

In our paper, we set `num_anchors=2` and `max_span_len=512`, so we require documents of `min_length=2048`. We simply need to provide this value as an argument when running the script:

In [None]:
train_data_path = "/content/wikitext_103/train.txt"
min_length = 2048

!python DeCLUTR/scripts/preprocess_wikitext_103.py $train_data_path --min-length $min_length

By default, [`allennlp train`](https://docs.allennlp.org/master/api/commands/train/) will create a vocabulary for our dataset. Because our model comes with a pretrained vocabulary, we can skip this step by creating the following file under our dataset folder:

In [None]:
vocabulary_directory = "/content/wikitext_103/vocabulary"
!mkdir -p $vocabulary_directory
!echo -e "*tags\n*labels" > "$vocabulary_directory/non_padded_namespaces.txt"

Lets confirm that our dataset looks as expected.

In [None]:
!wc -l $train_data_path  # This should be approximately 17.8K lines

In [None]:
!head -n 1 $train_data_path  # This should be a single Wikipedia entry

## 🏃 Training the model

Once you have collected the dataset, you can easily initiate a training session with the `allennlp train` command. An experiment is configured using a [Jsonnet](https://jsonnet.org/) config file. DeCLUTR provides a handful of these config files with sensible defaults. Let's look at a simplified config:

In [None]:
with open("DeCLUTR/configs/contrastive_simple.jsonnet", "r") as f:
    print(f.read())


The only thing to configure is the `train_data_path`, and optionally, the `vocabulary`. Because our vocabulary is pretrained, specifying it here will prevent AllenNLP from trying to construct it again. Here, we will pass both arguments to `allennlp train` via the `--overrides` argument, but you can also provide it in your config file directly:

In [None]:
overrides = (
    f"{{'train_data_path': '{train_data_path}', "
    f"'vocabulary': {{'type': 'from_files', 'directory': '{vocabulary_directory}'}}}}"
)

In [None]:
!allennlp train DeCLUTR/configs/contrastive_simple.jsonnet \
    --serialization-dir output \
    --overrides "$overrides" \
    --include-package declutr \
    -f

## ♻️ Conclusion

That's it! In this notebook, we covered how to collect data for training the model, and specifically how _long_ that text needs to be. We then briefly covered configuring and running a training session. Please see [our paper](https://arxiv.org/abs/2006.03659) and [repo](https://github.com/JohnGiorgi/DeCLUTR) for more details, and don't hesitate to open an issue if you have any trouble!