# Training your own model

This notebook will walk you through training your own model using [DeCLUTR](https://github.com/JohnGiorgi/DeCLUTR).

## 🔧 Install the prerequisites

In [1]:
# !pip install git+https://github.com/JohnGiorgi/DeCLUTR.git

# go to main dir i.e. DeCLUTR on local and run "pip install --editable .""

In [3]:
# testing multiprocessing fork - 

from multiprocessing import get_context
num_processes = 8
pool = get_context("spawn").Pool(num_processes)
pool

<multiprocessing.pool.Pool at 0x1a914b8ed08>

## 📖 Preparing a dataset


A dataset is simply a file containing one item of text (a document, a scientific paper, etc.) per line. For demonstration purposes, we have provided a script that will download the [WikiText-103](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/) dataset and format it for training with our method.

The only "gotcha" is that each piece of text needs to be long enough so that we can sample spans from it. In general, you should collect documents of a minimum length according to the following:

```python
min_length = num_anchors * max_span_len * 2
```

In our paper, we set `num_anchors=2` and `max_span_len=512`, so we require documents of `min_length=2048`. We simply need to provide this value as an argument when running the script:

In [1]:
import os

train_data_path = "E:/wiki_text/wikitext-103/train.txt"

# run this to download and preprocess data

min_length = 2048

# !python ../scripts/preprocess_wikitext_103.py $train_data_path --min-length $min_length --max-instances 500

Lets confirm that our dataset looks as expected.

In [7]:
!wc -l $train_data_path  # This should be approximately 17.8K lines

     500 E:/wiki_text/wikitext-103/train.txt
     500 total


wc: '#': No such file or directory
wc: This: No such file or directory
wc: should: No such file or directory
wc: be: No such file or directory
wc: approximately: No such file or directory
wc: 17.8K: No such file or directory
wc: lines: No such file or directory


In [4]:
# !head -n 1 $train_data_path  # This should be a single Wikipedia entry

### Look at sampling technique

This will help get an idea of what 

In [2]:
from declutr.common.contrastive_utils import sample_anchor_positive_pairs
from declutr.losses import NTXentLoss
import torch
from transformers import AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
text = "this is just an example sentence to test out some sampling and loss calculation from DeCLUTR. We want to see exactly how it works in order to implement it for our own use case"
len_text = len(text.split())

In [16]:
# just go with one anchor for now

num_anchors = 1
max_span_len = int((len_text/2)/num_anchors) 
min_span_len = 5
num_positives = 5
sampling_strat = "adjacent"

In [18]:
anchor_spans, positive_spans = sample_anchor_positive_pairs(
    text = text,
    num_anchors = num_anchors,
    num_positives = num_positives,
    max_span_len = max_span_len,
    min_span_len = min_span_len,
    sampling_strategy = sampling_strat
)

In [19]:
anchor_spans

['to see exactly how it works in order to implement it for our own use']

In [20]:
positive_spans

['loss calculation from DeCLUTR. We want',
 'and loss calculation from DeCLUTR. We want',
 'sampling and loss calculation from DeCLUTR. We want',
 'sampling and loss calculation from DeCLUTR. We want',
 'some sampling and loss calculation from DeCLUTR. We want']

In [None]:
# test loss function
anchor_emb = torch.rand(64).unsqueeze(0)
pos_emb = torch.rand(64).unsqueeze(0)
neg_emb = torch.rand(64).unsqueeze(0)

In [None]:
anchor_pos_embs = torch.cat((anchor_emb, pos_emb))
loss_func = NTXentLoss
embs, labels = NTXentLoss.get_embeddings_and_label(anchor_emb, pos_emb)


## 🏃 Training the model

Once you have collected the dataset, you can easily initiate a training session with the `allennlp train` command. An experiment is configured using a [Jsonnet](https://jsonnet.org/) config file. Lets take a look at the config for the DeCLUTR-small model presented in [our paper](https://arxiv.org/abs/2006.03659):

In [22]:
# with open("../training_config/declutr_small.jsonnet", "r") as f:
#     print(f.read())


The only thing to configure is the path to the training set (`train_data_path`), which can be passed to `allennlp train` via the `--overrides` argument (but you can also provide it in your config file directly, if you prefer):

In [2]:
# overrides = (
#     f"{{'train_data_path': '{train_data_path}', "
#     # lower the batch size to be able to train on Colab GPUs
#     "'data_loader.batch_size': 2, "
#     # training examples / batch size. Not required, but gives us a more informative progress bar during training
#     "'data_loader.batches_per_epoch': None}"
# )


overrides = (
    f"{{'train_data_path': '{train_data_path}', "
    # lower the batch size to be able to train on Colab GPUs
    "'data_loader.batch_size': 2,}"
)

In [4]:
overrides

"{'train_data_path': 'E:/wiki_text/wikitext-103/train.txt', 'data_loader.batch_size': 2,}"

In [3]:
!allennlp train "../training_config/declutr_small_v2.jsonnet" \
    --serialization-dir "E:/saved_models/declutr/wiki/output" \
    --overrides "$overrides" \
    --include-package "declutr" \
    -f

2022-09-21 13:12:47,625 - INFO - allennlp.common.params - random_seed = 13370
2022-09-21 13:12:47,625 - INFO - allennlp.common.params - numpy_seed = 1337
2022-09-21 13:12:47,626 - INFO - allennlp.common.params - pytorch_seed = 133
2022-09-21 13:12:47,773 - INFO - allennlp.common.checks - Pytorch version: 1.6.0
2022-09-21 13:12:47,773 - INFO - allennlp.common.params - type = default
2022-09-21 13:12:47,774 - INFO - allennlp.common.params - dataset_reader.type = declutr
2022-09-21 13:12:47,774 - INFO - allennlp.common.params - dataset_reader.lazy = False
2022-09-21 13:12:47,774 - INFO - allennlp.common.params - dataset_reader.cache_directory = None
2022-09-21 13:12:47,774 - INFO - allennlp.common.params - dataset_reader.max_instances = None
2022-09-21 13:12:47,775 - INFO - allennlp.common.params - dataset_reader.manual_distributed_sharding = False
2022-09-21 13:12:47,775 - INFO - allennlp.common.params - dataset_reader.manual_multi_process_sharding = False
2022-09-21 13:12:47,775 - INFO 


reading instances: 0it [00:00, ?it/s]
reading instances: 1it [00:00,  9.61it/s]
reading instances: 5it [00:00, 21.78it/s]
reading instances: 9it [00:00, 24.29it/s]
reading instances: 12it [00:00, 24.40it/s]
reading instances: 16it [00:00, 27.60it/s]
reading instances: 20it [00:00, 29.40it/s]
reading instances: 23it [00:00, 28.30it/s]
reading instances: 27it [00:01, 28.68it/s]
reading instances: 30it [00:01, 27.02it/s]
reading instances: 33it [00:01, 26.82it/s]
reading instances: 36it [00:01, 27.02it/s]
reading instances: 39it [00:01, 27.60it/s]
reading instances: 42it [00:01, 28.26it/s]
reading instances: 46it [00:01, 30.80it/s]
reading instances: 50it [00:01, 30.39it/s]
reading instances: 55it [00:01, 33.75it/s]
reading instances: 59it [00:02, 34.41it/s]
reading instances: 63it [00:02, 28.04it/s]
reading instances: 67it [00:02, 28.38it/s]
reading instances: 71it [00:02, 30.11it/s]
reading instances: 75it [00:02, 30.79it/s]
reading instances: 79it [00:02, 30.30it/s]
reading instances:

2022-09-21 13:15:02,156 - INFO - allennlp.models.archival - archiving weights and vocabulary to E:/saved_models/declutr/wiki/output\model.tar.gz


### 🤗 Exporting a trained model to HuggingFace Transformers

We have provided a simple script to export a trained model so that it can be loaded with [Hugging Face Transformers](https://github.com/huggingface/transformers)

In [9]:
archive_file = "E:/saved_models/declutr/wiki/output/"
save_directory = "E:/saved_models/declutr/wiki/output/transformers_format/"

!python ../scripts/save_pretrained_hf.py --archive_file $archive_file --save_directory $save_directory

💾 🤗 Transformers compatible model saved to: E:\saved_models\declutr\wiki\output\transformers_format. See https://huggingface.co/transformers/model_sharing.html for instructions on hosting the model with 🤗 Transformers.


Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
# !python ../scripts/save_pretrained_hf.py --help

The model, saved to `--save-directory`, can then be loaded using the Hugging Face Transformers library

> See the [embedding notebook](https://colab.research.google.com/github/JohnGiorgi/DeCLUTR/blob/master/notebooks/embedding.ipynb) for more details on using trained models.

In [11]:
from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoModel
  
tokenizer = AutoTokenizer.from_pretrained(f"{save_directory}")
model = AutoModel.from_pretrained(f"{save_directory}")

In [13]:
model

RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(50265, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inpl

> If you would like to upload your model to the Hugging Face model repository, follow the instructions [here](https://huggingface.co/transformers/model_sharing.html).

## ♻️ Conclusion

That's it! In this notebook, we covered how to collect data for training the model, and specifically how _long_ that text needs to be. We then briefly covered configuring and running a training session. Please see [our paper](https://arxiv.org/abs/2006.03659) and [repo](https://github.com/JohnGiorgi/DeCLUTR) for more details, and don't hesitate to open an issue if you have any trouble!