## Training
Since CondBERT uses pre-trained models, such as [BERT](https://huggingface.co/docs/transformers/model_doc/bert) (for masking), [FastText](https://pypi.org/project/fasttext/) (for words similarity) models and [sentence transformers](https://pypi.org/project/sentence-transformers/) models (for sentence similarity), no additional training is required.

However one may want to build vocab based on custom dataset, which could be down by running (where custom dataset should be located in *[data/interim](../data/interim)* in the same format as described in [previous notebook](1.0-initial-data-exploration.ipynb)):

Python
```python
from src.data.build_vocab import build_vocab
build_vocab()`
```

CLI
```shell
cd src/data
python make_data`et.py
```

In [1]:
import sys
sys.path.append("..")

### Vocab with custom Masked Language Model
It's also possible to build vocabulary using different model, for example

In [2]:
from src.data.build_vocab import build_vocab

# here model_name is model from https://huggingface.co/models?pipeline_tag=fill-mask
model_name = 'bert-large-uncased'
build_vocab(model_name)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 135390/135390 [00:43<00:00, 3122.01it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 135390/135390 [00:42<00:00, 3192.81it/s]


### Inference with custom parameters
It's also possible to run inference with custom parameters.

In [5]:
import fasttext
import fasttext.util
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

from pathlib import Path
from transformers import BertTokenizer, BertForMaskedLM

from src.models.cond_BERT import CondBERT
from src.data.load_vocab import load_toxicities, load_word2coef

vocab_dirname = Path('../data/interim/vocab')
# here, instead of:
# condBERT = load_condBERT()
# do:
model = BertForMaskedLM.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)
# array of token toxicities
tok_toxicities = load_toxicities(vocab_dirname / 'token_toxicities.txt')
# word-to-coefficient mapping
word2coef = load_word2coef(vocab_dirname / 'word2coef.pkl')
# here you can load custom fasttext model to find words similarity
fasttext.util.download_model('en', if_exists='ignore')
ft = fasttext.load_model('cc.en.300.bin')
condBERT = CondBERT(model, tokenizer, device, tok_toxicities, word2coef, ft)

toxic_example = "I am not stupid!"
print(condBERT(toxic_example))

i am not crazy!


## Visualization
Speaking about visualization, I believe that information provided in **docs, reports, README and notebooks** is sufficient and no additional visualization is required.

![Important graph](figures/main.png)