## CLIMATEBERT: A Pretrained Language Model for Climate-Related Text
#### by Nicolas Webersinke, Mathias Kraus, Julia Anna Bingler, and Markus Leippold
#### Link to paper: [arxiv.org/abs/2110.12010](https://arxiv.org/abs/2110.12010)
#### Code Part 2: Language model training

Import libraries and empty GPU cache (if applicable)

In [None]:
from transformers import AutoTokenizer
from transformers import Trainer, TrainingArguments
from transformers import AutoModelForPreTraining
from transformers import DataCollatorForLanguageModeling

from datasets import load_dataset

import torch
# torch.cuda.empty_cache()

Load dataset via Hugging Face datasets

In [None]:
datasets = load_dataset("text", data_files={"train": 'corpus/train_corpus.txt',         # Path to txt file with training corpus (selected or not)
                                            "validation": 'corpus/val_corpus.txt'})     # Path to txt file with validation corpus

Print size of dataset

In [None]:
print(len(datasets['train']))
print(len(datasets['validation']))

Load the language model and the tokenizer from the augmentation

In [2]:
card = "model/distilroberta-base-augmented"
tokenizer = AutoTokenizer.from_pretrained(card, use_fast=True)
model = AutoModelForPreTraining.from_pretrained(card)

Make sure the model is resized correctly

In [None]:
model.resize_token_embeddings(len(tokenizer))

Define tokenize function

In [16]:
def tokenize_function(samples):
    return tokenizer(samples["text"], truncation=True)

Perform tokenization

In [17]:
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=16, remove_columns=["text"]) # Adjust num_proc depending on machine

Init data collator for masked language modeling

In [18]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

Define training args

In [None]:
# For further arguments, see Hugging Face docs

training_args = TrainingArguments(
    output_dir="model/xyz",
    overwrite_output_dir=False,     # Attention!
    per_device_train_batch_size=24, # Adjust depending on machine
    per_device_eval_batch_size=24,  # Adjust depending on machine
    evaluation_strategy="epoch",
    save_strategy="epoch",
    fp16=True,                      # Adjust depending on machine
    dataloader_num_workers=8,       # Adjust depending on machine
    load_best_model_at_end=True,
    gradient_accumulation_steps=42, # Adjust depending on machine
    num_train_epochs=12,            # After 12 epochs the models didn't improve any further on our corpus 
    learning_rate=0.0005,
    weight_decay=0.01,
)

Init trainer

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer
)

Start training and evaluate/save (optional)

In [None]:
trainer.train()

In [None]:
trainer.evaluate()

In [None]:
trainer.save_model("model/xyz")