<a href="https://colab.research.google.com/github/DatNguyen2084/DLDH-Metaphor-detection/blob/main/DLDH_BERT_Continual_pretraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Seminar: DHDL - Novel metaphor detection approaches
Continue pre-training BERT on a specific corpus with MLM and NSP tasks

## Set up, install package

In [None]:
!pip install -q sentence_transformers
!pip install -q datasets

[K     |████████████████████████████████| 78 kB 3.6 MB/s 
[K     |████████████████████████████████| 3.4 MB 7.6 MB/s 
[K     |████████████████████████████████| 6.8 MB 38.6 MB/s 
[K     |████████████████████████████████| 1.2 MB 31.2 MB/s 
[K     |████████████████████████████████| 61 kB 455 kB/s 
[K     |████████████████████████████████| 596 kB 45.1 MB/s 
[K     |████████████████████████████████| 3.3 MB 36.1 MB/s 
[K     |████████████████████████████████| 895 kB 36.6 MB/s 
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 306 kB 5.2 MB/s 
[K     |████████████████████████████████| 1.1 MB 44.9 MB/s 
[K     |████████████████████████████████| 243 kB 45.1 MB/s 
[K     |████████████████████████████████| 132 kB 46.8 MB/s 
[K     |████████████████████████████████| 192 kB 46.7 MB/s 
[K     |████████████████████████████████| 160 kB 24.1 MB/s 
[K     |████████████████████████████████| 271 kB 40.6 MB/s 
[?25h

In [None]:
from transformers import AutoTokenizer, AutoModelForMaskedLM, BertConfig, BertForPreTraining
import nltk
from pathlib import Path
from transformers import Trainer, TrainingArguments
from transformers import TextDatasetForNextSentencePrediction
import gc, torch

nltk.download('punkt')
# load tokenizer from hugging face
tokenizer = AutoTokenizer.from_pretrained("redewiedergabe/bert-base-historical-german-rw-cased")
# load BERT from hugging face
model = BertForPreTraining.from_pretrained("redewiedergabe/bert-base-historical-german-rw-cased")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Downloading:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/248k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/419M [00:00<?, ?B/s]

Some weights of BertForPreTraining were not initialized from the model checkpoint at redewiedergabe/bert-base-historical-german-rw-cased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
## Do we have a GPU?
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 4792729211916454309
 xla_global_id: -1, name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 10843127808
 locality {
   bus_id: 1
   links {
   }
 }
 incarnation: 3255694642148650877
 physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7"
 xla_global_id: 416903419]

## Utils

### Clear gpu cache

In [None]:
gc.collect()
torch.cuda.empty_cache()

### Limit each sentence for max_words = 64
we limit a sentence with max. 64 words

In [None]:
def cut_off_string(s):
    """
    If sentence has more than 64 words, cut it off
    :param s: the sentence, that might be too long
    :return : the sentence after cutting off
    """
    max_words = 30
    words = nltk.word_tokenize(s)
    if len(words) > max_words:
        s = " ".join(words[:max_words])
    return s

## Preprocessing file
Bring file to standard format required by BERT

*   remove all new line character
*   cut off sentence if too long



In [None]:
def preprocessing_file_to_sentences(path: str):
  """
  remove all new line character
  cut off sentence if too long
  :param path: path of a file
  :return : the processed sentence
  """
    file = open(path, 'r')
    text = file.read()
    text = text.replace("\n\n", " ")
    text = text.replace("\n", " ")
    sents = nltk.sent_tokenize(text)
    for i in range(len(sents)):
        sents[i] = sents[i].strip()
        str = sents[i].split()
        sents[i] = " ".join(str)
    file.close()
    return sents

## Delete last line

In [None]:
import os
def delete_last_line(file):
  """
  return: file after deleted the last line
  """
    with open(file, "r+", encoding="utf-8") as file:

        # Move the pointer (similar to a cursor in a text editor) to the end of the file
        file.seek(0, os.SEEK_END)

        # This code means the following code skips the very last character in the file -
        # i.e. in the case the last line is null we delete the last line
        # and the penultimate one
        pos = file.tell() - 1

        # Read each character in the file one at a time from the penultimate
        # character going backwards, searching for a newline character
        # If we find a new line, exit the search
        while pos > 0 and file.read(1) != "\n":
            pos -= 1
            file.seek(pos, os.SEEK_SET)

        # So long as we're not at the start of the file, delete all the characters ahead
        # of this position
        if pos > 0:
            file.seek(pos, os.SEEK_SET)
            file.truncate()

## Load Datasets

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Paths to OCR datasets.
# Make sure to execute the "Mount Google Drive" cell above to make this work.
ROOT_PATH = '/content/drive/My Drive'
OCR_PATH_DATA = '/content/drive/My Drive/OCR-Korrekturen'
OCR_SUB_PATH_DATA_1 = '/1. Natur und Staat raw Scans'
OCR_SUB_PATH_DATA_2 = '/2. Durchgeführte Korrekturen'
OCR_SUB_PATH_DATA_3 = '/3. Automatisch erstellte Endversion'
INPUT_FOLDER = '/input_file/'

##Continue pre-train BERT


*   Loop through all files
*   bring it to standard form required by BERT
*   Continue training BERT with NSP and MLM task



In [None]:
# Concatenate all documents in to 1 file.
# Each line is a sentence
# Each document is seperated by a new line
input_file = ROOT_PATH + INPUT_FOLDER + "one.txt"
f = open(input_file, 'w')
for p in Path(OCR_PATH_DATA + OCR_SUB_PATH_DATA_3).glob('*'):
    print("processing_file: " + p.name)
    sentences = preprocessing_file_to_sentences(p)
    f.writelines([sentence + "\n" for sentence in sentences if len(sentence)>=20])
    f.writelines("\n")
    
f.close()

delete_last_line(input_file)

# Prepare input for NSP task using preprocessed file
dataset = TextDatasetForNextSentencePrediction(
    file_path= input_file,
    tokenizer=tokenizer,
    block_size = 128
)

# Data collector for MLM task
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability= 0.15
)


training_args = TrainingArguments(
    output_dir= ROOT_PATH + "/output/arguments",
    overwrite_output_dir=True,
    num_train_epochs=10,
    per_device_train_batch_size= 8,
    save_steps=20000,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

trainer.train()
# Save trained model and tokenizer
trainer.save_model(ROOT_PATH + "/output/model")
tokenizer.save_pretrained(ROOT_PATH + "/output/model")

processing_file: R_1914_13_LP_Band_292_6601-6640_corrected.txt
processing_file: R_1873_1_LP_Band_2_719-759_corrected.txt
processing_file: nus8_2_Methner_bereinigt.txt
processing_file: nus7_2_Schalk_bereinigt.txt
processing_file: nus6_2_Eleutheropulos_bereinigt.txt
processing_file: nus9_2_Haecker_bereinigt.txt
processing_file: nus5_2_Michaelis_bereinigt
processing_file: nus2_2_ruppin_bereinigt.txt
processing_file: nus4_2_Hesse_bereinigt
processing_file: nus10_2_Ziegler_bereinigt.txt
processing_file: nus3_2_schallmeyer_bereinigt
processing_file: nus1_2_matzat_bereinigt.txt


***** Running training *****
  Num examples = 15285
  Num Epochs = 10
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 19110


Step,Training Loss
500,2.6583
1000,2.5453
1500,2.4686
2000,2.3892
2500,2.2185
3000,2.2131
3500,2.1764
4000,2.1209
4500,2.0361
5000,2.0118




Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to /content/drive/My Drive/output/model
Configuration saved in /content/drive/My Drive/output/model/config.json
Model weights saved in /content/drive/My Drive/output/model/pytorch_model.bin
