# PART IIa (Training BERT)



# 1. Introduction


This notebook contains the training procedure for BERT for the Kaggle challenge "Coleridge Initative: Show US the Data" (https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data/). As a recap, this challenge is about recognizing public datasets used in scientific papers. In particular, we want to extract the datasets for scientific paper, with several NLP approaches. In this notebook, we test both BERT and SciBERT. The first model is introduced by Devlin, J., Chang, M. W., Lee, K., and Toutanova, K., in 2018 [1]. Source code of BERT can be fuond [here](https://github.com/google-research/bert). The second model is  introduced by Beltagy, I., Lo, K., and Cohan, A. in 2019 [2]. Source code of SciBERT can be found [here](https://github.com/allenai/scibert).

Furthermore, we append the existing data with a specialized Corpus for dataset tagging. TDMSci is a Corpus existing of annotated data for tasks, metrices and datasets. Here, B-DATASET and I-DATASET are the NER-labels indicating a word is (part of) a dataset [3]. Source code (and annotated data) of TDMSci can be found [here](https://github.com/IBM/science-result-extractor).


We have created three notebooks, one for **dataset creation** ([Part I](https://github.com/Josien94/MLiP/blob/main/Challenge2%20-%20Coleridge%20Initiative%20-%20Show%20US%20the%20Data/Part%20I_Creating_Dataset.ipynb)), one for **training** (Part IIa() and [Part IIb](https://github.com/Josien94/MLiP/blob/main/Challenge2%20-%20Coleridge%20Initiative%20-%20Show%20US%20the%20Data/Part_IIb_SciBERT_Training.ipynb)) and one for **testing** ([Part III](https://github.com/Josien94/MLiP/blob/main/Challenge2%20-%20Coleridge%20Initiative%20-%20Show%20US%20the%20Data/Part%20III_With_LiteralMatching_SciBERT.ipynb)). This notebook trains a BERT model on our own created dataset.


[1] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[2] Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676.  
[3] Hou, Y., Jochim, C., Gleize, M., Bonin, F., & Ganguly, D. (2021). TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics. arXiv preprint arXiv:2101.10273.

# 2. Preparing Notebook

In [1]:
!pip install datasets --no-index --find-links=file:///kaggle/input/coleridge-packages/packages/datasets 
!pip install ../input/coleridge-packages/seqeval-1.2.2-py3-none-any.whl 
!pip install ../input/coleridge-packages/tokenizers-0.10.1-cp37-cp37m-manylinux1_x86_64.whl 
!pip install ../input/coleridge-packages/transformers-4.5.0.dev0-py3-none-any.whl 
!pip install datasets 

Looking in links: file:///kaggle/input/coleridge-packages/packages/datasets
Processing /kaggle/input/coleridge-packages/packages/datasets/datasets-1.5.0-py3-none-any.whl
Processing /kaggle/input/coleridge-packages/packages/datasets/tqdm-4.49.0-py2.py3-none-any.whl
Processing /kaggle/input/coleridge-packages/packages/datasets/xxhash-2.0.0-cp37-cp37m-manylinux2010_x86_64.whl
Processing /kaggle/input/coleridge-packages/packages/datasets/huggingface_hub-0.0.7-py3-none-any.whl
Installing collected packages: tqdm, xxhash, huggingface-hub, datasets
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.56.2
    Uninstalling tqdm-4.56.2:
      Successfully uninstalled tqdm-4.56.2
Successfully installed datasets-1.5.0 huggingface-hub-0.0.7 tqdm-4.49.0 xxhash-2.0.0
Processing /kaggle/input/coleridge-packages/seqeval-1.2.2-py3-none-any.whl
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2
Processing /kaggle/input/coleridge-packages/tokenizers-

In [2]:
#Import necessary libraries
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import nltk
import re
import os
from os import listdir

from os.path import isfile, join
import re
import json
import time
import datetime
import random
import glob
import importlib
import allennlp
import numpy as np
import pandas as pd
from transformers import *
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import torch


  '"sox" backend is being deprecated. '


In [3]:
model_name = 'bert-base-cased' #allenai/scibert_scivocab_cased'
#Initialize paths for data
path_abs = '/kaggle/input/coleridgeinitiative-show-us-the-data/'
path_train = os.path.join(path_abs,'train/')
path_train_metadata = os.path.join(path_abs, 'train.csv')
path_test = os.path.join(path_abs, 'test/')
path_sample_submission = os.path.join(path_abs, 'sample_submission.csv')

path_abs_tdmsci = '/kaggle/input/tdmsci/'
path_test_tdmsci = os.path.join(path_abs_tdmsci, 'test_500_v2.txt')
path_train_tdmsci = os.path.join(path_abs_tdmsci,'train_1500_v2.txt')
path_train_nerjson = '../input/fork-of-mlip-group25-scibert-dataset/train_ner.json'


# 3. Train the BERT model
We first apply NER in combination with BERT. For this purpose, we need the pretrained [AutoModelForTokenClassification](https://huggingface.co/transformers/model_doc/auto.html#automodelfortokenclassification) for BERT. Source code of BERT can be found [here](https://github.com/google-research/bert)

In [7]:
# copy my_seqeval.py to the working directory because the input directory is non-writable
!cp /kaggle/input/coleridge-packages/my_seqeval.py ./


In [9]:
def train_scibert_ner(batch_size):
    os.environ["MODEL_NAME"] = f"{model_name}"
    os.environ["TRAIN_FILE"] = f"{path_train_nerjson}"
    os.environ["VALIDATION_FILE"] = f"{path_train_nerjson}"
    os.environ["BATCH_SIZE"] = f"{batch_size}"
    
    acc = 0
    with open(path_train_nerjson) as f:
        print("open ")
        for row in f:
            acc += 1
    
    print("There are {} training samples!".format(acc))
    
    !python ../input/kaggle-ner-utils/kaggle_run_ner.py \
    --model_name_or_path "$MODEL_NAME" \
    --train_file "$TRAIN_FILE" \
    --validation_file "$VALIDATION_FILE" \
    --num_train_epochs 4 \
    --per_device_train_batch_size "$BATCH_SIZE" \
    --per_device_eval_batch_size "$BATCH_SIZE" \
    --save_steps 15000 \
    --pad_to_max_length \
    --output_dir './output' \
    --report_to 'none' \
    --seed 123 \
    --do_train
!rm -r "./output"

rm: cannot remove './output': No such file or directory


In [10]:
#Train BERT model
batch_size = 8
train_scibert_ner(batch_size)

open 
There are 95394 training samples!
2021-06-06 17:59:34.445681: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
Downloading and preparing dataset json/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/json/default-48014b4fcd0722e5/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02...
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-48014b4fcd0722e5/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02. Subsequent calls will reuse this data.
[INFO|file_utils.py:1402] 2021-06-06 17:59:43,391 >> https://huggingface.co/bert-base-cased/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpiqnir5ms
Downloading: 100%|██████████████████████████████| 570/570 [00:00<00:00, 49