<a href="https://colab.research.google.com/github/Ardevop-sk/sk-bert-ner/blob/master/SK_NER_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training BERT for Named Entity Recognition in Slovak language

In this experiment, we will be training a state-of-the-art Natural Language Understanding model [BERT](https://arxiv.org/abs/1810.04805.) on manually annotated korpus of Court decisions from https://ru.justice.sk/ data using Google Cloud infrastructure.

This guide covers all stages of the procedure, including:

1. Setting up the training environment
2. Getting the data
3. Preparing models
4. Training model on cloud GPU
5. Downloading model to GCS
6. Serving BERT model for NER
7. Testing the services

For persistent storage of training data and model, you will require a Google Cloud Storage bucket. 
Please follow the [Google Cloud TPU quickstart](https://cloud.google.com/tpu/docs/quickstart) to create a GCP account and GCS bucket. New Google Cloud users have [$300 free credit](https://cloud.google.com/free/) to get started with any GCP product.
This document is based on document from: https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379

MIT License

Copyright (c) [2020] [Filip Bednárik]

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

## Step 1: setting up training environment
You will need to set the Runtime -> Change runtime type to GPU.

We need to install BERT and download python scripts for training.
The Jupyter environment allows executing bash commands directly from the notebook by using an exclamation mark ‘!’. We will be exploiting this approach to make use of several other bash commands throughout the experiment.

In [0]:
!pip install bert-tensorflow
!git clone https://github.com/Ardevop-sk/BERT-NER.git

We need to authenticate with google to use GCS later.

In [0]:
import os
import sys
import json
import nltk
import random
import logging
import tensorflow as tf

from glob import glob
from google.colab import auth, drive
from tensorflow.keras.utils import Progbar

auth.authenticate_user()
  
# configure logging
log = logging.getLogger('tensorflow')
log.setLevel(logging.INFO)

# create formatter and add it to the handlers
formatter = logging.Formatter('%(asctime)s :  %(message)s')
sh = logging.StreamHandler()
sh.setLevel(logging.INFO)
sh.setFormatter(formatter)
log.handlers = [sh]

In [0]:
if 'COLAB_TPU_ADDR' in os.environ:
  log.info("Using TPU runtime")
  USE_TPU = True
  TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']

  with tf.Session(TPU_ADDRESS) as session:
    log.info('TPU address is ' + TPU_ADDRESS)
    # Upload credentials to TPU.
    with open('/content/adc.json', 'r') as f:
      auth_info = json.load(f)
    tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
    
else:
  log.warning('Not connected to TPU runtime')
  USE_TPU = False

## Step 2: Getting the data


We either upload or crawl and transform data to following format:

```
My O
name O
is O
Filip I-PERSON
. O

This O
is O
second O
sentence O
. O
```

Note that code accepts input divided by space " " or tab "\t". Note: You don't need to filter columns if your corpus has more columns divided by space or tab. The only condition is that the first column is input token and last column is the NER label.


### 2.1a Uploading the data

For now, you will need to ask [me](https://www.linkedin.com/in/filipbednarik/) for the manually annotated dataset and handle the personal data contained according to actual legislations. Or use your own dataset in above mentioned format. I am working on fully automated solution using crawler and delegating the GDPR problem to you :)

In [0]:
!ls -la
!cd BERT-NER/data
from google.colab import files
uploadedTrain = files.upload()
uploadedDev = files.upload()
uploadedTest = files.upload()
!ls -la

### 2.1b Crawl and transform the data

TBD Crawler

### 2.2 Copy data to the right place

Finally We copy it to the right place

In [0]:
!mv -f train.txt BERT-NER/data/train.txt
!mv -f dev.txt BERT-NER/data/dev.txt
!mv -f test.txt BERT-NER/data/test.txt

## Step 3: Preparing models

We either download official Bert model trained on Wikipedia for multiple languages, or use our own pre-trained model (for example pre-trained on Slovak Opensubtitles https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379)

In [0]:
!wget https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip


In [0]:
!unzip multi_cased_L-12_H-768_A-12.zip -d BERT-NER/
!ls -la BERT-NER/

## Step 4: Training model on cloud GPU

Training may take more than half and hour. Depending on your dataset and currently available GPU.

In [0]:
!python BERT-NER/BERT_NER.py \
 --task_name="NER" \
 --do_lower_case=False \
 --crf=False \
 --do_train=True \
 --do_eval=True \
 --do_predict=True \
 --data_dir=BERT-NER/data \
 --middle_output=BERT-NER/middle_data \
 --vocab_file=BERT-NER/multi_cased_L-12_H-768_A-12/vocab.txt \
 --bert_config_file=BERT-NER/multi_cased_L-12_H-768_A-12/bert_config.json \
 --init_checkpoint=BERT-NER/multi_cased_L-12_H-768_A-12/bert_model.ckpt \
 --max_seq_length=128 \
 --train_batch_size=32 \
 --learning_rate=2e-5 \
 --num_train_epochs=3.0 \
 --output_dir=BERT-NER/output/result_dir \
 --labels="[PAD],O,Sud,Osoba,Adresa,Organizacia,ICO,Narodenie,X,[CLS],[SEP]"

Alternatively you can train on TPU if you fancy TPU. You would need to configure it with GCS though.

In [0]:
!python BERT-NER/BERT_NER.py \
 --task_name="NER" \
 --do_lower_case=False \
 --crf=False \
 --do_train=True \
 --do_eval=True \
 --do_predict=True \
 --data_dir=BERT-NER/data \
 --middle_output=BERT-NER/middle_data \
 --vocab_file=BERT-NER/multi_cased_L-12_H-768_A-12/vocab.txt \
 --bert_config_file=BERT-NER/multi_cased_L-12_H-768_A-12/bert_config.json \
 --init_checkpoint=BERT-NER/multi_cased_L-12_H-768_A-12/bert_model.ckpt \
 --max_seq_length=128 \
 --train_batch_size=32 \
 --learning_rate=2e-5 \
 --num_train_epochs=3.0 \
 --output_dir=BERT-NER/output/result_dir \
 --use_tpu=True \
 --tpu_name=grpc://10.110.157.90:8470

We will check the models

In [0]:
!ls -la BERT-NER/output/result_dir
!ls -la BERT-NER/output/result_dir/eval

We will evaluate the results:

In [0]:
TBD eval results

## Step 5: Downloading model to GCS

In [0]:
BUCKET_NAME = "ardevop-sk-bert-sk" #@param {type:"string"}
MODEL_DIR = "model_ner" #@param {type:"string"}
SRC_DIR = "BERT-NER/output/result_dir" #@param {type:"string"}
tf.gfile.MkDir(MODEL_DIR)

if not BUCKET_NAME:
  log.warning("WARNING: BUCKET_NAME is not set. "
              "You will not be able to train the model.")

In [0]:
if BUCKET_NAME:
  !gsutil cp -r $SRC_DIR gs://$BUCKET_NAME/$MODEL_DIR

## Step 6: Serving BERT model for NER

TBD: https://github.com/Ardevop-sk/bert-as-service

## Step 7: Testing the services

TBD: Swagger UI