<a href="https://colab.research.google.com/github/Ardevop-sk/sk-bert-ner/blob/master/SK_NER_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training BERT for Named Entity Recognition in Slovak language

In this experiment, we will be training a state-of-the-art Natural Language Understanding model [BERT](https://arxiv.org/abs/1810.04805.) on manually annotated corpus of Court decisions from https://ru.justice.sk/ data using Google Cloud infrastructure.

This guide covers all stages of the procedure, including:

1. Setting up the training environment
2. Getting the data
3. Preparing models
4. Training model on cloud GPU
5. Evaluate the results
6. Uploading model to GCS (optional)
7. Serving BERT model for NER as a Service
8. Testing the services

For persistent storage of training data and model, you will require a Google Cloud Storage bucket. 
Please follow the [Google Cloud TPU quickstart](https://cloud.google.com/tpu/docs/quickstart) to create a GCP account and GCS bucket. New Google Cloud users have [$300 free credit](https://cloud.google.com/free/) to get started with any GCP product.
This document is based on document from: https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379

MIT License

Copyright (c) [2020] [Filip Bednárik]

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

## Step 1: setting up training environment
You will need to set the Runtime -> Change runtime type to GPU or TPU.

We need to install BERT and download python scripts for training.
The Jupyter environment allows executing bash commands directly from the notebook by using an exclamation mark ‘!’. We will be exploiting this approach to make use of several other bash commands throughout the experiment.

In [0]:
!pip install bert-tensorflow
!git clone https://github.com/Ardevop-sk/BERT-NER.git

In [0]:
import os
import sys
import json
import nltk
import random
import logging
import tensorflow as tf

from glob import glob
from google.colab import auth, drive
from tensorflow.keras.utils import Progbar
  
# configure logging
log = logging.getLogger('tensorflow')
log.setLevel(logging.INFO)

# create formatter and add it to the handlers
formatter = logging.Formatter('%(asctime)s :  %(message)s')
sh = logging.StreamHandler()
sh.setLevel(logging.INFO)
sh.setFormatter(formatter)
log.handlers = [sh]

We need to authenticate with google to use GCS later.

In [0]:
auth.authenticate_user()

Optionally initialized TPU

In [0]:
if 'COLAB_TPU_ADDR' in os.environ:
  log.info("Using TPU runtime")
  USE_TPU = True
  TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']

  with tf.Session(TPU_ADDRESS) as session:
    log.info('TPU address is ' + TPU_ADDRESS)
    # Upload credentials to TPU.
    with open('/content/adc.json', 'r') as f:
      auth_info = json.load(f)
    tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
    
else:
  log.warning('Not connected to TPU runtime')
  USE_TPU = False

## Step 2: Getting the data


We either upload or crawl and transform data to following format:

```
My O
name O
is O
Filip I-PERSON
. O

This O
is O
second O
sentence O
. O
```

Note that code accepts input divided by space " " or tab "\t". Note: You don't need to filter columns if your corpus has more columns divided by space or tab. The only condition is that the first column is input token and last column is the NER label.


### 2.1a Uploading the data

In [0]:
!ls -la
!cd BERT-NER/data
from google.colab import files
uploadedTrain = files.upload()
uploadedDev = files.upload()
uploadedTest = files.upload()
!ls -la

### 2.1b Crawl and transform the data

In [31]:
!pip install beautifulsoup4
!pip install zeep
!mkdir data



In [32]:
import requests
from bs4 import BeautifulSoup
from zeep import Client, Settings
import math

settings = Settings(strict=False, xml_huge_tree=True)

ruWsClient = Client('https://ru-ws.justice.sk/ru-verejnost-ws/konanieService.wsdl', settings=settings)

pageCountResult = ruWsClient.service.getKonaniePreObdobie(DatumOd='1990-01-01',DatumDo='2020-01-01',VysledkovNaStranku=100,Stranka=0)
print(pageCountResult.VysledkovCelkom)
for i in range(math.ceil(pageCountResult.VysledkovCelkom/100)):
  print("page %s" % i)
  pageResult = ruWsClient.service.getKonaniePreObdobie(DatumOd='1990-01-01',DatumDo='2020-01-01',VysledkovNaStranku=100,Stranka=i)
  for konanieInfo in pageResult.KonanieInfoList.KonanieInfo:
    oznamyUrl = "https://ru.justice.sk/ru-verejnost-web/pages/obchodnyVestnik/oznamyList.xhtml?konanieId=" + str(konanieInfo.Id)
    print(oznamyUrl)
    oznamyData = requests.get(oznamyUrl)
    oznamyDOM = BeautifulSoup(oznamyData.content, 'html.parser')
    oznamyTable = oznamyDOM.body.find(id='vestnikForm:oVTable_data')
    oznamyRows = oznamyTable.find_all('tr')
    for oznamyRow in oznamyRows:
      zverejnil = oznamyRow.find_all('td')[1]
      detail = oznamyRow.find_all('td')[2]
      if zverejnil is not None and "súd" in zverejnil.text:
        uznesenieUrl = detail.a['href']
        print(uznesenieUrl)
        uznesenieData = requests.get('https://ru.justice.sk'+uznesenieUrl)
        uznesenieDOM = BeautifulSoup(uznesenieData.content, 'html.parser')
        uznesenieDataRows = uznesenieDOM.select('div.blok div.row')
        ics = None
        datumVydania = None
        textRozhodnutia = ""
        druh = None
        for uznesenieDataRow in uznesenieDataRows:
          uznesenieKeyValue = uznesenieDataRow.find_all('div')
          label = uznesenieKeyValue[0]
          value = uznesenieKeyValue[1]
          if('Druh:' in label.text):
            druh = value.text.strip()
          if('ICS:' in label.text):
            ics = value.text.strip()
          elif('Dátum vydania:' in label.text):
            datumVydania = value.text.strip()
          elif('Hlavička:' in label.text or 'Rozhodnutie:' in label.text or 'Poučenie:' in label.text):
            textRozhodnutia += value.text.strip() + " "

        if ics is not None and datumVydania is not None and textRozhodnutia != "" and druh is not None and "Uznesenie" in druh:
          text_file = open("data/"+ ics + ".txt", "w")
          text_file.write(textRozhodnutia)
          text_file.close()
  break

43953
page 0
https://ru.justice.sk/ru-verejnost-web/pages/obchodnyVestnik/oznamyList.xhtml?konanieId=3
/ru-verejnost-web/pages/obchodnyVestnik/detailOV.xhtml?vestnikPodanieId=162252
https://ru.justice.sk/ru-verejnost-web/pages/obchodnyVestnik/oznamyList.xhtml?konanieId=4
/ru-verejnost-web/pages/obchodnyVestnik/detailOV.xhtml?vestnikPodanieId=250526
/ru-verejnost-web/pages/obchodnyVestnik/detailOV.xhtml?vestnikPodanieId=249437
/ru-verejnost-web/pages/obchodnyVestnik/detailOV.xhtml?vestnikPodanieId=246105
/ru-verejnost-web/pages/obchodnyVestnik/detailOV.xhtml?vestnikPodanieId=223229
/ru-verejnost-web/pages/obchodnyVestnik/detailOV.xhtml?vestnikPodanieId=219840
/ru-verejnost-web/pages/obchodnyVestnik/detailOV.xhtml?vestnikPodanieId=193763
/ru-verejnost-web/pages/obchodnyVestnik/detailOV.xhtml?vestnikPodanieId=160438
/ru-verejnost-web/pages/obchodnyVestnik/detailOV.xhtml?vestnikPodanieId=145791
/ru-verejnost-web/pages/obchodnyVestnik/detailOV.xhtml?vestnikPodanieId=139922
/ru-verejnost-web

KeyboardInterrupt: ignored

In [33]:
!ls -la data

total 24
drwxr-xr-x 2 root root 4096 Jan 30 23:03 .
drwxr-xr-x 1 root root 4096 Jan 30 22:47 ..
-rw-r--r-- 1 root root 1542 Jan 30 23:03 8110202389.txt
-rw-r--r-- 1 root root 1691 Jan 30 23:03 8110233299.txt
-rw-r--r-- 1 root root 1568 Jan 30 23:03 8111210037.txt
-rw-r--r-- 1 root root  481 Jan 30 23:03 8111210869.txt


### 2.2 Copy data to the right place

Finally We copy it to the right place

In [0]:
!mv -f train.txt BERT-NER/data/train.txt
!mv -f dev.txt BERT-NER/data/dev.txt
!mv -f test.txt BERT-NER/data/test.txt

## Step 3: Preparing models

We either download official Bert model trained on Wikipedia for multiple languages, or use our own pre-trained model (for example pre-trained on Slovak Opensubtitles https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379)

In [0]:
!wget https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip


In [0]:
!unzip multi_cased_L-12_H-768_A-12.zip -d BERT-NER/
!ls -la BERT-NER/

## Step 4: Training model on cloud GPU

Training may take more than half and hour. Depending on your dataset and currently available GPU.

In [0]:
!python BERT-NER/BERT_NER.py \
 --task_name="NER" \
 --do_lower_case=False \
 --crf=False \
 --do_train=True \
 --do_eval=True \
 --do_predict=True \
 --data_dir=BERT-NER/data \
 --middle_output=BERT-NER/middle_data \
 --vocab_file=BERT-NER/multi_cased_L-12_H-768_A-12/vocab.txt \
 --bert_config_file=BERT-NER/multi_cased_L-12_H-768_A-12/bert_config.json \
 --init_checkpoint=BERT-NER/multi_cased_L-12_H-768_A-12/bert_model.ckpt \
 --max_seq_length=128 \
 --train_batch_size=32 \
 --learning_rate=2e-5 \
 --num_train_epochs=3.0 \
 --output_dir=BERT-NER/output/result_dir \
 --labels="[PAD],O,Sud,Osoba,Adresa,Organizacia,ICO,Narodenie,X,[CLS],[SEP]"

Alternatively you can train on TPU if you fancy TPU. You would need to configure it with GCS though.

In [0]:
!python BERT-NER/BERT_NER.py \
 --task_name="NER" \
 --do_lower_case=False \
 --crf=False \
 --do_train=True \
 --do_eval=True \
 --do_predict=True \
 --data_dir=BERT-NER/data \
 --middle_output=BERT-NER/middle_data \
 --vocab_file=BERT-NER/multi_cased_L-12_H-768_A-12/vocab.txt \
 --bert_config_file=BERT-NER/multi_cased_L-12_H-768_A-12/bert_config.json \
 --init_checkpoint=BERT-NER/multi_cased_L-12_H-768_A-12/bert_model.ckpt \
 --max_seq_length=128 \
 --train_batch_size=32 \
 --learning_rate=2e-5 \
 --num_train_epochs=3.0 \
 --output_dir=BERT-NER/output/result_dir \
 --use_tpu=True \
 --tpu_name=grpc://10.110.157.90:8470

We will check the models

In [0]:
!ls -la BERT-NER/output/result_dir
!ls -la BERT-NER/output/result_dir/eval

## Step 5 Evaluate the results

Optionally upload results or use the one that are the result of this run.



In [0]:
from google.colab import files
uploadedLabels = files.upload()

We can evaluate the results:

In [0]:
from collections import Counter

EVAL_RESULT = "BERT-NER/output/result_dir/label_test.txt" #@param {type:"string"}
rf = open(EVAL_RESULT, 'r', encoding='utf8')

entityTP = Counter()
entityFP = Counter()
entityFN = Counter()
labels = set()
for line in rf:
  tokenInfo = line.split('\t')
  if len(tokenInfo) == 3:
    gold = tokenInfo[1].strip()
    answer = tokenInfo[2].strip()
    labels.add(gold)
    labels.add(answer)
    if answer == gold:
      entityTP[answer] += 1
    else:
      if answer == 'O':
        entityFN[gold] += 1
      elif gold == 'O':
        entityFP[answer] += 1
      else:
        entityFN[gold] += 1
        entityFP[answer] += 1

for label in labels:
  if entityTP[label]+entityFP[label] == 0:
    precision = 0
  else:
    precision = entityTP[label]/(entityTP[label]+entityFP[label])
  
  if entityTP[label]+entityFN[label] == 0:
    recall = 0
  else:
    recall = entityTP[label]/(entityTP[label]+entityFN[label])
  
  if precision+recall == 0:
    f1 = 0
  else:
    f1 = 2*(precision*recall)/(precision+recall)
  log.info("-------------------------")
  log.info("Entity %s" % label)
  log.info("TP: %s FP: %s FN: %s" % (entityTP[label], entityFP[label], entityFN[label]))
  log.info("Precision: %s" % precision)
  log.info("Recall: %s" % recall)
  log.info("F1: %s" % f1)

## Step 6: Uploading model to GCS (optional)

You do not need to upload the model to GCS. Just use Left panel where you can download files directly.

In [0]:
BUCKET_NAME = "ardevop-sk-bert-sk" #@param {type:"string"}
MODEL_DIR = "model_ner" #@param {type:"string"}
SRC_DIR = "BERT-NER/output/result_dir" #@param {type:"string"}
tf.gfile.MkDir(MODEL_DIR)

if not BUCKET_NAME:
  log.warning("WARNING: BUCKET_NAME is not set. "
              "You will not be able to train the model.")

In [0]:
if BUCKET_NAME:
  !gsutil cp -r $SRC_DIR gs://$BUCKET_NAME/$MODEL_DIR

## Step 7: Serving BERT model for NER as a Service

TBD: https://github.com/Ardevop-sk/bert-as-service

## Step 8: Testing the services

TBD: Swagger UI