# **Fine-tuning a Transformer for Legal Entity Recognition with SpaCy** <br>

Named-entity recognition (NER) is a task in information extraction and refers to the automatic identification and classification of entities such as company names or a locations. In our use case, legal entities in German legal documents like references to paragraphs or law books are to be extracted. Unfortunately, pre-trained models are not able to cover such special use cases by default, so we fine-tuned a first BERT-model with spaCy v3.0 in this notebook to extract legal entities. <br>
spaCy’s CLI provides a set of commands and allows us to train and debug our model using only CLI commands. 

**Model:** <br>
    Spacy de_dep_news_trf <br>
**Dataset:** <br>
    Leitner, E. (2019). Eigennamen- und Zitaterkennung in Rechtstexten. Bachelor’s thesis, Universität Potsdam, Potsdam, 2. <br>
    https://github.com/elenanereiss/Legal-Entity-Recognition 

# 1. Import SpaCy and check CUDA Version





In [14]:
import spacy

In [11]:
# check CUDA Version

!nvidia-smi

Tue Oct 12 12:50:32 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           On   | 00000001:00:00.0 Off |                    0 |
| N/A   47C    P0    72W / 149W |   1021MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# 2. Data transformation from CoNLL-format into spaCy binary training format

Since spaCy v3.0 JSON format for training pipelines are deprecated and were replaced by spaCys binary training format with the extension .spacy. <br>
In the following training and test data will be converted to this format

In [12]:
# Convert train and test data from .conll to .spacy and group every 10 sentences into a document

# Training data
!python -m spacy convert data/train/bag.conll -c conll -n 10 data/train

!python -m spacy convert data/train/bfh.conll -c conll -n 10 data/train

!python -m spacy convert data/train/bgh.conll -c conll -n 10 data/train

!python -m spacy convert data/train/bpatg.conll -c conll -n 10 data/train

!python -m spacy convert data/train/bsg.conll -c conll -n 10 data/train

!python -m spacy convert data/train/bverfg.conll -c conll -n 10 data/train
# Test data
!python -m spacy convert data/val/bverwg.conll -c conll -n 10 data/val

[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (1280 documents): data/train/bag.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (853 documents): data/train/bfh.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (586 documents): data/train/bgh.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (1202 documents): data/train/bpatg.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (809 documents): data/train/bsg.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (924 documents): data/train/bverfg.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (1022 documents): data/val/bverwg.spacy[0m


# 3. Debug config file

The base config file contains all settings and parameters for the training pipeline in spaCy. <br>
With the 'debug data' command, spaCy automatically debugs and validates our training and test data

In [13]:
!python -m spacy debug data config/base_config_spacy.cfg

[1m
Some weights of the model checkpoint at bert-base-german-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[38;5;2m✔ Pipeline can be initialized with data[0m
[38;5;2m✔ Corpus is loadable[0m
[1m
Language: de
Training pipeline: transformer, ner
9725

# 4. Training 
After checking the data, we can start our training process easily with a one line command on GPU zero. 

In [15]:
# init pipeline and start the training process based on the config document
!python -m spacy train -g 0 config/base_config_spacy.cfg --output models/

[38;5;4mℹ Saving to output directory: models[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2021-10-09 19:21:55,427] [INFO] Set up nlp object from config
[2021-10-09 19:21:55,438] [INFO] Pipeline: ['transformer', 'ner']
[2021-10-09 19:21:55,441] [INFO] Created vocabulary
[2021-10-09 19:21:55,442] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at bert-base-german-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializin

# 5. Load and test trained model
If a model was trained successfully, we can easily test the performance on our validation data

In [24]:
! python -m spacy evaluate -g 0 models/model-best data/val

[38;5;4mℹ Using GPU: 0[0m
[1m

TOK     -    
NER P   86.28
NER R   83.24
NER F   84.73
SPEED   3224 

[1m

          P       R       F
LD    83.06   87.34   85.15
VT    36.45   19.80   25.66
GS    92.60   96.86   94.68
GRT   97.40   95.81   96.60
UN    59.09   76.47   66.67
RS    90.86   93.46   92.14
VO    79.37   69.01   73.83
VS    47.44   19.27   27.41
LIT   85.77   87.50   86.63
STR   35.85   33.33   34.55
ST    87.03   77.04   81.73
ORG   47.79   39.63   43.33
PER   87.78   69.53   77.60
INN   72.18   56.78   63.56
EUN   83.39   81.31   82.33
LDS   49.02   78.12   60.24
AN     0.00    0.00    0.00
RR     0.00    0.00    0.00
MRK    0.00    0.00    0.00



Furthermore, the model can be load and tested on some example queries

In [17]:
# check if spacy has gpu access
spacy.require_gpu()
# load best model from training as nlp
nlp = spacy.load("models/model-best")

In [37]:
# Test NER model on document sample

doc = nlp(
    
'''
Tenor
1. Die Revision des Klägers gegen das Urteil des Landesarbeitsgerichts Berlin-Brandenburg vom 2. April 2013 - 11 Sa 2346/12 - wird zurückgewiesen.
2. Der Kläger hat die Kosten der Revision zu tragen.Tatbestand1Die Parteien streiten über den Umfang des Zusatzurlaubs bei Wechselschichtarbeit.
2. Der Kläger ist seit dem 1. September 1994 für das beklagte Land tätig. Auf das Arbeitsverhältnis findet der Tarifvertrag zur Angleichung des Tarifrechts des Landes Berlin an das Tarifrecht der Tarifgemeinschaft deutscher Länder vom 14. Oktober 2010 (Angleichungs-TV Land Berlin) und danach grundsätzlich der TV-L Anwendung.
3. Der Kläger arbeitet als Polizeiangestellter in Wechselschicht in der Zeit von 05:45 Uhr bis 18:00 Uhr und von 17:45 Uhr bis 06:00 Uhr. Seine Arbeitszeit beträgt pro Schicht 12,25 Stunden, im Durchschnitt arbeitet er 3,5 Dienste pro Woche. Das beklagte Land gewährt dem Kläger nach § 27 Abs. 2 Buchst. a TV-L unter Anwendung der Kürzungsregel des § 26 Abs. 1 Satz 4 TV-L (bis zum 31. Dezember 2012: § 26 Abs. 1 Satz 5 TV-L) im Jahr vier Tage Zusatzurlaub für Wechselschichtarbeit á 12,25 Stunden.4Der Kläger vertritt die Auffassung, ihm stünden sechs Tage Zusatzurlaub im Jahr bei einer anzurechnenden Arbeitszeit von 12,25 Stunden zu. 
            Mikosch                Schmitz-Scholemann                Mestwerdt                                Maurer                Klein                    
'''
)

In [38]:
# visualize NER results with displacy 

from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

In [None]:
# save best transformer model on disk
nlp.to_disk("./model-best")

# 6. What next?

SpaCy allows us to train a custom model for legal entity extraction in documents just with a few CLI commands. <br>
Unfortunately some of the labels are not classified corretly. In the next steps, our goal is to improve the models performance and have a deeper look at the parameters and training data. 