# Objective

Use data collected during the Getuigenissen project (https://www.getuigenissen.org). These data contain transcribed texts of police records from the 18th-19th century which were manually enriched with named-entity tags and relationships between the different entities. 

- Build a Named Entity Recognition model
- See how good such a model works

Final objective: apply the named entity recognition model on transcribed images.

# Data

- About 6500 images of 3200 police interrogations covering more than 260 police cases were transcribed 2 times by volunteers.
- Some of these transcriptions were manually checked by researchers, for others, the transcription of the best transcriber was taken.
- For each of these 260 police records 1 text was sampled and was manually annotated with named entities

# Data analysis setup

1. Finetune BERT models (multilingual BERT, RobBERT https://github.com/iPieter/RobBERT, BERTje https://github.com/wietsedv/bertje) on the data and investigate accuracies of the named entity recognition task
2. Compare to building and finetuning a Conditional Random Field
3. Score the model on not annotated data

Notes

*   due to the nature of the police reports of the 18th-19th century, some smaller parts of the texts are in French
*   unfortunately data is currently not available as open data, but this might change in the future.




# Software installations


In [None]:
%cd /content
!git clone https://github.com/UniversalDependencies/UD_Dutch-LassySmall
%cd /content
!git clone https://github.com/wietsedv/bertje
%cd bertje

/content
Cloning into 'UD_Dutch-LassySmall'...
remote: Enumerating objects: 24, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 224 (delta 12), reused 11 (delta 4), pack-reused 200[K
Receiving objects: 100% (224/224), 7.39 MiB | 24.17 MiB/s, done.
Resolving deltas: 100% (134/134), done.
/content
Cloning into 'bertje'...
remote: Enumerating objects: 166, done.[K
remote: Counting objects: 100% (166/166), done.[K
remote: Compressing objects: 100% (140/140), done.[K
remote: Total 166 (delta 37), reused 106 (delta 11), pack-reused 0[K
Receiving objects: 100% (166/166), 215.08 KiB | 7.17 MiB/s, done.
Resolving deltas: 100% (37/37), done.
/content/bertje


## NER based on bertje



- Check setup of bertje based on UD Lassy Small
- Input requires 3 files train.tsv, dev.tsv and test.tsv

In [None]:
!python --version
!pip install transformers
!pip install pyyaml
!pip install scikit-learn
!pip install scipy
!pip install tqdm
!pip install tensorboard

Python 3.6.9
Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/50/0c/7d5950fcd80b029be0a8891727ba21e0cd27692c407c51261c3c921f6da3/transformers-4.1.1-py3-none-any.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 8.0MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 30.4MB/s 
[?25hCollecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 19.6MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=

In [None]:
!python /content/bertje/finetuning/prepare/prepare-ud.py -i "/content/UD_Dutch-LassySmall" -o "data"

 > Preparing NER data
Labels in data/pos/train.tsv (16 labels):
noun          12325 (16.40%)
punct         11295 (15.03%)
propn         10559 (14.05%)
adp            9293 (12.36%)
det            8130 (10.82%)
adj            5361 (7.13%)
verb           5170 (6.88%)
adv            2703 (3.60%)
num            2586 (3.44%)
pron           2368 (3.15%)
cconj          2010 (2.67%)
aux            1949 (2.59%)
sym             545 (0.73%)
sconj           486 (0.65%)
x               379 (0.50%)
intj              6 (0.01%)

Labels in data/pos/dev.tsv (16 labels):
noun           1830 (16.03%)
punct          1810 (15.85%)
adp            1374 (12.03%)
propn          1207 (10.57%)
det            1173 (10.27%)
verb            881 (7.72%)
adj             752 (6.59%)
pron            565 (4.95%)
adv             535 (4.69%)
num             386 (3.38%)
aux             352 (3.08%)
cconj           332 (2.91%)
sconj           111 (0.97%)
sym              64 (0.56%)
x                43 (0.38%)
intj             

In [None]:
!mkdir /content/getuigenissen

- Build these 3 files (train.tsv, test.tsv, dev.tsv) locally and upload to Google Colab in the `/content/getuigenissen` folder
- Upload getuigenissen-ner.yaml to the `bertje/finetuning/v2/configs/data` folder

```
data:
  name: "getuigenissen-ner"
  input: "/content/getuigenissen"
  num_labels: 25

model:
  shortname: "bertje"
  name: "wietsedv/bert-base-dutch-cased"
  type: "bert"

train:
  max_epochs: 200
```

In [None]:
%cd /content/bertje/finetuning/v2
!python main.py data/getuigenissen-ner

/content/bertje/finetuning/v2
2020-12-22 10:55:50.030758: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
importing config from "configs/default.yaml"
importing config from "configs/data/getuigenissen-ner.yaml"
data:
  cache: cache/{}-{}
  cfgs: [data/udlassy-pos, data/lassysmall-pos, data/conll2002-ner, data/sonar-ner,
    data/udlassy-ner, data/110kdbrd, data/110kdbrd-2, data/twisty, data/twisty2, data/twisty3,
    data/twisty-merge-4, data/twisty4-merge-4]
  clip_start: false
  dev: true
  input: /content/getuigenissen
  logs: logs/{}-{}
  merge: null
  name: getuigenissen-ner
  num_labels: 25
  num_sents: 1
  output: output/{}-{}
  token_level: true
  verify: false
eval: {batch_size: 64}
force: false
model:
  cfgs: [models/bertje, models/multi, models/bertnl, models/robbert]
  checkpoint: -1
  device: cuda
  do_export: true
  do_train: true
  lower_case: false
  name: wietsedv/bert-base-dutch-cased
  shortname: b

## RobBERT

In [None]:
%cd /content
!git clone https://github.com/iPieter/RobBERT
%cd RobBERT
#!git checkout v2.0

/content
fatal: destination path 'RobBERT' already exists and is not an empty directory.
/content/RobBERT


In [None]:
#!pip install fairseq
#!pip install nltk
#!pip install numpy
#!pip install torch==1.6.0
#!pip install tokenizers==0.4.2
#!pip install transformers==3.1.0
#!pip install tensorboardx
#!pip install nltk
#!pip install pytorch-lightning
#!pip install -e git+git@github.com:iPieter/kiwi.ml.git@76b66872fce68873809a0dea112e2ed552ae5b63#egg=kiwi

TODO

In [None]:
import transformers
dir(transformers)
from transformers import RobertaTokenizer, RobertaForTokenClassification
tokenizer = RobertaTokenizer.from_pretrained("pdelobelle/robbert-v2-dutch-base")
model     = RobertaForTokenClassification.from_pretrained("pdelobelle/robbert-v2-dutch-base")

In [None]:
txt = "dag wereld"
print(tokenizer(txt)['input_ids'])
inputs = tokenizer(txt, return_tensors = "pt")
print(inputs)
outputs = model(**inputs)
print(outputs)