# Train a German named entity recognition with spaCy



### Load spaCy and the German transformer pipeline https://spacy.io/models/de#de_dep_news_trf 

In [None]:
!pip install -U pip setuptools wheel
!pip install -U spacy==3.3
!pip install -U cuda111 transformers lookups
!pip install -U spacy-transformers
!python -m spacy download de_dep_news_trf

In [None]:
!pip install wandb -qqq
import wandb

In [None]:
# Log in to your W&B account
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 

··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
# If it does not work find out the correct cuda version
# !nvcc --version

# Test cupy
import cupy
a = cupy.zeros((1,1))

In [None]:
!python -m spacy validate

⠙ Loading compatibility table...[2K[38;5;2m✔ Loaded compatibility table[0m
[1m
[38;5;4mℹ spaCy installation: /usr/local/lib/python3.7/dist-packages/spacy[0m

NAME              SPACY                 VERSION                            
de_dep_news_trf   >=3.3.0.dev0,<3.4.0   [38;5;2m3.3.0[0m   [38;5;2m✔[0m
en_core_web_sm    >=3.3.0.dev0,<3.4.0   [38;5;2m3.3.0[0m   [38;5;2m✔[0m



### Upload train.spacy, valid.spacy and base_config_trf_spacy32.cfg to folder data and check config

 Manually change the train and valid paths in config_trf.cfg to /content/train.spacy and /content/dev.spacy.

To use Weights and Biases to track the experiment and upload your dataset to W&B and track versions of it, add this to the config:

```
[training.logger]
@loggers = "spacy.WandbLogger.v4"
project_name = 'ner_lm_trf'
remove_config_values = []
log_dataset_dir = "./assets"
```

see https://pypi.org/project/spacy-loggers/ 

In [None]:
### Create config.cfg https://spacy.io/usage/training 
!python -m spacy init fill-config base_config_trf_spacy32.cfg config_trf_32.cfg 

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config_trf_32.cfg
You can now add your data and train your pipeline:
python -m spacy train config_trf_32.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
!python -m spacy debug data config_trf_32.cfg

[1m
Some weights of the model checkpoint at bert-base-german-cased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[38;5;2m✔ Pipeline can be initialized with data[0m
[38;5;2m✔ Corpus is loadable[0m
[1m
Language: de
Training pipeline: transformer, ner
5159

### Activate GPU in colab

In [None]:
import spacy
spacy.require_gpu()

True

### Train the model

In [None]:
!python -m spacy train config_trf_32.cfg --output ./ner_lm_de_trf --gpu-id 0

[38;5;4mℹ Saving to output directory: ner_lm_de_trf[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-07-22 08:11:08,378] [INFO] Set up nlp object from config
[2022-07-22 08:11:08,937] [INFO] Pipeline: ['transformer', 'ner']
[2022-07-22 08:11:08,942] [INFO] Created vocabulary
[2022-07-22 08:11:08,943] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at bert-base-german-cased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are init

### Zip the folder with the best model and download it

In [None]:
!zip -r /content/file.zip /content/ner_lm_de_trf/model-best

  adding: content/ner_lm_de_trf/model-best/ (stored 0%)
  adding: content/ner_lm_de_trf/model-best/tokenizer (deflated 84%)
  adding: content/ner_lm_de_trf/model-best/meta.json (deflated 57%)
  adding: content/ner_lm_de_trf/model-best/vocab/ (stored 0%)
  adding: content/ner_lm_de_trf/model-best/vocab/vectors (deflated 45%)
  adding: content/ner_lm_de_trf/model-best/vocab/strings.json (deflated 74%)
  adding: content/ner_lm_de_trf/model-best/vocab/lookups.bin (stored 0%)
  adding: content/ner_lm_de_trf/model-best/vocab/vectors.cfg (stored 0%)
  adding: content/ner_lm_de_trf/model-best/vocab/key2row (stored 0%)
  adding: content/ner_lm_de_trf/model-best/config.cfg (deflated 61%)
  adding: content/ner_lm_de_trf/model-best/transformer/ (stored 0%)
  adding: content/ner_lm_de_trf/model-best/transformer/cfg (stored 0%)
  adding: content/ner_lm_de_trf/model-best/transformer/model (deflated 7%)
  adding: content/ner_lm_de_trf/model-best/ner/ (stored 0%)
  adding: content/ner_lm_de_trf/model-b

In [None]:
from google.colab import files
files.download("/content/file.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Upload test.spacy and evaluate model performance on unseen data

In [None]:
# Evaluate a currently trained model
!python -m spacy evaluate /content/ner_lm_de_trf/model-best /content/test --gold-preproc --gpu-id 0