# EXAMPLE - 5

**Tasks :- NER tagging, POS tagging**

**Tasks Description**

``NER`` :-This is a Named Entity Recognition task where individual words of the sentence are tagged with an entity label it belongs to. The words which don't belong to any entity label are simply labeled as "O".

``POS`` :- This is a Part of Speech tagging task. A part of speech is a category of words that have similar grammatical properties. Each word of the sentence is tagged with the part of speech label it belongs to. The words which don't belong to any part of speech label are simply labeled as "O".

**Conversational Utility** :-  In conversational AI context, determining the syntactic parts of the sentence can help in extracting noun-phrases or important keyphrases from the sentence.

**Data** :- In this example, we are using the <a href="https://www.clips.uantwerpen.be/conll2003/ner/">coNLL 2003</a> data which is BIO tagged format with the POS and NER tags separated by space.

The data is already present in ``coNLL_data`` directory.

# Step - 1: Transforming data

Raw data is in BIO tagged format with the POS and NER tags separated by space.

We already provide a sample transformation function ``coNLL_ner_pos_to_tsv`` to convert this data to required tsv format. 

Running data transformations will save the required train, dev and test tsv data files under ``data`` directory in root of library. For more details on the data transformation process, refer to <a href="https://multi-task-nlp.readthedocs.io/en/latest/data_transformations.html">data transformations</a> in documentation.

The transformation file should have the following details which is already created ``transform_file_conll.yml``.

```
transform1:
  transform_func: coNLL_ner_pos_to_tsv
  read_file_names:
    - coNLL_train.txt
    - coNLL_testa.txt
    - coNLL_testb.txt
  read_dir: coNLL_data
  save_dir: ../../data
 ```
 Following command can be used to run the data transformation for the tasks.

In [1]:
!python ../data_transformations.py \
    --transform_file 'transform_file_conll.yml'

^C


# Step -2 Data Preparation

For more details on the data preparation process, refer to <a href="https://multi-task-nlp.readthedocs.io/en/latest/training.html#running-data-preparation">data preparation</a> in documentation.

Defining tasks file for training single model for entailment task. The file is already created at ``tasks_file_conll.yml``
```
conllner:
  model_type: BERT
  config_name: bert-base-uncased
  dropout_prob: 0.2
  label_map_or_file: ../../data/ner_coNLL_train_label_map.joblib
  metrics:
  - seqeval_f1_score
  - seqeval_precision
  - seqeval_recall
  loss_type: NERLoss
  task_type: NER
  file_names:
  - ner_coNLL_train.tsv
  - ner_coNLL_testa.tsv
  - ner_coNLL_testb.tsv

conllpos:
    model_type: BERT
    config_name: bert-base-uncased
    dropout_prob: 0.2
    label_map_or_file: ../../data/pos_coNLL_train_label_map.joblib
    metrics:
    - seqeval_f1_score
    - seqeval_precision
    - seqeval_recall
    loss_type: NERLoss
    task_type: NER
    file_names:
    - pos_coNLL_train.tsv
    - pos_coNLL_testa.tsv
    - pos_coNLL_testb.tsv
```

In [None]:
!python ../data_preparation.py \
    --task_file 'tasks_file_SRL.yml' \
    --data_dir 'content/data/' \
    --max_seq_len 50
    # --data_dir '../../data' \

^C


# Step -3 Running Training

In [None]:
!python ../train.py \
    --data_dir 'content/data/bert-base-uncased_prepared_data' \
    --task_file 'tasks_file_SRL.yml' \
    --out_dir 'conll_ner_pos_bert_base' \
    --epochs 5 \
    --train_batch_size 4 \
    --eval_batch_size 8 \
    --grad_accumulation_steps 2 \
    --log_per_updates 25 \
    --save_per_updates 100 \
    --eval_while_train \
    --test_while_train \
    --max_seq_len 50 \
    --silent  

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
loading file vocab.txt from cache at /home/tiendat/.cache/huggingface/hub/models--dmis-lab--biobert-base-cased-v1.2/snapshots/67c9c25b46986521ca33df05d8540da1210b3256/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at /home/tiendat/.cache/huggingface/hub/models--dmis-lab--biobert-base-cased-v1.2/snapshots/67c9c25b46986521ca33df05d8540da1210b3256/config.json
Model config BertConfig {
  "_name_or_path": "dmis-lab/biobert-base-cased-v1.2",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range

# Step - 4 Infering

You can import and use the ``inferPipeline`` to get predictions for the required tasks.
The trained model and maximum sequence length to be used needs to be specified.

For knowing more details about infering, refer to <a href="https://multi-task-nlp.readthedocs.io/en/latest/infering.html">infer pipeline</a> in documentation.

In [1]:
import sys
sys.path.insert(1, '../')
from infer_pipeline import inferPipeline

  from .autonotebook import tqdm as notebook_tqdm
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
loading file vocab.txt from cache at /home/tiendat/.cache/huggingface/hub/models--dmis-lab--biobert-base-cased-v1.2/snapshots/67c9c25b46986521ca33df05d8540da1210b3256/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at /home/tiendat/.cache/huggingface/hub/models--dmis-lab--biobert-base-cased-v1.2/snapshots/67c9c25b46986521ca33df05d8540da1210b3256/config.json
Model config BertConfig {
  "_name_or_path": "dmis-lab/biobert-base-cased-v1.2",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob

In [2]:
%ls

[0m[01;32mSRL_model.ipynb[0m*          [34;42mcontent[0m/              [01;32mprobability.ipynb[0m*
[34;42mcoNLL_data[0m/               [01;32mmulti_task_logs.log[0m*  [01;32mtasks_file_SRL.yml[0m*
[34;42mconll_ner_pos_bert_base[0m/  [34;42mproba[0m/                [01;32mtransform_file_conll.yml[0m*


In [3]:
pipe = inferPipeline(modelPath='conll_ner_pos_bert_base/pt_30_epoch/multi_task_model_19_408.pt')

loading file vocab.txt from cache at /home/tiendat/.cache/huggingface/hub/models--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /home/tiendat/.cache/huggingface/hub/models--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/tokenizer_config.json
loading configuration file config.json from cache at /home/tiendat/.cache/huggingface/hub/models--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps

In [6]:
import pandas as pd
file_names = [
    "ner_coNLL_testa_abolish.tsv",
    "ner_coNLL_testa.tsv",
    "ner_coNLL_testb.tsv",
    "ner_coNLL_testb_abolish.tsv",
    "ner_coNLL_testa_alter.tsv",
    "ner_coNLL_testb_alter.tsv",
    "ner_coNLL_testa_begin.tsv",
    "ner_coNLL_testb_begin.tsv",
    "ner_coNLL_testa_block.tsv",
    "ner_coNLL_testb_block.tsv",
    "ner_coNLL_testa_catalyse.tsv",
    "ner_coNLL_testb_catalyse.tsv",
    "ner_coNLL_testa_confer.tsv",
    "ner_coNLL_testb_confer.tsv",
    "ner_coNLL_testa_decrease.tsv",
    "ner_coNLL_testb_decrease.tsv",
    "ner_coNLL_testa_delete.tsv",
    "ner_coNLL_testb_delete.tsv",
    "ner_coNLL_testa_develop.tsv",
    "ner_coNLL_testb_develop.tsv",
    "ner_coNLL_testa_disrupt.tsv",
    "ner_coNLL_testb_disrupt.tsv",
    "ner_coNLL_testa_eliminate.tsv",
    "ner_coNLL_testb_eliminate.tsv",
    "ner_coNLL_testa_encode.tsv",
    "ner_coNLL_testb_encode.tsv",
    "ner_coNLL_testa_express.tsv",
    "ner_coNLL_testb_express.tsv",
    "ner_coNLL_testa_generate.tsv",
    "ner_coNLL_testb_generate.tsv",
    "ner_coNLL_testa_inhibit.tsv",
    "ner_coNLL_testb_inhibit.tsv",
    "ner_coNLL_testa_initiate.tsv",
    "ner_coNLL_testb_initiate.tsv",
    "ner_coNLL_testa_lead.tsv",
    "ner_coNLL_testb_lead.tsv",
    "ner_coNLL_testa_lose.tsv",
    "ner_coNLL_testb_lose.tsv",
    "ner_coNLL_testa_modify.tsv",
    "ner_coNLL_testb_modify.tsv",
    "ner_coNLL_testa_mutate.tsv",
    "ner_coNLL_testb_mutate.tsv",
    "ner_coNLL_testa_proliferate.tsv",
    "ner_coNLL_testb_proliferate.tsv",
    "ner_coNLL_testa_recognize.tsv",
    "ner_coNLL_testb_recognize.tsv",
    "ner_coNLL_testa_result.tsv",
    "ner_coNLL_testb_result.tsv",
    "ner_coNLL_testa_skip.tsv",
    "ner_coNLL_testb_skip.tsv",
    "ner_coNLL_testa_splice.tsv",
    "ner_coNLL_testb_splice.tsv",
    "ner_coNLL_testa_transcribe.tsv",
    "ner_coNLL_testb_transcribe.tsv",
    "ner_coNLL_testa_transform.tsv",
    "ner_coNLL_testb_transform.tsv",
    "ner_coNLL_testa_translate.tsv",
    "ner_coNLL_testb_translate.tsv",
    "ner_coNLL_testa_truncate.tsv",
    "ner_coNLL_testb_truncate.tsv",
    "ner_coNLL_train.tsv"
]

for file_name in file_names:
    print(file_name)
    data = pd.read_csv(f'proba/{file_name}', sep='\t', header=None)
    data_list = data.values.tolist()
    pipe.infer(data_list, ['conllsrl'])

ner_coNLL_testa_abolish.tsv


Eval: 100%|██████████| 36/36 [01:16<00:00,  2.12s/it]


Predictions: ['B-A0', 'B-A1', 'X', 'O', 'O', 'O', 'O', 'X', 'O', 'B-V', 'O', 'B-A1', 'O', 'X', 'I-A1', 'O', 'I-A1', 'O', 'I-A1', 'I-A1', 'X', 'O']
Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Scores: [0.9926633834838867, 0.9665530920028687, 0.9793591499328613, 0.991998016834259, 0.9969558715820312, 0.9988555908203125, 0.9961336851119995, 0.9798697233200073, 0.9993796348571777, 0.9996517896652222, 0.9976978898048401, 0.9544804692268372, 0.9698025584220886, 0.9981369972229004, 0.9924359917640686, 0.9632140398025513, 0.9920843839645386, 0.9946407079696655, 0.9828525185585022, 0.9920355677604675, 0.9958076477050781, 0.9983377456665039]
Predictions: ['B-A0', 'I-A0', 'I-A0', 'I-A0', 'I-A0', 'I-A0', 'I-A0', 'I-A0', 'I-A0', 'I-A0', 'I-A0', 'I-A0', 'I-A0', 'O', 'O', 'O', 'B-V', 'B-A1', 'I-A0', 'I-A0', 'I-A1', 'O', 'I-A1', 'I-A1', 'O', 'B-A1', 'O', 'O', 'O', 'O']
Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', '

Eval:   3%|▎         | 39/1151 [02:04<59:14,  3.20s/it]  


KeyboardInterrupt: 

In [7]:
# samples = [['A G-to-A transition at the first nucleotide of intron 2 of patient 1 abolished normal splicing.'], ['Formation of liver metastases in mice was abolished by linomide and antibody-targeted superantigen therapy.']]
# tasks = ['conllsrl']
# pipe.infer(samples, tasks)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Eval: 100%|██████████| 1/1 [00:02<00:00,  2.36s/it]

Predictions: ['B-A0', 'O', 'X', 'I-A1', 'I-A0', 'I-A1', 'O', 'I-A0', 'I-A1', 'X', 'I-A1', 'X', 'X', 'X', 'O', 'I-A1']
Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Scores: [0.9921221137046814, 0.9697643518447876, 0.9951223731040955, 0.9938925504684448, 0.9754364490509033, 0.9904526472091675, 0.9940263032913208, 0.9925557971000671, 0.9918782711029053, 0.9933215975761414, 0.9890933036804199, 0.973213255405426, 0.9982563853263855, 0.9996533393859863, 0.9845981597900391, 0.9851794838905334]
Predictions: ['B-A1', 'I-A0', 'I-A1', 'I-A1', 'I-A0', 'I-A0', 'O', 'X', 'O', 'B-A0', 'I-A1', 'I-A0', 'I-A0', 'I-A0']
Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Scores: [0.9926537275314331, 0.9971977472305298, 0.9971291422843933, 0.9857472777366638, 0.9912330508232117, 0.9826040863990784, 0.9992601275444031, 0.9997195601463318, 0.99580317735672, 0.9956263303756714, 0.9892043471336365, 0.992823600769043, 0.9933537244796753, 0.9




[{'Query': ['A G-to-A transition at the first nucleotide of intron 2 of patient 1 abolished normal splicing.'],
  'conllsrl': [('A0', 'A'),
   ('A1', 'at'),
   ('A0', 'the'),
   ('A1', 'first'),
   ('A0', 'of'),
   ('A1', 'intron of splicing.')]},
 {'Query': ['Formation of liver metastases in mice was abolished by linomide and antibody-targeted superantigen therapy.'],
  'conllsrl': [('A1', 'Formation'),
   ('A0', 'of'),
   ('A1', 'liver metastases'),
   ('A0', 'in mice linomide'),
   ('A1', 'and'),
   ('A0', 'antibody-targeted superantigen therapy.')]}]

In [None]:
# print task name
# pipe.taskParams.taskIdNameMap

OrderedDict([(0, 'conllsrl')])

: 