# Named Entity Recognition (NER)

So far, we have used transformers to do basic classification for things such as sentiment analysis.
But what about doing more with that data?

<table align=left>
    <tr>
        <td>
            <ul style="text-align:left">
                <li><h1>Collection of Reviews</h1></li>
                <li><h1>Determine Sentiment</h1></li>
                <li><h1>But what are they talking about?<h1></li>
            </ul>
        </td>
        <td>
            <img src="images/sentimentexample.png">
        </td>
    </tr>
    </table>


Named Entity Recognition is the task of identifying and tagging entities in unstructured text.

- Who
- What
- Where
- When
- Which

Factual information and knowlege are normally express by or about named entities. Our task is to find them.

NER is at the core of automatic information extraction systems.

### Information Sources

<div>
<img align=left style="height:400px;" src="images/NERInformationSources.png">
</div>

### Tagging

Typically we use the BIO (beginning, inside, outside) notation for tagging

Often, we will also add a class

<img align=left src="images/nertaggingtable.png">
<img align=left src="images/nertaggingexample.png">

## So let's start playing and experimenting with NER

In [1]:
from transformers import pipeline
import pandas as pd
import torch



#### There are a lot of pretrained transformers for basic NLP tasks
#### We can instantiate transformers for basic tasks using the pipeline
#### It is a pipeline because it creates a tokenizer, an encoder, and a transformer and puts them in a pipeline
<pre>
 pipeline(task-name)
    
 pipeline(task-name, model = model-name, tokenizer = tokenizer-name)
</pre>

In [2]:
# Simple NER
nerpipe = pipeline('ner')            

# Try it with some aggregation
#nerpipe = pipeline('ner', aggregation_strategy="simple")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [3]:
text = "Freddy Fox the quick brown fox jumps over the lazy dog, Fido The Wonder Dog."
outputs = nerpipe(text)

In [4]:
pd.DataFrame.from_records(outputs)

Unnamed: 0,entity,score,index,word,start,end
0,I-PER,0.998655,1,Freddy,0,6
1,I-PER,0.998346,2,Fox,7,10
2,I-PER,0.98128,13,Fi,56,58
3,I-PER,0.961059,14,##do,58,60
4,I-PER,0.734078,15,The,61,64
5,I-PER,0.841098,16,Wonder,65,71
6,I-PER,0.731714,17,Dog,72,75


# What if we want to fine-tune our own model?

In [5]:
# First we need to load up the dataset
# Let's create a multi-lingual dataset similar to the distribution of languages in Switzerland

In [6]:
from datasets import load_dataset
from collections import defaultdict
from datasets import DatasetDict

In [7]:
langs = ["de", "fr", "it", "en"]
fracs = [0.629, 0.229, 0.084, 0.059]
# Return a DatasetDict if a key doesn't exist
panx_ch = defaultdict(DatasetDict)

for lang, frac in zip(langs, fracs):
    # Load monolingual corpus
    ds = load_dataset("xtreme", name=f"PAN-X.{lang}")
    # Shuffle and downsample each split according to spoken proportion
    for split in ds:
        panx_ch[lang][split] = (
            ds[split]
            .shuffle(seed=0)
            .select(range(int(frac * ds[split].num_rows))))

Reusing dataset xtreme (/var/snell_home/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /var/snell_home/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-e5ddf09f1ae095ec.arrow
Loading cached shuffled indices for dataset at /var/snell_home/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-25e7e2dd003d0fa6.arrow
Loading cached shuffled indices for dataset at /var/snell_home/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-73a95bc0accfea8b.arrow
Reusing dataset xtreme (/var/snell_home/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /var/snell_home/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-6ff29513007ec78b.arrow
Loading cached shuffled indices for dataset at /var/snell_home/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-c5c9a4fc19dfd7d6.arrow
Loading cached shuffled indices for dataset at /var/snell_home/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-9711ab25936b81b7.arrow
Reusing dataset xtreme (/var/snell_home/.cache/huggingface/datasets/xtreme/PAN-X.it/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /var/snell_home/.cache/huggingface/datasets/xtreme/PAN-X.it/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-daa9a1770078307c.arrow
Loading cached shuffled indices for dataset at /var/snell_home/.cache/huggingface/datasets/xtreme/PAN-X.it/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-5e244c05031bab3c.arrow
Loading cached shuffled indices for dataset at /var/snell_home/.cache/huggingface/datasets/xtreme/PAN-X.it/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-497ee15c12bff58d.arrow
Reusing dataset xtreme (/var/snell_home/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /var/snell_home/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-757845faa9fa6949.arrow
Loading cached shuffled indices for dataset at /var/snell_home/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-305cefc7ffa49fd9.arrow
Loading cached shuffled indices for dataset at /var/snell_home/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-e5ec5e6ba7c1237d.arrow


In [8]:
# What does our dataset look like?

In [9]:
pd.DataFrame({lang: [panx_ch[lang]["train"].num_rows] for lang in langs},
             index=["Number of training examples"])

Unnamed: 0,de,fr,it,en
Number of training examples,12580,4580,1680,1180


In [10]:
# Since german is the most, make a subset database of just german and see how it does for everything

In [11]:
tags = panx_ch["de"]["train"].features["ner_tags"].feature
print(tags)

ClassLabel(num_classes=7, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], id=None)


In [12]:
def create_tag_names(batch):
    return {"ner_tags_str": [tags.int2str(idx) for idx in batch["ner_tags"]]}

panx_de = panx_ch["de"].map(create_tag_names)

Loading cached processed dataset at /var/snell_home/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-cfdb46f2ddd432b7.arrow
Loading cached processed dataset at /var/snell_home/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-bab3a8a8bf52f502.arrow
Loading cached processed dataset at /var/snell_home/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-9cded3e32ef747bc.arrow


In [13]:
# Let's look at one

In [14]:
de_example = panx_de["train"][0]
pd.DataFrame([de_example["tokens"], de_example["ner_tags_str"]], ['Tokens', 'Tags'])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
Tokens,2.000,Einwohnern,an,der,Danziger,Bucht,in,der,polnischen,Woiwodschaft,Pommern,.
Tags,O,O,O,O,B-LOC,I-LOC,O,O,B-LOC,B-LOC,I-LOC,O


In [15]:
# What are the frequencies of the tags?

In [16]:
from collections import Counter

split2freqs = defaultdict(Counter)
for split, dataset in panx_de.items():
    for row in dataset["ner_tags_str"]:
        for tag in row:
            if tag.startswith("B"):
                tag_type = tag.split("-")[1]
                split2freqs[split][tag_type] += 1
pd.DataFrame.from_dict(split2freqs, orient="index")

Unnamed: 0,LOC,ORG,PER
train,6186,5366,5810
validation,3172,2683,2893
test,3180,2573,3071


In [17]:
# Looks good. Now we have our dataset

In [18]:
panx_de

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'ner_tags_str'],
        num_rows: 12580
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'ner_tags_str'],
        num_rows: 6290
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'ner_tags_str'],
        num_rows: 6290
    })
})

In [19]:
panx_de['train'][0]

{'tokens': ['2.000',
  'Einwohnern',
  'an',
  'der',
  'Danziger',
  'Bucht',
  'in',
  'der',
  'polnischen',
  'Woiwodschaft',
  'Pommern',
  '.'],
 'ner_tags': [0, 0, 0, 0, 5, 6, 0, 0, 5, 5, 6, 0],
 'langs': ['de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de'],
 'ner_tags_str': ['O',
  'O',
  'O',
  'O',
  'B-LOC',
  'I-LOC',
  'O',
  'O',
  'B-LOC',
  'B-LOC',
  'I-LOC',
  'O']}

# Now let's look at transformers

## Architecture of a transformer encoder for classification

<img alt="Architecture of a transformer encoder for classification." caption="Fine-tuning an encoder-based transformer for sequence classification" src="notebooks/images/chapter04_clf-architecture.png" id="clf-arch"/>

Fine-tuning an encoder-based transformer for sequence classification

## Architecture of a transformer encoder for token classification.

The wide linear layer shows that the same linear layer is applied to all hidden states.

<img alt="Architecture of a transformer encoder for named entity recognition. The wide linear layer shows that the same linear layer is applied to all hidden states." caption="Fine-tuning an encoder-based transformer for named entity recognition" src="notebooks/images/chapter04_ner-architecture.png" id="ner-arch"/>

Fine-tuning an encoder-based transformer for named entity recognition

## The Anatomy of the Transformers Model Class

### Bodies and Heads

<img alt="bert-body-head" caption="The `BertModel` class only contains the body of the model, while the `BertFor&lt;Task&gt;` classes combine the body with a dedicated head for a given task" src="notebooks/images/chapter04_bert-body-head.png" id="bert-body-head"/>

The `BertModel` class only contains the body of the model, while the `BertFor<Task>` classes combine the body with a dedicated head for a given task

## Now we can build and fine-tune our transformer 

In [20]:
from transformers import AutoTokenizer

model_name = "xlm-roberta-base"
model_name = "bert-base-cased"
model_name = "distilbert-base-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/426k [00:00<?, ?B/s]

In [21]:
# Let's see how the tokenizer works

text = "Jack Sparrow loves New York!"
tokens = tokenizer(text).tokens()
df = pd.DataFrame([tokens], index=[model_name])
df

Unnamed: 0,0,1,2,3,4,5,6,7,8
distilbert-base-cased,[CLS],Jack,Spa,##rrow,loves,New,York,!,[SEP]


#### Load the transformer

In [22]:
index2tag = {idx: tag for idx, tag in enumerate(tags.names)}
tag2index = {tag: idx for idx, tag in enumerate(tags.names)}

In [23]:
from transformers import AutoConfig

config = AutoConfig.from_pretrained(model_name, 
                                    num_labels=tags.num_classes,
                                    id2label=index2tag, label2id=tag2index)

In [24]:
from transformers import AutoModelForTokenClassification

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=tags.num_classes).to(device)

Downloading pytorch_model.bin:   0%|          | 0.00/251M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForTokenClassification: ['vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this 

In [25]:
input_ids = tokenizer.encode(text, return_tensors="pt")
pd.DataFrame([tokens, input_ids[0].numpy()], index=["Tokens", "Input IDs"])

Unnamed: 0,0,1,2,3,4,5,6,7,8
Tokens,[CLS],Jack,Spa,##rrow,loves,New,York,!,[SEP]
Input IDs,101,2132,23665,8674,7871,1203,1365,106,102


In [26]:
outputs = model(input_ids.to(device)).logits
predictions = torch.argmax(outputs, dim=-1)
print(f"Number of tokens in sequence: {len(tokens)}")
print(f"Shape of outputs: {outputs.shape}")

Number of tokens in sequence: 9
Shape of outputs: torch.Size([1, 9, 7])


#### Let's tokenize our dataset. We will only tokenize the german (de) data and see how things go

In [27]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, 
                                      is_split_into_words=True)
    labels = []
    for idx, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=idx)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None or word_idx == previous_word_idx:
                label_ids.append(-100)
            else:
                label_ids.append(label[word_idx])
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [28]:
def encode_panx_dataset(corpus):
    return corpus.map(tokenize_and_align_labels, batched=True)

In [29]:
panx_de_encoded = encode_panx_dataset(panx_ch["de"])

  0%|          | 0/13 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

In [30]:
panx_de_encoded

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 12580
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 6290
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 6290
    })
})

In [31]:
tokens = panx_de_encoded['train']['tokens'][0]
label_ids = panx_de_encoded['train']['labels'][0]
labels = [index2tag[l] if l != -100 else "IGN" for l in label_ids]
index = ["Tokens", "Label IDs", "Labels"]

pd.DataFrame([tokens,  label_ids, labels], index=index)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
Tokens,2.000,Einwohnern,an,der,Danziger,Bucht,in,der,polnischen,Woiwodschaft,...,,,,,,,,,,
Label IDs,-100,0,-100,-100,0,-100,-100,-100,0,0,...,5,-100,-100,-100,-100,6,-100,-100,0,-100
Labels,IGN,O,IGN,IGN,O,IGN,IGN,IGN,O,O,...,B-LOC,IGN,IGN,IGN,IGN,I-LOC,IGN,IGN,O,IGN


#### Now lets fine-tune the transformer

In [32]:
import numpy as np

def align_predictions(predictions, label_ids):
    preds = np.argmax(predictions, axis=2)
    batch_size, seq_len = preds.shape
    labels_list, preds_list = [], []

    for batch_idx in range(batch_size):
        example_labels, example_preds = [], []
        for seq_idx in range(seq_len):
            # Ignore label IDs = -100
            if label_ids[batch_idx, seq_idx] != -100:
                example_labels.append(index2tag[label_ids[batch_idx][seq_idx]])
                example_preds.append(index2tag[preds[batch_idx][seq_idx]])

        labels_list.append(example_labels)
        preds_list.append(example_preds)

    return preds_list, labels_list

In [33]:
from seqeval.metrics import f1_score

def compute_metrics(eval_pred):
    y_pred, y_true = align_predictions(eval_pred.predictions, 
                                       eval_pred.label_ids)
    return {"f1": f1_score(y_true, y_pred)}

In [34]:
from transformers import TrainingArguments

num_epochs = 3
batch_size = 24
logging_steps = len(panx_de_encoded["train"]) // batch_size

training_args = TrainingArguments(
                    output_dir=f"{model_name}-fine-tuned", 
                    log_level="error", 
                    num_train_epochs=num_epochs, 
                    per_device_train_batch_size=batch_size, 
                    per_device_eval_batch_size=batch_size, 
                    evaluation_strategy="epoch", 
                    save_steps=1e6, 
                    weight_decay=0.01,  
                    disable_tqdm=False, 
                    logging_steps=logging_steps, 
                    push_to_hub=False)

In [35]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer)

In [36]:
# Tokenizers throwing warning "The current process just got forked, Disabling parallelism to avoid deadlocks.
# To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
%env TOKENIZERS_PARALLELISM=false

env: TOKENIZERS_PARALLELISM=false


In [37]:
from transformers import Trainer

trainer = Trainer(model,
                  args=training_args, 
                  data_collator=data_collator, 
                  compute_metrics=compute_metrics,
                  train_dataset=panx_de_encoded["train"],
                  eval_dataset=panx_de_encoded["validation"], 
                  tokenizer=tokenizer)

In [38]:
trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mqsnell[0m. Use [1m`wandb login --relogin`[0m to force relogin




Epoch,Training Loss,Validation Loss,F1
1,No log,0.208481,0.749097
2,0.248900,0.181159,0.783522
3,0.248900,0.181369,0.801287


TrainOutput(global_step=789, training_loss=0.19849839107769676, metrics={'train_runtime': 116.8292, 'train_samples_per_second': 323.036, 'train_steps_per_second': 6.753, 'total_flos': 684209457084120.0, 'train_loss': 0.19849839107769676, 'epoch': 3.0})

In [39]:
def tag_text(text, tags, model, tokenizer):
    # Get tokens with special characters
    tokens = tokenizer(text).tokens()
    # Encode the sequence into IDs
    input_ids = tokenizer(text, return_tensors="pt").input_ids.to(device)
    # Get predictions as distribution over 7 possible classes
    outputs = model(input_ids)[0]
    # Take argmax to get most likely class per token
    predictions = torch.argmax(outputs, dim=2)
    # Convert to DataFrame
    preds = [tags.names[p] for p in predictions[0].cpu().numpy()]
    return pd.DataFrame([tokens, preds], index=["Tokens", "Tags"])
    

In [40]:
text_de = "Jeff Dean ist ein Informatiker bei Google in Kalifornien"
tag_text(text_de, tags, trainer.model, tokenizer)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
Tokens,[CLS],Jeff,Dean,is,##t,e,##in,In,##form,##ati,##ker,be,##i,Google,in,Kali,##fo,##rn,##ien,[SEP]
Tags,O,B-PER,I-PER,O,O,O,O,O,O,O,O,O,O,B-ORG,O,B-LOC,I-ORG,I-LOC,I-LOC,O
