<a href="https://colab.research.google.com/github/Kira1108/huggingface-examples/blob/main/NerSimplified.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Install packages and import**

In [1]:
from IPython.display import clear_output

!pip install transformers datasets
!pip install git+https://github.com/Kira1108/huggingface_utils.git
!pip install seqeval
!pip install evaluate

clear_output()

In [16]:
from transformers import AutoTokenizer
from transformers import DataCollatorForTokenClassification
from transformers import AutoModelForTokenClassification
from transformers import TrainingArguments
from transformers import Trainer
from transformers import pipeline

from datasets import load_dataset
from huggingface_utils.labels import LabelAligner
from huggingface_utils.metrics.ner import NerMetric

**Load Dataset**

In [5]:
data = load_dataset("conll2003")
clear_output()

**Tokenization**

In [8]:
checkpoint = 'distilbert-base-cased'

In [9]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
label_names = data['train'].features['ner_tags'].feature.names
aligner = LabelAligner(label_names = label_names, use_iob = True)

def tokenize_fn(batch):

    # tokenize batch[all examples in batch are tokenized]
    tokenized_inputs = tokenizer(
        batch['tokens'], truncation = True, is_split_into_words = True
    )

    # get true labels
    labels_batch = batch['ner_tags']

    aligned_labels_batch = []

    # for each example, align labels and targets
    for i, labels in enumerate(labels_batch):
        # get word_ids of the ith example
        word_ids = tokenized_inputs.word_ids(i)

        # align labels for ith example
        aligned_labels = aligner(labels = labels, word_ids = word_ids)
        aligned_labels_batch.append(aligned_labels)
    
    # put result back to batchEncoding object
    tokenized_inputs['labels'] = aligned_labels_batch
    return tokenized_inputs

tokenized_dataset = data.map(
    tokenize_fn, 
    batched = True, 
    remove_columns = data['train'].column_names)

  0%|          | 0/15 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

**DataCollator**

In [12]:
data_collator = DataCollatorForTokenClassification(tokenizer = tokenizer)

**Compute Metrics**

In [13]:
compute_metrics = NerMetric(label_names = label_names)

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

**Model, TrainingArguments and Trainer**

In [15]:
id2label = {k:v for k,v in enumerate(label_names)}
label2id = {v:k for k,v in id2label.items()}

model = AutoModelForTokenClassification.from_pretrained(
    checkpoint, 
    id2label = id2label, 
    label2id = label2id
)

training_args  = TrainingArguments(
    "distilbert_finetuned-ner",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate =2e-5,
    num_train_epochs = 3,
    weight_decay = 0.01
)

trainer = Trainer(
    model = model,
    tokenizer = tokenizer,
    args = training_args,
    train_dataset = tokenized_dataset['train'],
    eval_dataset = tokenized_dataset['validation'],
    data_collator = data_collator,
    compute_metrics = compute_metrics,
)

Downloading:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForTokenClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this 

**Model Training**

In [17]:
trainer.train()

***** Running training *****
  Num examples = 14041
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 5268
  Number of trainable parameters = 65197833
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.1046,0.084464,0.882823,0.911646,0.897003,0.975864
2,0.0447,0.070633,0.905213,0.932178,0.918498,0.981972
3,0.0276,0.069317,0.910028,0.936217,0.922937,0.982707


***** Running Evaluation *****
  Num examples = 3250
  Batch size = 8
Saving model checkpoint to distilbert_finetuned-ner/checkpoint-1756
Configuration saved in distilbert_finetuned-ner/checkpoint-1756/config.json
Model weights saved in distilbert_finetuned-ner/checkpoint-1756/pytorch_model.bin
tokenizer config file saved in distilbert_finetuned-ner/checkpoint-1756/tokenizer_config.json
Special tokens file saved in distilbert_finetuned-ner/checkpoint-1756/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 3250
  Batch size = 8
Saving model checkpoint to distilbert_finetuned-ner/checkpoint-3512
Configuration saved in distilbert_finetuned-ner/checkpoint-3512/config.json
Model weights saved in distilbert_finetuned-ner/checkpoint-3512/pytorch_model.bin
tokenizer config file saved in distilbert_finetuned-ner/checkpoint-3512/tokenizer_config.json
Special tokens file saved in distilbert_finetuned-ner/checkpoint-3512/special_tokens_map.json
***** Running Evaluation *****
 

TrainOutput(global_step=5268, training_loss=0.08016827148113091, metrics={'train_runtime': 228.1631, 'train_samples_per_second': 184.618, 'train_steps_per_second': 23.089, 'total_flos': 462023079274890.0, 'train_loss': 0.08016827148113091, 'epoch': 3.0})

**Model Persistence**

In [18]:
trainer.save_model("my_saved_model")

Saving model checkpoint to my_saved_model
Configuration saved in my_saved_model/config.json
Model weights saved in my_saved_model/pytorch_model.bin
tokenizer config file saved in my_saved_model/tokenizer_config.json
Special tokens file saved in my_saved_model/special_tokens_map.json


**Load and user model**

In [19]:
ner = pipeline("token-classification", model = 'my_saved_model', aggregation_strategy = 'simple', device = 0)
ner("Bill and John went to Washonton DC yesterday.")

loading configuration file my_saved_model/config.json
Model config DistilBertConfig {
  "_name_or_path": "my_saved_model",
  "activation": "gelu",
  "architectures": [
    "DistilBertForTokenClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "O",
    "1": "B-PER",
    "2": "I-PER",
    "3": "B-ORG",
    "4": "I-ORG",
    "5": "B-LOC",
    "6": "I-LOC",
    "7": "B-MISC",
    "8": "I-MISC"
  },
  "initializer_range": 0.02,
  "label2id": {
    "B-LOC": 5,
    "B-MISC": 7,
    "B-ORG": 3,
    "B-PER": 1,
    "I-LOC": 6,
    "I-MISC": 8,
    "I-ORG": 4,
    "I-PER": 2,
    "O": 0
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "vocab_size": 28996

[{'entity_group': 'PER',
  'score': 0.9967519,
  'word': 'Bill',
  'start': 0,
  'end': 4},
 {'entity_group': 'PER',
  'score': 0.9981799,
  'word': 'John',
  'start': 9,
  'end': 13},
 {'entity_group': 'LOC',
  'score': 0.99729055,
  'word': 'Washonton DC',
  'start': 22,
  'end': 34}]