# Fine-tune a pretrained 🤗 model for SoftSkill NER

This notebook shows how to fine-tune custom NER model for soft skills using 🤗 Huggingface pretrained model [distilbert](thttps://huggingface.co/distilbert-base-uncased). 

We will use 🤗 Transformers [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) for this. it is the simplest way to fine-tune a 🤗 Transformer model. You can however choose to fine tune models using pytorch and tensorflow way which gives you flexibility to write your own custom training loops.

In [None]:
# Transformers installation
! pip install transformers datasets
! pip install seqeval

## Load custom softskills dataset

We will use a custom created, tokenized and annotated dataset for Softskill NER as such a dataset is not available in the open domain. 

There are only 119 sentences with one or more NERs annotated in each sentences. This dataset is good enough to run training and get the decent results however for production usecases it is advisable to compile more data depending upon type and number of NERs the model should be able to classify.

Training data has some of the typical Softskills "positive attitude", "leadership", "customer focus" etc. you may want to take a look at 'raw_training_sentences.csv' and 'train_ner.json' in this repo to get good idea of how the custom NER training data has been prepared.

In [2]:
from datasets import load_dataset
import torch
from tqdm.notebook import tqdm

#Separate in train and test datasets
data_files = {"train": "./data/train_ner.json", "test": "./data/test_ner.json"}

#load custom NER dataset using Huggingface datasets liberary
skillner = load_dataset('json', data_files=data_files)

#lets define label names
label_names = ['O','SoftSkill']

#Lets set the device to cpu
device =  torch.device('cpu')

  from .autonotebook import tqdm as notebook_tqdm
Using custom data configuration default-e2cfc8a42dad81ca
Reusing dataset json (C:\Users\ashut\.cache\huggingface\datasets\json\default-e2cfc8a42dad81ca\0.0.0\ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)
100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 336.00it/s]


we need a tokenizer to process the text and include a padding and truncation strategy to handle any variable sequence lengths. To process the dataset in one step, use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/process.html#map) method to apply a preprocessing function over the entire dataset:

In [3]:
from transformers import AutoTokenizer

#load distilbert tokenizer 
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Our training data is already tokenized into words and labeled however we will need to 
# tokenize it to add special start and end tokens and furter tokenize like the way pretrained model was trained
# realign the labels with expended tokens
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

#Apply the tokenization and alignment to all the rows of training data in one go
tokenized_skillner = skillner.map(tokenize_and_align_labels, batched=True)


from transformers import DataCollatorForTokenClassification

#Batch the data for training 
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

Loading cached processed dataset at C:\Users\ashut\.cache\huggingface\datasets\json\default-e2cfc8a42dad81ca\0.0.0\ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b\cache-d0401a394c2ae949.arrow
Loading cached processed dataset at C:\Users\ashut\.cache\huggingface\datasets\json\default-e2cfc8a42dad81ca\0.0.0\ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b\cache-8995eebec30ed675.arrow


## Train

🤗 Transformers provides a [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) class optimized for training 🤗 Transformers models, making it easier to start training without manually writing your own training loop. The [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) API supports a wide range of training options and features such as logging, gradient accumulation, and mixed precision.

Let's load the model and number of labels. Since we are only classifying 1) softskill or 2) other than softskill Token, we will have only two labels. 

In [4]:
from transformers import AutoModelForTokenClassification

#load distilbert base uncased model to be used for token classification. Number of labels = 2 as we are only recognizing Softskill token as 1 and rest all as 0
model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=2).to(device)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForTokenClassification: ['vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN t

<Tip>

You will see a warning about some of the pretrained weights not being used and some weights being randomly
initialized. Don't worry, this is completely normal! The pretrained head of the BERT model is discarded, and replaced with a randomly initialized classification head. we will fine-tune this new model head on token classification task, transferring the knowledge of the pretrained model to it.

</Tip>

### Training hyperparameters

Next, let's create a [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) class which contains all the hyperparameters we can tune as well as flags for activating different training options. For this tutorial we can start with the default training [hyperparameters](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments), but feel free to experiment with these to find your optimal settings.

Specify where to save the checkpoints from this training:

In [5]:
from transformers import TrainingArguments

#default training args
training_args = TrainingArguments(
    output_dir="./skillner_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=10,
    per_device_eval_batch_size=10,
    num_train_epochs=10,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,    
)

### Metrics config

[Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) does not automatically evaluate model performance during training. we will need to pass [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) a function to compute and report metrics. The 🤗 Datasets library provides a simple [`accuracy`](https://huggingface.co/metrics/accuracy) function we can load with the `load_metric` (see this [tutorial](https://huggingface.co/docs/datasets/metrics.html) for more information) function:

Call `compute` on `metric` to calculate the accuracy of your predictions. Before passing your predictions to `compute`, you need to convert the predictions to logits (remember all 🤗 Transformers models return logits):

In [6]:
import numpy as np
from datasets import load_metric

metric = load_metric("seqeval")


def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }


### Trainer

Create a [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) object with your model, training arguments, training and test datasets, and evaluation function:

In [7]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_skillner["train"],
    eval_dataset=tokenized_skillner["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

Then fine-tune your model by calling [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train):

In [8]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: ner_tags, tokens, id. If ner_tags, tokens, id are not expected by `DistilBertForTokenClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 101
  Num Epochs = 10
  Instantaneous batch size per device = 10
  Total train batch size (w. parallel, distributed & accumulation) = 10
  Gradient Accumulation steps = 1
  Total optimization steps = 110


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.5499,0.449371,0.0,0.0,0.0,0.820628
2,0.3994,0.361132,0.0,0.0,0.0,0.820628
3,0.3194,0.301087,0.8,0.1,0.177778,0.834081
4,0.2397,0.243414,0.785714,0.55,0.647059,0.892377
5,0.1836,0.211946,0.828571,0.725,0.773333,0.923767
6,0.1383,0.193942,0.810811,0.75,0.779221,0.923767
7,0.1021,0.183426,0.810811,0.75,0.779221,0.923767
8,0.087,0.176324,0.794872,0.775,0.78481,0.923767
9,0.0773,0.178077,0.810811,0.75,0.779221,0.923767
10,0.0683,0.176858,0.789474,0.75,0.769231,0.919283


The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: ner_tags, tokens, id. If ner_tags, tokens, id are not expected by `DistilBertForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 19
  Batch size = 10
  _warn_prf(average, modifier, msg_start, len(result))
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: ner_tags, tokens, id. If ner_tags, tokens, id are not expected by `DistilBertForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 19
  Batch size = 10
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: ner_tags, tokens, id. If ner_tags, tokens, id are not expected 

TrainOutput(global_step=110, training_loss=0.20243914885954423, metrics={'train_runtime': 17.7708, 'train_samples_per_second': 56.835, 'train_steps_per_second': 6.19, 'total_flos': 5427972833556.0, 'train_loss': 0.20243914885954423, 'epoch': 10.0})

Lets save the checkpoints for fine tuned NER model 

In [9]:
#save the model in "skillner_model" directory
trainer.save_model() 

Saving model checkpoint to ./skillner_model
Configuration saved in ./skillner_model\config.json
Model weights saved in ./skillner_model\pytorch_model.bin
tokenizer config file saved in ./skillner_model\tokenizer_config.json
Special tokens file saved in ./skillner_model\special_tokens_map.json


Lets check our fine tuned model inference with a sample sentence

In [10]:
from transformers import pipeline

getNER = pipeline("ner", model=model.to(device), tokenizer=tokenizer)
example = "John Doe is known to be composed and confident professional at work, he also has strong leadership qualities."

ner_results = getNER(example)

ner_results

[{'entity': 'LABEL_0',
  'score': 0.9943357,
  'index': 1,
  'word': 'john',
  'start': 0,
  'end': 4},
 {'entity': 'LABEL_0',
  'score': 0.9899693,
  'index': 2,
  'word': 'doe',
  'start': 5,
  'end': 8},
 {'entity': 'LABEL_0',
  'score': 0.9952485,
  'index': 3,
  'word': 'is',
  'start': 9,
  'end': 11},
 {'entity': 'LABEL_0',
  'score': 0.99341005,
  'index': 4,
  'word': 'known',
  'start': 12,
  'end': 17},
 {'entity': 'LABEL_0',
  'score': 0.99029845,
  'index': 5,
  'word': 'to',
  'start': 18,
  'end': 20},
 {'entity': 'LABEL_0',
  'score': 0.9671793,
  'index': 6,
  'word': 'be',
  'start': 21,
  'end': 23},
 {'entity': 'LABEL_1',
  'score': 0.9086044,
  'index': 7,
  'word': 'composed',
  'start': 24,
  'end': 32},
 {'entity': 'LABEL_0',
  'score': 0.98399967,
  'index': 8,
  'word': 'and',
  'start': 33,
  'end': 36},
 {'entity': 'LABEL_1',
  'score': 0.8795789,
  'index': 9,
  'word': 'confident',
  'start': 37,
  'end': 46},
 {'entity': 'LABEL_1',
  'score': 0.5955358,
 

lets parse the results and check if the model has correctly identified the soft skills. There could be a better way to map the label ids to the text labels which is #ToDo 

In [11]:
for key in ner_results:
        if(key['entity'] =="LABEL_1"):
            print("NER = ",key['word'])


NER =  composed
NER =  confident
NER =  professional
NER =  leadership
