Fine-tuning a Named Entity Recognition (NER) model involves adapting a pre-trained model to your specific dataset and requirements. Here is a step-by-step guide to fine-tune an NER model using Hugging Face's transformers library and the datasets library.

**1) Install Necessary Libraries:**

In [1]:
!pip install transformers datasets wandb accelerate



**2) Setup and Initialization:**

In [2]:
import wandb
from transformers import AutoModelForTokenClassification, AutoTokenizer, Trainer, TrainingArguments, DataCollatorForTokenClassification
from datasets import load_dataset

# Initialize wandb
wandb.init(project="negativeNER")



[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


**3) Load and Prepare Dataset:**

In [3]:
!pip install datasets huggingface-hub



  pid, fd = os.forkpty()




In [4]:
from huggingface_hub import notebook_login

notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [5]:
dataset = load_dataset("procit002/conll2003AndNameStreetCitySep18_and_negative_words_ConfirmationAnswer")

Downloading readme:   0%|          | 0.00/2.33k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.08M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/545k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/519k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/98151 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13764 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/13969 [00:00<?, ? examples/s]

In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 98151
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 13764
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 13969
    })
})

In [7]:
%%capture

model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, padding='max_length', is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
#         print("label_id is",label_ids)
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)


In [8]:
print(dataset['train'].features['pos_tags'])

Sequence(feature=ClassLabel(names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], id=None), length=-1, id=None)


In [9]:
print(dataset["train"].features["pos_tags"].feature.names)

['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB']


In [10]:
print(dataset["train"].features["ner_tags"])

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)


**4) Load Pre-trained Model:**

In [11]:
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(dataset["train"].features["ner_tags"].feature.names))


model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**5) Define Training Arguments:**

Set up the training configuration:

In [13]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    save_steps=1000,
    weight_decay=0.01,
    report_to="wandb"
)




**6) Initialize Data Collator:**

Define a data collator to handle dynamic padding:

In [14]:
data_collator = DataCollatorForTokenClassification(tokenizer)


**7) Initialize the Trainer:**

Set up the Trainer class with the model, training arguments, data collator, and datasets:

In [15]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)


**8) Train the Model:**

Start the training process:

In [None]:
trainer.train()


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


**9) Save the Model:**

In [None]:
model.save_pretrained("./ner_model")
tokenizer.save_pretrained("./ner_model")


**10) Push to Hugging Face Hub:**

Optionally, push the fine-tuned model to the Hugging Face Hub:

In [None]:
try:
    model.push_to_hub('procit002/test_ner_sep_19_second')
    tokenizer.push_to_hub('procit002/test_ner_sep_19_second')
except Exception as e:
    print("Error pushing to hub:", str(e))


**11) Finish Wandb Run:**

In [None]:
wandb.finish()
