## Introduction:

Part-of-Speech (POS) tagging is an essential task in Natural Language Processing (NLP) that involves labeling words in a sentence with their corresponding grammatical roles, such as noun, verb, adjective, etc. Understanding the structure of a sentence through POS tags is crucial for various downstream tasks, including text analysis, syntactic parsing, and language modeling.

In this project, we aim to build a model that can accurately identify the parts of speech in a sentence. Using the CoNLL-2003 dataset, we will not only recognize the basic POS tags but also explore more complex structures such as noun phrases (NP) and verb phrases (VP). The goal is to demonstrate how POS tagging can be leveraged to gain deeper insights into sentence structure and meaning.

This project is structured as follows:
1. **Data Preparation**: Loading and exploring the CoNLL-2003 dataset, and preprocessing it to extract POS tags and phrase structures.
2. **Model Training**: Building a model that identifies POS tags and phrase structures using advanced NLP techniques.
3. **Evaluation and Analysis**: Evaluating the model's performance in identifying POS tags and analyzing its predictions.

By the end of this project, you will have a clear understanding of how to implement a POS tagging model and how it can be applied to various NLP tasks.


### Preparing the data:

##### The CoNLL-2003 dataset:

In [22]:
from datasets import load_dataset

raw_datasets = load_dataset('conll2003')

In [23]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [24]:
raw_datasets['train'][0]['tokens']

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']

In [31]:
raw_datasets['train'][0]['chunk_tags']

[11, 21, 11, 12, 21, 22, 11, 12, 0]

In [32]:
raw_datasets['train'].features['chunk_tags']

Sequence(feature=ClassLabel(names=['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP'], id=None), length=-1, id=None)

In [35]:
chunk_feature = raw_datasets['train'].features['chunk_tags']
label_names = chunk_feature.feature.names
label_names

['O',
 'B-ADJP',
 'I-ADJP',
 'B-ADVP',
 'I-ADVP',
 'B-CONJP',
 'I-CONJP',
 'B-INTJ',
 'I-INTJ',
 'B-LST',
 'I-LST',
 'B-NP',
 'I-NP',
 'B-PP',
 'I-PP',
 'B-PRT',
 'I-PRT',
 'B-SBAR',
 'I-SBAR',
 'B-UCP',
 'I-UCP',
 'B-VP',
 'I-VP']

In [37]:
words = raw_datasets["train"][0]["tokens"]
labels = raw_datasets["train"][0]["chunk_tags"]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

EU   rejects German call to   boycott British lamb . 
B-NP B-VP    B-NP   I-NP B-VP I-VP    B-NP    I-NP O 


### Data Processing:

In [2]:
from transformers import AutoTokenizer

model_checkpoint = 'distilbert/distilbert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)



In [41]:
tokenizer.is_fast

True

In [45]:
inputs = tokenizer(raw_datasets['train'][0]['tokens'], is_split_into_words = True)
inputs.tokens()
#the tokenizer added the special characters and extra tokens
#To counter this we use word_ids to get the labels for all added tokens

['[CLS]',
 'EU',
 'rejects',
 'German',
 'call',
 'to',
 'boycott',
 'British',
 'la',
 '##mb',
 '.',
 '[SEP]']

In [44]:
inputs.word_ids()

[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]

In [50]:
#we are adjusting new labels based on word ids
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id] 
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [48]:
labels = raw_datasets['train'][0]['chunk_tags']
word_ids = inputs.word_ids()
print(labels)
print(word_ids)
print(align_labels_with_tokens(labels,word_ids))

[11, 21, 11, 12, 21, 22, 11, 12, 0]
[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]
[-100, 11, 21, 11, 12, 21, 22, 11, 12, 12, 0, -100]


In [55]:
#to tokenize the whole dataset 
def tokenize_and_align_labels(example):
    tokenized_inputs = tokenizer(example['tokens'], truncation = True, is_split_into_words = True)
    all_labels = example['chunk_tags'] #it is a list of lists because we used batched = True in map method
    new_labels = []
    for i,labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels,word_ids))

    tokenized_inputs['labels'] = new_labels
    return tokenized_inputs

In [58]:
tokenized_datasets = raw_datasets.map(tokenize_and_align_labels, batched = True, remove_columns=raw_datasets['train'].column_names)

In [59]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 3453
    })
})

#### Fine tuning the model with Trainer API:

##### Data Collation:

In [62]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer = tokenizer)




In [69]:
batch = data_collator([tokenized_datasets['train'][i] for i in range(2)])
print(batch['labels'])
#we are adding -100 to make sure that the padded tokens are not considered while computing the loss

tensor([[-100,   11,   21,   11,   12,   21,   22,   11,   12,   12,    0, -100],
        [-100,   11,   12, -100, -100, -100, -100, -100, -100, -100, -100, -100]])


##### Metrics:

In [72]:
import evaluate

metric = evaluate.load('seqeval') #it needs the strings not the integers to compute loss

In [73]:
labels = raw_datasets['train'][0]['chunk_tags']
labels = [label_names[i] for i in labels]
labels

['B-NP', 'B-VP', 'B-NP', 'I-NP', 'B-VP', 'I-VP', 'B-NP', 'I-NP', 'O']

In [83]:
predictions = ['O','O','O','O','O','O','O','O','O']
metric.compute(predictions=[predictions], references=[labels])

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'NP': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 2},
 'VP': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 2},
 'overall_precision': 0.0,
 'overall_recall': 0.0,
 'overall_f1': 0.0,
 'overall_accuracy': 0.2222222222222222}

In [84]:
label_names

['O',
 'B-ADJP',
 'I-ADJP',
 'B-ADVP',
 'I-ADVP',
 'B-CONJP',
 'I-CONJP',
 'B-INTJ',
 'I-INTJ',
 'B-LST',
 'I-LST',
 'B-NP',
 'I-NP',
 'B-PP',
 'I-PP',
 'B-PRT',
 'I-PRT',
 'B-SBAR',
 'I-SBAR',
 'B-UCP',
 'I-UCP',
 'B-VP',
 'I-VP']

In [85]:
import numpy as np

def compute_metrics(eval_preds):
    logits,labels = eval_preds
    predictions = np.argmax(logits,axis = -1)
    true_labels = [[label_names[l] for l in label if l!=-100] for label in labels]
    true_predictions = [[label_names[p] for (p,l) in zip(prediction,label) if l!=-100] for (prediction,label) in zip(predictions,labels)]
    all_metrics = metric.compute(predictions = true_predictions, references = true_labels)
    return {
    'precision': all_metrics['overall_precision'],
    'recall': all_metrics['overall_recall'],
    'f1': all_metrics['overall_f1'],
    'accuracy': all_metrics['overall_accuracy']}

##### Defining the model:

In [89]:
id2label = {i:label for i,label in enumerate(label_names)}
label2id = {v:k for k,v in id2label.items()}

In [90]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, id2label=id2label, label2id = label2id)

model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [91]:
model.config.num_labels

23

##### Fine-tuning the model:

In [92]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [93]:
from transformers import TrainingArguments

args = TrainingArguments('bert-finetuned-ner',
                         evaluation_strategy = 'epoch',
                         save_strategy = 'epoch',
                         learning_rate = 2e-5,
                         num_train_epochs = 3,
                         weight_decay = 0.01,
                         push_to_hub = True,)



In [94]:
from transformers import Trainer

trainer = Trainer(model = model,
                  args = args,
                  train_dataset = tokenized_datasets['train'],
                  eval_dataset = tokenized_datasets['validation'],
                  data_collator = data_collator,
                  compute_metrics = compute_metrics,
                  tokenizer = tokenizer)

In [95]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.1926,0.180864,0.910433,0.905622,0.908021,0.954303
2,0.1318,0.162188,0.920037,0.915569,0.917798,0.95916
3,0.0933,0.163975,0.9223,0.919182,0.920739,0.960676


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=5268, training_loss=0.17121758203933704, metrics={'train_runtime': 6553.4701, 'train_samples_per_second': 6.428, 'train_steps_per_second': 0.804, 'total_flos': 460548101514270.0, 'train_loss': 0.17121758203933704, 'epoch': 3.0})

In [96]:
trainer.push_to_hub(commit_message="Training complete")

CommitInfo(commit_url='https://huggingface.co/Tarun-1999M/bert-finetuned-ner/commit/49ae5c0909330944418a5c663e5fe89f70d3adc3', commit_message='Training complete', commit_description='', oid='49ae5c0909330944418a5c663e5fe89f70d3adc3', pr_url=None, pr_revision=None, pr_num=None)

### Using the fine-tuned model:

In [1]:
from transformers import pipeline

model_checkpoint = "Tarun-1999M/bert-finetuned-ner"
token_classifier = pipeline(
    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
)
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

[{'entity_group': 'NP',
  'score': 0.9982133,
  'word': 'My name',
  'start': 0,
  'end': 7},
 {'entity_group': 'VP',
  'score': 0.9985273,
  'word': 'is',
  'start': 8,
  'end': 10},
 {'entity_group': 'NP',
  'score': 0.98740447,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'NP',
  'score': 0.9938041,
  'word': 'I',
  'start': 23,
  'end': 24},
 {'entity_group': 'VP',
  'score': 0.9956542,
  'word': 'work',
  'start': 25,
  'end': 29},
 {'entity_group': 'PP',
  'score': 0.9982097,
  'word': 'at',
  'start': 30,
  'end': 32},
 {'entity_group': 'NP',
  'score': 0.9252255,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'PP',
  'score': 0.99902964,
  'word': 'in',
  'start': 46,
  'end': 48},
 {'entity_group': 'NP',
  'score': 0.9994374,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

In [6]:
#| default_exp NER_app

In [7]:
#| export
# Import necessary libraries
import gradio as gr
from transformers import pipeline

# Load the fine-tuned BERT model for token classification
model_checkpoint = "Tarun-1999M/bert-finetuned-ner"
token_classifier = pipeline(
    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
)

# Set up the Gradio interface
title = "Token Classification with Fine-tuned DISTILBERT"
description = """
This application identifies and classifies tokens (e.g., named entities) in a given text using a DISTILBERT model fine-tuned for NER. 
Input any text to see how the model labels the tokens.

### Explanation of Abbreviations:
- **O**: Outside of a named entity
- **ADJP**: Adjective Phrase
- **ADVP**: Adverb Phrase
- **CONJP**: Conjunction Phrase
- **INTJ**: Interjection
- **LST**: List Item Marker
- **NP**: Noun Phrase
- **PP**: Prepositional Phrase
- **PRT**: Particle
- **SBAR**: Subordinating Conjunction Clause
- **UCP**: Unlike Coordinated Phrase
- **VP**: Verb Phrase

"""

article = "This demo uses a DISTILBERT model fine-tuned on a specific task for token classification."

# Define the prediction function
def predict(text):
    results = token_classifier(text)
    return results

# Gradio interface
gr.Interface(
    fn=predict,
    inputs="textbox",
    outputs="json",
    title=title,
    description=description,
    article=article,
    examples=[
        ["My name is Sylvain and I work at Hugging Face in Brooklyn."],
        ["Albert Einstein was a physicist and he developed the theory of relativity."],
        ["Python is a programming language that I use daily."]
    ],
).launch()


Running on local URL:  http://127.0.0.1:7863

To create a public link, set `share=True` in `launch()`.




In [2]:
import nbdev
notebook_name = "NER_POS_tagging.ipynb"
export_destination = "."
nbdev.export.nb_export(notebook_name, export_destination)