NLP libraries focused on practical and production ready task

*   Spacy- Tokenization (splitting text into words), Part-of-Speech (POS), tagging, Named Entity Recognition (NER), Dependency parsing Lemmatization (base form of a word)

*   NLTk- Tokenization, stemming, lemmatization, POS tagging, Text classification, Sentiment analysis


Named entity recognition helps in identifying and classifying the key information in text into predefined categories.



In [1]:
import pandas as pd
import numpy as np
import nltk
import spacy
from sklearn.preprocessing import OneHotEncoder

# Download and load the spaCy model
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m55.0 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
file_path = "/content/drive/My Drive/Dataset/Laptop_Train_v2.csv"
data = pd.read_csv(file_path)
data.head()

Unnamed: 0,id,Sentence,Aspect Term,polarity,from,to
0,2339,I charge it at night and skip taking the cord ...,cord,neutral,41,45
1,2339,I charge it at night and skip taking the cord ...,battery life,positive,74,86
2,1316,The tech guy then said the service center does...,service center,negative,27,41
3,1316,The tech guy then said the service center does...,"""sales"" team",negative,109,121
4,1316,The tech guy then said the service center does...,tech guy,neutral,4,12


In [4]:
data = data.drop(columns = ['id', 'from', 'to'], axis = 1)
data.head()

Unnamed: 0,Sentence,Aspect Term,polarity
0,I charge it at night and skip taking the cord ...,cord,neutral
1,I charge it at night and skip taking the cord ...,battery life,positive
2,The tech guy then said the service center does...,service center,negative
3,The tech guy then said the service center does...,"""sales"" team",negative
4,The tech guy then said the service center does...,tech guy,neutral


In [5]:
def str_to_num(x):
    if x == 'psotive':
      return 1
    elif x == 'negative':
      return -1
    else:
      return 0

data['polarity'] = data['polarity'].apply(str_to_num)
data.head()


Unnamed: 0,Sentence,Aspect Term,polarity
0,I charge it at night and skip taking the cord ...,cord,0
1,I charge it at night and skip taking the cord ...,battery life,0
2,The tech guy then said the service center does...,service center,-1
3,The tech guy then said the service center does...,"""sales"" team",-1
4,The tech guy then said the service center does...,tech guy,0


Idea how to take input :
step1: convert the input setence into token of words with each token as 768 dimension vector **using distilberttokenizer**  from which we get from BERT embedding
step2: Append the upos(universal parts of speech) of each word to above vector
step3: Append also xpos(extended parts of speech) of each word to above appended vector
For step2: Converting a word into one-hot encoded upos vector using 'ntlk' library and it outputs with dimension value of 37. (Alternatively, if we use spacy library it will give with 17dimension)

**UPOS**- Language independent, use when you want consistency across multiple

languages (e.g., multilingual NLP)

**XPOS**- richer, more detailed POS tags, defined by each language’s traditional grammar, hence language dependent. It gives fine grained grammertical details.


NLTK supports only XPOS by default which has 17 different tags.
Spacy supports both UPOS and XPOS and has 37 different pos tags

'punkt' : Splits a paragrapgh of text into individual sentences. It is trained on large text corpus for specific languages - eng, german etc. It is used before applying tokenization, pos tagging or NER. It is unsupervised means it learns punctuation patterns and abbreviation usage from a raw corpus.

In [6]:
from nltk import pos_tag, word_tokenize
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')   # used for XPOS & downloads a pretrained POS tagger based on the averaged perceptron algorithm, which is used by NLTK’s pos_tag() function.
nltk.download('universal_tagset')             # it converts XPOS to UPOS when requested
nltk.download('averaged_perceptron_tagger_eng', download_dir='/root/nltk_data') # Download the English tagger to a specific directory

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [7]:
def splitSentence(sentence):
    split_sentence = sentence.split(",") # split by commas ex- ['my  name is khan', 'i live in mannat', 'i am the king']
    split_sentence = [word.split() for word in split_sentence] # split each word by space ex - [['my', 'name', 'is', 'khan'], ['i', 'live', 'in', 'mannat']]
    split_sentence = [word for sublist in split_sentence for word in sublist] # flatten the list of lists into a single list ex - ['my', 'name', 'is', 'khan','i', 'live', 'in', 'mannat']

    return split_sentence

In [8]:
# Create a one-hot encoder and fit it to the set of UPOS tags
#this below tags are used by spacy for giving pos_tags.
upos_tags = ['ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X']
encoder = OneHotEncoder(sparse_output = False, categories=[upos_tags])
encoder.fit(np.array(upos_tags).reshape(-1, 1)) #fit


def upos_tagging_word(word):

    # Use the spacy tagger to tag the word and extract the UPOS tag
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(word) # example this will be [('author','NN')]
    upos_tag = doc[0].pos_ # "NN"
    if upos_tag==',' or upos_tag=='.' or upos_tag=='?' or upos_tag==':' or upos_tag==';' or upos_tag=='_':
        upos_tag = 'PUNCT'
    if upos_tag not in upos_tags:
        vector = np.zeros((1, len(upos_tags)))
        return vector

    # np.array([upos_tag]).reshape(-1, 1)) converts to [['NN']]
    # Transform the UPOS tag into a one-hot encoded vector
    onehot_vector = encoder.transform(np.array([upos_tag]).reshape(-1, 1))
    return onehot_vector

In [9]:
import torch
def upos_tagging(sentence):
    answer = []
    tokens = splitSentence(sentence)
    for i in tokens:
        word_tag = upos_tagging_word(i)
        answer.append(torch.tensor(word_tag[0]))
    return answer

In [10]:
print(upos_tagging("the author"))

[tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       dtype=torch.float64), tensor([0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       dtype=torch.float64)]


We will also compare what is the difference in the performance when only Word2Vec is used and when W2V + Upos tagging is used

### **Aspect Extraction Using BERT**

In [11]:
data

Unnamed: 0,Sentence,Aspect Term,polarity
0,I charge it at night and skip taking the cord ...,cord,0
1,I charge it at night and skip taking the cord ...,battery life,0
2,The tech guy then said the service center does...,service center,-1
3,The tech guy then said the service center does...,"""sales"" team",-1
4,The tech guy then said the service center does...,tech guy,0
...,...,...,...
2353,We also use Paralles so we can run virtual mac...,Windows Server Enterprise 2003,0
2354,We also use Paralles so we can run virtual mac...,Windows Server 2008 Enterprise,0
2355,"How Toshiba handles the repair seems to vary, ...",repair,0
2356,"How Toshiba handles the repair seems to vary, ...",repair,0


In [12]:
import pandas as pd
import numpy as np
import torch
import transformers
from transformers import DistilBertTokenizer, DistilBertModel, TFDistilBertForTokenClassification, TFDistilBertForSequenceClassification
# from transformers import DistillBertTokenizer, BertForTokenClassification, BertForSequenceClassification , BertModel
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

**Working of this tokenizer**

Downloads the vocab and tokenizer config (if not cached).

Loads the WordPiece tokenizer with the vocabulary used during DistilBERT's training.

Returns a tokenizer object you can use to tokenize your input text.

In [13]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')  ## loads the pretrained tokenizer for the DistilBert model (uncased one: which doesnt differentiate between capital & small letters)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [14]:
df1 = df_grouped = data.groupby('Sentence').agg({'Aspect Term': lambda x: x.tolist(), 'polarity': lambda x: x.tolist()}).reset_index()
df1.head()

Unnamed: 0,Sentence,Aspect Term,polarity
0,""" This isn't a big deal, I haven't noticed the...",[USB output],[-1]
1,"""> iPhoto is probably the best program I have...",[iPhoto],[0]
2,( The iBook backup also uses a firewire connec...,"[iBook backup, firewire connection]","[0, 0]"
3,"(Beware, their staff could send you back makin...",[staff],[-1]
4,(I found a 2GB stick for a bit under $50) Nice...,[system],[0]


**Adding BIO label column in the dataframe**
BIO means: It labels each word/token in a sentence, enabling models to learn which spans of text belong to a certain labels.
W/O BIO: You might detect "battery" and "life" as aspects, but not know if they belong together.

B-TERM: Beginning of aspect term

I-TERM: Inside the aspect term

O: Outside (not part of an aspect term)

BIO is required only when- Named Entity Recognition (NER), Chunking,Slot filling, or when we do token level classification.
Not required when Sentence-level classification (e.g., sentiment analysis, topic classification), Question answering, Text generation or embedding extraction

In [15]:
def tokenize_with_whitespace(text):         ##split the sentence by whitespaces
    """
    Tokenize text using whitespace as a delimiter
    """
    tokens = splitSentence(text)
    return tokens

def get_token_offsets(text):               ## tokens and their offsets. example: "the rice is good", here this function returns ['the','rice','is','good'],[(0,3),(4,8),(9,11),(12,16)]
    tokens = []
    offsets = []
    for token in tokenize_with_whitespace(text):
        tokens.append(token)
        start_index = text.find(token)
        end_index = start_index + len(token)
        offsets.append((start_index, end_index))
    return tokens, offsets

def get_token_spans(offsets):
    spans = [(offsets[i][0], offsets[i+1][0] if i < len(offsets) - 1 else offsets[i][1]) for i in range(len(offsets))]
    return spans

# Convert sentence to B-I-O tags
def convert_to_b_i_o(sentence, aspect_terms):             ##this converts the each token in sentence to {'I','O','B'}
    # Get corresponding token offsets and spans for sentence
    tokens, offsets = get_token_offsets(sentence)            ## Number of  tokens in one sentence
    spans = get_token_spans(offsets)

    tags = ["O"] * len(tokens)                              # everytime tag will be the size of the number of tokens in each sentence.
    for aspect_term in aspect_terms:
        aspect_words = tokenize_with_whitespace(aspect_term)    ## If a aspect term has two terms then store the two termms separately in the list
        aspect_words_len = len(aspect_words)

        start_index = None                                  # Find the start and end indexes of the aspect term within the sentence
        end_index = None
        for i in range(len(tokens) - aspect_words_len + 1):
            if tokens[i:i+aspect_words_len] == aspect_words:      ## this slicing will return a subset of list & hence we can compare it with another list.
                start_index = i                                   ## this index is the index of that particular word from the list of token generated from the sentence.
                end_index = i + aspect_words_len
                break

        if start_index is not None and end_index is not None:
            # Find the start and end offsets of the aspect term within the sentence
            start_offset = spans[start_index][0]  ## Here is the answer of why spans is used and not directly the offset (because... see in documentation).
            end_offset = spans[end_index-1][1]

            # Update tags list with B-I-O tags for aspect term
            for i in range(start_index, end_index):
                if i == start_index:
                    tags[i] = "B"
                else:
                    tags[i] = "I"

    return tags

In [16]:
df1['labels'] = 1
for i in range(0,len(df1['Sentence'])):
    a = convert_to_b_i_o(df1['Sentence'][i], df1['Aspect Term'][i])
    df1['labels'][i] = a

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df1['labels'][i] = a
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['labels'][i] = a
  df1['labels'][i] = 

In [17]:
df1

Unnamed: 0,Sentence,Aspect Term,polarity,labels
0,""" This isn't a big deal, I haven't noticed the...",[USB output],[-1],"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1,"""> iPhoto is probably the best program I have...",[iPhoto],[0],"[O, B, O, O, O, O, O, O, O, O, O, O, O, O, O]"
2,( The iBook backup also uses a firewire connec...,"[iBook backup, firewire connection]","[0, 0]","[O, O, B, I, O, O, O, O, O]"
3,"(Beware, their staff could send you back makin...",[staff],[-1],"[O, O, B, O, O, O, O, O, O, O, O, O, O, O, O, ..."
4,(I found a 2GB stick for a bit under $50) Nice...,[system],[0],"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
...,...,...,...,...
1477,"A coupla months later, they change my hard dr...",[hard drive],[-1],"[O, O, O, O, O, O, O, O, O]"
1478,"I actually had the hard drive replaced twice,...","[hard drive, mother board, dvd drive]","[-1, -1, -1]","[O, O, O, O, B, I, O, O, O, B, I, O, O, B, I, ..."
1479,One night I turned the freaking thing off aft...,"[GUI, screen, power light, hard drive light]","[-1, -1, 0, -1]","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1480,THE MOTHERBOARD IS DEAD !,[MOTHERBOARD],[-1],"[O, B, O, O, O]"


Manually how can i convert sentence to tokens

In [18]:
## converting sentence into tokens of vectors. each vector is of 768 dimension
## DistilBertTokenizer helps in creating input_id and attention mask for each token of the given sentence
## input_id is the numerical representation of the text and each id is based on the model vocabulary
## These IDs are actually fed to thee  transformer.
## DistilBertModel helps in looking for corresponding token embedding using the token_id from its embedding matrix
## Attention mask is a binary mask, if 1: attend this token, 2: ignore this token(usually padding)
## It prevents the model from learning from or being distracted by padding tokens.

model = DistilBertModel.from_pretrained('bert-base-uncased')

# Load BERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('bert-base-uncased')
def bert_to_token(sentence):


  # Define the sentence to be tokenized

  # Tokenize the sentence
  tokens = splitSentence(sentence)
  # print(len(tokens))
  # Convert tokens to ids
  input_ids = tokenizer.convert_tokens_to_ids(tokens)
  # print(len(input_ids))
  # Add special tokens
  # Convert input_ids to tensor
  input_ids = torch.tensor(input_ids).unsqueeze(0)      # helps in converting dim from [4] to [1, 4] where 0 mean along rows and if 1 mean along columns
  # print(len(input_ids))
  # Get the 768 dimensional vectors for each token
  outputs = model(input_ids)
  # print(outputs,len(outputs))
  token_vectors = outputs.last_hidden_state.squeeze(0)        # Output of the last layer of the DistilBert
  return token_vectors

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

You are using a model of type bert to instantiate a model of type distilbert. This is not supported for all configurations of models and can yield errors.


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of DistilBertModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['embeddings.LayerNorm.bias', 'embeddings.LayerNorm.weight', 'embeddings.position_embeddings.weight', 'embeddings.word_embeddings.weight', 'transformer.layer.0.attention.k_lin.bias', 'transformer.layer.0.attention.k_lin.weight', 'transformer.layer.0.attention.out_lin.bias', 'transformer.layer.0.attention.out_lin.weight', 'transformer.layer.0.attention.q_lin.bias', 'transformer.layer.0.attention.q_lin.weight', 'transformer.layer.0.attention.v_lin.bias', 'transformer.layer.0.attention.v_lin.weight', 'transformer.layer.0.ffn.lin1.bias', 'transformer.layer.0.ffn.lin1.weight', 'transformer.layer.0.ffn.lin2.bias', 'transformer.layer.0.ffn.lin2.weight', 'transformer.layer.0.output_layer_norm.bias', 'transformer.layer.0.output_layer_norm.weight', 'transformer.layer.0.sa_layer_norm.bias', 'transformer.layer.0.sa_layer_norm.weight', 'transformer.layer.1.attention.k_lin.b

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'DistilBertTokenizer'.


Just to have an understanding of how can i label_map manually

In [19]:
label_map = {'O':0,'B':1,'I':2}
labels = []
for label_list in df1['labels']:

    # Convert labels to tensor
    label_tensor = torch.tensor([label_map[label] for label in label_list], dtype=torch.long)
    labels.append(list(label_tensor))
print(labels)


[[tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0)], [tensor(0), tensor(1), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0)], [tensor(0), tensor(0), tensor(1), tensor(2), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0)], [tensor(0), tensor(0), tensor(1), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0)], [tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(1), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), 

**Analyzing Class Imbalance within the Aspect's labels os a particular sentence**

## **BERT Finetuning**

DistilBertTokenizer	Converts raw text → token IDs	Input: text → Output: token IDs, masks

DistilBertForTokenClassification	Predicts label per token (NER, etc.)	Input: token IDs → Output: logits per token. A DistilBERT model with an added classification head (a linear layer on top).

With the DistilBertModel we get only the embeddings of the tokens.

outputs.loss      # CrossEntropyLoss between predicted labels and true labels

outputs.logits    # Shape: [1, sequence_length, num_labels]

In [20]:
import torch
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForTokenClassification,
    DataCollatorForTokenClassification,

)

In [21]:
label_map = {'O': 0, 'B': 1, 'I': 2}

def preprocess_data(sentences, label_lists):
    tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

    # First, split sentences into words using the existing splitSentence function
    word_split_sentences = [splitSentence(sentence) for sentence in sentences]

    tokenized_inputs = tokenizer(
        word_split_sentences,  # Pass the word-split sentences to the tokenizer
        is_split_into_words=True,
        truncation=True,
        padding=True,
        return_tensors="pt"
    )

    all_labels = []
    for i, label_list in enumerate(label_lists):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        aligned_labels = []
        previous_word_idx = None
        for word_idx in word_ids:
            if word_idx is None:
                aligned_labels.append(-100)
            elif word_idx != previous_word_idx:
                # Check if word_idx is within the bounds of label_list
                if word_idx is not None and word_idx < len(label_list):
                    aligned_labels.append(label_map[label_list[word_idx]])
                else:
                    # If out of bounds, treat as 'O' or pad with -100
                    aligned_labels.append(-100) # Or label_map['O']
            else:
                 # Check if word_idx is within the bounds of label_list
                if word_idx is not None and word_idx < len(label_list):
                    aligned_labels.append(
                        label_map[label_list[word_idx]] if label_list[word_idx] != 'O' else 0
                    )
                else:
                    # If out of bounds, treat as 'O' or pad with -100
                    aligned_labels.append(-100) # Or label_map['O']
            previous_word_idx = word_idx
        all_labels.append(aligned_labels)

    tokenized_inputs["labels"] = torch.tensor(all_labels)
    return tokenized_inputs, tokenizer

In [22]:
## If you're padding during tokenization, technically you don't need a DataCollator
## Different samples in a batch might have very different lengths. Padding all samples globally wastes memory and computation.
## Data Collator performs dynamic padding per batch — only up to the longest sentence in that batch.


class TokenClassificationDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings                ## Stores the tokenized and aligned data in list of lists format (like input_ids, attention_mask, and labels) in self.encodings.

    def __getitem__(self, idx):
        return {key: val[idx] for key, val in self.encodings.items()}
    ## {
    ##    "input_ids": tensor of shape [seq_len],
#         "attention_mask": tensor of shape [seq_len],
#         "labels": tensor of shape [seq_len]
#       }

    def __len__(self):
        return len(self.encodings["input_ids"])

# Data collator for dynamic padding
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [23]:
training_params = {
    "epochs": 4,
    "batch_size": 1,
    "learning_rate": 2e-5
}

# Assuming df1['Sentence'] is list of token lists and df1['labels'] is list of label lists
sentences = df1['Sentence'][:1200].tolist()     # Each entry is a list of tokens
labels = df1['labels'][:1200].tolist()          # Each entry is a list of 'O', 'B', 'I'

tokenized_data, tokenizer = preprocess_data(sentences, labels)
data_collator.tokenizer = tokenizer  # Now assign it

dataset = TokenClassificationDataset(tokenized_data)
data_loader = DataLoader(dataset, batch_size=training_params["batch_size"], shuffle=True, collate_fn=data_collator)

In [24]:
model = DistilBertForTokenClassification.from_pretrained('distilbert-base-uncased', num_labels=3)

optimizer = AdamW(model.parameters(), lr=training_params["learning_rate"])

model.train()

for epoch in range(training_params["epochs"]):
    total_loss = 0.0
    for batch in data_loader:
        optimizer.zero_grad()

        outputs = model(
            input_ids=batch["input_ids"],
            attention_mask=batch["attention_mask"],
            labels=batch["labels"]
        )

        loss = outputs.loss
        total_loss += loss.item()
        loss.backward()
        optimizer.step()

    avg_loss = total_loss / len(data_loader)
    print(f"Epoch {epoch+1} | Average Loss: {avg_loss:.4f}")

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1 | Average Loss: 0.1868
Epoch 2 | Average Loss: 0.0757
Epoch 3 | Average Loss: 0.0334
Epoch 4 | Average Loss: 0.0169


In [25]:
tokenized_texts = [splitSentence(i) for i in df1['Sentence'][1200:].tolist()]
input_ids = [tokenizer.convert_tokens_to_ids(tokens) for tokens in tokenized_texts]
attention_masks = [[1] * len(tokens) for tokens in input_ids]
labels = list(df1['labels'][1200:])
model.eval()                        ## Switches the model to evaluation mode
with torch.no_grad():               ## It tells the pytorch not to track gradient to save memory and computations. All opertions inside the block wont build the computational graph
    predicted = []
    count = 0
    c = 0
    for i in range(len(input_ids)):
        outputs = model(torch.LongTensor(input_ids[i]).unsqueeze(0), attention_mask=torch.LongTensor(attention_masks[i]).unsqueeze(0))    ## As no true labels are passed so it just output the logits
        logits = outputs.logits                   ## For each token, gives raw scores for each label
        preds = torch.argmax(logits, dim=2)       ## Converts logits to predicted labels by picking the label with maximum score for each token along the dimension (done for each token from the list of token_ids)
        count = count + len(preds[0])             ## preds[0] gives you the predictions for just the sentence (without the batch dimension).
        c = c + len(labels[i])
        predicted.extend(preds[0])
print(predicted)

[tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0)

In [26]:
label_map = {'O':0,'B':1,'I':2}
labels_test = []
for label_list in df1['labels'][1200 :]:
    # Convert labels to tensor
    label_tensor = torch.tensor([label_map[label] for label in label_list], dtype=torch.long)
    labels_test.append(list(label_tensor))


actual = []
for i in labels_test:
    actual.extend(i)
# print(actual)

correct = 0
for i in range(len(actual)):
    if predicted[i]==actual[i]:
        correct = correct + 1
print(correct/count)

0.9151741783113561


average='weighted':

Computes precision per class

Then takes a weighted average based on the number of true instances in each class

Prevents rare classes from being ignored

In [27]:
from sklearn.metrics import precision_score, recall_score, f1_score

# calculate precision, recall, and F1 score
precision = precision_score(actual, predicted,average='weighted')
recall = recall_score(actual, predicted,average='weighted')
f1 = f1_score(actual, predicted,average='weighted')

# print results
print("Precision: ", precision)
print("Recall: ", recall)
print("F1 Score: ", f1)

Precision:  0.9126668943817551
Recall:  0.9151741783113561
F1 Score:  0.8854817637006664


In [28]:
!pip install --upgrade transformers



Prediction is done again because:
it's not strictly "re-training" or a completely redundant prediction, but rather a localized prediction step within the loop to facilitate the correct reconstruction and grouping of aspect terms for the output DataFrame.

In [29]:
from transformers import DistilBertTokenizerFast, DistilBertForTokenClassification
import torch
import pandas as pd

# Inverse map for labels
inverse_label_map = {0: 'O', 1: 'B', 2: 'I'}

extracted_aspects = []

# Iterate through the test sentences
test_sentences = df1['Sentence'][1200:].tolist()

predicted_aspect_terms = []

# Load the fast version of the tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# Assuming the model is already loaded and in eval mode from the previous cell
# model.eval()

for sentence in test_sentences:
    # Tokenize the sentence and get offset mapping and word_ids
    encoded = tokenizer(
        sentence,
        add_special_tokens=True,
        max_length=128,
        padding='max_length',
        truncation=True,
        return_tensors='pt',
        return_attention_mask=True,
        return_offsets_mapping=True
    )
    input_ids = encoded['input_ids']
    attention_mask = encoded['attention_mask']
    offsets = encoded['offset_mapping'][0].tolist()
    word_ids = encoded.word_ids()                     ## word_ids will have "None" for special tokens ['cls'], ['end'], ['pad']

    # Get predictions for the current sentence
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=2).squeeze(0).tolist() # Token-level predictions (0, 1, or 2)

    # Extract aspect terms based on BIO tags and offsets
    current_aspect_span = None
    sentence_aspect_terms_set = set() # Use a set to store unique aspect terms

    for token_idx in range(len(input_ids[0])):
        word_id = word_ids[token_idx]
        predicted_label_id = predictions[token_idx] # Numerical prediction for the token
        predicted_label = inverse_label_map[predicted_label_id] # Convert to 'O', 'B', 'I'

        # Only process tokens that correspond to original words (not special tokens)
        if word_id is not None:
            start_offset, end_offset = offsets[token_idx]

            if predicted_label == 'B':
                # If a previous aspect span was being tracked, add it to the set
                if current_aspect_span is not None:
                    aspect_term = sentence[current_aspect_span[0]:current_aspect_span[1]].strip()
                    if aspect_term:
                        sentence_aspect_terms_set.add(aspect_term)

                # Start a new aspect term span
                current_aspect_span = (start_offset, end_offset)

            elif predicted_label == 'I':
                # If currently tracking an aspect span, extend it
                if current_aspect_span is not None:
                    current_aspect_span = (current_aspect_span[0], end_offset)
                # If 'I' without preceding 'B', ignore (do not start or extend span)

            else: # predicted_label == 'O'
                # If currently tracking an aspect span, it's now completed
                if current_aspect_span is not None:
                    aspect_term = sentence[current_aspect_span[0]:current_aspect_span[1]].strip()
                    if aspect_term:
                        sentence_aspect_terms_set.add(aspect_term)
                    current_aspect_span = None # Reset the span

    # After the loop, add the last aspect term if one was being tracked at the end of the sentence
    if current_aspect_span is not None:
         aspect_term = sentence[current_aspect_span[0]:current_aspect_span[1]].strip()
         if aspect_term:
            sentence_aspect_terms_set.add(aspect_term)


    # Convert the set back to a list for the DataFrame
    extracted_terms = list(sentence_aspect_terms_set)

    # Add a placeholder if no aspect terms were extracted
    if not extracted_terms:
        predicted_aspect_terms.append(['NO_ASPECT_DETECTED'])
    else:
        predicted_aspect_terms.append(extracted_terms)


# Create a DataFrame
predicted_aspects_df = pd.DataFrame({
    'Sentence': test_sentences,
    'Predicted Aspect Terms': predicted_aspect_terms
})

display(predicted_aspects_df.head())

Unnamed: 0,Sentence,Predicted Aspect Terms
0,This computer I used daily nice compact design.,[NO_ASPECT_DETECTED]
1,This computer doesn't do that well with certai...,[NO_ASPECT_DETECTED]
2,This computer had exactly the specifications I...,[specifications]
3,This computer is exceptionally thin for it's s...,[screen size]
4,This computer that I have has had issues with ...,[keyboard]


In [30]:
print(test_sentences[253])
predicted_aspects_df.sample(5)

  The company sent me a whole new cord overnight and apologized.


Unnamed: 0,Sentence,Predicted Aspect Terms
84,"While it was highly rated, would I like it? I ...",[keyboard]
135,but now i have realized its a problem with thi...,[NO_ASPECT_DETECTED]
32,"Took me 11 hours, 3 trips to different FedEx o...",[IT support technicians]
87,"Who couldn't love a DVD burner, 80-gigabyte HD...","[components, DVD burner]"
15,This is the complete opposite to an ergonomic ...,[NO_ASPECT_DETECTED]


In [31]:
# Explode the 'Predicted Aspect Terms' column
exploded_predicted_aspects_df = predicted_aspects_df.explode('Predicted Aspect Terms')

# Display the new DataFrame
display(exploded_predicted_aspects_df.head(10))

Unnamed: 0,Sentence,Predicted Aspect Terms
0,This computer I used daily nice compact design.,NO_ASPECT_DETECTED
1,This computer doesn't do that well with certai...,NO_ASPECT_DETECTED
2,This computer had exactly the specifications I...,specifications
3,This computer is exceptionally thin for it's s...,screen size
4,This computer that I have has had issues with ...,keyboard
5,This computer was so challenging to carry and ...,NO_ASPECT_DETECTED
6,This is a great little computer for the price.,NO_ASPECT_DETECTED
7,This is a great value for the money.,value
8,This is a nicely sized laptop with lots of pro...,NO_ASPECT_DETECTED
9,This is a review of windows vista system.,NO_ASPECT_DETECTED


In [32]:
# Filter the predicted DataFrame to show sentences with no detected aspects
no_aspect_detected_df = predicted_aspects_df[
    predicted_aspects_df['Predicted Aspect Terms'].apply(lambda x: x == ['NO_ASPECT_DETECTED'])
]

# Merge with the original df1 to see the actual aspect terms
# We'll merge on the 'Sentence' column
# Note: This assumes 'Sentence' is a reliable key for merging.
# If a sentence appears multiple times in df1 with different aspect terms,
# the merge will create multiple rows for that sentence.
merged_no_aspect_df = pd.merge(
    no_aspect_detected_df,
    df1[['Sentence', 'Aspect Term']],
    on='Sentence',
    how='left'
)

# Display the results
print("Sentences where the model predicted NO_ASPECT_DETECTED and their actual aspect terms:")
display(merged_no_aspect_df)

Sentences where the model predicted NO_ASPECT_DETECTED and their actual aspect terms:


Unnamed: 0,Sentence,Predicted Aspect Terms,Aspect Term
0,This computer I used daily nice compact design.,[NO_ASPECT_DETECTED],[design]
1,This computer doesn't do that well with certai...,[NO_ASPECT_DETECTED],[games]
2,This computer was so challenging to carry and ...,[NO_ASPECT_DETECTED],"[carry, handle]"
3,This is a great little computer for the price.,[NO_ASPECT_DETECTED],[price]
4,This is a nicely sized laptop with lots of pro...,[NO_ASPECT_DETECTED],"[processing power, battery life, sized]"
5,This is a review of windows vista system.,[NO_ASPECT_DETECTED],[windows vista system]
6,"This is an over-sized, 18-inch laptop.",[NO_ASPECT_DETECTED],[18-inch]
7,This is the complete opposite to an ergonomic ...,[NO_ASPECT_DETECTED],[design]
8,This is the first time that I tried and owning...,[NO_ASPECT_DETECTED],[screen size]
9,This is what I call a good after sales service.,[NO_ASPECT_DETECTED],[after sales service]


## **Comparison when only weights of  classificaion heads are trained vs When both classification head and last two layers are trained.**

In [48]:
model1 = DistilBertForTokenClassification.from_pretrained('distilbert-base-uncased', num_labels=3)

# Freeze the encoder (DistilBERT base)
for param in model1.distilbert.parameters():
    param.requires_grad = False

for param in model1.classifier.parameters():
    param.requires_grad = True

model1.train()

for epoch in range(training_params["epochs"]):
    total_loss = 0.0
    for batch in data_loader:
        optimizer.zero_grad()

        outputs = model1(
            input_ids=batch["input_ids"],
            attention_mask=batch["attention_mask"],
            labels=batch["labels"]
        )

        loss = outputs.loss
        total_loss += loss.item()
        loss.backward()
        optimizer.step()

    avg_loss = total_loss / len(data_loader)
    print(f"Epoch {epoch+1} | Average Loss: {avg_loss:.4f}")

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1 | Average Loss: 1.0595
Epoch 2 | Average Loss: 1.0602
Epoch 3 | Average Loss: 1.0607
Epoch 4 | Average Loss: 1.0597


In [49]:
tokenized_texts = [splitSentence(i) for i in df1['Sentence'][1200:].tolist()]
input_ids = [tokenizer.convert_tokens_to_ids(tokens) for tokens in tokenized_texts]
attention_masks = [[1] * len(tokens) for tokens in input_ids]
labels = list(df1['labels'][1200:])
model1.eval()                        ## Switches the model to evaluation mode
with torch.no_grad():               ## It tells the pytorch not to track gradient to save memory and computations. All opertions inside the block wont build the computational graph
    predicted = []
    count = 0
    c = 0
    for i in range(len(input_ids)):
        outputs = model1(torch.LongTensor(input_ids[i]).unsqueeze(0), attention_mask=torch.LongTensor(attention_masks[i]).unsqueeze(0))    ## As no true labels are passed so it just output the logits
        logits = outputs.logits                   ## For each token, gives raw scores for each label
        preds = torch.argmax(logits, dim=2)       ## Converts logits to predicted labels by picking the label with maximum score for each token along the dimension (done for each token from the list of token_ids)
        count = count + len(preds[0])             ## preds[0] gives you the predictions for just the sentence (without the batch dimension).
        c = c + len(labels[i])
        predicted.extend(preds[0])
print(predicted)

[tensor(2), tensor(2), tensor(2), tensor(2), tensor(1), tensor(1), tensor(1), tensor(2), tensor(2), tensor(2), tensor(2), tensor(2), tensor(0), tensor(0), tensor(2), tensor(2), tensor(2), tensor(2), tensor(2), tensor(2), tensor(2), tensor(2), tensor(2), tensor(1), tensor(1), tensor(2), tensor(1), tensor(2), tensor(2), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(0), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(2), tensor(2), tensor(2), tensor(0), tensor(1), tensor(1), tensor(2), tensor(2), tensor(2), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(2), tensor(2), tensor(2), tensor(2), tensor(1), tensor(1)

In [50]:
label_map = {'O':0,'B':1,'I':2}
labels_test = []
for label_list in df1['labels'][1200 :]:
    # Convert labels to tensor
    label_tensor = torch.tensor([label_map[label] for label in label_list], dtype=torch.long)
    labels_test.append(list(label_tensor))


actual = []
for i in labels_test:
    actual.extend(i)
# print(actual)

correct = 0
for i in range(len(actual)):
    if predicted[i]==actual[i]:
        correct = correct + 1
print(correct/count)

0.22692383389096635


In [51]:
from sklearn.metrics import precision_score, recall_score, f1_score

# calculate precision, recall, and F1 score
precision = precision_score(actual, predicted,average='weighted')
recall = recall_score(actual, predicted,average='weighted')
f1 = f1_score(actual, predicted,average='weighted')

# print results
print("Precision: ", precision)
print("Recall: ", recall)
print("F1 Score: ", f1)

Precision:  0.8763034544845161
Recall:  0.22692383389096635
F1 Score:  0.29247902404108495


In [52]:
model2 = DistilBertForTokenClassification.from_pretrained('distilbert-base-uncased', num_labels=3)

# Freeze all layers initially
for param in model2.distilbert.parameters():
    param.requires_grad = False

# Unfreeze the last 2 layers
def unfreeze_last_n_layers(model, n=2):
    total_layers = len(model2.distilbert.transformer.layer)
    for i in range(total_layers - n, total_layers):
        for param in model2.distilbert.transformer.layer[i].parameters():
            param.requires_grad = True

# Classifier should remain trainable
for param in model2.classifier.parameters():
    param.requires_grad = True

unfreeze_last_n_layers(model2, n=4)

model2.train()

for epoch in range(training_params["epochs"]):
    total_loss = 0.0
    for batch in data_loader:
        optimizer.zero_grad()

        outputs = model2(
            input_ids=batch["input_ids"],
            attention_mask=batch["attention_mask"],
            labels=batch["labels"]
        )

        loss = outputs.loss
        total_loss += loss.item()
        loss.backward()
        optimizer.step()

    avg_loss = total_loss / len(data_loader)
    print(f"Epoch {epoch+1} | Average Loss: {avg_loss:.4f}")

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1 | Average Loss: 1.1490
Epoch 2 | Average Loss: 1.1503
Epoch 3 | Average Loss: 1.1485
Epoch 4 | Average Loss: 1.1503


In [53]:
tokenized_texts = [splitSentence(i) for i in df1['Sentence'][1200:].tolist()]
input_ids = [tokenizer.convert_tokens_to_ids(tokens) for tokens in tokenized_texts]
attention_masks = [[1] * len(tokens) for tokens in input_ids]
labels = list(df1['labels'][1200:])
model2.eval()                        ## Switches the model to evaluation mode
with torch.no_grad():               ## It tells the pytorch not to track gradient to save memory and computations. All opertions inside the block wont build the computational graph
    predicted = []
    count = 0
    c = 0
    for i in range(len(input_ids)):
        outputs = model2(torch.LongTensor(input_ids[i]).unsqueeze(0), attention_mask=torch.LongTensor(attention_masks[i]).unsqueeze(0))    ## As no true labels are passed so it just output the logits
        logits = outputs.logits                   ## For each token, gives raw scores for each label
        preds = torch.argmax(logits, dim=2)       ## Converts logits to predicted labels by picking the label with maximum score for each token along the dimension (done for each token from the list of token_ids)
        count = count + len(preds[0])             ## preds[0] gives you the predictions for just the sentence (without the batch dimension).
        c = c + len(labels[i])
        predicted.extend(preds[0])
print(predicted)

[tensor(0), tensor(0), tensor(0), tensor(0), tensor(1), tensor(1), tensor(1), tensor(0), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(2), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(0), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(2), tensor(1), tensor(2), tensor(2), tensor(2), tensor(1), tensor(0), tensor(0), tensor(0), tensor(2), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(0), tensor(1), tensor(1), tensor(1)

In [54]:
label_map = {'O':0,'B':1,'I':2}
labels_test = []
for label_list in df1['labels'][1200 :]:
    # Convert labels to tensor
    label_tensor = torch.tensor([label_map[label] for label in label_list], dtype=torch.long)
    labels_test.append(list(label_tensor))


actual = []
for i in labels_test:
    actual.extend(i)
# print(actual)

correct = 0
for i in range(len(actual)):
    if predicted[i]==actual[i]:
        correct = correct + 1
print(correct/count)

0.1749655579610313


In [55]:
from sklearn.metrics import precision_score, recall_score, f1_score

# calculate precision, recall, and F1 score
precision = precision_score(actual, predicted,average='weighted')
recall = recall_score(actual, predicted,average='weighted')
f1 = f1_score(actual, predicted,average='weighted')

# print results
print("Precision: ", precision)
print("Recall: ", recall)
print("F1 Score: ", f1)

Precision:  0.7780710883184311
Recall:  0.1749655579610313
F1 Score:  0.22153660486249965
