NLP libraries focused on practical and production ready task

*   Spacy- Tokenization (splitting text into words), Part-of-Speech (POS), tagging, Named Entity Recognition (NER), Dependency parsing Lemmatization (base form of a word)

*   NLTk- Tokenization, stemming, lemmatization, POS tagging, Text classification, Sentiment analysis


Named entity recognition helps in identifying and classifying the key information in text into predefined categories.



In [2]:
import pandas as pd
import numpy as np
import nltk
import spacy
from sklearn.preprocessing import OneHotEncoder

# Download and load the spaCy model
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
file_path = "/content/drive/My Drive/Dataset/Laptop_Train_v2.csv"
data = pd.read_csv(file_path)
data.head()

Unnamed: 0,id,Sentence,Aspect Term,polarity,from,to
0,2339,I charge it at night and skip taking the cord ...,cord,neutral,41,45
1,2339,I charge it at night and skip taking the cord ...,battery life,positive,74,86
2,1316,The tech guy then said the service center does...,service center,negative,27,41
3,1316,The tech guy then said the service center does...,"""sales"" team",negative,109,121
4,1316,The tech guy then said the service center does...,tech guy,neutral,4,12


In [5]:
data = data.drop(columns = ['id', 'from', 'to'], axis = 1)
data.head()

Unnamed: 0,Sentence,Aspect Term,polarity
0,I charge it at night and skip taking the cord ...,cord,neutral
1,I charge it at night and skip taking the cord ...,battery life,positive
2,The tech guy then said the service center does...,service center,negative
3,The tech guy then said the service center does...,"""sales"" team",negative
4,The tech guy then said the service center does...,tech guy,neutral


In [6]:
def str_to_num(x):
    if x == 'psotive':
      return 1
    elif x == 'negative':
      return -1
    else:
      return 0

data['polarity'] = data['polarity'].apply(str_to_num)
data.head()


Unnamed: 0,Sentence,Aspect Term,polarity
0,I charge it at night and skip taking the cord ...,cord,0
1,I charge it at night and skip taking the cord ...,battery life,0
2,The tech guy then said the service center does...,service center,-1
3,The tech guy then said the service center does...,"""sales"" team",-1
4,The tech guy then said the service center does...,tech guy,0


Idea how to take input :
step1: convert the input setence into token of words with each token as 768 dimension vector **using distilberttokenizer**  from which we get from BERT embedding
step2: Append the upos(universal parts of speech) of each word to above vector
step3: Append also xpos(extended parts of speech) of each word to above appended vector
For step2: Converting a word into one-hot encoded upos vector using 'ntlk' library and it outputs with dimension value of 37. (Alternatively, if we use spacy library it will give with 17dimension)

**UPOS**- Language independent, use when you want consistency across multiple

languages (e.g., multilingual NLP)

**XPOS**- richer, more detailed POS tags, defined by each language’s traditional grammar, hence language dependent. It gives fine grained grammertical details.


NLTK supports only XPOS by default which has 17 different tags.
Spacy supports both UPOS and XPOS and has 37 different pos tags

'punkt' : Splits a paragrapgh of text into individual sentences. It is trained on large text corpus for specific languages - eng, german etc. It is used before applying tokenization, pos tagging or NER. It is unsupervised means it learns punctuation patterns and abbreviation usage from a raw corpus.

In [7]:
from nltk import pos_tag, word_tokenize
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')   # used for XPOS & downloads a pretrained POS tagger based on the averaged perceptron algorithm, which is used by NLTK’s pos_tag() function.
nltk.download('universal_tagset')             # it converts XPOS to UPOS when requested
nltk.download('averaged_perceptron_tagger_eng', download_dir='/root/nltk_data') # Download the English tagger to a specific directory

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [8]:
def splitSentence(sentence):
    split_sentence = sentence.split(",") # split by commas ex- ['my  name is khan', 'i live in mannat', 'i am the king']
    split_sentence = [word.split() for word in split_sentence] # split each word by space ex - [['my', 'name', 'is', 'khan'], ['i', 'live', 'in', 'mannat']]
    split_sentence = [word for sublist in split_sentence for word in sublist] # flatten the list of lists into a single list ex - ['my', 'name', 'is', 'khan','i', 'live', 'in', 'mannat']

    return split_sentence

In [9]:
# Create a one-hot encoder and fit it to the set of UPOS tags
#this below tags are used by spacy for giving pos_tags.
upos_tags = ['ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X']
encoder = OneHotEncoder(sparse_output = False, categories=[upos_tags])
encoder.fit(np.array(upos_tags).reshape(-1, 1)) #fit


def upos_tagging_word(word):

    # Use the spacy tagger to tag the word and extract the UPOS tag
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(word) # example this will be [('author','NN')]
    upos_tag = doc[0].pos_ # "NN"
    if upos_tag==',' or upos_tag=='.' or upos_tag=='?' or upos_tag==':' or upos_tag==';' or upos_tag=='_':
        upos_tag = 'PUNCT'
    if upos_tag not in upos_tags:
        vector = np.zeros((1, len(upos_tags)))
        return vector

    # np.array([upos_tag]).reshape(-1, 1)) converts to [['NN']]
    # Transform the UPOS tag into a one-hot encoded vector
    onehot_vector = encoder.transform(np.array([upos_tag]).reshape(-1, 1))
    return onehot_vector

In [10]:
import torch
def upos_tagging(sentence):
    answer = []
    tokens = splitSentence(sentence)
    for i in tokens:
        word_tag = upos_tagging_word(i)
        answer.append(torch.tensor(word_tag[0]))
    return answer

In [11]:
print(upos_tagging("the author"))

[tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       dtype=torch.float64), tensor([0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       dtype=torch.float64)]


We will also compare what is the difference in the performance when only Word2Vec is used and when W2V + Upos tagging is used

### **Aspect Extraction Using BERT**

In [12]:
data

Unnamed: 0,Sentence,Aspect Term,polarity
0,I charge it at night and skip taking the cord ...,cord,0
1,I charge it at night and skip taking the cord ...,battery life,0
2,The tech guy then said the service center does...,service center,-1
3,The tech guy then said the service center does...,"""sales"" team",-1
4,The tech guy then said the service center does...,tech guy,0
...,...,...,...
2353,We also use Paralles so we can run virtual mac...,Windows Server Enterprise 2003,0
2354,We also use Paralles so we can run virtual mac...,Windows Server 2008 Enterprise,0
2355,"How Toshiba handles the repair seems to vary, ...",repair,0
2356,"How Toshiba handles the repair seems to vary, ...",repair,0


In [13]:
import pandas as pd
import numpy as np
import torch
import transformers
from transformers import DistilBertTokenizer, DistilBertModel, TFDistilBertForTokenClassification, TFDistilBertForSequenceClassification
# from transformers import DistillBertTokenizer, BertForTokenClassification, BertForSequenceClassification , BertModel
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

**Working of this tokenizer**

Downloads the vocab and tokenizer config (if not cached).

Loads the WordPiece tokenizer with the vocabulary used during DistilBERT's training.

Returns a tokenizer object you can use to tokenize your input text.

In [14]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')  ## loads the pretrained tokenizer for the DistilBert model (uncased one: which doesnt differentiate between capital & small letters)

In [15]:
df1 = df_grouped = data.groupby('Sentence').agg({'Aspect Term': lambda x: x.tolist(), 'polarity': lambda x: x.tolist()}).reset_index()
df1.head()

Unnamed: 0,Sentence,Aspect Term,polarity
0,""" This isn't a big deal, I haven't noticed the...",[USB output],[-1]
1,"""> iPhoto is probably the best program I have...",[iPhoto],[0]
2,( The iBook backup also uses a firewire connec...,"[iBook backup, firewire connection]","[0, 0]"
3,"(Beware, their staff could send you back makin...",[staff],[-1]
4,(I found a 2GB stick for a bit under $50) Nice...,[system],[0]


**Adding BIO label column in the dataframe**
BIO means: It labels each word/token in a sentence, enabling models to learn which spans of text belong to a certain labels.
W/O BIO: You might detect "battery" and "life" as aspects, but not know if they belong together.

B-TERM: Beginning of aspect term

I-TERM: Inside the aspect term

O: Outside (not part of an aspect term)

BIO is required only when- Named Entity Recognition (NER), Chunking,Slot filling, or when we do token level classification.
Not required when Sentence-level classification (e.g., sentiment analysis, topic classification), Question answering, Text generation or embedding extraction

In [16]:
def tokenize_with_whitespace(text):         ##split the sentence by whitespaces
    """
    Tokenize text using whitespace as a delimiter
    """
    tokens = splitSentence(text)
    return tokens

def get_token_offsets(text):               ## tokens and their offsets. example: "the rice is good", here this function returns ['the','rice','is','good'],[(0,3),(4,8),(9,11),(12,16)]
    tokens = []
    offsets = []
    for token in tokenize_with_whitespace(text):
        tokens.append(token)
        start_index = text.find(token)
        end_index = start_index + len(token)
        offsets.append((start_index, end_index))
    return tokens, offsets

def get_token_spans(offsets):
    spans = [(offsets[i][0], offsets[i+1][0] if i < len(offsets) - 1 else offsets[i][1]) for i in range(len(offsets))]
    return spans

# Convert sentence to B-I-O tags
def convert_to_b_i_o(sentence, aspect_terms):             ##this converts the each token in sentence to {'I','O','B'}
    # Get corresponding token offsets and spans for sentence
    tokens, offsets = get_token_offsets(sentence)            ## Number of  tokens in one sentence
    spans = get_token_spans(offsets)

    tags = ["O"] * len(tokens)                              # everytime tag will be the size of the number of tokens in each sentence.
    for aspect_term in aspect_terms:
        aspect_words = tokenize_with_whitespace(aspect_term)    ## If a aspect term has two terms then store the two termms separately in the list
        aspect_words_len = len(aspect_words)

        start_index = None                                  # Find the start and end indexes of the aspect term within the sentence
        end_index = None
        for i in range(len(tokens) - aspect_words_len + 1):
            if tokens[i:i+aspect_words_len] == aspect_words:      ## this slicing will return a subset of list & hence we can compare it with another list.
                start_index = i                                   ## this index is the index of that particular word from the list of token generated from the sentence.
                end_index = i + aspect_words_len
                break

        if start_index is not None and end_index is not None:
            # Find the start and end offsets of the aspect term within the sentence
            start_offset = spans[start_index][0]  ## Here is the answer of why spans is used and not directly the offset (because... see in documentation).
            end_offset = spans[end_index-1][1]

            # Update tags list with B-I-O tags for aspect term
            for i in range(start_index, end_index):
                if i == start_index:
                    tags[i] = "B"
                else:
                    tags[i] = "I"

    return tags

In [17]:
df1['labels'] = 1
for i in range(0,len(df1['Sentence'])):
    a = convert_to_b_i_o(df1['Sentence'][i], df1['Aspect Term'][i])
    df1['labels'][i] = a

In [18]:
df1

Unnamed: 0,Sentence,Aspect Term,polarity,labels
0,""" This isn't a big deal, I haven't noticed the...",[USB output],[-1],"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1,"""> iPhoto is probably the best program I have...",[iPhoto],[0],"[O, B, O, O, O, O, O, O, O, O, O, O, O, O, O]"
2,( The iBook backup also uses a firewire connec...,"[iBook backup, firewire connection]","[0, 0]","[O, O, B, I, O, O, O, O, O]"
3,"(Beware, their staff could send you back makin...",[staff],[-1],"[O, O, B, O, O, O, O, O, O, O, O, O, O, O, O, ..."
4,(I found a 2GB stick for a bit under $50) Nice...,[system],[0],"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
...,...,...,...,...
1477,"A coupla months later, they change my hard dr...",[hard drive],[-1],"[O, O, O, O, O, O, O, O, O]"
1478,"I actually had the hard drive replaced twice,...","[hard drive, mother board, dvd drive]","[-1, -1, -1]","[O, O, O, O, B, I, O, O, O, B, I, O, O, B, I, ..."
1479,One night I turned the freaking thing off aft...,"[GUI, screen, power light, hard drive light]","[-1, -1, 0, -1]","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1480,THE MOTHERBOARD IS DEAD !,[MOTHERBOARD],[-1],"[O, B, O, O, O]"


In [19]:
## converting sentence into tokens of vectors. each vector is of 768 dimension
## DistilBertTokenizer helps in creating input_id and attention mask for each token of the given sentence
## input_id is the numerical representation of the text and each id is based on the model vocabulary
## These IDs are actually fed to thee  transformer.
## DistilBertModel helps in looking for corresponding token embedding using the token_id from its embedding matrix
## Attention mask is a binary mask, if 1: attend this token, 2: ignore this token(usually padding)
## It prevents the model from learning from or being distracted by padding tokens.

model = DistilBertModel.from_pretrained('bert-base-uncased')

# Load BERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('bert-base-uncased')
def bert_to_token(sentence):


  # Define the sentence to be tokenized

  # Tokenize the sentence
  tokens = splitSentence(sentence)
  # print(len(tokens))
  # Convert tokens to ids
  input_ids = tokenizer.convert_tokens_to_ids(tokens)
  # print(len(input_ids))
  # Add special tokens
  # Convert input_ids to tensor
  input_ids = torch.tensor(input_ids).unsqueeze(0)      # helps in converting dim from [4] to [1, 4] where 0 mean along rows and if 1 mean along columns
  # print(len(input_ids))
  # Get the 768 dimensional vectors for each token
  outputs = model(input_ids)
  # print(outputs,len(outputs))
  token_vectors = outputs.last_hidden_state.squeeze(0)        # Output of the last layer of the DistilBert
  return token_vectors

You are using a model of type bert to instantiate a model of type distilbert. This is not supported for all configurations of models and can yield errors.
Some weights of DistilBertModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['embeddings.LayerNorm.bias', 'embeddings.LayerNorm.weight', 'embeddings.position_embeddings.weight', 'embeddings.word_embeddings.weight', 'transformer.layer.0.attention.k_lin.bias', 'transformer.layer.0.attention.k_lin.weight', 'transformer.layer.0.attention.out_lin.bias', 'transformer.layer.0.attention.out_lin.weight', 'transformer.layer.0.attention.q_lin.bias', 'transformer.layer.0.attention.q_lin.weight', 'transformer.layer.0.attention.v_lin.bias', 'transformer.layer.0.attention.v_lin.weight', 'transformer.layer.0.ffn.lin1.bias', 'transformer.layer.0.ffn.lin1.weight', 'transformer.layer.0.ffn.lin2.bias', 'transformer.layer.0.ffn.lin2.weight', 'transformer.layer.0.output_layer_norm.bias', 'transformer.laye

In [20]:
label_map = {'O':0,'B':1,'I':2}
labels = []
for label_list in df1['labels']:

    # Convert labels to tensor
    label_tensor = torch.tensor([label_map[label] for label in label_list], dtype=torch.long)
    labels.append(list(label_tensor))
print(labels)


[[tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0)], [tensor(0), tensor(1), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0)], [tensor(0), tensor(0), tensor(1), tensor(2), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0)], [tensor(0), tensor(0), tensor(1), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0)], [tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(1), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), 

**Analyzing Class Imbalance within the Aspect's labels os a particular sentence**

In [21]:
C_0 = 0
C_1 = 1
C_2 = 2
for label in labels:
    for i in label:
        if i==0:
            C_0 = C_0 + 1
        if i==1:
            C_1 = C_1 + 1
        if i==2:
            C_2 = C_2 + 1

class_counts = [C_0, C_1, C_2]
class_weights = torch.tensor([1.0, sum(class_counts) / (2 * class_counts[1]), sum(class_counts) / (2 * class_counts[2])])   ## class 0 is assigned weight 0, class 1 & 2 is assigned weight based on its frequence of occurence.
print(class_weights)                                                                                                        ## 2 is mult in deno to normalize the weight so that the weight doesn't get too high

tensor([ 1.0000,  7.0073, 14.9843])


## **BERT Finetuning**

DistilBertTokenizer	Converts raw text → token IDs	Input: text → Output: token IDs, masks

DistilBertForTokenClassification	Predicts label per token (NER, etc.)	Input: token IDs → Output: logits per token. A DistilBERT model with an added classification head (a linear layer on top).

With the DistilBertModel we get only the embeddings of the tokens.

outputs.loss      # CrossEntropyLoss between predicted labels and true labels

outputs.logits    # Shape: [1, sequence_length, num_labels]

In [22]:
import torch
from transformers import DistilBertTokenizer, DistilBertForTokenClassification

# Load the pre-trained BERT model and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForTokenClassification.from_pretrained('distilbert-base-uncased', num_labels=3)

# Define your optimizer and loss function
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)
loss_fn = torch.nn.CrossEntropyLoss()

# Prepare your data
label_map = {'O': 0, 'B': 1, 'I': 2}
tokenized_texts = [splitSentence(i) for i in df1['Sentence'][:1000].tolist()]               ## Splitting each sentence into list of lists of tokens.
input_ids = [tokenizer.convert_tokens_to_ids(tokens) for tokens in tokenized_texts]         ## Each token from the list is converted to corresponding id using the tokenizer corpus( list of lists of token_ids)
attention_masks = [[1] * len(tokens) for tokens in input_ids]
label_map = {'O':0,'B':1,'I':2}
labels = []
for label_list in df1['labels'][:1000]:
    # Convert labels to tensor
    label_tensor = torch.tensor([label_map[label] for label in label_list], dtype=torch.long)
    labels.append(list(label_tensor))

## Labels are the list of lists of new labels.

num_epochs = 1
# Train your model
model.train()
for epoch in range(num_epochs):
    for i in range(len(input_ids)):
        optimizer.zero_grad()                 ## It’s used to clear (reset) gradients before computing them for the next batch to prevent exploding geadients
        # Ensure labels have the same sequence length as input_ids
        current_labels = labels[i][:len(input_ids[i])]
        outputs = model(torch.LongTensor(input_ids[i]).unsqueeze(0), attention_mask=torch.LongTensor(attention_masks[i]).unsqueeze(0), labels=torch.LongTensor(current_labels).unsqueeze(0))
        loss = outputs.loss       # This gives the scaler loss value
        loss.backward()           # it computes gradients of the loss with respect to all trainable model parameters and store in param.grid.
        optimizer.step()          # the optimizer (like Adam) to adjust the model’s weights using the gradients calculated in the previous step. updates the parameters to hopefully make the loss lower next time.
    print(f"Epoch {epoch+1} Loss: {loss.item()}")

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1 Loss: 0.02501624822616577


In [23]:
## saves the trained weights of your model, So that you can later reload the model weights without retraining it again.
## model.state_dict()- Returns a dictionary of all learnable parameters in the model (weights, biases, etc.) No architecture only parameters
## '.pt' extension is commonly used for PyTorch model files
## model.load_state_dict(torch.load('distilbert-finetune-aspect.pt'))
## model.eval()  # switch to inference mode
torch.save(model.state_dict(), 'distilbert-finetune-aspect.pt')

# **PREDICTIONS**

In [24]:
# model.load_state_dict(torch.load('/content/distilbert-finetune-aspect.pt'))
# model.eval()

tokenized_texts = [splitSentence(i) for i in df1['Sentence'][1200:].tolist()]
input_ids = [tokenizer.convert_tokens_to_ids(tokens) for tokens in tokenized_texts]
attention_masks = [[1] * len(tokens) for tokens in input_ids]
labels = list(df1['labels'][1200:])
model.eval()                        ## Switches the model to evaluation mode
with torch.no_grad():               ## It tells the pytorch not to track gradient to save memory and computations. All opertions inside the block wont build the computational graph
    predicted = []
    count = 0
    c = 0
    for i in range(len(input_ids)):
        outputs = model(torch.LongTensor(input_ids[i]).unsqueeze(0), attention_mask=torch.LongTensor(attention_masks[i]).unsqueeze(0))    ## As no true labels are passed so it just output the logits
        logits = outputs.logits                   ## For each token, gives raw scores for each label
        preds = torch.argmax(logits, dim=2)       ## Converts logits to predicted labels by picking the label with maximum score for each token along the dimension (done for each token from the list of token_ids)
        count = count + len(preds[0])             ## preds[0] gives you the predictions for just the sentence (without the batch dimension).
        c = c + len(labels[i])
        predicted.extend(preds[0])
print(predicted)

[tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(1), tensor(2), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(1), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(1), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(1), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(1), tensor(0), tensor(0), tensor(0), tensor(1), tensor(0)

In [25]:
label_map = {'O':0,'B':1,'I':2}
labels_test = []
for label_list in df1['labels'][1200 :]:
    # Convert labels to tensor
    label_tensor = torch.tensor([label_map[label] for label in label_list], dtype=torch.long)
    labels_test.append(list(label_tensor))


actual = []
for i in labels_test:
    actual.extend(i)
# print(actual)

correct = 0
for i in range(len(actual)):
    if predicted[i]==actual[i]:
        correct = correct + 1
print(correct/count)

0.9387915764613265


average='weighted':

Computes precision per class

Then takes a weighted average based on the number of true instances in each class

Prevents rare classes from being ignored

In [26]:
from sklearn.metrics import precision_score, recall_score, f1_score

# calculate precision, recall, and F1 score
precision = precision_score(actual, predicted,average='weighted')
recall = recall_score(actual, predicted,average='weighted')
f1 = f1_score(actual, predicted,average='weighted')

# print results
print("Precision: ", precision)
print("Recall: ", recall)
print("F1 Score: ", f1)

Precision:  0.9330386232527655
Recall:  0.9387915764613265
F1 Score:  0.9345470502685594


In [27]:
!pip install --upgrade transformers



Prediction is done again because:
it's not strictly "re-training" or a completely redundant prediction, but rather a localized prediction step within the loop to facilitate the correct reconstruction and grouping of aspect terms for the output DataFrame.

In [None]:
from transformers import DistilBertTokenizerFast, DistilBertForTokenClassification
import torch
import pandas as pd

# Inverse map for labels
inverse_label_map = {0: 'O', 1: 'B', 2: 'I'}

extracted_aspects = []

# Iterate through the test sentences
test_sentences = df1['Sentence'][1200:].tolist()

predicted_aspect_terms = []

# Load the fast version of the tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# Assuming the model is already loaded and in eval mode from the previous cell
# model.load_state_dict(torch.load('/content/distilbert-finetune-aspect.pt'))
# model.eval()

for sentence in test_sentences:
    # Tokenize the sentence and get offset mapping and word_ids
    encoded = tokenizer(
        sentence,
        add_special_tokens=True,
        max_length=128,
        padding='max_length',
        truncation=True,
        return_tensors='pt',
        return_attention_mask=True,
        return_offsets_mapping=True
    )
    input_ids = encoded['input_ids']
    attention_mask = encoded['attention_mask']
    offsets = encoded['offset_mapping'][0].tolist()
    word_ids = encoded.word_ids()

    # Get predictions for the current sentence
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=2).squeeze(0).tolist() # Token-level predictions (0, 1, or 2)

    # Extract aspect terms based on BIO tags and offsets
    current_aspect_span = None
    sentence_aspect_terms_set = set() # Use a set to store unique aspect terms

    for token_idx in range(len(input_ids[0])):
        word_id = word_ids[token_idx]
        predicted_label_id = predictions[token_idx] # Numerical prediction for the token
        predicted_label = inverse_label_map[predicted_label_id] # Convert to 'O', 'B', 'I'

        # Only process tokens that correspond to original words (not special tokens)
        if word_id is not None:
            start_offset, end_offset = offsets[token_idx]

            if predicted_label == 'B':
                # If a previous aspect span was being tracked, add it to the set
                if current_aspect_span is not None:
                    aspect_term = sentence[current_aspect_span[0]:current_aspect_span[1]].strip()
                    if aspect_term:
                        sentence_aspect_terms_set.add(aspect_term)

                # Start a new aspect term span
                current_aspect_span = (start_offset, end_offset)

            elif predicted_label == 'I':
                # If currently tracking an aspect span, extend it
                if current_aspect_span is not None:
                    current_aspect_span = (current_aspect_span[0], end_offset)
                # If 'I' without preceding 'B', ignore (do not start or extend span)

            else: # predicted_label == 'O'
                # If currently tracking an aspect span, it's now completed
                if current_aspect_span is not None:
                    aspect_term = sentence[current_aspect_span[0]:current_aspect_span[1]].strip()
                    if aspect_term:
                        sentence_aspect_terms_set.add(aspect_term)
                    current_aspect_span = None # Reset the span

    # After the loop, add the last aspect term if one was being tracked at the end of the sentence
    if current_aspect_span is not None:
         aspect_term = sentence[current_aspect_span[0]:current_aspect_span[1]].strip()
         if aspect_term:
            sentence_aspect_terms_set.add(aspect_term)


    # Convert the set back to a list for the DataFrame
    predicted_aspect_terms.append(list(sentence_aspect_terms_set))


# Create a DataFrame
predicted_aspects_df = pd.DataFrame({
    'Sentence': test_sentences,
    'Predicted Aspect Terms': predicted_aspect_terms
})

display(predicted_aspects_df.head())

In [44]:
print(test_sentences[239])
predicted_aspects_df.head(5)

  It gets stuck all of the time you use it, and you have to keep tapping on it to get it to work.


Unnamed: 0,Sentence,Predicted Aspect Terms
0,This computer I used daily nice compact design.,[design]
1,This computer doesn't do that well with certai...,[]
2,This computer had exactly the specifications I...,[specifications]
3,This computer is exceptionally thin for it's s...,"[screen size, processing power]"
4,This computer that I have has had issues with ...,"[keyboard functions, keyboard]"


In [30]:
# Explode the 'Predicted Aspect Terms' column
exploded_predicted_aspects_df = predicted_aspects_df.explode('Predicted Aspect Terms')

# Display the new DataFrame
display(exploded_predicted_aspects_df.head(10))

Unnamed: 0,Sentence,Predicted Aspect Terms
0,This computer I used daily nice compact design.,design
1,This computer doesn't do that well with certai...,
2,This computer had exactly the specifications I...,specifications
3,This computer is exceptionally thin for it's s...,screen size
3,This computer is exceptionally thin for it's s...,processing power
4,This computer that I have has had issues with ...,keyboard
4,This computer that I have has had issues with ...,keyboard functions
5,This computer was so challenging to carry and ...,
6,This is a great little computer for the price.,price
7,This is a great value for the money.,value


In [45]:
from transformers import DistilBertTokenizerFast, DistilBertForTokenClassification
import torch
import pandas as pd

# Inverse map for labels
inverse_label_map = {0: 'O', 1: 'B', 2: 'I'}

extracted_aspects = []

# Iterate through the test sentences and their predicted labels
test_sentences = df1['Sentence'][1200:].tolist()  ## Sentences are present in list of lists

predicted_aspect_terms = []

# Load the fast version of the tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# Assuming the model is already loaded and in eval mode from the previous cell
# model.load_state_dict(torch.load('/content/distilbert-finetune-aspect.pt'))
# model.eval()

for i in range(len(test_sentences)):
    sentence = test_sentences[i]                    ## Chossing the ith list from the list of lists
    # Tokenize the sentence and get offset mapping
    ## using tokeniser as
    encoded = tokenizer(
        sentence,
        add_special_tokens=True, ## ex- [cls], [end]
        max_length=128,
        padding='max_length',
        truncation=True,
        return_tensors='pt',
        return_attention_mask=True,
        return_offsets_mapping=True # Get character offsets for tokens
    )
    input_ids = encoded['input_ids']                    ## Input ids of the one sentence.
    attention_mask = encoded['attention_mask']          ## Attention Masks of one sentence.
    offsets = encoded['offset_mapping'][0].tolist() # Get the offset mapping for the first (and only) token in the batch

    # Let's get predictions sentence by sentence for clarity and correctness
    with torch.no_grad():
        sentence_outputs = model(input_ids, attention_mask=attention_mask)        # for one entire sentence
        sentence_logits = sentence_outputs.logits
        sentence_preds = torch.argmax(sentence_logits, dim=2).squeeze(0).tolist() # Get predictions for one particular sentence

    current_aspect_term = []
    sentence_aspect_terms = []

    # Reconstruct words from tokens and their offsets
    words = []
    current_word_tokens = []
    current_word_start_offset = None
    current_word_end_offset = None # Keep track of the end offset of the current word

    for token_idx in range(len(input_ids[0])):                                  ## input_ids[0] removes the batch dimension. this line means accessing each input_id one by one
        token_text = tokenizer.decode([input_ids[0][token_idx]])                ## Decoding using the input ids that we got from the tokeniser.
        start_offset, end_offset = offsets[token_idx]                           ## offset is a list of tuples and we are accessing each tuple one by one

        if start_offset == end_offset and token_text not in ['[CLS]', '[SEP]', '[PAD]']:
             # Handle special tokens that might have zero length offset but are not [CLS], [SEP], [PAD]
             continue

        if token_text.startswith("##"):
            # Continuation of a word
            if current_word_tokens:
                current_word_tokens.append(token_text[2:])
                current_word_end_offset = end_offset # Update end offset with the end offset of the current token
            else:
                # Handle cases where a word starts with ## (unlikely but for robustness)
                current_word_tokens.append(token_text[2:])
                current_word_start_offset = start_offset # Or should be previous token's start?
                current_word_end_offset = end_offset

        else:
            # Start of a new word or a whole word token
            if current_word_tokens:
                # Finish the previous word
                full_word = "".join(current_word_tokens)
                # Use the stored end offset for the previous word
                words.append((full_word, current_word_start_offset, current_word_end_offset))

            if token_text in ['[CLS]', '[SEP]', '[PAD]']:
                 current_word_tokens = []
                 current_word_start_offset = None
                 current_word_end_offset = None
                 continue # Skip special tokens for aspect extraction

            current_word_tokens = [token_text]
            current_word_start_offset = start_offset
            current_word_end_offset = end_offset # Initialize end offset with the end offset of the first token

    # Add the last word if the sentence doesn't end with a special token
    if current_word_tokens:
         full_word = "".join(current_word_tokens)
         words.append((full_word, current_word_start_offset, current_word_end_offset))


    # Now align predicted labels with the reconstructed words
    word_labels = []
    word_index = 0
    token_index_in_sentence_preds = 0 # Index to iterate through sentence_preds, excluding special tokens

    # A more reliable way to align is to use word_ids from the encoding
    word_ids = encoded.word_ids()
    previous_word_idx = None
    current_word_predicted_label = None

    for token_idx in range(len(input_ids[0])):
        word_idx = word_ids[token_idx]

        if word_idx is not None and token_idx < len(sentence_preds):
            predicted_label = inverse_label_map[sentence_preds[token_idx]]

            if word_idx != previous_word_idx:
                # New word starts - assign the predicted label of the first token to the word
                if current_word_predicted_label is not None:
                    word_labels.append(current_word_predicted_label)
                current_word_predicted_label = predicted_label
            # For subsequent tokens of the same word, we rely on the label of the first token
            # when reconstructing the aspect term.

            previous_word_idx = word_idx

    # Add the label for the last word
    if current_word_predicted_label is not None:
        word_labels.append(current_word_predicted_label)


    # Reconstruct aspect terms based on word labels and the 'words' list (with correct offsets)
    current_aspect_term_words = []
    sentence_aspect_terms = []

    for word_idx in range(len(words)):
        word_text, start_offset, end_offset = words[word_idx]
        # Need to align word_labels with words list. word_labels now has one label per word.
        # This alignment is tricky with the current approach.

        # Let's use the predicted labels (aligned to tokens) and the original sentence
        # to extract the aspect terms based on the offsets.

        current_aspect_span = None

        # Iterate through token predictions and their offsets
        for token_idx in range(len(input_ids[0])):
            word_idx_for_token = word_ids[token_idx]
            if word_idx_for_token is not None and token_idx < len(sentence_preds):
                predicted_label = inverse_label_map[sentence_preds[token_idx]]
                start_offset, end_offset = offsets[token_idx]

                if predicted_label == 'B':
                    if current_aspect_span is not None:
                        # Add previous aspect term if it exists
                        sentence_aspect_terms.append(sentence[current_aspect_span[0]:current_aspect_span[1]])
                    current_aspect_span = (start_offset, end_offset)
                elif predicted_label == 'I':
                    if current_aspect_span is not None:
                        # Extend the current aspect term span
                        current_aspect_span = (current_aspect_span[0], end_offset)
                    # If 'I' without preceding 'B', ignore
                else: # 'O'
                    if current_aspect_span is not None:
                        # Add the completed aspect term
                        sentence_aspect_terms.append(sentence[current_aspect_span[0]:current_aspect_span[1]])
                        current_aspect_span = None

        # Add the last aspect term if the sentence ends with one
        if current_aspect_span is not None:
             sentence_aspect_terms.append(sentence[current_aspect_span[0]:current_aspect_span[1]])


    # Clean up extracted aspect terms (remove leading/trailing spaces, etc.)
    cleaned_aspect_terms = [term.strip() for term in sentence_aspect_terms if term.strip()]

    predicted_aspect_terms.append(cleaned_aspect_terms)


# Create a DataFrame
predicted_aspects_df = pd.DataFrame({
    'Sentence': test_sentences,
    'Predicted Aspect Terms': predicted_aspect_terms
})

display(predicted_aspects_df.head())

Unnamed: 0,Sentence,Predicted Aspect Terms
0,This computer I used daily nice compact design.,"[design, design, design, design, design, desig..."
1,This computer doesn't do that well with certai...,[]
2,This computer had exactly the specifications I...,"[specifications, specifications, specification..."
3,This computer is exceptionally thin for it's s...,"[screen size, processing power, screen size, p..."
4,This computer that I have has had issues with ...,"[keyboard, keyboard functions, keyboard, keybo..."
