In [1]:
import pandas as pd
import torch
import os

In [2]:
MY_DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.get_device_name(MY_DEVICE)

'NVIDIA GeForce RTX 3050 Ti Laptop GPU'

In [3]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "caching_allocator"

In [4]:
requirement_relevancy_dataset = pd.read_csv(
    "../../Datasets/irrelevant_requirements_dataset/irrelevant_requirements_dataset.csv",
    engine="pyarrow",
)

requirement_relevancy_dataset.head()

Unnamed: 0,reqs_statement,action_part,actor_part,label
0,user submit job associate cost execution time ...,submit job associate cost execution time deadline,user,relevant
1,user establish cost unit time and submit job,establish cost unit time and submit job,user,relevant
2,user monitor job submit status,monitor job submit status,user,relevant
3,user cancel job submit,cancel job submit,user,relevant
4,user check credit balance,check credit balance,user,relevant


## Experiment With NLP Models

In this segment, I will be experimenting with different NLP models to see which one performs the best. I will be using the following models: DistilBERT, ROBERA, DistilBERT, and XLNet. I will be using the HuggingFace library to implement these models. I will be using the same data as the previous notebook.


## DistilBERT Model

DistilBERT is a smaller version of BERT. It is trained to be faster and more efficient than BERT. It is also trained to be more memory efficient. It is trained using the same data as BERT. It is trained using a technique called knowledge distillation. This technique is used to compress a large model into a smaller model. The smaller model is trained to mimic the behavior of the larger model. The smaller model is


In [5]:
from transformers import (
    DistilBertModel,
    DistilBertTokenizer,
)
from sklearn.model_selection import train_test_split

In [6]:
text_data_X = requirement_relevancy_dataset["action_part"]
label_data_y = requirement_relevancy_dataset["label"]

In [7]:
bert_tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

In [8]:
with torch.no_grad():
    tokenized_text_data_X = bert_tokenizer(
        text_data_X.tolist(),
        padding="max_length",
        return_tensors="pt",
        max_length=64,
        truncation=True,
    )

    tokenized_text_data_y = bert_tokenizer(
        label_data_y.tolist(),
        padding="max_length",
        return_tensors="pt",
        max_length=64,
        truncation=True,
    )

In [9]:
tokenized_text_data_y = {
    key: val.to(MY_DEVICE) for key, val in tokenized_text_data_y.items()
}

tokenized_text_data_X = {
    key: val.to(MY_DEVICE) for key, val in tokenized_text_data_X.items()
}

In [10]:
tokenized_text_data_X["input_ids"].shape, tokenized_text_data_X["attention_mask"].shape

(torch.Size([621, 64]), torch.Size([621, 64]))

In [11]:
tokenized_text_data_y

{'input_ids': tensor([[  101,  7882,   102,  ...,     0,     0,     0],
         [  101,  7882,   102,  ...,     0,     0,     0],
         [  101,  7882,   102,  ...,     0,     0,     0],
         ...,
         [  101,  7882,   102,  ...,     0,     0,     0],
         [  101, 22537,   102,  ...,     0,     0,     0],
         [  101, 22537,   102,  ...,     0,     0,     0]], device='cuda:0'),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0')}

In [12]:
bert_model = DistilBertModel.from_pretrained(
    "distilbert-base-uncased",
    device_map=MY_DEVICE,
)

In [13]:
bert_model.save_pretrained(
    "../../Models/requirement_relevancy_experiment/NLP_models/my_bert_model"
)

### No Gradiant Calculation


torch.no_grad() is used to disable gradient calculation because we are not updating the parameters of the model. This will reduce memory consumption for computations.


In [14]:
with torch.cuda.amp.autocast():
    outputs = bert_model(**tokenized_text_data_X)
    last_hidden_states = outputs.last_hidden_state
# outputs = bert_model(**tokenized_text_data_X)

In [15]:
last_hidden_states

tensor([[[-0.3534, -0.1981, -0.1656,  ..., -0.3108, -0.0599,  0.4349],
         [ 0.1703,  0.1611,  0.2477,  ..., -0.1779,  0.0796,  0.2220],
         [-0.0327, -0.2060,  0.1768,  ..., -0.1925, -0.0997,  0.1481],
         ...,
         [-0.0700,  0.0444,  0.3540,  ..., -0.1894, -0.3079,  0.1799],
         [-0.0373,  0.0923,  0.1830,  ..., -0.1250, -0.4743,  0.1674],
         [-0.0203,  0.0741,  0.1860,  ..., -0.1342, -0.4503,  0.2178]],

        [[-0.2828, -0.0325, -0.4504,  ..., -0.3080,  0.2259,  0.4418],
         [-0.0231,  0.4239, -0.1768,  ..., -0.4183,  0.1507,  0.1583],
         [ 0.3004, -0.2843,  0.0732,  ..., -0.2676,  0.0906,  0.0983],
         ...,
         [ 0.1640,  0.0182,  0.3581,  ..., -0.2948, -0.0981,  0.1500],
         [-0.0513,  0.0541, -0.0595,  ..., -0.1478, -0.3698,  0.0287],
         [-0.0303,  0.0941, -0.0675,  ..., -0.1470, -0.4249,  0.0290]],

        [[-0.2093, -0.1249, -0.3169,  ..., -0.2322,  0.0539,  0.3074],
         [ 0.4184,  0.3295,  0.0380,  ..., -0

In [16]:
torch.cuda.empty_cache()

In [29]:
reshaped_last_hidden_states_X = (
    last_hidden_states.reshape(last_hidden_states.shape[0], -1).detach().cpu().numpy()
)
reshaped_last_hidden_states_X.shape

(621, 49152)

## Oversampling Of Data

The dataset is pretty imbalanced. So, we will oversample the data to make it balanced. We are currently analyzing various oversampling techniques. We will use the best one for our model. To know more about the various oversampling techniques, please refer to this [link](https://pypi.org/project/smote-variants/)


### SMOTE

SMOTE is an oversampling technique where the synthetic samples are generated for the minority class. This algorithm helps to overcome the overfitting problem posed by random oversampling. It randomly picks a point from the minority class and computes the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.


In [17]:
import smote_variants as sv

In [26]:
oversampler = sv.MulticlassOversampling(oversampler="SMOTE")

In [None]:
X_resampled, y_resampled = oversampler.sample(
    reshaped_last_hidden_states_X, label_data_y
)