In [11]:
import pandas as pd
import torch

In [14]:
MY_DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.get_device_name(MY_DEVICE)

'NVIDIA GeForce RTX 3050 Ti Laptop GPU'

In [5]:
requirement_relevancy_dataset = pd.read_csv(
    "../../Datasets/irrelevant_requirements_dataset/irrelevant_requirements_dataset.csv",
    engine="pyarrow",
)

requirement_relevancy_dataset.head()

Unnamed: 0,reqs_statement,action_part,actor_part,label
0,user submit job associate cost execution time ...,submit job associate cost execution time deadline,user,relevant
1,user establish cost unit time and submit job,establish cost unit time and submit job,user,relevant
2,user monitor job submit status,monitor job submit status,user,relevant
3,user cancel job submit,cancel job submit,user,relevant
4,user check credit balance,check credit balance,user,relevant


## Experiment With NLP Models

In this segment, I will be experimenting with different NLP models to see which one performs the best. I will be using the following models: BERT, ROBERA, DistilBERT, and XLNet. I will be using the HuggingFace library to implement these models. I will be using the same data as the previous notebook.

### BERT Model

BERT is a transformer-based model that was proposed in the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805). BERT stands for Bidirectional Encoder Representations from Transformers. BERT is a pre-trained model that can be fine-tuned for a variety of NLP tasks. BERT is a bidirectional model that uses the Transformer encoder architecture. 

In [28]:
from transformers import (
    RobertaTokenizer,
    BertTokenizer,
    BertForSequenceClassification,
)
from sklearn.model_selection import train_test_split

In [29]:
text_data_X = requirement_relevancy_dataset["action_part"]
label_data_y = requirement_relevancy_dataset["label"]

In [30]:
robert_tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

In [31]:
tokenized_text_data_X = robert_tokenizer(
    text_data_X.tolist(), padding="max_length", return_tensors="pt", max_length=128
)
tokenized_text_data_X = {
    key: val.to(MY_DEVICE) for key, val in tokenized_text_data_X.items()
}

In [32]:
tokenized_text_data_y = robert_tokenizer(
    label_data_y.tolist(), padding="max_length", return_tensors="pt", max_length=16
)

tokenized_text_data_y = {
    key: val.to(MY_DEVICE) for key, val in tokenized_text_data_y.items()
}

In [33]:
bert_model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2, device_map=MY_DEVICE
)

ImportError: Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install accelerate`

## Oversampling Of Data

The dataset is pretty imbalanced. So, we will oversample the data to make it balanced. We are currently analyzing various oversampling techniques. We will use the best one for our model. To know more about the various oversampling techniques, please refer to this [link](https://pypi.org/project/smote-variants/)


### SMOTE

SMOTE is an oversampling technique where the synthetic samples are generated for the minority class. This algorithm helps to overcome the overfitting problem posed by random oversampling. It randomly picks a point from the minority class and computes the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.


In [None]:
import smote_variants as sv

In [None]:
smote_oversampler = sv.MulticlassOversampling(
    oversampler="distance_SMOTE", oversampler_params={"random_state": 5}
)

In [None]:
oversamplex_X, oversamplex_y = smote_oversampler.sample(
    tokenized_text_data_X, tokenized_text_data_y
)

2024-01-13 00:37:55,104:INFO:MulticlassOversampling: Running multiclass oversampling with strategy eq_1_vs_many_successive


TypeError: unhashable type: 'dict'