# SMOTEN Oversampling Experiment
In This notebook we will be using the SMOTEN oversampling technique to balance the dataset. We will be using the same dataset as the previous notebook.

In [1]:
import pandas as pd
import torch
import os
import numpy as np
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTEN

In [2]:
MY_DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.get_device_name(MY_DEVICE)

'NVIDIA GeForce RTX 3050 Ti Laptop GPU'

In [3]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "caching_allocator"

In [4]:
requirement_relevancy_dataset = pd.read_csv(
    "../../../Datasets/irrelevant_requirements_dataset/irrelevant_requirements_dataset.csv",
    engine="pyarrow",
)

requirement_relevancy_dataset.head()

Unnamed: 0,reqs_statement,action_part,actor_part,label
0,user submit job associate cost execution time ...,submit job associate cost execution time deadline,user,relevant
1,user establish cost unit time and submit job,establish cost unit time and submit job,user,relevant
2,user monitor job submit status,monitor job submit status,user,relevant
3,user cancel job submit,cancel job submit,user,relevant
4,user check credit balance,check credit balance,user,relevant


### Making Train Test Split

In [5]:
requirements_X = requirement_relevancy_dataset["reqs_statement"]
label_y = requirement_relevancy_dataset["label"]

In [6]:
# Get the max length of the requirements
max_len = 0
for req in requirements_X:
    if len(req.split()) > max_len:
        max_len = len(req.split())

print("Max length of the requirements: ", max_len)

Max length of the requirements:  123


In [7]:
# One-hot encode the labels
label_y = pd.get_dummies(label_y, drop_first=True)
label_y = label_y["relevant"]
label_y

0       True
1       True
2       True
3       True
4       True
       ...  
616     True
617     True
618     True
619    False
620    False
Name: relevant, Length: 621, dtype: bool

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    requirements_X, label_y, test_size=0.15, random_state=42, stratify=label_y
)

In [9]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((527,), (527,), (94,), (94,))

In [10]:
X_train_resampled, y_train_resampled = SMOTEN(
    random_state=42, sampling_strategy="all"
).fit_resample(X_train.values.reshape(-1, 1), y_train)

In [11]:
X_train_resampled.shape, y_train_resampled.shape

((946, 1), (946,))

Befroe Resampling

In [12]:
y_train.value_counts()

relevant
True     473
False     54
Name: count, dtype: int64

In [13]:
y_train_resampled.value_counts()

relevant
True     473
False    473
Name: count, dtype: int64

In [14]:
X_train_resampled = X_train_resampled.reshape(1, -1)[0]

In [27]:
y_test.value_counts()

relevant
True     84
False    10
Name: count, dtype: int64

In [15]:
print(X_train_resampled.shape, X_test.shape)
print(y_train_resampled.shape, y_test.shape)

(946,) (94,)
(946,) (94,)


## Experiment With NLP Models

In this segment, I will be experimenting with different NLP models to see which one performs the best. I will be using the following models: DistilBERT, ROBERA, DistilBERT, and XLNet. I will be using the HuggingFace library to implement these models. I will be using the same data as the previous notebook.


## DistilBERT Model

DistilBERT is a smaller version of BERT. It is trained to be faster and more efficient than BERT. It is also trained to be more memory efficient. It is trained using the same data as BERT. It is trained using a technique called knowledge distillation. This technique is used to compress a large model into a smaller model. The smaller model is trained to mimic the behavior of the larger model. The smaller model is


In [16]:
from transformers import (
    DistilBertModel,
    DistilBertTokenizer,
)

In [17]:
bert_tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

In [18]:
with torch.no_grad():
    tokenized_train_data_X = bert_tokenizer(
        X_train_resampled.tolist(),
        padding="max_length",
        return_tensors="pt",
        max_length=100,
        truncation=True,
    )

    tokenized_test_data_X = bert_tokenizer(
        X_test.tolist(),
        padding="max_length",
        return_tensors="pt",
        max_length=100,
        truncation=True,
    )

In [19]:
bert_model = DistilBertModel.from_pretrained(
    "../../../Models/requirement_relevancy_experiment/NLP_models/my_distilbert_model/",
    # device_map=MY_DEVICE,
)

### Mixed Precision Calculation

Mixed precsion is the use of both 16 and 32 bit float to optimize memory during training to make it run faster


torch.cuda.amp.autocast() is


In [20]:
torch.cuda.empty_cache()

In [21]:
with torch.cuda.amp.autocast():
    X_train_outputs = bert_model(**tokenized_train_data_X)
    X_test_outputs = bert_model(**tokenized_test_data_X)

# outputs = bert_model(**tokenized_text_data_X)

In [22]:
X_train_last_hidden_states = X_train_outputs.last_hidden_state
X_test_last_hidden_states = X_test_outputs.last_hidden_state

In [23]:
X_train_last_hidden_states.shape, X_test_last_hidden_states.shape

(torch.Size([946, 100, 768]), torch.Size([94, 100, 768]))

In [24]:
reshaped_X_train_last_hidden_states = X_train_last_hidden_states.reshape(
    X_train_last_hidden_states.shape[0], -1
).detach().numpy()

reshaped_X_test_last_hidden_states = X_test_last_hidden_states.reshape(
    X_test_last_hidden_states.shape[0], -1
).detach().numpy()

reshaped_X_train_last_hidden_states.shape, reshaped_X_test_last_hidden_states.shape

((946, 76800), (94, 76800))

In [25]:
reshaped_X_train_last_hidden_states.shape, y_train_resampled.shape

((946, 76800), (946,))

In [26]:
reshaped_X_test_last_hidden_states.shape, y_test.shape

((94, 76800), (94,))

In [28]:
# # Save the reshaped_X_train_last_hidden_states and reshaped_X_test_last_hidden_states
np.savetxt(
    "../../../Datasets/irrelevant_requirements_dataset/model_state_outputs/distilbert/reshaped_X_train_last_hidden_states.csv",
    reshaped_X_train_last_hidden_states,
)

np.savetxt(
    "../../../Datasets/irrelevant_requirements_dataset/model_state_outputs/distilbert/reshaped_X_test_last_hidden_states.csv",
    reshaped_X_test_last_hidden_states,
)

# Save the y_train and y_test
np.savetxt(
    "../../../Datasets/irrelevant_requirements_dataset/model_state_outputs/distilbert/y_train.csv",
    y_train_resampled,
)

np.savetxt(
    "../../../Datasets/irrelevant_requirements_dataset/model_state_outputs/distilbert/y_test.csv",
    y_test,
)

In [26]:
# Utilitiy function for heatmap of a confusion matrix

import seaborn as sns
import matplotlib.pyplot as plt

def draw_heatmap(confusion_matrix, labels):
    ax = sns.heatmap(
        confusion_matrix,
        annot=True,
        fmt="d",
        cmap="Blues",
        xticklabels=labels,
        yticklabels=labels,
    )
    ax.set(xlabel="Predicted label", ylabel="True label")
    plt.show()

## Classification

The classification apporach now utilized cross validation to get a better estimate of the model's performance. The model is trained on 5 different folds of the data. The model is then evaluated on the validation set of each fold. The model with the best validation score is then used to make predictions on the test set. The test set predictions are then used to calculate the test set score. The test set score is the final score of the model. The detailed results are in the one_hot_standalone_cross_validation_results.csv file.
