In [34]:
import pandas as pd
import torch
import os
import numpy as np

In [2]:
MY_DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.get_device_name(MY_DEVICE)

'NVIDIA GeForce RTX 3050 Ti Laptop GPU'

In [3]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "caching_allocator"

In [4]:
requirement_relevancy_dataset = pd.read_csv(
    "../../Datasets/irrelevant_requirements_dataset/irrelevant_requirements_dataset.csv",
    engine="pyarrow",
)

requirement_relevancy_dataset.head()

Unnamed: 0,reqs_statement,action_part,actor_part,label
0,user submit job associate cost execution time ...,submit job associate cost execution time deadline,user,relevant
1,user establish cost unit time and submit job,establish cost unit time and submit job,user,relevant
2,user monitor job submit status,monitor job submit status,user,relevant
3,user cancel job submit,cancel job submit,user,relevant
4,user check credit balance,check credit balance,user,relevant


## Experiment With NLP Models

In this segment, I will be experimenting with different NLP models to see which one performs the best. I will be using the following models: DistilBERT, ROBERA, DistilBERT, and XLNet. I will be using the HuggingFace library to implement these models. I will be using the same data as the previous notebook.


## DistilBERT Model

DistilBERT is a smaller version of BERT. It is trained to be faster and more efficient than BERT. It is also trained to be more memory efficient. It is trained using the same data as BERT. It is trained using a technique called knowledge distillation. This technique is used to compress a large model into a smaller model. The smaller model is trained to mimic the behavior of the larger model. The smaller model is


In [5]:
from transformers import (
    DistilBertModel,
    DistilBertTokenizer,
)
from sklearn.model_selection import train_test_split

In [6]:
text_data_X = requirement_relevancy_dataset["action_part"]
label_data_y = requirement_relevancy_dataset["label"]

In [7]:
bert_tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

In [8]:
with torch.no_grad():
    tokenized_text_data_X = bert_tokenizer(
        text_data_X.tolist(),
        padding="max_length",
        return_tensors="pt",
        max_length=64,
        truncation=True,
    )

In [63]:
tokenized_text_data_y = np.array(
    map(lambda label: 1 if label == "relevant" else 0, label_data_y.tolist())
)

In [9]:
tokenized_text_data_X = {
    key: val.to(MY_DEVICE) for key, val in tokenized_text_data_X.items()
}

In [10]:
tokenized_text_data_X["input_ids"].shape, tokenized_text_data_X["attention_mask"].shape

(torch.Size([621, 64]), torch.Size([621, 64]))

In [11]:
tokenized_text_data_y

{'input_ids': tensor([[  101,  7882,   102,  ...,     0,     0,     0],
         [  101,  7882,   102,  ...,     0,     0,     0],
         [  101,  7882,   102,  ...,     0,     0,     0],
         ...,
         [  101,  7882,   102,  ...,     0,     0,     0],
         [  101, 22537,   102,  ...,     0,     0,     0],
         [  101, 22537,   102,  ...,     0,     0,     0]], device='cuda:0'),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0')}

In [12]:
bert_model = DistilBertModel.from_pretrained(
    "distilbert-base-uncased",
    device_map=MY_DEVICE,
)

### Mixed Precision Calculation

Mixed precsion is the use of both 16 and 32 bit float to optimize memory during training to make it run faster


torch.cuda.amp.autocast() is


In [14]:
with torch.cuda.amp.autocast():
    outputs = bert_model(**tokenized_text_data_X)
    last_hidden_states = outputs.last_hidden_state
# outputs = bert_model(**tokenized_text_data_X)

In [38]:
bert_model.save_pretrained(
    "../../Models/requirement_relevancy_experiment/NLP_models/my_distilbert_model"
)

In [None]:
last_hidden_states

In [16]:
torch.cuda.empty_cache()

In [29]:
reshaped_last_hidden_states_X = (
    last_hidden_states.reshape(last_hidden_states.shape[0], -1).detach().cpu().numpy()
)
reshaped_last_hidden_states_X.shape

(621, 49152)

In [35]:
np.savetxt(
    "../../Datasets/irrelevant_requirements_dataset/distilbert_X.csv",
    reshaped_last_hidden_states_X,
    delimiter=",",
)

In [None]:
# Run this cell to load the saved DistilBERT model and the reshaped last hidden states

# reshaped_last_hidden_states_X = np.loadtxt(
#     "../../Datasets/irrelevant_requirements_dataset/distilbert_X.csv",
#     delimiter=",",
# )

# bert_model = DistilBertModel.from_pretrained(
#     "../../Models/requirement_relevancy_experiment/NLP_models/my_distilbert_model"
# )

## Oversampling Of Data

The dataset is pretty imbalanced. So, we will oversample the data to make it balanced. We are currently analyzing various oversampling techniques. We will use the best one for our model. To know more about the various oversampling techniques, please refer to this [link](https://pypi.org/project/smote-variants/)


### SMOTE

SMOTE is an oversampling technique where the synthetic samples are generated for the minority class. This algorithm helps to overcome the overfitting problem posed by random oversampling. It randomly picks a point from the minority class and computes the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.


In [17]:
import smote_variants as sv

In [65]:
oversampler = sv.MulticlassOversampling(oversampler="SMOTE")

In [67]:
X_resampled, y_resampled = oversampler.sample(
    reshaped_last_hidden_states_X, tokenized_text_data_y
)

2024-01-13 04:35:22,337:INFO:MulticlassOversampling: Running multiclass oversampling with strategy eq_1_vs_many_successive
2024-01-13 04:35:22,389:INFO:MulticlassOversampling: Sampling minority class with label: 0
2024-01-13 04:35:22,422:INFO:SMOTE: Running sampling via ('SMOTE', "{'proportion': 1.0, 'n_neighbors': 5, 'nn_params': {}, 'n_jobs': 1, 'ss_params': {'n_dim': 2, 'simplex_sampling': 'random', 'within_simplex_sampling': 'random', 'gaussian_component': {}}, 'random_state': None, 'class_name': 'SMOTE'}")
2024-01-13 04:35:22,428:INFO:NearestNeighborsWithMetricTensor: NN fitting with metric minkowski
2024-01-13 04:35:22,432:INFO:NearestNeighborsWithMetricTensor: kneighbors query minkowski
2024-01-13 04:35:22,816:INFO:SMOTE: simplex sampling with n_dim 2


## Classification

In this section we will use various classification models to classify the texts. We will use the output of the hidden layers as the features and the tokenized label of the dataset as training label. We will use ensemble models as they are more robust in classification.


In [82]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report,
)
import joblib

In [75]:
X_train, X_test, y_train, y_test = train_test_split(
    X_resampled, y_resampled, test_size=0.2, random_state=42
)

### Random Forest Classifier

Random Forest Classifier is an ensemble model that uses decision trees to classify the data. It uses the bagging technique to create multiple decision trees and then uses the majority vote to classify the data. It is a robust model that is not prone to overfitting. It is also very fast to train.


In [76]:
from sklearn.ensemble import RandomForestClassifier

In [77]:
random_forest_classifier = RandomForestClassifier(random_state=42)

In [78]:
random_forest_classifier.fit(X_train, y_train)

In [94]:
# Evaluate the model through various metrics: accuracy, precision, recall, f1-score

y_pred = random_forest_classifier.predict(X_test)

print(
    "Accuracy score:",
    accuracy_score(y_test, y_pred),
    "\nPrecision score:",
    precision_score(y_test, y_pred),
    "\nRecall score:",
    recall_score(y_test, y_pred),
    "\nf1 score:",
    f1_score(y_test, y_pred),
)

Accuracy score: 0.9910313901345291 
Precision score: 0.9917355371900827 
Recall score: 0.9917355371900827 
f1 score: 0.9917355371900827


In [95]:
print("Classification for Random Forest\n", classification_report(y_test, y_pred))

Classification for Random Forest
               precision    recall  f1-score   support

           0       0.99      0.99      0.99       102
           1       0.99      0.99      0.99       121

    accuracy                           0.99       223
   macro avg       0.99      0.99      0.99       223
weighted avg       0.99      0.99      0.99       223



In [84]:
joblib.dump(
    random_forest_classifier,
    "../../Models/requirement_relevancy_experiment/classifier_models/distilbert_random_forest_classifier.joblib",
)

['../../Models/requirement_relevancy_experiment/classifier_models/distilbert_random_forest_classifier.joblib']

### Gradient Boost Classifier

Gradient Boost Classifier is an ensemble model that uses decision trees to classify the data. It uses the boosting technique to create multiple decision trees and then uses the majority vote to classify the data. It is a robust model that is not prone to overfitting. It is also very fast to train.

In [86]:
from sklearn.ensemble import GradientBoostingClassifier


In [87]:
gradient_boosting_classifier = GradientBoostingClassifier(random_state=42)

In [88]:
gradient_boosting_classifier.fit(X_train, y_train)

In [89]:
# Evaluate the model through various metrics: accuracy, precision, recall, f1-score

y_pred = gradient_boosting_classifier.predict(X_test)

print(
    "Accuracy score:",
    accuracy_score(y_test, y_pred),
    "\nPrecision score:",
    precision_score(y_test, y_pred),
    "\nRecall score:",
    recall_score(y_test, y_pred),
    "\nf1 score:",
    f1_score(y_test, y_pred),
)

Accuracy score: 1.0 
Precision score: 1.0 
Recall score: 1.0 
f1 score: 1.0


In [93]:
print("Classification for Gradient Boosting\n", classification_report(y_test, y_pred))

Classification for Gradient Boosting
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       102
           1       1.00      1.00      1.00       121

    accuracy                           1.00       223
   macro avg       1.00      1.00      1.00       223
weighted avg       1.00      1.00      1.00       223



In [90]:
joblib.dump(
    random_forest_classifier,
    "../../Models/requirement_relevancy_experiment/classifier_models/distilbert_gradient_boost_classifier.joblib",
)

['../../Models/requirement_relevancy_experiment/classifier_models/distilbert_gradient_boost_classifier.joblib']