In [1]:
import pandas as pd
import torch
import os
import numpy as np

In [2]:
MY_DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.get_device_name(MY_DEVICE)

'NVIDIA GeForce RTX 3050 Ti Laptop GPU'

In [3]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "caching_allocator"

In [4]:
requirement_relevancy_dataset = pd.read_csv(
    "../../Datasets/irrelevant_requirements_dataset/irrelevant_requirements_dataset.csv",
    engine="pyarrow",
)

requirement_relevancy_dataset.head()

Unnamed: 0,reqs_statement,action_part,actor_part,label
0,user submit job associate cost execution time ...,submit job associate cost execution time deadline,user,relevant
1,user establish cost unit time and submit job,establish cost unit time and submit job,user,relevant
2,user monitor job submit status,monitor job submit status,user,relevant
3,user cancel job submit,cancel job submit,user,relevant
4,user check credit balance,check credit balance,user,relevant


## Experiment With NLP Models

In this segment, I will be experimenting with different NLP models to see which one performs the best. I will be using the following models: DistilBERT, ROBERA, DistilBERT, and XLNet. I will be using the HuggingFace library to implement these models. I will be using the same data as the previous notebook.


## DistilBERT Model

DistilBERT is a smaller version of BERT. It is trained to be faster and more efficient than BERT. It is also trained to be more memory efficient. It is trained using the same data as BERT. It is trained using a technique called knowledge distillation. This technique is used to compress a large model into a smaller model. The smaller model is trained to mimic the behavior of the larger model. The smaller model is


In [5]:
from transformers import (
    DistilBertModel,
    DistilBertTokenizer,
)
from sklearn.model_selection import train_test_split

In [6]:
text_data_X = requirement_relevancy_dataset["reqs_statement"]
label_data_y = requirement_relevancy_dataset["label"]

In [7]:
bert_tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

In [8]:
with torch.no_grad():
    tokenized_text_data_X = bert_tokenizer(
        text_data_X.tolist(),
        padding="max_length",
        return_tensors="pt",
        max_length=64,
        truncation=True,
    )

In [10]:
tokenized_text_data_y = label_data_y.map({"relevant": 1, "irrelevant": 0}).to_numpy()
# tokenized_text_data_y

In [11]:
tokenized_text_data_X = {
    key: val.to(MY_DEVICE) for key, val in tokenized_text_data_X.items()
}

In [12]:
tokenized_text_data_X["input_ids"].shape, tokenized_text_data_X["attention_mask"].shape

(torch.Size([621, 64]), torch.Size([621, 64]))

In [14]:
bert_model = DistilBertModel.from_pretrained(
    "distilbert-base-uncased",
    device_map=MY_DEVICE,
)

### Mixed Precision Calculation

Mixed precsion is the use of both 16 and 32 bit float to optimize memory during training to make it run faster


torch.cuda.amp.autocast() is


In [15]:
with torch.cuda.amp.autocast():
    outputs = bert_model(**tokenized_text_data_X)
    last_hidden_states = outputs.last_hidden_state
# outputs = bert_model(**tokenized_text_data_X)

In [32]:
bert_model.save_pretrained(
    "../../Models/requirement_relevancy_experiment/NLP_models/my_distilbert_model"
)

In [None]:
last_hidden_states

In [16]:
torch.cuda.empty_cache()

In [17]:
reshaped_last_hidden_states_X = (
    last_hidden_states.reshape(last_hidden_states.shape[0], -1).detach().cpu().numpy()
)
reshaped_last_hidden_states_X.shape

(621, 49152)

In [35]:
np.savetxt(
    "../../Datasets/irrelevant_requirements_dataset/distilbert_X.csv",
    reshaped_last_hidden_states_X,
    delimiter=",",
)

In [None]:
# Run this cell to load the saved DistilBERT model and the reshaped last hidden states

# reshaped_last_hidden_states_X = np.loadtxt(
#     "../../Datasets/irrelevant_requirements_dataset/distilbert_X.csv",
#     delimiter=",",
# )

# bert_model = DistilBertModel.from_pretrained(
#     "../../Models/requirement_relevancy_experiment/NLP_models/my_distilbert_model"
# )

## Oversampling Of Data

The dataset is pretty imbalanced. So, we will oversample the data to make it balanced. We are currently analyzing various oversampling techniques. We will use the best one for our model. To know more about the various oversampling techniques, please refer to this [link](https://pypi.org/project/smote-variants/)


### SMOTE

SMOTE is an oversampling technique where the synthetic samples are generated for the minority class. This algorithm helps to overcome the overfitting problem posed by random oversampling. It randomly picks a point from the minority class and computes the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.


In [18]:
import smote_variants as sv

In [19]:
oversampler = sv.MulticlassOversampling(oversampler="SMOTE")

In [20]:
X_resampled, y_resampled = oversampler.sample(
    reshaped_last_hidden_states_X, tokenized_text_data_y
)

2024-01-14 20:14:51,586:INFO:MulticlassOversampling: Running multiclass oversampling with strategy eq_1_vs_many_successive
2024-01-14 20:14:51,617:INFO:MulticlassOversampling: Sampling minority class with label: 0
2024-01-14 20:14:51,648:INFO:SMOTE: Running sampling via ('SMOTE', "{'proportion': 1.0, 'n_neighbors': 5, 'nn_params': {}, 'n_jobs': 1, 'ss_params': {'n_dim': 2, 'simplex_sampling': 'random', 'within_simplex_sampling': 'random', 'gaussian_component': {}}, 'random_state': None, 'class_name': 'SMOTE'}")
2024-01-14 20:14:51,653:INFO:NearestNeighborsWithMetricTensor: NN fitting with metric minkowski
2024-01-14 20:14:51,656:INFO:NearestNeighborsWithMetricTensor: kneighbors query minkowski
2024-01-14 20:14:52,239:INFO:SMOTE: simplex sampling with n_dim 2


In [53]:
# Save the resampled data

np.savetxt(
    "../../Datasets/irrelevant_requirements_dataset/distilbert_X_resampled.csv",
    X_resampled,
    delimiter=",",
)

np.savetxt(
    "../../Datasets/irrelevant_requirements_dataset/distilbert_y_resampled.csv",
    y_resampled,
    delimiter=",",
)

Before Resampling:


In [22]:
# count the number of 1 and 0 in the total dataset
unique, counts = np.unique(tokenized_text_data_y, return_counts=True)
print(
    "Number of Irrelevant and Relevant in the total dataset:",
    dict(zip(["Irrelevant", "Relevant"], counts)),
)

Number of Irrelevant and Relevant in the total dataset: {'Irrelevant': 64, 'Relevant': 557}


After Resampling:


In [23]:
# count the number of 1 and 0 in the total dataset
unique, counts = np.unique(y_resampled, return_counts=True)
print(
    "Number of Irrelevant and Relevant in the total dataset:",
    dict(zip(["Irrelevant", "Relevant"], counts)),
)

Number of Irrelevant and Relevant in the total dataset: {'Irrelevant': 557, 'Relevant': 557}


## Classification

In this section we will use various classification models to classify the texts. We will use the output of the hidden layers as the features and the tokenized label of the dataset as training label. We will use ensemble models as they are more robust in classification.

Ensemble models are machine learning techniques that combine the predictions of multiple base models to improve overall performance. The key idea is that combining the strengths of different models can lead to a more robust and accurate prediction. Ensemble models are often more accurate than single models because they are less likely to be affected by bias.


In [24]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report,
)
import joblib

In [25]:
X_train, X_test, y_train, y_test = train_test_split(
    X_resampled, y_resampled, test_size=0.2, random_state=42
)

### Random Forest Classifier

**_How it works:_** A Random Forest is an ensemble of decision trees trained on random subsets of the features and the training data. Each tree independently makes a prediction, and the final prediction is obtained through voting or averaging.

**_Advantages_**: Reduces overfitting, improves stability, and increases accuracy.


In [26]:
from sklearn.ensemble import RandomForestClassifier

In [27]:
random_forest_classifier = RandomForestClassifier(random_state=42)

In [28]:
random_forest_classifier.fit(X_train, y_train)

In [29]:
# Evaluate the model through various metrics: accuracy, precision, recall, f1-score

y_pred = random_forest_classifier.predict(X_test)

print(
    "Accuracy score:",
    accuracy_score(y_test, y_pred),
    "\nPrecision score:",
    precision_score(y_test, y_pred),
    "\nRecall score:",
    recall_score(y_test, y_pred),
    "\nf1 score:",
    f1_score(y_test, y_pred),
)

Accuracy score: 0.9820627802690582 
Precision score: 0.975609756097561 
Recall score: 0.9917355371900827 
f1 score: 0.9836065573770492


In [30]:
print("Classification for Random Forest\n", classification_report(y_test, y_pred, digits=4))

Classification for Random Forest
               precision    recall  f1-score   support

           0     0.9900    0.9706    0.9802       102
           1     0.9756    0.9917    0.9836       121

    accuracy                         0.9821       223
   macro avg     0.9828    0.9812    0.9819       223
weighted avg     0.9822    0.9821    0.9820       223



In [33]:
joblib.dump(
    random_forest_classifier,
    "../../Models/requirement_relevancy_experiment/classifier_models/distilbert_random_forest_classifier.joblib",
)

['../../Models/requirement_relevancy_experiment/classifier_models/distilbert_random_forest_classifier.joblib']

### Gradient Boost Classifier

Gradient Boost Classifier is an ensemble model that uses decision trees to classify the data. It uses the boosting technique to create multiple decision trees and then uses the majority vote to classify the data. It is a robust model that is not prone to overfitting. It is also very fast to train.


In [34]:
from sklearn.ensemble import GradientBoostingClassifier

In [35]:
gradient_boosting_classifier = GradientBoostingClassifier(random_state=42)

In [36]:
gradient_boosting_classifier.fit(X_train, y_train)

In [37]:
# Evaluate the model through various metrics: accuracy, precision, recall, f1-score

y_pred = gradient_boosting_classifier.predict(X_test)

print(
    "Accuracy score:",
    accuracy_score(y_test, y_pred),
    "\nPrecision score:",
    precision_score(y_test, y_pred),
    "\nRecall score:",
    recall_score(y_test, y_pred),
    "\nf1 score:",
    f1_score(y_test, y_pred),
)

Accuracy score: 0.9910313901345291 
Precision score: 0.9917355371900827 
Recall score: 0.9917355371900827 
f1 score: 0.9917355371900827


In [50]:
print(
    "Classification for Gradient Boosting\n",
    classification_report(y_test, y_pred, digits=4),
)

Classification for Gradient Boosting
               precision    recall  f1-score   support

           0     0.9612    0.9706    0.9659       102
           1     0.9750    0.9669    0.9710       121

    accuracy                         0.9686       223
   macro avg     0.9681    0.9688    0.9684       223
weighted avg     0.9687    0.9686    0.9686       223



In [39]:
joblib.dump(
    gradient_boosting_classifier,
    "../../Models/requirement_relevancy_experiment/classifier_models/distilbert_gradient_boost_classifier.joblib",
)

['../../Models/requirement_relevancy_experiment/classifier_models/distilbert_gradient_boost_classifier.joblib']

### Adaboost Classifier

**_How it works:_** AdaBoost is an ensemble learning method that sequentially trains weak learners on weighted datasets, adjusting weights for misclassified instances in each iteration. The final prediction is made by combining the weak learners' predictions, weighted by their accuracy.

**_Advantages:_** AdaBoost is adaptable, emphasizing misclassified instances, has few hyperparameters to tune, is versatile with various base learners, avoids overfitting, is effective for binary classification, handles noisy data, and provides an interpretable final model.


In [40]:
from sklearn.ensemble import AdaBoostClassifier

In [41]:
adaboost_classifier = AdaBoostClassifier(random_state=42)

In [42]:
adaboost_classifier.fit(X_train, y_train)

In [51]:
y_pred = adaboost_classifier.predict(X_test)

print(
    "Classification result for AdaBoost\n",
    classification_report(y_test, y_pred, digits=4),
)

Classification result for AdaBoost
               precision    recall  f1-score   support

           0     0.9252    0.9706    0.9474       102
           1     0.9741    0.9339    0.9536       121

    accuracy                         0.9507       223
   macro avg     0.9497    0.9522    0.9505       223
weighted avg     0.9518    0.9507    0.9507       223



In [44]:
joblib.dump(
    adaboost_classifier,
    "../../Models/requirement_relevancy_experiment/classifier_models/distilbert_adaboost_classifier.joblib",
)

['../../Models/requirement_relevancy_experiment/classifier_models/distilbert_adaboost_classifier.joblib']

### XGBoost Classifier

**_How it works:_** XGBoost is a gradient boosting algorithm that combines the strengths of boosting and regularization techniques. It minimizes a loss function by adding weak learners sequentially and uses gradient descent for optimization.

**_Advantages:_** High accuracy, handles missing data, and provides feature importance.

More about XGBoost [here](https://xgboost.readthedocs.io/en/latest/tutorials/model.html)

In [45]:
from xgboost import XGBClassifier

In [46]:
xgboost_classifier = XGBClassifier(random_state=42)

In [47]:
xgboost_classifier.fit(X_train, y_train)

In [52]:
# Evaluate the model through various metrics: accuracy, precision, recall, f1-score by printing the classification report

y_pred = xgboost_classifier.predict(X_test)
print(
    "Classification for XG Boosting\n", classification_report(y_test, y_pred, digits=4)
)

Classification for XG Boosting
               precision    recall  f1-score   support

           0     0.9612    0.9706    0.9659       102
           1     0.9750    0.9669    0.9710       121

    accuracy                         0.9686       223
   macro avg     0.9681    0.9688    0.9684       223
weighted avg     0.9687    0.9686    0.9686       223



In [49]:
joblib.dump(
    xgboost_classifier,
    "../../Models/requirement_relevancy_experiment/classifier_models/distilbert_xgboost_classifier.joblib",
)

['../../Models/requirement_relevancy_experiment/classifier_models/distilbert_xgboost_classifier.joblib']