# SMOTEN Oversampling Experiment
In This notebook we will be using the SMOTEN oversampling technique to balance the dataset. We will be using the same dataset as the previous notebook.

In [1]:
import pandas as pd
import torch
import os
import numpy as np
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTEN

In [2]:
MY_DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.get_device_name(MY_DEVICE)

'NVIDIA GeForce RTX 3050 Ti Laptop GPU'

In [3]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "caching_allocator"

In [4]:
requirement_relevancy_dataset = pd.read_csv(
    "../../../Datasets/irrelevant_requirements_dataset/irrelevant_requirements_dataset.csv",
    engine="pyarrow",
)

requirement_relevancy_dataset.head()

Unnamed: 0,reqs_statement,action_part,actor_part,label
0,user submit job associate cost execution time ...,submit job associate cost execution time deadline,user,relevant
1,user establish cost unit time and submit job,establish cost unit time and submit job,user,relevant
2,user monitor job submit status,monitor job submit status,user,relevant
3,user cancel job submit,cancel job submit,user,relevant
4,user check credit balance,check credit balance,user,relevant


### Making Train Test Split

In [5]:
requirements_X = requirement_relevancy_dataset["reqs_statement"]
label_y = requirement_relevancy_dataset["label"]

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
    requirements_X, label_y, test_size=0.4, random_state=42, stratify=label_y
)

In [12]:
reshaped_X_train = np.array(X_train).reshape(-1, 1)
reshaped_X_test = np.array(X_test).reshape(-1, 1)

In [14]:
reshaped_X_train.shape, y_train.shape

((372, 1), (372,))

## Experiment With NLP Models

In this segment, I will be experimenting with different NLP models to see which one performs the best. I will be using the following models: DistilBERT, ROBERA, DistilBERT, and XLNet. I will be using the HuggingFace library to implement these models. I will be using the same data as the previous notebook.


## DistilBERT Model

DistilBERT is a smaller version of BERT. It is trained to be faster and more efficient than BERT. It is also trained to be more memory efficient. It is trained using the same data as BERT. It is trained using a technique called knowledge distillation. This technique is used to compress a large model into a smaller model. The smaller model is trained to mimic the behavior of the larger model. The smaller model is


In [7]:
from transformers import (
    DistilBertModel,
    DistilBertTokenizer,
)

In [8]:
bert_tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

In [19]:
with torch.no_grad():
    tokenized_train_data_X = bert_tokenizer(
        X_train.tolist(),
        padding="max_length",
        return_tensors="pt",
        max_length=64,
        truncation=True,
    )

    tokenized_test_data_X = bert_tokenizer(
        X_test.tolist(),
        padding="max_length",
        return_tensors="pt",
        max_length=64,
        truncation=True,
    )

In [20]:
tokenized_train_data_y = y_train.map({"relevant": 1, "irrelevant": 0})
tokenized_test_data_y = y_test.map({"relevant": 1, "irrelevant": 0})

In [21]:
bert_model = DistilBertModel.from_pretrained(
    "../../../Models/requirement_relevancy_experiment/NLP_models/my_distilbert_model/",
    # device_map=MY_DEVICE,
)

### Mixed Precision Calculation

Mixed precsion is the use of both 16 and 32 bit float to optimize memory during training to make it run faster


torch.cuda.amp.autocast() is


In [22]:
torch.cuda.empty_cache()

In [23]:
with torch.cuda.amp.autocast():
    X_train_outputs = bert_model(**tokenized_train_data_X)
    X_test_outputs = bert_model(**tokenized_test_data_X)

# outputs = bert_model(**tokenized_text_data_X)

In [24]:
X_train_last_hidden_states = X_train_outputs.last_hidden_state
X_test_last_hidden_states = X_test_outputs.last_hidden_state

In [25]:
reshaped_X_train_last_hidden_states = X_train_last_hidden_states.reshape(
    X_train_last_hidden_states.shape[0], -1
).detach().numpy()

reshaped_X_test_last_hidden_states = X_test_last_hidden_states.reshape(
    X_test_last_hidden_states.shape[0], -1
).detach().numpy()

reshaped_X_train_last_hidden_states.shape, reshaped_X_test_last_hidden_states.shape

((372, 49152), (249, 49152))

In [31]:
# Save the reshaped_X_train_last_hidden_states and reshaped_X_test_last_hidden_states
np.savetxt(
    "../../../Datasets/irrelevant_requirements_dataset/model_state_outputs/distilbert/reshaped_X_train_last_hidden_states.csv",
    reshaped_X_train_last_hidden_states,
)

np.savetxt(
    "../../../Datasets/irrelevant_requirements_dataset/model_state_outputs/distilbert/reshaped_X_test_last_hidden_states.csv",
    reshaped_X_test_last_hidden_states,
)

# Save the y_train and y_test
np.savetxt(
    "../../../Datasets/irrelevant_requirements_dataset/model_state_outputs/distilbert/y_train.csv",
    y_train,
)

np.savetxt(
    "../../../Datasets/irrelevant_requirements_dataset/model_state_outputs/distilbert/y_test.csv",
    y_test,
)

## Classification

In this section we will use various classification models to classify the texts. We will use the output of the hidden layers as the features and the tokenized label of the dataset as training label. We will use ensemble models as they are more robust in classification.

Ensemble models are machine learning techniques that combine the predictions of multiple base models to improve overall performance. The key idea is that combining the strengths of different models can lead to a more robust and accurate prediction. Ensemble models are often more accurate than single models because they are less likely to be affected by bias.


In [35]:
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report,
)
import joblib
from sklearn.utils.class_weight import compute_class_weight

In [36]:
# Compute class weights based on the resampled training set
class_weights = compute_class_weight("balanced", classes=[0, 1], y=y_train)
class_weights

array([4.89473684, 0.55688623])

### Random Forest Classifier

**_How it works:_** A Random Forest is an ensemble of decision trees trained on random subsets of the features and the training data. Each tree independently makes a prediction, and the final prediction is obtained through voting or averaging.

**_Advantages_**: Reduces overfitting, improves stability, and increases accuracy.


In [26]:
from sklearn.ensemble import RandomForestClassifier

In [41]:
random_forest_classifier = RandomForestClassifier(
    class_weight={0: class_weights[0], 1: class_weights[1]}, random_state=42
)

In [42]:
y_train = tokenized_train_data_y.to_numpy()
y_test = tokenized_test_data_y.to_numpy()

X_train = reshaped_X_train_last_hidden_states
X_test = reshaped_X_test_last_hidden_states

In [43]:
y_train.shape, reshaped_X_train_last_hidden_states.shape

((372,), (372, 49152))

In [44]:
random_forest_classifier.fit(reshaped_X_train_last_hidden_states, y_train)

In [47]:
# Evaluate the model through various metrics: accuracy, precision, recall, f1-score

y_pred = random_forest_classifier.predict(reshaped_X_test_last_hidden_states)
# y_test
print(
    "Accuracy score:",
    accuracy_score(y_test, y_pred),
    "\nPrecision score:",
    precision_score(y_test, y_pred),
    "\nRecall score:",
    recall_score(y_test, y_pred),
    "\nf1 score:",
    f1_score(y_test, y_pred),
)

Accuracy score: 0.8955823293172691 
Precision score: 0.8955823293172691 
Recall score: 1.0 
f1 score: 0.9449152542372882


In [49]:
print(
    "Classification for Random Forest\n",
    classification_report(y_test, y_pred, digits=4),
)

Classification for Random Forest
               precision    recall  f1-score   support

           0     0.0000    0.0000    0.0000        26
           1     0.8956    1.0000    0.9449       223

    accuracy                         0.8956       249
   macro avg     0.4478    0.5000    0.4725       249
weighted avg     0.8021    0.8956    0.8462       249



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [52]:
joblib.dump(
    random_forest_classifier,
    "../../../Models/requirement_relevancy_experiment/classifier_models/distilbert_random_forest_classifier.joblib",
)

['../../../Models/requirement_relevancy_experiment/classifier_models/distilbert_random_forest_classifier.joblib']

### Gradient Boost Classifier

Gradient Boost Classifier is an ensemble model that uses decision trees to classify the data. It uses the boosting technique to create multiple decision trees and then uses the majority vote to classify the data. It is a robust model that is not prone to overfitting. It is also very fast to train.


In [46]:
from sklearn.ensemble import GradientBoostingClassifier

In [47]:
gradient_boosting_classifier = GradientBoostingClassifier(random_state=42)

In [49]:
gradient_boosting_classifier.fit(reshaped_X_train_last_hidden_states, y_train)

In [50]:
# Evaluate the model through various metrics: accuracy, precision, recall, f1-score

y_pred = gradient_boosting_classifier.predict(reshaped_X_test_last_hidden_states)

print(
    "Accuracy score:",
    accuracy_score(y_test, y_pred),
    "\nPrecision score:",
    precision_score(y_test, y_pred),
    "\nRecall score:",
    recall_score(y_test, y_pred),
    "\nf1 score:",
    f1_score(y_test, y_pred),
)

Accuracy score: 0.896 
Precision score: 0.896 
Recall score: 1.0 
f1 score: 0.9451476793248946


In [51]:
print(
    "Classification for Gradient Boosting\n",
    classification_report(y_test, y_pred, digits=4),
)

Classification for Gradient Boosting
               precision    recall  f1-score   support

           0     0.0000    0.0000    0.0000        13
           1     0.8960    1.0000    0.9451       112

    accuracy                         0.8960       125
   macro avg     0.4480    0.5000    0.4726       125
weighted avg     0.8028    0.8960    0.8469       125



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [55]:
joblib.dump(
    gradient_boosting_classifier,
    "../../../Models/requirement_relevancy_experiment/classifier_models/distilbert_gradient_boost_classifier.joblib",
)

['../../../Models/requirement_relevancy_experiment/classifier_models/distilbert_gradient_boost_classifier.joblib']

### Adaboost Classifier

**_How it works:_** AdaBoost is an ensemble learning method that sequentially trains weak learners on weighted datasets, adjusting weights for misclassified instances in each iteration. The final prediction is made by combining the weak learners' predictions, weighted by their accuracy.

**_Advantages:_** AdaBoost is adaptable, emphasizing misclassified instances, has few hyperparameters to tune, is versatile with various base learners, avoids overfitting, is effective for binary classification, handles noisy data, and provides an interpretable final model.


In [56]:
from sklearn.ensemble import AdaBoostClassifier

In [57]:
adaboost_classifier = AdaBoostClassifier(random_state=42)

In [59]:
adaboost_classifier.fit(reshaped_X_train_last_hidden_states, y_train)

In [60]:
y_pred = adaboost_classifier.predict(reshaped_X_test_last_hidden_states)

print(
    "Classification result for AdaBoost\n",
    classification_report(y_test, y_pred, digits=4),
)

Classification result for AdaBoost
               precision    recall  f1-score   support

           0     0.0000    0.0000    0.0000        13
           1     0.8917    0.9554    0.9224       112

    accuracy                         0.8560       125
   macro avg     0.4458    0.4777    0.4612       125
weighted avg     0.7989    0.8560    0.8265       125



In [61]:
joblib.dump(
    adaboost_classifier,
    "../../../Models/requirement_relevancy_experiment/classifier_models/distilbert_adaboost_classifier.joblib",
)

['../../../Models/requirement_relevancy_experiment/classifier_models/distilbert_adaboost_classifier.joblib']

### XGBoost Classifier

**_How it works:_** XGBoost is a gradient boosting algorithm that combines the strengths of boosting and regularization techniques. It minimizes a loss function by adding weak learners sequentially and uses gradient descent for optimization.

**_Advantages:_** High accuracy, handles missing data, and provides feature importance.

More about XGBoost [here](https://xgboost.readthedocs.io/en/latest/tutorials/model.html)

In [62]:
from xgboost import XGBClassifier

In [63]:
xgboost_classifier = XGBClassifier(random_state=42)

In [64]:
xgboost_classifier.fit(reshaped_X_train_last_hidden_states, y_train)

In [65]:
# Evaluate the model through various metrics: accuracy, precision, recall, f1-score by printing the classification report

y_pred = xgboost_classifier.predict(reshaped_X_test_last_hidden_states)
print(
    "Classification for XG Boosting\n", classification_report(y_test, y_pred, digits=4)
)

Classification for XG Boosting
               precision    recall  f1-score   support

           0     0.0000    0.0000    0.0000        13
           1     0.8960    1.0000    0.9451       112

    accuracy                         0.8960       125
   macro avg     0.4480    0.5000    0.4726       125
weighted avg     0.8028    0.8960    0.8469       125



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [66]:
joblib.dump(
    xgboost_classifier,
    "../../../Models/requirement_relevancy_experiment/classifier_models/distilbert_xgboost_classifier.joblib",
)

['../../../Models/requirement_relevancy_experiment/classifier_models/distilbert_xgboost_classifier.joblib']