# RQ4: To what extent can Machine Learning/NLP models identify the sensemaking aspect of feedback?

## Part 3: Using DistilBERT To Identify Sensemaking

In part 3 of RQ4, we are going to use a deep learning algorithm called DistilBERT to train a model to identify the sensemaking component in the feedback text. This uncased DistilBERT model was provided by the Hugging Face AI community. It was customised here to serve the needs of the sensemaking classifications in this study. Reference: [Text Classification](https://huggingface.co/docs/transformers/tasks/sequence_classification)

### 1. Loading the Initial Libraries and the Dataset

First, we need to load the initial set of libraries and the feedback data to be used in training the model.

In [None]:
# Installing the desired versions of the transformers and accelerate libraries
# Note: If using a Kaggle notebook, restart the kernel and clear the outputs after this step
! pip install -U git+https://github.com/huggingface/transformers.git
! pip install -U git+https://github.com/huggingface/accelerate.git

In [1]:
# Importing the initial libraries
import pandas as pd
import re
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

We can use the `Pandas` library to load the feedback data into a dataframe.

In [2]:
# Loading the data with the Pandas library
data = pd.read_csv('./LabelledFeedback/stage2.csv')

# Isolating the required columns
data = data[['SentenceScoreRem', 'Rubric']]

# Checking the loaded data
data.head()

Unnamed: 0,SentenceScoreRem,Rubric
0,Yuejing more in depth analysis is required and...,Sensemaking 1&Impact 1
1,﻿Team 1 requested to re-do their workbook 3 to...,Impact 1
2,The team submitted the workbook 23 days after ...,Sensemaking 2
3,Risk assessment and report needs work as discu...,Sensemaking 1
4,"""Good effort, Please refer to detailed feedbac...",Agency 2&Agency 1


### 2. Cleaning the Text and Creating the Target Variable

Although Hugging Face's DistilBERT technique also has its own tokenization technique, we can help it along with some preliminary cleaning using the `NLTK` library.

In [3]:
# Defining a function to clean the feedback text
def clean_text(text):

    # Converting the characters of the text to lowercase form
    text = re.sub(r'[^a-zA-Z]', ' ', text.lower())

    # Tokenizing the text
    tokens = nltk.word_tokenize(text)

    # Loading the stopwords from the NLTK corpus
    stopwords = nltk.corpus.stopwords.words('english')

    # Removing the stop words from the tokenized text
    filtered_tokens = [token for token in tokens if token not in stopwords]

    # Joining the tokens
    clean_text = ' '.join(filtered_tokens)

    # Returning the cleaned text
    return clean_text

# Applying the text cleaning function to the data
data['text'] = data['SentenceScoreRem'].apply(clean_text)

# Split the preprocessed data into features and target variable
features = data['text']
target = data['Rubric']

# Checking the cleaned text
data[['text', 'Rubric']]

Unnamed: 0,text,Rubric
0,yuejing depth analysis required see link key c...,Sensemaking 1&Impact 1
1,team requested workbook better original mark,Impact 1
2,team submitted workbook days submission date kv,Sensemaking 2
3,risk assessment report needs work discussed tu...,Sensemaking 1
4,good effort please refer detailed feedback fil...,Agency 2&Agency 1
...,...,...
5754,q need use english communicate partiularly par...,Impact 1
5755,part b complicated needed explain rate change ...,Sensemaking 1
5756,q english exposition required,Impact 2
5757,made two errors finding determinant part b fin...,Sensemaking 1


As was the case in Part 1 of this research question, we will define a function to create a target variable called `label` that will contain the value 1 if the text contains the sensemaking component and 0 otherwise.

In [4]:
# Defining a function to create the target variable
def sensemaking(rub):

    # If statement to check whether the text contains the sensemaking component
    if 'Sensemaking' in rub:

        return 1

    else:

        return 0

# Applying the function to the data
data['label'] = data['Rubric'].apply(lambda x : sensemaking(x))

# Checking the new column
data[['text', 'Rubric', 'label']]

Unnamed: 0,text,Rubric,label
0,yuejing depth analysis required see link key c...,Sensemaking 1&Impact 1,1
1,team requested workbook better original mark,Impact 1,0
2,team submitted workbook days submission date kv,Sensemaking 2,1
3,risk assessment report needs work discussed tu...,Sensemaking 1,1
4,good effort please refer detailed feedback fil...,Agency 2&Agency 1,0
...,...,...,...
5754,q need use english communicate partiularly par...,Impact 1,0
5755,part b complicated needed explain rate change ...,Sensemaking 1,1
5756,q english exposition required,Impact 2,0
5757,made two errors finding determinant part b fin...,Sensemaking 1,1


### 3. Preparing the Data and Tokenizing it with DistilBERT

Hugging Face's uncased DistilBERT has been developed in such a way that it works with the Hugging Face dataset format. Therefore, we need to convert our `Pandas` dataframe to this dataset form. Secondly, we can use their `evaluate` library to compute the metrics of the classification such as accuracy and precision.

In [5]:
# Installing the datasets and evaluate library
pip install datasets evaluate

[0mNote: you may need to restart the kernel to use updated packages.


In [6]:
# Optional Step: To upload your model to the hugging face community
# Reference: https://huggingface.co/docs/transformers/tasks/sequence_classification
from huggingface_hub import notebook_login

# You will be asked to enter any tokens you may have generated on Hugging face.
# A Hugging Face Account is needed for this.
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Before transforming the data, let us split it into the train and test data using the `sklearn` library. As as the case in part 1, we will do a nice 80-20 split on the data.

In [7]:
# Defining the features
features = data['text']

# Defining the target variable
target = data['label']

# Splitting the data into train and test data sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

Let us create two data sets called train and test that are a combination of the features and target variables we defined in the previous step.

Later, we will add these two dataframes to the Hugging Face dataset.

In [8]:
# Creating the combined train data set
train = pd.DataFrame().assign(text=pd.DataFrame(X_train)['text'], label=pd.DataFrame(y_train)['label'])

In [9]:
# Creating the combined test data set
test = pd.DataFrame().assign(text=pd.DataFrame(X_test)['text'], label=pd.DataFrame(y_test)['label'])

We need to load the `Auto Tokenizer` from the `transformers` library to load the use the DistilBERT tokenizer.

In [10]:
# Loading the Auto Tokenizer from the transformers library
# Reference: https://huggingface.co/docs/transformers/tasks/sequence_classification
from transformers import AutoTokenizer

# Loading a pre-trained DistilBERT tokenizer to preprocess the text field
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

DistilBERT has a maximum input length for text sequences. Therefore, we must truncate those sequences that go beyond this length while tokenizing the text.

In [11]:
# Defining a function to tokenize the text and truncate sequences of text that are longer than the maximum input length
# Reference: https://huggingface.co/docs/transformers/tasks/sequence_classification
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

Our tokenizing function ready. Now, we can use the `datasets` library to pack the train and test data sets into a DistilBERT-friendly dataset format.

In [12]:
# Importing the dataset library
import datasets
from datasets import Dataset, DatasetDict

# Converting the train and test datasets to a dataset format
train = Dataset.from_pandas(train)
test = Dataset.from_pandas(test)

# Initialising a dataset dictionary
ds = DatasetDict()

# Adding the train and test datasets to the dataset dictionary
ds['train'] = train
ds['test'] = test

# Checking the newly created dataset dictionary
ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label', '__index_level_0__'],
        num_rows: 4607
    })
    test: Dataset({
        features: ['text', 'label', '__index_level_0__'],
        num_rows: 1152
    })
})

We will use the `map` function to tokenize the text. To process multiple components of the dataset simultaneously, we can set the `batched` flag to true.

In [13]:
# Tokenizing the text
# Reference: https://huggingface.co/docs/transformers/tasks/sequence_classification
tokenized_ds = ds.map(preprocess_function, batched=True)

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In the next step, we will create example batches to pad smaller elements to the longest batch length.

In [14]:
# Dynamically padding sentences to the longest batch
# Reference: https://huggingface.co/docs/transformers/tasks/sequence_classification
# Reference: https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/data_collator#transformers.DataCollatorWithPadding
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

### 4. Setting up the Metrics and Model Optimizer

As was the case with the machine learning models in part 1, the DistilBERT model is also a classification model. Therefore, its performance can be determined with the same four metrics:

- Accuracy
- Precision
- Recall
- F1-Score

We can load these metrics from the `evaluate` library.

In [15]:
# Importing the evaluate library
# Reference: https://huggingface.co/docs/transformers/tasks/sequence_classification
import evaluate

# Loading the classification model performance metrics from the evaluate library
accuracy = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall = evaluate.load("recall")
f1 = evaluate.load("f1")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

After loading the four metrics, we can define functions to compute them in a proper manner.

In [16]:
# Importing the numpy library
import numpy as np

# Defining functions to apply the metrics to the traning process
# Reference: https://huggingface.co/docs/transformers/tasks/sequence_classification
# Defining a function to compute accuracy
def compute_accuracy(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

# Defining a function to compute precision
def compute_precision(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return precision.compute(predictions=predictions, references=labels)

# Defining a function to compute recall
def compute_recall(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return recall.compute(predictions=predictions, references=labels)

# Defining a function to compute f1-score
def compute_f1(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return f1.compute(predictions=predictions, references=labels)

As an additional step, we can also designate our labels to print understanding results of the model. We will also feed these label translations to the model sequence classifier after building the model optimizer.

In [17]:
# Designating the labels to meaningful categories
# Reference: https://huggingface.co/docs/transformers/tasks/sequence_classification
id2label = {0: "No Sensemaking", 1: "Sensemaking"}
label2id = {"No Sensemaking": 0, "Sensemaking": 1}

DistilBERT uses a TensorFlow model to train the data. It is necessary to make sure that the model is optimized to give the best results. We can use the `creater_optimizer` function in the transformers library to compile our optimizations for the model.

In [18]:
# Importing the optimizer from the transformers library
# Reference: https://huggingface.co/docs/transformers/tasks/sequence_classification
from transformers import create_optimizer
import tensorflow as tf

# Setting the batch size
batch_size = 16

# Setting the number of epochs
num_epochs = 5

# Setting the number of batches per epocj
batches_per_epoch = len(tokenized_ds["train"]) // batch_size

# Setting the total steps in training
total_train_steps = int(batches_per_epoch * num_epochs)

# Combining the optimizations together
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

Next, we will load DistilBERT with a pre-trained model automator that selects the model learning rate that gave the best results. It is here where we specify the labels to indicate `Sensemaking` and `No Sensemaking`.

In [19]:
# Loading the Auto Model from the transformers library
# Reference: https://huggingface.co/docs/transformers/tasks/sequence_classification
from transformers import TFAutoModelForSequenceClassification

# Activation the auto model for sequence classification
model = TFAutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_transform.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

### 5. Preparing the Data and Callbacks

We are ready to make the final preparations for the model training process. Remember how we created the dataset dictionary for the train and test data and tokenized them. We need to move them into a train set and validation set respectively.

In [20]:
# Reference: https://huggingface.co/docs/transformers/tasks/sequence_classification
# Creating the tensorflow train set
tf_train_set = model.prepare_tf_dataset(
    tokenized_ds["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

# Creating the tensorflow validation set
tf_validation_set = model.prepare_tf_dataset(
    tokenized_ds["test"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


We can compile our model with the set of optimizations

In [21]:
# Reference: https://huggingface.co/docs/transformers/tasks/sequence_classification
import tensorflow as tf

# Compiling the optimizer
model.compile(optimizer=optimizer)

We also need to prepare a set of callbacks to allow the model to return the results of each metric. We will use the `KerasMetricCallback` function to reference each of the metric computations.

In [22]:
# Loading the Keras Metric Callback function from the transformers library
# Reference: https://huggingface.co/docs/transformers/tasks/sequence_classification
from transformers.keras_callbacks import KerasMetricCallback

# Initialising the callback for Accuracy
metric_callback_acc = KerasMetricCallback(metric_fn=compute_accuracy, eval_dataset=tf_validation_set)

# Initialising the callback for Precision
metric_callback_pre = KerasMetricCallback(metric_fn=compute_precision, eval_dataset=tf_validation_set)

# Initialising the callback for Recall
metric_callback_re = KerasMetricCallback(metric_fn=compute_recall, eval_dataset=tf_validation_set)

# Initialising the callback for F1-Score
metric_callback_f1 = KerasMetricCallback(metric_fn=compute_f1, eval_dataset=tf_validation_set)

Additionally, Keras also offers a `PushToCallback` facility to save our customised model. This allows model reproducibility. We need to specify an output directory to save the model.

In [23]:
# Loading the Push to Callback function from the transformers library
# Reference: https://huggingface.co/docs/transformers/tasks/sequence_classification
from transformers.keras_callbacks import PushToHubCallback

# Initialising the callback to push the model to an output directory
push_to_hub_callback = PushToHubCallback(
    output_dir="SensemakingDetectionModel",
    tokenizer=tokenizer,
)

Cloning https://huggingface.co/thefishtalepundit/SensemakingDetectionModel into local empty directory.


The final step before running the model is to save all our callbacks in a list to allow us to easily call them while fitting the model.

In [24]:
# Compiling the callbacks
# Reference: https://huggingface.co/docs/transformers/tasks/sequence_classification
callbacks = [metric_callback_acc, metric_callback_pre, metric_callback_re, metric_callback_f1, push_to_hub_callback]

### 6. Training the Model

We are finally ready to run our optimized uncased DistilBERT model with tensorflow. We can designate 3 epochs for this model run and observe the results. In case, the results are not desirable, we can always increase the number of epochs in subsequent runs.

In [25]:
# Fitting the model to the data
# Reference: https://huggingface.co/docs/transformers/tasks/sequence_classification
model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs = 3, callbacks=callbacks)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fc038976d10>

It looks like the model has produced exceptionally good results, achieving an accuracy of 0.89 and a recall of 0.89 indicating less overfitting. Let us try to reproduce this model on a new piece of text.

In [26]:
# Creating a new feedback text sample
text = "Your first three answers were correct. There were a lot of grammatical mistakes throughout your interview section. Please look up English Connect to improve your language skills. Your clarification for Mendel's theory was correct but it missed a few key details."

We can use `Pipeline` from the transformers library to load our saved model. This pipline has a feature for sentiment analysis that can be repurposed here to detect sensemaking instead.

In [27]:
# Importing pipeline from the transformers library
# Reference: https://huggingface.co/docs/transformers/tasks/sequence_classification
from transformers import pipeline

# Loading the pipeline model
classifier = pipeline("sentiment-analysis", model="SensemakingDetectionModel")

# Running the model on the text sample
classifier(text)

Some layers from the model checkpoint at /kaggle/working/SensemakingDetectionModel were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at /kaggle/working/SensemakingDetectionModel and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[{'label': 'Sensemaking', 'score': 0.9869995713233948}]

Looks like it accurately identified the presence of the sensemaking aspect in the text with a 99% certainty. Let us check whether it can accurately identify the absence of the sensemaking component with a new text sample.

In [28]:
# Creating a new text sample that does not contain the sensemaking element
text = "Well done! Just one change is required in the explaining of your teammates contribution"

In [30]:
# Loading the pipeline function in the transformers library
# Reference: https://huggingface.co/docs/transformers/tasks/sequence_classification
from transformers import pipeline

# Loading the pipeline model
classifier = pipeline("sentiment-analysis", model="SensemakingDetectionModel")

# Applying the model to the new text sample
classifier(text)

Some layers from the model checkpoint at /kaggle/working/SensemakingDetectionModel were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at /kaggle/working/SensemakingDetectionModel and are newly initialized: ['dropout_59']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[{'label': 'No Sensemaking', 'score': 0.9835424423217773}]

We can see that the model accurately identified the absence of the sensemaking component with a 98% certainty.