# Classifying text into Requirement and Not-requirement using DistilBERT

Training and Fine-Tuning 


<br><br>

## **Import libraries and modules**

In [1]:
#to install transformers from Hugging Face transformers in google colab:
#!pip3 install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m82.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.1/200.1 kB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m100.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.4 tokenizers-0.13.3 transformers-4.28.1


In [2]:
from collections import defaultdict
import gdown
import gzip
import json
import random
import pickle

import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import torch
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import ticker
sns.set(style='ticks', font_scale=1.2)

<br><br>

## **Set parameters and file paths**

In [3]:
model_name = 'distilbert-base-cased'  
device_name = 'cuda'       
max_length = 512                                                        
cached_model_directory_name = 'distilbert-ambiguity-classes'  

<br><br>

## **Load and sample Dronology data**

In [4]:
url = 'http://sarec.nd.edu/dronology/datasets/01/dronologydataset01.json'
gdown.download(url, 'dromology_dataset.json', quiet=False)

Downloading...
From: http://sarec.nd.edu/dronology/datasets/01/dronologydataset01.json
To: /content/dromology_dataset.json
100%|██████████| 374k/374k [00:00<00:00, 708kB/s]


'dromology_dataset.json'

In [5]:
with open('dromology_dataset.json') as file:
    data = json.load(file)

In [6]:
print(type(data))
print(data["entries"][0])
print(data["entries"][0]['attributes']['issuetype'])
print(data["entries"][0]['attributes']['description'])


<class 'dict'>
{'issueid': 'DD-768', 'attributes': {'issuetype': 'Design Definition', 'status': 'Closed', 'summary': 'UAV Configuration Command Types', 'description': 'Each movement command shall specify one of the following command types: {{SET_MONITORING_FREQUENCY  SET_STATE_FREQUENCY}}'}, 'children': {'refinedby': ['DD-727', 'DD-728']}, 'code': [{'status': '#manual-tagged', 'filename': 'edu.nd.dronology.core/src/edu/nd/dronology/core/vehicle/commands/CommandIds.java', 'timestamp': '2018-05-08T23:05:56Z'}, {'status': '#manual-tagged', 'filename': 'edu.nd.dronology.core/src/edu/nd/dronology/core/vehicle/commands/SetMonitoringFrequencyCommand.java', 'timestamp': '2018-05-08T23:05:56Z'}, {'status': '#manual-tagged', 'filename': 'edu.nd.dronology.core/src/edu/nd/dronology/core/vehicle/commands/SetStateFrequencyCommand.java', 'timestamp': '2018-05-08T23:05:56Z'}]}
Design Definition
Each movement command shall specify one of the following command types: {{SET_MONITORING_FREQUENCY  SET_STAT

In [7]:
texts = []
labels = []

for entry in data["entries"]:
  texts.append(entry['attributes']['description'])
  label = entry['attributes']['issuetype']
  if label != 'Requirement':
    labels.append("Else")
  else:
    labels.append(label)

['Each movement command shall specify one of the following command types: {{SET_MONITORING_FREQUENCY  SET_STATE_FREQUENCY}}', 'The _RouteCreationUI_ shall provide capabilities to modify existing routes.', 'The _MapComponent_ shall support different types of map layers (e.g.  terrain  satellite)', '"Provides map functionality for displaying routes and UAVs  leveraging off-the-shelf map providers (such as OSM and google maps).   "', 'The _UIMiddleware_ accepts resend waypoint commands associated with a unique {{UAV_ID}} and forwards to Dronology', 'UAV State messages shall be formatted as JSON objects and contain the {{UAV_ID}} and the {{MODE}}', 'If a user attempts to create a route without providing a name then the system shall not save the route.', 'The display of active flight plans shall include time in flight  traveled distance  remaining distance  completed and total waypoints.', 'The active flight plan shall be displayed at the top of the list of pending flight plans.', 'The _Mis

In [None]:
print(texts[:10])
print(labels[:10])
print(len(texts))
print(len(labels))

<br><br>

## **Split the data into training and test sets**

In [9]:
split_index = int(len(texts)/5*4)

train_texts = texts[:split_index]
train_labels = labels[:split_index]

test_texts = texts[split_index:]
test_labels = labels[split_index:]

In [None]:
len(train_texts), len(train_labels), len(test_texts), len(test_labels)

(6400, 6400, 1600, 1600)

## **Encode data for BERT**


| BERT special token | Explanation |
| --------------| ---------|
| [CLS] | Start token of every document. |
| [SEP] | Separator between each sentence |
| [PAD] | Padding at the end of the document as many times as necessary, up to 512 tokens |
|  &#35;&#35; | Start of a "word piece" |




In [10]:
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name) # The model_name needs to match our pre-trained model.

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

In [11]:
unique_labels = set(label for label in train_labels)
label2id = {label: id for id, label in enumerate(unique_labels)}
id2label = {id: label for label, id in label2id.items()}

In [12]:
label2id.keys()

dict_keys(['Requirement', 'Else'])

In [13]:
id2label.keys()

dict_keys([0, 1])

Encode texts and labels

In [14]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
test_encodings  = tokenizer(test_texts, truncation=True, padding=True, max_length=max_length)

train_labels_encoded = [label2id[y] for y in train_labels]
test_labels_encoded  = [label2id[y] for y in test_labels]



```
# This is formatted as code
```

**Examine a sequence in the training set after encoding**

In [15]:
' '.join(train_encodings[0].tokens[0:100])

'[CLS] Each movement command shall specify one of the following command types : { { SE ##T _ M ##ON ##IT ##OR ##ING _ F ##RE ##Q ##UE ##NC ##Y SE ##T _ ST ##AT ##E _ F ##RE ##Q ##UE ##NC ##Y } } [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

**Examine a sequence in the test set after encoding**

In [16]:
' '.join(test_encodings[0].tokens[0:100])

'[CLS] When a U ##AV in the T ##A ##K ##ING _ OF ##F state achieve ##s the target altitude it transitions to FL ##Y ##ING state . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

**Examine the training labels after encoding**

In [17]:
set(train_labels_encoded)

{0, 1}

**Examine the test labels after encoding**

In [18]:
set(test_labels_encoded)

{0, 1}

<br><br>

## **Make a custom Torch dataset**

In [19]:
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [20]:
train_dataset = MyDataset(train_encodings, train_labels_encoded)
test_dataset = MyDataset(test_encodings, test_labels_encoded)

**Examine a sequence in the Torch `training_dataset` after encoding**

In [21]:
' '.join(train_dataset.encodings[0].tokens[0:100])

'[CLS] Each movement command shall specify one of the following command types : { { SE ##T _ M ##ON ##IT ##OR ##ING _ F ##RE ##Q ##UE ##NC ##Y SE ##T _ ST ##AT ##E _ F ##RE ##Q ##UE ##NC ##Y } } [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

**Examine a sequence in the Torch `test_dataset` after encoding**

In [None]:
' '.join(test_dataset.encodings[1].tokens[0:100])

'[CLS] f ##y HD ##r @ S ##l ##H ` b ##d l ##S ##b ##w ##r l t ` l ##y ##q s ##wy n t ##sm ` y ##q ` k ##lm ##th l ##sh ` r ##y @ [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

<br><br>

## **Load pre-trained BERT model**

Here we load a pre-trained DistilBERT model and send it to CUDA.

**Note:** If you decide to repeat fine-tuning after already running the following cells, make sure that you re-run this cell to re-load the original pre-trained model before fine-tuning again.

In [22]:
# The model_name needs to match the name used for the tokenizer above.
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=len(id2label)).to(device_name)

Downloading pytorch_model.bin:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifi

<br><br>

## **Set the BERT fine-tuning parameters**


| Parameter | Explanation |
|-----------| ------------|
| num_train_epochs | total number of training epochs (how many times to pass through the entire dataset; too much can cause overfitting) |
| per_device_train_batch_size | batch size per device during training |
| per_device_eval_batch_size |  batch size for evaluation |
|  warmup_steps |  number of warmup steps for learning rate scheduler (set lower because of small dataset size) |
| weight_decay | strength of weight decay (reduces size of weights, like regularization) |
| output_dir | output directory for the fine-tuned model and configuration files |
| logging_dir | directory for storing logs |
| logging_steps | how often to print logging output (so that we can stop training early if the loss isn't going down) |
| evaluation_strategy | evaluate while training so that we can see the accuracy going up |

In [23]:
training_args = TrainingArguments(
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=20,   # batch size for evaluation
    learning_rate=5e-5,              # initial learning rate for Adam optimizer
    warmup_steps=100,                # number of warmup steps for learning rate scheduler (set lower because of small dataset size)
    weight_decay=0.01,               # strength of weight decay
    output_dir='./results',          # output directory
    logging_dir='./logs',            # directory for storing logs
    logging_steps=100,               # number of steps to output logging (set lower because of small dataset size)
    evaluation_strategy='steps',     # evaluate during fine-tuning so that we can see progress
)

<br><br>

## **Fine-tune the BERT model**

In [24]:
def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  acc = accuracy_score(labels, preds)
  return {
      'accuracy': acc,
  }

In [25]:
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset,           # evaluation dataset (usually a validation set; here we just send our test set)
    compute_metrics=compute_metrics      # our custom evaluation function 
)

In [26]:
trainer.train()



Step,Training Loss,Validation Loss


TrainOutput(global_step=60, training_loss=0.5589102427164714, metrics={'train_runtime': 8.7204, 'train_samples_per_second': 109.398, 'train_steps_per_second': 6.88, 'total_flos': 24188753974896.0, 'train_loss': 0.5589102427164714, 'epoch': 3.0})

<br><br>

## **Save fine-tuned model**


In [27]:
trainer.save_model(cached_model_directory_name)

(Optional) If you've already fine-tuned and saved the model, you can reload it using the following line. You don't have to run fine-tuning every time you want to evaluate.

In [None]:
# trainer = DistilBertForSequenceClassification.from_pretrained(cached_model_directory_name)

<br><br>

## **Evaluate fine-tuned model**

In [28]:
trainer.evaluate()

{'eval_loss': 0.18552416563034058,
 'eval_accuracy': 0.9625,
 'eval_runtime': 1.2326,
 'eval_samples_per_second': 64.904,
 'eval_steps_per_second': 3.245,
 'epoch': 3.0}

In [29]:
predicted_results = trainer.predict(test_dataset)

In [30]:
predicted_results.predictions.shape

(80, 2)

In [31]:
predicted_labels = predicted_results.predictions.argmax(-1) # Get the highest probability prediction
predicted_labels = predicted_labels.flatten().tolist()      # Flatten the predictions into a 1D list
predicted_labels = [id2label[l] for l in predicted_labels]  # Convert from integers back to strings for readability

In [32]:
len(predicted_labels)

80

In [33]:
print(classification_report(test_labels, 
                            predicted_labels))

              precision    recall  f1-score   support

        Else       0.96      1.00      0.98        76
 Requirement       1.00      0.25      0.40         4

    accuracy                           0.96        80
   macro avg       0.98      0.62      0.69        80
weighted avg       0.96      0.96      0.95        80



<br><br>

## **Pull out correct and incorrect classifications for examination**

Some example predictions that were correct.

In [34]:
for _true_label, _predicted_label, _text in random.sample(list(zip(test_labels, predicted_labels, test_texts)), 20):
  if _true_label == _predicted_label:
    print('LABEL:', _true_label)
    print('REVIEW TEXT:', _text[:100], '...')
    print()

LABEL: Else
REVIEW TEXT: A client shall register with the _UIMiddleware_ to receive flight route event notifications whenever ...

LABEL: Requirement
REVIEW TEXT: The _SingleUAVFlightPlanScheduler_ shall only execute one flight plan at a time for each UAV. ...

LABEL: Else
REVIEW TEXT:  ...

LABEL: Else
REVIEW TEXT:  ...

LABEL: Else
REVIEW TEXT: In the map  the past flight path that the UAV has already covered (behind the UAV) is shown in dotte ...

LABEL: Else
REVIEW TEXT: Implement the edit and delete functions for the buttons in the route list (see attached). They shoul ...

LABEL: Else
REVIEW TEXT: A client shall register with the _UIMiddleware_ to receive UAV type specification events whenever a  ...

LABEL: Else
REVIEW TEXT: "NvecInterpolator does position reckoning in WGS-84 and a replacement simulator that uses it needs t ...

LABEL: Else
REVIEW TEXT: A client shall register with the _UIMiddleware_ to receive notifications whenever a new flight plan  ...

LABEL: Else
REVIEW TE



```
# This is formatted as code
```

Some misclassifications.

In [35]:
for _true_label, _predicted_label, _text in random.sample(list(zip(test_labels, predicted_labels, test_texts)), 20):
  if _true_label != _predicted_label:
    print('TRUE LABEL:', _true_label)
    print('PREDICTED LABEL:', _predicted_label)
    print('REVIEW TEXT:', _text[:100], '...')
    print()

TRUE LABEL: Requirement
PREDICTED LABEL: Else
REVIEW TEXT: The _VehicleCore_ shall assign a unique ID to each activated UAV. ...

