<a href="https://colab.research.google.com/github/HaaLeo/vague-requirements-scripts/blob/master/colab-notebooks/BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classify requirements as vague or not using [ktrain](https://github.com/amaiya/ktrain) and tensorflow


## Install dependencies
*ktrain* requires TensorFlow 2.1. See [amaiya/ktrain#151](https://github.com/amaiya/ktrain/issues/151).

In [3]:
!pip3 install -q tensorflow_gpu==2.1.0 ktrain==0.17.3

In [4]:
import tensorflow as tf
import ktrain
assert tf.__version__ == '2.1.0'
assert ktrain.__version__ == '0.17.3'

Install the helper lib to process Amazon Mechanical result CSV files

In [5]:
!pip3 install -q -U git+https://github.com/HaaLeo/vague-requirements-scripts

  Building wheel for vaguerequirementslib (setup.py) ... [?25l[?25hdone


To get insights in the model's reasoning install the eli5 lib.
We will need a forked version of the **eli5** library that supportes TensorFlow Keras, so let's install it first.

In [None]:
!pip3 install -q git+https://github.com/amaiya/eli5@tfkeras_0_10_1

Enable logging

In [6]:
import logging
import sys

logging.basicConfig(
    format='%(asctime)s [%(name)-20.20s] [%(levelname)-5.5s]  %(message)s',
    stream=sys.stdout,
    level=logging.DEBUG)

## Set Parameters
Set the parameters for this run

In [19]:
random_state = 1 # for seeding
indices_to_read = [0,2,3,4] # indicate which MTurk files shall be read.

## Load Dataset

### Mount Google Drive
Mount the google drive to access the dataset

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


### Load Dataset Into Arrays

In [9]:
from vaguerequirementslib import read_csv_files, build_confusion_matrix, calc_majority_label
import pandas as pd

def read_drive_data(files_list: list, separator: str) -> pd.DataFrame:
    """
    Calculate the majority label for the given source file list

    Args:
        files_list (list): The CSV files to calculate the majority label for
        separator (str): The CSV separator
        drop_ties (bool): If there is a tie in votes (e.g.: One votes for vague one for not vague) then drop this entry from the confusion matrix.

    Returns:
        pd.DataFrame: The dataframe containing the majority label.
    """
    df = read_csv_files(files_list, separator)
    confusion_matrix = build_confusion_matrix(df, drop_ties=True)
    return calc_majority_label(confusion_matrix)

# Test it

df = read_drive_data(
    [f'/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-{i}-mturk.csv' for i in indices_to_read],
    ','
  )
df.head()


2020-07-04 07:39:37,825 [vaguerequirementslib] [DEBUG]  Read file="/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-0-mturk.csv" with 200 rows.
2020-07-04 07:39:38,736 [vaguerequirementslib] [DEBUG]  Read file="/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-2-mturk.csv" with 194 rows.
2020-07-04 07:39:40,597 [vaguerequirementslib] [DEBUG]  Read file="/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-3-mturk.csv" with 198 rows.
2020-07-04 07:39:41,486 [vaguerequirementslib] [DEBUG]  Read file="/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-4-mturk.csv" with 196 rows.
2020-07-04 07:39:41,499 [vaguerequirementslib] [INFO ]  Build confusion matrix.
2020-07-04 07:39:41,592 [vaguerequirementslib] [INFO ]  Dropped 180 requirements due to ties.
2020-07-04 07:39:41,593 [vaguerequirementslib] [INFO ]  Built confusion matrix including 214 of 394 requirements. 
2020-07-04 07:39:41,597 [vaguerequirementslib] [INFO ]  Overall "vague" votes count = 9

Unnamed: 0,requirement,vague_count,not_vague_count,majority_label
0,A fallback per band feature set resulting from...,2,0,1
1,Actuation of steering shall be possible regard...,0,2,0
2,"Additionally, the ZigBee end device shall then...",0,2,0
3,"Additionally, the plan provides traceability f...",2,0,1
4,"After completion of release of the resources, ...",0,2,0


Split data into train and test set

In [34]:
from sklearn.model_selection import train_test_split
from typing import Tuple, List
def split_dataset(dataframe: pd.DataFrame) -> Tuple[List[str], List[int], List[str], List[int]]:
    """
    Calculate the majority label for the given source file list

    Args:
        data_frame (pd.DataFrame): The data frame to generate the data sets from.

    Returns:
        Tuple[List[str], List[int], List[str], List[int]]: x_train, y_train, x_test, y_test
    """
    train_df, test_df = train_test_split(df, test_size=0.1, random_state=random_state)
    print(f'Training dataset: vague count="{train_df.sum()["majority_label"]}", not vague count="{train_df.shape[0] - train_df.sum()["majority_label"]}"')
    print(f'Test dataset: vague count="{test_df.sum()["majority_label"]}", not vague count="{test_df.shape[0] - test_df.sum()["majority_label"]}"')

    x_train = list(train_df['requirement'])
    y_train = list(train_df['majority_label'])
    x_test = list(test_df['requirement'])
    y_test = list(test_df['majority_label'])
    
    return x_train, y_train, x_test, y_test

x_train, y_train, x_test, y_test = split_dataset(df)


Training dataset: vague count="45", not vague count="147"
Test dataset: vague count="3", not vague count="19"


## STEP 1:  Preprocess Data and Create a Transformer Model

We will use [DistilBERT](https://arxiv.org/abs/1910.01108).

In [36]:
import ktrain
from ktrain import text
MODEL_NAME = 'distilbert-base-uncased'
t = text.Transformer(MODEL_NAME, maxlen=500, class_names=['vague', 'not vague'])
trn = t.preprocess_train(x_train, y_train)
val = t.preprocess_test(x_test, y_test)
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6)

preprocessing train...
language: en
train sequence lengths:
	mean : 22
	95percentile : 40
	99percentile : 62
2020-07-04 08:17:37,363 [urllib3.connectionpo] [DEBUG]  Starting new HTTPS connection (1): s3.amazonaws.com:443
2020-07-04 08:17:38,266 [urllib3.connectionpo] [DEBUG]  https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/bert-base-uncased-vocab.txt HTTP/1.1" 200 0


Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 19
	95percentile : 27
	99percentile : 40


2020-07-04 08:17:38,467 [urllib3.connectionpo] [DEBUG]  Starting new HTTPS connection (1): s3.amazonaws.com:443
2020-07-04 08:17:39,355 [urllib3.connectionpo] [DEBUG]  https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/distilbert-base-uncased-config.json HTTP/1.1" 200 0
2020-07-04 08:17:39,360 [urllib3.connectionpo] [DEBUG]  Starting new HTTPS connection (1): cdn.huggingface.co:443
2020-07-04 08:17:39,384 [urllib3.connectionpo] [DEBUG]  https://cdn.huggingface.co:443 "HEAD /distilbert-base-uncased-tf_model.h5 HTTP/1.1" 200 0


## STEP 2:  Train the Model

In [38]:
learner.fit_onecycle(5e-5, 4)



begin training using onecycle policy with max lr of 5e-05...
Train for 32 steps, validate for 1 steps
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x7ff9505673c8>

## STEP 3: Evaluate and Inspect the Model

In [39]:
learner.validate(class_names=t.get_classes())

              precision    recall  f1-score   support

       vague       0.82      0.74      0.78        19
   not vague       0.00      0.00      0.00         3

    accuracy                           0.64        22
   macro avg       0.41      0.37      0.39        22
weighted avg       0.71      0.64      0.67        22



array([[14,  5],
       [ 3,  0]])

Let's examine the validation example about which we were the most wrong.

In [40]:
learner.view_top_losses(n=1, preproc=t)

----------
id:11 | loss:3.13 | true:not vague | pred:vague)



In [45]:
top_loss_req = x_test[11] # Requirement that produces top loss
print(top_loss_req)

For hardware processed with water the moisture content of the gas effluent through or over the dried components, parts or system at ambient temperature, shall be measured.


This post talks more about computing than `alt.atheism` (the true category), so our model placed it into the only computing category available to it: `comp.graphics`

## STEP 4: Making Predictions on New Data in Deployment

In [46]:
predictor = ktrain.get_predictor(learner.model, preproc=t)

In [47]:
predictor.predict(top_loss_req)

'vague'

In [48]:
# predicted probability scores for each category
predictor.predict_proba(top_loss_req)

array([0.9563229 , 0.04367708], dtype=float32)

In [49]:
predictor.get_classes()

['vague', 'not vague']

As expected, `soc.religion.christian` is assigned the highest probability.

Let's invoke the `explain` method to see which words contribute most to the classification.



In [51]:
predictor.explain(top_loss_req)

Contribution?,Feature
2.57,Highlighted in text (sum)
0.655,<BIAS>


The words in the darkest shade of green contribute most to the classification and agree with what you would expect for this example.

We can save and reload our predictor for later deployment.

In [None]:
predictor.save('/tmp/my_distilbert_predictor')

In [None]:
reloaded_predictor = ktrain.load_predictor('/tmp/my_distilbert_predictor')

In [None]:
reloaded_predictor.predict('My computer monitor is really blurry.')

'comp.graphics'