<a href="https://colab.research.google.com/github/HaaLeo/vague-requirements-scripts/blob/master/colab-notebooks/BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classify requirements as vague or not using [ktrain](https://github.com/amaiya/ktrain) and tensorflow


## Install dependencies
*ktrain* requires TensorFlow 2.1. See [amaiya/ktrain#151](https://github.com/amaiya/ktrain/issues/151).
Further we install a forked version of eli5lib to gain insights in the model's decision process and some self built helper functions to preprocess MTurk result files.

In [3]:
!pip3 install -q tensorflow_gpu==2.1.0 ktrain==0.17.3
!pip3 install -q -U git+https://github.com/HaaLeo/vague-requirements-scripts
!pip3 install -q git+https://github.com/amaiya/eli5@tfkeras_0_10_1

  Building wheel for vaguerequirementslib (setup.py) ... [?25l[?25hdone
  Building wheel for eli5 (setup.py) ... [?25l[?25hdone


Check versions and enable logging

In [4]:
import tensorflow as tf
import ktrain
assert tf.__version__ == '2.1.0'
assert ktrain.__version__ == '0.17.3'

import logging
import sys

logging.basicConfig(
    format='%(asctime)s [%(threadName)-20.20s] [%(levelname)-5.5s]  %(message)s',
    stream=sys.stdout,
    level=logging.DEBUG)

## Set Parameters
Set the parameters for this run.
Ktrain ignores `max_features` and `ngram_range` in v0.17.3, see [amaiya/ktrain/issues#190](https://github.com/amaiya/ktrain/issues/190)

In [5]:
indices_to_read = [0,2,3,4] # indicate which MTurk files shall be read.
DATA_FILE_NAMES = [f'corpus-batch-{i}-mturk.csv' for i in indices_to_read]

RANDOM_STATE = 1 # for seeding

LEARNING_RATE = 5e-5
EPOCHS = 4
MODEL_NAME = 'distilbert-base-uncased'
MAX_LEN = 512
BATCH_SIZE = 6
MAX_FEATURES = 35_000
NGRAM_RANGE = 1

PREPROCESS_MODE = 'distilbert'

## Load Dataset

### Mount Google Drive
Mount the google drive to access the dataset

In [50]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


### Load Dataset Into Arrays

In [8]:
from vaguerequirementslib import read_csv_files, build_confusion_matrix, calc_majority_label
import pandas as pd

def read_drive_data(files_list: list, separator: str) -> pd.DataFrame:
    """
    Calculate the majority label for the given source file list

    Args:
        files_list (list): The CSV files to calculate the majority label for
        separator (str): The CSV separator
        drop_ties (bool): If there is a tie in votes (e.g.: One votes for vague one for not vague) then drop this entry from the confusion matrix.

    Returns:
        pd.DataFrame: The dataframe containing the majority label.
    """
    df = read_csv_files(files_list, separator)
    confusion_matrix = build_confusion_matrix(df, drop_ties=True)
    return calc_majority_label(confusion_matrix)

# Read all data
df = read_drive_data(
    [f'/content/drive/My Drive/datasets/corpus/labeled/{file_name}' for file_name in DATA_FILE_NAMES],
    ','
  )
df.head()

2020-07-04 20:50:53,325 [MainThread          ] [DEBUG]  Read file="/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-0-mturk.csv" with 200 rows.
2020-07-04 20:50:53,676 [MainThread          ] [DEBUG]  Read file="/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-2-mturk.csv" with 194 rows.
2020-07-04 20:50:54,113 [MainThread          ] [DEBUG]  Read file="/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-3-mturk.csv" with 198 rows.
2020-07-04 20:50:54,473 [MainThread          ] [DEBUG]  Read file="/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-4-mturk.csv" with 196 rows.
2020-07-04 20:50:54,502 [MainThread          ] [INFO ]  Build confusion matrix.
2020-07-04 20:50:54,656 [MainThread          ] [INFO ]  Dropped 180 requirements due to ties.
2020-07-04 20:50:54,659 [MainThread          ] [INFO ]  Built confusion matrix including 214 of 394 requirements. 
2020-07-04 20:50:54,664 [MainThread          ] [INFO ]  Overall "vague" votes count = 9

Unnamed: 0,requirement,vague_count,not_vague_count,majority_label
0,A fallback per band feature set resulting from...,2,0,1
1,Actuation of steering shall be possible regard...,0,2,0
2,"Additionally, the ZigBee end device shall then...",0,2,0
3,"Additionally, the plan provides traceability f...",2,0,1
4,"After completion of release of the resources, ...",0,2,0


### Split data set


In [9]:
from sklearn.model_selection import train_test_split
from typing import Tuple, List
from ktrain import text as txt


def split_dataset(dataframe: pd.DataFrame) -> Tuple[List[str], List[int], List[str], List[int], List[str], List[int]]:
    """
    Split the dataset into training, validation and test set.

    Args:
        data_frame (pd.DataFrame): The data frame to generate the data sets from.

    Returns:
        Tuple[List[str], List[int], List[str], List[int], List[str], List[int]]: x_train, y_train, x_val, y_val, x_test, y_test
    """
    train_df, val_test_df = train_test_split(df, test_size=0.2, random_state=RANDOM_STATE, stratify=df['majority_label'])
    val_df, test_df = train_test_split(val_test_df, test_size=0.5, random_state=RANDOM_STATE, stratify=val_test_df['majority_label'])

    print(f'Training dataset: vague count="{train_df.sum()["majority_label"]}", not vague count="{train_df.shape[0] - train_df.sum()["majority_label"]}"')
    print(f'Validation dataset: vague count="{val_df.sum()["majority_label"]}", not vague count="{val_df.shape[0] - val_df.sum()["majority_label"]}"')
    print(f'Test dataset: vague count="{test_df.sum()["majority_label"]}", not vague count="{test_df.shape[0] - test_df.sum()["majority_label"]}"')

    return train_df, val_df, test_df


def preprocess_data(train_df: pd.DataFrame, val_df: pd.DataFrame, test_df: pd.DataFrame) -> Tuple:
    def _preprocess(my_df: pd.DataFrame) -> Tuple:
        dummy_df = pd.DataFrame.from_dict({'requirement': ['foo', 'bar'], 'majority_label': [0, 1]})
        return txt.texts_from_df(my_df, text_column='requirement', label_columns=['majority_label'], val_df=dummy_df,  max_features=MAX_FEATURES, maxlen=MAX_LEN,  ngram_range=NGRAM_RANGE, preprocess_mode=PREPROCESS_MODE, random_state=RANDOM_STATE)
    
    train_data, _, _ = _preprocess(train_df)
    val_data, _, _ = _preprocess(val_df)
    test_data, _, test_preproc = _preprocess(test_df)

    return train_data, val_data, test_data, test_preproc

# Split the data set
train_df, val_df, test_df = split_dataset(df)

# Preprocess for Transfer Learning
train_data, val_data, test_data, test_preproc = preprocess_data(train_df, val_df, test_df)


Training dataset: vague count="38", not vague count="133"
Validation dataset: vague count="5", not vague count="16"
Test dataset: vague count="5", not vague count="17"
preprocessing train...
language: en
train sequence lengths:
	mean : 21
	95percentile : 40
	99percentile : 63
2020-07-04 20:51:02,989 [MainThread          ] [DEBUG]  Starting new HTTPS connection (1): s3.amazonaws.com:443
2020-07-04 20:51:03,411 [MainThread          ] [DEBUG]  https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/bert-base-uncased-vocab.txt HTTP/1.1" 200 0
2020-07-04 20:51:03,415 [MainThread          ] [DEBUG]  Attempting to acquire lock 139757452865376 on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock
2020-07-04 20:51:03,417 [MainThread          ] [INFO ]  Lock 139757452865376 acquired on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…


2020-07-04 20:51:04,143 [MainThread          ] [DEBUG]  Attempting to release lock 139757452865376 on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock
2020-07-04 20:51:04,146 [MainThread          ] [INFO ]  Lock 139757452865376 released on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock


Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 1
	95percentile : 1
	99percentile : 1


  'If this is a sentence pair classification task, please cast to tuple.')


preprocessing train...
language: en
train sequence lengths:
	mean : 22
	95percentile : 38
	99percentile : 46
2020-07-04 20:51:04,506 [MainThread          ] [DEBUG]  Starting new HTTPS connection (1): s3.amazonaws.com:443
2020-07-04 20:51:04,912 [MainThread          ] [DEBUG]  https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/bert-base-uncased-vocab.txt HTTP/1.1" 200 0


Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 1
	95percentile : 1
	99percentile : 1


preprocessing train...
language: en
train sequence lengths:
	mean : 21
	95percentile : 36
	99percentile : 38
2020-07-04 20:51:05,117 [MainThread          ] [DEBUG]  Starting new HTTPS connection (1): s3.amazonaws.com:443
2020-07-04 20:51:05,509 [MainThread          ] [DEBUG]  https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/bert-base-uncased-vocab.txt HTTP/1.1" 200 0


Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 1
	95percentile : 1
	99percentile : 1


## STEP 1:  Preprocess Data and Create a Transformer Model

We will use [DistilBERT](https://arxiv.org/abs/1910.01108).

In [10]:
import ktrain
t = txt.Transformer(MODEL_NAME, maxlen=MAX_LEN, class_names=['not-vague', 'vague']) # 0=not-vague 1=vague
t.preprocess_train_called = True # Simulate call to preprocess_train()
# t.preprocess_train(['foo', 'bar'], [0, 1]) 
# val_data = t.preprocess_test(x_val, y_val)
# test_data = t.preprocess_test(x_test, y_test)

model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=train_data, val_data=val_data, batch_size=BATCH_SIZE)

2020-07-04 20:51:18,338 [MainThread          ] [DEBUG]  Starting new HTTPS connection (1): s3.amazonaws.com:443
2020-07-04 20:51:18,733 [MainThread          ] [DEBUG]  https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/distilbert-base-uncased-config.json HTTP/1.1" 200 0
2020-07-04 20:51:18,736 [MainThread          ] [DEBUG]  Attempting to acquire lock 139760576447824 on /root/.cache/torch/transformers/a41e817d5c0743e29e86ff85edc8c257e61bc8d88e4271bb1b243b6e7614c633.8949e27aafafa845a18d98a0e3a88bc2d248bbc32a1b75947366664658f23b1c.lock
2020-07-04 20:51:18,737 [MainThread          ] [INFO ]  Lock 139760576447824 acquired on /root/.cache/torch/transformers/a41e817d5c0743e29e86ff85edc8c257e61bc8d88e4271bb1b243b6e7614c633.8949e27aafafa845a18d98a0e3a88bc2d248bbc32a1b75947366664658f23b1c.lock
2020-07-04 20:51:18,741 [MainThread          ] [DEBUG]  Starting new HTTPS connection (1): s3.amazonaws.com:443
2020-07-04 20:51:19,139 [MainThread          ] [DEBUG]  https://s3.amazonaws.com

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…


2020-07-04 20:51:19,174 [MainThread          ] [DEBUG]  Attempting to release lock 139760576447824 on /root/.cache/torch/transformers/a41e817d5c0743e29e86ff85edc8c257e61bc8d88e4271bb1b243b6e7614c633.8949e27aafafa845a18d98a0e3a88bc2d248bbc32a1b75947366664658f23b1c.lock
2020-07-04 20:51:19,176 [MainThread          ] [INFO ]  Lock 139760576447824 released on /root/.cache/torch/transformers/a41e817d5c0743e29e86ff85edc8c257e61bc8d88e4271bb1b243b6e7614c633.8949e27aafafa845a18d98a0e3a88bc2d248bbc32a1b75947366664658f23b1c.lock
2020-07-04 20:51:19,181 [MainThread          ] [DEBUG]  Starting new HTTPS connection (1): cdn.huggingface.co:443
2020-07-04 20:51:19,589 [MainThread          ] [DEBUG]  https://cdn.huggingface.co:443 "HEAD /distilbert-base-uncased-tf_model.h5 HTTP/1.1" 200 0
2020-07-04 20:51:19,591 [MainThread          ] [DEBUG]  Attempting to acquire lock 139757452919024 on /root/.cache/torch/transformers/cce28882467f298a29fc905b9dd1683695d96198a83432fe707089dccd71c019.e02bd57e9d85078

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=363423424.0, style=ProgressStyle(descri…


2020-07-04 20:51:27,556 [MainThread          ] [DEBUG]  Attempting to release lock 139757452919024 on /root/.cache/torch/transformers/cce28882467f298a29fc905b9dd1683695d96198a83432fe707089dccd71c019.e02bd57e9d8507853eccc7c04ac2e938a6cdaff4b9bf941c10e781b61ddb9bbd.h5.lock
2020-07-04 20:51:27,557 [MainThread          ] [INFO ]  Lock 139757452919024 released on /root/.cache/torch/transformers/cce28882467f298a29fc905b9dd1683695d96198a83432fe707089dccd71c019.e02bd57e9d8507853eccc7c04ac2e938a6cdaff4b9bf941c10e781b61ddb9bbd.h5.lock


## STEP 2:  Train the Model

In [11]:
learner.fit_onecycle(LEARNING_RATE, EPOCHS)



begin training using onecycle policy with max lr of 5e-05...
Train for 29 steps, validate for 1 steps
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x7f1bc025d5c0>

## STEP 3: Evaluate and Inspect the Model

In [28]:
test_result = learner.validate(class_names=t.get_classes(), val_data=test_data)
print(test_result)

              precision    recall  f1-score   support

   not-vague       0.77      1.00      0.87        17
       vague       0.00      0.00      0.00         5

    accuracy                           0.77        22
   macro avg       0.39      0.50      0.44        22
weighted avg       0.60      0.77      0.67        22

[[17  0]
 [ 5  0]]


  'precision', 'predicted', average, warn_for)


Let's examine the validation example about which we were the most wrong.

In [29]:
learner.view_top_losses(n=4, preproc=t, val_data=test_data)

----------
id:20 | loss:2.14 | true:vague | pred:not-vague)

----------
id:0 | loss:1.91 | true:vague | pred:not-vague)

----------
id:4 | loss:1.62 | true:vague | pred:not-vague)

----------
id:11 | loss:1.34 | true:vague | pred:not-vague)



In [30]:
top_loss_req = test_df.iloc[0]['requirement'] # Requirement that produces top loss
print(top_loss_req)
print(test_df.iloc[0]['majority_label'])

For hardware processed with water the moisture content of the gas effluent through or over the dried components, parts or system at ambient temperature, shall be measured.
1


This post talks more about computing than `alt.atheism` (the true category), so our model placed it into the only computing category available to it: `comp.graphics`

## STEP 3.1: Gather Results

Gather results and write them to the drive.

In [55]:
from vaguerequirementslib import TP, TN, FP, FN, calc_all_metrics
result_data = {
    'metrics':{
        'vague': {
            TP: int(test_result[1][1]),
            FP: int(test_result[0][1]),
            TN: int(test_result[0][0]),
            FN: int(test_result[1][0])
        },
        'not_vague': {
            TP: int(test_result[0][0]),
            FP: int(test_result[1][0]),
            TN: int(test_result[1][1]),
            FN: int(test_result[0][1])
        }
    },
    'misc': {   
        'data_files': DATA_FILE_NAMES,
        'random_state': RANDOM_STATE
    },
    'hyperparameter': {
        'learning_rate': LEARNING_RATE,
        'epochs': EPOCHS,
        'model_name': MODEL_NAME,
        'max_len': MAX_LEN,
        'batch_size': BATCH_SIZE
        # 'max_features': MAX_FEATURES,
        # 'ngram_range': NGRAM_RANGE
    }
}
# result_data['metrics']['not_vague'].update(calc_all_metrics(**result_data['metrics']['not_vague']))
# result_data['metrics']['vague'].update(calc_all_metrics(**result_data['metrics']['vague']))

from datetime import datetime
from pytz import timezone
import json

tz = timezone('Europe/Berlin')
now= datetime.now(tz)
result_file_name = now.strftime('%Y-%m-%d_%H-%M-%S_evaluation_result.json')

with open(f'/content/drive/My Drive/runs/{result_file_name}', mode='w', encoding='utf-8') as json_file:
  json.dump(result_data, json_file, indent=4)

## STEP 4: Making Predictions on New Data in Deployment

In [31]:
predictor = ktrain.get_predictor(learner.model, preproc=test_preproc)

In [36]:
predictor.predict(top_loss_req)

'0'

In [33]:
# predicted probability scores for each category
predictor.predict_proba(top_loss_req)

array([0.8525825, 0.1474175], dtype=float32)

In [34]:
predictor.get_classes()

['0', '1']

As expected, `soc.religion.christian` is assigned the highest probability.

Let's invoke the `explain` method to see which words contribute most to the classification.



In [35]:
predictor.explain(top_loss_req)

Contribution?,Feature
0.858,Highlighted in text (sum)
0.545,<BIAS>


The words in the darkest shade of green contribute most to the classification and agree with what you would expect for this example.

We can save and reload our predictor for later deployment.

In [None]:
predictor.save('/tmp/my_distilbert_predictor')

In [None]:
reloaded_predictor = ktrain.load_predictor('/tmp/my_distilbert_predictor')

In [None]:
reloaded_predictor.predict('My computer monitor is really blurry.')

'comp.graphics'