<a href="https://colab.research.google.com/github/HaaLeo/vague-requirements-scripts/blob/master/colab-notebooks/BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classify requirements as vague or not using [ktrain](https://github.com/amaiya/ktrain) and tensorflow


## Install dependencies
*ktrain* requires TensorFlow 2.1. See [amaiya/ktrain#151](https://github.com/amaiya/ktrain/issues/151).
Further we install a forked version of eli5lib to gain insights in the model's decision process and some self built helper functions to preprocess MTurk result files.

In [2]:
!pip3 install -q tensorflow_gpu==2.1.0 ktrain==0.17.3
!pip3 install -q -U git+https://github.com/HaaLeo/vague-requirements-scripts
!pip3 install -q git+https://github.com/amaiya/eli5@tfkeras_0_10_1

  Building wheel for vaguerequirementslib (setup.py) ... [?25l[?25hdone
  Building wheel for eli5 (setup.py) ... [?25l[?25hdone


Check versions and enable logging

In [3]:
import tensorflow as tf
import ktrain
assert tf.__version__ == '2.1.0'
assert ktrain.__version__ == '0.17.3'

import logging
import sys

logging.basicConfig(
    format='%(asctime)s [%(threadName)-20.20s] [%(levelname)-5.5s]  %(message)s',
    stream=sys.stdout,
    level=logging.DEBUG)

## Set Parameters
Set the parameters for this run.
Ktrain ignores `max_features` and `ngram_range` in v0.17.3, see [amaiya/ktrain/issues#190](https://github.com/amaiya/ktrain/issues/190)

In [9]:
indices_to_read = [0,2,3,4] # indicate which MTurk files shall be read.
DATA_FILE_NAMES = [f'corpus-batch-{i}-mturk.csv' for i in indices_to_read]

RANDOM_STATE = 1 # for seeding

LEARNING_RATE = 5e-5
EPOCHS = 4
MODEL_NAME = 'distilbert-base-uncased'
MAX_LEN = 512
BATCH_SIZE = 6
MAX_FEATURES = 35_000
NGRAM_RANGE = 1

CLASS_NAMES = ['not-vague', 'vague'] # 0=not-vague 1=vague

PREPROCESS_MODE = 'distilbert'

## Load Dataset

### Mount Google Drive
Mount the google drive to access the dataset

In [5]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


### Load Dataset Into Arrays

In [6]:
from vaguerequirementslib import read_csv_files, build_confusion_matrix, calc_majority_label
import pandas as pd

def read_drive_data(files_list: list, separator: str) -> pd.DataFrame:
    """
    Calculate the majority label for the given source file list

    Args:
        files_list (list): The CSV files to calculate the majority label for
        separator (str): The CSV separator
        drop_ties (bool): If there is a tie in votes (e.g.: One votes for vague one for not vague) then drop this entry from the confusion matrix.

    Returns:
        pd.DataFrame: The dataframe containing the majority label.
    """
    df = read_csv_files(files_list, separator)
    confusion_matrix = build_confusion_matrix(df, drop_ties=True)
    return calc_majority_label(confusion_matrix)

# Read all data
df = read_drive_data(
    [f'/content/drive/My Drive/datasets/corpus/labeled/{file_name}' for file_name in DATA_FILE_NAMES],
    ','
  )
df.head()

2020-07-05 08:38:12,869 [MainThread          ] [DEBUG]  Read file="/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-0-mturk.csv" with 200 rows.
2020-07-05 08:38:12,882 [MainThread          ] [DEBUG]  Read file="/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-2-mturk.csv" with 194 rows.
2020-07-05 08:38:12,895 [MainThread          ] [DEBUG]  Read file="/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-3-mturk.csv" with 198 rows.
2020-07-05 08:38:12,908 [MainThread          ] [DEBUG]  Read file="/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-4-mturk.csv" with 196 rows.
2020-07-05 08:38:12,921 [MainThread          ] [INFO ]  Build confusion matrix.
2020-07-05 08:38:13,027 [MainThread          ] [INFO ]  Dropped 180 requirements due to ties.
2020-07-05 08:38:13,029 [MainThread          ] [INFO ]  Built confusion matrix including 214 of 394 requirements. 
2020-07-05 08:38:13,032 [MainThread          ] [INFO ]  Overall "vague" votes count = 9

Unnamed: 0,requirement,vague_count,not_vague_count,majority_label
0,A fallback per band feature set resulting from...,2,0,1
1,Actuation of steering shall be possible regard...,0,2,0
2,"Additionally, the ZigBee end device shall then...",0,2,0
3,"Additionally, the plan provides traceability f...",2,0,1
4,"After completion of release of the resources, ...",0,2,0


### Split data set


In [7]:
from sklearn.model_selection import train_test_split
from typing import Tuple, List
from ktrain import text as txt


def split_dataset(dataframe: pd.DataFrame) -> Tuple[List[str], List[int], List[str], List[int], List[str], List[int]]:
    """
    Split the dataset into training, validation and test set.

    Args:
        data_frame (pd.DataFrame): The data frame to generate the data sets from.

    Returns:
        Tuple[List[str], List[int], List[str], List[int], List[str], List[int]]: x_train, y_train, x_val, y_val, x_test, y_test
    """
    train_df, val_test_df = train_test_split(df, test_size=0.2, random_state=RANDOM_STATE, stratify=df['majority_label'])
    val_df, test_df = train_test_split(val_test_df, test_size=0.5, random_state=RANDOM_STATE, stratify=val_test_df['majority_label'])

    print(f'Training dataset: vague count="{train_df.sum()["majority_label"]}", not vague count="{train_df.shape[0] - train_df.sum()["majority_label"]}"')
    print(f'Validation dataset: vague count="{val_df.sum()["majority_label"]}", not vague count="{val_df.shape[0] - val_df.sum()["majority_label"]}"')
    print(f'Test dataset: vague count="{test_df.sum()["majority_label"]}", not vague count="{test_df.shape[0] - test_df.sum()["majority_label"]}"')

    return train_df, val_df, test_df


def preprocess_data(train_df: pd.DataFrame, val_df: pd.DataFrame, test_df: pd.DataFrame) -> Tuple:
    def _preprocess(my_df: pd.DataFrame) -> Tuple:
        dummy_df = pd.DataFrame.from_dict({'requirement': ['foo', 'bar'], 'majority_label': [0, 1]})
        return txt.texts_from_df(my_df, text_column='requirement', label_columns=['majority_label'], val_df=dummy_df,  max_features=MAX_FEATURES, maxlen=MAX_LEN,  ngram_range=NGRAM_RANGE, preprocess_mode=PREPROCESS_MODE, random_state=RANDOM_STATE)
    
    train_data, _, _ = _preprocess(train_df)
    val_data, _, _ = _preprocess(val_df)
    test_data, _, test_preproc = _preprocess(test_df)

    return train_data, val_data, test_data, test_preproc

# Split the data set
train_df, val_df, test_df = split_dataset(df)

# Preprocess for Transfer Learning
train_data, val_data, test_data, test_preproc = preprocess_data(train_df, val_df, test_df)


Training dataset: vague count="38", not vague count="133"
Validation dataset: vague count="5", not vague count="16"
Test dataset: vague count="5", not vague count="17"
preprocessing train...
language: en
train sequence lengths:
	mean : 21
	95percentile : 40
	99percentile : 63
2020-07-05 08:38:15,972 [MainThread          ] [DEBUG]  Starting new HTTPS connection (1): s3.amazonaws.com:443
2020-07-05 08:38:16,249 [MainThread          ] [DEBUG]  https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/bert-base-uncased-vocab.txt HTTP/1.1" 200 0


Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 1
	95percentile : 1
	99percentile : 1


  'If this is a sentence pair classification task, please cast to tuple.')


preprocessing train...
language: en
train sequence lengths:
	mean : 22
	95percentile : 38
	99percentile : 46
2020-07-05 08:38:16,522 [MainThread          ] [DEBUG]  Starting new HTTPS connection (1): s3.amazonaws.com:443
2020-07-05 08:38:16,791 [MainThread          ] [DEBUG]  https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/bert-base-uncased-vocab.txt HTTP/1.1" 200 0


Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 1
	95percentile : 1
	99percentile : 1


preprocessing train...
language: en
train sequence lengths:
	mean : 21
	95percentile : 36
	99percentile : 38
2020-07-05 08:38:16,961 [MainThread          ] [DEBUG]  Starting new HTTPS connection (1): s3.amazonaws.com:443
2020-07-05 08:38:17,227 [MainThread          ] [DEBUG]  https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/bert-base-uncased-vocab.txt HTTP/1.1" 200 0


Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 1
	95percentile : 1
	99percentile : 1


## STEP 1:  Create a Transformer Model and Train it

We will use [DistilBERT](https://arxiv.org/abs/1910.01108).

In [26]:
from datetime import datetime
from pytz import timezone

# Create the transformer
t = txt.Transformer(MODEL_NAME, maxlen=MAX_LEN, class_names=CLASS_NAMES)
t.preprocess_train_called = True # Simulate call to preprocess_train()

# Get the model and learner
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=train_data, val_data=val_data, batch_size=BATCH_SIZE)

# For every triggered fitting run create a new directory where the results will be saved
now = datetime.now(timezone('Europe/Berlin'))
result_dir = f'/content/drive/My Drive/runs/{now.strftime("%Y-%m-%d/%H-%M-%S")}'

# Fit the model
learner.fit_onecycle(LEARNING_RATE, EPOCHS)

2020-07-05 09:14:25,208 [MainThread          ] [DEBUG]  Starting new HTTPS connection (1): s3.amazonaws.com:443
2020-07-05 09:14:25,480 [MainThread          ] [DEBUG]  https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/distilbert-base-uncased-config.json HTTP/1.1" 200 0
2020-07-05 09:14:25,484 [MainThread          ] [DEBUG]  Starting new HTTPS connection (1): cdn.huggingface.co:443
2020-07-05 09:14:25,704 [MainThread          ] [DEBUG]  https://cdn.huggingface.co:443 "HEAD /distilbert-base-uncased-tf_model.h5 HTTP/1.1" 200 0


begin training using onecycle policy with max lr of 5e-05...
Train for 29 steps, validate for 1 steps
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


## STEP 2: Evaluate the model
Evaluate the model using the `test_data`.

In [27]:
test_result = learner.validate(class_names=t.get_classes(), val_data=test_data)
print(test_result)

              precision    recall  f1-score   support

   not-vague       0.79      0.65      0.71        17
       vague       0.25      0.40      0.31         5

    accuracy                           0.59        22
   macro avg       0.52      0.52      0.51        22
weighted avg       0.66      0.59      0.62        22

[[11  6]
 [ 3  2]]


## STEP 3: Gather Results

Gather results, calulate metrics and write them to the drive.

In [44]:
import os
import json
from os import path

from vaguerequirementslib import TP, TN, FP, FN, calc_all_metrics

def build_result_data(test_result: List) -> dict:
    result_data = {
        'metrics':{
            'vague': {
                TP: int(test_result[1][1]),
                FP: int(test_result[0][1]),
                TN: int(test_result[0][0]),
                FN: int(test_result[1][0])
            },
            'not_vague': {
                TP: int(test_result[0][0]),
                FP: int(test_result[1][0]),
                TN: int(test_result[1][1]),
                FN: int(test_result[0][1])
            }
        },
        'misc': {   
            'data_files': DATA_FILE_NAMES,
            'random_state': RANDOM_STATE
        },
        'hyperparameter': {
            'learning_rate': LEARNING_RATE,
            'epochs': EPOCHS,
            'model_name': MODEL_NAME,
            'max_len': MAX_LEN,
            'batch_size': BATCH_SIZE,
            'max_features': MAX_FEATURES,
            'ngram_range': NGRAM_RANGE
        }
    }
    result_data['metrics']['not_vague'].update(calc_all_metrics(**result_data['metrics']['not_vague']))
    result_data['metrics']['vague'].update(calc_all_metrics(**result_data['metrics']['vague']))
    
    return result_data

result_data = build_result_data(test_result)
# Get the predictor
predictor = ktrain.get_predictor(learner.model, preproc=t)

## STEP 3.1 Save the Results
Check out the [FAQ](https://github.com/amaiya/ktrain/blob/master/FAQ.md#method-1-using-predictor-api-works-for-any-model) for how to load a model from a predictor.

In [45]:
# Save the evaluation result (test_data results)
with open(path.join(result_dir, 'evaluation.json'), mode='w', encoding='utf-8') as json_file:
    json.dump(result_data, json_file, indent=4)

# Save the corresponding model (predictor)
predictor.save(path.join(result_dir, 'predictor'))

## STEP 4 Inspect the Model and its Losses

Let's examine the validation example about which we were the most wrong.

In [32]:
learner.view_top_losses(n=4, preproc=t, val_data=test_data)
top_losses = learner.top_losses(n=4, preproc=t, val_data=test_data)

----------
id:0 | loss:2.57 | true:vague | pred:not-vague)

----------
id:20 | loss:2.08 | true:vague | pred:not-vague)

----------
id:4 | loss:1.44 | true:vague | pred:not-vague)

----------
id:6 | loss:1.14 | true:not-vague | pred:vague)

[(0, 2.568476, 'vague', 'not-vague'), (20, 2.0769691, 'vague', 'not-vague'), (4, 1.4417335, 'vague', 'not-vague'), (6, 1.135456, 'not-vague', 'vague')]


In [43]:
top_loss_req = test_df.iloc[6]['requirement'] # Requirement that produces top loss

print(predictor.predict(top_loss_req))

# predicted probability scores for each category
print(predictor.predict_proba(top_loss_req))

vague


[0.3212756 0.6787244]


Let's invoke the `explain` method to see which words contribute most to the classification.

In [38]:
from IPython.core.display import display

for id, _, _, _ in top_losses:
    top_loss_req = test_df.iloc[id]['requirement'] # Requirement that produces top loss
    display(predictor.explain(top_loss_req, n_samples=1_000))

Contribution?,Feature
1.201,Highlighted in text (sum)
0.362,<BIAS>


Contribution?,Feature
0.873,Highlighted in text (sum)
0.622,<BIAS>


Contribution?,Feature
0.828,Highlighted in text (sum)
0.349,<BIAS>


Contribution?,Feature
1.568,Highlighted in text (sum)
-0.298,<BIAS>


The words in the darkest shade of green contribute most to the classification and agree with what you would expect for this example.