<a href="https://colab.research.google.com/github/HaaLeo/vague-requirements-scripts/blob/engineering%2Fdifferent-preprocessing/colab-notebooks/BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classify requirements as vague or not using [ktrain](https://github.com/amaiya/ktrain) and tensorflow


## Install dependencies
*ktrain* requires TensorFlow 2.1. See [amaiya/ktrain#151](https://github.com/amaiya/ktrain/issues/151).
Further we install a forked version of eli5lib to gain insights in the model's decision process and some self built helper functions to preprocess MTurk result files.

In [56]:
!pip3 install -q tensorflow_gpu==2.1.0 ktrain==0.17.3
!pip3 install -q -U git+https://github.com/HaaLeo/vague-requirements-scripts
!pip3 install -q git+https://github.com/amaiya/eli5@tfkeras_0_10_1

  Building wheel for vaguerequirementslib (setup.py) ... [?25l[?25hdone
  Building wheel for eli5 (setup.py) ... [?25l[?25hdone


Check versions and enable logging

In [1]:
import tensorflow as tf
import ktrain
assert tf.__version__ == '2.1.0'
assert ktrain.__version__ == '0.17.3'

import logging
import sys

logging.basicConfig(
    format='%(asctime)s [%(threadName)-20.20s] [%(levelname)-5.5s]  %(message)s',
    stream=sys.stdout,
    level=logging.DEBUG)

2020-07-04 16:07:24,878 [MainThread          ] [DEBUG]  Loaded backend module://ipykernel.pylab.backend_inline version unknown.


## Set Parameters
Set the parameters for this run.
Ktrain ignores `max_features` and `ngram_range` in v0.17.3, see [amaiya/ktrain/issues#190](https://github.com/amaiya/ktrain/issues/190)

In [2]:
indices_to_read = [0,2,3,4] # indicate which MTurk files shall be read.
DATA_FILE_NAMES = [f'corpus-batch-{i}-mturk.csv' for i in indices_to_read]

RANDOM_STATE = 1 # for seeding

LEARNING_RATE = 5e-5
EPOCHS = 4
MODEL_NAME = 'distilbert-base-uncased'
MAX_LEN = 512
BATCH_SIZE = 6
MAX_FEATURES = 35_000
NGRAM_RANGE = 1

PREPROCESS_MODE = 'distilbert'

## Load Dataset

### Mount Google Drive
Mount the google drive to access the dataset

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Load Dataset Into Arrays

In [25]:
from vaguerequirementslib import read_csv_files, build_confusion_matrix, calc_majority_label
import pandas as pd

def read_drive_data(files_list: list, separator: str) -> pd.DataFrame:
    """
    Calculate the majority label for the given source file list

    Args:
        files_list (list): The CSV files to calculate the majority label for
        separator (str): The CSV separator
        drop_ties (bool): If there is a tie in votes (e.g.: One votes for vague one for not vague) then drop this entry from the confusion matrix.

    Returns:
        pd.DataFrame: The dataframe containing the majority label.
    """
    df = read_csv_files(files_list, separator)
    confusion_matrix = build_confusion_matrix(df, drop_ties=True)
    return calc_majority_label(confusion_matrix)

# Read all data
df = read_drive_data(
    [f'/content/drive/My Drive/datasets/corpus/labeled/{file_name}' for file_name in DATA_FILE_NAMES],
    ','
  )
df.head()

2020-07-04 16:17:42,999 [MainThread          ] [DEBUG]  Read file="/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-0-mturk.csv" with 200 rows.
2020-07-04 16:17:43,013 [MainThread          ] [DEBUG]  Read file="/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-2-mturk.csv" with 194 rows.
2020-07-04 16:17:43,026 [MainThread          ] [DEBUG]  Read file="/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-3-mturk.csv" with 198 rows.
2020-07-04 16:17:43,039 [MainThread          ] [DEBUG]  Read file="/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-4-mturk.csv" with 196 rows.
2020-07-04 16:17:43,057 [MainThread          ] [INFO ]  Build confusion matrix.
2020-07-04 16:17:43,236 [MainThread          ] [INFO ]  Dropped 180 requirements due to ties.
2020-07-04 16:17:43,240 [MainThread          ] [INFO ]  Built confusion matrix including 214 of 394 requirements. 
2020-07-04 16:17:43,245 [MainThread          ] [INFO ]  Overall "vague" votes count = 9

KeyError: ignored

### Split data set


In [34]:
from sklearn.model_selection import train_test_split
from typing import Tuple, List
from ktrain import text as txt


def split_dataset(dataframe: pd.DataFrame) -> Tuple[List[str], List[int], List[str], List[int], List[str], List[int]]:
    """
    Split the dataset into training, validation and test set.

    Args:
        data_frame (pd.DataFrame): The data frame to generate the data sets from.

    Returns:
        Tuple[List[str], List[int], List[str], List[int], List[str], List[int]]: x_train, y_train, x_val, y_val, x_test, y_test
    """
    train_df, val_test_df = train_test_split(df, test_size=0.2, random_state=RANDOM_STATE, stratify=df['majority_label'])
    val_df, test_df = train_test_split(val_test_df, test_size=0.5, random_state=RANDOM_STATE, stratify=val_test_df['majority_label'])

    print(f'Training dataset: vague count="{train_df.sum()["majority_label"]}", not vague count="{train_df.shape[0] - train_df.sum()["majority_label"]}"')
    print(f'Validation dataset: vague count="{val_df.sum()["majority_label"]}", not vague count="{val_df.shape[0] - val_df.sum()["majority_label"]}"')
    print(f'Test dataset: vague count="{test_df.sum()["majority_label"]}", not vague count="{test_df.shape[0] - test_df.sum()["majority_label"]}"')

    return train_df, val_df, test_df


def preprocess_data(train_df: pd.DataFrame, val_df: pd.DataFrame, test_df: pd.DataFrame) -> Tuple:
    def _preprocess(my_df: pd.DataFrame) -> Tuple:
        dummy_df = pd.DataFrame.from_dict({'requirement': ['foo', 'bar'], 'majority_label': [0, 1]})
        return txt.texts_from_df(my_df, text_column='requirement', label_columns=['majority_label'], val_df=dummy_df,  max_features=MAX_FEATURES, maxlen=MAX_LEN,  ngram_range=NGRAM_RANGE, preprocess_mode=PREPROCESS_MODE, random_state=RANDOM_STATE)
    
    train_data, _, _ = _preprocess(train_df)
    val_data, _, _ = _preprocess(val_df)
    test_data, _, test_preproc = _preprocess(test_df)

    return train_data, val_data, test_data, test_preproc

# Split the data set
train_df, val_df, test_df = split_dataset(df)

# Preprocess for Transfer Learning
train_data, val_data, test_data, test_preproc = preprocess_data(train_df, val_df, test_df)


Training dataset: vague count="38", not vague count="133"
Validation dataset: vague count="5", not vague count="16"
Test dataset: vague count="5", not vague count="17"
preprocessing train...
language: en
train sequence lengths:
	mean : 21
	95percentile : 40
	99percentile : 63
2020-07-04 16:27:08,838 [MainThread          ] [DEBUG]  Starting new HTTPS connection (1): s3.amazonaws.com:443
2020-07-04 16:27:09,013 [MainThread          ] [DEBUG]  https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/bert-base-uncased-vocab.txt HTTP/1.1" 200 0


Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 1
	95percentile : 1
	99percentile : 1


  'If this is a sentence pair classification task, please cast to tuple.')


preprocessing train...
language: en
train sequence lengths:
	mean : 22
	95percentile : 38
	99percentile : 46
2020-07-04 16:27:09,367 [MainThread          ] [DEBUG]  Starting new HTTPS connection (1): s3.amazonaws.com:443
2020-07-04 16:27:09,546 [MainThread          ] [DEBUG]  https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/bert-base-uncased-vocab.txt HTTP/1.1" 200 0


Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 1
	95percentile : 1
	99percentile : 1


preprocessing train...
language: en
train sequence lengths:
	mean : 21
	95percentile : 36
	99percentile : 38
2020-07-04 16:27:09,782 [MainThread          ] [DEBUG]  Starting new HTTPS connection (1): s3.amazonaws.com:443
2020-07-04 16:27:09,951 [MainThread          ] [DEBUG]  https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/bert-base-uncased-vocab.txt HTTP/1.1" 200 0


Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 1
	95percentile : 1
	99percentile : 1


## STEP 1:  Preprocess Data and Create a Transformer Model

We will use [DistilBERT](https://arxiv.org/abs/1910.01108).

In [48]:
import ktrain
t = txt.Transformer(MODEL_NAME, maxlen=MAX_LEN, class_names=['not-vague', 'vague']) # 0=not-vague 1=vague
t.preprocess_train_called = True # Simulate call to preprocess_train()
# t.preprocess_train(['foo', 'bar'], [0, 1]) 
# val_data = t.preprocess_test(x_val, y_val)
# test_data = t.preprocess_test(x_test, y_test)

model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=train_data, val_data=val_data, batch_size=BATCH_SIZE)

2020-07-04 16:45:04,282 [MainThread          ] [DEBUG]  Starting new HTTPS connection (1): s3.amazonaws.com:443
2020-07-04 16:45:04,427 [MainThread          ] [DEBUG]  https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/distilbert-base-uncased-config.json HTTP/1.1" 200 0
2020-07-04 16:45:04,434 [MainThread          ] [DEBUG]  Starting new HTTPS connection (1): cdn.huggingface.co:443
2020-07-04 16:45:04,697 [MainThread          ] [DEBUG]  https://cdn.huggingface.co:443 "HEAD /distilbert-base-uncased-tf_model.h5 HTTP/1.1" 200 0


## STEP 2:  Train the Model

In [49]:
learner.fit_onecycle(LEARNING_RATE, EPOCHS)



begin training using onecycle policy with max lr of 5e-05...
Train for 29 steps, validate for 1 steps
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x7f6dde9aa550>

## STEP 3: Evaluate and Inspect the Model

In [50]:
test_result = learner.validate(class_names=t.get_classes(), val_data=test_data)
print(test_result)

              precision    recall  f1-score   support

   not-vague       0.79      0.88      0.83        17
       vague       0.33      0.20      0.25         5

    accuracy                           0.73        22
   macro avg       0.56      0.54      0.54        22
weighted avg       0.69      0.73      0.70        22

[[15  2]
 [ 4  1]]


Let's examine the validation example about which we were the most wrong.

In [43]:
learner.view_top_losses(preproc=t)

----------
id:6 | loss:3.12 | true:vague | pred:not-vague)

----------
id:0 | loss:2.69 | true:vague | pred:not-vague)

----------
id:9 | loss:2.48 | true:vague | pred:not-vague)

----------
id:11 | loss:1.67 | true:vague | pred:not-vague)



In [47]:
top_loss_req = test_df.iloc[[6]]['requirement'] # Requirement that produces top loss
print(top_loss_req)
print(test_df.iloc[9]['majority_label'])

23    Conformance tests described in clause 5.3.6 sh...
Name: requirement, dtype: object
0


This post talks more about computing than `alt.atheism` (the true category), so our model placed it into the only computing category available to it: `comp.graphics`

## STEP 3.1: Gather Results

Gather results and write them to the drive.

In [24]:
from vaguerequirementslib import TP, TN, FP, FN, calc_all_metrics
result_data = {
    'metrics':{
        'vague': {
            TP: test_result[1][1],
            FP: test_result[0][1],
            TN: test_result[0][0],
            FN: test_result[1][0]
        },
        'not_vague': {
            TP: test_result[0][0],
            FP: test_result[1][0],
            TN: test_result[1][1],
            FN: test_result[0][1]
        }
    },
    'misc': {   
        'data_files': DATA_FILE_NAMES,
        'random_state': RANDOM_STATE
    },
    'hyperparameter': {
        'learning_rate': LEARNING_RATE,
        'epochs': EPOCHS
        'model_name': MODEL_NAME,
        'max_len': MAX_LEN,
        'batch_size': BATCH_SIZE
        # 'max_features': MAX_FEATURES,
        # 'ngram_range': NGRAM_RANGE
    }
}
result_data['metrics']['not_vague'].update(calc_all_metrics(**result_data['metrics']['not_vague']))
result_data['metrics']['vague'].update(calc_all_metrics(**result_data['metrics']['vague']))
print(result_data)

{'metrics': {'vague': {'true_positive': 4, 'false_positive': 13, 'true_negative': 4, 'false_negative': 1, 'accuracy': 0.36363636363636365, 'precision': 0.23529411764705882, 'recall': 0.8, 'specificity': 0.23529411764705882, 'false_negative_rate': 0.2, 'false_positive_rate': 0.7647058823529411, 'f1_score': 0.3636363636363636}, 'not_vague': {'true_positive': 4, 'false_positive': 1, 'true_negative': 4, 'false_negative': 13, 'accuracy': 0.36363636363636365, 'precision': 0.8, 'recall': 0.23529411764705882, 'specificity': 0.8, 'false_negative_rate': 0.7647058823529411, 'false_positive_rate': 0.2, 'f1_score': 0.3636363636363636}}}


## STEP 4: Making Predictions on New Data in Deployment

In [35]:
predictor = ktrain.get_predictor(learner.model, preproc=test_preproc)

In [36]:
predictor.predict(top_loss_req)

'1'

In [27]:
# predicted probability scores for each category
predictor.predict_proba(top_loss_req)

array([0.9278868 , 0.07211317], dtype=float32)

In [28]:
predictor.get_classes()

['not-vague', 'vague']

As expected, `soc.religion.christian` is assigned the highest probability.

Let's invoke the `explain` method to see which words contribute most to the classification.



In [33]:
predictor.explain(top_loss_req)

Contribution?,Feature
1.519,Highlighted in text (sum)
-0.312,<BIAS>


The words in the darkest shade of green contribute most to the classification and agree with what you would expect for this example.

We can save and reload our predictor for later deployment.

In [None]:
predictor.save('/tmp/my_distilbert_predictor')

In [None]:
reloaded_predictor = ktrain.load_predictor('/tmp/my_distilbert_predictor')

In [None]:
reloaded_predictor.predict('My computer monitor is really blurry.')

'comp.graphics'