<a href="https://colab.research.google.com/github/HaaLeo/vague-requirements-scripts/blob/master/colab-notebooks/BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classify requirements as vague or not using [ktrain](https://github.com/amaiya/ktrain) and tensorflow


## Install dependencies
*ktrain* requires TensorFlow 2.1. See [amaiya/ktrain#151](https://github.com/amaiya/ktrain/issues/151).
Further we install a forked version of eli5lib to gain insights in the model's decision process and some self built helper functions to preprocess MTurk result files.

In [1]:
%load_ext google.colab.data_table
!pip3 install -q tensorflow_gpu==2.1.0 ktrain==0.18.5 imbalanced-learn==0.5.0 psutil==5.7.0
!pip3 install -q -U git+https://github.com/HaaLeo/vague-requirements-scripts
!pip3 install -q git+https://github.com/amaiya/eli5@tfkeras_0_10_1

[K     |████████████████████████████████| 421.8MB 25kB/s 
[K     |████████████████████████████████| 25.2MB 118kB/s 
[K     |████████████████████████████████| 174kB 41.6MB/s 
[K     |████████████████████████████████| 450kB 58.6MB/s 
[K     |████████████████████████████████| 450kB 48.0MB/s 
[K     |████████████████████████████████| 51kB 8.0MB/s 
[K     |████████████████████████████████| 3.9MB 50.4MB/s 
[K     |████████████████████████████████| 421.8MB 39kB/s 
[K     |████████████████████████████████| 6.7MB 51.7MB/s 
[K     |████████████████████████████████| 983kB 57.7MB/s 
[K     |████████████████████████████████| 245kB 60.6MB/s 
[K     |████████████████████████████████| 778kB 43.6MB/s 
[K     |████████████████████████████████| 471kB 46.4MB/s 
[K     |████████████████████████████████| 890kB 53.8MB/s 
[K     |████████████████████████████████| 3.0MB 47.9MB/s 
[K     |████████████████████████████████| 1.1MB 51.6MB/s 
[?25h  Building wheel for ktrain (setup.py) ... [?25l[?

Check versions and enable logging

In [2]:
import tensorflow as tf
import ktrain
import imblearn
import psutil

assert tf.__version__ == '2.1.0'
assert ktrain.__version__ == '0.18.5'
assert imblearn.__version__ == '0.5.0'
assert psutil.__version__ == '5.7.0'

import logging
import sys

logging.basicConfig(
    format='%(asctime)s [%(name)-20.20s] [%(levelname)-5.5s]  %(message)s',
    stream=sys.stdout,
    level=logging.INFO)

LOGGER = logging.getLogger('colab-notebook')
LOGGER.info('Hello from colab notebook')

2020-08-11 15:38:24,765 [colab-notebook      ] [INFO ]  Hello from colab notebook


## Set Parameters
Set the parameters for this run.
Ktrain ignores `max_features` and `ngram_range` in v0.17.5, see [amaiya/ktrain/issues#190](https://github.com/amaiya/ktrain/issues/190)

In [3]:
#@title Set the parameter and hyperparameter 
#@markdown Set data files and proportion of train, val test set in source code

def set_parameters() -> dict:
    class_names = ['not-vague', 'vague'] # 0=not-vague 1=vague

    # The following parameter can be edited with the form fields
    random_state = 1  #@param {type:"integer"}

    resampling_strategy = 'random_downsampling'#@param ["random_downsampling", "random_upsampling"]

    kfold_splits = 10 #@param {type:"integer"}
    learning_rate = 1e-5 #@param {type:"number"}
    epochs =  2#@param {type:"integer"}
    model_name = 'distilbert-base-uncased' #@param {type:"string"}
    max_len = 256 #@param {type:"integer"}
    batch_size = 6 #@param {type:"integer"}

    return {
        'class_names': class_names,

        'random_state': random_state,

        'resampling_strategy': resampling_strategy,

        'kfold_splits': kfold_splits,
        'learning_rate': learning_rate,
        'epochs': epochs,
        'model_name': model_name,
        'max_len': max_len,
        'batch_size': batch_size
    }

params = set_parameters()
LOGGER.info(f'Sucessfully set parameters="{params}".')

2020-08-11 15:38:24,785 [colab-notebook      ] [INFO ]  Sucessfully set parameters="{'class_names': ['not-vague', 'vague'], 'random_state': 1, 'resampling_strategy': 'random_downsampling', 'kfold_splits': 10, 'learning_rate': 1e-05, 'epochs': 2, 'model_name': 'distilbert-base-uncased', 'max_len': 256, 'batch_size': 6}".


## Load Dataset

### Mount Google Drive
Mount the google drive to access the dataset

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


### Load Dataset

In [5]:
from vaguerequirementslib import read_csv_file

train_df = read_csv_file('/content/drive/My Drive/datasets/corpus/train_data.csv')
test_df = read_csv_file('/content/drive/My Drive/datasets/corpus/test_data.csv')

test_df_vague_count = int(test_df[test_df.majority_label == 1].majority_label.value_counts())
test_df_not_vague_count = int(test_df[test_df.majority_label == 0].majority_label.value_counts())
LOGGER.info(f'Test data frame consists of {test_df_vague_count} vague data points and {test_df_not_vague_count} not vague data points.')

df_vague_count = int(train_df[train_df.majority_label == 1].majority_label.value_counts())
df_not_vague_count = int(train_df[train_df.majority_label == 0].majority_label.value_counts())
LOGGER.info(f'Train data frame consists of {df_vague_count} vague data points and {df_not_vague_count} not vague data points.')
train_df.head()

2020-08-11 15:44:42,654 [colab-notebook      ] [INFO ]  Test data frame consists of 59 vague data points and 219 not vague data points.
2020-08-11 15:44:42,661 [colab-notebook      ] [INFO ]  Train data frame consists of 530 vague data points and 1968 not vague data points.


Unnamed: 0,requirement,vague_count,not_vague_count,majority_label
0,User selects a current date of the current all...,1,2,0
1,"For complex systems, a series of PDRs for each...",2,0,1
2,All instances in which the microcircuit is not...,0,2,0
3,"The cables, their routing and dressing should ...",1,2,0
4,"The step frequencies Fstep,X are defined in ta...",1,2,0


## Utility Functions

### Clean up disk
Remove temporary files and release memory

In [6]:
def cleanup_disk() -> None:
    # Try to release memory
    tf.keras.backend.clear_session()
    # Prevent disk overflow
    !rm -r /var/tmp/*
    !rm -r /tmp/*

### Resample and preprocess

In [7]:
from typing import Tuple, List

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from ktrain import text as txt
import pandas as pd

def resample(data_frame: pd.DataFrame, random_state: int, resampling_strategy: str = None) -> pd.DataFrame:
    """
    Re sample the given dataframe to contain equally much samples of vague and not-vague requirements.

    Args:
        data_frame (pd.DataFrame): The data frame to upsample.
        random_state (int): For seeding results
        strategy (str): The resampling strategy to use either "random_downsampling", "random_upsampling" or None.

    Returns:
        pd.DataFrame: The resampled data frame.
    """
    if resampling_strategy:
        if resampling_strategy == 'random_downsampling':
            sampler = RandomUnderSampler(sampling_strategy=1., random_state=random_state)
        elif resampling_strategy == 'random_upsampling':
            sampler = RandomOverSampler(sampling_strategy=1., random_state=random_state)

        x, y = sampler.fit_resample(data_frame.requirement.to_numpy().reshape(-1, 1), data_frame.majority_label)
        result = pd.DataFrame({'requirement': x.flatten(), 'majority_label': y})

        LOGGER.info(f'Resampled dataset with strategy"{resampling_strategy}": vague count="{result.sum()["majority_label"]}", not vague count="{result.shape[0] - result.sum()["majority_label"]}"')
    
    else: 
        LOGGER.warning('Data frame will not be resampled, because no strategy was provided.')
        result = data_frame
    return result

LOGGER.info('Defined resampling and preprocessing functions.')

2020-08-11 15:44:42,715 [colab-notebook      ] [INFO ]  Defined resampling and preprocessing functions.


### Create result object

Gather results, calulate metrics.

In [8]:
import os
import json
from os import path

from vaguerequirementslib import TP, TN, FP, FN, calc_all_metrics, calc_mean_average_precision

def calc_metrics(evaluation_result, map_df=None) -> dict:
    metrics_dict = {
        'vague': {
            TP: int(evaluation_result[1][1]),
            FP: int(evaluation_result[0][1]),
            TN: int(evaluation_result[0][0]),
            FN: int(evaluation_result[1][0])
        },
        'not_vague': {
            TP: int(evaluation_result[0][0]),
            FP: int(evaluation_result[1][0]),
            TN: int(evaluation_result[1][1]),
            FN: int(evaluation_result[0][1])
        }
    }

    metrics_dict['not_vague'].update(calc_all_metrics(**metrics_dict['not_vague']))
    metrics_dict['vague'].update(calc_all_metrics(**metrics_dict['vague']))

    if map_df: # Mean average precision
        map = calc_mean_average_precision(map_df)
        metrics_dict['mean_average_precision'] = map[0]
        metrics_dict['not_vague']['average_precision'] = map[1]
        metrics_dict['vague']['average_precision'] = map[2]

    return metrics_dict

def build_fold_result(train_result, val_result, test_result, learning_history, train_map_df, val_map_df, test_map_df) -> dict:
    fold_result = {
        'metrics':{
        },
        'learning_history': learning_history
    }

    fold_result['metrics']['train'] = calc_metrics(train_result, train_map_df)
    fold_result['metrics']['validation'] = calc_metrics(val_result, val_map_df)
    fold_result['metrics']['test'] = calc_metrics(test_result, test_map_df)

    LOGGER.debug('Successfully built fold result.')

    return fold_result


def build_result_data(fold_results: List, df_vague_count, df_not_vague_count, test_df_vague_count, test_df_not_vague_count, **kwargs) -> dict:
    result_data = {
        'misc': {   
            'random_state': kwargs['random_state']
        },
        'data_set':{
            'summary': {
                'vague_data_points': df_vague_count,
                'not_vague_data_points': df_not_vague_count,
            },
            'test': {
                'vague_data_points': test_df_vague_count,
                'not_vague_data_points': test_df_not_vague_count
            },
            'resampling_strategy': kwargs['resampling_strategy']
        },
        'fold_results': fold_results,
        'hyperparameter': {
            'learning_rate': kwargs['learning_rate'],
            'epochs': kwargs['epochs'],
            'model_name': kwargs['model_name'],
            'max_len': kwargs['max_len'],
            'batch_size': kwargs['batch_size']
        }
    }

    LOGGER.debug('Successfully built result.')
    return result_data


def insert_probabilities(data_frame: pd.DataFrame, pred) -> None:
    predictions = pred.predict_proba(list(data_frame.loc[:, 'requirement'])).transpose()
    data_frame.loc[:, 'not_vague_prob'] = predictions[0] 
    data_frame.loc[:, 'vague_prob'] = predictions[1] 

LOGGER.info('Created functions for result object creation.')


2020-08-11 15:44:42,766 [colab-notebook      ] [INFO ]  Created functions for result object creation.


### Save evaluation result

In [9]:
import os
import numpy as np


def save_data(data: dict, file_path: str) -> None:
    class NumpyJSONEncoder(json.JSONEncoder):
        def default(self, obj):
            if isinstance(obj, np.integer):
                return int(obj)
            elif isinstance(obj, np.floating):
                return float(obj)
            elif isinstance(obj, np.ndarray):
                return obj.tolist()
            else:
                return super(NumpyJSONEncoder, self).default(obj)

    # Save the evaluation result (test_data results)
    os.makedirs(os.path.dirname(file_path), exist_ok=True)
    with open(file_path, mode='w', encoding='utf-8') as json_file:
        json.dump(data, json_file, indent=4, cls=NumpyJSONEncoder)
    LOGGER.info(f'Successfully saved data to directory="{file_path}".')

def load_data(file_path: str) -> dict:
    with open(file_path, mode='r', encoding='utf-8') as json_file:
        data = json.load(json_file)
    return data

### K-Fold Cross Validation

In [10]:
from sklearn.model_selection import StratifiedKFold

def perform_kfold_cross_validation(transformer, kfold_df, test_frame, **kwargs):
    fold_results = []
    fold_counter = 1

    # Preprocess test data
    test_data = transformer.preprocess_train(list(test_frame['requirement']), list(test_frame['majority_label']))


    # Cerate k folds
    kfold = StratifiedKFold(n_splits=kwargs['kfold_splits'], shuffle=True, random_state=kwargs['random_state'])
    for train_idx, val_idx in kfold.split(kfold_df.requirement, kfold_df.majority_label):
        LOGGER.info(f'\n######################\n### Process fold {fold_counter} / {kwargs["kfold_splits"]}\n######################')

        # Get training data and validation data for this fold
        curr_train_df = kfold_df.iloc[train_idx]
        val_df = kfold_df.iloc[val_idx]
        LOGGER.info(f'Training dataset: vague count="{int(curr_train_df[curr_train_df.majority_label == 1].majority_label.value_counts())}", not vague count="{int(train_df[train_df.majority_label == 0].majority_label.value_counts())}".')
        LOGGER.info(f'Validation dataset: vague count="{int(val_df[val_df.majority_label == 1].majority_label.value_counts())}", not vague count="{int(val_df[val_df.majority_label == 0].majority_label.value_counts())}".')
        
        # Resample train df 
        LOGGER.info('Resample training data set.')
        curr_train_df = resample(curr_train_df, kwargs['random_state'], kwargs['resampling_strategy'])

        LOGGER.info(f'Preprocess training and validation data for the model="{kwargs["model_name"]}".')
        train_data = transformer.preprocess_train(list(curr_train_df['requirement']), list(curr_train_df['majority_label']))
        val_data = transformer.preprocess_train(list(val_df['requirement']), list(val_df['majority_label']))

        # Get the learner for the new fold
        LOGGER.info('Fit the model for one cycle.')
        learner = ktrain.get_learner(transformer.get_classifier(), train_data=train_data, val_data=val_data, batch_size=kwargs['batch_size'])
        
        # Find a suitable learning rate
        # learner.lr_find(show_plot=True, start_lr=1e-09, max_epochs=kwargs['epochs'])
        # print(learner.lr_estimate())
        
        # Fit the model
        learning_history = learner.fit_onecycle(kwargs['learning_rate'], kwargs['epochs']).history
        # learner.plot('loss')

        # Evaluate the model
        train_result = learner.validate(class_names=kwargs['class_names'], val_data=train_data, print_report=False)
        val_result = learner.validate(class_names=kwargs['class_names'], val_data=val_data, print_report=False)
        test_result = learner.validate(class_names=kwargs['class_names'], val_data=test_data)

        # Prepare dfs for map calculation
        current_predictor = ktrain.get_predictor(learner.model, preproc=transformer)
        insert_probabilities(curr_train_df, current_predictor)
        insert_probabilities(val_df, current_predictor)
        insert_probabilities(test_df, current_predictor)
        
        # build fold result
        fold_result = build_fold_result(train_result, val_result, test_result, learning_history, curr_train_df, val_df, test_df)
        fold_results.append(fold_result)

        fold_counter += 1

    LOGGER.info('Successfully trained model.')

    
    return fold_results, learner

LOGGER.info('Defined k-fold cross validation.')

2020-08-11 15:44:42,842 [colab-notebook      ] [INFO ]  Defined k-fold cross validation.


### Create a Transformer Model and Train it

In [11]:
from datetime import datetime
from pytz import timezone

def create_train_save(result_file_path, kfold_df, test_df, **kwargs):
    """Create and train a model. Afterwards save its evaluation results."""
    # Create the transformer
    t = txt.Transformer(kwargs['model_name'], maxlen=kwargs['max_len'], class_names=kwargs['class_names'])

    # Perform k fold cross validation
    fold_results, learner = perform_kfold_cross_validation(t, kfold_df, test_df, **kwargs)

    predictor = ktrain.get_predictor(learner.model, preproc=t)
    
    # Build and save the evaluation result
    eval_result = build_result_data(fold_results, df_vague_count, df_not_vague_count, test_df_vague_count, test_df_not_vague_count, **kwargs)
    save_data(eval_result, result_file_path)

    return learner, predictor

LOGGER.info('Defined method to create and train a model and save its results')

2020-08-11 15:44:42,856 [colab-notebook      ] [INFO ]  Defined method to create and train a model and save its results


## Main entry points

Here are some cells as main entry points, enabling training a single model, grid search or thresholding.


### Grid Search
Perform a grid search to find good hyperparameter

In [None]:
from copy import deepcopy
from os import path
const_params = {
    'class_names': ['not-vague', 'vague'],
    'random_state': 1,
}

param_grid = {
    'resampling_strategy': ['random_upsampling', 'random_downsampling'],
    'kfold_splits': [4, 8],
    'learning_rate': [
        2e-5, 3e-5,
        4e-5, 5e-5, # e-5
    ],
    'epochs': [1, 2, 3],
    'model_name': [
        # 'distilbert-base-uncased',
        # 'bert-base-uncased';
        'nghuyong/ernie-2.0-en'
    ],
    'max_len': [64, 128],
    'batch_size': [16, 32]
}

LOGGER.info('Start grid search.')
LOGGER.info('Search for checkpoint.')

checkpoint_found = False
checkpoint_data = None
skipped_config_counter = 0

for model_name in param_grid['model_name']:
    for resampling_strategy in param_grid['resampling_strategy']:
        for kfold_splits in param_grid['kfold_splits']:
            for epochs in param_grid['epochs']:
                for max_len in  param_grid['max_len']:
                    for batch_size in  param_grid['batch_size']:
                        for learning_rate in param_grid['learning_rate']:
                            # For every triggered fitting run create a new directory where the results will be saved
                            now = datetime.now(timezone('Europe/Berlin'))

                            result_dir = f'/content/drive/My Drive/runs/grid-search/earnie2.0'
                            eval_file = now.strftime('%Y-%m-%d_%H-%M-%S-evaluation.json')
                            result_file_path = path.join(result_dir, eval_file)
                            checkpoint_file_path = path.join(result_dir, 'grid-search-checkpoint.json')

                            current_params = deepcopy(const_params)
                            current_params.update({
                                'resampling_strategy': resampling_strategy,

                                'kfold_splits': kfold_splits,
                                'learning_rate': learning_rate,
                                'epochs': epochs,
                                'model_name': model_name,
                                'max_len': max_len,
                                'batch_size': batch_size
                            })

                            if not checkpoint_found:
                                # Load checkpoint once
                                if not checkpoint_data:
                                    try: 
                                        checkpoint_data = load_data(checkpoint_file_path)
                                    except FileNotFoundError:
                                        # Start fresh grid search if no checkpoint was found
                                        checkpoint_found = True
                                        LOGGER.info(f'No checkpoint exists at path="{checkpoint_file_path}". Start fresh grid search.')
                                        save_data(param_grid, path.join(result_dir, 'grid-search-param-grid.json'))
                                # Assert whether the checkpoint was reached / the current parameter set was handled.
                                # Then continue from the next checkpoint
                                if checkpoint_data == current_params:
                                    checkpoint_found = True
                                    LOGGER.info(f'Checkpoint was found. Skipped {skipped_config_counter} parameter configurations.')
                                    continue
                                else:
                                    skipped_config_counter += 1
                                    continue

                            elif checkpoint_found:
                                LOGGER.info(f'Consuming {psutil.cpu_percent()}% CPU and {psutil.virtual_memory().percent}% RAM.')
                                LOGGER.info(f'Grid search with parameters="{current_params}".')
                                create_train_save(result_file_path, train_df, test_df, **current_params)

                                # Save checkpoint configuration
                                LOGGER.info('Save grid-search checkpoint.')
                                save_data(current_params, checkpoint_file_path)

                                cleanup_disk()

              precision    recall  f1-score   support

   not-vague       0.89      0.76      0.82       126
       vague       0.36      0.59      0.45        29

    accuracy                           0.73       155
   macro avg       0.63      0.67      0.63       155
weighted avg       0.79      0.73      0.75       155



### Train single model
Create and train a single model

In [None]:
from os import path

single_train_params = {
    'class_names': ['not-vague', 'vague'],
    'random_state': None,

    # Paramater with highest f-score for BERT
    'resampling_strategy': 'random_downsampling',
    'kfold_splits': 4,
    
    'learning_rate': 1e-05,
    'epochs': 2,
    'model_name': 'bert-base-uncased',
    'max_len': 128,
    'batch_size': 16
}

good_model_found = False
the_best_predictor = None
best_f1_score = 0.
the_best_pred_path = None

while not good_model_found:     
    # Create a new directory for each run
    now = datetime.now(timezone('Europe/Berlin'))
    result_dir = f'/content/drive/My Drive/runs/bert/2/{now.strftime("%Y-%m-%d_%H-%M-%S")}'
    result_file_path = path.join(result_dir, 'evaluation.json')

    # Create and train model, save its evaluation
    the_learner, the_predictor = create_train_save(result_file_path, train_df, test_df, **single_train_params)

    eval_data = load_data(result_file_path)

    current_f1_score = eval_data['fold_results'][-1]['metrics']['test']['vague']['f1_score']
    if current_f1_score > best_f1_score:
        # Remove previous best pred
        !rm -r '{the_best_pred_path}'

        # Update best predictor and save it
        best_f1_score = current_f1_score
        the_best_predictor = the_predictor
        the_best_pred_path = path.join(result_dir, 'predictor')
        
        LOGGER.info(f'Save new best predictor who reached f1 score {best_f1_score}.')
        the_best_predictor.save(path.join(result_dir, 'predictor'))

    good_model_found = eval_data['fold_results'][-1]['metrics']['test']['vague']['f1_score'] >= 0.55 

    if good_model_found:
        LOGGER.info('Found a good model. Saving its predictor and exit search.')
        # Save the corresponding model (predictor)
    else:
        LOGGER.info('Found no good model yet. Continue search.')

    cleanup_disk()

Epoch 2/2

### Thresholding
Search for the best threshold

In [None]:
from vaguerequirementslib import predict_with_threshold
from sklearn.metrics import confusion_matrix
import csv

def evaluate_threshold(ground_truth, probabilities, threshold) -> dict:
    predictions = predict_with_threshold(probabilities, threshold)

    cm = confusion_matrix(ground_truth, predictions)

    metrics = calc_metrics(cm)['vague']
    metrics['classification_threshold'] = threshold
    
    return metrics


predictors_to_evaluate = [
    # Predictors trained using all data
    '/content/drive/My Drive/runs/distilbert/2/2020-08-07_09-26-16/predictor',
    '/content/drive/My Drive/runs/ernie2.0/1/2020-08-03_20-38-52/predictor',
    '/content/drive/My Drive/runs/bert/2/2020-08-03_23-34-43/predictor',

    # Only using MTurk Data
    '/content/drive/My Drive/runs/bert/1/2020-07-26_03-25-44/predictor',
    '/content/drive/My Drive/runs/distilbert/1/2020-07-26_19-27-36/predictor'
]

thresholds = np.arange(0., 1., .01)

for predictor_path in predictors_to_evaluate:
    
    all_train_threshold_metrics = {}
    all_test_threshold_metrics = {}
    # Load the predictor
    curr_pred = ktrain.load_predictor(predictor_path)

    # Predict
    curr_train_probs = curr_pred.predict_proba(list(train_df.loc[:, 'requirement'])) # train + val
    curr_test_probs = curr_pred.predict_proba(list(test_df.loc[:, 'requirement']))

    for threshold in thresholds:
        train_metrics = evaluate_threshold(list(train_df.loc[:, 'majority_label']), curr_train_probs, threshold)
        test_metrics = evaluate_threshold(list(test_df.loc[:, 'majority_label']), curr_test_probs, threshold)
        
        if not all_train_threshold_metrics:
            all_train_threshold_metrics = {key: [] for key in train_metrics}
        if not all_test_threshold_metrics:
            all_test_threshold_metrics = {key: [] for key in test_metrics}

        for key in train_metrics.keys():
            all_train_threshold_metrics[key].append(train_metrics[key])
            all_test_threshold_metrics[key].append(test_metrics[key])

    pd.DataFrame.from_dict(all_train_threshold_metrics).to_csv(path.join(predictor_path, '../train-thresholds.csv'), sep=';', index=False, quoting=csv.QUOTE_NONNUMERIC)
    pd.DataFrame.from_dict(all_test_threshold_metrics).to_csv(path.join(predictor_path, '../test-thresholds.csv'), sep=';', index=False, quoting=csv.QUOTE_NONNUMERIC)


2020-08-08 10:14:45,223 [filelock            ] [INFO ]  Lock 139875285960856 acquired on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…


2020-08-08 10:14:45,797 [filelock            ] [INFO ]  Lock 139875285960856 released on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock


2020-08-08 10:14:56,847 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:14:56,848 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:14:56,849 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:14:56,854 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:14:56,856 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:14:56,857 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:14:56,874 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:14:56,874 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:14:56,876 

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=479.0, style=ProgressStyle(description_…


2020-08-08 10:15:08,024 [filelock            ] [INFO ]  Lock 139875006397688 released on /root/.cache/torch/transformers/cca4e49c8196ae207c328e7466072f1471d445de206f05e5f75428d2d7a3f710.ec2bfa564ea3c54c926729c61f7100e6ea7aa4c3c04ee5543f799f5a25b7ef2e.lock
2020-08-08 10:15:08,341 [filelock            ] [INFO ]  Lock 139875286753064 acquired on /root/.cache/torch/transformers/9a207e6d9199c6b0288c252c8b081b828d246749d21b1ef3cc669f6474d5450a.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…


2020-08-08 10:15:09,002 [filelock            ] [INFO ]  Lock 139875286753064 released on /root/.cache/torch/transformers/9a207e6d9199c6b0288c252c8b081b828d246749d21b1ef3cc669f6474d5450a.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock
2020-08-08 10:15:09,584 [filelock            ] [INFO ]  Lock 139875006559232 acquired on /root/.cache/torch/transformers/c7622ad218643641d7d3ddea492c0cac7c626cb9a6da41eefb3cc11a4b7c60f1.275045728fbf41c11d3dae08b8742c054377e18d92cc7b72b6351152a99b64e4.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…


2020-08-08 10:15:09,931 [filelock            ] [INFO ]  Lock 139875006559232 released on /root/.cache/torch/transformers/c7622ad218643641d7d3ddea492c0cac7c626cb9a6da41eefb3cc11a4b7c60f1.275045728fbf41c11d3dae08b8742c054377e18d92cc7b72b6351152a99b64e4.lock
2020-08-08 10:15:10,226 [filelock            ] [INFO ]  Lock 139875324929024 acquired on /root/.cache/torch/transformers/4ba9893dbb70360f248414dea2a12722529644bebf6b1784bf97ff11cc9d3395.73a933aa27255ce576c445dcdb8155b6edb6e4c43cceb14b4b81f9e699a818b7.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=62.0, style=ProgressStyle(description_w…


2020-08-08 10:15:10,570 [filelock            ] [INFO ]  Lock 139875324929024 released on /root/.cache/torch/transformers/4ba9893dbb70360f248414dea2a12722529644bebf6b1784bf97ff11cc9d3395.73a933aa27255ce576c445dcdb8155b6edb6e4c43cceb14b4b81f9e699a818b7.lock


2020-08-08 10:15:40,165 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:15:40,165 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:15:40,170 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:15:40,173 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:15:40,174 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:15:40,175 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:15:40,191 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:15:40,192 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:15:40,192 

2020-08-08 10:16:19,428 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:19,428 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:19,429 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:19,438 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:19,442 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:19,444 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:19,461 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:19,462 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:19,465 

2020-08-08 10:16:58,271 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:58,272 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:58,272 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:58,281 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:58,283 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:58,285 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:58,300 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:58,301 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:58,302 

2020-08-08 10:17:15,039 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:17:15,040 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:17:15,041 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:17:15,046 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:17:15,048 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:17:15,049 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:17:15,063 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:17:15,064 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:17:15,065 

## STEP 2: Evaluate the model

Load the predictor, transformer and learner.

Check out the [FAQ](https://github.com/amaiya/ktrain/blob/master/FAQ.md#method-1-using-predictor-api-works-for-any-model) for how to load a model from a predictor.


In [None]:
eval_params = {
    'predictor_path': '/content/drive/My Drive/runs/bert/2/2020-08-03_23-34-43/predictor',
    'model_name': 'bert-base-uncased',
    'max_len': 128,
    'class_names': params['class_names']
}


transformer = txt.Transformer(eval_params['model_name'], maxlen=eval_params['max_len'], class_names=eval_params['class_names'])
test_data = transformer.preprocess_train(list(test_df['requirement']), list(test_df['majority_label']))

predictor = ktrain.load_predictor(eval_params['predictor_path'])
learner = ktrain.get_learner(predictor.model)

preprocessing train...
language: en
train sequence lengths:
	mean : 21
	95percentile : 39
	99percentile : 59
2020-08-11 15:45:26,951 [filelock            ] [INFO ]  Lock 140034357085240 acquired on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…


2020-08-11 15:45:27,462 [filelock            ] [INFO ]  Lock 140034357085240 released on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock


Is Multi-Label? False


Evaluate the model using the `test_data`.

In [None]:
test_result = learner.validate(val_data=test_data)
print(test_result)
learner.plot('loss')
LOGGER.info('Successfully validated test set.')

## STEP 4 Inspect the Model and its Losses

Let's examine the validation example about which we were the most wrong.

In [None]:
learner.view_top_losses(n=4, preproc=transformer, val_data=test_data)
top_losses = learner.top_losses(n=4, preproc=transformer, val_data=test_data)

----------
id:16 | loss:1.26 | true:not-vague | pred:vague)

----------
id:22 | loss:1.2 | true:not-vague | pred:vague)

----------
id:12 | loss:1.18 | true:not-vague | pred:vague)

----------
id:8 | loss:0.99 | true:vague | pred:not-vague)



In [None]:
top_loss_req = test_df.iloc[16]['requirement'] # Requirement that produces top loss

print(predictor.predict(top_loss_req))

# predicted probability scores for each category
print(predictor.predict_proba(top_loss_req))
print(top_loss_req)

2020-07-07 07:56:25,775 [MainThread          ] [DEBUG]  Starting new HTTPS connection (1): s3.amazonaws.com:443
2020-07-07 07:56:26,194 [MainThread          ] [DEBUG]  https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/bert-base-uncased-vocab.txt HTTP/1.1" 200 0


vague


[0.28304783 0.71695215]
The developer shall establish, control, and maintain a software test environment to perform integration and qualification testing of software.


Let's invoke the `explain` method to see which words contribute most to the classification.

In [None]:
from IPython.core.display import display

for id, _, _, _ in top_losses:
    top_loss_req = test_df.iloc[id]['requirement'] # Requirement that produces top loss
    display(predictor.explain(top_loss_req, n_samples=1_000))

Contribution?,Feature
0.641,Highlighted in text (sum)
-0.19,<BIAS>


Contribution?,Feature
1.088,Highlighted in text (sum)
-0.149,<BIAS>


Contribution?,Feature
1.787,Highlighted in text (sum)
-0.305,<BIAS>


Contribution?,Feature
0.294,<BIAS>
-0.601,Highlighted in text (sum)


The words in the darkest shade of green contribute most to the classification and agree with what you would expect for this example.