<a href="https://colab.research.google.com/github/HaaLeo/vague-requirements-scripts/blob/master/colab-notebooks/BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classify requirements as vague or not using [ktrain](https://github.com/amaiya/ktrain) and tensorflow


## Install dependencies
*ktrain* requires TensorFlow 2.1. See [amaiya/ktrain#151](https://github.com/amaiya/ktrain/issues/151).
Further we install a forked version of eli5lib to gain insights in the model's decision process and some self built helper functions to preprocess MTurk result files.

In [5]:
%load_ext google.colab.data_table
!pip3 install -q -U tensorflow_gpu==2.2.0 ktrain==0.23.0 imbalanced-learn==0.7.0 psutil==5.7.0 transformers==3.1.0
!pip3 install -q -U git+https://github.com/HaaLeo/vague-requirements-scripts
!pip3 install -q git+https://github.com/amaiya/eli5@tfkeras_0_10_1

The google.colab.data_table extension is already loaded. To reload it, use:
  %reload_ext google.colab.data_table
  Building wheel for vaguerequirementslib (setup.py) ... [?25l[?25hdone
  Building wheel for eli5 (setup.py) ... [?25l[?25hdone


Check versions and enable logging

In [6]:
import tensorflow as tf
import tensorflow.keras
import ktrain
# import imblearn
import psutil
import transformers

assert tf.__version__ == '2.3.0'
assert ktrain.__version__ == '0.23.0'
# assert imblearn.__version__ == '0.5.0'
assert psutil.__version__ == '5.7.0'
assert transformers.__version__ == '3.1.0'
import logging
import sys

logging.basicConfig(
    format='%(asctime)s [%(name)-20.20s] [%(levelname)-5.5s]  %(message)s',
    stream=sys.stdout,
    level=logging.INFO)

LOGGER = logging.getLogger('colab-notebook')
LOGGER.info('Hello from colab notebook')

AttributeError: ignored

## Set Parameters
Set the parameters for this run.
Ktrain ignores `max_features` and `ngram_range` in v0.17.5, see [amaiya/ktrain/issues#190](https://github.com/amaiya/ktrain/issues/190)

In [23]:
#@title Set the parameter and hyperparameter 
#@markdown Set data files and proportion of train, val test set in source code

def set_parameters() -> dict:
    class_names = ['not-vague', 'vague'] # 0=not-vague 1=vague

    # The following parameter can be edited with the form fields
    random_state = 1  #@param {type:"integer"}

    resampling_strategy = 'random_downsampling'#@param ["random_downsampling", "random_upsampling"]

    kfold_splits = 10 #@param {type:"integer"}
    learning_rate = 1e-5 #@param {type:"number"}
    epochs =  2#@param {type:"integer"}
    model_name = 'distilbert-base-uncased' #@param {type:"string"}
    max_len = 256 #@param {type:"integer"}
    batch_size = 6 #@param {type:"integer"}

    return {
        'class_names': class_names,

        'random_state': random_state,

        'resampling_strategy': resampling_strategy,

        'kfold_splits': kfold_splits,
        'learning_rate': learning_rate,
        'epochs': epochs,
        'model_name': model_name,
        'max_len': max_len,
        'batch_size': batch_size
    }

params = set_parameters()
LOGGER.info(f'Sucessfully set parameters="{params}".')

2020-10-24 10:33:42,580 [colab-notebook      ] [INFO ]  Sucessfully set parameters="{'class_names': ['not-vague', 'vague'], 'random_state': 1, 'resampling_strategy': 'random_downsampling', 'kfold_splits': 10, 'learning_rate': 1e-05, 'epochs': 2, 'model_name': 'distilbert-base-uncased', 'max_len': 256, 'batch_size': 6}".


## Load Dataset

### Mount Google Drive
Mount the google drive to access the dataset

In [25]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Load Dataset

In [26]:
from vaguerequirementslib import read_csv_file

train_df = read_csv_file('/content/drive/My Drive/datasets/corpus/train_data.csv')
test_df = read_csv_file('/content/drive/My Drive/datasets/corpus/test_data.csv')

test_df_vague_count = int(test_df[test_df.majority_label == 1].majority_label.value_counts())
test_df_not_vague_count = int(test_df[test_df.majority_label == 0].majority_label.value_counts())
LOGGER.info(f'Test data frame consists of {test_df_vague_count} vague data points and {test_df_not_vague_count} not vague data points.')

df_vague_count = int(train_df[train_df.majority_label == 1].majority_label.value_counts())
df_not_vague_count = int(train_df[train_df.majority_label == 0].majority_label.value_counts())
LOGGER.info(f'Train data frame consists of {df_vague_count} vague data points and {df_not_vague_count} not vague data points.')
train_df.head()

2020-10-24 10:34:53,084 [colab-notebook      ] [INFO ]  Test data frame consists of 59 vague data points and 219 not vague data points.
2020-10-24 10:34:53,091 [colab-notebook      ] [INFO ]  Train data frame consists of 530 vague data points and 1968 not vague data points.


Unnamed: 0,requirement,vague_count,not_vague_count,majority_label
0,User selects a current date of the current all...,1,2,0
1,"For complex systems, a series of PDRs for each...",2,0,1
2,All instances in which the microcircuit is not...,0,2,0
3,"The cables, their routing and dressing should ...",1,2,0
4,"The step frequencies Fstep,X are defined in ta...",1,2,0


## Utility Functions

### Clean up disk
Remove temporary files and release memory

In [27]:
def cleanup_disk() -> None:
    # Try to release memory
    tf.keras.backend.clear_session()
    # Prevent disk overflow
    !rm -r /var/tmp/*
    !rm -r /tmp/*

### Resample and preprocess

In [28]:
from typing import Tuple, List

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from ktrain import text as txt
import pandas as pd

def resample(data_frame: pd.DataFrame, random_state: int, resampling_strategy: str = None) -> pd.DataFrame:
    """
    Re sample the given dataframe to contain equally much samples of vague and not-vague requirements.

    Args:
        data_frame (pd.DataFrame): The data frame to upsample.
        random_state (int): For seeding results
        strategy (str): The resampling strategy to use either "random_downsampling", "random_upsampling" or None.

    Returns:
        pd.DataFrame: The resampled data frame.
    """
    if resampling_strategy:
        if resampling_strategy == 'random_downsampling':
            sampler = RandomUnderSampler(sampling_strategy=1., random_state=random_state)
        elif resampling_strategy == 'random_upsampling':
            sampler = RandomOverSampler(sampling_strategy=1., random_state=random_state)

        x, y = sampler.fit_resample(data_frame.requirement.to_numpy().reshape(-1, 1), data_frame.majority_label)
        result = pd.DataFrame({'requirement': x.flatten(), 'majority_label': y})

        LOGGER.info(f'Resampled dataset with strategy"{resampling_strategy}": vague count="{result.sum()["majority_label"]}", not vague count="{result.shape[0] - result.sum()["majority_label"]}"')
    
    else: 
        LOGGER.warning('Data frame will not be resampled, because no strategy was provided.')
        result = data_frame
    return result

LOGGER.info('Defined resampling and preprocessing functions.')

2020-10-24 10:35:07,202 [colab-notebook      ] [INFO ]  Defined resampling and preprocessing functions.


### Create result object

Gather results, calulate metrics.

In [29]:
import os
import json
from os import path

from vaguerequirementslib import TP, TN, FP, FN, calc_all_metrics, calc_mean_average_precision

def calc_metrics(evaluation_result, map_df=None) -> dict:
    metrics_dict = {
        'vague': {
            TP: int(evaluation_result[1][1]),
            FP: int(evaluation_result[0][1]),
            TN: int(evaluation_result[0][0]),
            FN: int(evaluation_result[1][0])
        },
        'not_vague': {
            TP: int(evaluation_result[0][0]),
            FP: int(evaluation_result[1][0]),
            TN: int(evaluation_result[1][1]),
            FN: int(evaluation_result[0][1])
        }
    }

    metrics_dict['not_vague'].update(calc_all_metrics(**metrics_dict['not_vague']))
    metrics_dict['vague'].update(calc_all_metrics(**metrics_dict['vague']))

    if map_df is not None: # Mean average precision
        map = calc_mean_average_precision(map_df)
        metrics_dict['mean_average_precision'] = map[0]
        metrics_dict['not_vague']['average_precision'] = map[1]
        metrics_dict['vague']['average_precision'] = map[2]

    return metrics_dict

def build_fold_result(train_result, val_result, test_result, learning_history, train_map_df, val_map_df, test_map_df) -> dict:
    fold_result = {
        'metrics':{
        },
        'learning_history': learning_history
    }

    fold_result['metrics']['train'] = calc_metrics(train_result, train_map_df)
    fold_result['metrics']['validation'] = calc_metrics(val_result, val_map_df)
    fold_result['metrics']['test'] = calc_metrics(test_result, test_map_df)

    LOGGER.debug('Successfully built fold result.')

    return fold_result


def build_result_data(fold_results: List, df_vague_count, df_not_vague_count, test_df_vague_count, test_df_not_vague_count, **kwargs) -> dict:
    result_data = {
        'misc': {   
            'random_state': kwargs['random_state']
        },
        'data_set':{
            'summary': {
                'vague_data_points': df_vague_count,
                'not_vague_data_points': df_not_vague_count,
            },
            'test': {
                'vague_data_points': test_df_vague_count,
                'not_vague_data_points': test_df_not_vague_count
            },
            'resampling_strategy': kwargs['resampling_strategy']
        },
        'fold_results': fold_results,
        'hyperparameter': {
            'learning_rate': kwargs['learning_rate'],
            'epochs': kwargs['epochs'],
            'model_name': kwargs['model_name'],
            'max_len': kwargs['max_len'],
            'batch_size': kwargs['batch_size']
        }
    }

    LOGGER.debug('Successfully built result.')
    return result_data


def insert_probabilities(data_frame: pd.DataFrame, pred) -> None:
    predictions = pred.predict_proba(list(data_frame.loc[:, 'requirement'])).transpose()
    data_frame.loc[:, 'not_vague_prob'] = predictions[0] 
    data_frame.loc[:, 'vague_prob'] = predictions[1] 

LOGGER.info('Created functions for result object creation.')


2020-10-24 10:35:21,574 [colab-notebook      ] [INFO ]  Created functions for result object creation.


### Save evaluation result

In [30]:
import os
import numpy as np


def save_data(data: dict, file_path: str) -> None:
    class NumpyJSONEncoder(json.JSONEncoder):
        def default(self, obj):
            if isinstance(obj, np.integer):
                return int(obj)
            elif isinstance(obj, np.floating):
                return float(obj)
            elif isinstance(obj, np.ndarray):
                return obj.tolist()
            else:
                return super(NumpyJSONEncoder, self).default(obj)

    # Save the evaluation result (test_data results)
    os.makedirs(os.path.dirname(file_path), exist_ok=True)
    with open(file_path, mode='w', encoding='utf-8') as json_file:
        json.dump(data, json_file, indent=4, cls=NumpyJSONEncoder)
    LOGGER.info(f'Successfully saved data to directory="{file_path}".')

def load_data(file_path: str) -> dict:
    with open(file_path, mode='r', encoding='utf-8') as json_file:
        data = json.load(json_file)
    return data

### K-Fold Cross Validation

In [31]:
from sklearn.model_selection import StratifiedKFold

def perform_kfold_cross_validation(transformer, kfold_df, test_frame, **kwargs):
    fold_results = []
    fold_counter = 1

    # Preprocess test data
    test_data = transformer.preprocess_train(list(test_frame['requirement']), list(test_frame['majority_label']))


    # Cerate k folds
    kfold = StratifiedKFold(n_splits=kwargs['kfold_splits'], shuffle=True, random_state=kwargs['random_state'])
    for train_idx, val_idx in kfold.split(kfold_df.requirement, kfold_df.majority_label):
        LOGGER.info(f'\n######################\n### Process fold {fold_counter} / {kwargs["kfold_splits"]}\n######################')

        # Get training data and validation data for this fold
        curr_train_df = kfold_df.iloc[train_idx]
        val_df = kfold_df.iloc[val_idx]
        LOGGER.info(f'Training dataset: vague count="{int(curr_train_df[curr_train_df.majority_label == 1].majority_label.value_counts())}", not vague count="{int(train_df[train_df.majority_label == 0].majority_label.value_counts())}".')
        LOGGER.info(f'Validation dataset: vague count="{int(val_df[val_df.majority_label == 1].majority_label.value_counts())}", not vague count="{int(val_df[val_df.majority_label == 0].majority_label.value_counts())}".')
        
        # Resample train df 
        LOGGER.info('Resample training data set.')
        curr_train_df = resample(curr_train_df, kwargs['random_state'], kwargs['resampling_strategy'])

        LOGGER.info(f'Preprocess training and validation data for the model="{kwargs["model_name"]}".')
        train_data = transformer.preprocess_train(list(curr_train_df['requirement']), list(curr_train_df['majority_label']))
        val_data = transformer.preprocess_train(list(val_df['requirement']), list(val_df['majority_label']))

        # Get the learner for the new fold
        LOGGER.info('Fit the model for one cycle.')
        learner = ktrain.get_learner(transformer.get_classifier(), train_data=train_data, val_data=val_data, batch_size=kwargs['batch_size'])
        learner.model.get_layer(index=0).trainable = False
        LOGGER.info('\n%s',learner.model.summary())
        # Find a suitable learning rate
        # learner.lr_find(show_plot=True, start_lr=1e-09, max_epochs=kwargs['epochs'])
        # print(learner.lr_estimate())
        
        # Fit the model
        learning_history = learner.fit_onecycle(kwargs['learning_rate'], kwargs['epochs']).history
        # learner.plot('loss')

        # Evaluate the model
        train_result = learner.validate(class_names=kwargs['class_names'], val_data=train_data, print_report=False)
        val_result = learner.validate(class_names=kwargs['class_names'], val_data=val_data, print_report=False)
        test_result = learner.validate(class_names=kwargs['class_names'], val_data=test_data)

        # Prepare dfs for map calculation
        current_predictor = ktrain.get_predictor(learner.model, preproc=transformer)
        insert_probabilities(curr_train_df, current_predictor)
        insert_probabilities(val_df, current_predictor)
        insert_probabilities(test_df, current_predictor)
        
        # build fold result
        fold_result = build_fold_result(train_result, val_result, test_result, learning_history, curr_train_df, val_df, test_df)
        fold_results.append(fold_result)

        fold_counter += 1

    LOGGER.info('Successfully trained model.')

    
    return fold_results, learner

LOGGER.info('Defined k-fold cross validation.')

2020-10-24 10:35:23,591 [colab-notebook      ] [INFO ]  Defined k-fold cross validation.


### Create a Transformer Model and Train it

In [32]:
from datetime import datetime
from pytz import timezone

def create_train_save(result_file_path, kfold_df, test_df, **kwargs):
    """Create and train a model. Afterwards save its evaluation results."""
    # Create the transformer
    t = txt.Transformer(kwargs['model_name'], maxlen=kwargs['max_len'], class_names=kwargs['class_names'])

    # Perform k fold cross validation
    fold_results, learner = perform_kfold_cross_validation(t, kfold_df, test_df, **kwargs)

    predictor = ktrain.get_predictor(learner.model, preproc=t)
    
    # Build and save the evaluation result
    eval_result = build_result_data(fold_results, df_vague_count, df_not_vague_count, test_df_vague_count, test_df_not_vague_count, **kwargs)
    save_data(eval_result, result_file_path)

    return learner, predictor

LOGGER.info('Defined method to create and train a model and save its results')

2020-10-24 10:35:33,956 [colab-notebook      ] [INFO ]  Defined method to create and train a model and save its results


## Main entry points

Here are some cells as main entry points, enabling training a single model, grid search or thresholding.


### Grid Search
Perform a grid search to find good hyperparameter

In [None]:
from copy import deepcopy
from os import path
const_params = {
    'class_names': ['not-vague', 'vague'],
    'random_state': 1,
}

param_grid = {
    'resampling_strategy': ['random_upsampling', 'random_downsampling'],
    'kfold_splits': [4, 8],
    'learning_rate': [
        2e-5, 3e-5,
        4e-5, 5e-5, # e-5
    ],
    'epochs': [1, 2, 3],
    'model_name': [
        # 'distilbert-base-uncased',
        # 'bert-base-uncased';
        'nghuyong/ernie-2.0-en'
    ],
    'max_len': [64, 128],
    'batch_size': [16, 32]
}

LOGGER.info('Start grid search.')
LOGGER.info('Search for checkpoint.')

checkpoint_found = False
checkpoint_data = None
skipped_config_counter = 0

for model_name in param_grid['model_name']:
    for resampling_strategy in param_grid['resampling_strategy']:
        for kfold_splits in param_grid['kfold_splits']:
            for epochs in param_grid['epochs']:
                for max_len in  param_grid['max_len']:
                    for batch_size in  param_grid['batch_size']:
                        for learning_rate in param_grid['learning_rate']:
                            # For every triggered fitting run create a new directory where the results will be saved
                            now = datetime.now(timezone('Europe/Berlin'))

                            result_dir = f'/content/drive/My Drive/runs/grid-search/earnie2.0'
                            eval_file = now.strftime('%Y-%m-%d_%H-%M-%S-evaluation.json')
                            result_file_path = path.join(result_dir, eval_file)
                            checkpoint_file_path = path.join(result_dir, 'grid-search-checkpoint.json')

                            current_params = deepcopy(const_params)
                            current_params.update({
                                'resampling_strategy': resampling_strategy,

                                'kfold_splits': kfold_splits,
                                'learning_rate': learning_rate,
                                'epochs': epochs,
                                'model_name': model_name,
                                'max_len': max_len,
                                'batch_size': batch_size
                            })

                            if not checkpoint_found:
                                # Load checkpoint once
                                if not checkpoint_data:
                                    try: 
                                        checkpoint_data = load_data(checkpoint_file_path)
                                    except FileNotFoundError:
                                        # Start fresh grid search if no checkpoint was found
                                        checkpoint_found = True
                                        LOGGER.info(f'No checkpoint exists at path="{checkpoint_file_path}". Start fresh grid search.')
                                        save_data(param_grid, path.join(result_dir, 'grid-search-param-grid.json'))
                                # Assert whether the checkpoint was reached / the current parameter set was handled.
                                # Then continue from the next checkpoint
                                if checkpoint_data == current_params:
                                    checkpoint_found = True
                                    LOGGER.info(f'Checkpoint was found. Skipped {skipped_config_counter} parameter configurations.')
                                    continue
                                else:
                                    skipped_config_counter += 1
                                    continue

                            elif checkpoint_found:
                                LOGGER.info(f'Consuming {psutil.cpu_percent()}% CPU and {psutil.virtual_memory().percent}% RAM.')
                                LOGGER.info(f'Grid search with parameters="{current_params}".')
                                create_train_save(result_file_path, train_df, test_df, **current_params)

                                # Save checkpoint configuration
                                LOGGER.info('Save grid-search checkpoint.')
                                save_data(current_params, checkpoint_file_path)

                                cleanup_disk()

              precision    recall  f1-score   support

   not-vague       0.89      0.76      0.82       126
       vague       0.36      0.59      0.45        29

    accuracy                           0.73       155
   macro avg       0.63      0.67      0.63       155
weighted avg       0.79      0.73      0.75       155



### Train single model
Create and train a single model

In [None]:
from os import path

single_train_params = {
    'class_names': ['not-vague', 'vague'],
    'random_state': None,

    # Paramater with highest f-score for BERT
    'resampling_strategy': 'random_downsampling',
    'kfold_splits': 4,
    
    'learning_rate': 1e-05,
    'epochs': 2,
    'model_name': 'bert-base-uncased',
    'max_len': 128,
    'batch_size': 16
}

good_model_found = False
the_best_predictor = None
best_f1_score = 0.
the_best_pred_path = None

while not good_model_found:     
    # Create a new directory for each run
    now = datetime.now(timezone('Europe/Berlin'))
    result_dir = f'/content/drive/My Drive/runs/bert/3/{now.strftime("%Y-%m-%d_%H-%M-%S")}'
    result_file_path = path.join(result_dir, 'evaluation.json')

    # Create and train model, save its evaluation
    the_learner, the_predictor = create_train_save(result_file_path, train_df, test_df, **single_train_params)

    eval_data = load_data(result_file_path)

    current_f1_score = eval_data['fold_results'][-1]['metrics']['test']['vague']['f1_score']
    if current_f1_score > best_f1_score:
        # Remove previous best pred
        !rm -r '{the_best_pred_path}'

        # Update best predictor and save it
        best_f1_score = current_f1_score
        the_best_predictor = the_predictor
        the_best_pred_path = path.join(result_dir, 'predictor')
        
        LOGGER.info(f'Save new best predictor who reached f1 score {best_f1_score}.')
        the_best_predictor.save(path.join(result_dir, 'predictor'))

    good_model_found = eval_data['fold_results'][-1]['metrics']['test']['vague']['f1_score'] >= 0.55 

    if good_model_found:
        LOGGER.info('Found a good model. Saving its predictor and exit search.')
        # Save the corresponding model (predictor)
    else:
        LOGGER.info('Found no good model yet. Continue search.')

    cleanup_disk()

preprocessing train...
language: en
train sequence lengths:
	mean : 21
	95percentile : 39
	99percentile : 59


Is Multi-Label? False
2020-08-19 08:35:04,159 [colab-notebook      ] [INFO ]  
######################
### Process fold 1 / 4
######################
2020-08-19 08:35:04,168 [colab-notebook      ] [INFO ]  Training dataset: vague count="397", not vague count="1968".
2020-08-19 08:35:04,173 [colab-notebook      ] [INFO ]  Validation dataset: vague count="133", not vague count="492".
2020-08-19 08:35:04,174 [colab-notebook      ] [INFO ]  Resample training data set.
2020-08-19 08:35:04,205 [colab-notebook      ] [INFO ]  Resampled dataset with strategy"random_downsampling": vague count="397", not vague count="397"
2020-08-19 08:35:04,206 [colab-notebook      ] [INFO ]  Preprocess training and validation data for the model="bert-base-uncased".
preprocessing train...
language: en
train sequence lengths:
	mean : 22
	95percentile : 42
	99percentile : 53


Is Multi-Label? False
preprocessing train...
language: en
train sequence lengths:
	mean : 21
	95percentile : 40
	99percentile : 52


Is Multi-Label? False
2020-08-19 08:35:05,222 [colab-notebook      ] [INFO ]  Fit the model for one cycle.
2020-08-19 08:35:05,588 [filelock            ] [INFO ]  Lock 140570251804008 acquired on /root/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…


2020-08-19 08:35:06,058 [filelock            ] [INFO ]  Lock 140570251804008 released on /root/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517.lock
2020-08-19 08:35:06,140 [filelock            ] [INFO ]  Lock 140570251801936 acquired on /root/.cache/torch/transformers/336363d3718f8cc6432db4a768a053f96a9eae064c8c96aff2bc69fe73929770.4733ec82e81d40e9cf5fd04556267d8958fb150e9339390fc64206b7e5a79c83.h5.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…


2020-08-19 08:35:24,839 [filelock            ] [INFO ]  Lock 140570251801936 released on /root/.cache/torch/transformers/336363d3718f8cc6432db4a768a053f96a9eae064c8c96aff2bc69fe73929770.4733ec82e81d40e9cf5fd04556267d8958fb150e9339390fc64206b7e5a79c83.h5.lock
Model: "tf_bert_for_sequence_classification_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_75 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 109,483,778
Trainable params: 1,538
Non-trainable params: 109,482,240
_________________________________________________________________
2020-08-19 08:35:27,600 [colab-notebook      ] [INFO ]  
None


begin training us

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


ValueError: ignored

### Thresholding
Search for the best threshold

In [None]:
from vaguerequirementslib import predict_with_threshold
from sklearn.metrics import confusion_matrix
import csv

def evaluate_threshold(ground_truth, probabilities, threshold) -> dict:
    predictions = predict_with_threshold(probabilities, threshold)

    cm = confusion_matrix(ground_truth, predictions)

    metrics = calc_metrics(cm)['vague']
    metrics['classification_threshold'] = threshold
    
    return metrics


predictors_to_evaluate = [
    # Predictors trained using all data
    '/content/drive/My Drive/runs/distilbert/2/2020-08-07_09-26-16/predictor',
    '/content/drive/My Drive/runs/ernie2.0/1/2020-08-03_20-38-52/predictor',
    '/content/drive/My Drive/runs/bert/2/2020-08-03_23-34-43/predictor',

    # Only using MTurk Data
    '/content/drive/My Drive/runs/bert/1/2020-07-26_03-25-44/predictor',
    '/content/drive/My Drive/runs/distilbert/1/2020-07-26_19-27-36/predictor'
]

thresholds = np.arange(0., 1., .01)

for predictor_path in predictors_to_evaluate:
    
    all_train_threshold_metrics = {}
    all_test_threshold_metrics = {}
    # Load the predictor
    curr_pred = ktrain.load_predictor(predictor_path)

    # Predict
    curr_train_probs = curr_pred.predict_proba(list(train_df.loc[:, 'requirement'])) # train + val
    curr_test_probs = curr_pred.predict_proba(list(test_df.loc[:, 'requirement']))

    for threshold in thresholds:
        train_metrics = evaluate_threshold(list(train_df.loc[:, 'majority_label']), curr_train_probs, threshold)
        test_metrics = evaluate_threshold(list(test_df.loc[:, 'majority_label']), curr_test_probs, threshold)
        
        if not all_train_threshold_metrics:
            all_train_threshold_metrics = {key: [] for key in train_metrics}
        if not all_test_threshold_metrics:
            all_test_threshold_metrics = {key: [] for key in test_metrics}

        for key in train_metrics.keys():
            all_train_threshold_metrics[key].append(train_metrics[key])
            all_test_threshold_metrics[key].append(test_metrics[key])

    pd.DataFrame.from_dict(all_train_threshold_metrics).to_csv(path.join(predictor_path, '../train-thresholds.csv'), sep=';', index=False, quoting=csv.QUOTE_NONNUMERIC)
    pd.DataFrame.from_dict(all_test_threshold_metrics).to_csv(path.join(predictor_path, '../test-thresholds.csv'), sep=';', index=False, quoting=csv.QUOTE_NONNUMERIC)


2020-08-08 10:14:45,223 [filelock            ] [INFO ]  Lock 139875285960856 acquired on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…


2020-08-08 10:14:45,797 [filelock            ] [INFO ]  Lock 139875285960856 released on /root/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock


2020-08-08 10:14:56,847 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:14:56,848 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:14:56,849 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:14:56,854 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:14:56,856 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:14:56,857 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:14:56,874 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:14:56,874 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:14:56,876 

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=479.0, style=ProgressStyle(description_…


2020-08-08 10:15:08,024 [filelock            ] [INFO ]  Lock 139875006397688 released on /root/.cache/torch/transformers/cca4e49c8196ae207c328e7466072f1471d445de206f05e5f75428d2d7a3f710.ec2bfa564ea3c54c926729c61f7100e6ea7aa4c3c04ee5543f799f5a25b7ef2e.lock
2020-08-08 10:15:08,341 [filelock            ] [INFO ]  Lock 139875286753064 acquired on /root/.cache/torch/transformers/9a207e6d9199c6b0288c252c8b081b828d246749d21b1ef3cc669f6474d5450a.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…


2020-08-08 10:15:09,002 [filelock            ] [INFO ]  Lock 139875286753064 released on /root/.cache/torch/transformers/9a207e6d9199c6b0288c252c8b081b828d246749d21b1ef3cc669f6474d5450a.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock
2020-08-08 10:15:09,584 [filelock            ] [INFO ]  Lock 139875006559232 acquired on /root/.cache/torch/transformers/c7622ad218643641d7d3ddea492c0cac7c626cb9a6da41eefb3cc11a4b7c60f1.275045728fbf41c11d3dae08b8742c054377e18d92cc7b72b6351152a99b64e4.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…


2020-08-08 10:15:09,931 [filelock            ] [INFO ]  Lock 139875006559232 released on /root/.cache/torch/transformers/c7622ad218643641d7d3ddea492c0cac7c626cb9a6da41eefb3cc11a4b7c60f1.275045728fbf41c11d3dae08b8742c054377e18d92cc7b72b6351152a99b64e4.lock
2020-08-08 10:15:10,226 [filelock            ] [INFO ]  Lock 139875324929024 acquired on /root/.cache/torch/transformers/4ba9893dbb70360f248414dea2a12722529644bebf6b1784bf97ff11cc9d3395.73a933aa27255ce576c445dcdb8155b6edb6e4c43cceb14b4b81f9e699a818b7.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=62.0, style=ProgressStyle(description_w…


2020-08-08 10:15:10,570 [filelock            ] [INFO ]  Lock 139875324929024 released on /root/.cache/torch/transformers/4ba9893dbb70360f248414dea2a12722529644bebf6b1784bf97ff11cc9d3395.73a933aa27255ce576c445dcdb8155b6edb6e4c43cceb14b4b81f9e699a818b7.lock


2020-08-08 10:15:40,165 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:15:40,165 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:15:40,170 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:15:40,173 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:15:40,174 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:15:40,175 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:15:40,191 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:15:40,192 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:15:40,192 

2020-08-08 10:16:19,428 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:19,428 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:19,429 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:19,438 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:19,442 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:19,444 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:19,461 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:19,462 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:19,465 

2020-08-08 10:16:58,271 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:58,272 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:58,272 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:58,281 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:58,283 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:58,285 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:58,300 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:58,301 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:16:58,302 

2020-08-08 10:17:15,039 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:17:15,040 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:17:15,041 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:17:15,046 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:17:15,048 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:17:15,049 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:17:15,063 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:17:15,064 [vaguerequirementslib] [WARNI]  Denominator = 0. Skip metric calculation and set it to value="0".
2020-08-08 10:17:15,065 

## STEP 2: Evaluate the model

Load the predictor, transformer and learner.

Check out the [FAQ](https://github.com/amaiya/ktrain/blob/master/FAQ.md#method-1-using-predictor-api-works-for-any-model) for how to load a model from a predictor.


In [65]:
eval_params = {
    'predictor_path': '/content/drive/My Drive/runs/ernie2.0/2/2020-08-03_20-38-52/predictor',
    'model_name': 'nghuyong/ernie-2.0-en',
    'max_len': 128,
    'class_names': params['class_names']
}
# 'nghuyong/ernie-2.0-en', '/content/drive/My Drive/runs/ernie2.0/2/2020-08-03_20-38-52/predictor', 128
# 'distilbert-base-uncased', '/content/drive/My Drive/runs/distilbert/2/2020-08-07_09-26-16/predictor', 64
# 'bert-base-uncased', '/content/drive/My Drive/runs/bert/2/2020-08-03_23-34-43/predictor', 128
transformer = txt.Transformer(eval_params['model_name'], maxlen=eval_params['max_len'], class_names=eval_params['class_names'])
train_data = transformer.preprocess_train(list(train_df['requirement']), list(train_df['majority_label']))
test_data = transformer.preprocess_train(list(test_df['requirement']), list(test_df['majority_label']))

predictor = ktrain.load_predictor(eval_params['predictor_path'])
learner = ktrain.get_learner(predictor.model, train_data=train_data, val_data=test_data)

preprocessing train...
language: en
train sequence lengths:
	mean : 22
	95percentile : 42
	99percentile : 57


Is Multi-Label? False
preprocessing train...
language: en
train sequence lengths:
	mean : 21
	95percentile : 39
	99percentile : 59


Is Multi-Label? False


ValueError: ignored

Evaluate the model using the `test_data`.

In [None]:
test_result = learner.validate(val_data=test_data, class_names=eval_params['class_names'])
print(test_result)
LOGGER.info('Successfully validated test set.')

              precision    recall  f1-score   support

   not-vague       0.85      0.78      0.81       219
       vague       0.38      0.51      0.43        59

    accuracy                           0.72       278
   macro avg       0.62      0.64      0.62       278
weighted avg       0.75      0.72      0.73       278

[[170  49]
 [ 29  30]]
2020-08-13 08:38:55,868 [colab-notebook      ] [INFO ]  Successfully validated test set.


## Calculate Inter-Rater Agreement

In [62]:
from vaguerequirementslib import build_confusion_matrix, calculate_free_marginal_kappa
test_reqs = list(test_df.loc[:, 'requirement'])
predictions = predictor.predict(test_reqs)
predictions = [ 
    1 if label == 'vague'
    else 0 if label == 'not-vague'
    else 'foo'
    for label in predictions
]
if 'foo' in predictions:
    raise Error('Invalid predictions')
ira_df = pd.DataFrame({
    'requirements': test_reqs + test_reqs,
    'label': list(test_df.loc[:, 'majority_label']) + predictions
})
conf_mat = build_confusion_matrix(ira_df, requirement_column='requirements', answer_column='label', vague_answer_labels=[1],not_vague_answer_labels=[0])
LOGGER.info('%s achieved k_free: %s', eval_params['model_name'], calculate_free_marginal_kappa(conf_mat))
conf_mat.head()


2020-10-24 11:16:33,925 [vaguerequirementslib] [INFO ]  Build confusion matrix.
2020-10-24 11:16:33,965 [vaguerequirementslib] [INFO ]  Built confusion matrix including 278 of 278 requirements. 
2020-10-24 11:16:33,967 [vaguerequirementslib] [INFO ]  Overall "vague" votes count = 141. Overall "not vague" votes count = 415
2020-10-24 11:16:34,076 [colab-notebook      ] [INFO ]  bert-base-uncased achieved k_free: 0.4316546762589928


Unnamed: 0,requirement,vague_count,not_vague_count
0,A Charge-Through VCONN -Powered USB Device sha...,0,2
1,A Controller that supports Directed Advertisin...,1,1
2,"A MirrorLink Client shall support VC, if the M...",0,2
3,A list of students nominated shall be provided...,1,1
4,A push button with a graphic label does not co...,0,2


## STEP 4 Inspect the Model and its Losses

### Check best and worst predictions
Let's examine the validation example about which we were the most wrong.

In [39]:
import csv
insert_probabilities(test_df, predictor)


SyntaxError: ignored

In [None]:
# A df with the n most falesy classified requirements (majority_label==0 but high vague prob )
wrong_vague_df = test_df[(test_df.majority_label==0) & (test_df.vague_prob>=0.5)].nlargest(4, 'vague_prob')
# A df with the n most correctly classified requirements (majority_label==1 and high vague prob )
right_vague_df = test_df[(test_df.majority_label==1) & (test_df.vague_prob>=0.5)].nlargest(4, 'vague_prob')

# A df with the n most falesy classified requirements (majority_label==1 but high not vague prob )
wrong_not_vague_df = test_df[(test_df.majority_label==1) & (test_df.not_vague_prob>=0.5)].nlargest(4, 'not_vague_prob')
# A df with the n most correctly classified requirements (majority_label==0 and high not vague prob )
right_not_vague_df = test_df[(test_df.majority_label==0) & (test_df.not_vague_prob>=0.5)].nlargest(4, 'not_vague_prob')

Unnamed: 0,requirement,vague_count,not_vague_count,majority_label,not_vague_prob,vague_prob
148,If the NG-RAN node receives a PDU SESSION RESO...,2,1,1,0.722571,0.277429
194,The Drop-Down scroll State field shall have th...,2,0,1,0.672247,0.327753
211,The MirrorLink Server shall stop using the vid...,2,0,1,0.663532,0.336468
273,When the sequence selection input is set to sl...,2,0,1,0.661083,0.338917


In [None]:
def print_explanation(caption, data_frame, pred, n_samples=1_000) -> None:
    LOGGER.info('\n%s', caption)
    for _, row in data_frame.iterrows():
        display(pred.explain(row.requirement, n_samples=n_samples))

print_explanation('### Handle wrong vague classifications', wrong_vague_df, predictor)
print_explanation('### Handle right vague classifications', right_vague_df, predictor)
print_explanation('### Handle wrong not vague classifications', wrong_not_vague_df, predictor)
print_explanation('### Handle right not vague classifications', right_not_vague_df, predictor)

2020-08-13 09:02:03,386 [colab-notebook      ] [INFO ]  
### Handle wrong vague classifications


Contribution?,Feature
0.703,Highlighted in text (sum)
-0.007,<BIAS>


Contribution?,Feature
0.612,Highlighted in text (sum)
0.178,<BIAS>


### Check top losses

In [None]:
learner.view_top_losses(n=4, preproc=transformer, val_data=test_data)
top_losses = learner.top_losses(n=4, preproc=transformer, val_data=test_data)

In [None]:
top_loss_req = test_df.iloc[130]['requirement'] # Requirement that produces top loss

print(predictor.predict(top_loss_req))

# predicted probability scores for each category
print(predictor.predict_proba(top_loss_req))
print(top_loss_req)

Let's invoke the `explain` method to see which words contribute most to the classification.

In [None]:
from IPython.core.display import display

print('### Explain correct vague classifications')
for id, _, _, _ in top_losses:
    top_loss_req = test_df.iloc[id]['requirement'] # Requirement that produces top loss
    display(predictor.explain(top_loss_req, n_samples=1_000))

Contribution?,Feature
0.703,Highlighted in text (sum)
-0.007,<BIAS>


Contribution?,Feature
1.758,Highlighted in text (sum)
0.001,<BIAS>


Contribution?,Feature
0.612,Highlighted in text (sum)
0.178,<BIAS>


Contribution?,Feature
0.34,Highlighted in text (sum)
0.124,<BIAS>


The words in the darkest shade of green contribute most to the classification.