<a href="https://colab.research.google.com/github/HaaLeo/vague-requirements-scripts/blob/master/colab-notebooks/BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classify requirements as vague or not using [ktrain](https://github.com/amaiya/ktrain) and tensorflow


## Install dependencies
*ktrain* requires TensorFlow 2.1. See [amaiya/ktrain#151](https://github.com/amaiya/ktrain/issues/151).

In [2]:
!pip3 install -q tensorflow_gpu==2.1.0 ktrain==0.17.3

In [3]:
import tensorflow as tf
import ktrain
assert tf.__version__ == '2.1.0'
assert ktrain.__version__ == '0.17.3'

Install the helper lib to process Amazon Mechanical result CSV files

In [4]:
!pip3 install -q -U git+https://github.com/HaaLeo/vague-requirements-scripts

  Building wheel for vaguerequirementslib (setup.py) ... [?25l[?25hdone


Enable logging

In [5]:
import logging
import sys

logging.basicConfig(
    format='%(asctime)s [%(name)-20.20s] [%(levelname)-5.5s]  %(message)s',
    stream=sys.stdout,
    level=logging.DEBUG)

## Load Dataset

### Mount Google Drive
Mount the google drive to access the dataset

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


### Load Dataset Into Arrays

In [9]:
from vaguerequirementslib import read_csv_files, build_confusion_matrix, calc_majority_label
import pandas as pd

def read_drive_data(files_list: list, separator: str) -> pd.DataFrame:
    """
    Calculate the majority label for the given source file list

    Args:
        files_list (list): The CSV files to calculate the majority label for
        separator (str): The CSV separator
        drop_ties (bool): If there is a tie in votes (e.g.: One votes for vague one for not vague) then drop this entry from the confusion matrix.

    Returns:
        pd.DataFrame: The dataframe containing the majority label.
    """
    df = read_csv_files(files_list, separator)
    confusion_matrix = build_confusion_matrix(df, drop_ties=True)
    return calc_majority_label(confusion_matrix)

# Test it
indices_to_read = [0,2,3,4]
df = read_drive_data(
    [f'/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-{i}-mturk.csv' for i in indices_to_read],
    ','
  )
df.head()

2020-07-03 11:09:03,739 [vaguerequirementslib] [DEBUG]  Read file="/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-0-mturk.csv" with 200 rows.
2020-07-03 11:09:04,066 [vaguerequirementslib] [DEBUG]  Read file="/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-2-mturk.csv" with 194 rows.
2020-07-03 11:09:04,356 [vaguerequirementslib] [DEBUG]  Read file="/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-3-mturk.csv" with 198 rows.
2020-07-03 11:09:04,694 [vaguerequirementslib] [DEBUG]  Read file="/content/drive/My Drive/datasets/corpus/labeled/corpus-batch-4-mturk.csv" with 196 rows.
2020-07-03 11:09:04,713 [vaguerequirementslib] [INFO ]  Build confusion matrix.
2020-07-03 11:09:04,857 [vaguerequirementslib] [INFO ]  Dropped 180 requirements due to ties.
2020-07-03 11:09:04,863 [vaguerequirementslib] [INFO ]  Built confusion matrix including 214 of 394 requirements. 
2020-07-03 11:09:04,865 [vaguerequirementslib] [INFO ]  Overall "vague" votes count = 9

Unnamed: 0,requirement,vague_count,not_vague_count,majority_label
0,A fallback per band feature set resulting from...,2,0,1
1,Actuation of steering shall be possible regard...,0,2,0
2,"Additionally, the ZigBee end device shall then...",0,2,0
3,"Additionally, the plan provides traceability f...",2,0,1
4,"After completion of release of the resources, ...",0,2,0


In [None]:
categories = ['alt.atheism', 'soc.religion.christian',
             'comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
train_b = fetch_20newsgroups(subset='train',
   categories=categories, shuffle=True, random_state=42)
test_b = fetch_20newsgroups(subset='test',
   categories=categories, shuffle=True, random_state=42)

print('size of training set: %s' % (len(train_b['data'])))
print('size of validation set: %s' % (len(test_b['data'])))
print('classes: %s' % (train_b.target_names))

x_train = train_b.data
y_train = train_b.target
x_test = test_b.data
y_test = test_b.target

size of training set: 2257
size of validation set: 1502
classes: ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']


## STEP 1:  Preprocess Data and Create a Transformer Model

We will use [DistilBERT](https://arxiv.org/abs/1910.01108).

In [None]:
import ktrain
from ktrain import text
MODEL_NAME = 'distilbert-base-uncased'
t = text.Transformer(MODEL_NAME, maxlen=500, classes=train_b.target_names)
trn = t.preprocess_train(x_train, y_train)
val = t.preprocess_test(x_test, y_test)
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6)



preprocessing train...
language: en
train sequence lengths:
	mean : 308
	95percentile : 837
	99percentile : 1938


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 343
	95percentile : 979
	99percentile : 2562


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=363423424.0, style=ProgressStyle(descri…




## STEP 2:  Train the Model

In [None]:
learner.fit_onecycle(5e-5, 4)



begin training using onecycle policy with max lr of 5e-05...
Train for 377 steps, validate for 47 steps
Epoch 1/4
Epoch 2/4
Epoch 3/4

## STEP 3: Evaluate and Inspect the Model

In [None]:
learner.validate(class_names=t.get_classes())

                        precision    recall  f1-score   support

           alt.atheism       0.94      0.90      0.92       319
         comp.graphics       0.96      0.97      0.96       389
               sci.med       0.98      0.96      0.97       396
soc.religion.christian       0.94      0.98      0.96       398

              accuracy                           0.96      1502
             macro avg       0.95      0.95      0.95      1502
          weighted avg       0.96      0.96      0.96      1502



array([[286,   8,   5,  20],
       [  9, 377,   2,   1],
       [  4,   7, 381,   4],
       [  5,   1,   1, 391]])

Let's examine the validation example about which we were the most wrong.

In [None]:
learner.view_top_losses(n=1, preproc=t)

----------
id:371 | loss:7.01 | true:alt.atheism | pred:comp.graphics)



In [None]:
print(x_test[371])

From: kempmp@phoenix.oulu.fi (Petri Pihko)
Subject: Re: Consciousness part II - Kev Strikes Back!
Organization: University of Oulu, Finland
X-Newsreader: TIN [version 1.1 PL9]
Lines: 30

Scott D. Sauyet (SSAUYET@eagle.wesleyan.edu) wrote:
> In <1993Apr21.163848.8099@cs.nott.ac.uk> 
> Kevin Anthony (kax@cs.nott.ac.uk) writes:

> > Firstly, I'm not impressed with the ability of algorithms. They're
> > great at solving problems once the method has been worked out, but not
> > at working out the method itself.
>   [ .. crossword example deleted ... ]

> Have you heard of neural networks?  I've read a little about them, and
> they seems to overcome most of your objections.

I'm sure there are many people who work with neural networks and
read this newsgroup. Please tell Kevin what you've achieved, and
what you expect.

> I am not saying that NNs will solve all such problems, but I think
> they show that it is not as hard as you think to come up with
> mechanical models of consciousness.

In

This post talks more about computing than `alt.atheism` (the true category), so our model placed it into the only computing category available to it: `comp.graphics`

## STEP 4: Making Predictions on New Data in Deployment

In [None]:
predictor = ktrain.get_predictor(learner.model, preproc=t)

In [None]:
predictor.predict('Jesus Christ is the central figure of Christianity.')

'soc.religion.christian'

In [None]:
# predicted probability scores for each category
predictor.predict_proba('Jesus Christ is the central figure of Christianity.')

array([2.9704000e-03, 5.0002872e-04, 6.5480877e-04, 9.9587470e-01],
      dtype=float32)

In [None]:
predictor.get_classes()

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

As expected, `soc.religion.christian` is assigned the highest probability.

Let's invoke the `explain` method to see which words contribute most to the classification.

We will need a forked version of the **eli5** library that supportes TensorFlow Keras, so let's install it first.

In [None]:
!pip3 install -q git+https://github.com/amaiya/eli5@tfkeras_0_10_1


  Building wheel for eli5 (setup.py) ... [?25l[?25hdone


In [None]:
predictor.explain('Jesus Christ is the central figure in Christianity.')

Contribution?,Feature
8.967,Highlighted in text (sum)
-0.101,<BIAS>


The words in the darkest shade of green contribute most to the classification and agree with what you would expect for this example.

We can save and reload our predictor for later deployment.

In [None]:
predictor.save('/tmp/my_distilbert_predictor')

In [None]:
reloaded_predictor = ktrain.load_predictor('/tmp/my_distilbert_predictor')

In [None]:
reloaded_predictor.predict('My computer monitor is really blurry.')

'comp.graphics'