#### Author
Victor Aleksandrin

#### Reference

#### Idea
Check different ideas to test performance of models, find where models make mistakes. 

#### Data
4500 cryptonews titles labeled as positive, neutral or negative + assessment data.

#### Result

In [1]:
import pandas as pd
import yaml

from sklearn.metrics import classification_report
from checklist_utils.train import train_logreg, train_bert, split_train_val, Predictor
from checklist_utils.data import read_data
from checklist_utils.tests import (
    get_coin_invariance_test, get_simple_negation_test, get_not_negative_test,
    get_punctuation_test, get_typos_test, get_contractions_test,
    get_change_names_test, get_change_locations_test
)

import checklist
from checklist.editor import Editor
from checklist.perturb import Perturb
from checklist.test_types import INV, MFT
from checklist.pred_wrapper import PredictorWrapper
from checklist.test_suite import TestSuite

import spacy
nlp = spacy.load("en_core_web_sm")

import warnings
warnings.filterwarnings('ignore')

import logging
logging.disable(logging.INFO)
logging.disable(logging.WARNING)

### Read data

In [2]:
DATA_PATH = "/artifacts/data"

In [3]:
label_mapping = {"Negative": 0, "Positive": 2, "Neutral": 1}

In [4]:
dataset = read_data(DATA_PATH)

### Config

In [5]:
bert_cfg_str = """
epochs: 3
train_batch_size: 32
val_batch_size: 64
seed: 42

model_name: &model_name distilbert-base-uncased

tokenizer:
    class: transformers.DistilBertTokenizer
    params:
        pretrained_model_name_or_path: *model_name
        model_max_length: 50

model:
    class: transformers.DistilBertForSequenceClassification
    params:
        pretrained_model_name_or_path: *model_name
        num_labels: 3

optimizer:
    class: transformers.AdamW
    params:
        lr: 0.000023
        weight_decay: 0.001

scheduler:
    params:
        name: polynomial
        num_warmup_steps: 0
"""

In [6]:
logreg_cfg_str = """
tfidf: # See https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
    stop_words: english
    ngram_range: '(1, 5)'
    analyzer: char
    min_df: 8
    lowercase: true
    max_features: 100000
logreg: # See https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
    C: 2.7
    solver: lbfgs
    multi_class: multinomial
    random_state: 17
    max_iter: 500
    n_jobs: 4
    fit_intercept: false
"""

In [7]:
bert_cfg = yaml.safe_load(bert_cfg_str)

In [8]:
logreg_cfg = yaml.safe_load(logreg_cfg_str)

### Task description

We'd like to test Bert and Tf-Idf-Logreg models and identify failures.


#### Approach
To test our models we apply the approach presented in the CheckList package. Checklist provides a matrix of general linguistic capabilities and test types.

**Main Capabilities**
- *Vocabulary + POS* - whether a model has the necessary vocabulary and whether it can appropriately handle the impact of words with different part of speech.
- *Robustness* to typos, irrelevant changes etc
- *NER* - appropriately understanding named entitites

**Test types**
- *MFT* - minimum functionality test
- *INV* - invariance test
- *DIR* - directional expectation test

Check paper below for details.


#### Steps


1. Split dataset on train data (80%) and validation data(20%).
2. Train models on train data.
3. Generate tests using Checklist and validation data.
4. Run tests and check error rate = not passed tests / all tests

#### References
- [Beyond Accuracy: Behavioral Testing of NLP models with CheckList](https://arxiv.org/abs/2005.04118)
- [Github](https://github.com/marcotcr/checklist)

### Split data

In [9]:
train_data, val_data, train_labels, val_labels = split_train_val(dataset, test_size=0.2)

### Train bert

In [10]:
output = train_bert(
    bert_cfg, 
    train_data,
    train_labels,
    val_data,
    val_labels, 
    return_predictions=True,
    return_model=True
)

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Predicting: 119it [00:00, ?it/s]

In [11]:
output.keys()

dict_keys(['scores', 'pred_labels', 'model'])

In [12]:
output["scores"]

{'train_loss': [0.7456982135772705, 0.46093425154685974, 0.36142468452453613],
 'val_loss': [0.5812508463859558, 0.563011109828949, 0.5765780210494995],
 'train_acc': [0.6849710941314697, 0.831581711769104, 0.8715186715126038],
 'val_acc': [0.7773109078407288, 0.7804622054100037, 0.793067216873169]}

In [13]:
print(classification_report(val_labels, output["pred_labels"], target_names=label_mapping.keys()))

              precision    recall  f1-score   support

    Negative       0.77      0.88      0.82       309
    Positive       0.71      0.56      0.62       193
     Neutral       0.84      0.83      0.84       450

    accuracy                           0.79       952
   macro avg       0.77      0.76      0.76       952
weighted avg       0.79      0.79      0.79       952



### Train logreg

In [14]:
logreg = train_logreg(logreg_cfg, train_data, train_labels)

In [15]:
print(classification_report(val_labels, logreg.predict(val_data), target_names=label_mapping.keys()))

              precision    recall  f1-score   support

    Negative       0.73      0.71      0.72       309
    Positive       0.69      0.41      0.51       193
     Neutral       0.71      0.84      0.77       450

    accuracy                           0.71       952
   macro avg       0.71      0.65      0.67       952
weighted avg       0.71      0.71      0.70       952



### Set prediction wrappers for tests

In [16]:
bert_predictor = Predictor(output["model"])

In [17]:
wrapped_bert_predictor = PredictorWrapper.wrap_softmax(bert_predictor)

In [18]:
wrapped_bert_predictor([
    "btc does not drop by 10%", 
    "EU Won't Ban Bitcoin After All",
    "BTC Price Tech Analysis for 08/02/2017  Back to Triangle Support"
])

(array([2, 2, 1]),
 array([[0.07258038, 0.07392533, 0.8534943 ],
        [0.08949793, 0.08003902, 0.83046305],
        [0.01836355, 0.77549595, 0.20614046]], dtype=float32))

In [19]:
wrapped_logreg_predictor = PredictorWrapper.wrap_softmax(logreg.predict_proba)

In [20]:
wrapped_logreg_predictor([
    "btc does not drop by 10%", 
    "EU Won't Ban Bitcoin After All",
    "BTC Price Tech Analysis for 08/02/2017  Back to Triangle Support"
])

(array([0, 0, 2]),
 array([[0.68979539, 0.18035446, 0.12985014],
        [0.52225071, 0.07841449, 0.3993348 ],
        [0.05535587, 0.36895761, 0.57568652]]))

### Add tests

In [21]:
suite = TestSuite()

### Capability: NER

#### Invariance tests

In [22]:
coin_inv_test = get_coin_invariance_test(val_data)
suite.add(coin_inv_test)

In [23]:
coin_inv_test.data[:2]

[["china's central bank to continue bitcoin exchange inspections",
  "china's central bank to continue ethereum exchange inspections",
  "china's central bank to continue ripple exchange inspections",
  "china's central bank to continue tether exchange inspections",
  "china's central bank to continue cardano exchange inspections",
  "china's central bank to continue stellar exchange inspections",
  "china's central bank to continue dogecoin exchange inspections"],
 ['bitcoin price to reach $60,000 before crashing to $1,000 in 2018 is saxo banks outrageous prediction',
  'ethereum price to reach $60,000 before crashing to $1,000 in 2018 is saxo banks outrageous prediction',
  'ripple price to reach $60,000 before crashing to $1,000 in 2018 is saxo banks outrageous prediction',
  'tether price to reach $60,000 before crashing to $1,000 in 2018 is saxo banks outrageous prediction',
  'cardano price to reach $60,000 before crashing to $1,000 in 2018 is saxo banks outrageous prediction',
 

In [24]:
change_names_test = get_change_names_test(list(nlp.pipe(val_data.values)))
suite.add(change_names_test)

In [25]:
change_locations_test = get_change_locations_test(list(nlp.pipe(val_data.values)))
suite.add(change_locations_test)

### Capability: Negation

#### MFTs

In [26]:
simple_negation_test = get_simple_negation_test()
suite.add(simple_negation_test)

In [27]:
simple_negation_test.data[:6]

['bitcoin is not legal.',
 'ethereum is not legal.',
 'ripple is not legal.',
 'tether is not legal.',
 'cardano is not legal.',
 'stellar is not legal.']

In [28]:
not_negative_test = get_not_negative_test()
suite.add(not_negative_test)

In [29]:
not_negative_test.data[:10]

['bitcoin does not drop below 39000$.',
 'ethereum does not drop below 39000$.',
 'ripple does not drop below 39000$.',
 'tether does not drop below 39000$.',
 'cardano does not drop below 39000$.',
 'stellar does not drop below 39000$.',
 'dogecoin does not drop below 39000$.',
 "bitcoin doesn't drop below 39000$.",
 "ethereum doesn't drop below 39000$.",
 "ripple doesn't drop below 39000$."]

### Capability: Robustness

#### Invariance tests

In [30]:
punctuation_test = get_punctuation_test(list(nlp.pipe(val_data.values)))
suite.add(punctuation_test)

In [31]:
typos_test = get_typos_test(val_data.values)
suite.add(typos_test)

In [32]:
contractions_test = get_contractions_test(val_data.values)
suite.add(contractions_test)

### Run tests

In [33]:
suite.run(wrapped_bert_predictor, overwrite=True)

Running Switch coin name.
Predicting 1574 examples
Running Change names.
Predicting 275 examples
Running Change locations.
Predicting 440 examples
Running Simple negation: negative samples.
Predicting 98 examples
Running Simple negation: not negative.
Predicting 112 examples
Running Punctuation.
Predicting 1014 examples
Running Typos.
Predicting 1000 examples
Running Contractions.
Predicting 48 examples


In [34]:
suite.visual_summary_table()

Please wait as we prepare the table data...


SuiteSummarizer(stats={'npassed': 0, 'nfailed': 0, 'nfiltered': 0}, test_infos=[{'name': 'Switch coin name.', …

In [40]:
suite.run(wrapped_logreg_predictor, overwrite=True)

Running Switch coin name.
Predicting 1514 examples
Running Simple negation: negative samples.
Predicting 98 examples
Running Simple negation: not negative.
Predicting 112 examples
Running Punctuation.
Predicting 1015 examples
Running Typos.
Predicting 1000 examples
Running Contractions.
Predicting 30 examples
Running Change names.
Predicting 352 examples
Running Change locations.
Predicting 517 examples


In [46]:
suite.summary()

Robustness

Punctuation.
Test cases:      500
Fails (rate):    0 (0.0%)


Typos.
Test cases:      500
Fails (rate):    11 (2.2%)

Example fails:
0.5 0.1 0.4 India is Preparing Bitcoin Regulations, Ban Unlikely: Report
0.4 0.1 0.5 India is Preparing Bitcoin Regluations, Ban Unlikely: Report

----
0.2 0.3 0.5 Bitcoin Cash Price Analysis: BCH/USD Breaks Above Key Resistance
0.4 0.3 0.3 Bitcoin Cash Price Analysis: BCH/USD Breaks bAove Key Resistance

----
0.3 0.4 0.3 OneCoin, Bitcoin, Litecoin on Same aWrning List by Bank of Uganda

----


Contractions.
Test cases:      15
Fails (rate):    0 (0.0%)




NER

Switch coin name.
Test cases:      200
Fails (rate):    13 (6.5%)

Example fails:
0.2 0.3 0.5 bitcoin has gone mainstream. that's a very big deal
0.2 0.4 0.4 tether has gone mainstream. that's a very big deal
0.2 0.4 0.4 stellar has gone mainstream. that's a very big deal

----
0.3 0.1 0.6 $2,300 and rising: bitcoin cash gains against bitcoin
0.4 0.2 0.4 $2,300 and rising: stellar cash

In [37]:
# suite.visual_summary_table()