# Tutorial 01: Preprocessing

This section discusses how to transfer raw data and raw models into the format required by Errudite.

### Must-do setup


**!! VERY IMPORTANT**

To succesfully run this notebook and all the following notebooks, make sure you install required dependencies:

```sh
# first, start you virtual environment.
# Assuming you are in the top errudite folder.
cd tutorial/
pip install requirements_tutorial.txt
```

These will make sure you can use dependencies that are not in the main package requirements. For example, here we load predictors from `Allennlp`.

In [3]:
%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

def import_sys():
    import sys
    sys.path.append('..')
import_sys()

import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)  # pylint: disable=invalid-name

In [4]:
pwd

'C:\\Users\\dongh\\errudite\\tutorials'

In [5]:
import errudite
print(errudite)

<module 'errudite' from '..\\errudite\\__init__.py'>


# Task Preview
We use Textual Entailment (TE) task as a demo task here. TE models take a pair of sentences and predict whether the facts in the first necessarily imply the facts in the second one. The dataset we have here is 100 instances from the [Stanford Natural Language Inference (SNLI)](https://nlp.stanford.edu/projects/snli/), a large annotated for learning natural language inference.

# Preprocessing & Input

For data grouping and rewriting to work, Errudite requires the input raw texts to be parsed and annotated with LEMMAs, POS tags, named entities, and parsing tree structures. The preprocessing is based on [SpaCy@2](https://spacy.io/). This notebook shows how to transfer raw text inputs into annotated instance objects.

## Define a `DatasetReader`.
---

A ``DatasetReader`` ([documentation here](https://errudite.readthedocs.io/en/latest/api/_extensible_dataset_reader.html)) loads the raw data from a data file, preprocess the data to include linguistic features, and save the processed data to a cache folder. All the task specific readers are *registered* under the base class ``DatasetReader``, so they could be queried via their names:

```python
DatasetReader.by_name("dataset_name")
```

Because the reader also handles dumping the processed instances into cache files, we require you to provide a desired cache path. If not provided, the default path is `./caches/`

To get the pre-implemented reader for SNLI, we could run the following: 

In [6]:
from errudite.io import DatasetReader

cache_folder_path = "./data/snli_tutorial_caches"
reader = DatasetReader.by_name("snli")(cache_folder_path=cache_folder_path)

INFO:errudite.utils.file_utils:Errudite cache folder selected: ./data/snli_tutorial_caches


## Step 1: Transfer the raw data to `Instance`s and `Target`s
---

With the reader, the first thing we do is to read from the raw dataset file, and transfer them into `Instance`s, a data structure used by Errudite, which contains various `Target`s. See more details in [this documentation](https://errudite.readthedocs.io/en/latest/api/_targets_and_instances.html).

### Raw data structure

In SNLI, one raw data takes the following structure:


```python
# A unique id denoting each sentence pair. 
pairID                                                  "4705552913.jpg#2r1n"
# A more general id that groups relevant sentence pairs.
captionID                                                  "4705552913.jpg#2"
# The premise caption that was supplied to the author of the pair.
sentence1                 "Two women are embracing while holding to go pa..."
# The hypothesis caption that was written by the author of the pair.
sentence2                 "The sisters are hugging goodbye while holding ..."
#  This is the label chosen by the majority of annotators.
gold_label                                                          "neutral"
# the majority of the following form the gold_label.
label1                                                              "neutral"
label2                                                           "entailment"
label3                                                              "neutral"
label4                                                              "neutral"
label5                                                              "neutral"
# These are default parsing information that we do not use.
sentence1_parse           "(ROOT (S (NP (CD Two) (NNS women)) (VP (VBP ar..."
sentence2_parse           "(ROOT (S (NP (DT The) (NNS sisters)) (VP (VBP ..."
sentence1_binary_parse    "( ( Two women ) ( ( are ( embracing ( while ( ..."
sentence2_binary_parse    "( ( The sisters ) ( ( are ( ( hugging goodbye ..."
```


### Processed `Instance`, and their associated variables

Each raw data above is transfered into an `Intance`, with the following information: 

#### IDs
An instance is identified (hashed) with:
- `qid`: unique identifier or id. (`pairID` in example, a unique identifier for each sentence1--sentence2 pair.)
- `vid`: Notes the version of a target. The original inputs are version 0. When a target is rewritten, vid increases. 

#### Targets

Targets are primitives that allow users to access inputs and outputs at different levels of granularity. What's essential in all the analysis are these `Target`s -- In fact, you could treat an `Instance` as a wrapper for targets.

In this task, we transfer `sentence1` (`premise`), `sentence2` (`hypothesis`), and `gold_label` (`groundtruth`) into targets. In addition, though not in the loaded dataframe above, each instance has predictions from models -- We will generate them as we go on. 

To do the transfer, call the `read()` function in the reader, and see the first instance:

In [7]:
# read the raw data!
instances = reader.read(
    # The path of the input data file. We are using the first 100 rows from the SNLI dev set.
    file_path='data/snli_dev_100.txt', 
    # If sample size is set, only load this many of instances, by default None.
    sample_size=100)

INFO:errudite.io.dataset_reader:Reading instances from lines in file at: data/snli_dev_100.txt
INFO:errudite.io.snli_reader:Reading instances from lines in file at: data/snli_dev_100.txt
100it [00:02, 42.30it/s]


### More on `Target` API

(Or: detailed steps in that `reader()` step.)

We define a general `Target` class which takes **four** inputs:
- `text`: The raw text will be processed with spacy
- `qid`: The id of the instance
- `vid`: The version
- `metas`: Sometimes a specific target can take additional inputs

For example, our `SNLIReader` uses the following lines to create `hypothesis` and `premise`:
```python
from errudite.targets.target import Target
premise = Target(
    qid='4705552913.jpg#2r1n', 
    text="The sisters are hugging goodbye while holding to go packages after just eating lunch.", vid=0, metas={'type': 'premise'})
hypothesis = Target(
    qid='4705552913.jpg#2r1n', 
    text='Two women are embracing while holding to go packages.', 
    vid=0, 
    metas={'type': 'hypothesis'})
print(premise)
print(hypothesis)
```

####  Special case of `Target`: `Label`

`Label` is a special subclass of Target, denoting _groundtruth_ and _prediction_. It takes an additional input:
- `model`: Which predictor the label is producied by. For groundtruths, make sure this model is `groundtruth`.

Because `Label` can be of different types (`int`, predefined class `str`, or span `str` extracted from certain targets), we define two subclasses of `Label`.
- `SpanLabel`: To handle tasks like QA, where the output label is a sequence span extracted from input (context), and therefore is not a predefined set. These labels are similarly processed by SpaCy to be queryable.
- `PredefinedLabel`: To handle tasks where the output label are discrete, predefined class types. These outputs will not need any preprocessing.

Here, the groundtruth label is always one of `['neutral', 'contradiction', 'entailment']`. Therefore, we define it with `PredefinedLabel`:

```python
from errudite.targets.label import PredefinedLabel
raw_labels = [row[f'label{i}']  for i in range(1,6)]
groundtruth = PredefinedLabel(
    model='groundtruth', 
    qid='4705552913.jpg#2r1n', 
    text='neutral', 
    vid=0, 
    # we can save the raw labels into the groundtruth as well:
    metas={'raw_labels': ['neutral', 'neutral', 'neutral', 'neutral', 'entailment']}
)
```

#### Merging targets into instances
instance
Create  classes by setting the correct entries and keys created by the targets. This becomes the wrapper class which is used by the DSL to create specific instances. 

**!!** While other entires can flow, make sure you set [`predictions` or `prediction`] and [`groundtruths` or `groundtruth`], depending on how many groundtruths you have, and how many models you are using to predict this one instance.
All of them are saved into the instance:

```python
instance.set_entries(
    hypothesis=hypothesis, 
    premise=premise, 
    groundtruth=groundtruth)
```

## Step 2: Load models & add the missing `prediction` target
---

Like mentioned before, though not included in the input each instance can have a prediction from a model (Or prediction**s** from multiple model**s** if you plan to do model comparison.)

Predictions can be loaded from files, just like how you would create groundtruth targets with `Label` classes. Alternatively, you can **get predictions from actual predictors in real time.**

This is especially important if you plan to do any form of rewriting later -- You can only test a rewrite if you have a model that can re-run predictions on the newly created, rewritten instances!

This part shows you how to load a `Predictor` (more in the [documentation](https://errudite.readthedocs.io/en/latest/api/_extensible_predictor.html)!)


### Getting the predictions

The basic `Predictor` class wraps the following information:
1. `name`: An identifier of the predictor.
2. `description`: A description for you to remember you model
3. `predictor`: The actual trained model
3. `perform_metrics`: metrics you will want to evaluate your model on

Below, we create a `predictor` with the allennlp pretrained model.

In [8]:
from errudite.predictors import Predictor
model_online_path = "https://s3-us-west-2.amazonaws.com/allennlp/models/decomposable-attention-elmo-2018.02.19.tar.gz"
predictor = Predictor.by_name("nli_decompose_att")(
    name='decompose_att', 
    description='Pretrained model from Allennlp, for the decomposable attention model',
    model_online_path=model_online_path)

INFO:pytorch_pretrained_bert.modeling:Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
INFO:pytorch_transformers.modeling_bert:Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
INFO:pytorch_transformers.modeling_xlnet:Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
INFO:allennlp.common.registrable:instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
INFO:allennlp.common.registrable:instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
INFO:allennlp.common.registrable:instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
INFO:allennlp.common.registrable:instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
INFO:allennlp.models.archival:loading archive file https://s3-us-west-2.amazonaws.com/allennlp/models/decomposable-attention-elm

The predictor has two types of predictions: 
1. A raw prediction function, which takes texts, and return a json format prediction output:

```python
# individual model prediction
predictor.predict(
  hypothesis=hypothesis.get_text(),
  premise=premise.get_text()
)
# returns: {'confidence': 0.9988018274307251, 'text': 'neutral'}
```

2. A class method that takes Target inputs, run model predictions, and wrap the output prediction into Labels.

```python
# individual model prediction
prediction = Predictor.by_name("nli_task_class".model_predict(
    predictor=predictor, 
    premise=premise, 
    hypothesis=hypothesis, 
    groundtruth=groundtruth)
# returns: 
# [PredefinedLabel] [LabelKey(qid='4705552913.jpg#2r1n', vid=0, model='decompose_att', label='neutral')]
# neutral ({'accuracy': 1.0, 'confidence': 0.9988018274307251})
```

We run it on every instance, and save them as `predictions` entries into the instances:

In [11]:
from tqdm import tqdm 
logger.info("Running predictions....")
for instance in tqdm(instances):
    prediction = Predictor.by_name("nli_task_class").model_predict(
        predictor, 
        premise=instance.premise, 
        hypothesis=instance.hypothesis, 
        groundtruth=instance.groundtruth)
    # set the prediction
    instance.set_entries(predictions=[ prediction ])

INFO:__main__:Running predictions....
100%|██████████████████████████████████████████████████████████████| 100/100 [01:39<00:00,  1.37s/it]


The completed instances have several functions that helps you to query targets or performances.

In [10]:
print(instances[0].get_entry('premise'), "\n")
print(instances[0].is_incorrect(model='decompose_att'), "\n")

instances[0].show_instance()

[Target] [InstanceKey(qid='4705552913.jpg#2r1n', vid=0)]
Two women are embracing while holding to go packages. 

False 

[Instance] [InstanceKey(qid='4705552913.jpg#2r1n', vid=0)]
[hypothesis]	The sisters are hugging goodbye while holding to go packages after just eating lunch.
[premise]	Two women are embracing while holding to go packages.
[groundtruth]	neutral	groundtruth	{}
[predictions]	neutral	decompose_att	{'accuracy': 1.0, 'confidence': 0.9988018274307251}



With all the predictions generated, we can use them to compute the model's overall performance:

In [14]:
predictor.evaluate_performance(instances)
print({"predictor": predictor.name, "perform": predictor.perform })

{'predictor': 'decompose_att', 'perform': {'accuracy': 0.92, 'confidence': 0.8930206060409546}}


We can also save them to the hashes attached to `Instance`, which will build three hashes:

1. `Instance.instance_hash`: `Dict[InstanceKey, Instance]`, A dict that saves all the original instances, denoted by the corresponding instance keys.
2. `Instance.instance_hash_rewritten`: `Dict[InstanceKey, Instance]`, A dict that saves all the rewritten instances, denoted by the corresponding instance keys.
3. `Instance.qid_hash`: `Dict[str, List[InstanceKey]]`, A dict that denotes wraps different versions of instance keys



In [15]:
# ---------
# Build the instance store hash
from errudite.targets.instance import Instance
instance_hash, instance_hash_rewritten, qid_hash = Instance.build_instance_hashes(instances)

## Step 3: Compute related distributions for further analysis
---

Besides creating instances, the `reader` preprocesses and computes two more things:

1. Compute the vocabulary from a given data file. This is for getting the training frequency and save to `Instance.train_freq` in the format of:
```python
{ vocab[str] : count[int] }
```

2. We Compute the relationship between linguistic features and model performances. It’s used for the programming by demonstration. The result is saved to `Instance.ling_perform_dict`. It’s in the format of:

```python
{
    target_name[str] : { # e.g. "premise"
        pattern[str]: { # e.g. "two women are", "two NOUN are"
            model_name[str]: { # e.g. "decompose_att"
                "cover": # how many instances are there.
                "err_cover": # The ratio of incorrect predictions with the pattern, 
                             # overall all the incorrect predictions.
                "err_rate":  # the ratio of incorrect predictions, 
                             # over all the instances wit the pattern.
            }
        }
    }
}
```

In [16]:
# ---------
# Compute the vocabulary from the given training data file.
reader.count_vocab_freq('data/snli_train_1000.txt')
# ---------
# Compute the relationship between linguistic features and model performances.
reader.compute_ling_perform_dict(list(Instance.instance_hash.values()))
# ---------

INFO:errudite.io.dataset_reader:Computing vocab frequency from file at: data/snli_train_1000.txt
INFO:errudite.io.snli_reader:Reading instances from lines in file at: data/snli_train_1000.txt
999it [00:00, 9715.72it/s]
INFO:errudite.io.dataset_reader:Computing premise frequency.
100%|███████████████████████████████████████████████████████████████████████████████| 999/999 [00:02<00:00, 377.52it/s]
INFO:errudite.io.dataset_reader:Computing hypoethsis frequency.
100%|███████████████████████████████████████████████████████████████████████████████| 999/999 [00:02<00:00, 455.97it/s]
INFO:errudite.io.dataset_reader:Computing linguistic performance distribution per instance...
100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [00:09<00:00,  9.32it/s]
INFO:errudite.io.dataset_reader:Computing the final distribution...


# Step 4: Save your processed data

It becomes tedious to re-run the preprocessing step again and again. 
`Instance.dump_preprocessed()` saves all the preprocessed information to the cache file. It generates the following files in the cache folder:

```sh
[cache folder]
│   # saved attr, group and rewrite json that can be reloaded. 
│   # we talk about these in the next tutorial.
├── analysis 
├── evaluations # predictions saved by the different models, with the model name being the folder name.
│   └── bidaf.pkl
├── instances.pkl # Save all the `Instance`, with the processed Target.
│   # A dict saving the relationship between linguistic features and model performances. 
│   # It's used for the programming by demonstration.
├── ling_perform_dict.pkl
├── train_freq.json # The training vocabulary frequency
└── vocab.pkl # The SpaCy vocab information.
```

In [17]:
reader.dump_preprocessed()

INFO:errudite.io.dataset_reader:Dumped 100 objects to ./data/snli_tutorial_caches\instances.pkl.
INFO:errudite.io.dataset_reader:Dumped 100 objects to ./data/snli_tutorial_caches\evaluations/decompose_att.pkl.
INFO:errudite.io.dataset_reader:Dumped the linginguistic perform dict.


# Extend Errudite for your own task
---

To extend Errudite to your own task and model, you will need to write your own `DatasetReader`, and your own `Predictor` wrapper. As we've seen, a `DatasetReader` knows how to turn a file containing a dataset into a collection of Instance s, and how to handle writting the processed instance caches to the cache folders. A `Predictor` wraps the prediction function of a model, and transfers the prediction to Label targets.

This section shows you how to implement both.

## Define `DatasetReader`

Your could define your own dataset reader by registering a task-specific reader under the abstract ``DatasetReader`` class -- Just make sure you override the `self._read` and `self._text_to_instance`. 

To give you a taste of how to do the implementation, we copy-paste an implementation for loading  [Stanford Natural Language Inference (SNLI)](https://nlp.stanford.edu/projects/snli/) that offered by default in ``errudite.io.snli_reader.SNLIReader`` (See the comments for implementation tips):


```python
import pandas as pd
from tqdm import tqdm

from overrides import overrides

from errudite.io import DatasetReader
from errudite.utils import normalize_file_path, accuracy_score
from errudite.targets.instance import Instance
from errudite.targets.target import Target
from errudite.targets.label import Label, PredefinedLabel

@DatasetReader.register("snli")
class SNLIReader(DatasetReader):
    def __init__(self, cache_folder_path: str=None) -> None:
        super().__init__(cache_folder_path)
        # overwrite the primary evaluation method and metric name
        Label.set_task_evaluator(accuracy_score, 'accuracy')
    
    @overrides
    def _read(self, file_path: str, lazy: bool, sample_size: int):
        """
        Returns a list containing all the instances in the specified dataset.

        Parameters
        ----------
        file_path : str
            The path of the input data file.
        lazy : bool, optional
            If ``lazy==True``, only run the tokenization, does not compute the linguistic
            features like POS, NER. By default False
        sample_size : int, optional
            If sample size is set, only load this many of instances, by default None
        
        Returns
        -------
        List[Instance]
            The instance list.
        """
        logger.info("Reading instances from lines in file at: %s", file_path)
        df = pd.read_csv(normalize_file_path(file_path), sep='\t')
        for idx, row in tqdm(df.iterrows()):
            if lazy:
                premises.append(row['sentence1'])
                hypotheses.append(row['sentence2'])
            else:
                instance = self._text_to_instance(f'q:{idx}', row)
                if instance is not None:
                    instances.append(instance)
                if sample_size and idx > sample_size:
                    break
        if lazy:
            return { "premise": premises, "hypoethsis": hypotheses }
        else:
            return instances
    
    @overrides
    def _text_to_instance(self, id: str, row) -> Instance:
        # The function that transfers raw text to instance.
        premise = Target(qid=row['pairID'], text=row['sentence1'], vid=0, metas={'type': 'premise'})
        hypothesis = Target(qid=row['pairID'], text=row['sentence2'], vid=0, metas={'type': 'hypothesis'})
        # label
        raw_labels = [row[f'label{i}']  for i in range(1,6)]
        groundtruth = PredefinedLabel(
            model='groundtruth', 
            qid=row['pairID'], 
            text=row['gold_label'], 
            vid=0, 
            metas={'raw_labels': raw_labels}
        )
        return self.create_instance(row['pairID'], 
            hypothesis=hypothesis, 
            premise=premise, 
            groundtruth=groundtruth)
```

This reader, as we did before, can be queried via:
```python
from errudite.readers import DatasetReader
DatasetReader.by_name("snli")
```

## Define `Predictor`

To use your own predictors / models, you need to extend the predictor class to do three things:
1. Define a list of `perform_metrics`, or metrics you will want to evaluate your model on. You can the task evaluator setter in you predictor `__init__()`. This includes two sub-steps:
    - First, define the evaluation function to determine how well a model is doing on one instance, based on an individual predicted label.
    - Second, from the metrics above, pick one that's primary, and it will be used to compute `is_incorrect()` in any label target object: primary metric < 1.
2. Define a class method `model_predict()` that takes `Target` inputs, run model predictions, and wrap the output prediction into Labels.
3. Wrap your raw model prediction:
    - Save your model as a variable (`model` variable in a `Predictor` object)
    - Wrap your mode prediction method in a `predict()` method    
    
To make your life easier, we've already created `perform_metrics` and the classmethod `model_predict()` for VQA, QA, Sentiment Analysis, and NLI tasks. See the corresponding folders under `errudite/predictors/`. They are also examples for you to extend for your own tasks. We copy-paste the `errudite.predictors.nli.predictor_decompose_att` to give you a taste of the actual implementation, which can be queried (as we did before):

```python
from errudite.predictors import Predictor
Predictor.by_name("nli_decompose_att")
```

The implementation:
```python
from typing import List, Dict
from ..predictor import Predictor
from ...targets.label import Label, PredefinedLabel
from ..predictor_allennlp import PredictorAllennlp # a wrapper for Allennlp classes

@Predictor.register("nli_decompose_att")
class PredictorNLI(Predictor, PredictorAllennlp):
    """
    The wrapper for DecomposableAttention model, as implemented in Allennlp:
    https://allenai.github.io/allennlp-docs/api/allennlp.predictors.html#decomposable-attention
    """
    def __init__(self, name: str, 
        model_path: str=None,
        model_online_path: str=None,
        description: str='') -> None:
        PredictorAllennlp.__init__(self, name, model_path, model_online_path, description)
        Predictor.__init__(self, name, description, model, perform_metrics)
        # set the perform metrics
        perform_metrics = ['accuracy', 'confidence']
        # First, define the evaluation function to determine how well a model is doing 
        # on one instance, based on an individual predicted label.
        from ...utils.evaluator import accuracy_score
        # Second, from the metrics above, pick one that's primary, and it will be used 
        # to compute `is_incorrect()` in any label target object: primary metric < 1.
        Label.set_task_evaluator(
            # the evaluation function that accepts pred and groundtruths, 
            # and return a dict of metrics: { metric_name: metric_score }. 
            # This is saved as Label.task_evaluation_func.
            task_evaluation_func=accuracy_score, 
            # The primary task metric name, ideally a key of task_evaluation_func ‘s return.
            task_primary_metric='accuracy')

    # the raw prediction function, returning the output of the model in a json format.
    def predict(self, premise: str, hypothesis: str) -> Dict[str, float]:
        try:
            labels = ['entailment', 'contradiction', 'neutral']
            predicted = self.model.predict_json({
                "premise": premise, "hypothesis":hypothesis})
            return {
                'confidence': max(predicted['label_probs']),
                'text': labels[np.argmax(label_probs)],
            }
        except:
            raise

    @classmethod
    # the class method that takes `Target` inputs, and output a `Label` object.
    def model_predict(cls, 
        predictor: Predictor, 
        premise: Target, 
        hypothesis: Target, 
        groundtruth: Label) -> 'Label':
        answer = None
        if not predictor:
            return answer
        predicted = predictor.predict(premise.get_text(), hypothesis.get_text())
        if not predicted:
            return None
        answer = PredefinedLabel(
            model=predictor.name, 
            qid=premise.qid,
            text=predicted['text'], 
            vid=max([premise.vid, hypothesis.vid, groundtruth.vid] ))
        answer.compute_perform(groundtruths=groundtruth)
        answer.set_perform(confidence=predicted['confidence'])
        return answer
```