# Preprocessing Data
---

Here, we preprocess our data, and use 'label denoising' ask our masking strategy for pretraining. The idea is to take an utterance, and a label, and create a "input" and "target" associated with each one, with the pretraining object being to predict the masked token. For example:

```
Input: "I am looking to book a flight from New York to Iceland. <MASK>"
Output: "<MASK> Book Flight."
```

### Settings
---
We use the Author's settings for the most part, and utilize the same helper functions they did, which come from the TensorFlow Preprocessing Code.

* https://github.com/amazon-science/label-aware-pretrain/blob/main/models/preprocessor.py  
* https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py

The different masking strategies are as follows:

* **Label Denoising**: Mask the label, and predict the masked token.
* **Intent Classification**: Add a prefix to each unmasked input, and predict the utterance.

The args class makes it simple to rerun the preprocessing stage with different settings in this particular notebook. The args class is as follows:

```
Args(
    dataset        # Location of the dataset
    seed           # Random seed
    labelsemantics # Masking strategy to use
    tokenizer      # Tokenizer to use
)
```

In [5]:
from dataclasses import dataclass

@dataclass
class Args:
    """Arguments for pretraining"""

    dataset: str
    seed: int
    labelsemantics: str
    tokenizer: str

    def __post_init__(self):
        assert self.dataset.endswith(".json")
        assert self.labelsemantics in ["random_denoising", "intent_classification", "label_denoising"]

args = Args(
    dataset         = "../data/pretraining/dataset/json/train.json",
    seed            = 1248, 
    labelsemantics  = "label_denoising",
    tokenizer       = "t5-base",
)

### Data Handler
---
Below, we implement our own helper class to handle with reading the raw data from a json and writing it out to a formatted json for pre-training. The below class also helps with tokenization, cleaning up any unsanitized data, and acts as a parent class for the preprocessor class.

In [6]:
from transformers import T5Tokenizer, PreTrainedTokenizerBase
import torch, json

@dataclass
class DataHandler:
    """
    DataHandler class for preprocessing data for pretraining. Responsible for tokenization and writing to file.
    """

    args: Args
    punc: tuple 

    def __post_init__(self):
        self.tokenizer : PreTrainedTokenizerBase = T5Tokenizer.from_pretrained( self.args.tokenizer )

    def to_dict(self, text, include_eos=True):
        target = self.tokenizer.encode(text) if include_eos else self.tokenizer.encode(text)[:-1]
        return {'inputs': "",
                'targets': torch.tensor(target)}

    def clean_str( self, txt ):
        str = txt.strip()
        if not str.endswith( self.punc ):
            str += "."
        return str

    def load_data(self, in_file):
        utterances, intents = [], []
        with open( in_file, 'r') as datastrings:
            for datastring in datastrings:
                data = json.loads(datastring)

                #Clean utterance and intent
                utterance = self.clean_str( data["translation"]["src"] )
                intent    = self.clean_str( data["translation"]["tgt"] )

                #Tokenize and append to list
                utterances.append( self.to_dict(utterance, include_eos=False))
                intents.append( self.to_dict(intent, include_eos=True))

        return (utterances, intents)


    def write_data(self, dataset):
        with open( self.args.labelsemantics + ".json", "w" ) as out_file:
            for data in dataset:
                data = {"inputs": self.tokenizer.decode( data["inputs"] ),
                        "targets": self.tokenizer.decode( data["targets"] )}
                out_file.write( json.dumps( data ) + "\n")

datahandler = DataHandler(
    punc = (".", "?", "!", ",", ";", ":"),
    args = args
)

### Preprocessor
---
The below preprocessor implements the different masking strategies we describe in our paper for pretraining. 

To change the type of masking strategy used, please visit the `args` section of the notebook.

In [7]:
@dataclass
class Preprocessor:
    """ Preprocessor class for preprocessing data for pretraining. Responsible for implementing masking strategies. """

    datahandler: DataHandler

    def __post_init__(self):
        self.utterances, self.intents = self.datahandler.load_data( self.datahandler.args.dataset )
        self.ic_package = zip( self.utterances, self.intents )

    def label_denoise( self ):
        """Preprocessing for T5 denoising objective. Returns preprocessed
        tokenized and encoded data."""
        ds = []

        for utterance, intent in self.ic_package:
            sentinel_id = self.datahandler.tokenizer.convert_tokens_to_ids( "<extra_id_0>" )
            input = torch.cat(( utterance["targets"], torch.tensor([sentinel_id]) ))
            target = torch.cat(( torch.tensor([sentinel_id]), intent["targets"] ))
            if input.shape[0] > 512:
                input = input[:512]
            ds.append( {'inputs': input, 'targets': target} )
        return ds
    
    def intent_classification( self ):
        """Preprocessing for T5 intent classification objective. Returns preprocessed
        tokenized and encoded data."""

        ds = []
        prefix = "intent classification: "

        for utterance, intent in self.ic_package:
            input   = utterance["targets"]
            target  = intent["targets"]
            if input.shape[0] > 512:
                input = input[: 512 - len( prefix )]
            ds.append( {'inputs': prefix + input, 'targets': target} )
        return ds
    
    def random_denoising( self ):
        pass

    def format_pretraining( self ):
        pretrain_format = self.datahandler.args.labelsemantics
        if pretrain_format == "label_denoising":
            return self.label_denoise()
        elif pretrain_format == "intent_classification":
            return self.intent_classification()
        elif pretrain_format == "random_denoising":
            return self.random_denoising()
        else:
            raise ValueError("Invalid pretraining format")  
        

preprocess = Preprocessor(
    datahandler = datahandler,
)

dataset = preprocess.format_pretraining()
datahandler.write_data( dataset )

Token indices sequence length is longer than the specified maximum sequence length for this model (531 > 512). Running this sequence through the model will result in indexing errors


TypeError: can only concatenate str (not "Tensor") to str

### Debugging

The below code helps visualize the data, and make sure that the preprocessor is working as expected.

In [None]:
tokenizer = datahandler.tokenizer

for data in dataset[:5]:
    print(f'input: {tokenizer.decode(data["inputs"])}\ntarget: {tokenizer.decode(data["targets"])}\n')

print("Num examples:", len(dataset))

input: Choose one person to be “it.” This person is the "answerer" and responsible for choosing the objective for each round.<extra_id_0>
target: <extra_id_0> Play the "Mind Reading" Game.</s>

input: Please help me as i am continuously facing the issue in transferring money to my friends, as all my transactions are getting failed.<extra_id_0>
target: <extra_id_0> Failed Transfer.</s>

input: Your final result: zombies will pile up against the handrail, instead of running around it, leaving them free for pickings.<extra_id_0>
target: <extra_id_0> Do the Pile Up Glitch in Tranzit.</s>

input: Start every day with self-affirmations.<extra_id_0>
target: <extra_id_0> Be Brave.</s>

input: Be clear on the type of editing you should do.<extra_id_0>
target: <extra_id_0> Be a Good Editor.</s>

Num examples: 98924
