<a href="https://colab.research.google.com/github/Masoud-Karami/Abstracts/blob/main/Word_Salad.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#<a href=https://arxiv.org/abs/2101.03453><b>BERT & Family Eat Word Salad:</b> Experiments with Text Understanding</a>

<a href=https://github.com/utahnlp/word-salad>code</a>

#**The goal**:
<ul>
    <li>study the response of large neural models to <code><i>destructive transformations</i></code>: perturbations of inputs that render them meaningless.
    </li>
</ul>

# **To be argued**:
A <code>reliable model</code> <font size="-2">(which is a model that should know what it does not know,
not fail silently, and be uncertain on examples that are uninformative about the label.)</font> should not be insensitive to such a drastic change in word order.

#**How**:
<ol type="i">
    <li>defining simple heuristics to construct incoherent inputs that should confuse any model that claims to understand natural language.</li>
    <li>characterizing the models' response using two metrics:
        <ul type="square">
            <li>its ability to predict valid labels for invalid input</li>
            <li>its confidence on these predictions</li>
        </ul>
    </li>
    <li>evaluating strategies to mitigate these weaknesses using regularization that makes models less confident in their predictions, or by allowing models to reject inputs.</li>
</ol>
___
<ol>
    <li><i><b>Destructive Transformation:</b></i>

>Consider a task with input $x \in X$ and an oracle function $f$ that maps inputs to labels $y \in Y$. A destructive transformation $\pi: X \to X$ is a function that operates on $x$ to produce transformed inputs $x^\prime = \pi(x)$ such that $f(x^\prime)$ is undefined. That is, none of the labels (i.e., the set $Y$) can apply to $x^\prime$.

Different classes of transformations:
        <ul type="none">
            <li><code>Lexical Overlap-based Transformations:</code> which preserve the bag-of-words representation of the original input but change the word order
                <ul type="I">
                    <li><b>Sort</b></li>
                    <li><b>Shuffle</b></li>
                    <li><b>Reverse</b></li>
                    <li><b>CopySort:</b> Copy one of the input texts and then sort it to create the second text
                    </li>
                </ul>
            </li>
            <li><code>Gradient-based Transformations:</code>
            scoring input tokens in proportion to their relative contribution to the output then studying the impact of removing, repeating, and replacing scored tokens.
           
>Given a trained neural model $\mathcal{M}$, and the task loss function $\mathcal{L}$, the change in the loss for the $i^{th}$ input token is approximated by the dot product of its token embedding $\mathbf{t}_i$ and the gradient of the loss propagated back to the input layer $\nabla_{\mathbf{t}_i, \mathcal{M}}\mathcal{L}$. That is, the $i^{th}$ token is scored by  $\mathbf{t}_i^\intercal\nabla_{\mathbf{t}_i,\mathcal{M}}\mathcal{L}$.
                
A higher score denotes a more important token.
                </li>
                    <li><code>Statistical transformation: PBSMT</code> 
            Phrase-based statistical machine translation (PBSMT) system to generate examples that use phrasal co-occurrence statistics.
                    </li>
        </ul>
    </li>
    <li><i><b>A <code>reliable model</code> should exhibit the following:</b></i>
        <ol type="i">
            <li>the agreement between original predictions and predictions on their transformed invalid variants should be random,</li>
            <li>predictions for invalid examples should be uncertain</li>
        </ol>
    </li>
    <li><i><b>Two metrics are necessary to accomplish step $2$:</b></i>
        <ul type="none">
            <li><code>Agreement</code> which is the $%$ of examples whose prediction remains
same after applying a destructive transformation. High agreement scores show that models retain their original predictions even when labelbearing information is removed from examples.
$$\text{closer to random} \implies \text{better handle invalid examples}$$</li>
            <li><code>Confidence</code> <font size="-2">(and entropy of output distributions reveal the same insights)</font> which is defined as the average probability of the predicted label. We want this number to be closer to $\frac{1}{N}$, where $N$ is the number of classes.</li>
        </ul>
    </li>
    <li><i><b>Calibrated <code>BERT</code> test on invalid examples:</b></i>
        by training confidence calibrated classifiers using following standard methods:
        <ol type=I>
            <li><a href=https://arxiv.org/abs/1906.02629>$\text{label smooting}$</a> (<a href=https://youtu.be/wmUiOAra_-M>a quick intro</a>) with <a href=https://amaarora.github.io/2020/06/29/FocalLoss.html> $\text{Focal loss}$</a></li>
            <li><a href=https://arxiv.org/abs/1910.12656>$\text{Temperatur scaling}$</a> (<a href=https://geoffpleiss.com/nn_calibration>intro </a>)</li>
            <li><a href=https://arxiv.org/abs/1706.04599>$\text{Expected calibration error (ECE)}$</a> Better calibrated models have lower ECE</li>
        </ol>
    </li>
    <li><i><b>Small vs. large perturbations:</b></i>
        <ul>
            <li>Robustness of the model to small input perturbations is <b>desirable</b> (<font size="-2">model’s prediction should not change for small perturbations in the input.</font>)</li>
            <li>However, excessive invariance to large input perturbations is <b>undesirable</b>.</li>
        </ul>
    </li>
    <li><i><b>Achievements:</b></i>
    <ol type="a">
        <li>the labels predicted by state-of-the-art models for destructively transformed inputs bear high agreement with the original ones</li>
        <li>models trained on meaningless examples perform comparably to the original model on unperturbed examples, despite never having encountered any well-formed training examples.</li>
        <li>models trained on meaningless sentences constructed by permuting the word order perform almost as well as the state-of-the-art models.</li>
        <li>models struggle even with the form of language by demonstrating that they force meaning onto token sequences devoid of any, i.e., they are not using the right kind of information to arrive at their predictions</li>
        <li> The transformations render sentences meaningless to humans, but the model knows the label.</li>
        <li>It is possible that, rather than understanding text, they merely learn spurious correlations in the training data. That is, models use the wrong information to arrive at the right answer.</li>
        <li>Not only do models retain a large fraction of their predictions, they do so with high confidence</li>
        <li>The reason why this undesirable model behavior occur in all models, irrespective of the pretraining tasks (and is even seen in models with a recurrent inductive bias) is that because these <a href=https://arxiv.org/abs/1803.02324> large models learn spurious correlations present in the training datasets</a>.
            <ul>
                <li>To substantiates this claim, train the model by flipping the (<mark>training set</mark>=<b style="background-color:lightgreen;">valid</b>, validation set=<b style="background-color:red;">invalid</b>) to (<mark>training set</mark>=<b style="background-color:red;">invalid</b>, validation set=<b style="background-color:lightgreen;">valid</b>)</li>
            </ul>
        </li>
        <li></li>
    </ol>
</li>
</ol>