<a href="https://colab.research.google.com/github/Masoud-Karami/Abstracts/blob/main/IDC_DND.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**[Demystifying Neural Language Models’ Insensitivity to Word-Order](https://arxiv.org/abs/2107.13955)**

[code](https://github.com/chandar-lab/)

## (In)Sensitivity of natural language models to word-order
___
<ol>
<li><i><b>How to investigate:</b></i>
<ol type="a">
        <li><code>quantifying</code> perturbations and analyzing their effect on neural models’ performance on language understanding tasks in <a href=https://mccormickml.com/2019/11/05/GLUE/><code>GLUE</code></a> benchmark.</li>
        <li>score the <code>local</code> and <code>global</code> ordering of tokens in the perturbed texts with the two following metrics:
            <ul>
                <li><b>local metric:</b> Direct Neighbour Displacement (<code>DND</code>)
                For every $c_j$, let $\mathcal{N}^{x_i}(c_j, R)$ indicate the relative position of the right neighbor ($R$) of character $c_j$ with respect to the position of $c_j$ in string $x_i$. Then, <code>DND</code> is computed as a summation over an indicator variable that indicates when the neighbor to the right of $c_i$ has shifted to a different position in $x'_i$.
                \begin{equation}
                DND \gets \frac{1}{k-1}\sum_{j=1}^{k-1} \left(1 \left[ \mathcal{N}^{x_i}(c_j, R) \neq\mathcal{N}^{x'_i}(c_j, R) \right]\right)
                \end{equation}</li>
                <li><b>global metric:</b> Index Displacement Count (<code>IDC</code>)</li>
                Let a string, $x_i = (c)_k^i$, be denoted by a sequence of characters $c_0, \ldots, c_k$, where $k$ is the length of the string in characters and $p^{x_i}$ denote the positions of characters in $x_i$.  Let $\eta(\cdot)$ be a perturbation operation.
                \begin{equation}
                x_i'\gets \eta\left(x_i\right),
                \end{equation}
                where $x_i'$ denote the perturbed string with positions of the characters specified by $p^{x'_i}$.
                \begin{equation}
                IDC \gets \frac{1}{k^2}\sum_{j=1}^{k}\left\Vert p^{x'_i}\left(j\right) - p^{x_i}\left(j\right) \right\Vert_{1}
                \end{equation}
            </ul>
        </ol></li>
        <li><i><b>Achievements:</b></i>
        <ol type='i'>
        <li><code>IDC</code> and <code>DND</code> are uncorrelated suggesting that the metrics measure different aspects of the perturbations.</li>
        <li><code>DND</code> only has a weak correlation with BLEU and <a href=https://en.wikipedia.org/wiki/Levenshtein_distance>Levenshtein</a>, indicating that <code>DND</code> measures a previously unmeasured dimension of the structure and similarity in texts.</li>
        <li>perturbation functions found in prior literature affect only the global ordering while the local ordering remains relatively unperturbed.</li>
            <li>pretrained and non-pretrained Transformers LSTMs, and Convolutional architectures require local ordering more so than the global ordering of tokens.</li>
            <li> evaluating sentence comprehension mechanisms of human shows that specific orders of words is necessary for comprehending the text</li>
            <li>local structure, moreso than global structure, is necessary for models to understand text.</li>
        </ol></li>
        <li><i><b>Class of perturbation analysis</b></i>
        <ol>
            <li>deletion</li>perturbation
            <li>paraphrase injection</li>
            <li>perturbations at a finer granularity:
            <ul>
            <li>word-level perturbed sentence</li>
            <li>subword-level perturbed sentence</li>
            <li>character-level perturbed sentence</li>
            </ul>
            On NLP, shuffleing n-grams (different values for n) to highlight the insensitivity of pretrained models shows that shuffling larger n-grams have a lesser effect than shuffling smaller n-grams</li>
        </ol></li></li>
        <li><i><b>BERT models:</b></i>
        <ol type="I">
            <li> have some syntactic capacity.</li>
            <li> represent information hierarchically and model linguistically relevant aspects in a hierarchical structure.</li>
            <li>s' contextual embeddings outputs contain syntactic information that could be used in downstream tasks.</li>
            <li> pretraining on syntax does not seem to improve downstream performance much.</li>
            <li> can understand syntax but they often prefer not to use that information to solve tasks.</li>
            <li> large language models are insensitive to minor perturbations highlighting the lack of syntactic knowledge used in syntax rich NLP tasks.</li>
            <li> pretraining models on perturbed inputs still obtain reasonable results on downstream tasks, showing that models that have never been trained on well-formed syntax can obtain results that are close to their peers.</li>
        </ol></li>
        <li><i><b>Popular similarity metrics:</b></i>
        <ul>
        <li><code><b>BLEU</b> and <b>ROUGE</b>:</code>treat text as a sequence of words, from which a measure of overlap is computed.</li>
        <li><code><b>Levenshtein</b> or <b>edit</b> distance:</code> measures the minimum amount of single character edAnbil Parthipanits (insertions, deletions or substitutions) necessary to match two strings together.</li>
        <li><code><b>Learned metrics:</b></code> which are often unaffected by minor perturbations in text which limits their usefulness in measuring perturbations.</li>
        <ol>
            <li><code><b>BERT-Score</b></code></li>
            <li><code><b>BLEURT</b></code></li>
            <li><code><b>POS mini-tree overlap score:</b></code> to computes the part-of-speech (PoS) tags neighborhood for every word and estimates an average overlap in the neighborhood for all the tokens before and after applying the perturbation.</li>
        </ol>
        </ul></li>
        <li><i><b>Perturbation Function:</b></i>
        <ul type="none">
            <li><code><b>full-shuffling:</b></code> randomly shuffles the position of every word, sub-word, or character, according to the level it is applied to and cuases a great amount of perturbation to the global and local. $16$ different word-level perturbations are categorized as:
            <ul>
                <li>PoS-Tag perturbations</li>
                <li>Dependency Tree perturbations</li>
                <li>Random shuffles</li>
            </ul>
            </li>
            <li><code><b>phrase shuffling:</b></code> creates chunks of contiguous tokens of variable length, on average, the same impact as the full shuffling on the global structure as the absolute positions of characters tend to change just as much as full shuffling while having a lesser impact on the local structure.</li>
            <li><code><b>Neighbour flip perturbations:</b></code> flip tokens of the chosen granularity with the immediate right neighbor with probability, $\rho$. This function has, on average, a smaller impact on the \emph{global} structure, as the absolute positions of tokens do not change much but can have an arbitrary large effect on disturbing the \emph{local} structure. The perturbation is applied by traversing the string from left-to-right on the desired granularity and with a probability $\rho$.</li>
        </ul>Unlike the full-shuffling operation, phrase shuffling uses a parameter $\rho$ that controls the average size of the randomly dLevenshteinefined contiguous chunks of tokens. The lower the value of $\rho$ is, the longer, on average, the phrases are, thus preserving more of the $local$ structure while destroying roughly the same amount of $global$ structure. In the extreme case with $\rho = 1.0$, phrase shuffling will be equivalent to full shuffling as phrases will all be one token long (of length $1$).</li>
<li><i><b>Text structures:</b></i>
        <ul type='none'>
        <li><i><code>global:</code></i> which relates to the absolute position of characters to their immediate neighbors</li>
        <li><i><code>local:</code></i> which relates to the relative position of characters to their immediate neighbors</li>
        </ul></li>
<li><i><b>Tokenization:</b></i>
        <ul type='none'>
        <li><i><code>word level:</code></i> splitting a sentence by the whitespace and punctuation marks</li>
        <li><i><code>character level:</code></i> splitting a sentence into characters, for example, in the example "dogs are a good friend" friend becomes f-r-i-e-n-d. </li>
        <li><i><code>Subword level:</code></i> used by $Transformers$ produces subword units, which are smaller than words but bigger than just characters, and have some meaning.</li>
        <ul>
        <li> <a href="https://towardsdatascience.com/overview-of-nlp-tokenization-algorithms-c41a7d5ec4f9">$byte-pair encoding (BPE)$</a> that merges the most frequently occurring character or character sequences iteratively.</li>
        <li> $Unigram LM$ which is based on the assumption that all subword occurrences are independent and therefore subword sequences are produced by the product of subword occurrence probabilities. it is based on a probabilistic LM and can output multiple segmentations with their probabilities. Instead of starting with a group of base symbols and learning merges with some rule, like BPE or WordPiece, it starts from a large vocabulary (for instance, all pretokenized words and the most common substrings) that it reduces progressively.</li>
        <li>$WordPiece$ used in <code><b>BERT</b></code> which is similar to BPE in many ways, except that it forms a new subword based on likelihood, not on <s>the next highest frequency pair</s>. Algorithm:
        <i><ol>
        <li>Get a large enough corpus.</li>
        <li>Define a desired subword vocabulary size.</li>
        <li>Split word to sequence of characters.</li>
        <li>Initialize the vocabulary with all the characters in the text.</li>
        <li>Build a language model based on the vocabulary.</li>
        <li>Repeat step 5 until reaching subword vocabulary size (defined in step 2) or the likelihood increase falls below a certain threshold.</li>
        </ol></i>
        </li>
        <li>$SentencePiece$</li>
        </ul>
        </ul></li>
        <li><i><b>Interpretability:</b></i> 
        perturbing the samples across <code>GLUE</code> tasks
        <ul>
        <li>$\text{Distribution of scores}$</li>
        <ul>
        <li>measured by <code>BLEU</code> and <code>Levenshtein</code> covers the entire range of vcomputed by <code>DND</code> for the different
perturbations functions indicates that the word-level and subword-level perturbations have a limited impact on the local structurealues for most of the word-level functions</li>
        <li>computed by <code>DND</code> for the different perturbations functions indicates that the word-level and subword-level perturbations
have a limited impact on the local structure</li>
        </ul>
        <li><code>BLEU</code> was uninterpretable when the perturbations were done at character or subword level, $\implies$ ineffective.</li>
        <li><code>DND</code> metric to strongly correlate with model performance on perturbed samples</li>
        </ul>
        </li>
</ol>