# Transformers for Downstream tasks.

Transformer's real usages come when we are using them for a more specialized tasks rather than guessing the \<mask> or next sentence (though this is still very useful). In this notebook we will work on how we can use transformers in several specialized NLP tasks. Also we will look upon performance measuring of transformers in such usecases.

Since now we are moving into the applications of Transformers, it is best to first look at various evaluation mechanisms used in NLP tasks. Below are some performance metrics used by the `General Language Understanding Evaluation (GLUE)` and `SuperGLUE` benchmarks for NLP.

#### **Accuracy Score**

Accuracy is one of the simplest form of evaluation. It simply calculates how many predictions are correct out of all the available examples.

#### **F1 Score**

Another metric which helps in specially uneven data distribution evaluations. 

                F1-score= 2 * (precision * recall)/(precision + recall)

#### **Matthews Correlation Coefficient (MCC)**

We used this already in our RoBERTa model evaluation as well. This computes a measurement based on all 4 values true positives, false positives, true negatives and false negatives. Its more useful compared to both accuracy and F1 scores even when the class distritions are different.

### Proving a model is state of the art

Before we can claim a model is SOTA we need to have 3 main things.

    1. A model
    2. A defined dataset related task
    3. A valid metric

So far we have worked on models and few metrics we can use. We can use provided benchmark datasets/tasks as the  2nd point.

One such benchmark we can use is SuperGLUE. Also there's an old one named GLUE as well!. 

The idea behind building the General Language Understanding Evaluation 
(GLUE) datasets was to show that NLU can be used in wide range of tasks. But with the performance of the new models GLUE became outdated as most of these new models bagan to outperform human baseline. So to set a higher human baseline standard, SuperGLUE was introduced.


If we look at the SuperGLUE benchmark, it consists of 8 selected tasks. Below is a screenshot of the taskset. [Web Site](https://super.gluebenchmark.com/tasks/).

<center><image src="imgs/15.jpg" width="500"/></center>

As we can see, it provides the task instruction, datasets, software and other resources required to solve the problem. Once a team runs the benchmark if it reaches the leaderboard results will get displayed.

For an example task there's Machine thinking measure. Here input would be some kind of premise and based on that, model need to choose most plausible answer for the given question out of given answers. This feels like quite complex task to be done by a machine, but this is already matched/passed by models. Check the leaderboard!

Below include some brief details about each of these tasks.

#### COPA task

As explained earlier, in this task, thesers a input premise. There's a question asked based on that premise along with multiple answers. The goal is to find the most plausible answer to the asked question based on the input premise.

#### BoolQ task

In this task a boolean answer is expected to a question which was asked along with a passage. Model should be able provide True or False answer based on the content inside of the passage. 

#### Commitment bank task

This is bit complex task compared to other as this involves a premise and a hypothesis. Based on the premise model should be able to identify whether the hypothesis is neutral, entailment or contradictory.

#### MultiRC task

Multi Sentence Reading Comprehension or MultiRC task give a text to read by the model and to pick the correct answer from the given possible choices to the provided question. This basically mimic the our exam comprehension like questions.

#### ReCoRD task

Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) represents another complex task for NLP models. In this task model will be given a input text and a query text which has a placeholder in it. Based on the input text paragraph, model needs to find the entity that would fit in to the query text placeholder.

#### RTE task

Recognizing Textual Entailment or RTE task makes the model read a premise, examine a hypothesis and then should predict whether the premise is entailed by the hypothesis. This require model to understand the text and use logic to answer.


#### WiC task

Words in Context allows to test the model's ability to process an ambiguous word. In this task model will have to analyze 2 sentences and determine whether a target word has the same meaning in the both sentences.

#### WSC task

Winograd Schema Challenge also test the model's ability to disambiguate. Here contains sentences which focus on the slight differences of gender pronouns. For example it will provide sentence with pronouns and then model will have to predict whether the given token refers the given pronoun. Below is an example.

    The blue cup was on top of the table until it was broken by the cat.

    target token: blue cup
    pronoun: it

