<a href="https://colab.research.google.com/github/IgnatiusEzeani/NLP-Lecture/blob/main/Week_18_NLP_Tasks_and_Methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 18 - Introduction to Text Classification

This lab will take you through an introductory text classification task using the contents from the [AllenNLP Guide](https://guide.allennlp.org/). AllenNLP is an open source library for building deep learning models for natural language processing, developed by the Allen Institute for Artificial Intelligence.

It is built on top of PyTorch and is designed to support researchers, engineers, students, etc., who wish to build high quality deep NLP models with ease. It provides high-level abstractions and APIs for common components and models in modern NLP. It also provides an extensible framework that makes it easy to run and manage NLP experiments.

# Section 1: Creating a classification model

## What is text classification
---
Text classification is one of the simplest NLP tasks, where the model, given some input text, predicts a label for the text. See the figure below for an illustration.

<figure>
<center>
<img src='https://github.com/IgnatiusEzeani/NLP-Lecture/blob/main/img/Artboard.png?raw=true' />
<figcaption>A basic text classification pipeline</figcaption></center>
</figure>


There are a variety of applications of text classification, such as spam filtering, sentiment analysis, and topic detection. Some examples are shown in the table below.

|Application | Description | Input | Output|
| --- | --- | --- |--- |
|Spam filtering |Detect and filter spam emails | Email |Spam / Not spam |
|Sentiment analysis|Detect the polarity of text |Tweet, review|Positive / Negative|
|Topic detection | Detect the topic of text | News article, blog post | Business / Tech / Sports|
|Language indentification | Detect the language of text | Written text|Igbo / English / Russian|

## Defining input and output
---
The first step for building an NLP model is to define its input and output. In AllenNLP, each training example is represented by an Instance object. An Instance consists of one or more Fields, where each Field represents one piece of data used by your model, either as an input or an output. Fields will get converted to tensors and fed to your model. The Reading Data chapter provides more details on using Instances and Fields to represent textual data.
For text classification, the input and the output are very simple. The model takes a `TextField` that represents the input text and predicts its label, which is represented by a `LabelField:`

```
# Input
text: TextField

# Output
label: LabelField
```

## Reading data

<figure>
<center>
<img src='https://github.com/IgnatiusEzeani/NLP-Lecture/blob/main/img/Slice.png?raw=true' />
<figcaption>Reformating text files as instances of texts and labels</figcaption></center>
</figure>

The first step for building an NLP application is to read the dataset and represent it with some internal data structure.

AllenNLP uses DatasetReaders to read the data, whose job it is to transform raw data files into Instances that match the input / output spec. Our spec for text classification is:
```
# Inputs
text: TextField

# Outputs
label: LabelField
```
We’ll want one Field for the input and another for the output, and our model will use the inputs to predict the outputs.
We assume the dataset has a simple data file format: `[text] [TAB] [label]`, for example:
```
I like this movie a lot! [TAB] positive
This was a monstrous waste of time [TAB] negative
AllenNLP is amazing [TAB] positive
Why does this have to be so complicated? [TAB] negative
This sentence expresses no sentiment [TAB] neutral
```

## Making a DatasetReader
---
You can implement your own DatasetReader by inheriting from the DatasetReader class. At minimum, you need to override the _read() method, which reads the input dataset and yields Instances.

In [None]:
@DatasetReader.register('classification-tsv')
class ClassificationTsvReader(DatasetReader):
    def __init__(self):
        self.tokenizer = SpacyTokenizer()
        self.token_indexers = {'tokens': SingleIdTokenIndexer()}

    def _read(self, file_path: str) -> Iterable[Instance]:
        with open(file_path, 'r') as lines:
            for line in lines:
                text, label = line.strip().split('\t')
                text_field = TextField(self.tokenizer.tokenize(text),
                                       self.token_indexers)
                label_field = LabelField(label)
                fields = {'text': text_field, 'label': label_field}
                yield Instance(fields)


This is a minimal DatasetReader that will return a list of classification Instances when you call reader.read(file). This reader will take each line in the input file, split the text into words using a tokenizer (the SpacyTokenizer shown here relies on spaCy), and represent those words as tensors using a word id in a vocabulary we construct for you.
Pay special attention to the text and label keys that are used in the fields dictionary passed to the Instance - these keys will be used as parameter names when passing tensors into your Model later.
Ideally, the output label would be optional when we create the Instances, so that we can use the same code to make predictions on unlabeled data (say, in a demo), but for the rest of this chapter we’ll keep things simple and ignore that.

There are lots of places where this could be made better for a more flexible and fully-featured reader; see the section on `DatasetReaders` for a deeper dive.

## Designing your model
---
<figure>
<center>
<img src='https://github.com/IgnatiusEzeani/NLP-Lecture/blob/main/img/batch_model_loss.png?raw=true' />
<figcaption>The Batch-Model-Loss structure</figcaption></center>
</figure>

The next thing we need is a Model that will take a batch of Instances, predict the outputs from the inputs, and compute a loss.
Remember that our Instances have this input/output spec:
```
# Inputs
text: TextField

# Outputs
label: LabelField
```
Also, remember that we used these names (text and label) for the fields in the DatasetReader. AllenNLP passes those fields by name to the model code, so we need to use the same names in our model.

### What should our model do?

<figure>
<center>
<img src='https://github.com/IgnatiusEzeani/NLP-Lecture/blob/main/img/text_model_labels.png?raw=true' />
<figcaption>Expanded Batch-Model-Loss structure</figcaption></center>
</figure>

Conceptually, a generic model for classifying text does the following:
  - Get some features corresponding to each word in your input
  - Combine those word-level features into a document-level feature vector
  - Classify that document-level feature vector into one of your labels.

The `allennlp` library makes each of these conceptual steps into a generic abstraction that you can use in your code, so that you can have a very flexible model that can use different concrete components for each step.

### Representing text with token IDs

<figure>
<center>
<img src='https://github.com/IgnatiusEzeani/NLP-Lecture/blob/main/img/tokenIDs_models_labels.png?raw=true' />
<figcaption>Token IDs are often used as initial inputs</figcaption></center>
</figure>

The first step is changing the strings in the input text into token ids. This is handled by the SingleIdTokenIndexer that we used previously, during part of our data processing pipeline that you don’t have to write code for

### Embedding tokens

<figure>
<center>
<img src='https://github.com/IgnatiusEzeani/NLP-Lecture/blob/main/img/embedding_tokens_labels.png?raw=true' />
<figcaption>Token IDs are often used as initial inputs</figcaption></center>
</figure>

The first thing the `Model` does is apply an `Embedding` function that converts each token ID that we got as input into a vector. This gives us a vector for each input token, so we have a large tensor here.

## Apply Seq2Vec encoder

<figure>
<center>
<img src='https://github.com/IgnatiusEzeani/NLP-Lecture/blob/main/img/apply_seq2seq_encoder.png?raw=true' />
<figcaption>Token IDs are often used as initial inputs</figcaption></center>
</figure>

Next we apply some function that takes the sequence of vectors for each input token and squashes it into a single vector. Before the days of pretrained language models like BERT, this was typically an LSTM or convolutional encoder. With BERT we might just take the embedding of the `[CLS]` token.

### Computing distribution over labels

<figure>
<center>
<img src='https://github.com/IgnatiusEzeani/NLP-Lecture/blob/main/img/compute_distributions_over_labels.png?raw=true' />
<figcaption>Token IDs are often used as initial inputs</figcaption></center>
</figure>

Finally, we take that single feature vector (for each Instance in the batch), and classify it as a label, which will give us a categorical probability distribution over our label space.



## Implementing the model - the constructor
---
### AllenNLP Model basics
<figure>
<center>
<img src= 'https://github.com/IgnatiusEzeani/NLP-Lecture/blob/main/img/allennlp_model_basic.png?raw=true' />
<figcaption>Token IDs are often used as initial inputs</figcaption></center>
</figure>

Now that we know what our model is going to do, we need to implement it. First, we’ll say a few words about how `Models` work in `AllenNLP`:
 - An AllenNLP `Model` is just a PyTorch `Module`
 - It implements a `forward()` method, and requires the output to be a _dictionary_.
 - Its output contains a loss key during training, which is used to optimize the model

Our training loop takes a batch of `Instances`, passes it through `Model.forward()`, grabs the loss `key` from the resulting dictionary, and uses backprop to compute gradients and update the model’s parameters. You don’t have to implement the training loop—all this will be taken care of by AllenNLP (though you can if you want to).

### Constructing the Model

In the `Model` constructor, we need to instantiate all of the parameters that we will want to train. It is often better to take most of these parameters as constructor _arguments_, so that we can configure the behavior of our model without changing the model code itself, and so that we can think at a higher level about what our model is doing. The constructor, `SimpleClassifier(Model)` for our text classification model looks like this:

```python
@Model.register('simple_classifier')
class SimpleClassifier(Model):
    def __init__(self,
                 vocab: Vocabulary,
                 embedder: TextFieldEmbedder,
                 encoder: Seq2VecEncoder):
        super().__init__(vocab)
        self.embedder = embedder
        self.encoder = encoder
        num_labels = vocab.get_vocab_size("labels")
        self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)
```
Notice that type annotations are used a lot in AllenNLP code - this is both for code readability (it’s _way_ easier to understand what a method does if you know the types of its arguments, instead of just their names), and because we use these annotations to do some magic for you in some cases.

One of those cases is constructor parameters, where we can automatically construct the embedder and encoder from a configuration file using these type annotations. See the chapter on configuration files for more information. That chapter will also tell you about the call to `@Model.register()`.

The upshot is that if you’re using the allennlp train command with a configuration file (which we show how to do below), you won’t ever have to call this constructor, it all gets taken care of for you.

### Passing the vocabulary
`Vocabulary` manages mappings between vocabulary items (such as words and labels) and their integer IDs. In our prebuilt training loop, the vocabulary gets created by AllenNLP after reading your training data, then passed to the `Model` when it gets constructed. We’ll find all tokens and labels that you use and assign them all integer IDs in separate namespaces. The way that this happens is fully configurable.

What we did in the DatasetReader will put the labels in the default “labels” namespace, and we grab the number of labels from the vocabulary on line 10.

### Embedding words

To get an initial word embedding, we use AllenNLP’s `TextFieldEmbedder`. This abstraction takes the tensors created by a `TextField` and embeds each one. This is our most complex abstraction, because there are a lot of ways to do this particular operation in NLP, and we want to be able to switch between these without changing our code. More details could be found from [Representing Text as Features](https://guide.allennlp.org/representing-text-as-features).

All you need to know for now is that you apply this to the `text` parameter you get in `forward()`, and you get out a tensor that has a single embedding vector for each input token, with shape (`batch_size, num_tokens, embedding_dim`).

### Applying a Seq2VecEncoder

To squash our sequence of token vectors into a single vector, we use AllenNLP’s `Seq2VecEncoder` abstraction. As the name implies, this encapsulates an operation that takes a sequence of vectors and returns a single vector.

Because all of our modules operate on batched input, this will take a tensor shaped like (`batch_size, num_tokens, embedding_dim`) and return a tensor shaped like (`batch_size, encoding_dim`).

## Applying a classification layer
The final parameters our `Model` needs is a classification layer, which can transform the output of our `Seq2VecEncoder` into logits, one value per possible label. These values will be converted to a probability distribution later and used for calculating the loss.

We don’t need to take the `num_labels` as a constructor argument, because we’ll just use a simple linear layer, which has sizes that we can figure out inside the constructor - the `Seq2VecEncoder` knows its output dimension, and the `Vocabulary` knows how many labels there are.

## Implementing the model — the forward method

Next, we need to implement the forward() method of your model, which takes the input, produces the prediction, and computes the loss. Remember, our constructor and input/output spec look like:

```python
@Model.register('simple_classifier')
class SimpleClassifier(Model):
    def __init__(self,
                 vocab: Vocabulary,
                 embedder: TextFieldEmbedder,
                 encoder: Seq2VecEncoder):
        super().__init__(vocab)
        self.embedder = embedder
        self.encoder = encoder
        num_labels = vocab.get_vocab_size("labels")
        self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)
```
```
# Inputs:
text: TextField

# Outputs:
label: LabelField
```

Here we’ll show how to use these parameters inside of `Model.forward()`, which will get arguments that match our input/output spec (because that’s how we coded the `DatasetReader`).

## Model.forward()
In `forward`, we use the parameters that we created in our constructor to transform the inputs into outputs. After we’ve predicted the outputs, we compute some loss function based on how close we got to the true outputs, and then return that loss (along with whatever else we want) so that we can use it to train the parameters.

```python
class SimpleClassifier(Model):
    def forward(self,
                text: TextFieldTensors,
                label: torch.Tensor) -> Dict[str, torch.Tensor]:
        # Shape: (batch_size, num_tokens, embedding_dim)
        embedded_text = self.embedder(text)
        # Shape: (batch_size, num_tokens)
        mask = util.get_text_field_mask(text)
        # Shape: (batch_size, encoding_dim)
        encoded_text = self.encoder(embedded_text, mask)
        # Shape: (batch_size, num_labels)
        logits = self.classifier(encoded_text)
        # Shape: (batch_size, num_labels)
        probs = torch.nn.functional.softmax(logits)
        # Shape: (1,)
        loss = torch.nn.functional.cross_entropy(logits, label)
        return {'loss': loss, 'probs': probs}
```

### Inputs to forward()
The first thing to notice is the inputs to this function. The way the AllenNLP training loop works is that we will take the field names that you used in your `DatasetReader` and give you a batch of instances with _those same field names_ in `forward`. So, because we used `text` and `label` as our field names, we need to name our arguments to `forward` the same way.

Second, notice the types of these arguments. Each type of `Field` knows how to convert itself into a `torch.Tensor`, then create a batched `torch.Tenso`r from all of the Fields with the same name from a batch of `Instances`. The types you see for `text` and `label` are the tensors produced by `TextField` and `LabelField` (again, see our chapter on using TextFields for more information about TextFieldTensors). The important part to know is that our `TextFieldEmbedder`, which we created in the constructor, expects this type of object as input and will return an embedded tensor as output.

### Embedding the text

The first actual modeling operation that we do is embed the text, getting a vector for each input token. Notice here that we’re not specifying anything about `how` that operation is done, just that a `TextFieldEmbedder` that we got in our constructor is going to do it. This lets us be very flexible later, changing between various kinds of embedding methods or pretrained representations (including ELMo and BERT) without changing our model code. More on this later.

### Applying a Seq2VecEncoder

After we have embedded our text, we next have to squash the sequence of vectors (one per token) into a single vector for the whole text. We do that using the `Seq2VecEncoder` that we got as a constructor argument. In order to behave properly when we’re batching pieces of text together that could have different lengths, we need to _mask_ elements in the `embedded_text` tensor that are only there due to padding. We use a utility function to get a mask from the `TextField` output, then pass that mask into the encoder.

At the end of these lines, we have a single vector for each instance in the batch.

### Making predictions

The last step of our model is to take the vector for each instance in the batch and predict a label for it. Our `classifier` is a `torch.nn.Linear` layer that gives a score (commonly called a `logit`) for each possible label. We normalize those scores using a `softmax` operation to get a probability distribution over labels that we can return to a consumer of this model. For computing the loss, PyTorch has a built in function that computes the cross entropy between the logits that we predict and the true label distribution, and we use that as our loss function.

And that’s it! This is all you need for a simple classifier. After you’ve written a `DatasetReader` and `Model`, AllenNLP takes care of the rest: connecting your input files to the dataset reader, intelligently batching together your instances and feeding them to the model, and optimizing the model’s parameters by using backprop on the loss. We go over this part in the next chapter.


# Section 2: Training and prediction

The last section presented an overview of the steps involved in counstructing a classification model using a PyTorch library, `allennlp` by AllenNLP. We learned how to write your own dataset reader and construct model.

In this chapter, we are going to train the text classification model and make predictions for new inputs. At this point, there are two ways to proceed: 

1. we can write your own script to construct the dataset reader and model and run the training loop,
2. or we can write a configuration file and use the `allennlp` train command. 

We will explore both ways.

## Training the model with your own script
---
In this section we’ll put together a simple example of reading in data, feeding it to the model, and training the model, using your own python script instead of `allennlp train`. While we recommend using `allennlp train` for most use cases, it’s easier to understand the introduction to the training loop in this section. Once you get a handle on this, switching to using our built in command should be easy, if you want to.

### The Dataset
Before proceeding, here are a few words about the dataset we will use throughout this chapter. The dataset is derived from the [Movie Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data/), collections of movie reviews on IMDb along with their polarity. The labels are binary (positive and negative), and our task is to predict the label from the review text.

This section is going to give a series of executable examples, that you can run yourself in your browser and see what they output. They will build on each other, with code from previous examples ending up in the `Setup` block in subsequent examples.

### Testing your dataset reader
In the first example, we’ll simply instantiate the dataset reader, read the movie review dataset using it, and inspect the AllenNLP Instances produced by the dataset reader. Below we have code that you can run (and modify if you want).

Uncomment below to install `allennlp`

In [None]:
# !pip install allennlp

### Setup

In [None]:
from typing import Dict, Iterable, List
from allennlp.data import DatasetReader, Instance
from allennlp.data.fields import Field, LabelField, TextField
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
from allennlp.data.tokenizers import Token, Tokenizer, WhitespaceTokenizer

### The `ClassificationTsvReader`

In [None]:
class ClassificationTsvReader(DatasetReader):
    def __init__(
        self,
        tokenizer: Tokenizer = None,
        token_indexers: Dict[str, TokenIndexer] = None,
        max_tokens: int = None,
        **kwargs
    ):
        super().__init__(**kwargs)
        self.tokenizer = tokenizer or WhitespaceTokenizer()
        self.token_indexers = token_indexers or {"tokens": SingleIdTokenIndexer()}
        self.max_tokens = max_tokens

    def _read(self, file_path: str) -> Iterable[Instance]:
        with open(file_path, "r") as lines:
            for line in lines:
                text, sentiment = line.strip().split("\t")
                tokens = self.tokenizer.tokenize(text)
                if self.max_tokens:
                    tokens = tokens[: self.max_tokens]
                text_field = TextField(tokens, self.token_indexers)
                label_field = LabelField(sentiment)
                fields: Dict[str, Field] = {"text": text_field, "label": label_field}
                yield Instance(fields)


dataset_reader = ClassificationTsvReader(max_tokens=64)
instances = list(dataset_reader.read("quick_start/data/movie_review/train.tsv"))

for instance in instances[:10]:
    print(instance)

When you run the code snippet above, you should see the dumps of the first ten instances and their content, including their text and label fields. (Note that we are only showing the first 64 tokens per instance by specifying max_tokens=64).

This is one way to check if your dataset reader is working as expected. We strongly recommend writing some simple tests for your data processing code, to be sure it’s actually doing what you want it to.