<a href="https://colab.research.google.com/github/IgnatiusEzeani/NLP-Lecture/blob/main/Week_18_NLP_Tasks_and_Methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 18 - Introduction to Text Classification

This lab will take you through an introductory text classification task using the contents from the [AllenNLP Guide](https://guide.allennlp.org/). AllenNLP is an open source library for building deep learning models for natural language processing, developed by the Allen Institute for Artificial Intelligence.

It is built on top of PyTorch and is designed to support researchers, engineers, students, etc., who wish to build high quality deep NLP models with ease. It provides high-level abstractions and APIs for common components and models in modern NLP. It also provides an extensible framework that makes it easy to run and manage NLP experiments.

In [None]:
# Uncomment below to install 'allennlp'
# !pip install allennlp

In [None]:
from typing import Dict, Iterable, List
from allennlp.data import DatasetReader, Instance
from allennlp.data.fields import Field, LabelField, TextField
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
from allennlp.data.tokenizers import Token, Tokenizer, WhitespaceTokenizer

## 1 What is text classification
---
Text classification is one of the simplest NLP tasks, where the model, given some input text, predicts a label for the text. See the figure below for an illustration.

<figure>
<center>
<img src='https://github.com/IgnatiusEzeani/NLP-Lecture/blob/main/img/Artboard.png?raw=true' />
<figcaption>A basic text classification pipeline</figcaption></center>
</figure>


There are a variety of applications of text classification, such as spam filtering, sentiment analysis, and topic detection. Some examples are shown in the table below.

|Application | Description | Input | Output|
| --- | --- | --- |--- |
|Spam filtering |Detect and filter spam emails | Email |Spam / Not spam |
|Sentiment analysis|Detect the polarity of text |Tweet, review|Positive / Negative|
|Topic detection | Detect the topic of text | News article, blog post | Business / Tech / Sports|

## Defining input and output
---
The first step for building an NLP model is to define its input and output. In AllenNLP, each training example is represented by an Instance object. An Instance consists of one or more Fields, where each Field represents one piece of data used by your model, either as an input or an output. Fields will get converted to tensors and fed to your model. The Reading Data chapter provides more details on using Instances and Fields to represent textual data.
For text classification, the input and the output are very simple. The model takes a `TextField` that represents the input text and predicts its label, which is represented by a `LabelField:`

```
# Input
text: TextField

# Output
label: LabelField
```

## Reading data

<figure>
<center>
<img src='https://github.com/IgnatiusEzeani/NLP-Lecture/blob/main/img/Slice.png?raw=true' />
<figcaption>Reformating text files as instances of texts and labels</figcaption></center>
</figure>

The first step for building an NLP application is to read the dataset and represent it with some internal data structure.

AllenNLP uses DatasetReaders to read the data, whose job it is to transform raw data files into Instances that match the input / output spec. Our spec for text classification is:
```
# Inputs
text: TextField

# Outputs
label: LabelField
```
We’ll want one Field for the input and another for the output, and our model will use the inputs to predict the outputs.
We assume the dataset has a simple data file format: `[text] [TAB] [label]`, for example:
```
I like this movie a lot! [TAB] positive
This was a monstrous waste of time [TAB] negative
AllenNLP is amazing [TAB] positive
Why does this have to be so complicated? [TAB] negative
This sentence expresses no sentiment [TAB] neutral
```

## Making a DatasetReader
---
You can implement your own DatasetReader by inheriting from the DatasetReader class. At minimum, you need to override the _read() method, which reads the input dataset and yields Instances.

In [None]:
@DatasetReader.register('classification-tsv')
class ClassificationTsvReader(DatasetReader):
    def __init__(self):
        self.tokenizer = SpacyTokenizer()
        self.token_indexers = {'tokens': SingleIdTokenIndexer()}

    def _read(self, file_path: str) -> Iterable[Instance]:
        with open(file_path, 'r') as lines:
            for line in lines:
                text, label = line.strip().split('\t')
                text_field = TextField(self.tokenizer.tokenize(text),
                                       self.token_indexers)
                label_field = LabelField(label)
                fields = {'text': text_field, 'label': label_field}
                yield Instance(fields)


This is a minimal DatasetReader that will return a list of classification Instances when you call reader.read(file). This reader will take each line in the input file, split the text into words using a tokenizer (the SpacyTokenizer shown here relies on spaCy), and represent those words as tensors using a word id in a vocabulary we construct for you.
Pay special attention to the text and label keys that are used in the fields dictionary passed to the Instance - these keys will be used as parameter names when passing tensors into your Model later.
Ideally, the output label would be optional when we create the Instances, so that we can use the same code to make predictions on unlabeled data (say, in a demo), but for the rest of this chapter we’ll keep things simple and ignore that.

There are lots of places where this could be made better for a more flexible and fully-featured reader; see the section on `DatasetReaders` for a deeper dive.