# Project: Text Classification in PyTorch

## Introduction
This project deals with neural text classification using PyTorch. Text classification is the process of assigning tags or categories to text according to its content. It's one of the fundamental tasks in Natural Language Processing (NLP) with broad applications such as sentiment analysis, topic labeling, spam detection, and intent detection.

Text classification algorithms are at the heart of a variety of software systems that process text data at scale. Email software uses text classification to determine whether incoming mail is sent to the inbox or filtered into the spam folder. Discussion forums use text classification to determine whether comments should be flagged as inappropriate.

**_Example:_** A simple example of text classification would be Spam Classification. Considering a bunch of emails that I would receive in the my personal inbox if the email service provider did not have a spam filter algorithm. Because of the spam filter, spam emails get redirected to the Spam folder, while I receive only non-spam ("_ham_") emails in your inbox.

![](http://blog.yhat.com/static/img/spam-filter.png)

## Task
Here, I want to focus on a specific type of text classification task, "Document Classification into Topics". It can be addressed as classifying text data or even large documents into separate discrete topics/genres of interest.


![](https://miro.medium.com/max/700/1*YWEqFeKKKzDiNWy5UfrTsg.png)

In this project, I will be working on classifying given text data into discrete topics or genres. A bunch of text data is given, each of which has a label attached. We learn why I think the contents of the documents have been given these labels based on their words. I need to create a neural classifier that is trained on this given information. Once I have a trained classifier, it should be able to predict the label for any new document or text data sample that is fed to it. The labels need not have any meaning to us, nor to you necessarily.

## Data
There are various datasets that we can use for this purpose. This project shows the usage of the text classification datasets in the PyTorch library ``torchtext``. There are different datasets in this library like `AG_NEWS`, `SogouNews`, `DBpedia`, and others. This project will deal with training a supervised learning algorithm for classification using one of these datasets. In this project, first I will work with the `AG_NEWS` dataset for the Linear Model.

## Load Data

A bag of **ngrams** feature is applied to capture some partial information about the local word order. In practice, bi-grams or tri-grams are applied to provide more benefits as word groups than only one word.

**Example:**

*"I love Neural Networks"*
* **Bi-grams:** "I love", "love Neural", "Neural Networks"
* **Tri-grams:** "I love Neural", "love Neural Networks"

In the code below, I have loaded the `AG_NEWS` dataset from the ``torchtext.datasets.TextClassification`` package with bi-grams feature. The dataset supports the ngrams method. By setting ngrams to 2, the example text in the dataset will be a list of single words plus bi-grams string.

In [None]:
"""
Load the AG_NEWS dataset in bi-gram features format.
"""
!pip install torchtext==0.4
import torch
import torchtext
from torchtext.datasets import text_classification
import os

NGRAMS = 2

if not os.path.isdir('./.data'):
    os.mkdir('./.data')

train_dataset, test_dataset = text_classification.DATASETS['AG_NEWS'](
    root='./.data', ngrams=NGRAMS, vocab=None)

BATCH_SIZE = 16

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Collecting torchtext==0.4
[?25l  Downloading https://files.pythonhosted.org/packages/43/94/929d6bd236a4fb5c435982a7eb9730b78dcd8659acf328fd2ef9de85f483/torchtext-0.4.0-py3-none-any.whl (53kB)
[K     |██████▏                         | 10kB 23.6MB/s eta 0:00:01[K     |████████████▍                   | 20kB 28.8MB/s eta 0:00:01[K     |██████████████████▌             | 30kB 20.0MB/s eta 0:00:01[K     |████████████████████████▊       | 40kB 23.4MB/s eta 0:00:01[K     |██████████████████████████████▉ | 51kB 22.8MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 8.9MB/s 
Installing collected packages: torchtext
  Found existing installation: torchtext 0.3.1
    Uninstalling torchtext-0.3.1:
      Successfully uninstalled torchtext-0.3.1
Successfully installed torchtext-0.4.0


ag_news_csv.tar.gz: 11.8MB [00:00, 71.6MB/s]
120000lines [00:06, 17886.47lines/s]
120000lines [00:15, 7894.45lines/s]
7600lines [00:00, 8055.16lines/s]


## Model

My first simple model is composed of an [`EmbeddingBag`](https://pytorch.org/docs/stable/nn.html?highlight=embeddingbag#torch.nn.EmbeddingBag) layer and a linear layer.

``EmbeddingBag`` computes the mean value of a “bag” of embeddings. The text entries here have different lengths. ``EmbeddingBag`` requires no padding here since the text lengths are saved in offsets. Additionally, since ``EmbeddingBag`` accumulates the average across the embeddings on the fly, ``EmbeddingBag`` can enhance the performance and memory efficiency to process a sequence of tensors.

In [None]:
# TODO: Import the necessary libraries
import torch.nn as nn
import torch.nn.functional as F

class TextClassifier(nn.Module):
    # Defining the __init__() method with proper parameters
    # (vocabulary size, dimensions of the embeddings, number of classes)
    def __init__(self, VOCAB_SIZE, EMBED_DIM, NUM_CLASS):
        super().__init__()
        # Defining the embedding layer
        self.embedding = nn.EmbeddingBag(VOCAB_SIZE, EMBED_DIM, sparse=True)
        # Defining the linear forward layer
        self.fc = nn.Linear(EMBED_DIM, NUM_CLASS)
        # Initializing weights
        self.initialize_weights()
    # Defining a method to initialize weights.
        # The weights should be random in the range of -0.5 to 0.5.
        # Initialize bias values as zero.
    def initialize_weights(self):
        self.embedding.weight.data.uniform_(-0.5, 0.5)
        self.fc.weight.data.uniform_(-0.5, 0.5)
        self.fc.bias.data.zero_()

    # Defining the forward function.
        # This should calculate the embeddings and return the linear layer
        # with calculated embedding values.

    def forward(self, text, offsets):
        embeddings = self.embedding(text, offsets)
        return self.fc(embeddings)

## Checking the data before I proceed!

Okay, so I know that I'm using the `AG_NEWS` dataset in this project, but do not know any specific details. Let's find out!

Reported the following:
* Vocabulary size (VOCAB_SIZE)
* Number of classes (NUM_CLASS)
* Names of the classes


In [None]:
VOCAB_SIZE = len(train_dataset.get_vocab())
print(VOCAB_SIZE)

NUM_CLASS = len(train_dataset.get_labels())
print(NUM_CLASS)

labels = train_dataset.get_labels()
print(labels)

1308844
4
{0, 1, 2, 3}


Vocabulary size = 1308844
Number of classes = 4
Names of the classes = {0,1,2,3}, which corresponds to world,sports,business and science/tech

## Create an instance for my model

The vocab size is equal to the length of vocab (including single word and ngrams). The number of classes is equal to the number of labels. Using the parameters which are used to analyze the data to create an instance `model` of my text classifier `TextClassifier`.

In [None]:
'''
Paramters and model instance creation.
'''

# Instantiating the Vocabulary size and the number of classes
# from the training dataset that I loaded.

VOCAB_SIZE = len(train_dataset.get_vocab())
EMBED_DIM = 32
NUM_CLASS = len(train_dataset.get_labels())

# Instantiating the model with the parameters that I defined above.
# Allocating it to the 'device' variable which is available

model = TextClassifier(VOCAB_SIZE, EMBED_DIM, NUM_CLASS).to(device)

## Generate batch

Since the text entries have different lengths, I need to create a custom function to generate data batches and offsets. This function should be passed to the ``collate_fn`` parameter in the ``DataLoader`` call of pyTorch which I will use to create the data later on. The input to ``collate_fn`` is a list of tensors with the size of batch_size, and the ``collate_fn`` function packs them into a mini-batch. ``collate_fn`` is declared as a top level definition.

The text entries in the original data batch input are packed into a list and concatenated as a single tensor as the input of ``EmbeddingBag``. The offsets is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor. Label is a tensor saving the labels of individual text entries.

The function should take batch as an input parameter. Each entry in the batch contains a pair of values of the text and the corresponding label.

In [None]:
# Defining the function definition

def generate_batch(batch):

    label = torch.tensor([entry[0] for entry in batch])
    text = [entry[1] for entry in batch]
    offsets = [0] + [len(entry) for entry in text]

    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text = torch.cat(text)

    return text, offsets, label

## Defining the train function

Here, defining a function which I will use later on in the project to train the model. The outline of the function is something like this -

* load the data as batches
* iterate over the batches
* find the model output for a forward pass
* calculate the loss
* perform backpropagation on the loss (optimize it)
* find the training accuracy

In addition to this, I also need to find the total loss and total training accuracy values. Also, the average values of the total loss and total accuracy need to be found.

In [None]:
def train(train_data):

    # Initial values of training loss and training accuracy

    train_loss = 0
    train_acc = 0

    # Using the PyTorch DataLoader class to load the data
    # into shuffled batches of appropriate sizes into the variable 'data'

    data = DataLoader(sub_train_, batch_size=BATCH_SIZE, shuffle=True,
                      collate_fn=generate_batch)


    for i, (text, offsets, cls) in enumerate(data):

        # Performing backprop on the optimizer
        optimizer.zero_grad()


        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)

        # Storing the output of the model in variable 'output'
        output = model(text, offsets)


        # Defining the 'loss' variable (with respect to 'output' and 'cls').
        # Also calculating the total loss in variable 'train_loss'
        loss = criterion(output, cls)
        train_loss += loss.item()


        # Performing the backward propagation on 'loss' and
        # optimize it through the 'optimizer' step
        loss.backward()
        optimizer.step()


        # Calculating and storing the total training accuracy
        # in the variable 'total_acc'.
        train_acc += (output.argmax(1) == cls).sum().item()


    # Adjusting the learning rate here using the scheduler step
    scheduler.step()

    return train_loss / len(sub_train_), train_acc / len(sub_train_)


## Defining the test function

Using the framework of the `train()` function in the previous cell, try to figure out the structure of the test function below.

In [None]:
def test(test_data):

    # Initial values of test loss and test accuracy

    loss = 0
    acc = 0

    # Using DataLoader class to load the data
    # into non-shuffled batches of appropriate sizes.

    data = DataLoader(test_data, batch_size=BATCH_SIZE, collate_fn=generate_batch)

    for text, offsets, cls in data:

        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)

        with torch.no_grad():


            # Getting the model output
            output = model(text, offsets)


            # Calculating and adding the loss to find total 'loss'
            loss = criterion(output, cls)
            loss += loss.item()


            # Calculating the accuracy and storing it in the 'acc' variable
            acc += (output.argmax(1) == cls).sum().item()


    return loss / len(test_data), acc / len(test_data)

## Spliting the dataset and run the model

The original `AG_NEWS` has no validation dataset. For this reason,the training dataset is needed to split into training and validation sets with a proper split ratio. The `random_split()` function in the torch.utils core PyTorch library is used.

* The initial learning rate is 4.0, number of epochs as 5, training data ratio is 0.9.
* A proper loss function is used
* Defining an Optimization algorithm (SGD)
* Defining a scheduler function to adjust the learning rate through epochs (gamma parameter = 0.9).
* The loss and accuracy values for both training and validation data sets will be monitored

In [None]:
import time
from torch.utils.data import DataLoader
from torch.utils.data.dataset import random_split

# Setting the number of epochs and the learning rate to
# their initial values here

N_EPOCHS = 5
LEARNING_RATE = 4.0
TRAIN_RATIO = 0.9

# Seting the intial validation loss to positive infinity
min_valid_loss = float('inf')


# Using the CrossEntropy loss function
criterion = torch.nn.CrossEntropyLoss().to(device)


# Using the SGD optimization algorithm with parameters
optimizer = torch.optim.SGD(model.parameters(), lr=4.0)


# Using a scheduler function
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)


# Splitting the data into train and validation sets using random_split()
train_length = int(len(train_dataset) * 0.95)
sub_train_, sub_valid_ = \
    random_split(train_dataset, [train_length, len(train_dataset) - train_length])


# Finishing the rest of the code below

for epoch in range(N_EPOCHS):

    start_time = time.time()
    train_loss, train_acc = train(sub_train_)
    valid_loss, valid_acc = test(sub_valid_)

    secs = int(time.time() - start_time)
    mins = secs / 60
    secs = secs % 60

    print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs))
    print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
    print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')

Epoch: 1  | time in 0 minutes, 9 seconds
	Loss: 0.0263(train)	|	Acc: 84.6%(train)
	Loss: 0.0001(valid)	|	Acc: 90.7%(valid)
Epoch: 2  | time in 0 minutes, 8 seconds
	Loss: 0.0120(train)	|	Acc: 93.6%(train)
	Loss: 0.0000(valid)	|	Acc: 91.0%(valid)
Epoch: 3  | time in 0 minutes, 8 seconds
	Loss: 0.0070(train)	|	Acc: 96.3%(train)
	Loss: 0.0000(valid)	|	Acc: 91.2%(valid)
Epoch: 4  | time in 0 minutes, 8 seconds
	Loss: 0.0039(train)	|	Acc: 98.0%(train)
	Loss: 0.0001(valid)	|	Acc: 91.4%(valid)
Epoch: 5  | time in 0 minutes, 9 seconds
	Loss: 0.0022(train)	|	Acc: 99.0%(train)
	Loss: 0.0000(valid)	|	Acc: 91.7%(valid)


## Let's  check the test loss and test accuracy

As I have trained the model and have seen how well it performs on the training and validation datasets. Now, I need to check the model's performance against the test dataset. Using the test dataset as input, the test loss and test accuracy scores of the model are reported.

In [None]:
# Computing the results (loss and accuracy) on the test data

print('Checking the results of test dataset...')
test_loss, test_acc = test_loss, test_acc = test(test_dataset)
print(f'\tLoss: {test_loss:.4f}(test)\t|\tAcc: {test_acc * 100:.1f}%(test)')

Checking the results of test dataset...
	Loss: 0.0002(test)	|	Acc: 89.7%(test)


In [None]:
# importing necessary libraries

import re
from torchtext.data.utils import ngrams_iterator
from torchtext.data.utils import get_tokenizer

# labels for the AG_NEWS dataset

ag_news_label = {1 : "World",
                 2 : "Sports",
                 3 : "Business",
                 4 : "Sci/Tec"}

def predict(text, model, vocab, ngrams):
    tokenizer = get_tokenizer("basic_english")
    with torch.no_grad():
        text = torch.tensor([vocab[token]
                            for token in ngrams_iterator(tokenizer(text), ngrams)])
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1

ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

vocab = train_dataset.get_vocab()
model = model.to("cpu")

# Predicting the topic of the above given random text (using bigrams)

print("This is a '%s' news" % ag_news_label[predict(ex_text_str, model, vocab, 2)])


This is a 'Sports' news


The model is tested with a new sample text. Now, feeding some more random examples of similar text (which are related to at least one of the four topics _"World", "Sports", "Business", "Sci/Tec"_ of our problem) to the model and checking how the model reacts. Testing 3 such example.

In [None]:
# importing necessary libraries

import re
from torchtext.data.utils import ngrams_iterator
from torchtext.data.utils import get_tokenizer

# labels for the AG_NEWS dataset

ag_news_label = {1 : "World",
                 2 : "Sports",
                 3 : "Business",
                 4 : "Sci/Tec"}

def predict(text, model, vocab, ngrams):
    tokenizer = get_tokenizer("basic_english")
    with torch.no_grad():
        text = torch.tensor([vocab[token]
                            for token in ngrams_iterator(tokenizer(text), ngrams)])
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1

str1 = "Japan is to ask all schools to close from Monday to prevent the spread of the coronavirus, PM Shinzo Abe says\
The closure thought to affect 13 million students will continue until the school year ends in late March."

str2 = "Clearview AI, a start-up with a database of more than three billion photographs from Facebook, YouTube and Twitter, has been hacked.\
The attack allowed hackers to gain access to its client list but it said its servers had not been breached."

str3= "Global investors were hit with a sixth day of stock market losses on Thursday, as traders responded to the threat of the coronavirus.\
The string of declines pushed indexes in Europe and the US down more than 10% from their recent highs - sending them into so-called correction territory."
vocab = train_dataset.get_vocab()
model = model.to("cpu")

# Predicting the topic of the above given random text (using bigrams)

print("This is a '%s' news" % ag_news_label[predict(str1, model, vocab, 2)])#it's a world news

print("This is a '%s' news" % ag_news_label[predict(str2, model, vocab, 2)])#it's technology related news

print("This is a '%s' news" % ag_news_label[predict(str3, model, vocab, 2)])#it's business related news


This is a 'World' news
This is a 'Sci/Tec' news
This is a 'Business' news


Model predicted correctly for new three test cases

Okay, probably the model still works great with the examples those are fed to it in the previous section.
##How about twisting the testcase.

Let's feed it some more random text data from completely different genres/topics (not belonging to the 4 topics).

Of course the predictions will be limited to the four class labels that the model is trained on. I will try to findout the reason why the labels that the model predicted for the given text inputs.

In [None]:
# importing necessary libraries

import re
from torchtext.data.utils import ngrams_iterator
from torchtext.data.utils import get_tokenizer

# labels for the AG_NEWS dataset

ag_news_label = {1 : "World",
                 2 : "Sports",
                 3 : "Business",
                 4 : "Sci/Tec"}

def predict(text, model, vocab, ngrams):
    tokenizer = get_tokenizer("basic_english")
    with torch.no_grad():
        text = torch.tensor([vocab[token]
                            for token in ngrams_iterator(tokenizer(text), ngrams)])
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1

str1 = "Local music often appears at karaoke venues, which is on lease from the record labels. Traditional Japanese music differs markedly \
from Western music, as it is often based on the intervals of human breathing rather than on mathematical timing."

str2 = "No doubt, life is beautiful and every moment  a celebration of being alive, but one should be always ready to face adversity and challenges. \
A person who has not encountered difficulties in life can never achieve success.Difficulties test the courage, patience, perseverance and \
true character of a human being. Adversity and hardships make a person strong and ready to face the challenges of life with equanimity. \
There is no doubt that there can be no gain without pain. It is only when one toils and sweats it out that success is nourished and sustained"

str3= "Education is analyzed as being an important role in the society, where the structure of teaching, learning, and environment is \
frequently debated as factor (main) responsible for the development of people. \
This is why education system, and the structure for teaching shall be considered seriously."
vocab = train_dataset.get_vocab()
model = model.to("cpu")

# Predicting the topic of the above given random text (using bigrams)

print("This is a '%s' news" % ag_news_label[predict(str1, model, vocab, 2)])#it was related to music types in japan.

print("This is a '%s' news" % ag_news_label[predict(str2, model, vocab, 2)])#it's related to philosophy of life

print("This is a '%s' news" % ag_news_label[predict(str3, model, vocab, 2)])#it's related to education news


This is a 'Business' news
This is a 'Business' news
This is a 'Sci/Tec' news


1. Str1 Text decribes about differnt types of music in Japan. But our model classifies it as business news. This is probably due to the worlds such as "lease" and "mathematical timing" in the text, which is often used in business.

2. Str2 is philosophy about life. But our model classified it as business news. We do not clearly understand why, but we think its because of the words such as adversity and challenges , no gain which often comes in business news.

3. Str3 is about education system. But our model classified it as Scinec/Tech news. It is may be due to the fact that in most of the science related news they link education. and also words such as "teaching , lerning and enviornment " are used in defining new technology , how to learn it and its developement systems. Also the word "system" is more often used in technology.

##Room for improvement
The model probably has achieved a good accuracy score. However, there may be lots of things that could still be tried to do to improve the classifier model.

1.As I have using simple fully connected neyworks I am considering the current state and predictiong the output, which in other sense is not considering the sequence of the word appearence or temporal relation of the words. So I can use different models such as LSTM where sequence of the word matters not just the words.

2. I can increase the depth of the neural network and try. Becuse usually in the initial layers only primitive features are learnt. So I can try to increase the depth of the network and test whether model learns complex features.

3. I can reduce the step size and increase the number of epochs for the current network to see whether it gives more validation and test accuracy.