**News Classification by LSTM**

In this notebook, we will try to classify news merely from the language associated with it. We just use its headline and short description to classify the news's category. One thing we intentionally avoid is the author's name due to tendency of certain author to write articles on particular topics.

In general this notebook is comprised of some sections which are:
1. Preparing data
2. Building the model
3. Training the model
4. User input

We use some components here to name a few:
* Torchtext library
* Pre-trained word embedding
* LSTM network architecture
* Bidirectional LSTM
* Multi-layered LSTM
* Regularization
* Adam optimizer
* Cross-entropy loss function for classification problem

**Preparing Data**

We use Torchtext library to pre-process our data. Torchtext simplifies text data pre-processing that includes reading data, tokenizing, converting into tensors, and building vocabulary to be easier.

In [None]:
from google.colab import drive

drive.mount("/content/drive")

%cd '/content/drive/MyDrive/7643/final/'

Mounted at /content/drive
/content/drive/MyDrive/7643/final


In [None]:
import torch
from torchtext.legacy import data

So first we specify what our data comprises of. We decide that our data comprises of TEXT which are the news' headlines and short descriptions, as well as LABEL which is the news' category. Here we tokenize the text using [spacy](https://spacy.io/?source=post_page---------------------------) tokenizer and to make all the words use lower case. While we keep the entire LABEL as it is.

In [None]:
TEXT = data.Field(tokenize = 'spacy', lower = True)
LABEL = data.LabelField()

Let's use TabularDataset for json type file here. We extract the entirety of our data into something like dictionary with three keys, 'headline', 'desc', and 'category' that corresponds to each news' headline, short description, and category.

In [None]:
news = data.TabularDataset(
    path='data/News_Category_Dataset_v2.json', format='json',
    fields={'headline': ('headline', TEXT),
            'short_description' : ('desc', TEXT),
             'category': ('category', LABEL)})

Further split our dataset into training set trn, validation set vld, and test set tst using seed for reproducible result.

In [None]:
import random
SEED = 1234
from tqdm import tqdm_notebook, tqdm

trn, vld, tst = news.split(split_ratio=[0.7, 0.2, 0.1], random_state = random.seed(SEED))

We will check an example of our data. It should comprises parsed headline, description, and the associated category.

In [None]:
# vars(trn[0])

We build our vocabulary from our datasets and convert it into vectors from glove. From there we check how many vocabularies we have from our text and how many categories we have.

In [None]:
TEXT.build_vocab(trn, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(trn)

In [None]:
# https://stackoverflow.com/questions/7368789/convert-all-strings-in-a-list-to-int
# https://www.askpython.com/python/built-in-methods/python-vars-method

vocab = vars(LABEL.vocab)
freqs = list(vocab['freqs'].values())
freqs = list(map(int, freqs))
freqs.sort(reverse=True)

Here, let's wrap out data to get the relevant iterator for our training, validation, as well as test sets.

In [None]:
BATCH_SIZE = 64

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (trn, vld, tst), 
    batch_size = BATCH_SIZE, 
    device = device,
    sort_key= lambda x: len(x.headline), 
    sort_within_batch= False
    )

**Building the Model**

In this section, we define our model. Since we are trying to classify the news based on its headline and short description that are in the form of sentences or paragraphs, we are going to use sequential model that is LSTM (Long Short Term Memory). More specifically, we use bidirectional and two-layered LSTM layer hopefully to get better accuracy for our prediction. We also implement regularization by using dropout during our forward pass. In this model, we specifically split the processing for the headline and short description and concatenate them before final processing to get the prediction of our news' category. The detail can be seen in the diagram below:

![](https://i.imgur.com/6nXjqx8.png)

In [None]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        
        super().__init__()
                
        self.embedding = nn.Embedding(vocab_size, embedding_dim).to(device)
        
        self.lstm_head = nn.LSTM(embedding_dim, hidden_dim, num_layers = n_layers, bidirectional = bidirectional, dropout = dropout).to(device)
        
        self.lstm_desc = nn.LSTM(embedding_dim, hidden_dim, num_layers = n_layers, bidirectional = bidirectional, dropout = dropout).to(device)
        
        self.fc_head = nn.Linear(hidden_dim * 2, 100).to(device)
        
        self.fc_desc = nn.Linear(hidden_dim * 2, 100).to(device)

        self.fc_total = nn.Linear(200, output_dim).to(device)
        
        self.dropout = nn.Dropout(dropout).to(device)
                
    def forward(self, headline, description):
                        
        embedded_head = self.dropout(self.embedding(headline))
        
        embedded_desc = self.dropout(self.embedding(description))
                                    
        output_head, (hidden_head, cell_head) = self.lstm_head(embedded_head)
        
        output_desc, (hidden_desc, cell_desc) = self.lstm_desc(embedded_desc)
        
        hidden_head = self.dropout(torch.cat((hidden_head[-2, :, :], hidden_head[-1, :, :]), dim = 1))
        
        hidden_desc = self.dropout(torch.cat((hidden_desc[-2, :, :], hidden_desc[-1, :, :]), dim = 1))
        
        full_head = self.fc_head(hidden_head)
        
        full_desc = self.fc_desc(hidden_desc)
        
        hidden_total = torch.cat((full_head, full_desc), 1)
        
        return self.fc_total(hidden_total)

Now we create our model and check how many parameters we are training.

In [None]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = len(LABEL.vocab)
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.2

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT)

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 12,590,629 trainable parameters


Next, replace the initial weights of the embedding layers with the pre-trained embeddings.

In [None]:
pretrained_embeddings = TEXT.vocab.vectors

In [None]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[-0.9303, -0.5822,  1.4222,  ...,  0.3795, -0.5287, -1.5877],
        [ 1.7707, -0.9132, -0.1961,  ..., -0.0253,  1.5668,  0.7579],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 1.1114,  0.1034,  1.8841,  ..., -0.1761, -0.3233, -0.1203],
        [-1.7324, -1.2718,  0.4066,  ...,  1.4167,  1.2657, -0.6771],
        [ 0.1900, -0.0299,  0.7712,  ...,  1.9195,  1.1134, -0.4383]],
       device='cuda:0')

**Training the Model**

We choose Adam algorithm as our optimizer, as well as cross entropy loss for our loss function since we are doing classification problem with multiple categories. We also define the function to calculate accuracy of our prediction.

In [None]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

In [None]:
# https://stackoverflow.com/questions/66074684/runtimeerror-expected-scalar-type-double-but-found-float-in-pytorch-cnn-train
# https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html?highlight=crossentropy#torch.nn.CrossEntropyLoss

from DiceLoss import DiceLoss
from focal_loss import FocalLoss
import weights

losstype = "focal"
weighttype = "isns"

# ins
# isns
# classbal
# simple

w = None
w = weights.weightweight(freqs,type=weighttype)
w = w.to(device)

if losstype == "focal":
    criterion = FocalLoss(weight=w, gamma=1) 
elif losstype == "dice":
    criterion = DiceLoss()
elif losstype == "crossentropy":
    criterion = nn.CrossEntropyLoss(weight=w)
elif losstype == "nll":
    criterion = nn.NLLLoss(weight=w)

model = model.to(device)
criterion = criterion.to(device)

print(w)

tensor([0.0066, 0.0089, 0.0095, 0.0120, 0.0122, 0.0128, 0.0146, 0.0151, 0.0151,
        0.0155, 0.0165, 0.0171, 0.0177, 0.0184, 0.0191, 0.0196, 0.0198, 0.0203,
        0.0204, 0.0205, 0.0205, 0.0226, 0.0230, 0.0235, 0.0236, 0.0237, 0.0251,
        0.0256, 0.0258, 0.0260, 0.0262, 0.0287, 0.0306, 0.0319, 0.0320, 0.0323,
        0.0328, 0.0350, 0.0360, 0.0370, 0.0378], device='cuda:0')


In [None]:
def categorical_accuracy(preds, y):
    max_preds = preds.argmax(dim = 1, keepdim = True).to(device)
    correct = max_preds.squeeze(1).eq(y).to(device)
    return correct.sum() / torch.FloatTensor([y.shape[0]]).to(device)

Here we define the training and evaluate part of our model.

In [None]:
!pip install torchmetrics
from torchmetrics.functional import accuracy, f1, precision
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()

    # Get the progress bar for later modification
    # progress_bar = tqdm_notebook(iterator, ascii=True)
    for idx, batch in enumerate(iterator):
        
        optimizer.zero_grad()
                        
        predictions = model(batch.headline, batch.desc)
        predictions = predictions.squeeze(1)
        
        loss = criterion(predictions, batch.category)
        
        acc = categorical_accuracy(predictions, batch.category)
        my_acc = accuracy(predictions, batch.category)
        my_f1 = f1(predictions, batch.category, num_classes=41)
        my_prec = precision(predictions, batch.category, average='micro')

        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), my_acc, my_f1, my_prec



In [None]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:
            
            predictions = model(batch.headline, batch.desc).squeeze(1)
            
            loss = criterion(predictions, batch.category)
            
            acc = categorical_accuracy(predictions, batch.category)
            my_acc = accuracy(predictions, batch.category)
            my_f1 = f1(predictions, batch.category, num_classes=41)
            my_prec = precision(predictions, batch.category, average='micro')

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), my_acc, my_f1, my_prec


In [None]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs


Now we are ready to train our model. We will train it for five epochs.

In [None]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc, train_f1, train_prec = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc, valid_f1, valid_prec = evaluate(model, valid_iterator, criterion)

    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_epoch = epoch
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'news_classification_model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')


  mask = Variable(mask, volatile=index.volatile).to(device)


Epoch: 01 | Epoch Time: 2m 4s
	Train Loss: 0.438 | Train Acc: 75.00%
	 Val. Loss: 1.694 |  Val. Acc: 69.81%
Epoch: 02 | Epoch Time: 2m 4s
	Train Loss: 0.278 | Train Acc: 82.81%
	 Val. Loss: 1.859 |  Val. Acc: 67.92%
Epoch: 03 | Epoch Time: 2m 4s
	Train Loss: 0.234 | Train Acc: 84.38%
	 Val. Loss: 1.952 |  Val. Acc: 75.47%
Epoch: 04 | Epoch Time: 2m 4s
	Train Loss: 0.205 | Train Acc: 90.62%
	 Val. Loss: 2.051 |  Val. Acc: 69.81%
Epoch: 05 | Epoch Time: 2m 4s
	Train Loss: 0.180 | Train Acc: 93.75%
	 Val. Loss: 2.119 |  Val. Acc: 71.70%


And we test it with our best model.

In [None]:
# accuracy(preds, target)

# test_loss, test_acc = evaluate(model, test_iterator, criterion)

# print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

print(best_epoch)
test_loss, test_acc, my_test_f1, my_test_prec = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} '
      f'| Test Acc: {test_acc*100:.2f}'
      f'| Test F1: {my_test_f1*100:.2f} | Test Prec: {my_test_prec*100:.2f}%')

0


  mask = Variable(mask, volatile=index.volatile).to(device)


Test Loss: 2.120 | Test Acc: 58.14| Test F1: 58.14 | Test Prec: 58.14%


In [None]:
w = weights.weightweight(freqs,type="simple")
w = w.to(device)


criterion = FocalLoss(weight=w, gamma=1) 

N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc, train_f1, train_prec = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc, valid_f1, valid_prec = evaluate(model, valid_iterator, criterion)

    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_epoch = epoch
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'news_classification_model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

  mask = Variable(mask, volatile=index.volatile).to(device)


Epoch: 01 | Epoch Time: 2m 4s
	Train Loss: 15.612 | Train Acc: 85.94%
	 Val. Loss: 105.996 |  Val. Acc: 71.70%
Epoch: 02 | Epoch Time: 2m 3s
	Train Loss: 9.603 | Train Acc: 92.19%
	 Val. Loss: 116.752 |  Val. Acc: 75.47%
Epoch: 03 | Epoch Time: 2m 3s
	Train Loss: 8.255 | Train Acc: 92.19%
	 Val. Loss: 120.578 |  Val. Acc: 69.81%
Epoch: 04 | Epoch Time: 2m 3s
	Train Loss: 7.445 | Train Acc: 95.31%
	 Val. Loss: 124.563 |  Val. Acc: 75.47%
Epoch: 05 | Epoch Time: 2m 4s
	Train Loss: 6.668 | Train Acc: 95.31%
	 Val. Loss: 127.693 |  Val. Acc: 75.47%


In [None]:
print(best_epoch)
test_loss, test_acc, my_test_f1, my_test_prec = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} '
      f'| Test Acc: {test_acc*100:.2f}'
      f'| Test F1: {my_test_f1*100:.2f} | Test Prec: {my_test_prec*100:.2f}%')

0


  mask = Variable(mask, volatile=index.volatile).to(device)


Test Loss: 127.536 | Test Acc: 67.44| Test F1: 67.44 | Test Prec: 67.44%


**User Input**

In this section, we let ourself to put our own input and let the model predict the news' categories beyond the dataset. For consistencies, we will use news from Huffington Post and try to get its category predicted. Make sure that the first input is the headline and the second input is the short description of the article.

News can be obtained from [here](https://www.huffpost.com/).

In [None]:
import spacy
nlp = spacy.load('en')

def predict_category(model, head, desc):
    model.eval()
    head = head.lower()
    desc = desc.lower()
    tokenized_head = [tok.text for tok in nlp.tokenizer(head)]
    tokenized_desc = [tok.text for tok in nlp.tokenizer(desc)]
    indexed_head = [TEXT.vocab.stoi[t] for t in tokenized_head]
    indexed_desc = [TEXT.vocab.stoi[t] for t in tokenized_desc]
    tensor_head = torch.LongTensor(indexed_head).to(device)
    tensor_desc = torch.LongTensor(indexed_desc).to(device)
    tensor_head = tensor_head.unsqueeze(1)
    tensor_desc = tensor_desc.unsqueeze(1)
    prediction = model(tensor_head, tensor_desc)
    max_pred = prediction.argmax(dim=1)
    return max_pred.item()

News headline: Trump’s Art Of Distraction

News short description: The conversation surrounding Trump’s latest racist rants has provoked us to revisit author Toni Morrison’s 1975 keynote address at Portland State University on the true purpose of racism.

Correct category: Politics

In [None]:
pred = predict_category(model, "Trump’s Art Of Distraction", "The conversation surrounding Trump’s latest racist rants has provoked us to revisit author Toni Morrison’s 1975 keynote address at Portland State University on the true purpose of racism..")
print(f'Predicted category is: {pred} = {LABEL.vocab.itos[pred]}')

News headline: Indiana Cop Apologizes After Accusing McDonald’s Worker Of Eating His Sandwich

News short description: The Marion County sheriff’s deputy forgot he had taken a bite out of his McChicken earlier that day, authorities said.

Correct category: U.S. News

In [None]:
pred = predict_category(model, "Indiana Cop Apologizes After Accusing McDonald’s Worker Of Eating His Sandwich", "The Marion County sheriff’s deputy forgot he had taken a bite out of his McChicken earlier that day, authorities said.")
print(f'Predicted category is: {pred} = {LABEL.vocab.itos[pred]}')

News headline: Kyle ‘Bugha’ Giersdorf, 16, Wins Fortnite World Cup And Takes Home $ 3 Million Prize

News short description: Fortnite has nearly 250 million registered players and raked in an estimated $2.4 billion last year.

Correct category: Sports

In [None]:
pred = predict_category(model, "Kyle ‘Bugha’ Giersdorf, 16, Wins Fortnite World Cup And Takes Home $ 3 Million Prize", "Fortnite has nearly 250 million registered players and raked in an estimated $2.4 billion last year.")
print(f'Predicted category is: {pred} = {LABEL.vocab.itos[pred]}')

**References**

This notebook was created thanks to the two references below.
* http://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/
* https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/2%20-%20Upgraded%20Sentiment%20Analysis.ipynb