# **Natural Language Processing Using Convolutional Nueral Network for Text Retrieval Conference (TREC) Question Classification Dataset with Pytorch**


## **Alexander Sepenu**

##### **This code is made available to beginners in Natural Language Processing to support your learning effort. You can adopt this reproducible code for your use as i hope you find it  purposeful as you continue on your Data Science or Machine Learning journey.** 

In [None]:
# install old version of pytorch's torchtext  version 0.9.0 to get dataset 
!pip install -U torch==1.8.0 torchtext==0.9.0

# Reload environment
exit()

In [1]:
# Required Libraries 
import torch
from torchtext.legacy import data, datasets
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import random


# **Preprocessing Dataset**

In [2]:
# defining fields for dataset classification
TEXT = data.Field (tokenize = 'spacy', lower = True)
LABEL = data.LabelField()

In [3]:
## Setting Seed and which Colab RAM size to use from the CUDA package
# change colab runtime type to GPU before running code
seed = 1234
torch.manual_seed(seed)

colab_ram = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(colab_ram )

cuda


In [4]:
# Splitting data into train, test and val sets
train, test = datasets.TREC.splits(TEXT, LABEL)
train, val = train.split(random_state=random.seed(seed))

downloading train_5500.label


train_5500.label: 100%|██████████| 336k/336k [00:00<00:00, 3.01MB/s]


downloading TREC_10.label


TREC_10.label: 100%|██████████| 23.4k/23.4k [00:00<00:00, 889kB/s]


In [5]:
# Verifying the data from train set for label and text structure
vars(train[-1])

{'label': 'DESC', 'text': ['what', 'is', 'a', 'cartesian', 'diver', '?']}

In [7]:
# build vocabulary for text and label for words that appearing atleast twice in train set
TEXT.build_vocab(train, min_freq = 2)
LABEL.build_vocab(train)

In [8]:
# Checking the size of the text and label vocabs 
print("Vocabulary size of TEXT:",len(TEXT.vocab.stoi))
print("Vocabulary size of LABEL:",len(LABEL.vocab.stoi))

Vocabulary size of TEXT: 2700
Vocabulary size of LABEL: 6


In [9]:
# setting up the iterators for train, test and val sets with batch size = 64 and set to use GPU rantime resource
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits((train, val, test),
    batch_size = 64,
    sort_key=lambda x: len(x.text), 
    device = colab_ram)

# **Convolutional Nueral Network Model Building**

In [10]:
## Defining the Convolutional Network Model Params or Layers  
class CNN(nn.Module):
  def __init__(self, vocabulary_size, embedding_size, 
               kernels_number, kernel_sizes, output_size, dropout_rate):
    super().__init__()
    self.embedding = nn.Embedding(vocabulary_size, embedding_size)
    self.convolution_layers = nn.ModuleList([nn.Conv2d(in_channels=1, 
                                                       out_channels= kernels_number,kernel_size= (k, embedding_size))
                                                       for k in kernel_sizes])
    self.dropout = nn.Dropout(dropout_rate)             # Dropout layer here is to prevent overfitting
    self.fully_connected = nn.Linear(len(kernel_sizes) * kernels_number, output_size)

## using the forward function to build the model's architecture using defined layers and outputing the prediction of the text
  def forward(self, text):
    text = text.permute(1,0)
    input_embeddings = self.embedding(text)                # Input layer here takes in questions and outputs the corresponding labels
    input_embeddings = input_embeddings.unsqueeze(1)
    conved = [F.relu(convolution_layer(input_embeddings)).squeeze(3)
              for convolution_layer in self.convolution_layers]
    pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
    concat = self.dropout(torch.cat(pooled, dim = 1))
    final_output = self.fully_connected(concat)

    return final_output

In [11]:
# Defining Convolution Nueral Newtork Params for Modeling building

vocabulary_size = 2700              # Input layer size
embedding_size = 100                # Dimension of embedding layer
kernels_number = 100                # number of filters in network
kernel_sizes = [2, 3, 4]            # Filter Sizes for layers
output_size = 6                     # len of Label size
dropout_rate = 0.8                  # Dropout rate

In [12]:
# Passing Params in to CNN Model
cnn_model = CNN(vocabulary_size, embedding_size, kernels_number, kernel_sizes, output_size, dropout_rate)

In [13]:
 # Running the model in the GPU RAM resource
cnn_model.to(colab_ram)

CNN(
  (embedding): Embedding(2700, 100)
  (convolution_layers): ModuleList(
    (0): Conv2d(1, 100, kernel_size=(2, 100), stride=(1, 1))
    (1): Conv2d(1, 100, kernel_size=(3, 100), stride=(1, 1))
    (2): Conv2d(1, 100, kernel_size=(4, 100), stride=(1, 1))
  )
  (dropout): Dropout(p=0.8, inplace=False)
  (fully_connected): Linear(in_features=300, out_features=6, bias=True)
)

# **Model Evaluation**

In [14]:
# defining the Model's evaluation criterion
eval_criterion = nn.CrossEntropyLoss()         # using the CrossEntropy loss function
eval_criterion = eval_criterion.to(colab_ram)   # assigning GPU resoruces to evaluation criterion 

eval_optimizer = optim.Adam(cnn_model.parameters())  # Model Optimizer using the Adam option

In [15]:
# Defining parameters for accuracy in model predition
def accuracy(predictions, actual_label):
    max_predictions = predictions.argmax(dim= 1, keepdim = True, )
    correct_predictions = max_predictions.squeeze(1).eq(actual_label)
    accuracy = correct_predictions.sum() / torch.cuda.FloatTensor([actual_label.shape[0]])
    return accuracy

In [16]:
# Definining the parameters to iterate the train set  in batches
def train(cnn_model, iterator, eval_optimizer, eval_criterion):

    cnn_model.train()
    epoch_loss = 0
    epoch_acc = 0
    
    for batch in iterator:
        eval_optimizer.zero_grad()
        
        predictions = cnn_model(batch.text)
        
        loss = eval_criterion(predictions, batch.label)
        
        acc = accuracy(predictions, batch.label)
        
        loss.backward()
        
        eval_optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [17]:
# Defining the parameters for the categorical accuracy to calculate the difference between predited and actual text lable
def categorical_accuracy(preds, y):
    top_pred = preds.argmax(1, keepdim = True)
    correct = top_pred.eq(y.view_as(top_pred)).sum()
    acc = correct.float() / y.shape[0]
    return acc

In [18]:
# Defining the Evaluation function parameters to make prediction and to measure loss and accuracy
def evaluate(cnn_model, iterator, eval_criterion):

    cnn_model.eval()
    epoch_loss = 0
    epoch_acc = 0
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = cnn_model(batch.text)
            
            loss = eval_criterion(predictions, batch.label)
            
            acc = categorical_accuracy(predictions, batch.label)
           
            epoch_loss += loss.item()
            
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

# **Trainning the CNN Model**

In [19]:
number_of_epochs = 10

best_acc = float('-inf')

for epoch in range(number_of_epochs):
    
    train_loss, train_acc = train(cnn_model, train_iterator, eval_optimizer, eval_criterion)
    valid_loss, valid_acc = evaluate(cnn_model, valid_iterator, eval_criterion)
    
    if valid_acc > best_acc:
        best_acc = valid_acc
        torch.save(cnn_model.state_dict(), 'trec.pt')
    
    print(f'Epoch {epoch+1} ')
    print(f'\tTrain Loss: {train_loss:.4f} | Train Acc: {train_acc*100:.4f}%')
    print(f'\t Validation Loss: {valid_loss:.3f} |  Validation Acc: {valid_acc*100:.4f}%')

Epoch 1 
	Train Loss: 1.7619 | Train Acc: 30.5833%
	 Validation Loss: 1.177 |  Validation Acc: 58.5136%
Epoch 2 
	Train Loss: 1.2546 | Train Acc: 51.7396%
	 Validation Loss: 0.964 |  Validation Acc: 66.3929%
Epoch 3 
	Train Loss: 1.0641 | Train Acc: 59.3073%
	 Validation Loss: 0.861 |  Validation Acc: 69.0104%
Epoch 4 
	Train Loss: 0.9461 | Train Acc: 63.5417%
	 Validation Loss: 0.786 |  Validation Acc: 71.4476%
Epoch 5 
	Train Loss: 0.8685 | Train Acc: 67.0885%
	 Validation Loss: 0.747 |  Validation Acc: 72.6629%
Epoch 6 
	Train Loss: 0.8036 | Train Acc: 70.5573%
	 Validation Loss: 0.721 |  Validation Acc: 73.9450%
Epoch 7 
	Train Loss: 0.7404 | Train Acc: 72.7135%
	 Validation Loss: 0.697 |  Validation Acc: 74.9332%
Epoch 8 
	Train Loss: 0.6624 | Train Acc: 76.2708%
	 Validation Loss: 0.651 |  Validation Acc: 77.5240%
Epoch 9 
	Train Loss: 0.6346 | Train Acc: 77.4792%
	 Validation Loss: 0.633 |  Validation Acc: 77.2970%
Epoch 10 
	Train Loss: 0.6002 | Train Acc: 78.3854%
	 Validation

**Due to Overfitting on the trainning set and val set, lets test model performance on test set to measure accuracy of the model**

In [20]:
cnn_model.load_state_dict(torch.load('trec.pt'))

test_loss, test_acc = evaluate(cnn_model, test_iterator, eval_criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.4f}%')

Test Loss: 0.543 | Test Acc: 82.8425%


In [21]:
import spacy
nlp = spacy.load('en_core_web_sm')

def predict_class(cnn_model, sentence, min_len = 4):
    cnn_model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    if len(tokenized) < min_len:
        tokenized += ['<pad>'] * (min_len - len(tokenized))
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    tensor = torch.LongTensor(indexed).to(colab_ram)
    tensor = tensor.unsqueeze(1)
    preds = cnn_model(tensor)
    max_preds = preds.argmax(dim = 1)
    return max_preds.item()

### **Testing how well the model predicts a label with a text input**

In [22]:
pred_class = predict_class(cnn_model, "How many zeros are in a thousand")
print(f'Predicted Lable is: {LABEL.vocab.itos[pred_class]}')

Predicted Lable is: NUM


In [23]:
pred_class = predict_class(cnn_model, "Everest is the tallest mountain in the world")
print(f'Predicted Lable is: {LABEL.vocab.itos[pred_class]}')

Predicted Lable is: LOC


In [24]:
pred_class = predict_class(cnn_model, "Where is my Burger?")
print(f'Predicted Lable is: {LABEL.vocab.itos[pred_class]}')

Predicted Lable is: DESC


# **Conclusion**
In conclusion, as the models's parameters were varied, different accuracies were realized. One notable one to share was setting the **dropout_rate = 0.8** and the **number_of_epochs = 10**, the accuracy jumped to **88.8371%**. Perhaps as more data becomes availbale varied parameters could yield better results.
Due to computational limitations, a grid search was nod concidered to perform and parameter tuning that finds the best model results. I hope as more resources become available, this and other methods will be explored.