# **Natural Language Processing Using Convolutional Nueral Network for Text Retrieval Conference (TREC) Question Classification Dataset with Pytorch**


## **Alexander Sepenu**

##### **This code is made available to beginners in Natural Language Processing to support your learning effort. You can adopt this reproducible code for your use as i hope you find it  purposeful as you continue on your Data Science or Machine Learning journey.** 

In [None]:
# install old version of pytorch's torchtext  version 0.9.0 to get dataset 
!pip install -U torch==1.8.0 torchtext==0.9.0

# Reload environment
exit()

In [None]:
# Required Libraries 
import torch
from torchtext.legacy import data, datasets
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import random


# **Preprocessing Dataset**

In [None]:
# defining fields for dataset classification
TEXT = data.Field (tokenize = 'spacy', lower = True)
LABEL = data.LabelField()

In [None]:
## Setting Seed and which Colab RAM size to use from the CUDA package
# change colab runtime type to GPU before running code
seed = 1234
torch.manual_seed(seed)

colab_ram = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(colab_ram )

cuda


In [None]:
# Splitting data into train, test and val sets
train, test = datasets.TREC.splits(TEXT, LABEL)
train, val = train.split(random_state=random.seed(seed))

In [None]:
# Verifying the data from train set for label and text structure
vars(train[-1])

{'label': 'DESC', 'text': ['what', 'is', 'a', 'cartesian', 'diver', '?']}

In [None]:
# build vocabulary for text and label for words that appearing atleast twice in train set
TEXT.build_vocab(train, min_freq = 2)
LABEL.build_vocab(train)

In [None]:
# Checking the size of the text and label vocabs 
print("Vocabulary size of TEXT:",len(TEXT.vocab.stoi))
print("Vocabulary size of LABEL:",len(LABEL.vocab.stoi))

Vocabulary size of TEXT: 2700
Vocabulary size of LABEL: 6


In [None]:
# setting up the iterators for train, test and val sets with batch size = 64 and set to use GPU rantime resource
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits((train, val, test),
    batch_size = 64,
    sort_key=lambda x: len(x.text), 
    device = colab_ram)

# **Convolutional Nueral Network Model Building**

In [None]:
## Defining the Convolutional Network Model Params or Layers  
class CNN(nn.Module):
  def __init__(self, vocabulary_size, embedding_size, 
               kernels_number, kernel_sizes, output_size, dropout_rate):
    super().__init__()
    self.embedding = nn.Embedding(vocabulary_size, embedding_size)
    self.convolution_layers = nn.ModuleList([nn.Conv2d(in_channels=1, 
                                                       out_channels= kernels_number,kernel_size= (k, embedding_size))
                                                       for k in kernel_sizes])
    self.dropout = nn.Dropout(dropout_rate)             # Dropout layer here is to prevent overfitting
    self.fully_connected = nn.Linear(len(kernel_sizes) * kernels_number, output_size)

## using the forward function to build the model's architecture using defined layers and outputing the prediction of the text
  def forward(self, text):
    text = text.permute(1,0)
    input_embeddings = self.embedding(text)                # Input layer here takes in questions and outputs the corresponding labels
    input_embeddings = input_embeddings.unsqueeze(1)
    conved = [F.relu(convolution_layer(input_embeddings)).squeeze(3)
              for convolution_layer in self.convolution_layers]
    pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
    concat = self.dropout(torch.cat(pooled, dim = 1))
    final_output = self.fully_connected(concat)

    return final_output

In [None]:
# Defining Convolution Nueral Newtork Params for Modeling building

vocabulary_size = 2700              # Input layer size
embedding_size = 100                # Dimension of embedding layer
kernels_number = 100                # number of filters in network
kernel_sizes = [2, 3, 4]            # Filter Sizes for layers
output_size = 6                     # len of Label size
dropout_rate = 0.8                  # Dropout rate

In [None]:
# Passing Params in to CNN Model
cnn_model = CNN(vocabulary_size, embedding_size, kernels_number, kernel_sizes, output_size, dropout_rate)

In [None]:
 # Running the model in the GPU RAM resource
cnn_model.to(colab_ram)

CNN(
  (embedding): Embedding(2700, 100)
  (convolution_layers): ModuleList(
    (0): Conv2d(1, 100, kernel_size=(2, 100), stride=(1, 1))
    (1): Conv2d(1, 100, kernel_size=(3, 100), stride=(1, 1))
    (2): Conv2d(1, 100, kernel_size=(4, 100), stride=(1, 1))
  )
  (dropout): Dropout(p=0.8, inplace=False)
  (fully_connected): Linear(in_features=300, out_features=6, bias=True)
)

# **Model Evaluation**

In [None]:
# defining the Model's evaluation criterion
eval_criterion = nn.CrossEntropyLoss()         # using the CrossEntropy loss function
eval_criterion = eval_criterion.to(colab_ram)   # assigning GPU resoruces to evaluation criterion 

eval_optimizer = optim.Adam(cnn_model.parameters())  # Model Optimizer using the Adam option

In [None]:
# Defining parameters for accuracy in model predition
def accuracy(predictions, actual_label):
    max_predictions = predictions.argmax(dim= 1, keepdim = True, )
    correct_predictions = max_predictions.squeeze(1).eq(actual_label)
    accuracy = correct_predictions.sum() / torch.cuda.FloatTensor([actual_label.shape[0]])
    return accuracy

In [None]:
# Definining the parameters to iterate the train set  in batches
def train(cnn_model, iterator, eval_optimizer, eval_criterion):

    cnn_model.train()
    epoch_loss = 0
    epoch_acc = 0
    
    for batch in iterator:
        eval_optimizer.zero_grad()
        
        predictions = cnn_model(batch.text)
        
        loss = eval_criterion(predictions, batch.label)
        
        acc = accuracy(predictions, batch.label)
        
        loss.backward()
        
        eval_optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
# Defining the parameters for the categorical accuracy to calculate the difference between predited and actual text lable
def categorical_accuracy(preds, y):
    top_pred = preds.argmax(1, keepdim = True)
    correct = top_pred.eq(y.view_as(top_pred)).sum()
    acc = correct.float() / y.shape[0]
    return acc

In [None]:
# Defining the Evaluation function parameters to make prediction and to measure loss and accuracy
def evaluate(cnn_model, iterator, eval_criterion):

    cnn_model.eval()
    epoch_loss = 0
    epoch_acc = 0
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = cnn_model(batch.text)
            
            loss = eval_criterion(predictions, batch.label)
            
            acc = categorical_accuracy(predictions, batch.label)
           
            epoch_loss += loss.item()
            
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

# **Trainning the CNN Model**

In [None]:
number_of_epochs = 10

best_acc = float('-inf')

for epoch in range(number_of_epochs):
    
    # Write the code here
    train_loss, train_acc = train(cnn_model, train_iterator, eval_optimizer, eval_criterion)
    # Write the code here
    valid_loss, valid_acc = evaluate(cnn_model, valid_iterator, eval_criterion)
    
    if valid_acc > best_acc:
        # Write the code here
        best_acc = valid_acc
        torch.save(cnn_model.state_dict(), 'trec.pt')
    
    print(f'Epoch {epoch+1} ')
    print(f'\tTrain Loss: {train_loss:.4f} | Train Acc: {train_acc*100:.4f}%')
    print(f'\t Validation Loss: {valid_loss:.3f} |  Validation Acc: {valid_acc*100:.4f}%')

Epoch 1 
	Train Loss: 0.1763 | Train Acc: 94.5052%
	 Validation Loss: 0.535 |  Validation Acc: 83.0395%
Epoch 2 
	Train Loss: 0.1827 | Train Acc: 94.4271%
	 Validation Loss: 0.547 |  Validation Acc: 82.9661%
Epoch 3 
	Train Loss: 0.1714 | Train Acc: 94.5521%
	 Validation Loss: 0.543 |  Validation Acc: 83.1597%
Epoch 4 
	Train Loss: 0.1667 | Train Acc: 94.4948%
	 Validation Loss: 0.549 |  Validation Acc: 82.4519%
Epoch 5 
	Train Loss: 0.1484 | Train Acc: 95.0729%
	 Validation Loss: 0.560 |  Validation Acc: 82.9060%
Epoch 6 
	Train Loss: 0.1532 | Train Acc: 94.9740%
	 Validation Loss: 0.558 |  Validation Acc: 83.2799%
Epoch 7 
	Train Loss: 0.1436 | Train Acc: 95.0260%
	 Validation Loss: 0.560 |  Validation Acc: 83.6872%
Epoch 8 
	Train Loss: 0.1470 | Train Acc: 95.6510%
	 Validation Loss: 0.558 |  Validation Acc: 84.0478%
Epoch 9 
	Train Loss: 0.1214 | Train Acc: 96.2708%
	 Validation Loss: 0.571 |  Validation Acc: 83.5670%
Epoch 10 
	Train Loss: 0.1184 | Train Acc: 96.1979%
	 Validation

**Due to Overfitting on the trainning set and val set, lets test model performance on test set to measure accuracy of the model**

In [None]:
cnn_model.load_state_dict(torch.load('trec.pt'))

test_loss, test_acc = evaluate(cnn_model, test_iterator, eval_criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.4f}%')

Test Loss: 0.372 | Test Acc: 88.8371%


# **Conclusion**
In conclusion, as the models's parameters were varied, different accuracies were realized. One notable one to share was setting the **dropout_rate = 0.8** and the **number_of_epochs = 10**, the accuracy jumped to **88.8371%**.
Due to computational limitations, a grid search was nod concidered to perform and parameter tuning that finds the best model results. I hope as more resoruces become avalable, this and other methods will be explored.