# Text Classification with CNN  (PyTorch)

## 1.0 Introduction

Text classification is one of the most common Natural Language Processing tasks. It consists in encoding a text into a tensor that can be then used by a machine learning method to predict a label, which may represent any category.

Text classification has been used for a wide range of purposes, such as text categorization, spam detection, information extraction, sentiment analysis, and so on.

In this short tutorial, we will develop a simple Convolutional Neural Network (CNN) for text classification in PyTorch. All code will be commented to increase readability.

We will train and test our model on the 20 Newsgroup Dataset following the steps below:
 
1. Load the word embeddings (e.g. Word2Vec or Glove)
2. Load the dataset (i.e. 20 Newsgroup Dataset)
3. Define the hyperparameters (e.g. arguments)
4. Define the model (i.e. CNN)
5. Train and Validate the model (Epochs, Metrics, etc.)
6. Test the model (Metrics)

The reader can expand and re-adapt our system to work on different datasets and for different purposes.


### 1.1 References and Acknowledgements

Most of the code described below is adapted from:
- TextCNN (Yoon, 2014): https://arxiv.org/abs/1408.5882
- Rationale Net (Tao et al., 2016): https://arxiv.org/abs/1606.04155
- Extraction from Breast Pathology Reports (Yala et al., 2016): https://www.biorxiv.org/content/early/2016/10/10/079913

We recommend the reader to go through these papers for having a clear understanding of the model and the theory behind it.

## 2.0 Task

The 20 Newsgroups dataset (http://qwone.com/~jason/20Newsgroups/) is a collection of approximately 20,000 newsgroup documents, partitioned across 20 different categories (med, space, atheism, etc.). This dataset has been previously adopted for both clustering and classification of documents.

In this tutorial, we will load the dataset through the Scikit-Learn interface and we will use it for text classification. See section 4.0.

## 3.0 Word Embeddings

"Word embeddings" refers to a set of language modeling and feature learning techniques that allows computer to learn word representations in dense vectors of real numbers (as opposed to sparse co-occurrence vectors).

The two most common word embedding types are Word2Vec (either Continuous Bag-of-words or Skip-Gram) and Glove, even though a number of other algorithms have been proposed through the years. In this tutorial we do not intend to describe how such algorithms work, but we suggest the reader to look at least for some basic information about them.

What we can briefly mention here is that word embeddings rely on the *Distributional Hypothesis* (Harris, 1954), according to which words that occur in similar contexts tend to be similar. If we count or predict the contexts in which words occur, we can learn vectorial representations that are expected to represent similarity by mean of distance in the generated vectorial semantic space. Word vectors representing similar meaning will be closer than word vectors representing different ones.


### 3.1 Load the Word Embeddings

The first step to make our algorithm work is to provide it with word vectorial representations.

One way to do so would be to collect the vocabulary used in our target dataset and learn the vectorial representations for each word from a large corpus.

Another, and more practical, way consists in loading the pre-trained word embeddings, which can be easily downloaded from the Web. This is possible because we can expect that the majority of words used in our newsgroup dataset is common and frequent enough to exist in the pre-trained word embeddings. Such assumption would have been wrong if we had to deal with medical or pharmaceutical domain, as the vocabulary would have contained very rare words.

In the code below, we will use Glove embeddings (https://nlp.stanford.edu/projects/glove/), but the reader can eventually use different kind of embeddings.

In [2]:
# Load the Embeddings
import numpy as np

# Set the path where you have downloaded the embeddings
emb_path = "/Users/hhjs/Downloads/glove.6B/glove.6B.300d.txt"

# Set the embedding size
emb_dims = 300


def load_embeddings(emb_path, emb_dims):
    '''
    Load the embeddings from a text file
    
        :param emb_path: Path of the text file
        :param emb_dims: Embedding dimensions
        
        :return emb_tensor: tensor containing all word embeedings
        :return word_to_indx: dictionary with word:index
    '''

    # Load the file
    lines = open(emb_path).readlines()
    
    # Creating the list and adding the PADDING embedding
    emb_tensor = [np.zeros(emb_dims)]
    word_to_indx = {'PADDING_WORD':0}
    
    # For each line, save the embedding and the word:index
    for indx, l in enumerate(lines):
        word, emb = l.split()[0], l.split()[1:]
        
        if not len(emb) == emb_dims:
            continue
        
        # Update the embedding list and the word:index dictionary
        emb_tensor.append([float(x) for x in emb])
        word_to_indx[word] = indx+1
    
    # Turning the list into a numpy object
    emb_tensor = np.array(emb_tensor, dtype=np.float32)
    return emb_tensor, word_to_indx

The function load_embeddings takes in input the embedding path (pointing to a text file with one word and vector per line) and the embedding dimensions.

It loads the embeddings into emb_tensor, adding a zero-padding embedding in position zero. For each word in the emb_tensor, the word index is recorded in the word_to_indx dictionary.

Below we call the function and print the dimensions of both the vector tensor and dictionary.

In [3]:
# Calling load_embeddings and printing the size of the returned objects
emb_tensor, word_to_indx = load_embeddings(emb_path, emb_dims)

print('Words: {}\nVectors (+ zero-padding): {}'.format(len(word_to_indx.keys()), emb_tensor.shape))

Words: 400001
Vectors (+ zero-padding): (400001, 300)


## 4.0 Load the Dataset

In section 2.0 we have shortly introduced the task. In this section we show how to load the dataset using the Scikit-Learn API.

The dataset needs to be processed in a way that it can be then used by our Convolutional Neural Network for classification: our classes below are used exactly for this goal.

In [9]:
# Load the Dataset

from sklearn.datasets import fetch_20newsgroups
from abc import ABCMeta, abstractmethod, abstractproperty
import torch.utils.data as data
import torch

import re
import random
import tqdm


# Classes in the dataset
classes = ['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


class AbstractDataset(data.Dataset):
    '''
    Abstract class that adds general method to the Newsgroup dataset
    '''
    
    __metaclass__ = ABCMeta

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):
        sample = self.dataset[index]
        return sample


class Newsgroup(AbstractDataset):
    '''
    Newsgroup dataset loader
    '''
    
    def __init__(self, set_type, classes, word_to_indx, class_balance_true=True, max_length=80):
        '''
        Load the dataset from SK-Learn

            :param set_type: string containing either 'train', 'dev' or 'test'
            :param classes: list of strings containing the classes
            :param word_to_indx: dictionary of word:index
            :param max_length: integer with max word to consider
            :return: nothing
        '''

        # Deterministic randomization
        random.seed(0)
        
        n_classes = len(classes)
        class_balance = {}
        self.dataset = []

        # If train or dev...
        if set_type in ['train', 'dev']:
            data = self.preprocess(fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'),
                                                      categories=classes))
            
            # Randomly split train in 80-20%
            random.shuffle(data)
            num_train = int(len(data)*.8)
            if set_type == 'train':
                data = data[:num_train]
            else:
                data = data[num_train:]
                
        # If test...     
        else:
            data = self.preprocess(fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'),
                                                      categories=classes))

        # For every unprocessed_sample in the created set, process it
        for indx, unprocessed_sample in tqdm.tqdm(enumerate(data)):
            sample = self.process_line(unprocessed_sample, word_to_indx, max_length)
            
            # If the sample is not empty, save it and add its y to the class_balance dictionary
            if sample['text'] != '':
                if not sample['y'] in class_balance:
                    class_balance[sample['y']] = 0
                class_balance[sample['y']] += 1
                self.dataset.append(sample)

            
    def preprocess(self, data):
        '''
        Return a list of (text, label and label_name)

            :param data: 20 newsgroup dataset as imported by SK-Learn
            
            :return processed_data: list of text, label and label_name
        '''
        processed_data = []
        for indx, sample in enumerate(data['data']):
            text, label = sample, data['target'][indx]
            label_name = data['target_names'][label]
            text = re.sub('\W+', ' ', text).lower().strip()
            processed_data.append((text, label, label_name))
        return processed_data

    
    def get_indices_tensor(self, text_arr, word_to_indx, max_length):
        '''
        Return a tensor of max_length with the word indices
        
            :param text_arr: text array
            :param word_to_indx: dictionary word:index
            :param max_length: maximum length of returned tensors
            
            :return x: tensor containing the indices
        '''
        
        pad_indx = 0
        text_indx = [word_to_indx[x] if x in word_to_indx else pad_indx for x in text_arr][:max_length]
        
        # Padding
        if len(text_indx) < max_length:
            text_indx.extend([pad_indx for _ in range(max_length - len(text_indx))])

        x =  torch.LongTensor([text_indx])

        return x


    def process_line(self, row, word_to_indx, max_length, case_insensitive=True):
        '''
        Return every line as a dictionary with text, x, y, y_name

            :param row: document (or comment)
            :param word_to_indx: dictionary of word:index
            :param max_length: integer with max word to consider
            
            :return sample: dictionary of text, x, y, y_name
        '''
        
        text, label, label_name = row
        
        if case_insensitive:
            text = " ".join(text.split()[:max_length]).lower()
        else:
            text = " ".join(text.split()[:max_length])
            
        x =  self.get_indices_tensor(text.split(), word_to_indx, max_length)
        
        sample = {'text':text,'x':x, 'y':label, 'y_name': label_name}
        return sample

The class AbstractDataset adds general method to the Newsgroup dataset.

The class Newsgroup loads the dataset and process it, turning every line of it in a dictionary with the following keys:

- text: the text of the comment
- x: tensor containing the indices of the words in text
- y: label (an integer)
- y_name: name of the label

Below we load the dataset in the train, dev and test sets and we print one sample.

In [10]:
# Loading the dataset
train = Newsgroup('train', classes, word_to_indx, class_balance_true=True, max_length=80)
dev = Newsgroup('dev', classes, word_to_indx, class_balance_true=True, max_length=80)
test = Newsgroup('test', classes, word_to_indx, class_balance_true=True, max_length=80)

# Printing 3 datapoints
for datapoint in train[:3]:
    print(datapoint)
    print(len(datapoint['x'][0]))

9051it [00:00, 40010.21it/s]
2263it [00:00, 37448.26it/s]
7532it [00:00, 43216.82it/s]

{'text': 'thanks again one final question the name gehrels wasn t known to me before this thread came up but the may issue of scientific american has an article about the inconstant cosmos with a photo of neil gehrels project scientist for nasa s compton gamma ray observatory same person mark brader softquad inc toronto information we want information utzoo sq msb msb sq com the prisoner', 'x': tensor([[  3125,    379,     49,    295,    996,      1,    312, 106696, 128074,
           2160,    226,      5,    286,    107,     38,  14410,    264,     61,
             35,      1,    108,    496,      4,   2441,    141,     32,     30,
           1760,     60,      1, 142185,  21499,     18,      8,   3120,      4,
           6400, 106696,    717,   4757,     11,   4168,   1535,  18477,  17305,
           3015,   8891,    216,    900,    800, 269737, 343531,  15232,   2527,
            420,     54,    304,    420,      0,  16426, 155648, 155648,  16426,
          10109,      1,   5032,   




In [11]:
print(list(word_to_indx.keys())[:10])
print(len(list(word_to_indx.keys())))
word_to_indx['of']

['PADDING_WORD', 'the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"']
400001


4

In [12]:
dic = {}
for k in range(len(train)):
    datapoint = train[k]
    text = datapoint['text'].split(' ')
    x = datapoint['x']
    for i in range(len(text)):
        word = text[i]
        indx = x[0][i]
    
        if word not in dic.keys():
            dic[word] = indx

for k in range(len(dev)):
    datapoint = dev[k]
    text = datapoint['text'].split(' ')
    x = datapoint['x']
    for i in range(len(text)):
        word = text[i]
        indx = x[0][i]
    
        if word not in dic.keys():
            dic[word] = indx

In [14]:
import torch.nn as nn

In [356]:
class MarkovChain(nn.Module):
  def __init__(self, dataset, total_init_probability = None, total_transfer_probability = None):
    
    self.train(dataset)
    self.sequence_len = len(dataset.__getitem__(0)['x'][0])

  def train(self, dataset):
    current_index = 0
    self.ind_glob_to_ind = {}
    self.count_ind = {}
    for k in tqdm.tqdm(range(len(dataset))):
      datapoint = dataset[k]
      x = datapoint['x']
      for i in range(len(x[0])):
        indx = x[0][i].item()
        if indx not in self.ind_glob_to_ind.keys():
          self.ind_glob_to_ind[indx] = current_index
          current_index+=1
        self.count_ind[indx] = self.count_ind.get(indx, 0) + 1
    
    self.init_probability = torch.zeros(len(self.ind_glob_to_ind))
    self.transition_probability = torch.zeros([len(self.ind_glob_to_ind)]*2)
    for k in self.count_ind.keys():
      self.init_probability[self.ind_glob_to_ind[k]] += self.count_ind[k]
    

    for k in tqdm.tqdm(range(len(dataset))):
      datapoint = dataset[k]
      x = datapoint['x']
      for i in range(len(x[0]) -1):
        indx = x[0][i].item()
        next_indx = x[0][i+1].item()
        self.transition_probability[self.ind_glob_to_ind[indx], self.ind_glob_to_ind[next_indx]] += 1

    self.init_probability = self.init_probability / (torch.sum(self.init_probability)+1e-8)
    self.transition_probability = self.transition_probability / (torch.sum(self.transition_probability, axis = 1, keepdim=True)+1e-8)
    # self.transition_probability = self.transition_probability.unsqueeze(0).unsqueeze(0).to_sparse(sparse_dim=2)
    self.output_dim = len(self.ind_glob_to_ind)

  def impute(self, data, masks, nb_imputation= 10):

    with torch.no_grad():
      batch_size = data.shape[0]
      
      current_data = map(lambda x: self.ind_glob_to_ind[x.item()], data.flatten().detach().clone())
      current_data = torch.tensor(list(current_data), dtype=torch.int64, device = data.device).reshape(data.shape)
      current_data_complete = torch.nn.functional.one_hot(current_data, num_classes = self.output_dim)
      masks_expanded = masks.unsqueeze(-1).expand(-1, -1, self.output_dim)
      message = torch.zeros((batch_size, self.sequence_len, self.output_dim))    
      message[:, 0] = self.init_probability * (1-masks_expanded[:, 0]) + masks_expanded[:, 0] * current_data_complete[:, 0]
      message[:, 0] = message[:, 0]/(torch.sum(message[:, 0], axis = 1, keepdim=True)+1e-8) # Batchsize * output_dim

  
      # Forward :
      for i in tqdm.tqdm(range(1, self.sequence_len)):
        message[:, i] = torch.matmul(message[:, i-1], self.transition_probability.unsqueeze(0)) * (1-masks_expanded[:, i]) + masks_expanded[:, i] * current_data_complete[:, i]
        message[:, i] = message[:, i]/torch.sum(message[:, i], axis = 1, keepdim=True)

      
      # Backward : 
      masks_nb_imputation = masks.unsqueeze(1).expand(-1, nb_imputation, -1,) # Batchsize * nb_imputation * sequence_len 
      current_data_nb_imputation = current_data.unsqueeze(1).expand(-1, nb_imputation, -1,) # Batchsize * nb_imputation * sequence_len
      output_sample = torch.zeros((batch_size, nb_imputation, self.sequence_len,))

      output_sample[:, :, -1] = torch.distributions.categorical.Categorical(probs = message[:, -1]).sample((nb_imputation,)).permute(1,0,)
      output_sample[:, :, -1] = masks_nb_imputation[:, :, -1] * output_sample[:, :, -1] + (1-masks_nb_imputation[:, :, -1]) * current_data_nb_imputation[:, :, -1]
      message = message.unsqueeze(1).expand(-1, nb_imputation,-1, -1).clone()

      for i in tqdm.tqdm(range(self.sequence_len-2, -1, -1)):
        current_transition = self.transition_probability.unsqueeze(0).unsqueeze(0).expand(batch_size, nb_imputation, -1, -1) # Batchsize * nb_imputation * output_dim * output_dim
        current_transition = torch.cat([current_transition[j, k, :, output_sample[j,k,i+1].long()] for j in range(batch_size) for k in range(nb_imputation)]).reshape(batch_size, nb_imputation, self.output_dim, )        
        message[:, :, i] *= current_transition
        message[:, :, i] = message[:, :, i]/(torch.sum(message[:, :, i], axis = -1, keepdim=True)+1e-8)
        dist = torch.distributions.categorical.Categorical(probs=message[:, :, i])
        output_sample[:, :, i] = dist.sample()
        output_sample[:, :, i] = (1-masks_nb_imputation[:, :, i]) * output_sample[:, :, i] + masks_nb_imputation[:, :, i] * current_data_nb_imputation[:, :, i]

    return output_sample



In [357]:
current_markovchain = MarkovChain(train)
current_markovchain.transition_probability.shape

100%|██████████| 8799/8799 [00:01<00:00, 5954.57it/s]
100%|██████████| 8799/8799 [00:06<00:00, 1297.81it/s]


torch.Size([1, 1, 26484, 26484])

In [358]:
batch = next(iter(train))
dic = current_markovchain.ind_glob_to_ind

In [359]:
print(batch['x'])
masks = torch.bernoulli(torch.ones(batch['x'].shape)*0.5)

tensor([[  3125,    379,     49,    295,    996,      1,    312, 106696, 128074,
           2160,    226,      5,    286,    107,     38,  14410,    264,     61,
             35,      1,    108,    496,      4,   2441,    141,     32,     30,
           1760,     60,      1, 142185,  21499,     18,      8,   3120,      4,
           6400, 106696,    717,   4757,     11,   4168,   1535,  18477,  17305,
           3015,   8891,    216,    900,    800, 269737, 343531,  15232,   2527,
            420,     54,    304,    420,      0,  16426, 155648, 155648,  16426,
          10109,      1,   5032,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0]])


In [360]:
print(dic)
dic[18477]

{3125: 0, 379: 1, 49: 2, 295: 3, 996: 4, 1: 5, 312: 6, 106696: 7, 128074: 8, 2160: 9, 226: 10, 5: 11, 286: 12, 107: 13, 38: 14, 14410: 15, 264: 16, 61: 17, 35: 18, 108: 19, 496: 20, 4: 21, 2441: 22, 141: 23, 32: 24, 30: 25, 1760: 26, 60: 27, 142185: 28, 21499: 29, 18: 30, 8: 31, 3120: 32, 6400: 33, 717: 34, 4757: 35, 11: 36, 4168: 37, 1535: 38, 18477: 39, 17305: 40, 3015: 41, 8891: 42, 216: 43, 900: 44, 800: 45, 269737: 46, 343531: 47, 15232: 48, 2527: 49, 420: 50, 54: 51, 304: 52, 0: 53, 16426: 54, 155648: 55, 10109: 56, 5032: 57, 15: 58, 64: 59, 97376: 60, 1117: 61, 4712: 62, 58582: 63, 47: 64, 241621: 65, 59593: 66, 9493: 67, 780: 68, 14: 69, 1051: 70, 5531: 71, 42: 72, 5282: 73, 6339: 74, 6: 75, 1994: 76, 150: 77, 7279: 78, 9011: 79, 3262: 80, 21: 81, 9315: 82, 2531: 83, 51: 84, 164: 85, 77: 86, 186: 87, 365: 88, 559: 89, 680: 90, 7: 91, 2299: 92, 8151: 93, 85: 94, 672: 95, 406: 96, 340: 97, 314: 98, 1347: 99, 183: 100, 123: 101, 2644: 102, 139: 103, 4790: 104, 223: 105, 75: 106, 1

39

In [361]:
imputed = current_markovchain.impute(batch['x'], masks, 2)

  0%|          | 0/79 [00:00<?, ?it/s]


RuntimeError: sparse transpose: transposed dimensions must be sparse Got sparse_dim: 1, d0: 1, d1: 2

In [297]:
imputed.to(torch.int64)

tensor([[[  235,     1,    72,  4855,     4,     5,     6,    81,  2325,     9,
            183,    11,   914,    13,    14,   820,   362,     5,  1659,    81,
             19,    20,    21,     5,    15,   226,    25,    26,   113,     5,
             28,    29,    30,    31,    32,    21,    33,     7,   347,  1619,
             36,    37,    38,    39,    40,    41,    42,    72,  4509,    45,
             46,    47,    48,    49,  4396,    51,    52,  4103,    53,    54,
             54,    54,    54,    56,     5,    57,    53,    53,    53,    53,
             53,    53,    53,    53,    53,    53,    53,    53,    53,    53],
         [   17,     1,   795,    94,     4,     5,     6,    81,  2325,     9,
             52,    11,   300,    13,    14,   473,  9749, 23891,    69,    87,
             19,    20,    21, 19445,   347,   355,    25,    26,    69,     5,
             28,    29,    30,     5,    32,    21,    33,     7,   347, 21622,
             36,    37,    38,    39,  

In [299]:
inverse_dic = {v: k for k, v in dic.items()}
imputed_2 = torch.tensor(list(map(lambda x: inverse_dic[x.item()], imputed.flatten()))).reshape(imputed.shape)

In [300]:
indx_to_word = {v: k for k, v in word_to_indx.items()}
imputed_3 = list(map(lambda x: indx_to_word[x.item()], imputed_2.flatten()))

In [301]:
print(batch['text'])

x = ''
for k in range(80):
    x += imputed_3[k] + ' '
print(x)

thanks again one final question the name gehrels wasn t known to me before this thread came up but the may issue of scientific american has an article about the inconstant cosmos with a photo of neil gehrels project scientist for nasa s compton gamma ray observatory same person mark brader softquad inc toronto information we want information utzoo sq msb msb sq com the prisoner
is again one quick question a name i don t have to act before this have a not necessarily it may issue of 1 315 about an article on the inconstant cosmos with a photo of neil gehrels son and for nasa s compton gamma ray observatory space does mark brader softquad inc toronto when we want PADDING_WORD PADDING_WORD sq msb msb sq com the prisoner PADDING_WORD PADDING_WORD PADDING_WORD PADDING_WORD PADDING_WORD PADDING_WORD PADDING_WORD PADDING_WORD PADDING_WORD PADDING_WORD PADDING_WORD PADDING_WORD PADDING_WORD PADDING_WORD 


## 5.0 Define the Hyperparameters

Whenever we train a neural network, a large set of hyperparameters need to be defined and tuned. Such parameters are generally tuned looking at the performance on the development set.

In this section we define the default arguments. We will see in our experiments that such parameters are already good enough to obtain high accuracy on the Newsgroup dataset.

In [34]:
# Set the parameters

args = {'train':True, 'test':False, 'cuda':False, 'class_balance':False,
        'init_lr':0.001, 'epochs':4, 'batch_size':128, 'patience':10,
        'save_dir':'snapshot', 'model_path':'model.pt', 'results_path':'snapshot/results.txt', 'model':'TextCNN',
        'hidden_dims':100, 'num_layers':1, 'dropout':0.1, 'weight_decay':1e-3,
        'filter_num':100, 'filters':[3, 4, 5], 'num_class':20, 'emb_dims':300,
        'tuning_metric':'loss', 'num_workers':4, 'objective':'cross_entropy'}

#'gumbel_temprature':1, 'gumbel_decay':1e-5,'tag_lambda':.5

## 6.0 Defining the Model

In this section, we see how to create a Convolutional Neural Network for text classification. We do not intend here to discuss the theory behind CNNs, as the reader can easily find sources online for it (a nice tutorial can be found here: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/). We would instead propose the commented code below.

Our implementation is organized in two classes:
- one is the Encoder, which loads the embeddings, calls the model and returns the logits for the output classes;
- the other is the model, implemented as a TextCNN, which takes in input a three dimensional tensor (batch times word_number times emb_dimensions) and returns the activation.

In [35]:
# Defining the Encoder and the Model classes

import pdb
import torch
import torch.nn as nn
import torch.autograd as autograd
import torch.nn.functional as F


# Encoder
class Encoder(nn.Module):
    '''
    Load the embeddings and encode them
    '''

    def __init__(self, embeddings, args):
        '''
        Load embeddings and call the TextCNN model
        
            :param embeddings: tensor with word embeddings
            :param model: default is 'TextCNN'
            
            :return: nothing
        '''
        super(Encoder, self).__init__()
        
        # Saving the parameters
        self.model = args['model']
        self.num_class = args['num_class']
        self.hidden_dims = args['hidden_dims']
        self.num_layers = args['num_layers']
        self.filters = args['filters']
        self.filter_num = args['filter_num']
        self.cuda = args['cuda']
        self.dropout = args['dropout']
        
        # Loading the word embeddings in the Neural Network
        vocab_size, hidden_dim = embeddings.shape
        self.emb_dims = hidden_dim
        self.emb_layer = nn.Embedding(vocab_size, hidden_dim)
        self.emb_layer.weight.data = torch.from_numpy(embeddings)
        self.emb_layer.weight.requires_grad = True
        self.emb_fc = nn.Linear(hidden_dim, hidden_dim)
        self.emb_bn = nn.BatchNorm1d(hidden_dim)
        
        # Calling the model, followed by a fully connected hidden layer
        if self.model == 'TextCNN':
            self.cnn = TextCNN(args, max_pool_over_time=True)
            # The hidden fully connected layer size is given by the number of filters
            # times the filter size, by the number of hidden dimensions
            self.fc = nn.Linear(len(self.filters) * self.filter_num, hidden_dim)
        else:
            raise NotImplementedError("Model {} not yet supported for encoder!".format(model))

        # Dropout and final layer
        self.dropout = nn.Dropout(self.dropout)
        self.hidden = nn.Linear(hidden_dim, self.num_class)
        
        
    def forward(self, x_indx):
        '''
        Forward step
        
            :param x_indx: batch of word indices
            
            :return logit: predictions
            :return: hidden layer
        '''
        
        x = self.emb_layer(x_indx.squeeze(1))
        if self.cuda:
            x = x.cuda()
        
        # Non linear projection with dropout
        x = F.relu(self.emb_fc(x))
        x = self.dropout(x)
        # TextNN, fully connected and non linearity
        if self.model == 'TextCNN':
            x = torch.transpose(x, 1, 2) # Transpose x dimensions into (Batch, Emb, Length)
            hidden = self.cnn(x)
            hidden = F.relu(self.fc(hidden))
        else:
            raise Exception("Model {} not yet supported for encoder!".format(self.model))

        # Dropout and final layer
        hidden = self.dropout(hidden)
        logit = self.hidden(hidden)
        return logit, hidden


# Model
class TextCNN(nn.Module):
    '''
    CNN for Text Classification
    '''

    def __init__(self, args, max_pool_over_time=False):
        '''
        Convolutional Neural Network
        
            :param num_layers: number of layers
            :param filters: filters shape
            :param filter_num: number of filters
            :param emb_dims: embedding dimensions
            :param max_pool_over_time: boolean
            
            :return: nothing
        '''
        super(TextCNN, self).__init__()

        # Saving the parameters
        self.num_layers = args['num_layers']
        self.filters = args['filters']
        self.filter_num = args['filter_num']
        self.emb_dims = args['emb_dims']
        self.cuda = args['cuda']
        self.max_pool = max_pool_over_time
        
        self.layers = []
        
        # For every layer...
        for l in range(self.num_layers):
            convs = []
            
            # For every filter...
            for f in self.filters:
                # Defining the sizes
                in_channels =  self.emb_dims if l == 0 else self.filter_num * len(self.filters)
                kernel_size = f
                
                # Adding the convolutions in the list
                conv = nn.Conv1d(in_channels=in_channels, out_channels=self.filter_num, kernel_size=kernel_size)
                self.add_module('layer_' + str(l) + '_conv_' + str(f), conv)
                convs.append(conv)
                
            self.layers.append(convs)


    def _conv(self, x):
        '''
        Left padding and returning the activation
        
            :param x: input tensor (batch, emb, length)
            :return layer_activ: activation
        '''
        
        layer_activ = x
        
        for layer in self.layers:
            next_activ = []
            
            for conv in layer:
                # Setting the padding dimensions: it is like adding
                # kernel_size - 1 empty embeddings
                left_pad = conv.kernel_size[0] - 1
                pad_tensor_size = [d for d in layer_activ.size()]
                pad_tensor_size[2] = left_pad
                left_pad_tensor = autograd.Variable(torch.zeros(pad_tensor_size))
                
                if self.cuda:
                    left_pad_tensor = left_pad_tensor.cuda()
                    
                # Concatenating the padding to the tensor
                padded_activ = torch.cat((left_pad_tensor, layer_activ), dim=2)
                
                # onvolution activation
                next_activ.append(conv(padded_activ))

            # Concatenating accross channels
            layer_activ = F.relu(torch.cat(next_activ, 1))
            #pdb.set_trace()
        return layer_activ


    def _pool(self, relu):
        '''
        Max Pool Over Time
        '''
        
        pool = F.max_pool1d(relu, relu.size(2)).squeeze(-1)
        return pool


    def forward(self, x):
        '''
        Forward steps over the x
        
            :param x: input (batch, emb, length)

            :return activ: activation
        '''
        
        activ = self._conv(x)
        
        # Pooling over time?
        if self.max_pool:
            activ = self._pool(activ)
            
        return activ

In [7]:
# Creating the encoder and TextCNN, and printing an output from a random input

encoder = Encoder(emb_tensor, args)                    

print("Output logits for the first (randomly sorted) element of the dataset:\n\n")
print(encoder(train[0]['x'])[0])

Output logits for the first (randomly sorted) element of the dataset:


tensor([[-0.0550,  0.0176,  0.1084, -0.0500,  0.0102, -0.0198, -0.0603,  0.0331,
          0.1217, -0.0852,  0.1033,  0.0317, -0.0446,  0.0101,  0.1126,  0.0216,
          0.0344, -0.0747, -0.0306, -0.0025]], grad_fn=<AddmmBackward0>)


## 7.0 Train the Model

After loading the word embeddings and the dataset, we defined the model and the encoder. At this point, it remains to train the system and finally to evaluate it.

The training code is relatively complicated, so we split it into utilities and core functions. Every function is properly commented, and we hope the reader can easily understand their goal.

### 7.1 Utilities

All the functions listed below are of support for the core training functions implemented in the next section.

In [8]:
# Train the model
import sklearn.metrics
import sys, os

def get_optimizer(models, args):
    '''
    Save the parameters of every model in models and pass them to
    Adam optimizer.
    
        :param models: list of models (such as TextCNN, etc.)
        :param args: arguments
        
        :return: torch optimizer over models
    '''
    params = []
    for model in models:
        params.extend([param for param in model.parameters() if param.requires_grad])
    return torch.optim.Adam(params, lr=args['lr'],  weight_decay=args['weight_decay'])


def init_metrics_dictionary(modes):
    '''
    Create dictionary with empty array for each metric in each mode
    
        :param modes: list with either train, dev or test
        
        :return epoch_stats: statistics for a given epoch
    '''
    epoch_stats = {}
    metrics = ['loss', 'obj_loss', 'k_selection_loss', 'k_continuity_loss',
               'accuracy', 'precision', 'recall', 'f1', 'confusion_matrix', 'mse']
    for metric in metrics:
        for mode in modes:
            key = "{}_{}".format(mode, metric)
            epoch_stats[key] = []
    return epoch_stats


def get_train_loader(train_data, args):
    '''
    Iterative train loader with sampler and replacer if class_balance
    is true, normal otherwise.
    
        :param train_data: training data
        :param args: arguments
        
        :return train_loader: iterable training set
    '''
    
    if args['class_balance']:
        # If the class_balance is true: sample and replace
        sampler = data.sampler.WeightedRandomSampler(
                weights=train_data.weights,
                num_samples=len(train_data),
                replacement=True)
        train_loader = data.DataLoader(
                train_data,
                num_workers=args['num_workers'],
                sampler=sampler,
                batch_size=args['batch_size'])
    else:
        # If the class_balance is false, do not sample
        train_loader = data.DataLoader(
            train_data,
            batch_size=args['batch_size'],
            shuffle=True,
            num_workers=args['num_workers'],
            drop_last=False)
    return train_loader


def get_dev_loader(dev_data, args):
    '''
    Iterative dev loader
    
        :param dev_data: dev set
        :param args: arguments
        
        :return dev_loader: iterative dev set
    '''
    
    dev_loader = data.DataLoader(
        dev_data,
        batch_size=args['batch_size'],
        shuffle=False,
        num_workers=args['num_workers'],
        drop_last=False)
    return dev_loader


def get_x_indx(batch, eval_model):
    '''
    Given a batch, return all the x
    
        :param batch: batch of dictionaries
        :param eval_model: true or false, for volatile
        
        :return x_indx: tensor of batch*x
    '''
    
    x_indx = autograd.Variable(batch['x'], volatile=eval_model)
    return x_indx


def get_loss(logit, y, args):
    '''
    Return the cross entropy or mse loss
    
        :param logit: predictions
        :param y: gold standard
        :param args: arguments
        
        :return loss: loss
    '''
    
    if args['objective'] == 'cross_entropy':
        loss = F.cross_entropy(logit, y)
    elif args['objective'] == 'mse':
        loss = F.mse_loss(logit, y.float())
    else:
        raise Exception("Objective {} not supported!".format(args['objective']))
    return loss


def tensor_to_numpy(tensor):
    '''
    Return a numpy matrix from a tensor

        :param tensor: tensor
        
        :return numpy_matrix: numpy matrix
    '''
    return tensor.data[0]


def get_metrics(preds, golds, args):
    '''
    Return the metrics given predictions and golds
    
        :param preds: list of predictions
        :param golds: list of golds
        :param args: arguments
        
        :return metrics: metrics dictionary
    '''
    metrics = {}

    if args['objective']  in ['cross_entropy', 'margin']:
        metrics['accuracy'] = sklearn.metrics.accuracy_score(y_true=golds, y_pred=preds)
        metrics['confusion_matrix'] = sklearn.metrics.confusion_matrix(y_true=golds,y_pred=preds)
        metrics['precision'] = sklearn.metrics.precision_score(y_true=golds, y_pred=preds, average="weighted")
        metrics['recall'] = sklearn.metrics.recall_score(y_true=golds,y_pred=preds, average="weighted")
        metrics['f1'] = sklearn.metrics.f1_score(y_true=golds,y_pred=preds, average="weighted")
        metrics['mse'] = "NA"
    elif args['objective'] == 'mse':
        metrics['mse'] = sklearn.metrics.mean_squared_error(y_true=golds, y_pred=preds)
        metrics['confusion_matrix'] = "NA"
        metrics['accuracy'] = "NA"
        metrics['precision'] = "NA"
        metrics['recall'] = "NA"
        metrics['f1'] = 'NA'
    return metrics


def collate_epoch_stat(stat_dict, epoch_details, mode, args):
    '''
    Update stat_dict with details from epoch_details and create
    log statement

        :param stat_dict: a dictionary of statistics lists to update
        :param epoch_details: list of statistics for a given epoch
        :param mode: train, dev or test
        :param args: model run configuration

        :return stat_dict: updated stat_dict with epoch details
        :return log_statement: log statement sumarizing new epoch

    '''
    log_statement_details = ''
    for metric in epoch_details:
        loss = epoch_details[metric]
        stat_dict['{}_{}'.format(mode, metric)].append(loss)

        log_statement_details += ' -{}: {}'.format(metric, loss)

    log_statement = '\n {} - {}\n--'.format(args['objective'], log_statement_details )

    return stat_dict, log_statement

### 7.2 Core Functions

Below we present the core functions for the training.

In [16]:
# Run each epoch
def run_epoch(data_loader, train_model, model, optimizer, step, args):
    '''
    Train model for one pass of train data, and return loss, acccuracy
    
        :param data_loader: iterable dataset
        :param train_model: true if training, false otherwise
        :param model: text classifier, such as TextCNN
        :param optimizer: Adam
        :param args: arguments
        
        :return epoch_stat:
        :return step: number of steps
        :return losses: list of losses
        :return preds: list of predictions
        :return golds: list of gold standards
    '''
    
    eval_model = not train_model
    data_iter = data_loader.__iter__()
    print("HERE")

    losses = []
    obj_losses = []
    
    preds = []
    golds = []
    texts = []

    if train_model:
        model.train()
    else:
        model.eval()

    num_batches_per_epoch = len(data_iter)
    if train_model:
        num_batches_per_epoch = min(len(data_iter), 10000)

    for _ in tqdm.tqdm(range(num_batches_per_epoch)):
        # Get the batch
        batch = data_iter.next()
        
        if train_model:
            step += 1
            #if step % 100 == 0:
            #    args['gumbel_temprature'] = max(np.exp((step+1) * -1 * args['gumbel_decay']), .05)

        # Load X and Y
        x_indx = get_x_indx(batch, eval_model)
        text = batch['text']
        y = autograd.Variable(batch['y'], volatile=eval_model)

        if args['cuda']:
            x_indx, y = x_indx.cuda(), y.cuda()

        if train_model:
            optimizer.zero_grad()

        logit, _ = model(x_indx)

        # Calculate the loss
        loss = get_loss(logit, y, args)
        obj_loss = loss
        print(loss)
        # Backward step
        if train_model:
            loss.backward()
            optimizer.step()

        # Saving loss
        obj_losses.append(tensor_to_numpy(obj_loss))
        losses.append(tensor_to_numpy(loss))
        
        # Softmax, preds, text and gold
        batch_softmax = F.softmax(logit, dim=-1).cpu()
        preds.extend(torch.max(batch_softmax, 1)[1].view(y.size()).data.numpy())
        texts.extend(text)
        golds.extend(batch['y'].numpy())

    # Get metrics
    epoch_metrics = get_metrics(preds, golds, args)
    epoch_stat = {'loss' : np.mean(losses), 'obj_loss': np.mean(obj_losses)}

    for metric_k in epoch_metrics.keys():
        epoch_stat[metric_k] = epoch_metrics[metric_k]

    return epoch_stat, step, losses, preds, golds


def train_model(train_data, dev_data, model, args):
    '''
    Train model on the training and tune it on the dev set.
    
    If model does not improve dev performance within patience
    epochs, best model is restored and the learning rate halved
    to continue training.

    At the end of training, the function will restore the best model
    on the dev set.

        :param train_data: preprocessed data
        :param dev_data: preprocessed data
        :param models: models to be used for text classification
        :param args: hyperparameters
        
        :return epoch_stats: a dictionary of metrics for train and dev
        :return model: best model
    '''
    
    snapshot = '{}'.format(os.path.join(args['save_dir'], args['model_path']))

    if args['cuda']:
        model = model.cuda()

    args['lr'] = args['init_lr']
    optimizer = get_optimizer([model], args)

    num_epoch_sans_improvement = 0
    epoch_stats = init_metrics_dictionary(modes=['train', 'dev'])
    step = 0
    tuning_key = "dev_{}".format(args['tuning_metric'])
    best_epoch_func = min if tuning_key == 'loss' else max

    train_loader = get_train_loader(train_data, args)
    dev_loader = get_dev_loader(dev_data, args)

    # For every epoch...
    for epoch in range(1, args['epochs'] + 1):
        print("-------------\nEpoch {}:\n".format(epoch))
        
        # Load the training and dev sets...
        for mode, dataset, loader in [('Train', train_data, train_loader),
                                      ('Dev', dev_data, dev_loader)]:
            
            train_model = mode == 'Train'
            print('{}'.format(mode))
            key_prefix = mode.lower()
            epoch_details, step, _, _, _ = run_epoch(data_loader=loader, train_model=train_model, model=model,
                                                     optimizer=optimizer, step=step, args=args)
            
            epoch_stats, log_statement = collate_epoch_stat(epoch_stats, epoch_details, key_prefix, args)
            
            # Log performance
            print(log_statement)

        # Save model if beats best dev
        best_func = min if args['tuning_metric'] == 'loss' else max
        if best_func(epoch_stats[tuning_key]) == epoch_stats[tuning_key][-1]:
            num_epoch_sans_improvement = 0
            if not os.path.isdir(args['save_dir']):
                os.makedirs(args['save_dir'])
            # Subtract one because epoch is 1-indexed and arr is 0-indexed
            epoch_stats['best_epoch'] = epoch - 1
            torch.save(model, snapshot)
        else:
            num_epoch_sans_improvement += 1

        if not train_model:
            print('---- Best Dev {} is {:.4f} at epoch {}'.format(
                args['tuning_metric'], epoch_stats[tuning_key][epoch_stats['best_epoch']],
                epoch_stats['best_epoch'] + 1))

        # If the number of epochs without improvements is high, reduce the learning rate
        if num_epoch_sans_improvement >= args['patience']:
            print("Reducing learning rate")
            num_epoch_sans_improvement = 0
            model.cpu()
            model = torch.load(snapshot)

            if args['cuda']:
                model = model.cuda()
            args['lr'] *= .5
            optimizer = get_optimizer([model], args)

    # Restore model to best dev performance
    if os.path.exists(args['model_path']):
        model.cpu()
        model = torch.load(snapshot)

    return epoch_stats, model

Let's start to train the model.

In [17]:
epoch_stats, model = train_model(train, dev, encoder, args)

-------------
Epoch 1:

Train


Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/miniconda3/envs/lime_env/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/opt/miniconda3/envs/lime_env/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'Newsgroup' on <module '__main__' (built-in)>


KeyboardInterrupt: 

The model has been trained only for 4 epochs for time limitations. Yet, its performance on the dev set is very high (its accuracy reaches 91% at the fourth epoch). We can therefore proceed to evaluate the TextCNN on the test set.

## 8.0 Test the Model

In the previous sections, we have loaded the embeddings and the dataset, we have set the hyperparameters and we have implemented the model as a CNN. Finally we have trained its parameters and we are now at the point in which we have to test its performance.

In the following section we describe the code for testing our model.

In [None]:
# Testing the model

def test_model(test_data, model, args):
    '''
    Run the model on test data, and return statistics,
    including loss and accuracy.
    
        :param test_data: test data
        :param model: a model, like TextCNN
        :param args: arguments
        
        :return test_stats:
    '''
    if args['cuda']:
        model = model.cuda()

    # Loading the test data as iterable
    test_loader = torch.utils.data.DataLoader(
        test_data,
        batch_size=args['batch_size'],
        shuffle=False,
        num_workers=args['num_workers'],
        drop_last=False)

    # The function is defined before
    test_stats = init_metrics_dictionary(modes=['test'])

    mode = 'Test'
    train_model = False
    key_prefix = mode.lower()
    print("-------------\nTest")
    epoch_details, _, losses, preds, golds = run_epoch(
        data_loader=test_loader,
        train_model=train_model,
        model=model,
        optimizer=None,
        step=None,
        args=args)

    test_stats, log_statement = collate_epoch_stat(test_stats, epoch_details, 'test', args)
    test_stats['losses'] = losses
    test_stats['preds'] = preds
    test_stats['golds'] = golds

    print(log_statement)

    return test_stats

This simple function calls the run_epoch over the test set and return the metric statistics, which are finally printed in the code below.

In [None]:
stats = test_model(test, model, args)

-------------
Test


100%|██████████| 58/58 [00:11<00:00,  5.12it/s]


 cross_entropy -  -loss: 1.2976808548 -f1: 0.585272234341 -recall: 0.595567109044 -precision: 0.596501985905 -obj_loss: 1.2976808548 -mse: NA -confusion_matrix: [[  9   0   1   0   0   9   4   2   9   2   3   5   3   9  19 147  13   2
   21  53]
 [  0 184  22  11  19  76  23   2   7   1   0   8  13   3   9   1   0   2
    1   2]
 [  0  14 205  47  14  52   8   4   6   1   1   3   3   4   4   3   4   0
    2   4]
 [  1   3  34 212  47   9  26   3   3   3   2   1  38   1   1   0   1   0
    0   0]
 [  1   3   9 137 121   2  35   2   3   1   3   3  39   6   3   1   1   0
    0   1]
 [  0  34  29  12   8 265  12   2   1   4   0   7   7   1   7   0   0   1
    0   0]
 [  0   1   7  35   5   1 281  14   6   2   3   1  13   2   4   2   3   0
    1   1]
 [  0   2   0   0   0   2  20 260  37   2   4   2  18   2   6   2  14   0
    1   2]
 [  0   1   2   4   4   3  12  34 263   6   3   1  12   7   7   3   8   1
    8   7]
 [  1   1   0   1   0   6   4   0   7 289  46   1   3   6   0   1   3   1




## 9.0 Conclusions

In this tutorial we have seen how to develop TextCNN, a Convolutional Neural Network for text classification. Despite the fact that this model can reach a very good performance on the 20 Newsgroup dataset with few epochs of training, modifications and improvements are necessary when dealing with more complex datasets. 

Some of them may include using character and positional embeddings, adding multiple layers, including recurrent neural network layers, attention, etc. We leave to the reader the joy of trying new approaches, also exploring the related literature.