# Character-level Convolutional Networks for text Classification

In this Imlementation, we will use character-based features to classify the text. Character-based features are very powerful and have many advantages over word-based features. The paper which we are going to implement in this recipe was published as [Character-level Convolutional Networks for Text Classification ](https://arxiv.org/pdf/1509.01626.pdf) by Xiang Zhang and coworkers.

This network is a deep convolutional network with six convolution layers, followed by dense layers. each convolution layer is Conv1D. The schematic diagram of the model is as shown below.

![](figures/character_cnn.png)

Figure: Illustration of the model

## Importing Requirements

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
from sklearn.metrics import accuracy_score
from torch import nn
from tqdm import tqdm
from torch.utils.data import DataLoader
from torch.utils.data import Dataset

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


## Pre-processing

To demonstrate the character based text classification, we will be taking AgNews dataset for the supervised learning task.  AGNews is a collection of more than 1 million news articles. News articles are collected from more than 2000 news sources. Ag News corpus can be downloaded from https://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html. After downloading this corpus one may require to extract the data from an XML format, To avoid these additional steps I have placed a cleaner version of ag news data set in `Ch5/data` folder. The dataset is divided in to train and test split and both the files Ch5/data/ag_news.test  and `Ch5/data/ag_news`.trainare kept in ready to use format.

In [None]:
train_file = 'data/ag_news.train'
test_file = 'data/ag_news.test'

In [None]:
def parse_label(label):
    '''
    Get the actual labels from label string
    Input:
        label (string) : labels of the form '__label__2'
    Returns:
        label (int) : integer value corresponding to label string
    '''
    return int(label.strip()[-1]) - 1

In [None]:
def get_pandas_df(filename):
    '''
    Load the data into Pandas.DataFrame object
    This will be used to convert data to torchtext object
    '''
    with open(filename, 'r') as datafile:
        data = [line.strip().split(',', maxsplit=1) for line in datafile]
        data_text = list(map(lambda x: x[1], data))
        data_label = list(map(lambda x: parse_label(x[0]), data))

    full_df = pd.DataFrame({"text":data_text, "label":data_label})
    return full_df



In [None]:
def get_iterators(config, train_file, test_file):
    """
    prepare iterator for test and test data
    """
    train_set = MyDataset(train_file, config)
    test_set = MyDataset(test_file, config)

    train_size = int(0.9 * len(train_set))
    test_size = len(train_set) - train_size
    
    train_iterator = DataLoader(train_set, batch_size=config.batch_size, shuffle=True)
    test_iterator = DataLoader(test_set, batch_size=config.batch_size)
    return train_iterator, test_iterator



## Converting to charcter features

The character representation is generated by below-given code snippets. It has two main function  `__init__`  which houses the vocabulary set and define various limits such as length of data, uniques labels and unique character in the text. The second function `__getitem__` constructs the character based feature matrix. 

Character representation for this module includes fixing character vocabulary like an English text can have following characters   
``` abcdefghijklmnopqrstuvwxyz0123456789,;.!?:'\"/\\|_@#$%^&*~/`+-=<>()[]{}.```
All other characters are ignored. A maximum length of the sentence or document is fixed. In the paper, the max character length was fixed to be 1014. Considering the size of our dataset, in our case, the max character length is fixed at 300. 

In [None]:
class MyDataset(Dataset):
    """
    preparing 2D character array from the text
    """
    def __init__(self, data_path, config):
        """
        Defining chacrater set
        """
        self.config = config
        self.vocabulary = list("""abcdefghijklmnopqrstuvwxyz0123456789,;.!?:'\"/\\|_@#$%^&*~`+-=<>()[]{}""")
        self.identity_mat = np.identity(len(self.vocabulary))
        data = get_pandas_df(data_path)
        self.texts = list(data.text)
        self.labels = list(data.label)
        self.length = len(self.labels)

    def __len__(self):
        return self.length

    def __getitem__(self, index):
        raw_text = self.texts[index]
        data = np.array([self.identity_mat[self.vocabulary.index(i)] for i in list(raw_text) if i in self.vocabulary],
                        dtype=np.float32)
        if len(data) > self.config.max_len:
            data = data[:self.config.max_len]
        elif 0 < len(data) < self.config.max_len:
            data = np.concatenate(
                (data, np.zeros((self.config.max_len - len(data), len(self.vocabulary)), dtype=np.float32)))
        elif len(data) == 0:
            data = np.zeros((self.config.max_len, len(self.vocabulary)), dtype=np.float32)
        label = self.labels[index]
        return data, label

## Defining Network

A discussed above,  the network has 6 convolution layers and each layer is constructed using nn.Sequential module. The output frame length after the last convolutional layer and before any of the fully-connected layers) is $l6 = (max\_len − 96)/27 $. This number multiplied with the frame size at layer 6, this produces the input dimension that will be compatible with first fully-connected layer accepts. Followed by convolution layer there are 3 fully connected layers which are finally converged into a number of classes.

Each convolution layer is followed by Rectifier Linear Units and max-pool operation is applied to concentrate features. In short, each convolution block looks like as given below.

```python
conv1 = nn.Sequential(
nn.Conv1d(in_channels=self.config.vocab_size, out_channels=self.config.num_channels,     kernel_size=7),
nn.ReLU(),
nn.MaxPool1d(kernel_size=3)
) 
```

To construct individual block I have used Pytorch function  nn.Sequential. nn.Sequential helps to keep network tidy and easy to understand. In the larger network, each network sub block is designed as  nn.Sequential and then all these subblocks are added to form the entire network. nn.Sequential is a container and module are executed in the order they are stacked in the constructor. you can pass ordered dictionary also to the nn.Sequential module. Example of both approaches is given below. 

```python
# Constructing Sequential block with Stacking
model = nn.Sequential(
        nn.Conv2d(1,20,5),
        nn.ReLU(),
        nn.Conv2d(20,64,5),
        nn.ReLU()
 )

# Constructing Sequential block with OrderedDict
model = nn.Sequential(OrderedDict([
          ('conv1', nn.Conv2d(1,20,5)),
          ('relu1', nn.ReLU()),
          ('conv2', nn.Conv2d(20,64,5)),
          ('relu2', nn.ReLU())
]))
```


In [None]:
class CharCNN(nn.Module):
    def __init__(self, config):
        super(CharCNN, self).__init__()
        self.config = config
        conv1 = nn.Sequential(
            nn.Conv1d(in_channels=self.config.vocab_size, out_channels=self.config.num_channels, kernel_size=7),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=3)
        ) 
        conv2 = nn.Sequential(
            nn.Conv1d(in_channels=self.config.num_channels, out_channels=self.config.num_channels, kernel_size=7),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=3)
        ) 
        conv3 = nn.Sequential(
            nn.Conv1d(in_channels=self.config.num_channels, out_channels=self.config.num_channels, kernel_size=3),
            nn.ReLU()
        ) 
        conv4 = nn.Sequential(
            nn.Conv1d(in_channels=self.config.num_channels, out_channels=self.config.num_channels, kernel_size=3),
            nn.ReLU()
        ) 
        conv5 = nn.Sequential(
            nn.Conv1d(in_channels=self.config.num_channels, out_channels=self.config.num_channels, kernel_size=3),
            nn.ReLU()
        ) 
        conv6 = nn.Sequential(
            nn.Conv1d(in_channels=self.config.num_channels, out_channels=self.config.num_channels, kernel_size=3),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=3)
        )
        
        conv_output_size = self.config.num_channels * ((self.config.max_len - 96) // 27)
        
        linear1 = nn.Sequential(
            nn.Linear(conv_output_size, self.config.linear_size),
            nn.ReLU(),
            nn.Dropout(self.config.dropout_keep)
        )
        linear2 = nn.Sequential(
            nn.Linear(self.config.linear_size, self.config.linear_size),
            nn.ReLU(),
            nn.Dropout(self.config.dropout_keep)
        )
        linear3 = nn.Sequential(
            nn.Linear(self.config.linear_size, self.config.output_size),
            nn.Softmax()
        )
        
        self.convolutional_layers = nn.Sequential(conv1,conv2,conv3,conv4,conv5,conv6)
        self.linear_layers = nn.Sequential(linear1, linear2, linear3)
        
        # Initialize Weights
        self._create_weights(mean=0.0, std=0.05)
    
    def _create_weights(self, mean=0.0, std=0.05):
        """
        Function to initialize weights
        """
        for module in self.modules():
            if isinstance(module, nn.Conv1d) or isinstance(module, nn.Linear):
                module.weight.data.normal_(mean, std)
    
    def forward(self, embedded_sent):
        embedded_sent = embedded_sent.transpose(1,2)#.permute(0,2,1) # shape=(batch_size,embed_size,max_len)
        conv_out = self.convolutional_layers(embedded_sent)
        conv_out = conv_out.view(conv_out.shape[0], -1)
        linear_output = self.linear_layers(conv_out)
        return linear_output
    
    def add_optimizer(self, optimizer):
        """
        Function to add otimizer 
        """
        self.optimizer = optimizer
        
    def add_loss_op(self, loss_op):
        """
        Function to add los
        """
        self.loss_op = loss_op
    
    def reduce_lr(self):
        for g in self.optimizer.param_groups:
            g['lr'] = g['lr'] / 2
        print("Reducing Learning Rate to : ", g['lr'])

## Defining model configs

In [None]:
class Config(object):
    num_channels = 256
    linear_size = 256
    output_size = 4
    max_epochs = 10
    lr = 0.001
    batch_size = 128
    vocab_size = 68
    max_len = 300 # 1014 in original paper
    dropout_keep = 0.5
config = Config()

In [None]:
train_iterator, test_iterator = get_iterators(config, train_file, test_file)

# Visualizing features
The character-based feature matrix with 64 unique characters with sentence/document max length equals to 300 is shown below. The position in the matrix where particular character and its index in sentence/document is marked as 1, else all indices are kept zero.

![](figures/character_represenation.png)
Figure: Showing how the character is given as features. We have taken max length of the sentence to be 300. each character can be one of the  64 different types of predefined characters. If the character is present at the given index in the sentence the position of the character is marked as 1 and rest all remain zero. In the figure, the presence of the character in at any particular location is shown by Yellow (1).

In [None]:
for x, y in train_iterator:
    plt.imshow(np.transpose(x[0]), cmap = 'viridis',)
    break

## Training and Validation related functions

In [None]:
model = CharCNN(config)
model.train()
optimizer = torch.optim.SGD(model.parameters(), lr=config.lr)
loss_fn = nn.CrossEntropyLoss()
model.add_optimizer(optimizer)
model.add_loss_op(loss_fn)

In [None]:
model.to(device)
loss_fn.to(device)

In [None]:
def check_accuracy(y_pred, y):
    pred_class = torch.argmax(y_pred, dim=1)
    acc  = accuracy_score(pred_class.cpu(), y.cpu())
    return acc

In [None]:
def eval_batch(model, test_iterator):
    accuracy = []
    for i, batch in enumerate(train_iterator):
        _, n_true_label = batch
        x, y = batch
        optimizer.zero_grad()
        y_pred = model(x.to(device))
        accuracy.append(check_accuracy(y_pred, y))
    return (np.average(np.array(accuracy)))

In [None]:
def run_batch(model, optimizer, loss_fn, train_iterator, test_iterator, epoch):
    train_losses = []
    train_accuracy = []
    test_accuracy = []
    if (epoch > 0 and epoch%3 == 0):
        model.reduce_lr()
    for i, batch in tqdm(enumerate(train_iterator)):
        print(i)
        _, n_true_label = batch
        x, y = batch
        optimizer.zero_grad()
        y_pred = model(x.to(device))
        loss = loss_fn(y_pred, y.to(device))
        loss.backward()
        train_losses.append(loss.data.cpu().numpy())
        optimizer.step()
        train_accuracy.append(check_accuracy(y_pred, y))
        test_accuracy.append( eval_batch(model, test_iterator))
    return train_accuracy, test_accuracy, train_losses
        

In [None]:
TRAIN_ACC = []
TRAIN_LOSS = []
TEST_ACC = [] 
for i in range(0,1):
    train_accuracy, test_accuracy, train_losses = run_batch(model, optimizer, loss_fn, train_iterator, test_iterator, i)
    TRAIN_ACC.extend(train_accuracy)
    TEST_ACC.extend(test_accuracy)
    TRAIN_LOSS.extend(train_losses)

# Performance plotting
When the network was trained for some epochs, accuracy for train and test set increases gradually and loss decreases.

![](figures/char_cnn_progress.png)

Figure : Plotting performance of character level CNN on text clarification task.
With above-given implementation, I finally managed to get above 80% accuracy for test and the training dataset. 

In [None]:
plt.plot(TRAIN_ACC , label = "Train Accuracy")
plt.plot(TRAIN_LOSS , label = "Train Loss")
plt.plot(TEST_ACC, label = "Test Accuracy")
plt.ylabel("Accuracy/Loss")
plt.xlabel("Iteration")
plt.legend(loc='upper left')
plt.show()