# CharCNN Model

Detailed theory regarding the `CharCNN` can be found in this [paper](https://arxiv.org/pdf/1509.01626.pdf). The basic idea for a `CharCNN` essentially follow the below process:

 1. Obtain text data in raw form
 2. Apply quantisation to transform text data into fixed numerical input format
 3. Apply 1-D convolution to quantised text data to capture temporal features
 4. Learn the weights for the CNN (with 1-D convolution layers) using gradient descent

The advantages of the `CharCNN` as claimed by the author in the original paper includes:
 - CNNs do not require the knowledge of words
 - CNNs do not require knowledge about the syntactic or semantic structure of a language
 - The previous point leads to a simplified engineering solution where a single model may be applied to multiple languages
 - Abnormal character combinations such as misspellings and emoticons may be natrually learnt
 - Generally CNNs are faster to train when compared to RNNs (typically used for NLP tasks)
 
However, the advantages do come with a prerequisite that the model is trained on a sufficiently large dataset that is reasonably balanced.

***Note: The main "trick" for a `CharCNN` is the text quantisation preprocessing combined with 1-D convolution (2-D convolutions is not suited in this case as text is sequential and therefore can be considered as one dimensional)***

The data preprocessing and text quantisation steps are detailed in the `Data Preparation` notebook. The following sections will detail some of the techinical details for the `CharCNN`.

## 1. CharCNN modules

### 1.1. 1-D convolutional module.

To explain 1-D convolution, the simplest approach is by demonstration. Below images shows 1-D convolution with 1-D and 2-D inputs.

<center>
    <img src=./imgs/conv1d_1din.jpg height=700 width=700>
        <figcaption>
            <b>1-D convolution with 1-D input</b>
        </figcaption>
    <img src=./imgs/conv1d_2din.jpg height=700 width=700>
        <figcaption>
            <b>1-D convolution with 2-D input</b>
        </figcaption>
</center>

The principle is exactly the same as 2-D convolution, you slide a kernel across the input matrix, apply element-wise multiplication and sum the results to produce a single output element. The only difference here is that we only slide the kernel in the horizontal direction with 1-D convolution.

*Note: the output of the 1-D convolution will be a 1-D matrix regardless of input dimension. However, if we want to use many filters (i.e. multi-channel) then the output 1-D matrices will be stacked together to form a 2-D matrix with dimension (1D x N) where N is the number of channels / filters used. This is quite well explained in this StackOverflow [article](https://stackoverflow.com/questions/42883547/intuitive-understanding-of-1d-2d-and-3d-convolutions-in-convolutional-neural-n).

Taking our scenario as an example, the input shape will be (70 x 1014) for a single sample (although in reality there can be a sample dimension). This originates from the fact that we have 70 characters in the alphabet and 1014 features specified by the assumed maximum text input length. In this case, the first 1-D convolutional layer in the network should have a filter size of (K x 70) where K is the kernel size. The output of the convolutional layer will then be (C x 1014) where C is the number of filters. Note this assumes that 'SAME' padding is applied and stride = 1.

This is demonstrated below in PyTorch.

In [15]:
from torch import Tensor, nn
import numpy as np
from math import ceil

In [51]:
input_matrix = np.random.rand(1, 70, 1014)
input_tensor = Tensor(input_matrix)
kernel_size = 7
padding = ceil((kernel_size-1)/2) # Assuming default dilation and stride
conv = nn.Conv1d(in_channels=70, out_channels=256, kernel_size=kernel_size, stride=1, padding=padding)
conv(input_tensor).shape

torch.Size([1, 256, 1014])

*Note: to apply 'SAME' padding, the formula below can be used:

<img src=./imgs/conv1d_outshape.png height=700 width=700>

### 1.2. 1-D MaxPooling module

Max Pooling as a concept is very simple to understand. It is used to decrease the size of feature vectors while retaining essential information. The pooling technique exists to allow for the training of deeper neural networks.

For a given kernel size and stride, the Max Pool will filter the input matrix by selecting the maximum value within the kernel dimension and return it as a single element output. For non-overlapping max pooling, the stride should be set to the kernel size.

Below is a demonstration of 1-D maxplooing using PyTorch.

In [39]:
from torch import Tensor, nn
import numpy as np

In [64]:
input_matrix = np.random.rand(1, 2, 9)
input_tensor = Tensor(input_matrix)
kernel_size = 3
stride = kernel_size # for non-overlapping max pooling
out_shape = ceil((input_tensor.shape[2]-(kernel_size-1)-1)/stride + 1)
max_pool = nn.MaxPool1d(kernel_size=kernel_size, stride=stride, padding=0)
print(f"Input Shape: {input_tensor.shape}")
print(f"Output Shape: {max_pool(input_tensor).shape}")
print(f"Calculated output feature dimension: {out_shape}")

Input Shape: torch.Size([1, 2, 9])
Output Shape: torch.Size([1, 2, 3])
Calculated output feature dimension: 3


*Note: the output feature dimension shape can be calculated using the same formula detailed in the 1-D convolution section. A rule of thumb is that for non-overlapping pooling, the feature dimension will decrease with a factor that is equal to the kernel size.

In [65]:
input_matrix

array([[[0.83814695, 0.07839527, 0.59405836, 0.75269495, 0.34446287,
         0.11398736, 0.71846285, 0.63767295, 0.91154655],
        [0.39194546, 0.47190193, 0.58148252, 0.28835472, 0.94771464,
         0.05098323, 0.67678279, 0.51066291, 0.73233075]]])

In [66]:
max_pool(input_tensor)

tensor([[[0.8381, 0.7527, 0.9115],
         [0.5815, 0.9477, 0.7323]]])

**The max pooling for an input with many channels (2-D matrix) will have stacked 1-D output**

### 1.3. Activation function

Using Relu as detailed in the paper by Zhang et. al.

h(x) = max{0, x}

### 1.4. Fully connected layers

These are linear layers usually positioned at the tail end of a neural network, which typically act as a classifier. There are two things to note:

 - The last fully connected layer will have the same number of units/neurons/outputs as the number of classes existing for prediction.
 - The first fully connected layer input dimension must correspond to the flattened output dimension of the last convolutional layer.
 
These two points are demonstrated using PyTorch below.

In [67]:
from torch import Tensor, nn
import numpy as np

In [96]:
input_matrix = np.random.rand(1, 70, 1014)
input_tensor = Tensor(input_matrix)
kernel_size = 7
padding = ceil((kernel_size-1)/2) # Assuming default dilation and stride
conv = nn.Conv1d(in_channels=70, out_channels=256, kernel_size=kernel_size, stride=1, padding=padding)
conv_out = conv(input_tensor)
lin_input_features = conv_out.view(conv_out.size(0), -1).shape[1] # Keep the batch dimension intact and flatten out all of the other dimensions
linear = nn.Linear(in_features=lin_input_features, out_features=3) # Set the input features equal to the flattened output features from convoluation layer
                                                                   # Set the output features equal to the number of classes for prediction, 3 in this case
linear_out = linear(conv_out.view(conv_out.size(0), -1)) # pass the flattened convolutional layer output through the lienar layer
print(f"Output shape of convolution layer: {conv_out.shape}")
print(f"Input features to linear layer: {lin_input_features}")
print(f"Output shape of the linear layer: {linear_out.shape}")

Output shape of convolution layer: torch.Size([1, 256, 1014])
Input features to linear layer: 259584
Output shape of the linear layer: torch.Size([1, 3])


### 1.5. Dropout layers

Dropout is a technique used for regularisation in deep learning. The concept is simple, for each training sample batch / iteration, ignore a fraction (p) of units in a hidden layer. This can be intuitively considered as training using many many different models for each sample batch of data. The technique for dropout regularisation is demonstrated in the image below.

<center>
    <img src=./imgs/dropout.png height=700 width=700>
</center>

## 2. CharCNN Architecture

### 2.1 Model structure

The CharCNN architecture used for this project very much follows the implementation (at least as a starting point) detailed in the [paper](https://arxiv.org/pdf/1509.01626.pdf) by Zhang et. al. The ConvNet structure is 9 layers deep and contains the following elements:

 - Input: (70 x 1014)
 - Conv1D(1): k=7, s=1, c_in=70, c_out=256, p='SAME'
     - ReLu
     - MaxPool1D: k=3, s=3
 - Conv1D(2): k=7, s=1, c_in=256, c_out=256, p='SAME'
     - ReLu
     - MaxPool1D: k=3, s=3
 - Conv1D(3): k=3, s=1, c_in=256, c_out=256, p='SAME'
     - ReLu
 - Conv1D(4): k=3, s=1, c_in=256, c_out=256, p='SAME'
     - ReLu
 - Conv1D(5): k=3, s=1, c_in=256, c_out=256, p='SAME'
     - ReLu
 - Conv1D(6): k=3, s=1, c_in=256, c_out=256, p='SAME'
     - ReLu
     - MaxPool1D: k=3, s=3
 - Linear(7): c_in=9472, c_out=1024
     - ReLu
     - Dropout: p=0.5
 - Linear(8): c_in=1024, c_out=1024
     - ReLu
     - Dropout: p=0.5
 - Linear(9): c_in=1025, c_out=10
 
***Note**:
 - The number of input features to the first linear layer is calculated by flattening the output matrix from the last convolution layer (excluding batch dimension).
 - There are 12,125,187 trainable parameters within the model.

In [1]:
from torch import rand, Tensor, nn, cuda
import numpy as np
from typing import Tuple

class CharCNN(nn.Module):
    """
    Character level CNN implementation as per the paper authored by Zhang et.al. The paper 
    is titled "Character-level convolutional Networks for Text Classification - 2016". The 
    model consists of 9 layers including 6 convolutional layers and 3 fully connected layers.
    Details are shown below.
                
                Conv Layers
    -------------------------------------------------
    Layer    Features    Kernel    Pool    Activation
    -----    --------    ------    ----    ----------
      1         256         7        3        ReLU
      2         256         7        3        ReLU
      3         256         3        N/A      ReLU 
      4         256         3        N/A      ReLU
      5         256         3        N/A      ReLU
      6         256         3        3        ReLU
      
                FC Layers
    -----------------------------------
    Layer    Features    Dropout    Activation
    -----    --------    -------    ----------
      7        1024        p=0.5       ReLU
      8        1024        p=0.5       ReLU
      9        TBC         N/A         N/A
      
    Furthermore, the paper initialised the model weights using a Gaussian distribution with 
    a standard deviation of 0.05. Note there were no specific details regarding padding of 
    the convolutional layer inputs, and 'SAME' padding is assumed.
    """
    
    def __init__(self, batch_size: int, alph_len :int, max_len: int, num_classes: int):
        """
        CharCNN model constructor. During model instantiation, the trainable model weights 
        are initialised using Gaussian distribution.
        
        :param batch_size: Number of input samples.
        :type batch_size: int
        :param alph_len: The number of characters in the alphabet for text quantisation. This 
            dictates the dimension of the first convolutional layer.
        :type alph_len: int
        :param max_len: The assume maximum input text length. This is the input features to 
            the model, and dictates the input feature length for the first linear layer.
        :type max_len: int
        :param num_classes: The number of classes for prediction. This dictates the number of 
            output units of the last linear layer.
        :type num_classes: int
        :return: Nothing.
        :rtype: None
        """
        
        super(CharCNN, self).__init__()
    
        # Convolutional layer architecture
        self.conv_layers = nn.Sequential(
            # conv1 -> (b x 256 x 338): 
            nn.Conv1d(alph_len, 256, 7, padding=3),
            nn.ReLU(),
            nn.MaxPool1d(3),
            # conv2 -> (b x 256 x 112)
            nn.Conv1d(256, 256, 7, padding=3),
            nn.ReLU(),
            nn.MaxPool1d(3),
            # conv3 -> (b x 256 x 112)
            nn.Conv1d(256, 256, 3, padding=1),
            nn.ReLU(),
            # conv4 -> (b x 256 x 112)
            nn.Conv1d(256, 256, 3, padding=1),
            nn.ReLU(),
            # conv5 -> (b x 256 x 112)
            nn.Conv1d(256, 256, 3, padding=1),
            nn.ReLU(),
            # conv6 -> (b x 256 x 37) -> 9472 features
            nn.Conv1d(256, 256, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool1d(3)
        )
        
        # Determine linear layer input shape
        input_shape = (batch_size, alph_len, max_len)
        self.lin_input_features = self._get_lin_features(input_shape)
        
        # Linear layer architecture
        self.linear_layers = nn.Sequential(
            # linear7
            nn.Linear(self.lin_input_features, 1024),
            nn.ReLU(),
            nn.Dropout(0.5),
            # linear8
            nn.Linear(1024, 1024),
            nn.ReLU(),
            nn.Dropout(0.5),
            # linear9
            nn.Linear(1024, num_classes),
        )
        
        # Calculate total number of trainable parameters in the model
        self.total_params = self._get_total_params()
        # Random initialisation of weights using gaussian distribution
        self.apply(self._init_weights)
        # If GPU is available, instantiate the model on GPU
        if cuda.is_available():
            self.cuda()
    
    def _get_lin_features(self, shape: Tuple[int]) -> int:
        """
        Convenience function to calculate the input feature length for 
        the first linear layer in the model.
        
        :param shape: Input shape to the model in the form of (b x w x l) where 
            b, l and w are the input batch size, width (number of rows) and 
            length (number of columns) respectively.
        :type shape: Tuple[int]
        :return: The number of input features to the first linear layer in the 
            model.
        :rtype: int
        """
        x = rand(shape)
        x = self.conv_layers(x)
        return x.view(x.size(0), -1).shape[1]
    
    def _get_total_params(self) -> int:
        """
        Convenience function for calculating the total number of trainable parameters 
        in the model.
        """
        total_params = sum([p.numel() for p in self.parameters() 
                            if p.requires_grad])
        return total_params
    
    def _init_weights(self, module: nn.Module, mean: float=0., std: float=0.05) -> None:
        """
        Convenience function for initialising the weights for a single module in the 
        model using Gaussian distribution with specified mean and standard deviation. 
        This function should be passed to nn.Module.apply which recursively applies
        input function to all sub modules within the model.
        
        :param module: a nn.Module object to which the weight initialisation will be 
            applied.
        :type module: nn.Module
        :param mean: Mean for a Gaussian distribution.
        :type mean: float
        :param std: Standard deviation for a Gaussian distribution.
        :type std: float
        :return: Nothing. Modifies the input modules weights and biases inplace.
        :rtype: None.
        """
        if isinstance(module, nn.Conv1d) or isinstance(module, nn.Linear):
            module.weight.data.normal_(mean, std) # inplace op with funct_
            module.bias.data.fill_(0.01)
    
    def forward(self, x):
        """
        Forward propagation function for the model. The output of the model will 
        have dimension (batch_size x number of classes for prediction)
        """
        x = x.transpose(1,2) # transpose sicne input is actually provided as (max_len x alph_len) but we want the other way around
        x = self.conv_layers(x)
        x = x.view(x.size(0), -1) # Flatten the output from the convolutional layers, keep batch dimension intact
        x = self.linear_layers(x)
        return x

In [2]:
# Testing the model
input_matrix = Tensor(np.random.randn(128, 1014, 70))
model = CharCNN(128, 70, 1014, 3)
output = model.forward(input_matrix)
print(f"Input dimension: {input_matrix.shape}")
print(f"Output dimension: {output.shape}")
print(f"Number of trainable parameters in the model: {model.total_params:,}")

Input dimension: torch.Size([128, 1014, 70])
Output dimension: torch.Size([128, 3])
Number of trainable parameters in the model: 12,125,187


### 2.2 Optimisation Algorithm

The optimisation algorithm consists of the following components (as per the paper):
 - Stochastic Gradient Descent with minibatch of size 128, using momentum 0.9.
 - Learning rate decay with initial step size of 0.01, halved every 3 epoches for 10 times.
 - Crossentropy loss function for training optimisation.
 - Evaluation metrics will be Accuracy and F1 score (to cater for imbalanced label class distribution in the training and test data)

Since it was seen that there are quite a bit of class imbalance in the dataset labels, it will be worthwhile to apply class weightings to the crossentropy loss function. **Focal Loss should be investigated in the future, it seems interesting [here](https://medium.com/@ayodeleodubela/what-does-focal-loss-mean-for-training-neural-networks-770636f76379).**

This [article](https://towardsdatascience.com/optimization-algorithms-in-deep-learning-191bfc2737a4) gives a pretty good treatment of the common optimisation algorithms used for deep learning.

In [36]:
from torch import nn, optim
from torch.optim.lr_scheduler import MultiStepLR
from torch.nn import CrossEntropyLoss
from collections import Counter
from torch import cuda
from pandas import read_csv, Series
from typing import Union

def get_class_weights(labels: Seires) -> Tensor:
    """
    Calculate class weightings based on each class' proportion
    in the label.
    
    :param labels: The labels in the training dataset.
    :type labels: Series
    :return: A tensor of weights.
    :rtype: Tensor
    """
    # Calculate class weightings
    class_counts = dict(Counter(train_labels))
    m = max(class_counts.values())
    for c in class_counts:
        class_counts[c] = m / class_counts[c]
    # Convert weightings to tensor
    weights = []
    for k in sorted(class_counts.keys()):
        weights.append(class_counts[k])
    weights = Tensor(weights)
    # Move weights to GPU if available
    if cuda.is_available():
        weights = weights.cuda()
    return weights

def init_optimisation(model: CharCNN, 
                      optimiser: str='sgd', 
                      unbalance_classes: bool=False, class_weights: Tensor=None, 
                      lr: float=0.01, momentum: float=0.9, 
                      schedule_lr: bool=False) -> Tuple[Union[optim.SGD, optim.Adam], 
                                                        CrossEntropyLoss, MultiStepLR]:
    """
    Initialise the optimisation algorithm which selects:
    1. Balanced or unbalanced crossentropy loss function, if unbalanced, class weightings
       are applied during loss optimisation.
    2. Gradient descent algorithm ca be either SGD with momentum, or ADAM.
    3. The user has the option to enable learning rate scheduling for the optimisation 
       algorithm. The scheduler implements learning rate reduction by halving it every 
       three epochs up to the point where this has been applied 10 times.
       
    :param model: A CharCNN model from which parameters will be updated by the optimiser.
    :type model: CharCNN
    :param optimiser: The type of gradient descent algorithm to use. Can be 'sgd' or 'adam'.
    :type optimiser: str
    :param unbalance_classes: Indicator for initialising a weighted crossentropy loss function.
    :type unbalance_classes: bool
    :param class_weights: The list of weightings to be applied to each class for the crossentropy
        loss function. Note if unbalance_classes is False, this parameter will be ignored.
    :type class_weights: Tensor
    :param lr: learning rate for the gradient descent. If schedule_lr is True, this will be 
        the initial learning rate.
    :type lr: float
    :param momentum: the velocity coefficient gradient descent with momentum
    :type momentum: float
    :param schedule_lr: Indicator to enable learning rate scheduling with the algorithm mentioned 
        in the function description.
    :type schedule_lr: bool
    :return: 
    """
    # Balance or unbalanced loss function
    if unbalance_classes:
        criterion = CrossEntropyLoss(weight=class_weights)
    else:
        criterion = CrossEntropyLoss() # The cross_entropy function includes softmax calc, and can have weights applied for different classes
    
    # Choose optimiser
    if optimiser == 'sgd':
        optimiser = optim.SGD(model.parameters(), lr=lr, momentum=momentum) # Stochastic gradient descent optimiser, can have momentum term passed to it
    elif optimiser == 'adam':
        optimiser = optim.Adam(model.parameters(), lr=lr)
        
    # Create learning rate scheduler
    if schedule_lr:
        # Multiply optimiser learning rate by 0.5 for each milestone epochs specified in steps
        steps = [x*3 for x in range(1, 11)]
        scheduler = MultiStepLR(optimiser, milestones=steps, gamma=0.5)
    else:
        scheduler = None
        
    return optimiser, criterion, scheduler

In [48]:
# Test optimisation
df = read_csv("../data/test/test_clean.csv")
train_labels = df['rating']
class_weights = get_class_weights(train_labels)
optimiser, criterion, scheduler = init_optimisation(model, 'sgd', unbalance_classes=True, 
                                                    class_weights=class_weights, schedule_lr=True)

# 3. Training Results

The training of the model was conducted on a GPU with the following settings for the final model:

 - **Quantisation**
  - Alphabet: (abcdefghijklmnopqrstuvwxyz0123456789-,;.!?:'\"/\\|_@#$%^&*~`+ =<>()[]{}\n)
  - Max Input Size: 1014 characters.
  
  
 - **Model parameters**
  - Refer to this [paper](https://arxiv.org/pdf/1509.01626.pdf) for model layer attributes.
  - Output classes: 0 - poor; 1 - average; 2 - good
  
  
 - **Data parameters**
  - Batch size: 1024 samples
  - Sampler: random weighted sampling for each batch (used to deal with unbalanced classes)
  - Multiprocessing: 2 concurrent processes for data loading.
  
  
 - **Training parameters**
  - Optimiser: Stochastic Gradient Descent with Momentum.
   - Learning rate (initial): 0.01
   - Momentum: 0.9
   - Learning rate scheuling: Stepped (halves every 3 epochs for a maximum of 10 times if it gets there)
  - Number of epochs: 10
  - Loss function: Weighted crossentropy.
  - Early-stopping: Enabled (waits 3 epochs for test f1 score improvement)

## 3.1 Logged results

The training of the model ended on the 7th epoch out of the 10 epochs defined.The final results is shown below:

<img src=./imgs/results.png height=400 width=400>

<img src=./imgs/train_pic.png height=600 width=1000>

<img src=./imgs/test_pic.png height=600 width=1000>

From the training results, the below observations are made:

 - The overall accuracy of the model on the last epoch is `0.84`
 - Predictions performs the best on class `2` worst on class `1`. This is due to the class imbalance in the data mainly, but intuitively it's also much more difficult to gauge an `average` sentiment since it would include much more nuance from a language perspective. Getting more data will certianly help, but we can also maybe change the model to treat a binary classification problem.
 - The training scores are reasonably close to the testing scores, which means the model is not overfitting (except for f1 score on class 1 predictions).
 - The test scores reached a plateau much earlier than training scores, in this case, the early stopping mechanism seems to have aided the training process. Since if the training carried on, we will probably start seeing the test scores diverge with respect to the training scores i.e. model overfitting.

## 3.2 Play with the model

In [2]:
from predict import load_model, infer

while True:
    input_str = input("Please provide a input sentence: ")
    model = load_model()
    print(infer(input_str, model))
    next_str = input("Make another prediction? [Y/N]\n")
    if next_str.lower() not in ('y', 'n'):
        print("Did not enter a valid option!")
        next_str = input("Make another prediction? [Y/N]\n")
    elif next_str == 'n':
        break

Please provide a input sentence:  This product sucks!! @Puma #notgood


Predicted Sentiment: Poor




Make another prediction? [Y/N]
 y
Please provide a input sentence:  I guess the service is okay


Predicted Sentiment: Average




Make another prediction? [Y/N]
 So the other day I bought this new bike, and it was awesome!


Did not enter a valid option!


Make another prediction? [Y/N]
 y
Please provide a input sentence:  So the other day I bought this new bike, and it was awesome!


Predicted Sentiment: Good




Make another prediction? [Y/N]
 y
Please provide a input sentence:  I'm a bit uncertain about the project status, can we arrange a meeting soon?


Predicted Sentiment: Average




Make another prediction? [Y/N]
 y
Please provide a input sentence:  Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. When coffee berries turn from green to bright red in color – indicating ripeness – they are picked, processed, and dried.[2] Dried coffee seeds (referred to as "beans") are roasted to varying degrees, depending on the desired flavor. Roasted beans are ground and then brewed with near-boiling water to produce the beverage known as coffee. 


Predicted Sentiment: Average




Make another prediction? [Y/N]
 n
