## Corpus

Polarity Dataset. Pang/Lee ACL 2004

http://www.cs.cornell.edu/people/pabo/movie-review-data/

http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz

Based on : https://github.com/cezannec/CNN_Text_Classification/blob/master/CNN_Text_Classification.ipynb

## Load Corpus

In [1]:
from corpus import prepare_corpus

corpus = prepare_corpus()

prepare_corpus: 100%|██████████| 1000/1000 [00:04<00:00, 224.65it/s]
prepare_corpus: 100%|██████████| 1000/1000 [00:04<00:00, 210.11it/s]


In [2]:
len(corpus)

2000

## Embedding - Word2Vec

Google news embeddings this time!

In [3]:
# https://radimrehurek.com/gensim/downloader.html

import os
import gensim.downloader as dl
from gensim.models import KeyedVectors

pretrained_model_name = "word2vec-google-news-300"
model_dl_path = os.path.join(dl.BASE_DIR, pretrained_model_name, f"{pretrained_model_name}.gz")

if os.path.exists(model_dl_path):
    # load model
    print(f"Loading model from {model_dl_path}")
    gnews_embeddings = dl.load(pretrained_model_name)
else:
    # download
    print(f"Model will be downloaded at {model_dl_path}")
    corpus_embeddings = dl.load("word2vec-google-news-300")

Loading model from C:\Users\shawo/gensim-data\word2vec-google-news-300\word2vec-google-news-300.gz


### Vocabulary

In [4]:
# list of all the words word2vec has processed
vocabulary = gnews_embeddings.index_to_key
vocab_len = len(vocabulary)

In [5]:
vocab_len

3000000

In [6]:
vocabulary[:10]

['</s>', 'in', 'for', 'that', 'is', 'on', '##', 'The', 'with', 'said']

## Encode all tokens with indices from embedding

In [7]:
from tqdm import tqdm


def encode_corpus_tokens_with_embed_idx(corpus):
    encoded_corpus = list()
    for i in tqdm(range(len(corpus)), desc="encode_tokens_with_embed_idx"):
        idxs = []
        label, tokens = corpus[i]

        for token in tokens:
            try:
                idx = gnews_embeddings.key_to_index[token]
            except:
                # if token isn't in the vocab
                idx = 0

            idxs.append(idx)
        
        
        encoded_corpus.append((label, idxs))



    return encoded_corpus

In [8]:
encoded_corpus = encode_corpus_tokens_with_embed_idx(corpus)

encode_tokens_with_embed_idx: 100%|██████████| 2000/2000 [00:00<00:00, 7380.14it/s]


## Padding

Left pad with 0

However we need a sequence length. 

In [9]:
# get max sequences length
sentences = [s[1] for s in corpus]
max_seq_len = max(len(s) for s in sentences)
max_seq_len

1477

In [10]:
import numpy as np

def pad_tokens(encoded_corpus, seq_len=max_seq_len):
    padded = np.zeros(
        (len(encoded_corpus), seq_len),
        dtype=np.int32
    )

    for i in tqdm(range(len(corpus)), desc="pad"):
        tokens = encoded_corpus[i][1]

        # nltk's stopwords are a bit agrressive, ignore token lists with 0 size
        if len(tokens) == 0:
            continue

        padded[i, -len(tokens):] = np.array(tokens)

    return padded

In [11]:
padded_tokens = pad_tokens(encoded_corpus)

pad: 100%|██████████| 2000/2000 [00:00<00:00, 33333.89it/s]


## Input and Labels?

In [12]:
X = padded_tokens # input
y = np.array([c[0] for c in encoded_corpus])  #label

In [13]:
print(X.shape)
print(y.shape)

(2000, 1477)
(2000,)


## Split data

In [14]:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    X, y, random_state=42, train_size=0.8
)

In [15]:
# https://datascience.stackexchange.com/questions/15135/train-test-validation-set-splitting-in-sklearn

x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, train_size=0.8, random_state=42)

In [16]:
print(f"x_train = {x_train.shape} # y_train = {y_train.shape}")
print(f"x_val = {x_val.shape} # y_val = {y_val.shape}")
print(f"x_test = {x_test.shape} # y_test = {y_test.shape}")

x_train = (1280, 1477) # y_train = (1280,)
x_val = (320, 1477) # y_val = (320,)
x_test = (400, 1477) # y_test = (400,)


## Convert to TensorData

In [17]:
import torch
from torch.utils.data import TensorDataset

training_data = TensorDataset(torch.from_numpy(x_train), torch.from_numpy(y_train))
val_data = TensorDataset(torch.from_numpy(x_val), torch.from_numpy(y_val))

## DataLoader for Torch

Let torch handle the shuffling and etc yada stuff for MiniBatch

Why MiniBatch? Dataset is big and feeding everything at once won't generalize well. (Even if the machine can handle it!)

In [18]:
from torch.utils.data import DataLoader

# define a batch size
batch_size = 50

train_loader = DataLoader(training_data, shuffle=True, batch_size=batch_size)
val_loader = DataLoader(val_data, shuffle=True, batch_size=batch_size)

## CNN Model

This model is based on https://arxiv.org/abs/1408.5882

In [19]:
import torch.nn as nn
import torch.nn.functional as F

class SentimentClassifierCNN(nn.Module):
    def __init__(self, freeze_embeddings=True):
        super(SentimentClassifierCNN, self).__init__()
        
        # properties
        self.kernel_sizes = [3,4,5]
        self.num_filters = 100
        self.embedding_dim = 300 # gnews300
        self.output_size = 1
        self.vocab_size=vocab_len

        # convert embeddings to tensors!
        self.embedding = torch.from_numpy(gnews_embeddings.vectors)

        # neural network 

        # embedding layer
        # by default we're freezing embeddings
        self.embedding = nn.Embedding.from_pretrained(self.embedding, freeze=freeze_embeddings)

        # conv layers
        # 3 conv layers, since 3 kernel sizes
        self.conv1d = nn.ModuleList([
            nn.Conv2d(1, self.num_filters, (k, self.embedding_dim), padding=(k - 2, 0))

            for k in self.kernel_sizes
        ])

        # final linear layer
        self.linear = nn.Linear(len(self.kernel_sizes) * self.num_filters, self.output_size)

        # dropout and sigmoid
        # why sigmoid? Well, binary classification task!
        self.dropout = nn.Dropout(0.1)
        self.sigmoid = nn.Sigmoid()

    # helper 
    def conv_and_pool(self, x, conv):
        """
        Convolutional + max pooling layer
        """
        # squeeze last dim to get size: (batch_size, num_filters, conv_seq_length)
        x = F.relu(conv(x)).squeeze(3)
        
        # 1D pool over conv_seq_length
        # squeeze to get size: (batch_size, num_filters)
        x_max = F.max_pool1d(x, x.size(2)).squeeze(2)
        return x_max

    def forward(self, x):
        embeds = self.embedding(x)
        embeds = embeds.unsqueeze(1) # reshape for conv (vector to matrix)

        conv_out =  [self.conv_and_pool(embeds, conv) for conv in self.conv1d]

        # concate convolution outputs as a "vector"
        out = torch.cat(conv_out, 1)
        # apply dropout
        out = self.dropout(out)

        # linear 
        out = self.linear(out)

        return self.sigmoid(out)

In [20]:
cnn = SentimentClassifierCNN()
print(cnn)

SentimentClassifierCNN(
  (embedding): Embedding(3000000, 300)
  (conv1d): ModuleList(
    (0): Conv2d(1, 100, kernel_size=(3, 300), stride=(1, 1), padding=(1, 0))
    (1): Conv2d(1, 100, kernel_size=(4, 300), stride=(1, 1), padding=(2, 0))
    (2): Conv2d(1, 100, kernel_size=(5, 300), stride=(1, 1), padding=(3, 0))
  )
  (linear): Linear(in_features=300, out_features=1, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
  (sigmoid): Sigmoid()
)


## Hyperparams

In [21]:
learning_rate = 0.001

loss_fn = nn.BCELoss()
optimizer = torch.optim.Adam(cnn.parameters(), lr=learning_rate)

## Create Device
Using accelerate from huggingface
https://huggingface.co/docs/accelerate/index.html


In [22]:
from accelerate import Accelerator

accelerator = Accelerator()
device = accelerator.device

print(device)

cuda


## Add model, dataloader, optimizer and dataset to device

In [23]:
cnn, train_loader, val_loader, optimizer = accelerator.prepare(
    cnn, train_loader, val_loader, optimizer
)

## Train

In [24]:
epochs = 15

def train_cnn(model, train_loader, val_loader, epochs, optimizer, loss_fn, accl=accelerator):
    print_counter = 0 # print loss for each 10th count

    for e in tqdm(range(epochs), desc=f"train_cnn_for_{epochs}_epochs"):
        model.train()
        for input, label in train_loader:
            print_counter += 1
            # zero gradients
            model.zero_grad()

            # forward pass
            output = model(input)

            # backprop
            loss = loss_fn(output.squeeze(), label.float())
            accl.backward(loss)
            optimizer.step()

            # log loss 
            if print_counter % 10 == 0:
                validation_losses = []
                
                model.eval() # switch mode
                with torch.no_grad():
                    for val_input, val_label in val_loader:
                        val_output = model(val_input)
                        val_loss = loss_fn(val_output.squeeze(), val_label.float())
                        validation_losses.append(val_loss.item())
                    print(f"\nEpoch: {e + 1}/{epochs}\tStep: {print_counter}\tTrain Loss: {loss.item()}\tValidation Loss: {np.mean(validation_losses)}")

                model.train()

            


%time train_cnn(model=cnn, train_loader=train_loader, val_loader=val_loader, epochs=epochs, optimizer=optimizer, loss_fn=loss_fn)

train_cnn_for_15_epochs:   0%|          | 0/15 [00:00<?, ?it/s]
Epoch: 1/15	Step: 10	Train Loss: 0.6800340414047241	Validation Loss: 0.6815965175628662

Epoch: 1/15	Step: 20	Train Loss: 0.6725341081619263	Validation Loss: 0.685352052961077
train_cnn_for_15_epochs:   7%|▋         | 1/15 [00:04<00:59,  4.27s/it]
Epoch: 2/15	Step: 30	Train Loss: 0.600382387638092	Validation Loss: 0.6436454653739929

Epoch: 2/15	Step: 40	Train Loss: 0.6341875195503235	Validation Loss: 0.6615862505776542
train_cnn_for_15_epochs:  13%|█▎        | 2/15 [00:07<00:47,  3.62s/it]
Epoch: 2/15	Step: 50	Train Loss: 0.5865278840065002	Validation Loss: 0.5992969189371381

Epoch: 3/15	Step: 60	Train Loss: 0.49015775322914124	Validation Loss: 0.6082991872514997

Epoch: 3/15	Step: 70	Train Loss: 0.4470709264278412	Validation Loss: 0.5463864292417254
train_cnn_for_15_epochs:  20%|██        | 3/15 [00:10<00:40,  3.39s/it]
Epoch: 4/15	Step: 80	Train Loss: 0.41555091738700867	Validation Loss: 0.5163954709257398

Epoch: 4/15

## Inference

In [25]:
test_x_tensor = torch.from_numpy(x_test)
test_x_tensor = test_x_tensor.to(device)

def classify_sentiment(model, test_data):
    model.eval()
    with torch.no_grad():

        out = model(test_data)
        out = torch.round(out.squeeze())
    	
        return out.cpu().detach().numpy()

In [26]:
y_pred = classify_sentiment(cnn, test_x_tensor)

## Evaluation

In [27]:
from sklearn.metrics import classification_report

print(classification_report(y_pred=y_pred, y_true=y_test))

              precision    recall  f1-score   support

           0       0.84      0.84      0.84       199
           1       0.84      0.84      0.84       201

    accuracy                           0.84       400
   macro avg       0.84      0.84      0.84       400
weighted avg       0.84      0.84      0.84       400

