# Word Window Classifier

The aim of this notebook is to define a NLP neural network able to detect color-related words contained in a sentence.\
Inspired by [Standford's CS224N lab](https://stanford.edu/class/cs224n/materials/CS224N_PyTorch_Tutorial.html).

## Modules Importation

In [None]:
import torch
import torch.nn as nn

from typing import List, Dict
from torch.utils.data import DataLoader
from functools import partial

## Corpus Definition

The present dataset is composed by different sentences.\
The purpose of the model is to be able to recognize a color among these words.

In [None]:
sentences = [
    "The sky was a brilliant shade of blue.",
    "She wore a stunning red dress to the party.",
    "The leaves turned vibrant shades of orange and yellow in the fall.",
    "His new car is a sleek black.",
    "The walls of the room were painted a calming light green.",
    "A fluffy white cat sat on the windowsill.",
    "The sunset was a beautiful mix of pink and purple.",
    "She admired the emerald green of the gemstone.",
    "The artist used a lot of bright yellow in his painting.",
    "The ocean looked turquoise under the midday sun.",
    "He bought a pair of brown leather shoes.",
    "The morning sky was a soft pastel pink.",
    "The old book had a faded blue cover.",
    "Her eyes were a striking shade of hazel.",
    "The banana was ripe and yellow.",
    "The house had a cheerful red front door.",
    "The mountain peaks were capped with white snow.",
    "She chose a light grey suit for the interview.",
    "The grapes were a dark purple, almost black.",
    "The sunset painted the horizon with orange and red hues.",
    "The bird had bright blue feathers.",
    "The grass was lush and green after the rain.",
    "He preferred writing with a black pen.",
    "Her hair was dyed a vivid purple.",
    "The butterfly had wings of brilliant blue.",
    "The apples were shiny and red.",
    "The night sky was a deep, inky black.",
    "She decorated the room with light lavender curtains.",
    "Who are these greens snakes that are whistling on your heads for?"
]


## Data Preprocessing

Usually, when starting a NLP task, it is recommanded to preprocess the given sentences this way:\
\- lowering each character;\
\- splitting the sentence into tokens (i.e. an array where each element is a word);\
\- removing stop words;\
\- stemming or lemmatizing the words.

With regards to the fact that this notebook is mainly made to expose word window classifiers, the two last steps are not going to be performed as they are not required.

In [None]:
def tokenize(sentence: str) -> List[str]:
    for punc in set([',', ';', '.', '?', '!', '/', "'", '-', '_']): sentence = sentence.replace(punc, '')
    return sentence.lower().split()

In [None]:
train_sentences = [tokenize(sentence) for sentence in sentences]

# Example 
train_sentences[0]

The input sentence `"The sky was a brilliant shade of blue."` is transfored into the array of tokens `['the', 'sky', 'was', 'a', 'brilliant', 'shade', 'of', 'blue']`, which is actually better when it comes to machine learning tasks.

The following array defines the labels of the input words. For sentence, the array contains the name number of integers as the input array has words. These integers are equal to `1`if the word refers to a color, else `0`.

In [None]:
train_labels = [
    [0, 0, 0, 0, 0, 0, 0, 1],              # "The sky was a brilliant shade of blue."
    [0, 0, 0, 0, 1, 0, 0, 0, 0],           # "She wore a stunning red dress to the party."
    [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],  # "The leaves turned vibrant shades of orange and yellow in the fall."
    [0, 0, 0, 0, 0, 0, 1],                 # "His new car is a sleek black."
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],     # "The walls of the room were painted a calming light green."
    [0, 0, 1, 0, 0, 0, 0, 0],              # "A fluffy white cat sat on the windowsill."
    [0, 0, 0, 0, 0, 0, 0, 1, 0, 1],        # "The sunset was a beautiful mix of pink and purple."
    [0, 0, 0, 0, 1, 0, 0, 0],              # "She admired the emerald green of the gemstone."
    [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],     # "The artist used a lot of bright yellow in his painting."
    [0, 0, 0, 1, 0, 0, 0, 0],              # "The ocean looked turquoise under the midday sun."
    [0, 0, 0, 0, 0, 1, 0, 0],              # "He bought a pair of brown leather shoes."
    [0, 0, 0, 0, 0, 0, 0, 1],              # "The morning sky was a soft pastel pink."
    [0, 0, 0, 0, 0, 0, 1, 0],              # "The old book had a faded blue cover."
    [0, 0, 0, 0, 0, 0, 0, 1],              # "Her eyes were a striking shade of hazel."
    [0, 0, 0, 0, 0, 1],                    # "The banana was ripe and yellow."
    [0, 0, 0, 0, 0, 1, 0, 0],              # "The house had a cheerful red front door."
    [0, 0, 0, 0, 0, 0, 1, 0],              # "The mountain peaks were capped with white snow."
    [0, 0, 0, 0, 1, 0, 0, 0, 0],           # "She chose a light grey suit for the interview."
    [0, 0, 0, 0, 0, 1, 0, 1],              # "The grapes were a dark purple, almost black."
    [0, 0, 0, 0, 0, 0, 1, 0, 1, 0],        # "The sunset painted the horizon with orange and red hues."
    [0, 0, 0, 0, 1, 0],                    # "The bird had bright blue feathers."
    [0, 0, 0, 0, 0, 1, 0, 0, 0],           # "The grass was lush and green after the rain."
    [0, 0, 0, 0, 0, 1, 0],                 # "He preferred writing with a black pen."
    [0, 0, 0, 0, 0, 0, 1],                 # "Her hair was dyed a vivid purple."
    [0, 0, 0, 0, 0, 0, 1],                 # "The butterfly had wings of brilliant blue."
    [0, 0, 0, 0, 0, 1],                    # "The apples were shiny and red."
    [0, 0, 0, 0, 0, 0, 0, 1],              # "The night sky was a deep, inky black."
    [0, 0, 0, 0, 0, 0, 1, 0],              # "She decorated the room with light lavender curtains."
    [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]   # "Who are these greens snakes that are whistling on your heads for?"
]


It is convenient for the following tasks to define a set of words, here called `vocabulary`.\
This set contains every word contained in the dataset's sentences.

In [None]:
vocabulary = set(word for sentence in train_sentences for word in sentence)
vocabulary

When asking the model's to predict what is the color(s) of a sentence, it is possible that it faces a word that is not in its vocabulary.\
To prevent from this, the unknown word is added to the set of words as `'<unk>'`.\
Also, as the models relies on a window of words sliding through the sentence, it is required to define another special words to handle the side effects. This padding word is defined as `'<pad>'`.

In [None]:
vocabulary.add('<unk>')
vocabulary.add('<pad>')

In [None]:
def pad_window(sentence: List[str], window_size: int) -> List[str]:
    window = ['<pad>'] * window_size
    return window + sentence + window

# Example
print(f'Padded sentence: {pad_window(train_sentences[0], window_size=2)}')

It is now possible to automatically add padding to each tokenized sentences.\
The next step to make this dataset compatible with machine learning models is to give to each word a unique index it can refer to.

In [None]:
word_to_idx = {word: index for index, word in enumerate(sorted(list(vocabulary)))}
idx_to_word = {index: word for index, word in enumerate(sorted(list(vocabulary)))}

Two functions are now defined to process this way the entire dataset.

In [None]:
def tokens_to_indices(sentence: List[str], word_to_idx: Dict[str, int]) -> List[int]:
    return [word_to_idx[token] if token in word_to_idx else word_to_idx['<unk>'] for token in sentence]

def indices_to_token(indices: List[int], idx_to_word: Dict[int, str]) -> List[str]:
    return [idx_to_word[index] for index in indices]

In [None]:
sentence_ver = 'Are birds red?'
tokenized_ver = tokenize(sentence_ver)
indices_ver = tokens_to_indices(tokenized_ver, word_to_idx)
restored_ver = indices_to_token(indices_ver, idx_to_word)

print(f'Original: {sentence_ver}')
print(f'Tokenized: {tokenized_ver}')
print(f'Indices: {indices_ver}')
print(f'Restored: {restored_ver}')

As explained before, "birds" is not a word of the set of words previously defined. Thus, is it replaced by `'<unk>'`.

In [None]:
train_indices = [tokens_to_indices(sentence, word_to_idx) for sentence in train_sentences]
train_indices[:5]

## Model Definition

In [None]:
embeds = nn.Embedding(len(vocabulary), embedding_dim=5)
list(embeds.parameters())

In [None]:
def _collate_fn(batch, window_size: int, word_to_idx: Dict[str, int]) -> any:
    X, y = zip(*batch)
    X = [pad_window(s, window_size=window_size) for s in X]
    X = [tokens_to_indices(s, word_to_idx) for s in X]
    X = [torch.LongTensor(X_i) for X_i in X]

    pad_token_index = word_to_idx['<pad>']
    X_padded = nn.utils.rnn.pad_sequence(X, batch_first=True, padding_value=pad_token_index)

    lengths = torch.LongTensor([len(label) for label in y])

    y = [torch.LongTensor(y_i) for y_i in y]
    y_padded = nn.utils.rnn.pad_sequence(y, batch_first=True, padding_value=0)

    return X_padded, y_padded, lengths

In [None]:
data = list(zip(train_sentences, train_labels))
collate_fn = partial(_collate_fn, window_size=2, word_to_idx=word_to_idx)

loader = DataLoader(data, batch_size=2, shuffle=True, collate_fn=collate_fn)

for counter, (batched_X, batched_y, batched_lengths) in enumerate(loader):
    print(f"Iteration {counter}")
    print("Batched Input:")
    print(batched_X)
    print("Batched Labels:")
    print(batched_y)
    print("Batched Lengths:")
    print(batched_lengths)
    print("")

In [None]:
print('Original Tensor:')
print(batched_X)

chunk = batched_X.unfold(1, 2 * 2 + 1, 1)  # 2 * window_size + 1
print('\nWindows:')
print(chunk)

In [None]:
class WordWindowClassifier(nn.Module):
    def __init__(self, hyperparameters, vocab_size, padding_idx=0):
        super(WordWindowClassifier, self).__init__()

        self.window_size = hyperparameters['window_size']
        self.embed_dim = hyperparameters['embed_dim']
        self.hidden_dim = hyperparameters['hidden_dim']
        self.freeze_embeddings = hyperparameters['freeze_embeddings']
        self.word_to_idx = hyperparameters['word_to_idx']
        self.idx_to_word = hyperparameters['idx_to_word']

        # Embedding Layer
        self.embeds = nn.Embedding(vocab_size, self.embed_dim, padding_idx=padding_idx)
        if self.freeze_embeddings: self.embed_layer.weight.requires_grad = False

        # Hidden Layer
        full_window_size = 2 * self.window_size + 1
        self.hidden_layer = nn.Sequential(
            nn.Linear(full_window_size * self.embed_dim, self.hidden_dim),
            nn.Tanh()
        )

        # Output layer
        self.output_layer = nn.Linear(self.hidden_dim, 1)

        # Probabilities
        self.probabilities = nn.Sigmoid()

    def forward(self, inputs):
        B, _ = inputs.size()

        # Reshaping
        token_windows = inputs.unfold(1, 2 * self.window_size + 1, 1)
        _, adjusted_length, _ = token_windows.size()
        assert token_windows.size() == (B, adjusted_length, 2 * self.window_size + 1)

        # Embedding
        embedded_windows = self.embeds(token_windows)

        # Reshaping
        embedded_windows = embedded_windows.view(B, adjusted_length, -1)

        # Layer 1
        layer_1 = self.hidden_layer(embedded_windows)

        # Layer 2
        output = self.output_layer(layer_1)

        # Softmax Score
        output = self.probabilities(output)
        output = output.view(B, -1)

        return output

    def predict(self, input):
        for punc in set([',', ';', '.', '?', '!', '/', "'", '-', '_']): input = input.replace(punc, '')
        tokens = input.lower().split()
        window = ['<pad>'] * self.window_size
        padded_tokens = window + tokens + window
        tokens_idx = [self.word_to_idx[token] for token in padded_tokens]
        output = self.forward(torch.tensor([tokens_idx, tokens_idx]))
        mask = output[0] > 0.5
        target_index = mask.nonzero(as_tuple=True)[0]
        if len(target_index) == 0: return None
        pred_tokens = [padded_tokens[idx] for idx in target_index]
        pred = ' '.join(token for token in pred_tokens)
        return pred

## Training

In [None]:
data = list(zip(train_sentences, train_labels))
batch_size = 4
shuffle = True
window_size = 2
collate_fn = partial(_collate_fn, window_size=window_size, word_to_idx=word_to_idx)

loader = DataLoader(data, batch_size=batch_size, shuffle=shuffle, collate_fn=collate_fn)

model_hyperparameters = {
    'batch_size': batch_size,
    'window_size': window_size,
    'embed_dim': 2*32,
    'hidden_dim': 100,
    'freeze_embeddings': False,
    'word_to_idx': word_to_idx,
    'idx_to_word': idx_to_word
}

vocab_size = len(word_to_idx)
model = WordWindowClassifier(model_hyperparameters, vocab_size)

### Optimizer Definition

In [None]:
learning_rate = 0.01
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

### Loss and Train Functions

In [None]:
def loss_function(batch_outputs, batch_labels, batch_lengths) -> float:
    bceloss = nn.BCELoss()
    loss = bceloss(batch_outputs, batch_labels.float())
    
    # Rescaling
    loss = loss/batch_lengths.sum().float()

    return loss


def train_epoch(loss_function, optimizer, model, loader):
    total_loss = 0
    for batch_inputs, batch_labels, batch_lengths in loader:
        optimizer.zero_grad()  # clear gradient
        outputs = model.forward(batch_inputs)  # forward pass
        loss = loss_function(outputs, batch_labels, batch_lengths)  # batch loss
        loss.backward()  # gradient
        optimizer.step()
        total_loss += loss.item()
    return total_loss


def train(loss_function, optimizer, model, loader, num_epochs=10000):
    for epoch in range(0, num_epochs):
        epoch_loss = train_epoch(loss_function, optimizer, model, loader)
        if epoch % 100 == 0: print(f'Epoch: {epoch+1} - Loss: {epoch_loss}')

### Training

In [None]:
num_epochs = 1000
train(loss_function, optimizer, model, loader, num_epochs=num_epochs)

## Model's Predictions

In [None]:
test_corpus = ["The sky is orange and blue.",
               "When the sun goes down, the sky turns black.",
               "The car is purple!",
               "Is the car violet?",
               "No color in there.",
               "I like ham when it's pink."]
test_sentences = [tokenize(sentence) for sentence in test_corpus]
test_labels = [[0, 0, 0, 1, 0, 1],
               [0, 0, 0, 0, 0, 0, 0, 0, 1],
               [0, 0, 0, 1],
               [0, 0, 0, 1],
               [0, 0, 0, 0],
               [0, 0, 0, 0, 0, 1]]

verbose = True
test_data = list(zip(test_sentences, test_labels))
batch_size = 1
shuffle = False
window_size = 2
collate_fn = partial(_collate_fn, window_size=window_size, word_to_idx=word_to_idx)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)

for counter, (test_instance, labels, _) in enumerate(test_loader):
    print(f'Sample {counter+1}')
    outputs = model.forward(test_instance)[0]
    if verbose:
        print(f'   Labels: {labels}\n   Outputs: {outputs}')
    nb_colors = labels.sum()
    if nb_colors == 0:
        print('   No color to detect.\n')
        continue
    colors_indexes = torch.topk(outputs.flatten(), nb_colors).indices
    if nb_colors == 1:
        print(f'   Detected color: {idx_to_word[int(test_instance[0, window_size+colors_indexes[0]])]}.\n')
    else:
        colors = ', '.join(idx_to_word[int(test_instance[0, window_size+index])] for index in colors_indexes)
        print(f'   Detected colors: {colors}.\n')

The performances of the models here are quite good yet it troubles on certain situations.\
To prevent from these, it could be relevant to increase the size of the dataset, look for a word-reduction method (stemming or lemmatizing) or optimize the structure of the neural network as it remains very basic.