## Task 3

In this task you have to create a network which looks at characters of the word and tries to guess whether the word is a noun, a verb, an adjective, and so on. To be more precise: the input is a word (without context), the output is a POS-tag (Part-of-Speech). Since some words are unambiguous, and we have no context, our network is supposed to return the set of possible tags.

The data is taken from Universal Dependencies English corpus, and of course it contains errors, especially because not all possible tags occured in the data.

Train a network (4p) or two networks (+2p) solving this task. Both networks should look at character n-grams occuring in the word. There are two options:

* **Fixed size:** for instance take 2,3, and 4-character suffixes of the word, use them as  features (whith 1-hot encoding). You can also combine prefix and suffix features. Simple, useful trick: when looking at suffixes, add some '_' characters at the beginning of the word to guarantee that shorter words have suffixes of a desired length.

* **Variable size:** take for instance 4-grams (or 4 grams and 3-grams), use Deep Averaging Network. Simple trick: add extra character at the beginning and at the end of the word, to add the information, that ngram occurs at special position ('ed' at the end has slightly different meaning that 'ed' in the middle)

In [None]:
# WandB:
# Fixed: https://wandb.ai/maria_wyrzykowska/NLP5/runs/1oiehhn3?workspace=user-maria_wyrzykowska
# Ngrams (mniejsza sieć): https://wandb.ai/maria_wyrzykowska/NLP5/runs/2h1odl9x?workspace=user-maria_wyrzykowska
# Ngrams (większa sieć): https://wandb.ai/maria_wyrzykowska/NLP5/runs/aqpwxt0r?workspace=user-maria_wyrzykowska

In [1]:
from __future__ import unicode_literals, print_function, division

import itertools
from io import open
from nltk.util import ngrams
import numpy as np
import pytorch_lightning as pl
import torch
import torch.nn as nn
import torch.nn.functional as F
import wandb
from pytorch_lightning.loggers import WandbLogger
from sklearn import preprocessing
from torch.optim import Adam
from torch.utils.data import DataLoader
from torch.utils.data import DataLoader

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [5]:
class UdeFixedDataset(torch.torch.utils.data.Dataset):
    def __init__(self, path, device):
        self.data = open(path).readlines()
        self.device = device
        self.suffixes = []
        self.labels = []

        for line in self.data:
            splitted = line.split()
            word = splitted[0].lower()
            label = splitted[1]
            if len(word) < 4:
                word = word + "_" * (4 - len(word))
            self.suffixes.append([word[:2], word[:3], word[:4], word[-4:], word[-4:], word[-2:]])
            self.labels.append(label)

        self.unique_suffixes = list(set(list(itertools.chain.from_iterable(self.suffixes))))
        self.unique_labels = list(set(self.labels))

    def fit_encoders(self):
        suffixes_le = preprocessing.LabelEncoder()
        labels_le = preprocessing.LabelEncoder()

        suffixes_le.fit(self.unique_suffixes)
        labels_le.fit(self.unique_labels)

        return suffixes_le, labels_le

    def transform_data(self, suffixes_le, labels_le):
        self.encoded_suffixes = [torch.tensor(np.array(suffixes_le.transform(p)), device = self.device) for p in self.suffixes]
        self.encoded_labels = [torch.tensor(np.array(labels_le.transform([g])), device = self.device) for g in self.labels]

    def __len__(self):
        return len(self.suffixes)

    def __getitem__(self, index):
        return torch.nn.functional.one_hot(self.encoded_suffixes[index], num_classes=len(self.unique_suffixes)).sum(axis=0).type(torch.FloatTensor), self.encoded_labels[index]

In [22]:
class SmallNet(pl.LightningModule):

    def __init__(self, num_features, num_classes):
        super().__init__()

        self.model = nn.Sequential(
            nn.Linear(num_features, 4096),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(4096, 2048),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(2048, 1024),
            nn.ReLU(),
            nn.Linear(1024, num_classes)
        )

        self.loss = nn.CrossEntropyLoss()

    def configure_optimizers(self):
        optimizer = Adam(
            self.model.parameters()
        )
        return optimizer

    def forward(self, inputs):
        return self.model(inputs)

    def training_step(self, batch, batch_idx):
        x, y = batch
        y = torch.flatten(y)
        outputs = self(x)
        loss = self.loss(outputs, y)
        _, preds = torch.max(outputs, dim=1)
        _, idx = torch.topk(outputs, 5)
        self.log("train_loss", loss)
        self.log("train_accuracy", (preds == y).float().mean(), on_step=False, on_epoch=True)
        self.log("train_hitrate@5", (idx == y.reshape(-1, 1)).any(1).float().mean(), on_step=False, on_epoch=True)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y = torch.flatten(y)
        outputs = self(x)
        loss = self.loss(outputs, y)
        _, preds = torch.max(outputs, dim=1)
        _, idx = torch.topk(outputs, 5)
        self.log("val_loss", loss)
        self.log("val_accuracy", (preds == y).float().mean(), on_step=False, on_epoch=True)
        self.log("val_hitrate@5", (idx == y.reshape(-1, 1)).any(1).float().mean(), on_step=False, on_epoch=True)
        return loss

In [7]:
dataset = UdeFixedDataset('/content/sample_data/english_tags_dev.txt', device)
val_dataset = UdeFixedDataset('/content/sample_data/english_tags_test.txt', device)
dataset.unique_suffixes = list(set(dataset.unique_suffixes + val_dataset.unique_suffixes))
val_dataset.unique_suffixes = list(set(dataset.unique_suffixes + val_dataset.unique_suffixes))
dataset.unique_labels = list(set(dataset.unique_labels + val_dataset.unique_labels))
val_dataset.unique_labels = list(set(dataset.unique_labels + val_dataset.unique_labels))
suffixes_le, labels_le = dataset.fit_encoders()
dataset.transform_data(suffixes_le, labels_le)
val_dataset.transform_data(suffixes_le, labels_le)

In [45]:
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=16, shuffle=False)
print(len(dataset.unique_suffixes))
print(len(dataset.unique_labels))

16216
106


In [25]:
net = SmallNet(16216, 106)
wandb.finish()
trainer = pl.Trainer(
    logger=WandbLogger(
        save_dir=f"/content/sample_data",
        project="NLP5",
    ),
    gpus=1,
    max_epochs=15,
)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [26]:
trainer.fit(net, dataloader, val_dataloader)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type             | Params
-------------------------------------------
0 | model | Sequential       | 77.0 M
1 | loss  | CrossEntropyLoss | 0     
-------------------------------------------
77.0 M    Trainable params
0         Non-trainable params
77.0 M    Total params
308.089   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

  rank_zero_warn("Detected KeyboardInterrupt, attempting graceful shutdown...")


In [15]:
# v2: trzymać tylko indeksy n-gramów, mieć embedding który potem uśredniamy

In [2]:
class UdeNgramDataset(torch.torch.utils.data.Dataset):
    def __init__(self, path, device):
        self.data = open(path).readlines()
        self.device = device
        self.ngrams = []
        self.labels = []

        for line in self.data:
            splitted = line.split()
            word = "$" + splitted[0].lower() + "#"
            label = splitted[1]
            self.ngrams.append([''.join(n) for n in (ngrams(word, 3))])
            self.labels.append(label)

        self.unique_ngrams = list(set(list(itertools.chain.from_iterable(self.ngrams))))
        self.unique_labels = list(set(self.labels))

    def fit_encoders(self):
        ngrams_le = preprocessing.LabelEncoder()
        labels_le = preprocessing.LabelEncoder()

        ngrams_le.fit(self.unique_ngrams)
        labels_le.fit(self.unique_labels)

        return ngrams_le, labels_le

    def transform_data(self, ngrams_le, labels_le):
        self.encoded_ngrams  = [torch.tensor(np.array(ngrams_le.transform(p)), device = self.device) for p in self.ngrams]
        self.encoded_labels = [torch.tensor(np.array(labels_le.transform([g])), device = self.device) for g in self.labels]

    def __len__(self):
        return len(self.ngrams)

    def __getitem__(self, index):
        return torch.nn.functional.one_hot(self.encoded_ngrams[index], num_classes=len(self.unique_ngrams)).sum(axis=0).type(torch.FloatTensor), self.encoded_labels[index]

In [6]:
n_dataset = UdeNgramDataset('/content/sample_data/english_tags_dev.txt', device)
n_val_dataset = UdeNgramDataset('/content/sample_data/english_tags_test.txt', device)
n_dataset.unique_ngrams = list(set(n_dataset.unique_ngrams + n_val_dataset.unique_ngrams))
n_val_dataset.unique_ngrams = list(set(n_dataset.unique_ngrams + n_val_dataset.unique_ngrams))
n_dataset.unique_labels = list(set(n_dataset.unique_labels + n_val_dataset.unique_labels))
n_val_dataset.unique_labels = list(set(n_dataset.unique_labels + n_val_dataset.unique_labels))
ngrams_le, labels_le = n_dataset.fit_encoders()
n_dataset.transform_data(ngrams_le, labels_le)
n_val_dataset.transform_data(ngrams_le, labels_le)

In [11]:
class SmallEmbeddedNet(pl.LightningModule):

    def __init__(self, num_features, num_classes):
        super().__init__()
        self.embedding = nn.Linear(num_features, 256)

        self.model = nn.Sequential(
            nn.Linear(256, 4096),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(4096, 2048),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(2048, 1024),
            nn.ReLU(),
            nn.Linear(1024, num_classes)
        )

        self.loss = nn.CrossEntropyLoss()

    def configure_optimizers(self):
        optimizer = Adam(
            self.model.parameters()
        )
        return optimizer

    def forward(self, inputs):
        # summing embeddings and averaging
        x = self.embedding(inputs)/(torch.unsqueeze(inputs.sum(dim=1), dim=1))
        return self.model(x)


    def training_step(self, batch, batch_idx):
        x, y = batch
        y = torch.flatten(y)
        outputs = self(x)
        loss = self.loss(outputs, y)
        _, preds = torch.max(outputs, dim=1)
        _, idx = torch.topk(outputs, 5)
        self.log("train_loss", loss)
        self.log("train_accuracy", (preds == y).float().mean(), on_step=False, on_epoch=True)
        self.log("train_hitrate@5", (idx == y.reshape(-1, 1)).any(1).float().mean(), on_step=False, on_epoch=True)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y = torch.flatten(y)
        outputs = self(x)
        loss = self.loss(outputs, y)
        _, preds = torch.max(outputs, dim=1)
        _, idx = torch.topk(outputs, 5)
        self.log("val_loss", loss)
        self.log("val_accuracy", (preds == y).float().mean(), on_step=False, on_epoch=True)
        self.log("val_hitrate@5", (idx == y.reshape(-1, 1)).any(1).float().mean(), on_step=False, on_epoch=True)
        return loss

In [8]:
n_dataloader = DataLoader(n_dataset, batch_size=16, shuffle=True)
n_val_dataloader = DataLoader(n_val_dataset, batch_size=16, shuffle=False)
print(len(n_dataset.unique_ngrams))
print(len(n_dataset.unique_labels))

8616
106


In [12]:
n_net = SmallEmbeddedNet(8616, 106)
wandb.finish()
trainer = pl.Trainer(
    logger=WandbLogger(
        save_dir=f"/content/sample_data",
        project="NLP5",
    ),
    gpus=1,
    max_epochs=15,
)

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
epoch,▁▁▁▁▁▁▂▂▃▃▃▃▃▃▃▃▄▄▄▅▅▅▅▅▅▅▅▆▆▇▇▇▇▇▇▇▇███
train_accuracy,▁▃▄▅▅▅▆▆▆▆▇▇▇██
train_hitrate@5,▁▄▅▆▆▆▆▇▇▇▇▇███
train_loss,█▄▅▄▅▄▅▃▄▄▅▂▃▃▃▃▄▄▆▅▃▃▅▄▄▄▄▂▂▂▄▂▂▂▃▁▄▁▁▄
trainer/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
val_accuracy,▁▄▅▆▆▆▆▇▇▇▇█▇██
val_hitrate@5,▁▄▅▆▆▇▇▇▇▇██▇██
val_loss,█▅▄▃▃▃▂▂▂▂▁▁▂▁▁

0,1
epoch,14.0
train_accuracy,0.65968
train_hitrate@5,0.95492
train_loss,1.49951
trainer/global_step,17189.0
val_accuracy,0.56365
val_hitrate@5,0.92534
val_loss,1.4967


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [None]:
trainer.fit(n_net, n_dataloader, n_val_dataloader)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type             | Params
-----------------------------------------------
0 | embedding | Linear           | 2.2 M 
1 | model     | Sequential       | 11.7 M
2 | loss      | CrossEntropyLoss | 0     
-----------------------------------------------
13.9 M    Trainable params
0         Non-trainable params
13.9 M    Total params
55.424    Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]