## Homework 7
### Recurrent Neural Networks

Welcome to Homework 7! 

The homework contains several tasks. You can find the amount of points that you get for the correct solution in the task header. Maximum amount of points for each homework is _four_.

The **grading** for each task is the following:
- correct answer - **full points**
- insufficient solution or solution resulting in the incorrect output - **half points**
- no answer or completely wrong solution - **no points**

Even if you don't know how to solve the task, we encourage you to write down your thoughts and progress and try to address the issues that stop you from completing the task.

When working on the written tasks, try to make your answers short and accurate. Most of the times, it is possible to answer the question in 1-3 sentences.

When writing code, make it readable. Choose appropriate names for your variables (`a = 'cat'` - not good, `word = 'cat'` - good). Avoid constructing lines of code longer than 100 characters (79 characters is ideal). If needed, provide the commentaries for your code, however, a good code should be easily readable without them :)

Finally, all your answers should be written only by yourself. If you copy them from other sources it will be considered as an academic fraud. You can discuss the tasks with your classmates but each solution must be individual.

<font color='red'>**Important!:**</font> **before sending your solution, do the `Kernel -> Restart & Run All` to ensure that all your code works.**

In [5]:
!pip install conllu

Collecting conllu
  Using cached conllu-4.4.1-py2.py3-none-any.whl (15 kB)
Installing collected packages: conllu
[31mERROR: Could not install packages due to an OSError: [Errno 13] Permission denied: '/gpfs/space/software/jupyterhub/python/jupyter/lib/python3.10/site-packages/conllu'
Check the permissions.
[0m
You should consider upgrading via the '/gpfs/space/software/jupyterhub/python/jupyter/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
!wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4611/ud-treebanks-v2.9.tgz

--2022-04-05 12:38:08--  https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-4611/ud-treebanks-v2.9.tgz
Resolving lindat.mff.cuni.cz (lindat.mff.cuni.cz)... 195.113.20.140
Connecting to lindat.mff.cuni.cz (lindat.mff.cuni.cz)|195.113.20.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 463076751 (442M) [application/x-gzip]
Saving to: ‘ud-treebanks-v2.9.tgz’


2022-04-05 12:38:19 (41.2 MB/s) - ‘ud-treebanks-v2.9.tgz’ saved [463076751/463076751]



In [3]:
!tar -xzf ud-treebanks-v2.9.tgz

In [4]:
import os
os.environ["TORCH_SHOW_CPP_STACKTRACES"] = "1"

In [5]:
from conllu import parse, TokenList
from pathlib import Path
from torchtext.vocab import build_vocab_from_iterator, Vocab
from collections import OrderedDict, defaultdict
import typing
from typing import List, Dict
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
import torch.nn as nn
import torch
import math

## Task 0. Load the data for your language

Choose the correct folder and short treebank code for your language. For example, for Estonian EDT treebank, the `treebank_dir` will be `ud-treebanks-v2.9/UD_Estonian-EDT/` and `treebank_short` will be `et_edt`.

Ideally, you should use the same dataset that you used in the practice session. If you missed it, just find a dataset for your native language. If your native language is not represented in the Universal Dependencies, you can use any other language that you know (e.g. English).

In [60]:
treebank_dir = Path('ud-treebanks-v2.9/UD_Japanese-GSD/')
treebank_short = 'ja_gsd'

In [61]:
train_data = parse(open(treebank_dir / f'{treebank_short}-ud-train.conllu', encoding='utf-8').read())
valid_data = parse(open(treebank_dir / f'{treebank_short}-ud-dev.conllu', encoding='utf-8').read())
test_data = parse(open(treebank_dir / f'{treebank_short}-ud-test.conllu', encoding='utf-8').read())

In [62]:
word_vocab = build_vocab_from_iterator(
    ([token["form"] for token in sent] for sent in train_data), 
    specials=["<pad>", "<unk>"]
)
word_vocab.set_default_index(word_vocab["<unk>"])

upos_vocab = build_vocab_from_iterator(
    ([token["upos"] for token in sent] for sent in train_data), 
    specials=["<pad>"]
)

In [37]:
feats_vocab = build_vocab_from_iterator(
    (["|".join([k + '=' + v for k, v in token["feats"].items()]) if token["feats"] else "_" for token in sent] 
     for sent in train_data),
     specials=["<pad>"])
feats_vocab.set_default_index(feats_vocab["_"])

In [202]:
print(train_data[10])

TokenList<統一, 教会, に, 導ける, !>


printask 1. Create a character vocab (0.25 points)

Using the training data, create a vocabulary of each character in the training data. I should be similar to the `word_vocab`, but instead of tokens you would have characters. Add `"<pad>"` and `"<unk>"` special tokens. Make index for `"<unk>"` token default for the vocab.

The `stoi` dict should look similar to this (it will not be exactly the same, just make sure that each key is a single character):

```
{'Ð': 106, 'g': 16, '”': 71, '<pad>': 0, '<unk>': 1, ')': 43, 'i': 4, '!': 61, 'Ö': 92, 'a': 2, 'e': 3, '3': 52, 'Ä': 82, 's': 5, ',': 23, 'l': 7, '4': 60, 't': 6, '5': 51, 'u': 8, 'ū': 127, '.': 20, 'n': 9, 'Ü': 64, 'k': 10, 'd': 11, 'o': 12, '-': 30, 'm': 13, '2': 44, 'r': 14, '6': 58, 'v': 15, '0': 35, 'Ç': 121, 'p': 17, '(': 45, 'h': 18, 'j': 19, 'ä': 21, 'S': 28, 'õ': 22, 'B': 59, '"': 29, 'Õ': 80, 'b': 24, 'ü': 25, 'K': 27, 'ö': 26, 'A': 37, 'T': 31, 'E': 32, 'M': 33, '1': 34, 'ç': 104, 'P': 36, 'f': 38, '9': 39, 'á': 118, 'V': 40, 'L': 41, 'N': 42, 'þ': 99, 'I': 46, 'å': 122, 'R': 47, 'J': 48, 'H': 49, 'ø': 119, 'O': 50, ':': 53, '?': 54, 'U': 55, 'Ž': 113, '8': 56, 'c': 57, '7': 62, 'ó': 110, 'D': 63, 'ð': 107, 'G': 65, 'C': 66, 'y': 67, '$': 114, 'š': 68, 'Š': 91, '%': 69, '“': 70, 'ñ': 125, 'F': 72, 'z': 73, 'w': 74, 'ž': 84, ';': 75, "'": 76, 'x': 77, '/': 78, 'ω': 79, 'à': 117, 'W': 81, 'X': 83, 'Y': 85, 'í': 124, 'Z': 86, '[': 87, ']': 88, '·': 89, '…': 90, 'q': 93, 'æ': 111, 'Q': 94, '&': 95, 'é': 96, '–': 97, '>': 98, '@': 100, '=': 101, '+': 102, '²': 103, '_': 105, '*': 108, '§': 109, '•': 112, '<': 115, '°': 116, '~': 120, 'ë': 123, 'ē': 126}
```

In [203]:
# TODO: Build a character-level vocab
char_vocab = build_vocab_from_iterator(
    ( [ char for char in words] for words in [tokens.metadata['text'] for tokens in train_data]), 
    specials=["<pad>", "<unk>"])
# TODO: Set the default index
char_vocab.set_default_index(char_vocab["<unk>"])

## Task 2. Add character representation of words (0.25 points)

Modify the dataset to return the character indices of each token. For example, for the token `cat` and the vocab `{'<pad>': 0, '<unk>': 1, 'a': 2, 'c': 3, 't': 4}` you should return `[3, 2, 4]`.

In [302]:
class UDDataset(Dataset):
    def __init__(self, data: List[TokenList], 
                 vocabs: Dict[str, Vocab], 
                 device: torch.device):
        super().__init__()
        self.data = data
        self.vocabs = vocabs
        self.device = device
        self.preprocess()

    def preprocess(self):
        self.preprocessed = []
        for sent in self.data:
            # TODO: Return character indices for each token
            chars=[]
            tokens = [ token["form"] for token in sent ] 
            for token in tokens:
                lst=[]
                for t in list(token):
                    try:
                        lst.append(self.vocabs["char"].get_stoi()[t])
                    except KeyError:
                        lst.append(self.vocabs["char"].get_default_index())
                chars.append(lst)
            
            words = self.vocabs["word"]([token["form"] for token in sent])
            upos = self.vocabs["upos"]([token["upos"] for token in sent])
            feats_raw = ["|".join([k + '=' + v for k, v in token["feats"].items()]) 
                         if token["feats"] else "_" for token in sent]
            feats = self.vocabs["feats"](feats_raw)

            chars = [torch.tensor(char, dtype=torch.long, device=self.device) for char in chars]
            words = torch.tensor(words, dtype=torch.long, device=self.device)
            upos = torch.tensor(upos, dtype=torch.long, device=self.device)
            feats = torch.tensor(feats, dtype=torch.long, device=self.device)

            self.preprocessed.append((chars, words, upos, feats))

    def __getitem__(self, index):
        return self.preprocessed[index]

    def __len__(self):
        return len(self.preprocessed)

In [303]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
# device = torch.device("cpu")
print("Current device is:", device)

vocabs = {"char": char_vocab, "word": word_vocab, "upos": upos_vocab, "feats": feats_vocab}
train_dataset = UDDataset(train_data, vocabs, device)
dev_dataset = UDDataset(valid_data, vocabs, device)
test_dataset = UDDataset(test_data, vocabs, device)

Current device is: cuda


You can use these two cells to test if your output makes sense. Look through each character encoded token and compare it with the original token. The number of characters (letters) must be the same for each token.

For example:

```
([tensor([46, 16,  2], device='cuda:0'),
  tensor([25, 18,  3, 10,  5,  2,  5], device='cuda:0'),
  tensor([10, 14, 12, 12,  9], device='cuda:0'),
  tensor([6, 8, 7, 4], device='cuda:0'),
  tensor([ 5,  2,  7,  2, 17, 21, 14,  2,  5,  6,  3,  7,  6], device='cuda:0'),
  tensor([ 4,  5,  4, 10,  8,  6,  3,  7,  6], device='cuda:0'),
  tensor([20], device='cuda:0')],
 tensor([  646, 15186, 12842,   172, 62115, 44158,     2], device='cuda:0'),
 tensor([13,  5,  1,  3,  5,  1,  2], device='cuda:0'),
 tensor([ 94, 119,   2,   6, 313, 154,   1], device='cuda:0'))
```

```
# newdoc id = aja_ee199920
# sent_id = aja_ee199920_1
# text = Iga üheksas kroon tuli salapärastelt isikutelt.
1	Iga	iga	DET	P	Case=Nom|Number=Sing|PronType=Tot	3	det	3:det	_
2	üheksas	üheksas	ADJ	N	Case=Nom|Number=Sing|NumForm=Word|NumType=Ord	3	amod	3:amod	_
3	kroon	kroon	NOUN	S	Case=Nom|Number=Sing	4	nsubj	4:nsubj	_
4	tuli	tulema	VERB	V	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin|Voice=Act	0	root	0:root	_
5	salapärastelt	sala_pärane	ADJ	A	Case=Abl|Degree=Pos|Number=Plur	6	amod	6:amod	_
6	isikutelt	isik	NOUN	S	Case=Abl|Number=Plur	4	obl	4:obl	SpaceAfter=No
7	.	.	PUNCT	Z	_	4	punct	4:punct	_
```

See how the token `Iga` is encoded as `tensor([46, 16,  2], device='cuda:0')`, `üheksas` as `tensor([25, 18,  3, 10,  5,  2,  5], device='cuda:0')` and so on. Your results should follow a similar pattern.

In [304]:
train_dataset[0]

([tensor([255,  46, 248,  16], device='cuda:0'),
  tensor([4], device='cuda:0'),
  tensor([11], device='cuda:0'),
  tensor([ 97,  19,  76, 130,  41,  29], device='cuda:0'),
  tensor([86, 67, 16], device='cuda:0'),
  tensor([2], device='cuda:0'),
  tensor([ 418, 1226], device='cuda:0'),
  tensor([15], device='cuda:0'),
  tensor([31,  8], device='cuda:0'),
  tensor([2], device='cuda:0'),
  tensor([10], device='cuda:0'),
  tensor([5], device='cuda:0'),
  tensor([2297], device='cuda:0'),
  tensor([45, 26], device='cuda:0'),
  tensor([95], device='cuda:0'),
  tensor([4], device='cuda:0'),
  tensor([259,  16,  32], device='cuda:0'),
  tensor([14], device='cuda:0'),
  tensor([1204,   22,   20], device='cuda:0'),
  tensor([27, 12], device='cuda:0'),
  tensor([11], device='cuda:0'),
  tensor([512,  80], device='cuda:0'),
  tensor([109], device='cuda:0'),
  tensor([4], device='cuda:0'),
  tensor([ 418, 1226], device='cuda:0'),
  tensor([4], device='cuda:0'),
  tensor([17,  8], device='cuda:0'),


In [305]:
print(train_data[0].serialize())

# newdoc id = train-s1
# sent_id = train-s1
# text = ホッケーにはデンジャラスプレーの反則があるので、膝より上にボールを浮かすことは基本的に反則になるが、その例外の一つがこのスクープである。
1	ホッケー	ホッケー	NOUN	名詞-普通名詞-一般	_	9	obl	_	BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|LUWBILabel=B|LUWPOS=名詞-普通名詞-一般|SpaceAfter=No
2	に	に	ADP	助詞-格助詞	_	1	case	_	BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|LUWBILabel=B|LUWPOS=助詞-格助詞|SpaceAfter=No
3	は	は	ADP	助詞-係助詞	_	1	case	_	BunsetuBILabel=I|BunsetuPositionType=FUNC|LUWBILabel=B|LUWPOS=助詞-係助詞|SpaceAfter=No
4	デンジャラス	デンジャラス	NOUN	名詞-普通名詞-一般	_	5	compound	_	BunsetuBILabel=B|BunsetuPositionType=CONT|LUWBILabel=B|LUWPOS=名詞-普通名詞-一般|SpaceAfter=No
5	プレー	プレー	NOUN	名詞-普通名詞-サ変可能	_	7	nmod	_	BunsetuBILabel=I|BunsetuPositionType=SEM_HEAD|LUWBILabel=I|LUWPOS=名詞-普通名詞-一般|SpaceAfter=No
6	の	の	ADP	助詞-格助詞	_	5	case	_	BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|LUWBILabel=B|LUWPOS=助詞-格助詞|SpaceAfter=No
7	反則	反則	NOUN	名詞-普通名詞-サ変可能	_	9	nsubj	_	BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|LUWBILabel=B|LUWPOS=名詞-普通名詞-一般|SpaceAfter=No
8	が	が	ADP	助詞-格助

## Task 3. Make the collate function (0.25 points)

To make a mini-batch for the character-level, unfold each sentence and concatenate it into one padded 2-d tensor.

For example, this input:

```
[[tensor([46, 16,  2], device='cuda:0'),
  tensor([25, 18,  3, 10,  5,  2,  5], device='cuda:0'),
  tensor([10, 14, 12, 12,  9], device='cuda:0'),
  tensor([6, 8, 7, 4], device='cuda:0'),
  tensor([ 5,  2,  7,  2, 17, 21, 14,  2,  5,  6,  3,  7,  6], device='cuda:0'),
  tensor([ 4,  5,  4, 10,  8,  6,  3,  7,  6], device='cuda:0'),
  tensor([20], device='cuda:0')],
 [tensor([32,  3,  5,  6,  4], device='cuda:0'),
  tensor([32, 10,  5, 17, 14,  3,  5,  5,  4], device='cuda:0'),
  tensor([ 6,  3,  2, 11,  2], device='cuda:0'),
  tensor([12,  9], device='cuda:0'),
  tensor([32,  3,  5,  6,  4], device='cuda:0'),
  tensor([36,  2,  9, 10], device='cuda:0'),
  tensor([ 8,  8, 14,  4,  9,  8, 11], device='cuda:0'),
  tensor([49,  2,  9,  5,  2, 17,  2,  9, 16,  2], device='cuda:0'),
  tensor([ 6,  3, 18,  4,  9, 16,  8,  4, 11], device='cuda:0'),
  tensor([23], device='cuda:0'),
  tensor([13,  4,  5], device='cuda:0'),
  tensor([ 6, 12,  4, 13,  8,  5,  4, 11], device='cuda:0'),
  tensor([10,  2, 10,  5], device='cuda:0'),
  tensor([2, 2, 5, 6, 2, 6], device='cuda:0'),
  tensor([ 6,  2, 16,  2,  5,  4], device='cuda:0'),
  tensor([ 5,  8, 15,  3,  7], device='cuda:0'),
  tensor([19,  2], device='cuda:0'),
  tensor([13,  4,  7,  7,  3], device='cuda:0'),
  tensor([10, 21,  4, 16,  8,  5], device='cuda:0'),
  tensor([15, 12, 12,  7,  2,  5], device='cuda:0'),
  tensor([17,  2,  9, 10,  2], device='cuda:0'),
  tensor([ 7,  4, 16,  4], device='cuda:0'),
  tensor([13,  4,  7, 19,  2, 14, 11,  4], device='cuda:0'),
  tensor([10, 14, 12, 12,  9,  4], device='cuda:0'),
  tensor([8, 7, 2, 6, 8, 5, 3, 5], device='cuda:0'),
  tensor([10,  2, 18,  6,  7,  2,  5,  6], device='cuda:0'),
  tensor([14,  2, 18,  2], device='cuda:0'),
  tensor([20], device='cuda:0')]]
```

should become this mini-batch:

```
tensor([[46, 16,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [25, 18,  3, 10,  5,  2,  5,  0,  0,  0,  0,  0,  0],
        [10, 14, 12, 12,  9,  0,  0,  0,  0,  0,  0,  0,  0],
        [ 6,  8,  7,  4,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [ 5,  2,  7,  2, 17, 21, 14,  2,  5,  6,  3,  7,  6],
        [ 4,  5,  4, 10,  8,  6,  3,  7,  6,  0,  0,  0,  0],
        [20,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [32,  3,  5,  6,  4,  0,  0,  0,  0,  0,  0,  0,  0],
        [32, 10,  5, 17, 14,  3,  5,  5,  4,  0,  0,  0,  0],
        [ 6,  3,  2, 11,  2,  0,  0,  0,  0,  0,  0,  0,  0],
        [12,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [32,  3,  5,  6,  4,  0,  0,  0,  0,  0,  0,  0,  0],
        [36,  2,  9, 10,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [ 8,  8, 14,  4,  9,  8, 11,  0,  0,  0,  0,  0,  0],
        [49,  2,  9,  5,  2, 17,  2,  9, 16,  2,  0,  0,  0],
        [ 6,  3, 18,  4,  9, 16,  8,  4, 11,  0,  0,  0,  0],
        [23,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [13,  4,  5,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [ 6, 12,  4, 13,  8,  5,  4, 11,  0,  0,  0,  0,  0],
        [10,  2, 10,  5,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [ 2,  2,  5,  6,  2,  6,  0,  0,  0,  0,  0,  0,  0],
        [ 6,  2, 16,  2,  5,  4,  0,  0,  0,  0,  0,  0,  0],
        [ 5,  8, 15,  3,  7,  0,  0,  0,  0,  0,  0,  0,  0],
        [19,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [13,  4,  7,  7,  3,  0,  0,  0,  0,  0,  0,  0,  0],
        [10, 21,  4, 16,  8,  5,  0,  0,  0,  0,  0,  0,  0],
        [15, 12, 12,  7,  2,  5,  0,  0,  0,  0,  0,  0,  0],
        [17,  2,  9, 10,  2,  0,  0,  0,  0,  0,  0,  0,  0],
        [ 7,  4, 16,  4,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [13,  4,  7, 19,  2, 14, 11,  4,  0,  0,  0,  0,  0],
        [10, 14, 12, 12,  9,  4,  0,  0,  0,  0,  0,  0,  0],
        [ 8,  7,  2,  6,  8,  5,  3,  5,  0,  0,  0,  0,  0],
        [10,  2, 18,  6,  7,  2,  5,  6,  0,  0,  0,  0,  0],
        [14,  2, 18,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [20,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]], device='cuda:0')
```

Also save the length of each word in characters. The length tensor should have dtype `torch.long` and be on a CPU. For the input above, it should be:

```
tensor([ 3,  7,  5,  4, 13,  9,  1,  5,  9,  5,  2,  5,  4,  7, 10,  9,  1,  3,
         8,  4,  6,  6,  5,  2,  5,  6,  6,  5,  4,  8,  6,  8,  8,  4,  1])
```

In [319]:
def _collate_fn(batch):
    # TODO: Padded mini-batch for characters
    chars =  pad_sequence([each for item in batch for each in item[0]], batch_first=True)
    # TODO: Lengths of each token in characters
    chars_len = torch.count_nonzero(chars,dim=1)
    
    words = pad_sequence([item[1] for item in batch], batch_first=True)
    words_len = torch.tensor([len(item[1]) for item in batch], dtype=torch.long)
    upos = pad_sequence([item[2] for item in batch], batch_first=True)
    feats = pad_sequence([item[3] for item in batch], batch_first=True)
    return chars, chars_len, words, words_len, upos, feats

In [320]:
# Change the batch size to a lower value if you run out of memory
batch_size = 32
test_batch_size = 8

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, collate_fn=_collate_fn, shuffle=True)
dev_dataloader = DataLoader(dev_dataset, batch_size=batch_size, collate_fn=_collate_fn, shuffle=False)
test_dataloader = DataLoader(test_dataset, batch_size=test_batch_size, collate_fn=_collate_fn, shuffle=False)

Here you can check if you did everything correctly.

First, the first dimension of `chars` should be equal to the sum of all `words_len`.
Then, the sum of all `chars_len` should be equal to the number of non-zero elements in the `chars`.

If you don't see any error here then everything is probably fine.

In [321]:
batch = next(iter(train_dataloader))

assert batch[0].size(0) == sum(batch[3]), f"The first dimension of chars must be equal to the sum of all elements in words_len! Got {batch[0].size()} and {sum(batch[3])} instead."
assert sum(batch[1]) == batch[0].count_nonzero(), f"The sum of all chars_len must be equal to the number of non-zero elements in chars! Got {sum(batch[1])} and {batch[0].count_nonzero()} instead."

## Task 4. Create a character-level LSTM encoder (2.5 points)

Here, you will need to create a character-level encoder. It will take the characters for each token as an input and will produce a single vector of each token.

First, define the embedding layer with the number of embeddings equal to `vocab_size` and embedding dim to `emb_dim`. Also, specify the padding index, which is `0` in our case.

Next, define a bidirectional LSTM layer. It should have the hidden size equal to `hid_dim` and one layer. Also, specify `batch_first=True`. 

Finally, add a dropout layer with the dropout probability of `0.5`.

In the forward method, first encode the input `chars` with the embedding layer. Then, apply dropout on the embeddings.

After that, pack the embedded inputs with `pack_padded_sequence` function. Make sure to specify `batch_first=True` and `enforce_sorted=False`. Pass the packed inputs to the LSTM layer.

Take the last hidden state of the input. Don't forget to concatenate the hidden dimensions since we have a bidirectional LSTM (use [`torch.cat`](https://pytorch.org/docs/stable/generated/torch.cat.html?highlight=cat#torch.cat) for that). Refer to the [`nn.LSTM`](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html?highlight=lstm#torch.nn.LSTM) documentation to get more information about the inputs and outputs.

In the end, you model should output a 2-d tensor where the first dimension is the same as in `chars` and the second dimension is `hid_dim * 2`.

In [322]:
class CharLSTM(nn.Module):
    def __init__(self, emb_dim, hid_dim, vocab_size):
        super().__init__()
        # TODO: Initialize embedding layer
        self.emb = nn.Embedding(vocab_size, emb_dim, padding_idx=0)
        # TODO: Initialize LSTM layer
      
        self.lstm = nn.LSTM(input_size=emb_dim, 
                            hidden_size=hid_dim, 
                            num_layers=1, 
                            batch_first=True, 
                            bidirectional=True)
        # TODO: Initialize dropout layer
        self.drop = nn.Dropout()
        
    def forward(self, chars, chars_len):
        # TODO: Embed the inputs and apply dropout
        x = self.drop(self.emb(chars))
        # TODO: Pack the sequence
        x_packed = pack_padded_sequence(x,lengths=chars_len.cpu(),batch_first=True,enforce_sorted=False)
        # TODO: Run LSTM layer on the packed and get the final hidden state for each element in the sequence
        out,(hidden, cell)= self.lstm(x_packed)
        # TODO: Concatenate the hidden dimensions from each LSTM direction
        output = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)
        return output

You can test your model here.

In [323]:
char_lstm = CharLSTM(100, 400, len(char_vocab))
char_lstm = char_lstm.to(device)

In [324]:
batch = next(iter(train_dataloader))
char_lstm.eval()
with torch.no_grad():
    output = char_lstm(batch[0], batch[1])

print('Input size:', batch[0].size())
print('Output size:', output.size())

Input size: torch.Size([840, 12])
Output size: torch.Size([840, 800])


## Task 5. Add character-level encoder to the model (0.25 points)

Here, you will have to add the `CharLSTM` module to our final model. Use `char_emb_dim`, `char_hid_dim`, and `char_vocab_size` to initialize it. 

Then, calculate the correct input size for the main LSTM encoder. It should be equal to the sum of the word and character embedding sizes.

You don't need to change the forward method here.

In [325]:
class MorphClassifier(nn.Module):
    def __init__(self, emb_dim, hid_dim, char_emb_dim, char_hid_dim, char_vocab_size, vocab_size, num_upos, num_feats):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, emb_dim, padding_idx=0)
        # TODO: Initialize CharLSTM
        self.char_lstm = CharLSTM(emb_dim,char_hid_dim,char_vocab_size)
        # TODO: Calculate the input size
        input_size = emb_dim + char_hid_dim*2
        self.lstm = nn.LSTM(input_size=input_size, 
                            hidden_size=hid_dim, 
                            num_layers=1, 
                            batch_first=True, 
                            bidirectional=True)
        
        self.upos_clf = nn.Sequential(nn.Dropout(),
                                      nn.Linear(hid_dim * 2, hid_dim),
                                      nn.ReLU(),
                                      nn.Dropout(),
                                      nn.Linear(hid_dim, num_upos))
        
        self.feats_clf = nn.Sequential(nn.Dropout(),
                                       nn.Linear(hid_dim * 2, hid_dim),
                                       nn.ReLU(),
                                       nn.Dropout(),
                                       nn.Linear(hid_dim, num_feats))
        self.drop = nn.Dropout()

    def forward(self, words, words_len, chars, chars_len):
        word_x = self.drop(self.emb(words))
        # Encode with the character-level LSTM
        char_x = self.char_lstm(chars, chars_len.cpu())
        # Split the sequence back into words
        char_x = pad_sequence(char_x.split(words_len.tolist()), batch_first=True)
        # Concatenate word and char embeddings
        x = torch.cat([word_x, char_x], dim=2)

        x_packed = pack_padded_sequence(x, words_len, batch_first=True, enforce_sorted=False)
        hidden_packed, (_, _) = self.lstm(x_packed)
        hidden, lens = pad_packed_sequence(hidden_packed, batch_first=True)

        upos_pred = self.upos_clf(hidden)
        feats_pred = self.feats_clf(hidden)

        return upos_pred, feats_pred

In [326]:
num_feats = len(feats_vocab)
num_upos = len(upos_vocab)
num_words = len(word_vocab)
num_chars = len(char_vocab)

emb_dim = 75
hid_dim = 300
char_emb_dim = 100
char_hid_dim = 400

model = MorphClassifier(emb_dim, hid_dim, char_emb_dim, char_hid_dim, num_chars, num_words, num_upos, num_feats)
model.to(device)

MorphClassifier(
  (emb): Embedding(20179, 75, padding_idx=0)
  (char_lstm): CharLSTM(
    (emb): Embedding(2857, 75, padding_idx=0)
    (lstm): LSTM(75, 400, batch_first=True, bidirectional=True)
    (drop): Dropout(p=0.5, inplace=False)
  )
  (lstm): LSTM(875, 300, batch_first=True, bidirectional=True)
  (upos_clf): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=600, out_features=300, bias=True)
    (2): ReLU()
    (3): Dropout(p=0.5, inplace=False)
    (4): Linear(in_features=300, out_features=17, bias=True)
  )
  (feats_clf): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=600, out_features=300, bias=True)
    (2): ReLU()
    (3): Dropout(p=0.5, inplace=False)
    (4): Linear(in_features=300, out_features=4, bias=True)
  )
  (drop): Dropout(p=0.5, inplace=False)
)

In [327]:
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = torch.optim.Adam(model.parameters())

## Task 6. Train and evaluate your model (0.5 points)

Train the model.

In [328]:
num_iters = 50
best_loss = float('inf')
save_path = Path(f'model_char_{treebank_short}_best.pt')

for i in range(num_iters):
    model.train()
    iter_loss = 0
    for chars, chars_len, words, words_len, upos, feats in train_dataloader:
        optimizer.zero_grad()

        upos_pred, feats_pred = model(words, words_len, chars, chars_len)

        loss = criterion(upos_pred.flatten(0, 1), upos.flatten())
        loss += criterion(feats_pred.flatten(0, 1), feats.flatten())
        loss.backward()
        optimizer.step()

        iter_loss += loss
    
    model.eval()
    dev_loss = 0
    with torch.no_grad():
        for chars, chars_len, words, words_len, upos, feats in dev_dataloader:
            upos_pred, feats_pred = model(words, words_len, chars, chars_len)

            loss = criterion(upos_pred.flatten(0, 1), upos.flatten())
            loss += criterion(feats_pred.flatten(0, 1), feats.flatten())

            dev_loss += loss

    dev_loss = dev_loss.item() / len(dev_dataloader)
    print(f"Epoch {i+1}/{num_iters} | Train Loss: {iter_loss.item() / len(train_dataloader)} | " +
          f"Dev Loss: {dev_loss}")
    
    if dev_loss < best_loss:
        print(f'Loss decreased ({best_loss} -> {dev_loss}). Saving the model to {save_path}...')
        best_loss = dev_loss 
        torch.save(model, save_path)

Epoch 1/50 | Train Loss: 0.7688503351686228 | Dev Loss: 0.2451505959033966
Loss decreased (inf -> 0.2451505959033966). Saving the model to model_char_ja_gsd_best.pt...
Epoch 2/50 | Train Loss: 0.2609578473535598 | Dev Loss: 0.16550418734550476
Loss decreased (0.2451505959033966 -> 0.16550418734550476). Saving the model to model_char_ja_gsd_best.pt...
Epoch 3/50 | Train Loss: 0.19330683229196124 | Dev Loss: 0.13860346376895905
Loss decreased (0.16550418734550476 -> 0.13860346376895905). Saving the model to model_char_ja_gsd_best.pt...
Epoch 4/50 | Train Loss: 0.15962970094982856 | Dev Loss: 0.11537754535675049
Loss decreased (0.13860346376895905 -> 0.11537754535675049). Saving the model to model_char_ja_gsd_best.pt...
Epoch 5/50 | Train Loss: 0.1359239207133988 | Dev Loss: 0.11502741277217865
Loss decreased (0.11537754535675049 -> 0.11502741277217865). Saving the model to model_char_ja_gsd_best.pt...
Epoch 6/50 | Train Loss: 0.11872395131383007 | Dev Loss: 0.10304304957389832
Loss decre

Load the best model.

In [330]:
model = torch.load(save_path)
model.to(device)

MorphClassifier(
  (emb): Embedding(20179, 75, padding_idx=0)
  (char_lstm): CharLSTM(
    (emb): Embedding(2857, 75, padding_idx=0)
    (lstm): LSTM(75, 400, batch_first=True, bidirectional=True)
    (drop): Dropout(p=0.5, inplace=False)
  )
  (lstm): LSTM(875, 300, batch_first=True, bidirectional=True)
  (upos_clf): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=600, out_features=300, bias=True)
    (2): ReLU()
    (3): Dropout(p=0.5, inplace=False)
    (4): Linear(in_features=300, out_features=17, bias=True)
  )
  (feats_clf): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=600, out_features=300, bias=True)
    (2): ReLU()
    (3): Dropout(p=0.5, inplace=False)
    (4): Linear(in_features=300, out_features=4, bias=True)
  )
  (drop): Dropout(p=0.5, inplace=False)
)

Print out the predictions for the first eight sentences in the test set. 

Report the cases where your model made a mistake. Why do you think it made it?

__YOUR ANSWER BELOW:__

**(A) :**
Modal in UPOS **AUX** sometimes count as **VERB**, for example "し" in 2nd sentence. Because in the usage method is not too different from the main verbs and they are thus tagged VERB.

In [335]:
test_dataloader = DataLoader(test_dataset, batch_size=test_batch_size, collate_fn=_collate_fn, shuffle=False)
test_batch = next(iter(test_dataloader))
test_data_batch = test_data[:test_batch_size]
test_data_batch

model.eval()
with torch.no_grad():
    upos_pred, feats_preds = model(test_batch[2], test_batch[3], test_batch[0], test_batch[1])
    upos_ids = torch.argmax(torch.softmax(upos_pred, dim=2), dim=2)
    feats_ids = torch.argmax(torch.softmax(feats_preds, dim=2), dim=2)
    for i, ids in enumerate(upos_ids):
        print('token\tupos_pred\tupos_real\tfeats_pred\tfeats_real\n')
        sent_len = len(test_data_batch[i])
        tokens = [token["form"] for token in test_data_batch[i]]
        upos_real = [token["upos"] for token in test_data_batch[i]]
        feats_real = ["|".join([k + '=' + v for k, v in token["feats"].items()]) if token["feats"] else "_"
                      for token in test_data_batch[i]]
        upos_pred = upos_vocab.lookup_tokens(ids.tolist()[:sent_len])
        feats_pred = feats_vocab.lookup_tokens(feats_ids[i].tolist()[:sent_len])

        for token, upos, u_real, feats, f_real in zip(tokens, upos_pred, upos_real, feats_pred, feats_real):
            print('\t'.join([token, upos, u_real, feats, f_real]))
        print('\n---\n')

token	upos_pred	upos_real	feats_pred	feats_real

これ	PRON	PRON	_	_
に	ADP	ADP	_	_
不快	NOUN	NOUN	_	_
感	NOUN	NOUN	_	_
を	ADP	ADP	_	_
示す	VERB	VERB	_	_
住民	NOUN	NOUN	_	_
は	ADP	ADP	_	_
い	VERB	VERB	_	_
まし	AUX	AUX	_	_
た	AUX	AUX	_	_
が	SCONJ	SCONJ	_	_
,	PUNCT	PUNCT	_	_
現在	ADV	ADV	_	_
,	PUNCT	PUNCT	_	_
表立っ	VERB	VERB	_	_
て	SCONJ	SCONJ	_	_
反対	NOUN	NOUN	_	_
や	ADP	ADP	_	_
抗議	NOUN	NOUN	_	_
の	ADP	ADP	_	_
声	NOUN	NOUN	_	_
を	ADP	ADP	_	_
挙げ	VERB	VERB	_	_
て	SCONJ	SCONJ	_	_
いる	VERB	VERB	_	_
住民	NOUN	NOUN	_	_
は	ADP	ADP	_	_
い	VERB	VERB	_	_
ない	SCONJ	SCONJ	Polarity=Neg	Polarity=Neg
よう	AUX	AUX	_	_
です	AUX	AUX	_	_
。	PUNCT	PUNCT	_	_

---

token	upos_pred	upos_real	feats_pred	feats_real

幸福	NOUN	NOUN	_	_
の	ADP	ADP	_	_
科学	NOUN	NOUN	_	_
側	NOUN	NOUN	_	_
から	ADP	ADP	_	_
は	ADP	ADP	_	_
,	PUNCT	PUNCT	_	_
特に	ADV	ADV	_	_
どう	ADV	ADV	_	_
し	VERB	AUX	_	_
て	SCONJ	SCONJ	_	_
ほしい	AUX	AUX	_	_
と	ADP	ADP	_	_
いう	VERB	VERB	_	_
要望	NOUN	NOUN	_	_
は	ADP	ADP	_	_
いただい	VERB	VERB	_	_
て	SCONJ	SCONJ	_	_
い	VERB	VERB	_	_
ませ	AUX	AUX	_	_
ん	SCONJ	SCONJ	Polari

Evaluate your model on the test dataset and report the scores for UPOS and UFeats.

(Optional) Did this model perform better then the one that you trained during the practice session. Compare the scores.

How does your model compare to the other models in [CoNLL 2018 Shared Task](http://universaldependencies.org/conll18/results.html)?

__YOUR ANSWER BELOW:__

**(A) :** Comparing my model to others model in CoNLL 2018 Shared Task, my result in UPOS and Ufeats is better than 1st rank.

In UPOS, my result is **97.37**, the 1st rank in CoNLL 2018 Shared Task is **92.97**.

In UFeats, my result is **99.98**, the 1st rank in CoNLL 2018 Shared Task is **94.52**.

In [336]:
test_dataloader = DataLoader(test_dataset, batch_size=test_batch_size, collate_fn=_collate_fn, shuffle=False)
test_token_list = []
with torch.no_grad():
    for batch_id, (chars, chars_len, words, words_len, upos, feats) in enumerate(test_dataloader):
        upos_pred, feats_preds = model(words, words_len, chars, chars_len)
        upos_ids = torch.argmax(torch.softmax(upos_pred, dim=2), dim=2)
        feats_ids = torch.argmax(torch.softmax(feats_preds, dim=2), dim=2)
        for i, ids in enumerate(upos_ids):
            current_id = (batch_id * test_batch_size) + i
            sent_len = len(test_data[current_id])
            token_ids = [token["id"] for token in test_data[current_id]]
            tokens = [token["form"] for token in test_data[current_id]]
            upos_real = [token["upos"] for token in test_data[current_id]]
            feats_real = ["|".join([k + '=' + v for k, v in token["feats"].items()]) if token["feats"] else "_"
                          for token in test_data[current_id]]
            upos_pred = upos_vocab.lookup_tokens(ids.tolist()[:sent_len])
            feats_pred = feats_vocab.lookup_tokens(feats_ids[i].tolist()[:sent_len])

            token_list = []
            for token_id, token, upos, feats in zip(token_ids, tokens, upos_pred, feats_pred):
                head = token_id - 1 if isinstance(token_id, int) else None
                token_list.append({"id": token_id, 
                                   "form": token, 
                                   "lemma": None, 
                                   "upos": upos,
                                   "xpos": None,
                                   "feats": feats,
                                   "head": head,
                                   "deprel": "root" if head == 0 else None,
                                   "deps": None,
                                   "misc": None})
            test_token_list.append(TokenList(token_list))

In [337]:
with open(Path(f'{treebank_short}-ud-test.pred.conllu'), 'w', encoding='utf-8') as f:
    f.write(''.join([sent.serialize() for sent in test_token_list]))

Don't forget to change the file names to the correct ones for your language. 

The evaluation script can be found here: http://universaldependencies.org/conll18/conll18_ud_eval.py

In [340]:
!python conll18_ud_eval.py -v ./ud-treebanks-v2.9/UD_Japanese-GSD/ja_gsd-ud-test.conllu ./ja_gsd-ud-test.pred.conllu

Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |    100.00 |    100.00 |    100.00 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |    100.00 |    100.00 |    100.00 |
UPOS       |     97.37 |     97.37 |     97.37 |     97.37
XPOS       |      0.00 |      0.00 |      0.00 |      0.00
UFeats     |     99.98 |     99.98 |     99.98 |     99.98
AllTags    |      0.00 |      0.00 |      0.00 |      0.00
Lemmas     |      0.00 |      0.00 |      0.00 |      0.00
UAS        |     34.69 |     34.69 |     34.69 |     34.69
LAS        |      0.02 |      0.02 |      0.02 |      0.02
CLAS       |      0.55 |      0.04 |      0.08 |      0.04
MLAS       |      0.18 |      0.01 |      0.03 |      0.01
BLEX       |      0.00 |      0.00 |      0.00 |      0.00
