# DL - Attention

In [16]:
import torch, torchdata, torchtext
from torch import nn
import torch.nn.functional as F 
import random, math, time

device = torch.device ('cuda' if torch.cuda.is_available() else 'cpu')
print (device)

# make our work comparable if restarted the kernel
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

cpu


In [17]:
torch.__version__, torchtext.__version__

('2.1.2', '0.16.2')

## 1. ETL: Loading the dataset

In [18]:
from torchtext.datasets import Multi30k

# Define source and target languages
SRC_LANGUAGE = 'en'  # Source language is English
TRG_LANGUAGE = 'de'  # Target language is German

train = Multi30k(split = ('train'), language_pair = (SRC_LANGUAGE, TRG_LANGUAGE))

In [19]:
# this is a datapipe object, very similar to pytorch dataset version 2 which is better
train

ShardingFilterIterDataPipe

## 2. EDA - simple investigation

In [20]:
# let's take a look to one example of the train
sample = next(iter(train))
sample

('Two young, White males are outside near many bushes.',
 'Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.')

In [21]:
train_size = len(list(iter(train)))
train_size # 29001



29001

Since 29001 is plenty, we gonna call random_split to train, valid and test

In [22]:
train, val, test = train.random_split (total_length=train_size, weights = {"train": 0.7, "val": 0.2, "test": 0.1}, seed = 999)

In [23]:
train_size = len(list(iter(train)))
train_size # 20301

20301

In [24]:
val_size = len(list(iter(val)))
val_size # 5800

5800

In [25]:
test_size = len(list(iter(test)))
test_size # 2900

2900

## 3. Preprocessing

### Tokenizing

Note: the models must first be downloaded using the followings on the command line:

python3 -m spacy download en_core_web_sm

python3 -m spacy download de_core_news_sm

First, since we have two languages, let's create some constants to represent that. Also, let's create two dicts: one for holding our tokenizers and one for holding all the vocabs with assigned numbers for each unique word

In [26]:
# place holders
token_transform = {}
vocab_transform = {}

In [27]:
from torchtext.data.utils import get_tokenizer
token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language = 'en_core_web_sm')
token_transform[TRG_LANGUAGE] = get_tokenizer('spacy', language = 'de_core_news_sm')

In [28]:
# example of tokenization of the english part
print('Sentence: ', sample[0])
print('Tokenization: ', token_transform[SRC_LANGUAGE](sample[0]))

Sentence:  Two young, White males are outside near many bushes.
Tokenization:  ['Two', 'young', ',', 'White', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']


A function to tokenize our output

In [29]:
# helper function to yield list of tokens
# here data can be 'train' or 'val' or 'test' 
def yield_tokens(data, language):
    language_index = {SRC_LANGUAGE: 0, TRG_LANGUAGE:1}
    
    for data_sample in data:
        yield token_transform[language](data_sample[language_index[language]])
        # either first or second index

Before we tokenize, let's define some special symbols so our neural network understand the embeddings of these symbols, namely the unknown, the padding, the start of sentence, and end of sentence.

In [30]:
# Define special symbols and indices
UNK_IDX, PAD_IDX, SOS_IDX, EOS_IDX = 0, 1, 2, 3

# make sure the tockens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<sos>', '<eos>']

### Text to integers (Numericalization) 

Next we gonna create function (torchtext called vocabs) that turn these tokens into integers. Here we use built in factory function build_vocab_from_iterator which accepts iterator that yield list or iterator of tokens.

In [31]:
from torchtext.vocab import build_vocab_from_iterator

for ln in [SRC_LANGUAGE, TRG_LANGUAGE]:
    # Create torchtext's Vocab object 
    vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train, ln), 
                                                    min_freq = 2,   # if not, everything will be treated as UNK
                                                    specials = special_symbols,
                                                    special_first = True) # indicates whether to insert symbols at the beginning or at the end                                            
# Set UNK_IDX as the default index. This index is returned when the token is not found. 
# If not set, it throws RuntimeError when the queried token is not found in the Vocabulary. 
for ln in [SRC_LANGUAGE, TRG_LANGUAGE]:
    vocab_transform[ln].set_default_index(UNK_IDX)



In [32]:
# see some example
vocab_transform[SRC_LANGUAGE](['here', 'is', 'a', 'unknownword', 'a'])

[1891, 10, 4, 0, 4]

In [33]:
# we can reverse it....
mapping = vocab_transform[SRC_LANGUAGE].get_itos()

# print 1891, for example
mapping[1891]

'here'

In [35]:
# let's try unknown vocab
mapping[0]
# they will all map to <unk> which has 0 as integer

'<unk>'

In [36]:
# let's try special symbols
mapping[1], mapping[2], mapping[3]

('<pad>', '<sos>', '<eos>')

In [37]:
# check unique vocabularies
len(mapping)

5174

## 4. Preparing the dataloader

## 5. Design the model

### Seq2Seq

### Encoder

### Attention

### Decoder

## 6. Training

## 7. Test on some random news

## 8. Attention