# Data encoding

In [1]:
from os import getcwd, chdir

if getcwd().endswith('notebooks'):
    chdir('..')

In [2]:
from data.getDataset import getCognatesSet, getIteration
from data.vocab import computeInferenceData_Source, computeInferenceData_Target, wordsToOneHots
from random import randint

raw_cognates = getCognatesSet()['french']
__random_start_index = randint(0, len(raw_cognates)-6)
raw_cognates = raw_cognates[__random_start_index: __random_start_index + 5]
cognates = computeInferenceData_Target(wordsToOneHots(raw_cognates))
raw_samples = getIteration(1)[__random_start_index: __random_start_index + 5]
samples = computeInferenceData_Source(wordsToOneHots(raw_samples)) #TODO: simplify the data loading

## Encoding

## Samples encoding: `InferenceData_Source` type

This type refers to a tuple of three elements:
- an IntTensor `S` of shape $\left(\max \{|x|, x \in \textrm{batch}\} + 2, c, b\right)$. For all $0 \leq i < c$ and $0 \leq j < b$, `S[:, i, j]` represents one sample among the $b$ ones which are linked with the $i$-th cognate pair. It is represented along the first axis by tokens encoded with one-hot indexes and the sequence is opened by the `SOS_TOKEN` and the `EOS_TOKEN`.
- a cpu ByteTensor `L` of shape $\left( c, b \right)$ containing the length of each samples with the boundaries token. It is defined such that `S[L[i, j]:, i, j]` is a list of the `PADDING_TOKEN`'s one-hot indices. Therefore, `L[i, j]` = $|x_{(i,j)}| + 2$, if we note $x_{(i,j)}$ as the raw sample (without the boundaries token) represented at the position (i, j) in `S`.
- `n`: the max of `L` (if the tuple is correctly defined, then `n = S.size()[0]`)

In [3]:
print(raw_samples)
print(samples[0][...,0].T)
print(samples[0].size())
print('\n' + "Samples' length (without boundaries):", str([len(c) for c in raw_samples]))
print("Samples' length (with boundaries):", samples[1][:,0])
print("Max sample length with boundaries:", samples[2])

['ɔrixinal', 'ɔrnamɛntʊ', 'uʁor', 'ɔrlɔʒ', 'ɔrtoðoksa']
tensor([[58, 30, 14,  6, 20,  6, 11,  0,  9, 57, 59],
        [58, 30, 14, 11,  0, 10, 32, 11, 16, 43, 57],
        [58, 17, 41, 12, 14, 57, 59, 59, 59, 59, 59],
        [58, 30, 14,  9, 30, 46, 57, 59, 59, 59, 59],
        [58, 30, 14, 16, 12, 23, 12,  8, 15,  0, 57]], dtype=torch.int32)
torch.Size([11, 5, 1])

Samples' length (without boundaries): [8, 9, 4, 5, 9]
Samples' length (with boundaries): tensor([10, 11,  6,  7, 11])
Max sample length with boundaries: 11


## Cognates encoding: `InferenceData_Targets` type

This type of data is defined by a tuple similar with `InferenceData_Source`, excepted that the `EOS_TOKEN` is here removed from the first IntTensor, which involves that the sequences lengths are reduced by one, compared to the sequences in the previous type. Therefore, we can sum up its three elements in the following list:
- `S`: an IntTensor of shape $\left( \max \{ |y_l|, y_l\in \textrm{batch}_l \} + 1, c \right)$
- `L`: a cpu ByteTensor of shape $(c)$ `L[i]` = $|y_{l, i}| + 1$
- `n`: the max of `L` (if the tuple is correctly defined, then `n = S.size()[0]` )

In [4]:
print(raw_cognates)
print(cognates[0].T)
print(cognates[0].size())
print('\n' + "Cognates' length (without SOS token):", str([len(c) for c in raw_cognates]))
print("Cognates' length (with SOS token):", cognates[1])
print("Max cognate length with SOS token:", cognates[2])

['oʁiʒinˈal', 'ɔʁnəmˈɑ̃', 'oʁˈœʁ', 'ɔʁlˈɔʒ', 'ɔʁtodoksˈi']
tensor([[58, 12, 41,  6, 46,  6, 11, 51,  0,  9, 59],
        [58, 30, 41, 11, 31, 10, 51, 28, 54, 59, 59],
        [58, 12, 41, 51, 26, 41, 59, 59, 59, 59, 59],
        [58, 30, 41,  9, 51, 30, 46, 59, 59, 59, 59],
        [58, 30, 41, 16, 12,  2, 12,  8, 15, 51,  6]], dtype=torch.int32)
torch.Size([11, 5])

Cognates' length (without SOS token): [9, 8, 5, 6, 10]
Cognates' length (with SOS token): tensor([10,  9,  6,  7, 11])
Max cognate length with SOS token: 11
