# KAIST AI605 Assignment 3: Transformer

TA in charge: Miyoung Ko (miyoungko@kaist.ac.kr)

**Due date**:  May 17 (Tue) 11:00pm, 2022  


## Your Submission
If you are a KAIST student, you will submit your assignment via [KLMS](https://klms.kaist.ac.kr). If you are a NAVER student, you will submit via [Google Form](https://forms.gle/FSng5HUwtQinTFAU8). 

You need to submit both (1) .ipynb file (needs to be fully executable on CoLab), and (2) a pdf of the file.

Use in-line LaTeX (see below) for mathematical expressions. Collaboration among students is allowed but it is not a group assignment so make sure your answer and code are your own. Make sure to mention your collaborators in your assignment with their names and their student ids.

## Grading
The entire assignment is out of 20 points.. For every late day, your grade will be deducted by 2 points (KAIST students only). You can use one of your no-penalty late days (7 days in total). Make sure to mention this in your submission. You will receive a grade of zero if you submit after 7 days.


## Environment
You will need Python 3.7+ and PyTorch 1.9+, which are already available on Colab:

In [35]:
to_Train = False

In [36]:
from platform import python_version
import torch

print("python", python_version())
print("torch", torch.__version__)

python 3.9.7
torch 1.11.0+cpu


In [37]:
import torch.nn as nn
import torch.nn.functional as F
import math, copy, time
from torch.autograd import Variable
import pprint

In [38]:
#from google.colab import drive
#drive.mount('/content/gdrive')

## 1. Attention Layer

We will first start with going over a few concepts that you learned in your high school statistics class. The variance of a random variable $X$, $\text{Var}
(X)$ is defined as $\text{E}[(X-\mu)^2]$ where $\mu$ is the mean of $X$. Furthermore, given two independent random variables $X$ and $Y$ and a constant $a$,
$$ \text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y),$$
$$ \text{Var}(aX) = a^2\text{Var}(X),$$
$$ \text{Var}(XY) = \text{E}(X^2)\text{E}(Y^2) - [\text{E}(X)]^2[\text{E}(Y)]^2.$$

> **Problem 1.1** *(3 points)* Suppose we are given two sets of $n$ random variables, $X_1 \dots X_n$ and $Y_1 \dots Y_n$, where all of these $2n$ variables are mutually independent and have a mean of $0$ and a variance of $1$. Prove that
$$\text{Var}\left(\sum_i^n X_i Y_i\right) = n.$$



> We are going to proove the generalisation of the (1) for $n$ elemenent by induction : 
$$ \text{Var}\sum_{i=0}^n X_i = \sum_{i=0}^n\text{Var} X_i  \text{   (}  \mathcal{P}(n) \text{)}$$
The case of n=1 is prooven with the (1), we assume the induction hypothesis that for a particular $n$, so $\mathcal{P}(n)$ is true and the single case for the rank $n$ holds, meaning the statement for $n$ is true. 
>
>\begin{align}
    \begin{split} 
        \text{Var}\sum_{i=0}^{n+1} X_i &= \text{Var}\sum_{i=0}^n X_i + X_{n+1} \\
                                       &= (\text{Var}\sum_{i=0}^n X_i)  + X_{n+1} \text{   whith (1)} \\
                                       &= \sum_{i=0}^n\text{Var} X_i + X_{n+1} \text{   whith }
                                       \mathcal{P}(n)  \\
                                       &= \sum_{i=0}^{n+1}\text{Var} X_i 
    \end{split}
\end{align} 
So $\mathcal{P}(n+1)$ is true. 
>
>**Conclusion:**  Since both the base case and the inductive step have been proved as true, by mathematical induction the statement holds for every natural number n : 
> $$ \text{Var}\sum_{i=0}^n X_i = \sum_{i=0}^n\text{Var} X_i  $$
> Hense : 
> $$ \text{Var}\sum_{i=0}^n Y_i X_i = \sum_{i=0}^n\text{Var} (X_i Y_i)  $$
> So with line 3 : 
> $$ \text{Var}\sum_{i=0}^n Y_i X_i = \sum_{i=0}^n \text{E}(X_i^2)\text{E}(Y_i^2) - [\text{E}(X_i)]^2[\text{E}(Y_i)]^2  $$
>But $\forall \text{i} \in [[0,n ]], $
>
>\begin{equation}
    \left\lbrace
		\begin{split}
            E(X_i) &= E(Y_i) = 0 \\
            \text{Var}(X_i) &= E(X_i^2) - E(X_i)^2 = E(X_i^2) = 1
        \end{split}
	\right.
\end{equation}
> So finally : 
> $$ \text{Var}\sum_{i=0}^n Y_i X_i = \sum_{i=0}^n 1 = n $$

**===========================================================================================**

In Lecture 11 and 12, we discussed how the attention is computed in Transformer via the following equation,
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V.$$
> **Problem 1.2** *(3 points)*  Suppose $Q$ and $K$ are matrices of independent variables each of which has a mean of $0$ and a variance of $1$. Using what you learned from Problem 1.1., show that
$$\text{Var}\left(\frac{QK^\top}{\sqrt{d_k}}\right) = 1.$$

> We use the result of the Problem 1.2, and as $\frac{1}{\sqrt{d_k}}$ is not a variable with (3) we have : 
>
> $$\text{Var}\left(\frac{QK^T}{\sqrt{d_k}}\right) =  \frac{1}{{d_k}} [ \text{Var}(QK^T) ] = \frac{1}{d_k} [\text{Var}(Q) \text{Var}(K^T)]  \\  $$ 
>
> But by definition $d_k$ is the dimention of $Q$ and $T$ and $\forall i,j \in [[0, d_k]],  \text{Var}(T_{ij}) = \text{Var}(Q_{ij}) = 1  $ 
> So : 
> 
>$$
  \text{Var}(Q) = \text{Var}(K^T) =
  \left[ {\begin{array}{cc}
    1 & \dotsb & 1\\
    \vdots &  \ddots & \vdots \\
    1 & \dotsb & 1
  \end{array} } \right]
$$
> So it follows that : 
>$$
  \text{Var}(Q)\text{Var}(K^T) =
  \left[ {\begin{array}{cc}
    d_k & \dotsb & d_k\\
    \vdots &  \ddots & \vdots \\
    d_k & \dotsb & d_k
  \end{array} } \right]
$$
So at the end : 
$$\text{Var}\left(\frac{QK^\top}{\sqrt{d_k}}\right) = 1.$$

(Where 1 represent the matrix with 1 everywhere)

**===========================================================================================**


> **Problem 1.3** *(2 points)* What would happen if the assumption that the variance of $Q$ and $K$ is $1$ does not hold? Consider each case of it being higher and lower than $1$ and conjecture what it implies, respectively.

>As we have seen in the previous $$[ \text{Var}(QK^T) ] = [\text{Var}(Q) \text{Var}(K^T)]  \\  $$ 
>And : 
>$$
  \text{Var}(Q) = \text{Var}(K^T) =
  \left[ {\begin{array}{cc}
    1 & \dotsb & 1\\
    \vdots &  \ddots & \vdots \\
    1 & \dotsb & 1
  \end{array} } \right]  \text{           (1)}
$$
> Which results to : 
> $$
  \text{Var}(Q)\text{Var}(K^T) =
  \left[ {\begin{array}{cc}
    d_k & \dotsb & d_k\\
    \vdots &  \ddots & \vdots \\
    d_k & \dotsb & d_k
  \end{array} } \right]  \text{           (2)}
$$
>But if the hypothesis of the variance of Q and K is equal to 1, then the equality 1 is no longer valid. 
>
> **_CASE 1_** : the variance is higher than 1 
>
> In this case all the element of the matrix (1) are bigger than 1. Hence, in the matrix (2) all the elements of the matrix are greater than $d_k$, lets name $\alpha$ all those elements.
> $$
  \text{Var}\left(\frac{QK^\top}{\sqrt{d_k}}\right) = \frac{1}{d_k} \text{Var}(Q)\text{Var}(K^T) =
  \left[ {\begin{array}{cc}
    \frac{\alpha}{d_k} & \dotsb & \frac{\alpha}{d_k}\\
    \vdots &  \ddots & \vdots \\
    \frac{\alpha}{d_k} & \dotsb & \frac{\alpha}{d_k}
  \end{array} } \right]  \text{           (3)}
$$
> As $\alpha$ greater than $d_k$, $\frac{\alpha}{d_k} > 1$, hence : 
>$$ \text{Var}\left(\frac{QK^\top}{\sqrt{d_k}}\right)> 1  $$
>
> **_CASE 2_** : the variance is smaller than 1 
> 
> In this case all the element of the matrix (1) are smaller than 1. Hence, in the matrix (2) all the elements of the matrix are smaller than $d_k$, lets name $\alpha$ all those elements. We got the same matrix (3), but this time, as $\alpha$ smaller than $d_k$, $\frac{\alpha}{d_k} < 1$, hence : 
>$$ \text{Var}\left(\frac{QK^\top}{\sqrt{d_k}}\right)< 1  $$

> Conclusion if the variances are greater than 1 the result is greater than one and if the variance is smaller the result is also smaller, which will be shown in the next question. 

> **Problem 1.4** *(2 points)* Now it is time to experimentally verify the theory! Create the random variables $X$ and $Y$ with the mean of zero and the variance of one (using `torch.randn`) and verify the equation in Problem 1.2. Then experiment with higher and lower variance to verify your finding in Problem 1.3. Briefly explain your results.

In [39]:
def loop (n, lenght ): 
    S = torch.zeros((1,lenght)) 
    for i in range (n): 
        X = torch.randn(lenght)
        Y = torch.randn(lenght)
        S += X*Y 
    return int(torch.var(S))


def average(m, n , lenght): 
    a = []
    for i in range(n):
        a.append(loop(n , lenght))
    return int(sum(a)/len(a))

print(" VAR(Sum(XiYi)) =" , average(1000, 400, 600) ,'\n', 'n =', 400)

 VAR(Sum(XiYi)) = 398 
 n = 400


So we can say that $\text{Var}\left(\sum_i^n X_i Y_i\right) = n.$ is a good aproximation. 

In [40]:
import numpy as np

def loop2 (n, lenght, variance): 
    S = torch.zeros((1,lenght)) 
    for i in range (n): 
        X = torch.normal(mean=0, std=np.sqrt(variance), size=(1,  lenght))
        Y = torch.normal(mean=0, std=np.sqrt(variance), size=(1,  lenght))
        S += X*Y 
    return int(torch.var(S))


def average2(m, n , lenght, variance): 
    a = []
    for i in range(n):
        a.append(loop2(n , lenght,variance ))
    return int(sum(a)/len(a))

print("===============", '\n',"Variance = 4")
print(" VAR(Sum(XiYi)) =" , average2(1000, 400, 600, 4) ,'and', 'n =', 400)

print("===============", '\n', "Variance = 2")
print(" VAR(Sum(XiYi)) =" , average2(1000, 400, 600, 2) , 'and', 'n =', 400)

print("===============", '\n',"Variance = 1")
print(" VAR(Sum(XiYi)) =" , average2(1000, 400, 600, 1) ,'and', 'n =', 400)

print("===============", '\n',"Variance = 0.5")
print(" VAR(Sum(XiYi)) =" , average2(1000, 400, 600, 0.5) ,'and', 'n =', 400)

print("===============", '\n',"Variance = 0.1")
print(" VAR(Sum(XiYi)) =" , average2(1000, 400, 600, 0.1) ,'and', 'n =', 400)

 Variance = 4
 VAR(Sum(XiYi)) = 6406 and n = 400
 Variance = 2
 VAR(Sum(XiYi)) = 1605 and n = 400
 Variance = 1
 VAR(Sum(XiYi)) = 399 and n = 400
 Variance = 0.5
 VAR(Sum(XiYi)) = 99 and n = 400
 Variance = 0.1
 VAR(Sum(XiYi)) = 3 and n = 400


So the result of problen 1.2 doesn't hold if the variance is greater or smaller than 1

## 2. Transformer for Spelling Error Correction

In this section, you will implement Transformer for a few tasks that are simpler than machine translation. Feel free to copy and paste from [The Annotated Transformer](http://nlp.seas.harvard.edu/annotated-transformer/) (note that this is a new version released in 2022 recently), though make sure to mention that you copied the code from it. Note that we do not provide a separate training or evaluation data, so it is your job to be able to create these in a reasonable manner.



!pip install -q torchdata==0.3.0 torchtext==0.12 spacy==3.2 altair GPUtil
!python -m spacy download de_core_news_sm
!python -m spacy download en_core_web_sm

In [41]:
import os
from os.path import exists
import torch
import torch.nn as nn
from torch.nn.functional import log_softmax, pad
import math
import copy
import time
from torch.optim.lr_scheduler import LambdaLR
import pandas as pd
#import altair as alt
from torchtext.data.functional import to_map_style_dataset
from torch.utils.data import DataLoader
from torchtext.vocab import build_vocab_from_iterator
import torchtext.datasets as datasets
#import spacy
#import GPUtil
#import warnings
from torch.utils.data.distributed import DistributedSampler
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
from tqdm import tqdm 

# Set to False to skip notebook execution (e.g. for debugging)
#warnings.filterwarnings("ignore")
RUN_EXAMPLES = True

In [42]:
# Some convenience helper functions used throughout the notebook


def is_interactive_notebook():
    return __name__ == "__main__"


def show_example(fn, args=[]):
    if __name__ == "__main__" and RUN_EXAMPLES:
        return fn(*args)


def execute_example(fn, args=[]):
    if __name__ == "__main__" and RUN_EXAMPLES:
        fn(*args)


class DummyOptimizer(torch.optim.Optimizer):
    def __init__(self):
        self.param_groups = [{"lr": 0}]
        None

    def step(self):
        None

    def zero_grad(self, set_to_none=False):
        None


class DummyScheduler:
    def step(self):
        None

In [43]:
class EncoderDecoder(nn.Module):
    """
    A standard Encoder-Decoder architecture. Base for this and many
    other models.
    """

    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator

    def forward(self, src, tgt, src_mask, tgt_mask):
        "Take in and process masked src and target sequences."
        return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)

    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)

    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)
class Generator(nn.Module):
    "Define standard linear + softmax generation step."

    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)

    def forward(self, x):
        return log_softmax(self.proj(x), dim=-1)

In [44]:
def clones(module, N):
    "Produce N identical layers."
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])
class Encoder(nn.Module):
    "Core encoder is a stack of N layers"

    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, mask):
        "Pass the input (and mask) through each layer in turn."
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

In [45]:
class LayerNorm(nn.Module):
    "Construct a layernorm module (See citation for details)."

    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

In [46]:
class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer norm.
    Note for code simplicity the norm is first as opposed to last.
    """

    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        "Apply residual connection to any sublayer with the same size."
        return x + self.dropout(sublayer(self.norm(x)))

In [47]:
class EncoderLayer(nn.Module):
    "Encoder is made up of self-attn and feed forward (defined below)"

    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size

    def forward(self, x, mask):
        "Follow Figure 1 (left) for connections."
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)

In [48]:
class Decoder(nn.Module):
    "Generic N layer decoder with masking."

    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)

In [49]:
class DecoderLayer(nn.Module):
    "Decoder is made of self-attn, src-attn, and feed forward (defined below)"

    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)

    def forward(self, x, memory, src_mask, tgt_mask):
        "Follow Figure 1 (right) for connections."
        m = memory
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        return self.sublayer[2](x, self.feed_forward)

In [50]:
def subsequent_mask(size):
    "Mask out subsequent positions."
    attn_shape = (1, size, size)
    subsequent_mask = torch.triu(torch.ones(attn_shape), diagonal=1).type(
        torch.uint8
    )
    return subsequent_mask == 0

In [51]:
def attention(query, key, value, mask=None, dropout=None):
    "Compute 'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = scores.softmax(dim=-1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

In [52]:
class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h
        self.h = h
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, query, key, value, mask=None):
        "Implements Figure 2"
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)

        # 1) Do all the linear projections in batch from d_model => h x d_k
        query, key, value = [
            lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
            for lin, x in zip(self.linears, (query, key, value))
        ]

        # 2) Apply attention on all the projected vectors in batch.
        x, self.attn = attention(
            query, key, value, mask=mask, dropout=self.dropout
        )

        # 3) "Concat" using a view and apply a final linear.
        x = (
            x.transpose(1, 2)
            .contiguous()
            .view(nbatches, -1, self.h * self.d_k)
        )
        del query
        del key
        del value
        return self.linears[-1](x)

In [53]:
class PositionwiseFeedForward(nn.Module):
    "Implements FFN equation."

    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.w_2(self.dropout(self.w_1(x).relu()))

In [54]:
class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model

    def forward(self, x):
        return self.lut(x) * math.sqrt(self.d_model)

In [55]:
class PositionalEncoding(nn.Module):
    "Implement the PE function."

    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer("pe", pe)

    def forward(self, x):
        x = x + self.pe[:, : x.size(1)].requires_grad_(False)
        return self.dropout(x)

In [56]:
def make_model(
    src_vocab, tgt_vocab, N=6, d_model=512, d_ff=2048, h=8, dropout=0.1
):
    "Helper: Construct a model from hyperparameters."
    c = copy.deepcopy
    attn = MultiHeadedAttention(h, d_model)
    ff = PositionwiseFeedForward(d_model, d_ff, dropout)
    position = PositionalEncoding(d_model, dropout)
    model = EncoderDecoder(
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
        Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout), N),
        nn.Sequential(Embeddings(d_model, src_vocab), c(position)),
        nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)),
        Generator(d_model, tgt_vocab),
    )

    # This was important from their code.
    # Initialize parameters with Glorot / fan_avg.
    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)
    return model

In [57]:
def inference_test():
    test_model = make_model(11, 11, 2)
    test_model.eval()
    src = torch.LongTensor([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
    src_mask = torch.ones(1, 1, 10)

    memory = test_model.encode(src, src_mask)
    ys = torch.zeros(1, 1).type_as(src)

    for i in range(9):
        out = test_model.decode(
            memory, src_mask, ys, subsequent_mask(ys.size(1)).type_as(src.data)
        )
        prob = test_model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.data[0]
        ys = torch.cat(
            [ys, torch.empty(1, 1).type_as(src.data).fill_(next_word)], dim=1
        )

    print("Example Untrained Model Prediction:", ys)


def run_tests():
    for _ in range(10):
        inference_test()


show_example(run_tests)

Example Untrained Model Prediction: tensor([[0, 2, 2, 6, 2, 2, 2, 2, 2, 2]])
Example Untrained Model Prediction: tensor([[0, 5, 1, 1, 1, 1, 1, 1, 1, 1]])
Example Untrained Model Prediction: tensor([[ 0, 10,  0, 10,  0, 10,  0, 10,  3,  3]])
Example Untrained Model Prediction: tensor([[0, 9, 5, 0, 5, 0, 5, 0, 5, 0]])
Example Untrained Model Prediction: tensor([[ 0,  2, 10,  9,  9,  7,  3,  3,  2, 10]])
Example Untrained Model Prediction: tensor([[0, 5, 6, 0, 5, 6, 5, 7, 7, 7]])
Example Untrained Model Prediction: tensor([[ 0,  3, 10, 10, 10, 10, 10, 10, 10, 10]])
Example Untrained Model Prediction: tensor([[0, 7, 1, 7, 3, 5, 6, 0, 5, 2]])
Example Untrained Model Prediction: tensor([[0, 4, 0, 4, 8, 7, 1, 1, 1, 1]])
Example Untrained Model Prediction: tensor([[0, 1, 1, 1, 1, 1, 8, 1, 8, 1]])


In [58]:
class Batch:
    """Object for holding a batch of data with mask during training."""

    def __init__(self, src, tgt=None, pad=2):  # 2 = <blank>
        self.src = src
        self.src_mask = (src != pad).unsqueeze(-2)
        if tgt is not None:
            self.tgt = tgt[:, :-1]
            self.tgt_y = tgt[:, 1:]
            self.tgt_mask = self.make_std_mask(self.tgt, pad)
            self.ntokens = (self.tgt_y != pad).data.sum()

    @staticmethod
    def make_std_mask(tgt, pad):
        "Create a mask to hide padding and future words."
        tgt_mask = (tgt != pad).unsqueeze(-2)
        tgt_mask = tgt_mask & subsequent_mask(tgt.size(-1)).type_as(
            tgt_mask.data
        )
        return tgt_mask

In [59]:
class TrainState:
    """Track number of steps, examples, and tokens processed"""

    step: int = 0  # Steps in the current epoch
    accum_step: int = 0  # Number of gradient accumulation steps
    samples: int = 0  # total # of examples used
    tokens: int = 0  # total # of tokens processed

In [60]:
def run_epoch(
    data_iter,
    model,
    loss_compute,
    optimizer,
    scheduler,
    epoch, 
    nbbatch,
    treshold,
    mode="train",
    accum_iter=1,
    train_state=TrainState(),
):
    """Train a single epoch"""
    start = time.time()
    total_tokens = 0
    total_loss = 0
    tokens = 0
    n_accum = 0
    j = 0 
    to_break = False 
    for i, batch in enumerate(data_iter):
        #print(i)
        #print(batch)
        out = model.forward(
            batch.src, batch.tgt, batch.src_mask, batch.tgt_mask
        )
        loss, loss_node = loss_compute(out, batch.tgt_y, batch.ntokens)
        # loss_node = loss_node / accum_iter
        if mode == "train" or mode == "train+log":
            loss_node.backward()
            train_state.step += 1
            train_state.samples += batch.src.shape[0]
            train_state.tokens += batch.ntokens
            if i % accum_iter == 0:
                optimizer.step()
                optimizer.zero_grad(set_to_none=True)
                n_accum += 1
                train_state.accum_step += 1
            scheduler.step()

        total_loss += loss
        total_tokens += batch.ntokens
        tokens += batch.ntokens
        if i == nbbatch-1 and (mode == "train" or mode == "train+log"):
            lr = optimizer.param_groups[0]["lr"]
            elapsed = time.time() - start
            print(                
                    "Epoch Step:", '\t', epoch, '\t', "| Accumulation Step:",n_accum, '\t', " | Loss:" , float(loss / batch.ntokens), '\t' , 
                    "| Tokens / Sec:" ,int(tokens / elapsed), '\t'," | Learning Rate:", lr
            )
            epoch += 10
            #% (i, n_accum, loss / batch.ntokens, tokens / elapsed, lr)
            start = time.time()
            tokens = 0
        if loss<treshold : 
          to_break = True
        del loss
        del loss_node
    return total_loss / total_tokens, train_state, to_break

In [61]:
def rate(step, model_size, factor, warmup):
    """
    we have to default the step to 1 for LambdaLR function
    to avoid zero raising to negative power.
    """
    if step == 0:
        step = 1
    return factor * (
        model_size ** (-0.5) * min(step ** (-0.5), step * warmup ** (-1.5))
    )

In [62]:
class LabelSmoothing(nn.Module):
    "Implement label smoothing."

    def __init__(self, size, padding_idx, smoothing=0.0):
        super(LabelSmoothing, self).__init__()
        self.criterion = nn.KLDivLoss(reduction="sum")
        self.padding_idx = padding_idx
        self.confidence = 1.0 - smoothing
        self.smoothing = smoothing
        self.size = size
        self.true_dist = None

    def forward(self, x, target):
        assert x.size(1) == self.size
        true_dist = x.data.clone()
        true_dist.fill_(self.smoothing / (self.size - 2))
        true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence)
        true_dist[:, self.padding_idx] = 0
        mask = torch.nonzero(target.data == self.padding_idx)
        if mask.dim() > 0:
            true_dist.index_fill_(0, mask.squeeze(), 0.0)
        self.true_dist = true_dist
        return self.criterion(x, true_dist.clone().detach())

In [63]:
class SimpleLossCompute:
    "A simple loss compute and train function."

    def __init__(self, generator, criterion):
        self.generator = generator
        self.criterion = criterion

    def __call__(self, x, y, norm):
        x = self.generator(x)
        sloss = (
            self.criterion(
                x.contiguous().view(-1, x.size(-1)), y.contiguous().view(-1)
            )
            / norm
        )
        return sloss.data * norm, sloss

In [64]:
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    memory = model.encode(src, src_mask)
    ys = torch.zeros(1, 1).fill_(start_symbol).type_as(src.data)
    for i in range(max_len - 1):
        out = model.decode(
            memory, src_mask, ys, subsequent_mask(ys.size(1)).type_as(src.data)
        )
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.data[0]
        ys = torch.cat(
            [ys, torch.zeros(1, 1).type_as(src.data).fill_(next_word)], dim=1
        )
    return ys

> **Problem 2.1** *(3 points)* Create a model that takes a random set of input symbols from a vocabulary of digits (i.e. 0, 1, ... , 8, 9) as the input and generate back the same symbols. Instead of varying length, we fix the length to 32. Make sure to report that your model's sequence-level (not token level) accuracy goes above 90%. Note that a similar problem is also in Annotated Transformer, and copying the code is allowed.


In [65]:
def data_gen1(V, sentence_size,  batch_size, nbatches):
    """Generate random data for a src-tgt copy task.
    V is is the upper bond value, hence the number on coracters 
    """
    for i in range(nbatches):
        data = torch.randint(0, V, size=(batch_size, sentence_size))
        data[:, 0] = 1
        src = data.requires_grad_(False).clone().detach()
        tgt = data.requires_grad_(False).clone().detach()
        yield Batch(src, tgt, 0)

In [66]:
""" from https://debuggercafe.com/saving-and-loading-the-best-model-in-pytorch/ """

import torch
import matplotlib.pyplot as plt
plt.style.use('ggplot')
class SaveBestModel:
    """
    Class to save the best model while training. If the current epoch's 
    validation loss is less than the previous least less, then save the
    model state.
    """
    def __init__(
        self, path, best_valid_loss=float('inf')
    ):
        self.best_valid_loss = best_valid_loss
        self.path = path 
        
    def __call__(
        self, current_valid_loss, 
        epoch, model, optimizer, criterion
    ):
        
        if current_valid_loss < self.best_valid_loss:
            self.best_valid_loss = current_valid_loss
            print(f"\nBest validation loss: {self.best_valid_loss}")
            print(f"\nSaving best model for epoch: {epoch+1}\n")
            #torch.save(model.state_dict(), '/content/gdrive/MyDrive/AI605_assignement_3/model_epoch200_tres.pt')
            torch.save({
                'epoch': epoch+1,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'loss': criterion,
                }, self.path)
# save on collab
#torch.save(model.state_dict(), '/content/gdrive/MyDrive/AI605_assignement_3/model_epoch200_tres.pt')

In [67]:
data_size     = 10
V             = 10       # 0 1 2 3 4 5 6 7 8 9 
sentence_size = 32
min_int,  max_int=0,  9 
batch_size    = 80
nbbatch       = 20
nbepoch       = 200
treshold      = 0.003
saver1         = SaveBestModel('/content/gdrive/MyDrive/AI605_assignement_3/model_exercice_2_1_1.pth')

In [68]:
##########################################################################
###############     Creation of an empty model         ###################
##########################################################################

if to_Train : 
    criterion = LabelSmoothing(size=sentence_size, padding_idx=0, smoothing=0.0)
    model1    = make_model(sentence_size, sentence_size, N=2)

    optimizer = torch.optim.Adam(
        model1.parameters(), lr=0.5, betas=(0.9, 0.98), eps=1e-9
    )
    lr_scheduler = LambdaLR(
        optimizer=optimizer,
        lr_lambda=lambda step: rate(
            step, model_size=model1.src_embed[0].d_model, factor=1.0, warmup=400
        ),
    )


Train the model for the first time

In [69]:
##########################################################################
###############  Training of the created model         ###################
##########################################################################

if to_Train : 
    for epoch in range(nbepoch):
        model1.train()
        loss, train_state, to_break = run_epoch(
            data_gen1(V, sentence_size, batch_size, nbbatch), 
            #data, 
            model1,
            SimpleLossCompute(model1.generator, criterion),
            optimizer,
            lr_scheduler,
            epoch, 
            nbbatch,
            treshold,
            mode="train",
        )
        model1.eval()
        run_epoch(
            data_gen1(V, sentence_size, batch_size, 15),
            #data,
            model1,
            SimpleLossCompute(model1.generator, criterion),
            DummyOptimizer(),
            DummyScheduler(),
            epoch, 
            nbbatch,
            treshold,
            mode="eval",
        )[0]
        saver1.__call__(
        loss, 
        epoch,
        model1, 
        optimizer, 
        criterion,
        )
        if to_break == True : 
          break

In [70]:
##########################################################################
###############     Creation of an empty model         ###################
##########################################################################

criterion = LabelSmoothing(size=sentence_size, padding_idx=0, smoothing=0.0)
model1 = make_model(sentence_size, sentence_size, N=2)

optimizer = torch.optim.Adam(
        model1.parameters(), lr=0.5, betas=(0.9, 0.98), eps=1e-9
    )
lr_scheduler = LambdaLR(
        optimizer=optimizer,
        lr_lambda=lambda step: rate(
            step, model_size=model1.src_embed[0].d_model, factor=1.0, warmup=400
        ),
    )
model1.eval()


EncoderDecoder(
  (encoder): Encoder(
    (layers): ModuleList(
      (0): EncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linears): ModuleList(
            (0): Linear(in_features=512, out_features=512, bias=True)
            (1): Linear(in_features=512, out_features=512, bias=True)
            (2): Linear(in_features=512, out_features=512, bias=True)
            (3): Linear(in_features=512, out_features=512, bias=True)
          )
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=True)
          (w_2): Linear(in_features=2048, out_features=512, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (sublayer): ModuleList(
          (0): SublayerConnection(
            (norm): LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): SublayerConnection(
            (norm): LayerNorm()

In [71]:
##########################################################################
###############     Load the model from the drive      ###################
##########################################################################

#from google.colab import files

#model.load_state_dict(torch.load('/content/gdrive/MyDrive/AI605_assignement_3/model_epoch200.pt'))

# load the best model checkpoint
best_model_cp = torch.load('model_exercice_2_1.pth')
best_model_epoch = best_model_cp['epoch']
print(f"Best model was saved at {best_model_epoch} epochs\n")

# load the last model checkpoint
last_model_cp = torch.load('model_exercice_2_1.pth')
last_model_epoch = last_model_cp['epoch']
print(f"Last model was saved at {last_model_epoch} epochs\n")

model1.load_state_dict(best_model_cp['model_state_dict'])
epoch     = (best_model_cp['epoch'])
optimizer.load_state_dict(best_model_cp['optimizer_state_dict'])
criterion = (best_model_cp['loss'])

Best model was saved at 71 epochs

Last model was saved at 71 epochs



In [72]:
##########################################################################
###############    Training of the loaded model        ###################
##########################################################################

if to_Train : 
  for epoch in range(epoch, nbepoch):
        model1.train()
        loss, train_state, to_break = run_epoch(
            data_gen1(V, sentence_size, batch_size, nbbatch), 
            #data, 
            model1,
            SimpleLossCompute(model1.generator, criterion),
            optimizer,
            lr_scheduler,
            epoch, 
            nbbatch,
            treshold,
            mode="train",
        )
        model1.eval()
        run_epoch(
            data_gen1(V, sentence_size, batch_size, 15),
            #data,
            model1,
            SimpleLossCompute(model1.generator, criterion),
            DummyOptimizer(),
            DummyScheduler(),
            epoch, 
            nbbatch,
            treshold,
            mode="eval",
        )[0]
        saver1.__call__(
        loss, 
        epoch,
        model1, 
        optimizer, 
        criterion,
        )
        if to_break == True : 
          break

In [73]:
### Small exemple 
model1.eval()
src = torch.LongTensor([[0, 4, 2, 7, 4, 5, 6, 7, 8, 6]])
max_len = src.shape[1]
src_mask = torch.ones(1, 1, max_len)
print(greedy_decode(model1, src, src_mask, max_len=max_len, start_symbol=0))

tensor([[0, 4, 2, 7, 4, 5, 6, 7, 8, 6]])


In [74]:

def accurary(model, max_int, min_int ):
    common, total = 0, 0
    for i in tqdm(range(0, 500)): 
        model.eval()
        src      = torch.randint(1,max_int, (1, data_size))
        max_len  = src.shape[1]
        src_mask = torch.ones(1, 1, max_len)
        result   = greedy_decode(model, src, src_mask, max_len=max_len, start_symbol=0)
        if src.tolist()[0][1:(data_size-1)]==result.tolist()[0][1:(data_size-1)]: # that mean the index of the target is in the result
        # we get rid of the 0 colomne of 0 and the last column which are the end and start token 
            common +=1 
        total += 1 # all the try 
    return (common / total) *100

In [75]:
acc = accurary(model1, max_int, min_int )
print( '\n', '\n', "The accuracy achieved over 500 sentences is", acc, " %")

100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [00:32<00:00, 15.36it/s]


 
 The accuracy achieved over 500 sentences is 100.0  %





So we get an acccuracy of 100% wich mean all the sentences are copied correctly. This programme can be used for data reduction for example. 

> **Problem 2.2** *(4 points)* Now, we will implement a bit more useful function, so-called spelling error correction. Your job is to create a model whose input is a word with spelling errors, and the output is the spelling-corrected word. Here, your vocabulary will be character instead of word. You can create your own training data by using an existing text corpus as the target and inject noise into it to use it as the input. You are free to use whichever text corpus you like. If you can't think of one, please use context data in SQuAD Dataset (see Assignment 2). Report accuracy in your own evaluation data (you will receive full credit as long as both the evaluation data and the accuracy are reasonable), and also show 3 examples where it succeeds at correcting spelling.


In [76]:
!pip install datasets



In [77]:
import datasets
from pprint import pprint
import torch 

squad_dataset = datasets.load_dataset('squad')
pprint(squad_dataset['train'][0]) # 'context' contains the document

Reusing dataset squad (C:\Users\Utilisateur\.cache\huggingface\datasets\squad\plain_text\1.0.0\d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


  0%|          | 0/2 [00:00<?, ?it/s]

{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the '
            "Main Building's gold dome is a golden statue of the Virgin Mary. "
            'Immediately in front of the Main Building and facing it, is a '
            'copper statue of Christ with arms upraised with the legend '
            '"Venite Ad Me Omnes". Next to the Main Building is the Basilica '
            'of the Sacred Heart. Immediately behind the basilica is the '
            'Grotto, a Marian place of prayer and reflection. It is a replica '
            'of the grotto at Lourdes, France where the Virgin Mary reputedly '
            'appeared to Saint Bernadette Soubirous in 1858. At the end of the '
            'main drive (and in a direct line that connects through 3 statues '
            'and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did t

In [78]:
context  = squad_dataset['train']['context']
question = squad_dataset['train']['question']
print (len(context))

87599


In [79]:
def process(context, questions):
    """document is a list of sentences"""
    context_dict = {}
    documents = []
    query_pairs = []
    count = 0

    for context, question in zip(context, questions):
        if context in context_dict:
            context_id = context_dict[context]
        else:
            context_id = count
            context_dict[context] = count
            count += 1
            documents.append(context)
        query_pairs.append((context_id, question))
    return documents, query_pairs

In [80]:
light_context, light_qestion = process(context, question)
print(len(light_context))

18891


We create a few function to generate a dataset with mistakes

In [81]:
##########################################################################
###############      Mistake génération methodes       ###################
##########################################################################

""" imput  : correct word in lower case
    output : word with mistakes : Delete letter, Swap letter, Add letter , only words with more that 4 letter are modified"""
import random as random

aphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i' , 'j' , 'k' , 'l' , 'm' , 'n', 'o', 'p', 'q', 'r', 's', 't', 'u',
           'v', 'w', 'x', 'y', 'z' ]

def delete(word : str) ->str:  
    """delete a letter in the word"""
    i = random.randint(0,len(word)-1)
    if len(word)<=3:
        return word
    
    if i == len(word)-1:
        return(word[:i])        
    if i ==0 :
        return(word[1:])
    else : 
        return (word[:(i)]+word [(i+1):])
    
def swap(word : str)->str:
    """swap two letters in the word"""
    i = random.randint(0,len(word)-1)
    if len(word)<=3:
        return word
    if i == len(word)-1:
        return (word[:(i-1)]+word[i]+word[i-1])
    if i ==0 :
        return(word[i+1]+word[i]+word[(i+2):])
    else: 
        return (word[:(i-1)]+word[i] +  word[i-1]+word[(i+1):])
     
def add(word : str)->str:
    """add a letter in the word"""
    i = random.randint(0,len(word)-1)
    j = random.randint(0,len(aphabet)-1)
    if len(word)<=3:
        return word
    if i == len(word)-1:
        return(word[:i] +aphabet[j])        
    if i ==0 :
        return(aphabet[j] + word[1:])
    else : 
        return (word[:(i)]+ aphabet[j] +word [(i):])

When I create my dataset I only put mistakes on 3 words over 4 because  Ithink the model should also be able to recognie when there are no mistakes in a word. 

In [82]:
def badsentence(sentence: str, is_sentence: bool, a = 1 , b = 4):
    """ generates a sentence or a word with mistakes """ 
    badsentence= ""
    for word in (sentence.split()) : 
        i = random.randint(a, b) # 3 over 4 chances to have a mistake 
        if i ==1: 
            word = delete(word)
        if i ==2: 
            word=swap(word)
        if i==3 : 
            word= add(word)
        if is_sentence == True: 
            badsentence+= " "+ word
        else: 
            badsentence+= word
    return badsentence

In [83]:
#Example
a = "je suis une grosse grenouille"
print(badsentence(a, True))

 je sis une grosuse grenouille


In [84]:
import re
def tokenizer(sentences : list, max_word_len: int)->list: 
    sentence = [" ".join(sentences)]
    sentence =(re.sub(r'[^a-zA-Z ]', '', sentence[0])).lower().split(' ')
    words = [word for word in sentence  if len(word) !=0 and len(word) < max_word_len] 
    return (words)


In [85]:
vocab_tgt =  tokenizer (light_context, 15)
vocab_src =  [badsentence(a, False) for a in vocab_tgt]

In [86]:
# Exemples
print("Source imput size : " , len(vocab_src))
print(vocab_src[0:13], '\n')
print("Target output size: ", len(vocab_tgt))
print(vocab_tgt[0:13])


Source imput size :  2142408
['the', 'cshool', 'has', 'a', 'caholic', 'characteer', 'atop', 'the', 'mmain', 'buildins', 'god', 'domd', 'is'] 

Target output size:  2142408
['the', 'school', 'has', 'a', 'catholic', 'character', 'atop', 'the', 'main', 'buildings', 'gold', 'dome', 'is']


In [87]:
import torch
##########################################################################
###############      Creation of the dataset           ###################
##########################################################################

vocab = ['_', '[', ']', 
         
         'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 
         'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',

         '0', '1', '2', '3', '4', '5', '6', '7', '8', '9','-', "'", ",", ' ']

vocab_to_index = {vocab : index for index, vocab in enumerate(vocab) }
index_to_vocab  = {index : vocab for index, vocab in enumerate(vocab) }

def vectorizer(word : str)-> torch.tensor: 
  a = [1]
  for letter in word: 
    a+=[vocab_to_index[letter]]
  a+= [2]
  return (torch.tensor(a))

def wordizer (tensor : torch.tensor) -> str : 
  a = ''
  #we eliminate the specials token "[" and "]"
  for letter in tensor.tolist()[1:(len(tensor.tolist())-1)]: 
    a+=index_to_vocab[letter]
  return (a)


vectorizer("papaye")
wordizer(torch.tensor([ 1, 18,  3, 18,  3, 27,  7,  2]))

'papaye'

In [88]:
# now we can convert our dataset to tensor 

target = [vectorizer(word) for word in vocab_tgt]
source = [vectorizer(word) for word in vocab_src]

In [89]:
pprint(source[0:10])

[tensor([ 1, 22, 10,  7,  2]),
 tensor([ 1,  5, 21, 10, 17, 17, 14,  2]),
 tensor([ 1, 10,  3, 21,  2]),
 tensor([1, 3, 2]),
 tensor([ 1,  5,  3, 10, 17, 14, 11,  5,  2]),
 tensor([ 1,  5, 10,  3, 20,  3,  5, 22,  7,  7, 20,  2]),
 tensor([ 1,  3, 22, 17, 18,  2]),
 tensor([ 1, 22, 10,  7,  2]),
 tensor([ 1, 15, 15,  3, 11, 16,  2]),
 tensor([ 1,  4, 23, 11, 14,  6, 11, 16, 21,  2])]


In [90]:
max_len_vector = max (s.size()[0] for s in source)
print(max_len_vector)

17


In [91]:
# Padding of the different source and target = 
source =[torch.nn.ZeroPad2d((0,max_len_vector-len(vector)))(vector) for vector in source] 
target =[torch.nn.ZeroPad2d((0,max_len_vector-len(vector)))(vector) for vector in target] 

In [92]:
pprint(source[0:10])

[tensor([ 1, 22, 10,  7,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]),
 tensor([ 1,  5, 21, 10, 17, 17, 14,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0]),
 tensor([ 1, 10,  3, 21,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]),
 tensor([1, 3, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 tensor([ 1,  5,  3, 10, 17, 14, 11,  5,  2,  0,  0,  0,  0,  0,  0,  0,  0]),
 tensor([ 1,  5, 10,  3, 20,  3,  5, 22,  7,  7, 20,  2,  0,  0,  0,  0,  0]),
 tensor([ 1,  3, 22, 17, 18,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]),
 tensor([ 1, 22, 10,  7,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]),
 tensor([ 1, 15, 15,  3, 11, 16,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]),
 tensor([ 1,  4, 23, 11, 14,  6, 11, 16, 21,  2,  0,  0,  0,  0,  0,  0,  0])]


In [93]:
def data_gen2( source, target  , batch_size, nbatches):
    "Generate random data for a src-tgt copy task."
    for i in range(nbatches):
        data_src = torch.stack([vector for vector in source[i*batch_size : (i+1)*batch_size]])
        src = data_src.requires_grad_(False).clone().detach()
        #print(src)
        data_tgt = torch.stack([vector for vector in target[i*batch_size : (i+1)*batch_size]])
        tgt = data_tgt.requires_grad_(False).clone().detach()

        yield Batch(src, tgt, 0)

In [94]:
sentence_size2 = len(vocab)+3 # +2 for the 2 special token and +1 because V=11 for size 10 hence vector size+1 

batch_size    = 80
nbbatch       = 40
nbepoch       = 250
treshold      = 0.05
saver2         = SaveBestModel('model_exercice_2_2.pth')

In [95]:
##########################################################################
###############     Creation of an empty model         ###################
##########################################################################

if to_Train: 
    criterion = LabelSmoothing(size=sentence_size2, padding_idx=0, smoothing=0.0)
    model2 = make_model(sentence_size2, sentence_size2, N=2)

    optimizer = torch.optim.Adam(
        model2.parameters(), lr=0.5, betas=(0.9, 0.98), eps=1e-9
    )
    lr_scheduler = LambdaLR(
        optimizer=optimizer,
        lr_lambda=lambda step: rate(
            step, model_size=model2.src_embed[0].d_model, factor=1.0, warmup=400
        ),
    )


##########################################################################
###############     Training of the  model             ###################
##########################################################################


    for epoch in (range(nbepoch)):
        model2.train()
        
        
        loss, train_state, to_break = run_epoch(
            data_gen2(source[0:200000], target[0:200000], batch_size, nbbatch), 
            model2,
            SimpleLossCompute(model2.generator, criterion),
            optimizer,
            lr_scheduler,
            epoch, 
            nbbatch,
            treshold,
            mode="train",
        )
       
        model2.eval()
        run_epoch(
            data_gen2(source[200000:210000], target[200000:210000], batch_size, nbbatch),
            #data,
            model2,
            SimpleLossCompute(model2.generator, criterion),
            DummyOptimizer(),
            DummyScheduler(),
            epoch, 
            nbbatch,
            treshold, 
            mode="eval",
        )[0]
        
        saver2.__call__(
        loss, 
        epoch,
        model2, 
        optimizer, 
        criterion,
        )

In [96]:
##########################################################################
###############     Creation of an empty model         ###################
##########################################################################

criterion = LabelSmoothing(size=sentence_size2, padding_idx=0, smoothing=0.0)
model2    = make_model(sentence_size2, sentence_size2, N=2)

optimizer = torch.optim.Adam(
        model2.parameters(), lr=0.5, betas=(0.9, 0.98), eps=1e-9
    )
lr_scheduler = LambdaLR(
        optimizer=optimizer,
        lr_lambda=lambda step: rate(
            step, model_size=model2.src_embed[0].d_model, factor=1.0, warmup=400
        ),
    )
model2.eval()


EncoderDecoder(
  (encoder): Encoder(
    (layers): ModuleList(
      (0): EncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linears): ModuleList(
            (0): Linear(in_features=512, out_features=512, bias=True)
            (1): Linear(in_features=512, out_features=512, bias=True)
            (2): Linear(in_features=512, out_features=512, bias=True)
            (3): Linear(in_features=512, out_features=512, bias=True)
          )
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=True)
          (w_2): Linear(in_features=2048, out_features=512, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (sublayer): ModuleList(
          (0): SublayerConnection(
            (norm): LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): SublayerConnection(
            (norm): LayerNorm()

In [97]:
##########################################################################
###############     Load the model from the drive      ###################
##########################################################################

#from google.colab import files
#files.download('model.pt') 

#model.load_state_dict(torch.load('/content/gdrive/MyDrive/AI605_assignement_3/model_epoch200.pt'))

# load the best model checkpoint
best_model_cp    = torch.load('model_exercice_2_2.pth')
best_model_epoch = best_model_cp['epoch']
print(f"Best model was saved at {best_model_epoch} epochs\n")

# load the last model checkpoint
last_model_cp    = torch.load('model_exercice_2_2.pth')
last_model_epoch = last_model_cp['epoch']
print(f"Last model was saved at {last_model_epoch} epochs\n")

model2.load_state_dict(best_model_cp['model_state_dict'])
epoch     = (best_model_cp['epoch'])
optimizer.load_state_dict(best_model_cp['optimizer_state_dict'])
criterion = (best_model_cp['loss'])


Best model was saved at 230 epochs

Last model was saved at 230 epochs



In [98]:
def accuracy(model, nbsample = 300):
    common, total = 0, 0
    a = 210000+nbsample
    for target, source in tqdm(zip( (vocab_tgt[210000:a]), (vocab_src[210000:a]) )): 
        model.eval()
        max_len  = vectorizer(source).unsqueeze(dim=0).shape[1]
        src_mask = torch.ones(1, 1, max_len)
        a        =  greedy_decode(model,  vectorizer(source).unsqueeze(dim=0), src_mask, max_len=max_len, start_symbol=1)
        word = wordizer(a.squeeze ())
        word = re.sub(r'[^a-zA-Z ]', '', word).lower() # to remove the ] wich is add if the word is shortenned 
        
        if word == target : # that mean the index of the target is in the result
        # we get rid of the 0 colomne of 0 
            common +=1 
        total += 1 # all the try 
    return (common / total) * 100

print('\n', "model accuracy : ", accuracy (model2) , "%")

300it [00:13, 21.97it/s]


 model accuracy :  51.33333333333333 %





We got a really high accuracy of 50,6666% for a set of 300 new words. That means the model is able to correct some words. Nevertless, the accuray is boosted by the fact that small words with 3 letters or less don't have mistakes. The model is able to transmit them correctly which increased the accuracy rate. Also, 1/4 of the words don't have any mistakes (by construction), that can also improove the accuracy even if the model doesn't always transmit them correclty. 

In [99]:
def accuracy(model, a, b,  nbsample = 400):
    common, total = 0, 0
    x = 210000+nbsample
    for target in tqdm (vocab_tgt[210000:x]): 
      if not (((a==1) &(len(target)<4))):
        source = badsentence(target, False, 1,3)
        model.eval()
        max_len  = vectorizer(source).unsqueeze(dim=0).shape[1]
        src_mask = torch.ones(1, 1, max_len)
        w        =  greedy_decode(model,  vectorizer(source).unsqueeze(dim=0), src_mask, max_len=max_len, start_symbol=1)
        word = wordizer(w.squeeze ())
        word = re.sub(r'[^a-zA-Z ]', '', word).lower() # to remove the ] wich is add if the word is shortenned 
        
        if word == target : # that mean the index of the target is in the result
        # we get rid of the 0 colomne of 0 
          common +=1 
        total += 1 # all the try 
    return (common / total) * 100


In [100]:
print('\n', "accuracy for a set were all word have mistakes", accuracy(model2, a= 1,b = 3), "%",  '\n')
print('\n', "accuracy for a set were all the words are correct", accuracy(model2, a = 4, b= 4),"%",  '\n')

100%|████████████████████████████████████████████████████████████████████████████████| 400/400 [00:14<00:00, 26.77it/s]



 accuracy for a set were all word have mistakes 21.2 % 



100%|████████████████████████████████████████████████████████████████████████████████| 400/400 [00:18<00:00, 21.47it/s]


 accuracy for a set were all the words are correct 45.25 % 






So it correct bad words with 18,4% of accuracy and it pinpoints correct words with an accuracy of 44,5%

In [101]:
model2.eval()

target = (vocab_tgt[210010:210060]) #good ones 
source = (vocab_src[210010:210060]) # bad ones 
for src , trg in zip(target, source) :
  
    truth = src
    print("truth  : ", src)
    src = badsentence(src, False, b=3) 
    print("imput  : ",src)  
    b = vectorizer("franch").unsqueeze(dim=0).squeeze ()
    max_len  = vectorizer(src).unsqueeze(dim=0).shape[1]
    src_mask = torch.ones(1, 1, max_len)
    a        = ( greedy_decode(model2,  vectorizer(src).unsqueeze(dim=0), src_mask, max_len=max_len, start_symbol=1))
    if re.sub(r'[^a-zA-Z ]', '', wordizer(a.squeeze ())) == truth: 
      print("output : " , re.sub(r'[^a-zA-Z ]', '', wordizer(a.squeeze ())), "      CORRECT !!!!") 
    else : 
      print("output : " , re.sub(r'[^a-zA-Z ]', '', wordizer(a.squeeze ()))) 
    print("=====================")


truth  :  dialect
imput  :  difalect
output :  dificit
truth  :  or
imput  :  or
output :  or       CORRECT !!!!
truth  :  dialects
imput  :  dialect
output :  dialect
truth  :  was
imput  :  was
output :  was       CORRECT !!!!
truth  :  spoken
imput  :  psoken
output :  sporen
truth  :  either
imput  :  etiher
output :  either       CORRECT !!!!
truth  :  certain
imput  :  certani
output :  certain       CORRECT !!!!
truth  :  is
imput  :  is
output :  is       CORRECT !!!!
truth  :  only
imput  :  qnly
output :  monl
truth  :  that
imput  :  hat
output :  hat
truth  :  avestan
imput  :  aveostan
output :  avesitio
truth  :  all
imput  :  all
output :  all       CORRECT !!!!
truth  :  forms
imput  :  form
output :  from
truth  :  and
imput  :  and
output :  and       CORRECT !!!!
truth  :  old
imput  :  old
output :  old       CORRECT !!!!
truth  :  persian
imput  :  perian
output :  pering
truth  :  are
imput  :  are
output :  are       CORRECT !!!!
truth  :  distinct
imput  :  dsit

In [102]:
##########################################################################
###############     Training of the loaded model       ###################
##########################################################################
nbepoch = 231
if to_Train : 
  for epoch in (range(epoch,nbepoch)):
        model2.train()
        
        
        loss, train_state, to_break = run_epoch(
            data_gen2(source[0:200000], target[0:200000], batch_size, nbbatch), 
            model2,
            SimpleLossCompute(model2.generator, criterion),
            optimizer,
            lr_scheduler,
            epoch, 
            nbbatch,
            treshold,
            mode="train",
        )
       
        model2.eval()
        run_epoch(
            data_gen2(source[200000:210000], target[200000:210000], batch_size, nbbatch),
            #data,
            model2,
            SimpleLossCompute(model2.generator, criterion),
            DummyOptimizer(),
            DummyScheduler(),
            epoch, 
            nbbatch,
            treshold, 
            mode="eval",
        )[0]
        
        #saver2.__call__(
        #loss, 
        #epoch,
        #model, 
        #optimizer, 
        #criterion,
        #)

> **Problem 2.3** *(3 points)* Extend this word-level spelling correction model to sentence-level. You can assume that the number of characters of each sentence is 100 or less. You do not have to report accuracy, but find one example where the word-level model fails and sentence-level model correctly predicts.


In [103]:
a = "je suis, un Poussin]. "
a = re.sub(r'[^a-zA-Z ]', '', a).lower()
a

'je suis un poussin '

In [104]:
##########################################################################
###############      Creation of the dataset           ###################
##########################################################################
"""" We take the sentences from the squad vocab and we infuse mistake in the 
word of the sentencce, then we vectorized the all"""

sentence_size  = 10
sentence_scr   = [' '.join(badsentence(elt,True).split(" ")[0:sentence_size]) for elt in light_context if (len(elt.split(" "))>=sentence_size)] 
src            = [ vectorizer(re.sub(r'[^a-zA-Z ]', '', elt).lower())  for elt in sentence_scr ]  # we get rid of the non letters characters 
max_len        = max([len(elt) for elt in src])                                   # we get the lenght of the diggest tensor 
source         = [torch.nn.ZeroPad2d((0,max_len-len(elt)))(elt) for elt in src]   # we pad the source 
print ("Exemples of four embedded vectors")
source[0:4]

Exemples of four embedded vectors


[tensor([ 1, 42,  3, 20,  5, 10, 22, 11,  7,  5, 22, 23, 20,  3, 14, 14, 27, 42,
         22, 10,  7, 42, 21,  5, 10, 17, 17, 14, 42, 10,  3, 21, 42,  3, 42,  5,
          3, 22, 10, 17, 14,  5, 42,  5, 10,  3, 20,  3,  5, 22,  7, 20, 42,  3,
          8, 22, 17, 18, 42, 22, 10,  7,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0]),
 tensor([ 1, 42,  3, 21, 42,  3, 22, 42, 15, 17, 22, 21, 42, 17, 22, 10, 20,  7,
         42, 23, 16, 11, 24,  7, 20, 21, 11, 22, 11,  7, 21, 42,  5, 17, 22, 20,
          7, 42,  6,  3, 15, 19,  7, 21, 42, 21, 22, 23,  6,  7, 16, 22, 21, 42,
         20, 23, 16,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0]),
 tensor([ 1, 42, 22, 10,  7, 42, 23, 16, 11, 24,  7, 20, 21, 11, 22, 27, 42, 11,
         21, 42

In [105]:
def data_gen3( source, target  , batch_size, nbatches):
    "Generate random data for a src-tgt copy task."
    for i in range(nbatches):
        data_src = torch.stack([vector for vector in source[i*batch_size : (i+1)*batch_size]])
        src      = data_src.requires_grad_(False).clone().detach()
        data_tgt = torch.stack([vector for vector in target[i*batch_size : (i+1)*batch_size]])
        tgt      = data_tgt.requires_grad_(False).clone().detach()

        yield Batch(src, tgt, 0)

In [106]:
sentence_size3 = len(vocab)+3 # +2 for the 2 special token and +1 because V=11 for size 10 hence vector size+1 

batch_size    = 80
nbbatch       = 10
nbepoch       = 250
treshold      = 0.05
saver3        = SaveBestModel('model_exercice_2_3_1.pth')

In [111]:
##########################################################################
###############     Creation of an empty model         ###################
##########################################################################
if to_Train: 
    criterion = LabelSmoothing(size=sentence_size3, padding_idx=0, smoothing=0.0)
    model3    = make_model(sentence_size3, sentence_size3, N=2)

    optimizer = torch.optim.Adam(
        model3.parameters(), lr=0.5, betas=(0.9, 0.98), eps=1e-9
    )
    lr_scheduler = LambdaLR(
        optimizer=optimizer,
        lr_lambda=lambda step: rate(
            step, model_size=model3.src_embed[0].d_model, factor=1.0, warmup=400
        ),
    )

##########################################################################
###############     Training of the created model       ##################
##########################################################################

    for epoch in (range(nbepoch)):
        model3.train()
        
        
        loss, train_state, to_break = run_epoch(
            data_gen3(source[0:8000], target[0:8000], batch_size, nbbatch), 
            model3,
            SimpleLossCompute(model3.generator, criterion),
            optimizer,
            lr_scheduler,
            epoch,
            nbbatch,
            treshold,
            mode="train",
        )
       
        model3.eval()
        run_epoch(
            data_gen3(source[8000:10000], target[8000:10000], batch_size, nbbatch),
            model3,
            SimpleLossCompute(model3.generator, criterion),
            DummyOptimizer(),
            DummyScheduler(),
            epoch, 
            nbbatch,
            treshold, 
            mode="eval",
        )[0]
        
        saver3.__call__(
        loss, 
        epoch,
        model3, 
        optimizer, 
        criterion,
        )

In [112]:
##########################################################################
###############     Creation of an empty model         ###################
##########################################################################

criterion = LabelSmoothing(size=sentence_size3, padding_idx=0, smoothing=0.0)
model3 = make_model(sentence_size3, sentence_size3, N=2)

optimizer = torch.optim.Adam(
        model3.parameters(), lr=0.5, betas=(0.9, 0.98), eps=1e-9
    )
lr_scheduler = LambdaLR(
        optimizer=optimizer,
        lr_lambda=lambda step: rate(
            step, model_size=model3.src_embed[0].d_model, factor=1.0, warmup=400
        ),
    )
model3.eval()


EncoderDecoder(
  (encoder): Encoder(
    (layers): ModuleList(
      (0): EncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linears): ModuleList(
            (0): Linear(in_features=512, out_features=512, bias=True)
            (1): Linear(in_features=512, out_features=512, bias=True)
            (2): Linear(in_features=512, out_features=512, bias=True)
            (3): Linear(in_features=512, out_features=512, bias=True)
          )
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=True)
          (w_2): Linear(in_features=2048, out_features=512, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (sublayer): ModuleList(
          (0): SublayerConnection(
            (norm): LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): SublayerConnection(
            (norm): LayerNorm()

In [113]:
##########################################################################
###############     Load the model from the drive      ###################
##########################################################################

#from google.colab import files
#files.download('model.pt') 

#model.load_state_dict(torch.load('/content/gdrive/MyDrive/AI605_assignement_3/model_epoch200.pt'))

# load the best model checkpoint
best_model_cp    = torch.load('model_exercice_2_3_1.pth')
best_model_epoch = best_model_cp['epoch']
print(f"Best model was saved at {best_model_epoch} epochs\n")

# load the last model checkpoint
last_model_cp    = torch.load('model_exercice_2_3_1.pth')
last_model_epoch = last_model_cp['epoch']
print(f"Last model was saved at {last_model_epoch} epochs\n")

model3.load_state_dict(best_model_cp['model_state_dict'])
epoch     = (best_model_cp['epoch'])
optimizer.load_state_dict(best_model_cp['optimizer_state_dict'])
criterion = (best_model_cp['loss'])


Best model was saved at 180 epochs

Last model was saved at 180 epochs



In [114]:
##########################################################################
###############     Training of the loaded model       ###################
##########################################################################

if to_Train: 
  for epoch in (range(epoch,nbepoch)):
        model3.train()
        
        
        loss, train_state, to_break = run_epoch(
            #data_gen3(source[0:200000], target[0:200000], batch_size, nbbatch), 
            data_gen3(source[0:8000], target[0:8000], batch_size, nbbatch), 
            model3,
            SimpleLossCompute(model3.generator, criterion),
            optimizer,
            lr_scheduler,
            epoch, 
            nbbatch,
            treshold,
            mode="train",
        )
       
        model3.eval()
        run_epoch(
            #data_gen3(source[200000:210000], target[200000:210000], batch_size, nbbatch),
            data_gen3(source[8000:10000], target[8000:10000], batch_size, nbbatch),
            model3,
            SimpleLossCompute(model3.generator, criterion),
            DummyOptimizer(),
            DummyScheduler(),
            epoch, 
            nbbatch,
            treshold, 
            mode="eval",
        )[0]
        
        saver3.__call__(
        loss, 
        epoch,
        model3, 
        optimizer, 
        criterion,
        )

In [115]:
print(vectorizer(" "))
model3.eval()
for s in sentence_scr[10000: 10010]:
  #src      = vectorizer(a).unsqueeze(dim=0)
  #src = vectorizer(re.sub(r'[^a-zA-Z ]', '', s).lower()).unsqueeze(dim=0)
  print(s)
  src      = vectorizer(re.sub(r'[^a-zA-Z ]', '', s).lower()).unsqueeze(dim=0)
  print(src)
  max_len  = src.shape[1]
  src_mask = torch.ones(1, 1, max_len)
  a        = ( greedy_decode(model3, src, src_mask, max_len=max_len, start_symbol=1))
  print(a)


tensor([ 1, 42,  2])
 The enxt yeao Madlonna and Mavercik used Warner Musiq
tensor([[ 1, 42, 22, 10,  7, 42,  7, 16, 26, 22, 42, 27,  7,  3, 17, 42, 15,  3,
          6, 14, 17, 16, 16,  3, 42,  3, 16,  6, 42, 15,  3, 24,  7, 20,  5, 11,
         13, 42, 23, 21,  7,  6, 42, 25,  3, 20, 16,  7, 20, 42, 15, 23, 21, 11,
         19,  2]])
tensor([[ 1, 25,  3, 21,  2, 11,  5, 22, 11,  5,  3, 14,  2,  2,  2,  2,  2,  2,
          2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
          2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
          2,  2]])
 In mid-0204 adonna embarked on the Rie-Invention Wolrd Tour
tensor([[ 1, 42, 11, 16, 42, 15, 11,  6, 42,  3,  6, 17, 16, 16,  3, 42,  7, 15,
          4,  3, 20, 13,  7,  6, 42, 17, 16, 42, 22, 10,  7, 42, 20, 11,  7, 11,
         16, 24,  7, 16, 22, 11, 17, 16, 42, 25, 17, 14, 20,  6, 42, 22, 17, 23,
         20,  2]])
tensor([[ 1,  6,  7, 18,  3, 20, 22, 15,  7, 16, 22, 21,  2,  2,  2,  2,  2, 

In [116]:
model3.eval()
src      = vectorizer("i liveed in a huge houset for severals years").unsqueeze(dim=0)
max_len  = src.shape[1]
src_mask = torch.ones(1, 1, max_len)
a        = ( greedy_decode(model3, src, src_mask, max_len=max_len, start_symbol=1))
print("imput tensor : ", src)
print("output tensor: ", a)
print("imput sentence: i liveed in a huge houset for severals years")
print("output word : ", wordizer(a.squeeze()))

model3.eval()
src      = vectorizer("my cat hates riverss of the mountainse").unsqueeze(dim=0)
max_len  = src.shape[1]
src_mask = torch.ones(1, 1, max_len)
a        = ( greedy_decode(model3, src, src_mask, max_len=max_len, start_symbol=1))
print("imput tensor: ", src)
print("imput sentence : my cat hates riverss of the mountainse")
print("output word : ", wordizer(a.squeeze()))


imput tensor :  tensor([[ 1, 11, 42, 14, 11, 24,  7,  7,  6, 42, 11, 16, 42,  3, 42, 10, 23,  9,
          7, 42, 10, 17, 23, 21,  7, 22, 42,  8, 17, 20, 42, 21,  7, 24,  7, 20,
          3, 14, 21, 42, 27,  7,  3, 20, 21,  2]])
output tensor:  tensor([[ 1, 18, 10,  6,  2, 15,  7,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
          2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
          2,  2, 20,  2,  2,  2,  2,  2,  2,  2]])
imput sentence: i liveed in a huge houset for severals years
output word :  phd]me]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]r]]]]]]
imput tensor:  tensor([[ 1, 15, 27, 42,  5,  3, 22, 42, 10,  3, 22,  7, 21, 42, 20, 11, 24,  7,
         20, 21, 21, 42, 17,  8, 42, 22, 10,  7, 42, 15, 17, 23, 16, 22,  3, 11,
         16, 21,  7,  2]])
imput sentence : my cat hates riverss of the mountainse
output word :  a]]]]er]]]]]]]]]]]e]tur]]]]]ts]tur]tur


We got really bad result, maybe it is because the dataset is too complicated or maybe it is because we didn't train it enought due to lack of time

In [117]:
import csv

def generate_csv(csv_path, vocab_src, vocab_tgt ):
    with open(csv_path, 'w', newline='') as csvfile:
        writter = csv.writer(csvfile)
        writter.writerow(["input", "target"])
        for case , correction in zip( vocab_src, vocab_tgt):
            # Adding the task's prefix to input 
            input_text = "grammar: " + case
            writter.writerow([input_text, correction])