# Lab 6 : Transformers and LLMs
```
- [S23] Advanced Machine Learning, Innopolis University
- Teaching Assistant: Gcinizwe Dlamini
```
<hr>


```
Lab Plan
1. Transformers (translation architecture)
2. Self-Attention
3. Multi-headed attention
4. Positional Encoding
5. Transfomer Encoder Part
6. Application of Transformers
7. Self practice tasks
```

<hr>


# 1. Transformers

* [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf) -- Original paper on attention

![](http://jalammar.github.io/images/t/The_transformer_encoder_decoder_stack.png)

## 1.1 Transfomer Encoder

The encoder contains a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. <br>
PyTorch implementation : `nn.TransformerEncoder` and `nn.TransformerEncoderLayer` <br>
**The main goal is to efficiently encode the data**

  1         |  2
:-------------------------:|:-------------------------:
![](http://jalammar.github.io/images/t/encoder_with_tensors.png)  |  ![](http://jalammar.github.io/images/t/transformer_resideual_layer_norm_2.png)


## 2. Self-Attention

**Keep in mind : The main goal is to encode the data in a much more efficient way** In other words is to create meaningful embeddings<br>
- As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.


**How does Self-Attention work?**

Steps:
1. For each word, we create a **`Query`** vector, a **`Key`** vector, and a **`Value`** vector.
  - What are the **`Query`** vector, a **`Key`** vector, and a **`Value`** vector? : They're abstractions that are useful for calculating attention... They are a breakdown of the word embeddings
2. Calculating self-attention score from **`Query`** **`Key`** vector.
3. Divide the scores by 8 (This leads to having more stable gradients)
4. Pass the result through a softmax operation (softmax score determines how much each word will be expressed at this position)
5. Multiply each value vector by the softmax score
6. Sum up the weighted value vectors

### Step 1

For each word, we create a **`Query`** vector, a **`Key`** vector, and a **`Value`** vector.

![](http://jalammar.github.io/images/t/transformer_self_attention_vectors.png)

In [None]:
import torch

# simple sequence = I will pass AML midterm
simple_sequence_embedding = torch.rand((5, 512))

# Create weight matrices
Wq = torch.normal(0,0.5, (512, 7))
Wk = torch.normal(0,0.1, (512, 7))
Wv = torch.normal(0,0.2, (512, 7))


# Create key, query and value for each word in the senetence
queries = simple_sequence_embedding.mm(Wq)
keys = simple_sequence_embedding.mm(Wk)
values = simple_sequence_embedding.mm(Wv)

In [None]:
queries

tensor([[-5.1061, 11.8408,  2.2086,  2.9840, -2.0887, -0.8397, -1.4995],
        [ 2.9208, 11.5278, -4.3229,  2.8188, -4.7808,  2.0832, -1.8955],
        [-0.6251, 12.5453,  2.1003, -0.3908, -1.6719,  6.5194,  0.0218],
        [-1.6637, 18.6328, -6.2575, -4.3300, -3.0603,  2.0430, -2.3926],
        [ 4.7630,  9.6453, -3.1484, -2.0177, -3.1768,  4.7115,  2.6854]])

### Step 2

Calculating self-attention score from **`Query`** and **`Key`** vector

In [None]:
scores = torch.mm(queries, keys.T)
scores

tensor([[-36.7789, -38.2997, -27.0507, -59.0658, -52.3270],
        [-32.6367, -27.2935, -24.6161, -54.2585, -41.6928],
        [-31.3280, -21.5199, -10.8190, -40.1629, -36.4128],
        [-54.1099, -47.6417, -44.3191, -80.6077, -79.5127],
        [-17.8594, -11.4702,  -4.2672, -26.6263, -21.9530]])

### Step 3
Divide the scores by 8 (This leads to having more stable gradients)

In [None]:
scores = scores/8
scores

tensor([[ -4.5974,  -4.7875,  -3.3813,  -7.3832,  -6.5409],
        [ -4.0796,  -3.4117,  -3.0770,  -6.7823,  -5.2116],
        [ -3.9160,  -2.6900,  -1.3524,  -5.0204,  -4.5516],
        [ -6.7637,  -5.9552,  -5.5399, -10.0760,  -9.9391],
        [ -2.2324,  -1.4338,  -0.5334,  -3.3283,  -2.7441]])

### Step 4

Pass the result through a softmax operation

In [None]:
scores = torch.nn.functional.softmax(scores, dim=1)
scores

tensor([[0.1850, 0.1530, 0.6241, 0.0114, 0.0265],
        [0.1649, 0.3215, 0.4494, 0.0111, 0.0532],
        [0.0548, 0.1867, 0.7113, 0.0182, 0.0290],
        [0.1487, 0.3339, 0.5058, 0.0054, 0.0062],
        [0.1039, 0.2309, 0.5682, 0.0347, 0.0623]])

### Step 5 & 6

* Multiply each value vector by the softmax score
* Sum up the weighted value vectors



In [None]:
z = scores @ values
z

tensor([[ 4.4716, -0.8754,  1.0015,  5.3325,  3.6058,  0.7773,  2.0604],
        [ 4.5854, -1.1084,  1.0863,  5.2149,  3.6833,  1.1822,  2.2346],
        [ 4.4695, -0.9500,  1.0667,  5.4705,  3.5735,  0.9243,  2.1818],
        [ 4.6475, -1.0420,  1.0692,  5.2539,  3.7072,  1.2904,  2.2424],
        [ 4.4866, -1.0492,  1.1142,  5.3061,  3.5890,  0.9629,  2.2019]])

## 3. Multi-headed attention

**GOAL**:
1. Expand the model's ability to focus on different positions
2. Provide the attention layer multiple “representation subspaces”

**Attention with $N$ just means repeating self attention algorithm $N$ times and joining the results**


![](https://data-science-blog.com/wp-content/uploads/2022/01/mha_img_original.png)

**Multi-headed attention steps:**
1. Same as self-attention calculation, just n different times with different weight matrices
2. Condense the $N$ z metrices down into a single matrix by concatinating the matrices then multiply them by an additional weights matrix `WO`

Now the output z metrix is fed to the FFNN

In [None]:
from torch import Tensor
import torch.nn.functional as f
from torch import nn


def scaled_dot_product_attention(query, key, value):
  temp = query.bmm(key.transpose(1, 2))
  scale = query.size(-1) ** 0.5
  softmax = f.softmax(temp / scale, dim=-1)
  return softmax.bmm(value)

### 3.1 Attention head

In [None]:
class AttentionHead(nn.Module):
  def __init__(self, dim_in, dim_q, dim_k):
    super().__init__()
    self.q = nn.Linear(dim_in, dim_q)
    self.k = nn.Linear(dim_in, dim_k)
    self.v = nn.Linear(dim_in, dim_k)

  def forward(self, query, key, value):
    return scaled_dot_product_attention(self.q(query), self.k(key), self.v(value))

### 3.2 Multi Head Attention

**Task:** Implement multi-head attention model

In [None]:
class MultiHeadAttention(nn.Module):
  def __init__(self, number_of_heads, dim_in, dim_q, dim_k):
    super().__init__()
    self.heads = nn.ModuleList([AttentionHead(dim_in, dim_q, dim_k) for _ in range(number_of_heads)])
    self.linaer = nn.Linear(number_of_heads * dim_k, dim_in)

  def forward(self, query, key, value):
    z = self.linear(torch.cat([h(query, key, value) for h in self.heads]), dim=-1)
    return z

## 4. Positional Encoding

A way to account for the order of the words in the input sequence. A transformer adds a vector to each input embedding which helps it determine the position of each word. <br>
**Goal** : preserving information about the order of tokens  

Positional Encoding they can either be learned or fixed a priori.

Proposed approach from original paper : describe a simple scheme for fixed positional encodings based on sine and cosine functions

In [None]:
def position_encoding(seq_len, dim_model, device):
  pos = torch.arange(seq_len, dtype=torch.float, device=device).reshape(1, -1, 1)
  dim = torch.arange(dim_model, dtype=torch.float, device=device).reshape(1, 1, -1)
  phase = pos / (1e4 ** (dim / dim_model))

  return torch.where(dim.long() % 2 == 0, torch.sin(phase), torch.cos(phase))

![](http://jalammar.github.io/images/t/transformer_resideual_layer_norm_2.png)

## 5. Transfomer Encoder Part
### 5.1Encoder Feed Forward

In [None]:
def feed_forward(dim_input = 512, dim_feedforward = 2048):
  return nn.Sequential(nn.Linear(dim_input, dim_feedforward), nn.ReLU(), nn.Linear(dim_feedforward, dim_input))

### 5.2 Encoder Residual

From the original paper the author implementation

In [None]:
class Residual(nn.Module):
  def __init__(self, sublayer, dimension, dropout = 0.1):
    super().__init__()
    self.sublayer = sublayer
    self.norm = nn.LayerNorm(dimension)
    self.dropout = nn.Dropout(dropout)

  def forward(self, *tensors):
    return self.norm(tensors[0] + self.dropout(self.sublayer(*tensors)))

### 5.3 Putting the Encoder layer together

In [None]:
class TransformerEncoderLayer(nn.Module):
  def __init__(self, dim_model = 512, num_heads = 6, dim_feedforward = 2048, dropout = 0.1):
    super().__init__()
    dim_q = dim_k = max(dim_model // num_heads, 1)
    self.attention = Residual(MultiHeadAttention(num_heads, dim_model, dim_q, dim_k),
                              dimension=dim_model, dropout=dropout)
    self.feed_forward = Residual(feed_forward(dim_model, dim_feedforward), dimension=dim_model, dropout=dropout)

  def forward(self, src):
    src = self.attention(src, src, src)
    return self.feed_forward(src)

## 5.4 Putting together transfomer Encoder part

In [None]:
class TransformerEncoder(nn.Module):
  def __init__(self, num_layers = 12, dim_model = 512, num_heads = 4, dim_feedforward = 2048, dropout: float = 0.1):
    super().__init__()
    self.layers = nn.ModuleList([TransformerEncoderLayer(dim_model, num_heads, dim_feedforward, dropout) for _ in range(num_layers) ])

  def forward(self, src):
    seq_len, dimension = src.size(1), src.size(2)
    src += position_encoding(seq_len, dimension)
    for layer in self.layers:
      src = layer(src)

    return src

# The Decoder Side

The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder.

The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

![](https://jalammar.github.io/images/t/transformer_resideual_layer_norm_3.png)


## Decoder layer



**Task**: implement the decoder layer

In [None]:
class TransformerDecoderLayer(nn.Module):
  def __init__(self, dim_model = 512, num_heads = 4, dim_feedforward = 2048, dropout = 0.1):
    super().__init__()
    dim_q = dim_k = max(dim_model // num_heads, 1)
    self.self_attention = Residual(MultiHeadAttention(num_heads, dim_model, dim_q, dim_k),
                                   dimension=dim_model, dropout=dropout)
    self.encoder_attention = Residual(MultiHeadAttention(num_heads, dim_model, dim_q, dim_k),
                                      dimension=dim_model, dropout=dropout)
    self.feed_forward = Residual(feed_forward(dim_model, dim_feedforward), dimension=dim_model, dropout=dropout)

  def forward(self, tgt, memory, src_mask=None, tgt_mask=None):
    tgt = self.self_attention(tgt, tgt, tgt, mask=tgt_mask)
    tgt = self.encoder_attention(tgt, memory, memory, mask=src_mask)
    return self.feed_forward(tgt)

In [None]:
class TransformerDecoder(nn.Module):
    def __init__(self, num_layers=6, dim_model=512, num_heads=6, dim_feedforward=2048, dropout=0.1):
        super().__init__()
        self.layers = nn.ModuleList([TransformerDecoderLayer(dim_model, num_heads, dim_feedforward, dropout) for _ in range(num_layers)])

    def forward(self, tgt, memory, src_mask=None, tgt_mask=None):
        seq_len, dimension = tgt.size(1), tgt.size(2)
        tgt += position_encoding(seq_len, dimension)

        for layer in self.layers:
            tgt = layer(tgt, memory, src_mask, tgt_mask)
        return tgt

In [None]:
class Transformer(nn.Module):
    def __init__(self, num_encoder_layers=6, num_decoder_layers=6, dim_model=512, num_heads=6, dim_feedforward=2048, dropout=0.1):
        super().__init__()
        self.encoder = TransformerEncoder(num_encoder_layers, dim_model, num_heads, dim_feedforward, dropout)
        self.decoder = TransformerDecoder(num_decoder_layers, dim_model, num_heads, dim_feedforward, dropout)

    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        memory = self.encoder(src)
        output = self.decoder(tgt, memory, src_mask, tgt_mask)
        return output

## 6. Application of Transfomers

We will look at sentiment analysis

In [None]:
import transformers

class Transformer(nn.Module):

  def __init__(self, output_dim):
    super().__init__()
    self.transformer = transformers.AutoModel.from_pretrained('bert-base-uncased')
    for param in self.transformer.parameters():
        param.requires_grad = False #only extract information

    hidden_dim = self.transformer.config.hidden_size
    self.fc = nn.Linear(hidden_dim, output_dim)


  def forward(self, text):
    # text = [batch size, seq len]
    output = self.transformer(text, output_attentions=True)
    hidden = output.last_hidden_state
    # hidden = [batch size, seq len, hidden dim]
    attention = output.attentions[-1]
    # attention = [batch size, n heads, seq len, seq len]
    cls_hidden = hidden[:, 0, :]
    prediction = self.fc(torch.tanh(cls_hidden))

    return prediction

## 7. Tasks

```
Task 1
- Using the above implementation code the decoder layer and assemble a full transformer model
```

<hr>

```
Task 2
- Implement, train and test a Transfomer model for Part-of-speech tagging task.
```

**Task 2 Datasets**: [Train](https://www.dropbox.com/s/x9n6f9o9jl7pno8/train_pos.txt?dl=1), [Test](https://www.dropbox.com/s/v8nccvq7jewcl8s/test_pos.txt?dl=1)


In [None]:
import torch
from torch.utils.data import Dataset
from sklearn.preprocessing import LabelEncoder
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

class POSDataset(Dataset):
    def __init__(self, file_path, label_encoder, train_encoder = False):
        self.sentences = []
        self.tags = []
        self.sentence = []
        self.tag = []

        with open(file_path, 'r') as file:
            for line in file:
                if line.isspace() and self.sentence:
                  self.sentences.append(self.sentence)
                  self.sentence = []
                  self.tags.append(self.tag)
                  self.tag = []
                  continue

                word, label = line.lower().strip().split()
                self.sentence.append(word)
                self.tag.append(label)

        if train_encoder:
          labels = [item for sublist in self.tags for item in sublist]
          label_encoder.fit(labels)
          print(self.sentences[0])
          print(self.tags[0])
          print(self.sentences[-1])
          print(label_encoder.transform(self.tags[-1]))

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
      return self.sentences[idx], self.tags[idx]


In [None]:
import torchtext
label_encoder = LabelEncoder()
file_path_train = 'train_pos.txt'
train_dataset = POSDataset(file_path_train, label_encoder, train_encoder = True)

file_path_test = 'test_pos.txt'
test_dataset = POSDataset(file_path_test, label_encoder)

min_freq = 5
special_tokens = ["<unk>", "<pad>"]

def get_tokens(dataset):
    for idx in range(len(dataset)):
        yield dataset[idx][0]

vocab = torchtext.vocab.build_vocab_from_iterator(
    get_tokens(train_dataset),
    min_freq=min_freq,
    specials=special_tokens,
)

unk_index = vocab["<unk>"]
pad_index = vocab["<pad>"]
vocab.set_default_index(unk_index)

['confidence', 'in', 'the', 'pound', 'is', 'widely', 'expected', 'to', 'take', 'another', 'sharp', 'dive', 'if', 'trade', 'figures', 'for', 'september', ',', 'due', 'for', 'release', 'tomorrow', ',', 'fail', 'to', 'show', 'a', 'substantial', 'improvement', 'from', 'july', 'and', 'august', "'s", 'near-record', 'deficits', '.']
['nn', 'in', 'dt', 'nn', 'vbz', 'rb', 'vbn', 'to', 'vb', 'dt', 'jj', 'nn', 'in', 'nn', 'nns', 'in', 'nnp', ',', 'jj', 'in', 'nn', 'nn', ',', 'vb', 'to', 'vb', 'dt', 'jj', 'nn', 'in', 'nnp', 'cc', 'nnp', 'pos', 'jj', 'nns', '.']
['it', 'is', 'also', 'pulling', '20', 'people', 'out', 'of', 'puerto', 'rico', ',', 'who', 'were', 'helping', 'huricane', 'hugo', 'victims', ',', 'and', 'sending', 'them', 'to', 'san', 'francisco', 'instead', '.']
[25 39 27 36 10 22 14 14 20 20  5 41 35 36 20 20 22  5  9 36 25 32 20 20
 27  6]


In [None]:
special_tokens = ["<pad>"]

def get_tags(dataset):
    for idx in range(len(dataset)):
        yield dataset[idx][1]

tag_vocab = torchtext.vocab.build_vocab_from_iterator(
    get_tags(train_dataset),
    min_freq=min_freq,
    specials=special_tokens,
)

tag_pad_index = tag_vocab["<pad>"]

In [None]:
def encode_sample(example, vocab, tag_vocab):
    ids = vocab.lookup_indices(example[0])
    tag_ids = tag_vocab.lookup_indices(example[1])
    return ids, tag_ids

In [None]:
class POSDatasetEncoded(Dataset):
    def __init__(self, pos_dataset, vocab, tag_vocab, padded_len=78):

        self.sentences = []
        self.tags = []
        for idx in range(len(pos_dataset)):
            sample = pos_dataset[idx]
            ids, tag_ids = encode_sample(sample, vocab, tag_vocab)
            l = len(ids)

            ids += [vocab['<pad>']] * (padded_len - l)
            tag_ids += [tag_vocab['<pad>']] * (padded_len - l)

            self.sentences.append(torch.LongTensor(ids))
            self.tags.append(torch.LongTensor(tag_ids))


    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        return self.sentences[idx], self.tags[idx]

train_dataset = POSDatasetEncoded(train_dataset, vocab, tag_vocab)
test_dataset = POSDatasetEncoded(test_dataset, vocab, tag_vocab)

In [None]:
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=4, persistent_workers=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=4, persistent_workers=True)



In [None]:
class PositionalEncoding(nn.Module):

    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe = torch.zeros(max_len, 1, d_model)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x: Tensor) -> Tensor:
        """
        Arguments:
            x: Tensor, shape ``[seq_len, batch_size, embedding_dim]``
        """
        x = x + self.pe[:x.size(0)]
        return self.dropout(x)

In [None]:
import math
import os
from tempfile import TemporaryDirectory
from typing import Tuple

import torch
from torch import nn, Tensor
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from torch.utils.data import dataset

class TransformerModel(nn.Module):

    def __init__(self, ntoken: int, d_model: int, nhead: int, d_hid: int,
                 nlayers: int, dropout: float = 0.5):
        super().__init__()
        self.model_type = 'Transformer'
        self.pos_encoder = PositionalEncoding(d_model, dropout)
        encoder_layers = TransformerEncoderLayer(d_model, nhead, d_hid, dropout)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
        self.embedding = nn.Embedding(ntoken, d_model)
        self.d_model = d_model
        self.linear = nn.Linear(d_model, ntoken)

        self.init_weights()

    def init_weights(self) -> None:
        initrange = 0.1
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.linear.bias.data.zero_()
        self.linear.weight.data.uniform_(-initrange, initrange)

    def forward(self, src: Tensor, src_mask: Tensor = None) -> Tensor:
        """
        Arguments:
            src: Tensor, shape ``[seq_len, batch_size]``
            src_mask: Tensor, shape ``[seq_len, seq_len]``

        Returns:
            output Tensor of shape ``[seq_len, batch_size, ntoken]``
        """
        src = self.embedding(src) * math.sqrt(self.d_model)
        src = self.pos_encoder(src)
        if src_mask is None:
            """Generate a square causal mask for the sequence. The masked positions are filled with float('-inf').
            Unmasked positions are filled with float(0.0).
            """
            src_mask = nn.Transformer.generate_square_subsequent_mask(len(src)).to(device)
        output = self.transformer_encoder(src, src_mask)
        output = self.linear(output)
        return output

In [None]:
from torch.nn import TransformerEncoder, TransformerEncoderLayer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ntokens = len(vocab)  # size of vocabulary
emsize = 30  # embedding dimension
d_hid = 200  # dimension of the feedforward network model in ``nn.TransformerEncoder``
nlayers = 2  # number of ``nn.TransformerEncoderLayer`` in ``nn.TransformerEncoder``
nhead = 2  # number of heads in ``nn.MultiheadAttention``
dropout = 0.2  # dropout probability
model = TransformerModel(ntokens, emsize, nhead, d_hid, nlayers, dropout).to(device)

In [None]:
import torch.optim as optim
import tqdm
import numpy as np
lr = 5e-4
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

In [None]:
n_epochs = 10
for ep in range(n_epochs):
  model.train()
  epoch_losses = []
  train_correct = 0
  train_total = 0
  for batch in tqdm.tqdm(train_loader, desc="training...", leave=False):
    optimizer.zero_grad()
    ids = batch[0].to(device)
    label = batch[1].to(device)

    prediction = model(ids)
    #output_flat = output.view(-1, ntokens)
    loss = criterion(prediction.view(-1, prediction.shape[2]), label.view(-1))

    loss.backward()
    optimizer.step()

    epoch_losses.append(loss.item())

    _, predicted_tags = torch.max(prediction, 2)
    train_total += prediction.shape[0] * prediction.shape[1]
    train_correct += (predicted_tags == label).sum().item()

  test_losses = []
  test_correct = 0
  test_total = 0
  with torch.no_grad():
    model.eval()
    for batch in tqdm.tqdm(test_loader, desc='evaluating...', leave=False):
      ids = batch[0].to(device)
      label = batch[1].to(device)

      prediction = model(ids)
      loss = criterion(prediction.view(-1, prediction.shape[2]), label.view(-1))
      test_losses.append(loss.item())

      _, predicted_tags = torch.max(prediction, 2)
      test_total += prediction.shape[0] * prediction.shape[1]
      test_correct += (predicted_tags == label).sum().item()


  print(f'[Epoch {ep}] Train:\n\tLoss: {np.mean(epoch_losses)}, Acc: {train_correct / train_total}\nTest:\n\tLoss: {np.mean(test_losses)}, Acc: {test_correct / test_total}')




[Epoch 0] Train:
	Loss: 0.1148509298584291, Acc: 0.9591625921079815
Test:
	Loss: 0.1327374642567029, Acc: 0.9536626395473314




[Epoch 1] Train:
	Loss: 0.11479813706661973, Acc: 0.9591941555907536
Test:
	Loss: 0.1327663745198931, Acc: 0.9536690115715961




[Epoch 2] Train:
	Loss: 0.11481394818318742, Acc: 0.9592041985170902
Test:
	Loss: 0.1327995296035494, Acc: 0.9537263597899781




[Epoch 3] Train:
	Loss: 0.11474067643284798, Acc: 0.9591224204026353
Test:
	Loss: 0.1328321217544495, Acc: 0.9537837080083601




[Epoch 4] Train:
	Loss: 0.114547873900405, Acc: 0.959297454261644
Test:
	Loss: 0.13286202270833272, Acc: 0.9537136157414488




[Epoch 5] Train:
	Loss: 0.1143162717510547, Acc: 0.9592314578885751
Test:
	Loss: 0.13288922915382992, Acc: 0.9537454758627721




[Epoch 6] Train:
	Loss: 0.11459328957966396, Acc: 0.9591324633289718
Test:
	Loss: 0.13291437521813407, Acc: 0.9536753835958607




[Epoch 7] Train:
	Loss: 0.11445691234299114, Acc: 0.9593046277804559
Test:
	Loss: 0.13293731023394872, Acc: 0.9536371514502727




[Epoch 8] Train:
	Loss: 0.1143775873152273, Acc: 0.9591195509951105
Test:
	Loss: 0.13295724550409924, Acc: 0.9536180353774787


                                                               

[Epoch 9] Train:
	Loss: 0.11472529510834388, Acc: 0.9591181162913481
Test:
	Loss: 0.13296770730188914, Acc: 0.9536052913289493


