## Week 6 : Large Language Models
```
- Generative Artificial Intelligence (Fall semester 2023)
- Professor: Muhammad Fahim
- Teaching Assistant: Ahmad Taha
```
<hr>

## Contents
```
1. Transformers (Implementing a transformer)
2. Self-Attention
3. Multi-headed attention
4. Positional Encoding

```

<hr>


# Transformers

* [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf) -- Original paper on attention

![](http://jalammar.github.io/images/t/The_transformer_encoder_decoder_stack.png)


In [1]:
import torch
from torch import nn
import torch.optim as optim
import pandas as pd
import numpy as np

from torch.utils.data import DataLoader, TensorDataset

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Transformer Encoder with Pytorch

In [2]:
encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=32)
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=3)



In [3]:
encoder_layer

TransformerEncoderLayer(
  (self_attn): MultiheadAttention(
    (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
  )
  (linear1): Linear(in_features=512, out_features=2048, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
  (linear2): Linear(in_features=2048, out_features=512, bias=True)
  (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  (dropout1): Dropout(p=0.1, inplace=False)
  (dropout2): Dropout(p=0.1, inplace=False)
)

## Encoder

The encoder contains a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. <br>
**The main goal is to efficiently encode the data**

![](http://jalammar.github.io/images/t/encoder_with_tensors.png)

## Self-Attention

**Keep in mind : The main goal is to encode the data in a much more efficient way** In other words is to create meaningful embeddings<br>
- As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.


**How does Self-Attention work?**

Steps:
1. For each word, we create a **`Query`** vector, a **`Key`** vector, and a **`Value`** vector.
  - What are the **`Query`** vector, a **`Key`** vector, and a **`Value`** vector? : They’re abstractions that are useful for calculating attention... They are a breakdown of the word embeddings
2. Calculating self-attention score from **`Query`** **`Key`** vector.
3. Divide the scores by 8 (This leads to having more stable gradients)
4. Pass the result through a softmax operation (softmax score determines how much each word will be expressed at this position)
5. Multiply each value vector by the softmax score
6. Sum up the weighted value vectors

### Step 1

For each word, we create a **`Query`** vector, a **`Key`** vector, and a **`Value`** vector.

![](http://jalammar.github.io/images/t/transformer_self_attention_vectors.png)

In [4]:
# simple sequence = I am here today
simple_sequence_embedding = torch.rand((4, 512))

# Create weight matrices
Wq = torch.normal(0,0.5, (512, 7))
Wk = torch.normal(0,0.1, (512, 7))
Wv = torch.normal(0,0.2, (512, 7))

# Create key, query and value for each word in the senetence
queries = simple_sequence_embedding.mm(Wq) # self.q(embedding[0])
keys = simple_sequence_embedding.mm(Wk)
values = simple_sequence_embedding.mm(Wv)

In [5]:
values

tensor([[-1.4144,  0.1425,  0.6360, -4.6266, -1.1239, -3.8719, -0.2917],
        [-1.2888,  1.2161,  1.3884, -2.9614,  0.0145, -2.8608, -1.7486],
        [-0.5685,  0.6050, -0.0479, -2.7829, -1.1900, -3.0137, -2.7247],
        [-0.8240,  0.1764,  2.7191, -3.3767,  1.2477, -3.1469, -4.3871]])

In [6]:
simple_sequence_embedding

tensor([[0.6412, 0.0468, 0.1284,  ..., 0.0317, 0.7052, 0.3328],
        [0.8832, 0.4157, 0.0492,  ..., 0.8416, 0.2928, 0.5546],
        [0.2911, 0.0679, 0.8347,  ..., 0.6101, 0.8706, 0.3020],
        [0.1589, 0.4290, 0.4915,  ..., 0.3169, 0.7444, 0.0272]])

## Step 2

Calculating self-attention score from **`Query`** and **`Key`** vector

In [7]:
scores = torch.mm(queries, keys.T)
scores

tensor([[ -7.0074,   4.3708,  17.8155,   0.3243],
        [-11.0049,  -3.2940,   0.3478,   2.0981],
        [  3.6775,  16.4631,  24.6573,  -9.4841],
        [-13.9342,  14.0622,  17.1531,   3.4276]])

## Step 3
Divide the scores by 8 (This leads to having more stable gradients)

In [8]:
scores = scores / 8
scores

tensor([[-0.8759,  0.5463,  2.2269,  0.0405],
        [-1.3756, -0.4117,  0.0435,  0.2623],
        [ 0.4597,  2.0579,  3.0822, -1.1855],
        [-1.7418,  1.7578,  2.1441,  0.4285]])

## Step 4

Pass the result through a softmax operation

In [9]:
scores = torch.nn.functional.softmax(scores, dim=1)
scores

tensor([[0.0334, 0.1386, 0.7443, 0.0836],
        [0.0775, 0.2033, 0.3204, 0.3988],
        [0.0502, 0.2484, 0.6917, 0.0097],
        [0.0109, 0.3615, 0.5319, 0.0957]])

## Step 5 & 6

* Multiply each value vector by the softmax score
* Sum up the weighted value vectors



In [10]:
scores.shape, values.shape

(torch.Size([4, 4]), torch.Size([4, 7]))

In [11]:
z = scores @ values
z

tensor([[-0.7180,  0.6384,  0.4055, -2.9190, -0.8170, -3.0323, -2.6470],
        [-0.8824,  0.5224,  1.4005, -3.1989,  0.0321, -3.1023, -3.0006],
        [-0.7923,  0.7294,  0.3700, -2.9256, -0.8639, -3.0201, -2.3762],
        [-0.8625,  0.7799,  0.7435, -2.9244, -0.5207, -2.9805, -2.5043]])

# Multi-headed attention

**GOAL**:
1. Expand the model’s ability to focus on different positions
2. Provide the attention layer multiple “representation subspaces”

**Attention with $N$ just means repeating self attention algorithm $N$ times and joining the results**


![](https://data-science-blog.com/wp-content/uploads/2022/01/mha_img_original.png)

**Multi-headed attention steps:**
1. Same as self-attention calculation, just n different times with different weight matrices
2. Condense the $N$ z metrices down into a single matrix by concatinating the matrices then multiply them by an additional weights matrix `WO`

Now the output z metrix is fed to the FFNN

In [12]:
from torch import Tensor
import torch.nn.functional as f


def scaled_dot_product_attention(query, key, value):
  temp = query.bmm(key.transpose(1, 2))
  scale = query.size(-1) ** 0.5
  softmax = f.softmax(temp / scale, dim=-1)
  return softmax.bmm(value)

## Now lets make attention head

In [13]:
import torch.nn as nn

class AttentionHead(nn.Module):
    def __init__(self, dim_in, dim_q, dim_k):
        super().__init__()
        # Fill in the missing parts of the constructor to initialize the query, key, and value linear transformations
        # dim_in is the input dimension, dim_q and dim_k are the output dimensions for the queries and keys/values respectively
        # Example: dim_in 512, dim_q = dim_k 64 in the paper
        self.q = nn.Linear(dim_in, dim_q)
        self.k = nn.Linear(dim_in, dim_k)
        self.v = nn.Linear(dim_in, dim_k)

    def forward(self, query, key, value):
        # Implement the forward pass by calling scaled_dot_product_attention function
        # You need to transform the query, key, and value using the linear transformations defined in __init__
        # Fill in with the correct method calls
        return scaled_dot_product_attention(self.q(query), self.k(key), self.v(value))


## Multi Head Attention

In [14]:
class MultiHeadToAttention(nn.Module):
    def __init__(self, number_of_heads, dim_in, dim_q, dim_k):
        super().__init__()
        # Initialize heads as multi-AttentionHead instances
        # Initialize linear to combine the outputs of all heads into a single output
        self.heads = nn.ModuleList([AttentionHead(dim_in, dim_q, dim_k) for _ in range(number_of_heads)])
        self.linear = nn.Linear(number_of_heads*dim_k, dim_in)

    def forward(self, query: Tensor, key: Tensor, value: Tensor):
        # Concatenate outputs from all heads and apply the final linear transformation
        z = self.linear(torch.cat([h(query, key, value) for h in self.heads], dim=-1))
        return z

## Positional Encoding

A way to account for the order of the words in the input sequence. A transformer adds a vector to each input embedding which helps it determine the position of each word. <br>
**Goal** : preserving information about the order of tokens  

Positional Encoding they can either be learned or fixed a priori.

Proposed approach from original paper : describe a simple scheme for fixed positional encodings based on sine and cosine functions

![](https://miro.medium.com/v2/resize:fit:640/format:webp/1*C3a9RL6-SFC6fW8NGpJg5A.png)

In [15]:
def position_encoding(seq_len, dim_model, device):
    # Define the position tensor 'pos' with dimensions appropriate for sequence length
    pos = torch.arange(seq_len, dtype=torch.float, device=device).reshape(1, -1, 1)

    # Define the dimension tensor 'dim' suitable for the model dimensions
    dim = torch.arange(dim_model, dtype=torch.float, device=device).reshape(1, 1, -1)

    # Calculate the phase using the position and dimension tensors
    phase = pos / (1e4 ** (dim / dim_model))


    # Return the sinusoidal position encoding
    # Complete this line to select sin or cos based on even/odd index
    return torch.where(dim.long() % 2 == 0, torch.sin(phase), torch.cos(phase))


## Encoder Feed Forward

In [16]:
def feed_forward(dim_input = 512, dim_feedforward = 2048):
  return nn.Sequential(nn.Linear(dim_input, dim_feedforward),
                       nn.ReLU(),
                       nn.Linear(dim_feedforward, dim_input)
                       )

## Encoder Residual

From the original paper the author implementation

In [17]:
class Residual(nn.Module):
  def __init__(self, sublayer, dimension, dropout = 0.1):
    super().__init__()
    self.sublayer = sublayer
    self.norm = nn.LayerNorm(dimension)
    self.dropout = nn.Dropout(dropout)

  def forward(self, *tensors):
    return self.norm(tensors[0] + self.dropout(self.sublayer(*tensors)))

## Putting all together on encoder side

![](http://jalammar.github.io/images/t/transformer_resideual_layer_norm_2.png)

## Putting the Encoder layer together

In [18]:
class TransformerEncoderLayer(nn.Module):
    def __init__(self, dim_model=512, num_heads=6, dim_feedforward=2048, dropout=0.1):
        super().__init__()
        # Define dimensions for query and key based on model dimension and number of heads
        dim_q = dim_k = max(dim_model // num_heads, 1)

        # Initialize the MultiHeadAttention component with a residual connection and dropout
        self.attention = Residual(MultiHeadToAttention(num_heads, dim_model, dim_q, dim_k),
                              dimension=dim_model, dropout=dropout)

        # Initialize the feedforward component with a residual connection and dropout
        self.feed_forward = Residual(feed_forward(dim_model, dim_feedforward), dimension=dim_model, dropout=dropout)


    def forward(self, src):
        # Apply the attention mechanism to the source data
        src = self.attention(src, src, src)

        # Apply the feedforward network to the output of the attention mechanism
        return self.feed_forward(src)


## Putting together transfomer Encoder part

In [19]:
class TransformerEncoder(nn.Module):
    def __init__(self, num_layers=12, dim_model=512, num_heads=4, dim_feedforward=2048, dropout=0.1):
        super().__init__()
        # Initialize a list of TransformerEncoderLayer instances
        self.layers = nn.ModuleList([TransformerEncoderLayer(dim_model, num_heads, dim_feedforward, dropout) for _ in range(num_layers) ])


    def forward(self, src):
        # Retrieve the sequence length and dimension from the source tensor
        seq_len, dimension = src.size(1), src.size(2)

        # Add position encoding to the source tensor
        src += position_encoding(seq_len, dimension)

        # Process each layer in the transformer encoder
        for layer in self.layers:
          src = layer(src)

        return src

# The Decoder Side

The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder.


![](https://jalammar.github.io/images/t/transformer_resideual_layer_norm_3.png)

## Decoder layer

The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

**Task**: implement the decoder layer

In [20]:
class TransformerDecoderLayer(nn.Module):
    def __init__(
        self,
        dim_model: int = 512,
        num_heads: int = 6,
        dim_feedforward: int = 2048,
        dropout: float = 0.1,
    ):
        super().__init__()
        # init dim_q and dim_k as in the encoder
        dim_q = dim_k = max(dim_model // num_heads, 1)

        # Initialize the first self-attention layer with a residual connection
        self.attention_1 = Residual(
            MultiHeadToAttention(num_heads, dim_model, dim_q, dim_k),
            dimension=dim_model, dropout=dropout
        )

        # Initialize the second attention layer for interaction with the encoder output
        self.attention_2 = Residual(
            MultiHeadToAttention(num_heads, dim_model, dim_q, dim_k),
            dimension=dim_model, dropout=dropout
        )

        # Initialize the feed-forward network
        self.feed_forward = Residual(
            feed_forward(dim_model, dim_feedforward),
            dimension=dim_model, dropout=dropout
        )

    def forward(self, tgt: Tensor, memory: Tensor) -> Tensor:
        # Self-attention mechanism
        tgt = self.attention_1(tgt, tgt, tgt)

        # Cross-attention mechanism where the decoder attends to the encoder's output
        tgt = self.attention_2(tgt, memory, memory)

        # Pass through the feed-forward network
        return self.feed_forward(tgt)

## Full Transfomer Decoder

**Task**: implement the transfomer decoder part class

In [21]:
class TransformerDecoder(nn.Module):
    def __init__(
        self,
        num_layers: int = 6,
        dim_model: int = 512,
        num_heads: int = 8,
        dim_feedforward: int = 2048,
        dropout: float = 0.1,
    ):
        super().__init__()
        # Initialize laters from TransformerDecoderLayer instances
        self.layers = nn.ModuleList([TransformerDecoderLayer(dim_model, num_heads, dim_feedforward, dropout) for _ in range(num_layers)])

        # Define the output linear transformation
        self.linear = nn.Linear(dim_model, dim_feedforward)

    def forward(self, tgt: Tensor, memory: Tensor) -> Tensor:
        # Calculate sequence length and dimension from the target tensor
        seq_len, dimension = tgt.size(1), tgt.size(2)

        # Add position encoding to the target tensor
        tgt += position_encoding(seq_len, dimension, device=tgt.device)

        # Process each layer in the transformer decoder
        # Code here
        for layer in self.layers:
            tgt = layer(tgt, memory)

        # Apply a softmax to the output of the final linear layer
        return nn.softmax(self.linear(tgt))

## Full Transfomer model

**Task**:
1. Assembly a full transfomer (Encoder + Decoder)
2. Implement the Transfomer training loop
3. Using dataset of your choice train the transformer just for one epoch


In [22]:
class Transfomer(nn.Module):
  def __init__(self, output_dim):
    super().__init__()
    self.transformer = transformers.AutoModel.from_pretrained('bert-base-uncased')
    for param in self.transformer.parameters():
        param.requires_grad = False

    hidden_dim = self.transformer.config.hidden_size
    self.fc = nn.Linear(hidden_dim, output_dim)


  def forward(self, text):
    # text = [batch size, seq len]
    output = self.transformer(text, output_attentions=True)
    hidden = output.last_hidden_state
    # hidden = [batch size, seq len, hidden dim]
    attention = output.attentions[-1]
    # attention = [batch size, n heads, seq len, seq len]
    cls_hidden = hidden[:, 0, :]
    prediction = self.fc(torch.tanh(cls_hidden))

    return prediction

In [23]:
import collections

import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torchtext
import tqdm

import torch
from torch.utils.data import Dataset, DataLoader

import math
import os
from tempfile import TemporaryDirectory
from typing import Tuple

import torch
from torch import nn, Tensor
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from torch.utils.data import dataset

import transformers

from torchtext.vocab import build_vocab_from_iterator



In [24]:
# link for downloading train set 'https://www.dropbox.com/s/x9n6f9o9jl7pno8/train_pos.txt?dl=1'
# link for downloading test set 'https://www.dropbox.com/s/v8nccvq7jewcl8s/test_pos.txt?dl=1'


file_path_train = '/content/train_pos.txt'


with open(file_path_train, 'r', encoding='utf-8') as file:
    train_ = file.readlines()


file_path_test = '/content/test_pos.txt'

with open(file_path_test, 'r', encoding='utf-8') as file:
    test_ = file.readlines()


class POS_Dataset(Dataset):
    def __init__(self, file):
      self.sentences=[]
      self.tags=[]
      sentence=[]
      tag=[]
      for line in file:
          if line.isspace(): #if we met ' '
            self.sentences.append(sentence)
            sentence=[]
            self.tags.append(tag)
            tag=[]
        # Разделяем строку по пробелам и извлекаем слово и тег
          else:
              word, tag_ = line.split()
              sentence.append(word)
              tag.append(tag_)

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, index):
        return self.sentences[index], self.tags[index]

train_dataset = POS_Dataset(train_)
test_dataset = POS_Dataset(test_)

min_freq = 5
special_tokens = ["<unk>", "<pad>"]

def get_tokens(dataset):
    for index in range(len(dataset)):
        yield dataset[index][0]

vocab = torchtext.vocab.build_vocab_from_iterator(
    get_tokens(train_dataset),
    min_freq=min_freq,
    specials=special_tokens,
)

unk_index = vocab["<unk>"]
pad_index = vocab["<pad>"]
vocab.set_default_index(unk_index)


special_tokens = ["<pad>"]

def get_tags(dataset):
    for index in range(len(dataset)):
        yield dataset[index][1]

tag_vocab = torchtext.vocab.build_vocab_from_iterator(
    get_tags(train_dataset),
    min_freq=min_freq,
    specials=special_tokens,
)

tag_pad_index = tag_vocab["<pad>"]



def encode_sample(example, vocab, tag_vocab):
    ids = vocab.lookup_indices(example[0])
    tag_ids = tag_vocab.lookup_indices(example[1])
    return ids, tag_ids


class POS_Dataset_Encoded(Dataset):
    def __init__(self, pos_dataset, vocab, tag_vocab, padded_len=78):
        self.sentences = []
        self.tags = []
        for index in range(len(pos_dataset)):
            sample = pos_dataset[index]
            ids, tag_ids = encode_sample(sample, vocab, tag_vocab)
            l = len(ids)

            # pad sequence
            ids += [vocab['<pad>']] * (padded_len - l)
            tag_ids += [tag_vocab['<pad>']] * (padded_len - l)

            self.sentences.append(torch.LongTensor(ids))
            self.tags.append(torch.LongTensor(tag_ids))

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, index):
        return self.sentences[index], self.tags[index]


train_dataset = POS_Dataset_Encoded(train_dataset, vocab, tag_vocab)
test_dataset = POS_Dataset_Encoded(test_dataset, vocab, tag_vocab)
batchSize = 64
train_loader = DataLoader(train_dataset, batch_size=batchSize, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batchSize)


def get_accuracy(prediction, label):
  batch_size, _ = prediction.shape
  predicted_classes = prediction.argmax(dim=-1)
  correct_predictions = predicted_classes.eq(label).sum()
  accuracy = correct_predictions / batch_size
  return accuracy

In [25]:
class TransformerModel(nn.Module):

    def __init__(self, ntoken: int, d_model: int, nhead: int, d_hid: int,
                 nlayers: int, dropout: float = 0.5):
        super().__init__()
        self.model_type = 'Transformer'
        self.pos_encoder = PositionalEncoding(d_model, dropout)
        encoder_layers = TransformerEncoderLayer(d_model, nhead, d_hid, dropout)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
        self.embedding = nn.Embedding(ntoken, d_model)
        self.d_model = d_model
        self.linear = nn.Linear(d_model, ntoken)

        self.init_weights()

    def init_weights(self) -> None:
        initrange = 0.1
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.linear.bias.data.zero_()
        self.linear.weight.data.uniform_(-initrange, initrange)

    def forward(self, src: Tensor, src_mask: Tensor = None) -> Tensor:
        """
        Arguments:
            src: Tensor, shape ``[seq_len, batch_size]``
            src_mask: Tensor, shape ``[seq_len, seq_len]``

        Returns:
            output Tensor of shape ``[seq_len, batch_size, ntoken]``
        """
        src = self.embedding(src) * math.sqrt(self.d_model)
        src = self.pos_encoder(src)
        if src_mask is None:
            """Generate a square causal mask for the sequence. The masked positions are filled with float('-inf').
            Unmasked positions are filled with float(0.0).
            """
            src_mask = nn.Transformer.generate_square_subsequent_mask(len(src)).to(device)
        output = self.transformer_encoder(src, src_mask)
        output = self.linear(output)
        return output

In [26]:
class PositionalEncoding(nn.Module):

    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe = torch.zeros(max_len, 1, d_model)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)


    def forward(self, x: Tensor) -> Tensor:
        """
        Arguments:
            x: Tensor, shape ``[seq_len, batch_size, embedding_dim]``
        """

        x = x + self.pe[:x.size(0)]
        return self.dropout(x)

In [27]:
ntokens = len(vocab)  # size of vocabulary
emsize = 200  # embedding dimension
d_hid = 200  # dimension of the feedforward network model in ``nn.TransformerEncoder``
nlayers = 2  # number of ``nn.TransformerEncoderLayer`` in ``nn.TransformerEncoder``
nhead = 2  # number of heads in ``nn.MultiheadAttention``
dropout = 0.2  # dropout probability

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
lr = 5e-4
model = TransformerModel(ntokens, emsize, nhead, d_hid, nlayers, dropout).to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()




In [28]:
n_epochs = 1
for ep in range(n_epochs):
  model.train()
  epoch_losses = []
  epoch_accs = []
  for batch in tqdm.tqdm(train_loader, desc="training..."):
    optimizer.zero_grad()
    ids = batch[0].to(device)
    label = batch[1].to(device)

    prediction = model(ids)

    loss = criterion(prediction.view(-1, prediction.shape[2]), label.view(-1))
    accuracy = get_accuracy(prediction.view(-1, prediction.shape[2]), label.view(-1))

    loss.backward()
    optimizer.step()

    epoch_losses.append(loss.item())
    epoch_accs.append(accuracy.item())

  test_losses = []
  test_accs = []
  with torch.no_grad():
    model.eval()
    for batch in tqdm.tqdm(test_loader, desc='evaluating...'):
      ids = batch[0].to(device)
      label = batch[1].to(device)

      prediction = model(ids)
      loss = criterion(prediction.view(-1, prediction.shape[2]), label.view(-1))
      accuracy = get_accuracy(prediction.view(-1, prediction.shape[2]), label.view(-1))


      test_losses.append(loss.item())
      test_accs.append(accuracy.item())

  print(f'[Epoch {ep+1}]\nTrain:\n\tLoss: {np.mean(epoch_losses)}, Acc: {np.mean(epoch_accs)}\nTest:\n\tLoss: {np.mean(test_losses)}, Acc: {np.mean(test_accs)}')

training...: 100%|██████████| 140/140 [00:05<00:00, 25.44it/s]
evaluating...: 100%|██████████| 32/32 [00:00<00:00, 66.77it/s]

[Epoch 1]
Train:
	Loss: 0.9660310673926558, Acc: 0.837715214131666
Test:
	Loss: 0.19516967562958598, Acc: 0.9471842590719461



