### Fune-Tuning BERT model on COLA Datasets
- In this assignment we will work on Fine-Tuning BERT model for Clasification task.

## Installation

In [34]:
# !pip install wget
!pip install torch -q
!pip install transformers -q
!pip install datasets -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Device Selection
- Use GPU if available, Else Run on CPU instead.

In [3]:
import torch

if torch.cuda.is_available():
  device = torch.device("cuda")
  device_count = torch.cuda.device_count()
  device_name = torch.cuda.get_device_name(0)

  print(f"There are {device_count} GPU(s) available.")
  print(f"We will use the GPU: {device_name}")


else:
  print("No GPU available, using the CPU instead.")
  device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: NVIDIA GeForce RTX 2060


## Loading Datasets
- In this lab assignments we will use COLA datasets from the [GLUE Benchmark](https://huggingface.co/datasets/nyu-mll/glue).
  - GLUE Benchmark consists of different tests, however we will use `The Corpus of Linguistics Acceptability (cola)` dataset for single sentence classification. It's a set of sentences labeled as grammatically correct or incorrect.
  - For more info [check HuggingFace GLUE Benchmark](https://huggingface.co/datasets/nyu-mll/glue)



## Exercise 1 (2 points)
**`Q. Load cola tests from GLUE Benchmark using datasets library`**

`Hint:`  
- use load_dataset() function
- pass glue, and cola as dataset name
- Read [this datasets documentation](https://huggingface.co/docs/datasets/v1.11.0/loading_datasets.html)

In [4]:
### Ex-1-Task-1
import pandas as pd
from datasets import load_dataset

dataset = None
# Task: Load cola tests from GLUE Benchmark using datasets library
### BEGIN SOLUTION
dataset = load_dataset('glue', 'cola')
### END SOLUTION

Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/251k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/37.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/37.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8551 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1043 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1063 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})


In [7]:
### Ex-1-Task-1

assert dataset is not None, "dataset cannot be None"
assert 'train' in dataset, "Train split is missing"
assert 'validation' in dataset, "Validation split is missing"
assert 'test' in dataset, "Test split is missing"
expected_columns = ['sentence', 'label']
assert all(col in dataset['train'].column_names for col in expected_columns), "Columns are missing"
print("All assertions passed. Dataset is valid.")

All assertions passed. Dataset is valid.


**convert to pandas**
- You can convert dataset to pandas and play with it.
- In this blog post we will only use `train split` from our loaded dataset.

In [8]:
# only use train split from dataset for ease.
df_train = dataset['train'].to_pandas()
df_train = df_train.sample(1000)

# Report the number of sentences
print(f"Number of training sentences are: {df_train.shape[0]}")

# Display 10 random rows from the data
df_train.sample(10)

Number of training sentences are: 1000


Unnamed: 0,sentence,label,idx
5036,Why she told him is unclear.,1,5036
2363,Alison poked Daisy in the ribs.,1,2363
5985,Megan was sat on by her brother.,1,5985
374,"Mary will believe Susan, and you will Bob.",1,374
1137,"We called my father, who had just turned 60, up.",0,1137
2091,"I threw the ball to Mary, but she was looking ...",1,2091
6768,"When never had Sir Thomas been so offended, Mr...",0,6768
3781,I regard Andrew as the best writer.,1,3781
397,John seems will win.,0,397
2069,To whom did you throw the ball?,1,2069


Now lets, extract sentences, and labels from the dataframe and convert it to numpy array.

Later, we need this.

In [9]:
# Convert Training set to numpy ndarrays
# Extract the sentences and labels of our training set

sentences = df_train.sentence.values
labels = df_train.label.values

print(f"Shape of Sentences are: {sentences.shape}")
print(f"Shape of Labels are: {labels.shape}")

Shape of Sentences are: (1000,)
Shape of Labels are: (1000,)


## Tokenization
- Tokenization is essential step to transform our dataset to the format that is required to train the BERT Model.
  - [Read this Tokenizer blog](https://huggingface.co/docs/transformers/en/main_classes/tokenizer) if you are curious.
- **Key steps:**
  - Split text into tokens.
    - BERT uses WordPiece tokenization.
    - [Read this blog post from HuggingFace](https://huggingface.co/learn/nlp-course/en/chapter6/6) to get more insights on WordPiece Tokenization algorithms.
  - Splitted text must be mapped to their index in the tokenizer vocabulary.

**Note:**  We will use "uncased" version of the BERT and set do_lower_case to True.

In [10]:
from transformers import BertTokenizer

print("Loading BERT tokenizer....")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

Loading BERT tokenizer....


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

**Tokenizer in Action (Example)**  
`Apply the tokenizer to one sentence just to see the output`

In [11]:
# Display the original sentence
print(f"Original: {sentences[0]}")

# Tokenization
print(f'Tokenized: {tokenizer.tokenize(sentences[0])}')

# Print the sentence mapped to token ids.
print(f'Token IDs: {tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentences[0]))}')

Original: The train reached the station.
Tokenized: ['the', 'train', 'reached', 'the', 'station', '.']
Token IDs: [1996, 3345, 2584, 1996, 2276, 1012]


Here, first we have converted text to tokens and then to token_ids.

Alternatively, Use `tokenizer.encode()` function that combines both `tokenizer.tokenize()` and `tokenizer.convert_tokens_to_ids()`

## BERT Input Formatting   
- Add special tokens to the start and end of each sentence.
  - `[SEP]` at the end of every sentence.
  - `[CLS]` prepend this token to the begining of every sentence, suitable for sentence level classification task.
- Pad and truncate all sentences to a single constant length.
  - The maximum sentence length is 512 tokens.
  - Padding is done with a special `[PAD]` token, which is at index 0 in the BERT vocabulary.
  - Padding and truncate are necessary to make all the long and small sentences to have constant length, This is very important.
- Explicitly differentiate real tokens from padding tokens with the "attention mask".
  - "Attention Mask" is simply an array of 1's and 0's indicating which tokens are padding and which aren't.
  - This mask tells the "Self-Attention" mechanism in BERT not to incorporate these PAD tokens into its interpretation of the sentence.

**Decide maximum sentence length for padding/truncating**

In [12]:
# Q. Decide maximum sentence length for padding/truncating to

max_len = 0

for sentence in sentences:

  # Tokenize the text and add `[CLS]` and `[SEP]` tokens
  input_ids = tokenizer.encode(sentence, add_special_tokens=True)

  max_len = max(max_len, len(input_ids))

print(f"Max sentence length: {max_len}")

Max sentence length: 37


## Tokenization (continue)  
- For more info: [HuggingFace Tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer?highlight=encode_plus#transformers.PreTrainedTokenizer.encode_plus)
- Now lets actually apply the tokenization.
- Here we will use `tokenizer.encode_plus()` function which provides abstraction to:
  - split the sentence into tokens.
  - Add the special `[CLS]` and `[SEP]` tokens.
  - Map the tokens to their IDs.
  - Pad or truncate all sentences to the same length.
  - Creation of attention masks which explicitly differentiate real tokens from `[PAD]` tokens.

- **Note:**
  - `tokenizer.encode()` includes all the abstractions except adding attention masks, so we will use `tokenizer.encode_plus()` that includes all of the abstractions listed above.


  **`Q. Tokenize all of the sentences and map the tokens to their word IDs.`**

In [13]:
input_ids = []
attention_masks = []

for sentence in sentences:
  # tokenizer.encode_plus() includes all the abstractions except adding
  # attention masks
  encoded_dict = tokenizer.encode_plus(
      sentence,
      add_special_tokens = True,
      max_length = 64,
      pad_to_max_length = True,
      return_attention_mask=True,
      return_tensors='pt'
  )

  # Add the encoded sentence to the list
  input_ids.append(encoded_dict['input_ids'])

  # Add the attention mask to the list
  attention_masks.append(encoded_dict['attention_mask'])

# Conver the lists into tensors
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [14]:
# Display sentence 0
print(f'Original: {sentences[0]}')
print(f'Tokens Ids: {input_ids[0]}')

Original: The train reached the station.
Tokens Ids: tensor([ 101, 1996, 3345, 2584, 1996, 2276, 1012,  102,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0])


## Training and Validation Splits
- Training Splits --> 90%
- Validation Splits --> 10%
- We will convert all of the Training Splits and Validation Splits to [TensorDataset](https://pytorch.org/docs/stable/data.html)

In [15]:
from torch.utils.data import TensorDataset, random_split

# combine the training inputs into a TensorDataset
dataset_tensor = TensorDataset(input_ids, attention_masks, labels)

train_size = int(0.9 * len(dataset_tensor))
val_size = len(dataset_tensor) - train_size

# Random Selection
train_dataset, val_dataset = random_split(dataset_tensor, [train_size, val_size])

print(f"Training samples: {train_size}")
print(f"Validation samples: {val_size}")

Training samples: 900
Validation samples: 100


**Create DataLoader**  
- `DataLoader` is a crucial component that provides an efficient way to iterate over a dataset.
- It combines a dataset and a sampler and provides an iterable over the given dataset.
- **`Key Functionalities:`**
  - Batching the data
  - Shuffling the data
  - Loading the data
- In our scenario, we will create train_dataloader with `Random sampling` and validation_dataloader with `Sequential sampling`.
- `More info at:`
  - [torch.utils.data](https://pytorch.org/docs/stable/data.html)

**Q. Set batch_size to the one that is used in BERT paper for GLUE Benchmark.**  (3 Points)
  - Read [this BERT Paper](https://arxiv.org/pdf/1810.04805) and figure out batch_size for GLUE task.

In [16]:
### Ex-2-Task-1
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

batch_size = None
# Task: Set batch_size to the one that is used in BERT paper for GLUE Benchmark
### BEGIN SOLUTION
batch_size = 32
### END SOLUTION

In [18]:
### Ex-2-Task-1
assert batch_size is not None, "batch_size cannot be None"

In [None]:
# Random Batch Selection is must for Training
train_dataloader = DataLoader(
    train_dataset,
    sampler = RandomSampler(train_dataset), # Random Batch Selection
    batch_size = batch_size
)

# We can use Sequential batch selection for Validation
valid_dataloader = DataLoader(
    val_dataset,
    sampler = SequentialSampler(val_dataset), # Sequential Batch Selection
    batch_size = batch_size
)

## Train Classification Model
- We should modify pre-trained BERT model to give outputs for classification i.e. we should add one classification heads for [cls] token embeddings.
- Kudos to HuggingFace that they provides pytorch implementation that is the high level abstraction for this task.
- [BertForSequenceClassification --> Documentation](https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#bertforsequenceclassification)
  - Normal BERT Model + Single Classification Layer  
  - Entire pre-trained BERT model and the additional untrained classification layer is trained on our specific task.

- List of classes provided for fune-tuning:
  - BertModel
  - BertForPreTrainig
  - BertForMaskedLM
  - BertForNextSentencePrediction
  - [BertForSequenceClassification --> Documentation](https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#bertforsequenceclassification) --> `We use this`
  - BertForTokenClassification
  - BertForQuestionAnswering

- The documentation for List of classes can be found [here](https://huggingface.co/transformers/v2.2.0/model_doc/bert.html)

Now, lets load pretrained BERT models i.e. `bert-base-uncased` i.e. this version has only lowercase letters and is the smaller version i.e. base version.

**Note:**  
Check list of pretrained models [here](https://huggingface.co/transformers/v3.3.1/pretrained_models.html)

In [19]:
from transformers import BertForSequenceClassification, AdamW, BertConfig


model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels = 2, # number of output labels -- 2 for binary classification
    output_attentions = False, # whether model returns attention_weights
    output_hidden_states = False # whether model returns all hidden_states
)

# Run model on the GPU
if device.type == 'cuda':
  model.cuda()

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Browse model's parameters**

In [20]:
# Get all of the model's parameters
params = list(model.named_parameters())

print(f'The BERT model has {len(params)} different named parameters')

print('\n=====Embedding Layer======\n')
for p in params[0:5]:
  print(f"{p[0]} {str(tuple(p[1].size()))}")

print('\n=====First Transformer======\n')
for p in params[5:21]:
  print(f"{p[0]} {str(tuple(p[1].size()))}")


print('\n=====Output Layer======\n')
for p in params[-4:]:
  print(f"{p[0]} {str(tuple(p[1].size()))}")

The BERT model has 201 different named parameters


bert.embeddings.word_embeddings.weight (30522, 768)
bert.embeddings.position_embeddings.weight (512, 768)
bert.embeddings.token_type_embeddings.weight (2, 768)
bert.embeddings.LayerNorm.weight (768,)
bert.embeddings.LayerNorm.bias (768,)


bert.encoder.layer.0.attention.self.query.weight (768, 768)
bert.encoder.layer.0.attention.self.query.bias (768,)
bert.encoder.layer.0.attention.self.key.weight (768, 768)
bert.encoder.layer.0.attention.self.key.bias (768,)
bert.encoder.layer.0.attention.self.value.weight (768, 768)
bert.encoder.layer.0.attention.self.value.bias (768,)
bert.encoder.layer.0.attention.output.dense.weight (768, 768)
bert.encoder.layer.0.attention.output.dense.bias (768,)
bert.encoder.layer.0.attention.output.LayerNorm.weight (768,)
bert.encoder.layer.0.attention.output.LayerNorm.bias (768,)
bert.encoder.layer.0.intermediate.dense.weight (3072, 768)
bert.encoder.layer.0.intermediate.dense.bias (3072,)
bert.encoder.layer

## Optimizer & Learning Rate Scheduler
- `Batch Size:`
- `Learning rate:` 2e-5
- `Epochs:` 5
- `epsilon:` 1e-8

In [21]:
from torch.optim import Adam


optimizer = Adam(
    model.parameters(),
    lr=2e-5,
    eps=1e-8
)

# optimizer = AdamW(
#     model.parameters(),
#     lr=2e-5,
#     eps=1e-8
# )

# Learning Rate Scheduler (2 points)

**`Q Add total_steps`**

`Hint:`
- total_steps = [number of batches] * [number of epochs]

In [31]:
### Ex-3-Task-1

from transformers import get_linear_schedule_with_warmup

epochs = 4

# Hint:
# total_steps: [number of batches] x [number of epochs]
total_steps = None
# TASK: Add total steps i.e. [number of batches] x [number of epochs]
### BEGIN SOLUTION
total_steps = len(train_dataloader) * epochs
### END SOLUTION

In [32]:
### Ex-3-Task-1
assert total_steps is not None, "total_steps cannot be None"
print("All assertions passed.")

All assertions passed.


In [None]:
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps=0,
                                            num_training_steps=total_steps)

## Training Loop (3 points)
- **Training:**
  - Unpack our data inputs and labels
  - [Optional] Load data onto the GPU for acceleration
  - Clear out the gradients calculated in the previous pass.
    - In pytorch the gradients accumulate by default unless you explicitly clear them out.
  - Forward pass (feed input data through the network)
  - Backward pass (backpropagation)
  - Tell the network to update parameters with optimizer.step()
  - Track variables for monitoring progress

- **Evaluation:**
  - Unpack our data inputs and labels
  - [Optional] Load data onto the GPU for acceleration
  - Forward pass (feed input data through the network)
  - Compute loss on our validation data and track variables for monitoring progress


_Don't know Pytorch? Don't worry just browse [this simple tutorial](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py)_

In [24]:
import time
import datetime
import numpy as np

def get_accuracy(preds, labels):
  """
  Function to calculate the accuracy of our predictions vs labels
  """
  pred_flat = np.argmax(preds, axis=1).flatten()
  labels_flat = labels.flatten()
  return np.sum(pred_flat == labels_flat) / len(labels_flat)

def format_time(elapsed):
  """
  Takes time in seconds and returns a string hh:mm:ss
  """

  elapsed_rounded = int(round(elapsed))
  # Format as hh:mm:ss
  return str(datetime.timedelta(seconds=elapsed_rounded))

In [25]:
import random
import numpy as np
import torch

def set_seed(seed_val=42):
    random.seed(seed_val)
    np.random.seed(seed_val)
    torch.manual_seed(seed_val)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed_val)

**Training Function**

**`Q. Compute Average Loss`** (1 points)

`Hint:`  
_Average Loss = total_loss/number_of_batches_

In [26]:
### Ex-4-Task-1
def train(model, train_dataloader, optimizer, scheduler, device):
    print('Training...')

    t0 = time.time()
    total_train_loss = 0
    model.train()

    for step, batch in enumerate(train_dataloader):
        if step % 40 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        model.zero_grad()
        outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)

        loss = outputs.loss
        total_train_loss += loss.item()

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()

    avg_train_loss = None
    # hint: Average Loss = total_loss/number_of_batches
    ### BEGIN SOLUTION
    avg_train_loss = total_train_loss / len(train_dataloader)
    ### END SOLUTION
    
    training_time = format_time(time.time() - t0)

    return avg_train_loss, training_time

In [33]:
### Intentionally left blank

Training...


RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


**Validation Function**
**`Q. Compute Average Validation Accuracy and Average Validation Loss`** (1 points)

`Hint:`  
1. _Average Validation Accuracy = total_validation_accuracy/number_of_validation_batches_
2. _Average Validation Loss = total_validation_loss/number_of_validation_batches_

In [27]:
### Ex-4-Task-2
def validate(model, valid_dataloader, device):
    print("Running Validation...")

    t0 = time.time()
    model.eval()

    total_eval_accuracy = 0
    total_eval_loss = 0

    for batch in valid_dataloader:
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        with torch.no_grad():
            outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)

        loss, logits = outputs.loss, outputs.logits
        total_eval_loss += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        total_eval_accuracy += get_accuracy(logits, label_ids)
    
    avg_val_accuracy = None
    avg_val_loss = None
    
    ### BEGIN SOLUTION
    avg_val_accuracy = total_eval_accuracy / len(valid_dataloader)
    avg_val_loss = total_eval_loss / len(valid_dataloader)
    ### END SOLUTION

    validation_time = format_time(time.time() - t0)

    print("  Accuracy: {0:.2f}".format(avg_val_accuracy))
    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    return avg_val_accuracy, avg_val_loss, validation_time

In [None]:
### Intentionally left blank

**Training Loop**

In [28]:
def training_loop(model, train_dataloader, valid_dataloader, optimizer, scheduler, device, epochs):
    set_seed(42)

    training_stats = []
    total_t0 = time.time()

    for epoch_i in range(epochs):
        print("\n======== Epoch {:} / {:} ========".format(epoch_i + 1, epochs))

        avg_train_loss, training_time = train(model, train_dataloader, optimizer, scheduler, device)
        avg_val_accuracy, avg_val_loss, validation_time = validate(model, valid_dataloader, device)

        training_stats.append({
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Valid. Accur.': avg_val_accuracy,
            'Training Time': training_time,
            'Validation Time': validation_time
        })

    print("\nTraining complete!")
    print("Total training took {:} (h:mm:ss)".format(format_time(time.time() - total_t0)))

    return training_stats

# Example usage
# model, train_dataloader, valid_dataloader, optimizer, scheduler, device, and epochs need to be defined.
training_stats = training_loop(model, train_dataloader, valid_dataloader, optimizer, scheduler, device, epochs)


Training...
Running Validation...
  Accuracy: 0.84
  Validation Loss: 0.44
  Validation took: 0:00:05

Training...
Running Validation...
  Accuracy: 0.84
  Validation Loss: 0.43
  Validation took: 0:00:04

Training...
Running Validation...
  Accuracy: 0.84
  Validation Loss: 0.41
  Validation took: 0:00:04

Training...
Running Validation...
  Accuracy: 0.84
  Validation Loss: 0.40
  Validation took: 0:00:05

Training complete!
Total training took 0:08:03 (h:mm:ss)


In [29]:
# assert training_stats[0]['Training Loss'] is not None, 'Average Training Loss cannot be None'
# assert training_stats[0]['Valid. Accur.'] is not None, 'Average Validation Accuracy cannot be None'
# assert training_stats[0]['Valid. Loss'] is not None, 'Average Validation Loss cannot be None'

# print("All assertions passed.")

# assert len(training_stats) == epochs, "training_stats list does not contains all epochs metrics"

All assertions passed.


**Note:**  
_This assignment is not focused on Hyperparameter Tuning and Improving Metrics. You can experiment on your own to improve accuracy and minimize validation loss._

## Summary of Training Process

In [30]:
# Convert training stats to pandas DataFrame.

df_stats = pd.DataFrame(training_stats)
df_stats = df_stats.set_index('epoch')
df_stats

Unnamed: 0_level_0,Training Loss,Valid. Loss,Valid. Accur.,Training Time,Validation Time
epoch,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.61859,0.437216,0.84375,0:01:57,0:00:05
2,0.572833,0.433367,0.84375,0:01:54,0:00:04
3,0.536366,0.407509,0.84375,0:01:54,0:00:04
4,0.477001,0.398204,0.84375,0:01:59,0:00:05


## Congratulations!

You have successfully completed the assignment. Well done!
