# Problem 4

In this problem, we simply finetune a BERT model (not pretrained) on RTE dataset, and then finetune a BERT model (pretrained) on RTE dataset.

**IMPORTANT NOTES**:
- Please make sure that you have already read the part of hw5 pdf that corresponds to this problem. This is very important.
- At the end of the hw5, you will need to submit a zip folder containing three things. The instruction is also included in the first paragraph of the hw5 pdf.
  - (1) The writeup pdf containing your solutions to Problems 1, 2, 3, 4, 5. Yes, there're things you need to respond in your writeup (see hw5 pdf).
  - (2) The downloaded colab corresponding to Problem 4.
  - (3) The downloaded colab corresponding to Problem 5.

Some imports and data downloading

In [None]:
!git clone https://github.com/huggingface/transformers
!python transformers/utils/download_glue_data.py --tasks RTE
!pip install transformers

Cloning into 'transformers'...
remote: Enumerating objects: 53, done.[K
remote: Counting objects: 100% (53/53), done.[K
remote: Compressing objects: 100% (51/51), done.[K
remote: Total 54608 (delta 15), reused 26 (delta 0), pack-reused 54555[K
Receiving objects: 100% (54608/54608), 40.74 MiB | 10.65 MiB/s, done.
Resolving deltas: 100% (38155/38155), done.
Downloading and extracting RTE...
	Completed!
Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/99/84/7bc03215279f603125d844bf81c3fb3f2d50fe8e511546eb4897e4be2067/transformers-4.0.0-py3-none-any.whl (1.4MB)
[K     |████████████████████████████████| 1.4MB 5.9MB/s 
Collecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 33.6MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonh

In [None]:
!ls glue_data/RTE/

dev.tsv  test.tsv  train.tsv


In [None]:
import dataclasses
import logging
import os
import sys
from dataclasses import dataclass, field
from typing import Dict, Optional

import numpy as np

import torch
import torch.nn as nn 
from transformers import AutoTokenizer, EvalPrediction, GlueDataset, GlueDataTrainingArguments, AutoModel, BertPreTrainedModel, AutoConfig, BertModel
from transformers import GlueDataTrainingArguments 
from transformers import (
    Trainer,
    TrainingArguments,
    glue_compute_metrics,
    glue_tasks_num_labels,
    set_seed,
)

# device = 'cuda' if torch.cuda.is_available() else 'cpu'
# print(device)

In [None]:
model_name = "bert-base-uncased"

data_args = GlueDataTrainingArguments(task_name="rte", data_dir="./glue_data/RTE")
training_args = TrainingArguments(
    logging_steps=50, 
    per_device_train_batch_size=32, 
    per_device_eval_batch_size=64, 
    save_steps=1000,
    #evaluate_during_training=True,
    evaluation_strategy='steps',
    output_dir="./models/rte",
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    do_predict=True,
    learning_rate=0.00001,
    num_train_epochs=15,
)
#set_seed(42)
#set_seed(66)
set_seed(1024)
num_labels = glue_tasks_num_labels[data_args.task_name]

### From non-pretrained BERT

TODO:
- Complete the following three lines such that ```tokenizer``` and ```config``` and ```bert_model``` corresponds to the ```model_name``` we defined in the above cells. 
- IMPORTANT: make sure that the BERT model does not load pretrained weights!
- Hint: https://huggingface.co/transformers/model_doc/auto.html and other relevant Hugging Face documentations. Consider using the tools we imported in the first cell. More hints: it's okay to use ```from_pretrained``` in the first two lines, depending on what class you use.

In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
config = AutoConfig.from_pretrained('bert-base-uncased')
bert_model = AutoModel.from_config(config)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




TODO:
- Complete the forward function of the following class such that the model can do finetuning on RTE dataset.
- For more instructions, please refer to the hw5 pdf.

In [None]:
class SequenceClassificationBERT(nn.Module):
      
    def __init__(self, config, bert_model):
        super().__init__()
        self.config = config
        self.num_labels = config.num_labels
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        self.bert = bert_model

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ):
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # make sure that all the arguments in the forward() function is used
        # somewhere in the code

        ##### 

        outputs = self.bert(input_ids=input_ids,
                            attention_mask=attention_mask,
                            token_type_ids=token_type_ids,
                            position_ids=position_ids,
                            head_mask=head_mask,
                            inputs_embeds=inputs_embeds,
                            output_attentions=output_attentions,
                            output_hidden_states=output_hidden_states,
                            return_dict=return_dict)
        
        pooler_output = outputs[1] #outputs['pooler_output']
        pooler_output = self.dropout(pooler_output)
        logits = self.classifier(pooler_output)
        
        loss = torch.nn.functional.cross_entropy(logits, labels)

        #####

        # do not change the lines below, so make sure your code works for the
        # lines below
        output = (logits,) + outputs[2:]
        return ((loss,) + output) if loss is not None else output


In [None]:
model = SequenceClassificationBERT(config=config, bert_model=bert_model)

TODO:
- Print out the number of trainable parameters in the BERT model. This can be done in one line. Please feel free to look up resources online. We also briefly touched upon relevant materials in Lab 3, but here, make sure you only count the number of trainable parameters.

In [None]:
 n_parameters = sum([param.numel() for param in bert_model.parameters() if param.requires_grad])
 print(n_parameters)

109482240


In [None]:
train_dataset = GlueDataset(data_args, tokenizer=tokenizer)
eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="dev")



Now we train the model. Please make sure to read the pdf instructions. When you report results in the pdf writeup, make sure you report the mean and std of >=3 runs with different random seeds. Consider using ```set_seed(some number)``` before the below cell, before each run.

Make sure in each run, you're picking the best validation accuracy. We're using Trainer instead of the normal training loop which we have seen many many times earlier in the semester. In the trainer, we need to specify ```num_train_epochs``` (in ```training_args```) which we defined above. Please feel free to modify ```training_args``` such that:
- The learning rate is small (around 0.00001).
- Your model doesn't have large improments on validation accuracy anymore, at the end of training. The expected behavior is that the final validation accuracy won't be much better than chance.

We provided part of an example log below, but you may be able to get better accuracy. Again, make sure this run corresponds to using an non-pretrained BERT.

###### seed=42

In [None]:
# seed = 42
def compute_metrics(p: EvalPrediction):
    preds = np.argmax(p.predictions, axis=1)
    return glue_compute_metrics(data_args.task_name, preds, p.label_ids)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

trainer.train()
trainer.evaluate()

Epoch,Training Loss,Validation Loss,Acc
1,0.72653,0.697098,0.472924
2,0.709291,0.695456,0.487365
3,0.696165,0.71155,0.516245
4,0.701618,0.696434,0.519856
5,0.692432,0.702387,0.483755
6,0.679537,0.708883,0.527076
7,0.651396,0.745231,0.541516
8,0.583672,0.962056,0.483755
9,0.483394,1.050688,0.530686
10,0.461441,1.023973,0.527076




{'epoch': 15.0, 'eval_acc': 0.51985559566787, 'eval_loss': 1.2215412855148315}

In [None]:
# seed = 42
def compute_metrics(p: EvalPrediction):
    preds = np.argmax(p.predictions, axis=1)
    return glue_compute_metrics(data_args.task_name, preds, p.label_ids)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

trainer.train()
trainer.evaluate()

Step,Training Loss,Validation Loss,Acc
50,0.72653,0.691287,0.523466
100,0.700166,0.689972,0.527076
150,0.709291,0.693458,0.530686
200,0.692685,0.69418,0.534296
250,0.698886,0.698426,0.472924
300,0.689901,0.696577,0.501805
350,0.681027,0.704986,0.516245
400,0.678059,0.710867,0.519856
450,0.663324,0.746819,0.509025
500,0.638553,0.920932,0.480144




{'epoch': 15.0, 'eval_acc': 0.516245487364621, 'eval_loss': 1.2973672151565552}

###### seed=66

In [None]:
# seed = 66
def compute_metrics(p: EvalPrediction):
    preds = np.argmax(p.predictions, axis=1)
    return glue_compute_metrics(data_args.task_name, preds, p.label_ids)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

trainer.train()
trainer.evaluate()

Epoch,Training Loss,Validation Loss,Acc
1,0.727851,0.703836,0.472924
2,0.708654,0.696329,0.472924
3,0.697048,0.708082,0.523466
4,0.705909,0.696599,0.527076
5,0.698485,0.704714,0.462094
6,0.69031,0.705285,0.480144
7,0.669762,0.893986,0.516245
8,0.625031,0.885702,0.472924
9,0.509842,0.961451,0.541516
10,0.537305,1.010956,0.480144




{'epoch': 15.0,
 'eval_acc': 0.5306859205776173,
 'eval_loss': 1.2328784465789795}

In [None]:
# seed = 66
def compute_metrics(p: EvalPrediction):
    preds = np.argmax(p.predictions, axis=1)
    return glue_compute_metrics(data_args.task_name, preds, p.label_ids)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

trainer.train()
trainer.evaluate()

Step,Training Loss,Validation Loss,Acc
50,0.727851,0.69094,0.519856
100,0.699387,0.6914,0.519856
150,0.708654,0.696796,0.480144
200,0.696052,0.692157,0.541516
250,0.705842,0.700376,0.472924
300,0.693251,0.69642,0.501805
350,0.68366,0.696539,0.545126
400,0.701664,0.719854,0.527076
450,0.694876,0.739392,0.472924
500,0.665021,0.725499,0.487365




{'epoch': 15.0, 'eval_acc': 0.51985559566787, 'eval_loss': 1.2328428030014038}

###### seed=1024

In [None]:
# seed = 1024
def compute_metrics(p: EvalPrediction):
    preds = np.argmax(p.predictions, axis=1)
    return glue_compute_metrics(data_args.task_name, preds, p.label_ids)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

trainer.train()
trainer.evaluate()

Epoch,Training Loss,Validation Loss,Acc
1,0.736307,0.697712,0.472924
2,0.707634,0.694633,0.509025
3,0.694101,0.709252,0.527076
4,0.705873,0.693536,0.519856
5,0.691027,0.70212,0.480144
6,0.682469,0.70183,0.519856
7,0.650897,0.761242,0.537906
8,0.603193,0.863979,0.494585
9,0.467567,0.974652,0.505415
10,0.453784,1.282012,0.490975




{'epoch': 15.0,
 'eval_acc': 0.4657039711191336,
 'eval_loss': 1.2433407306671143}

In [None]:
# seed = 1024
def compute_metrics(p: EvalPrediction):
    preds = np.argmax(p.predictions, axis=1)
    return glue_compute_metrics(data_args.task_name, preds, p.label_ids)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

trainer.train()
trainer.evaluate()

Step,Training Loss,Validation Loss,Acc
50,0.736307,0.690416,0.516245
100,0.703807,0.690547,0.519856
150,0.707634,0.691204,0.527076
200,0.692734,0.696762,0.487365
250,0.706489,0.695748,0.472924
300,0.690385,0.695951,0.498195
350,0.684503,0.69878,0.519856
400,0.695887,0.712947,0.519856
450,0.676639,0.708457,0.552347
500,0.656378,0.73868,0.516245




{'epoch': 15.0,
 'eval_acc': 0.49458483754512633,
 'eval_loss': 1.1478031873703003}

### From pretrained BERT

Now, let's do the above experiments using a pretrained BERT!

TODO:
- Complete the following three lines such that ```tokenizer``` and ```config``` and ```bert_model``` corresponds to the ```model_name``` we defined in the above cells. 
- IMPORTANT (different from the TODO a few cells above): make sure that the BERT model below loads the pretrained weights!

In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
config = AutoConfig.from_pretrained('bert-base-uncased')
bert_model = AutoModel.from_pretrained('bert-base-uncased')

In [None]:
bert_model 

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

In [None]:
train_dataset = GlueDataset(data_args, tokenizer=tokenizer)
eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="dev")



In [None]:
model = SequenceClassificationBERT(config=config, bert_model=bert_model)

TODO:
- Similarly, we train the model. For more instructions, please see the TODO cells above (i.e., the TODO corresponding to training the model, when we're not loading weights into BERT), as well as the hw5 pdf.

##### seed=42

In [None]:
def compute_metrics(p: EvalPrediction):
    preds = np.argmax(p.predictions, axis=1)
    return glue_compute_metrics(data_args.task_name, preds, p.label_ids)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.evaluate()

Step,Training Loss,Validation Loss,Acc
50,0.702725,0.667568,0.624549
100,0.665358,0.657738,0.602888
150,0.591188,0.651313,0.606498
200,0.500107,0.674072,0.624549
250,0.454714,0.71649,0.613718
300,0.365586,0.7635,0.638989
350,0.306803,0.788521,0.635379
400,0.264215,0.847433,0.635379
450,0.205447,0.897948,0.631769
500,0.168385,0.959609,0.617329




{'epoch': 15.0,
 'eval_acc': 0.6498194945848376,
 'eval_loss': 1.4110996723175049}

##### seed=66

In [None]:
def compute_metrics(p: EvalPrediction):
    preds = np.argmax(p.predictions, axis=1)
    return glue_compute_metrics(data_args.task_name, preds, p.label_ids)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.evaluate()

Step,Training Loss,Validation Loss,Acc
50,0.712786,0.689428,0.545126
100,0.68564,0.682427,0.563177
150,0.639523,0.6384,0.642599
200,0.537166,0.664761,0.646209
250,0.498509,0.681198,0.635379
300,0.39903,0.759685,0.631769
350,0.323567,0.760996,0.66065
400,0.286551,0.828829,0.65704
450,0.208386,0.924486,0.620939
500,0.161066,1.044224,0.602888




{'epoch': 15.0, 'eval_acc': 0.6353790613718412, 'eval_loss': 1.583669662475586}

##### seed=1024

In [None]:
def compute_metrics(p: EvalPrediction):
    preds = np.argmax(p.predictions, axis=1)
    return glue_compute_metrics(data_args.task_name, preds, p.label_ids)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.evaluate()

Step,Training Loss,Validation Loss,Acc
50,0.704031,0.693695,0.476534
100,0.693677,0.676955,0.559567
150,0.659933,0.65558,0.599278
200,0.565732,0.699989,0.610108
250,0.526754,0.669775,0.617329
300,0.420376,0.738756,0.613718
350,0.327282,0.842197,0.617329
400,0.293008,0.870432,0.638989
450,0.22351,0.965117,0.620939
500,0.1698,1.079856,0.620939




{'epoch': 15.0,
 'eval_acc': 0.6245487364620939,
 'eval_loss': 1.7666387557983398}