<a href="https://colab.research.google.com/github/AvoyDatta/DeepLearningLab/blob/master/Copy_of_CS329S_Problem_Set_2_%5BRelease%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS329S Problem Set 2

# Overview

In the last assignment, we’ve tested some public ML systems. In this assignment, we’ll familiarize ourselves with how to train and evaluate models.

We'll be using two commercial APIs: [HuggingFace API](https://huggingface.co/) and [OpenAI API](https://beta.openai.com/).

## OpenAI API

You should have received an invite to access OpenAI API by now. Please let us know if you haven’t.

## Optional access to HuggingFace’s paid inference API

The class has a startup plan on HuggingFace. To access the class’s benefits -- e.g. running inference using their GPUs instead of using your own GPU credits -- please create an account and ask to join [Stanford CS 329S organization](https://huggingface.co/stanford-cs329s). We’ll add you in. Don’t wait until the last minute to join because we might not be able to accept your requests in time!

We emphasize that joining the organization is optional: we have found the models to work fine running on the GPUs associated with Google Colab, so if you would rather run your models on Colab’s GPUs without the CS329S HuggingFace organization, it should not limit your ability to complete the assignment.

# Submission
1. Click ***File > Save a Copy in Drive*** to save your own copy of the document to work in and submit.
2. Please answer all the problems from this problem set in your Colab notebook.
3. Submit **both** `ps2.ipynb` notebook file and `ps2.pdf` to Gradescope. 
  - To download the `ps2.ipynb`: File > Download .ipynb
  - To download the `ps2.pdf`: File > Print


**This assignment is meaty. Start early!**


**Tip**: You can use the colab GPU for this by selecting:

> **Runtime**   →   **Change runtime type**   →   **Hardware Accelerator: GPU**


# Part I. Understanding pretrained models

Some of the HuggingFace modules that you might find useful for this assignment.
- [`load_dataset`](https://huggingface.co/docs/datasets/loading_datasets.html)
- [`pipelines`](https://huggingface.co/transformers/main_classes/pipelines.html)
- [`Trainer`](https://huggingface.co/transformers/main_classes/trainer.html)

Run the following cells to set up the necessary prerequisites. Feel free to modify the cells to import any module you need.

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd '/content/drive/MyDrive/cs329s_ps2'
!ls

Mounted at /content/drive
/content/drive/MyDrive/cs329s_ps2
aclImdb  aclImdb_v1.tar.gz  logs  my_model.h5  runs


In [None]:
!nvidia-smi

Sat Feb 13 06:23:44 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!pip install -q transformers
!pip install -q datasets
!pip install nltk

[K     |████████████████████████████████| 1.8MB 26.1MB/s 
[K     |████████████████████████████████| 890kB 46.2MB/s 
[K     |████████████████████████████████| 3.2MB 49.1MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 163kB 17.5MB/s 
[K     |████████████████████████████████| 20.7MB 6.4MB/s 
[K     |████████████████████████████████| 245kB 45.1MB/s 


In [None]:
import numpy as np
import random

from datasets import load_dataset
from datasets import load_metric
from transformers import pipeline
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

from sklearn.calibration import calibration_curve
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import matplotlib.pyplot as plt

import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Problem 1: Setup

We’ll examine pretrained language models’ performance for the task of sentiment analysis with two classes. For Problem 1 and Problem 2 of this assignment, you’ll need to pick the following:

- **Fine-tuned model**: choose one pretrained language model that has been fine-tuned on the IMDB dataset from the [HuggingFace Model Hub](https://huggingface.co/models). You can see the list of available models trained on IMDB [here](https://huggingface.co/models?search=imdb). Examples:
    - `roberta-base-imdb` is `robert-base` that has been fine-tuned on the IMDB dataset.
    - `distilbert-base-uncased-imdb` is `distilbert-base-uncased` that has been fine-tuned on the IMDB dataset.

- **Out-of-distribution dataset**: pick a **binary label** sentiment analysis dataset that IS NOT IMDB from HuggingFace Datasets interface (some of them have more than two labels, so make sure you pick one with binary labels!). You can find the list of sentiment analysis tasks [here](https://huggingface.co/datasets?filter=task_ids:sentiment-classification).


### 1.1 Understanding your fine-tuned model [5 points]



#### a. (1 point) What fine-tuned model did you choose? What’s its pretrained LM counterpart?


**Answer**:

distilbert-base-uncased-imdb

The underlying LM is Distilbert (https://arxiv.org/abs/1910.01108) from HuggingFace. The co-authors trained a smaller version of BERT using knowledge distillation.


#### b. (1 point) Is your model cased or uncased? Why did you choose that?

**Answer**:

Uncased. When it comes to IMDB reviews, the labels should be case-invariant.

#### c. (3 points) What is the number of parameters in your model? You can find this number either from the original paper or write code to count its number of parameters.

In [None]:
################## (OPTIONAL) YOUR CODE HERE ##################
# Find the number of parameters in your model
from transformers import AutoTokenizer, AutoModelForSequenceClassification


pretrained = AutoModelForSequenceClassification.from_pretrained("textattack/distilbert-base-uncased-imdb")
pretrained.num_parameters()
###############################################################

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=485.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267845150.0, style=ProgressStyle(descri…




66955010

In [None]:
# pretrained.config

**Answer**:

As validated above, the DistilBERT model has about **66.96 M params.**

### 1.2 Understanding the [IMDB dataset](https://ai.stanford.edu/~amaas/data/sentiment/) [3 points]

**Tip**: The IMDB dataset can also be explored in the Hugging Face model hub ([IMDb](https://huggingface.co/datasets/imdb)) 

**Example: Loading Dataset**

The following cell shows how to use the HuggingFace [`Datasets`](https://github.com/huggingface/datasets) library to download and prepare the IMDb dataset. 


In [None]:
# dataset = load_dataset("imdb")

# dataset
# dataset['test'][0]

**Alternatively**, you can also read in datasets from raw text. Checkout this [tutorial](https://huggingface.co/transformers/custom_datasets.html?highlight=imdb%20rating%20dataset) for more details.

In [None]:
! wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
! tar -xf aclImdb_v1.tar.gz

--2021-02-13 06:25:01--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz.1’


2021-02-13 06:25:06 (19.1 MB/s) - ‘aclImdb_v1.tar.gz.1’ saved [84125825/84125825]

^C


In [None]:
from pathlib import Path

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

KeyboardInterrupt: ignored

In [None]:
                                                ################## (OPTIONAL) YOUR CODE HERE ##################
# Feel free to write your own code to import the IMDB dataset.
###############################################################

#### a. (1 point) Plot the reviews' lengths (word count) as a histogram for entire dataset (train + test).

- If you need some help generating plots, checkout matplotlib [tutorial](https://matplotlib.org/3.1.1/gallery/statistics/hist.html)


In [None]:
################## YOUR CODE HERE ##################
# Plot the review's lengths

all_texts = train_texts + test_texts
all_lens = [len(nltk.tokenize.word_tokenize(text)) for text in all_texts]

plt.hist(all_lens)
plt.xlabel('word counts')
plt.show()
####################################################

**Answer**:

[YOUR ANSWER HERE]

#### b. (1 point) Report the label distribution (number of positive and negative examples) for both the train and test splits.

In [None]:
################## YOUR CODE HERE ##################
# Find the label distributions

print("### Train ###")
_, cts = np.unique(train_labels, return_counts=True)
print(f"Positive samples: {cts[1]}")
print(f"Negative samples: {cts[0]}")


print("### Test ###")
_, cts = np.unique(test_labels, return_counts=True)
print(f"Positive samples: {cts[1]}")
print(f"Negative samples: {cts[0]}")
####################################################

### Train ###
Positive samples: 8622
Negative samples: 12500
### Test ###
Positive samples: 12500
Negative samples: 12500


**Answer**:

### Train ###

Positive samples: 12500
Negative samples: 12500


### Test ###
Positive samples: 12500
Negative samples: 12500

#### c. (1 point) What evaluation metric(s) would be appropriate for this dataset? Why?

**Answer**:

The train and test splits are *perfectly* balanced. Thus, top-1 accuracy would be an appropriate metric. F-1 score would also be a valid metric if we want to balance the effects of false positives vs false negatives. 

### 1.3 Understanding the out-of-distribution dataset [3 points]

#### a. (1 point) What's the name of your out-of-distribution dataset? Include a link. 

**Answer**:

Name: amazon_polarity
Link: https://huggingface.co/datasets/amazon_polarity

"The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review."


#### b. (1 point) Describe your dataset, including its splits, its columns, and their statistics. 

In [None]:
################## YOUR CODE HERE ##################
# Load your data and find your dataset's statistics
ood = load_dataset("amazon_polarity")
ood

####################################################

**Answer**:

    train: Dataset({
        features: ['label', 'title', 'content'],
        num_rows: 3600000
    })
    test: Dataset({
        features: ['label', 'title', 'content'],
        num_rows: 400000
    })


#### c. (1 point) Report the label distribution (number of positive and negative examples) for both the train and test splits.

In [None]:
################## YOUR CODE HERE ##################
# Plot the label distribution
####################################################

**Answer**:

[YOUR ANSWER HERE]

## Problem 2. Training from scratch v.s. pretrained-model

### 2.1 Train a sentiment analysis model from scratch [10 points with 5 extra credit]

Use any framework (e.g. sklearn, PyTorch, Keras, TensorFlow) and any architecture (e.g. Logistic Regression, LSTM, Transformers), train a sentiment analysis model from scratch to get an accuracy of at least 85% on the test split of the IMDB dataset.

- **5 extra points if your model’s accuracy is above 90%**

### General variables

In [None]:
# # config = pretrained.config
# from sklearn.model_selection import train_test_split
# train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

In [None]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
# val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

toy_encodings = tokenizer(train_texts[:100], truncation=True, padding=True)


In [None]:
## Scratch model specific

# Load pretrained model config
import tensorflow as tf
from transformers import TFDistilBertForSequenceClassification

pre_tf = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
pre_tf.summary()
config = pre_tf.config

## Set tf dataset
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))
# val_dataset = tf.data.Dataset.from_tensor_slices((
#     dict(val_encodings),
#     val_labels
# ))
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
))

toy_dataset = tf.data.Dataset.from_tensor_slices((
    dict(toy_encodings),
    train_labels[:100]
))

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_transform', 'activation_13', 'vocab_layer_norm', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'dropout_19', 'pre_classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
distilbert (TFDistilBertMain multiple                  66362880  
_________________________________________________________________
pre_classifier (Dense)       multiple                  590592    
_________________________________________________________________
classifier (Dense)           multiple                  1538      
_________________________________________________________________
dropout_19 (Dropout)         multiple                  0         
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0
_________________________________________________________________


In [None]:
# Define callbacks & optimizer
es = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss', min_delta=0, patience=1, verbose=1,
    mode='auto', baseline=None, restore_best_weights=True
)

ckpt = tf.keras.callbacks.ModelCheckpoint(
    './best_model.h5', monitor='val_loss', verbose=1, save_best_only=True,
    save_weights_only=False, mode='auto', save_freq='epoch'
)


optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)

In [None]:
train_dataset

<TensorSliceDataset shapes: ({input_ids: (512,), attention_mask: (512,)}, ()), types: ({input_ids: tf.int32, attention_mask: tf.int32}, tf.int32)>

In [None]:
from transformers import TFDistilBertForSequenceClassification, TFTrainer, TFTrainingArguments

training_args = TFTrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

# with training_args.strategy.scope():
model = TFDistilBertForSequenceClassification(config=config)

trainer = TFTrainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=toy_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

In [None]:
from transformers import TFDistilBertForSequenceClassification
model = TFDistilBertForSequenceClassification(config=config)

In [None]:
## Model training with Keras API


model.compile(optimizer=optimizer, loss=model.compute_loss, metrics=['accuracy']) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=4, batch_size=16)

# model.save_weights('./my_model.h5')


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


In [None]:
model.load_weights('./my_model.h5')
model.compile(optimizer=optimizer, loss=model.compute_loss, metrics=['accuracy']) # can also use any keras loss fn


In [None]:
print(model.compute_loss())

TypeError: ignored

In [None]:
model.save_weights('./my_model.h5')
!ls

aclImdb  aclImdb_v1.tar.gz  logs  my_model.h5  runs


In [None]:
# import torch

# class IMDbDataset(torch.utils.data.Dataset):
#     def __init__(self, encodings, labels):
#         self.encodings = encodings
#         self.labels = labels

#     def __getitem__(self, idx):
#         item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
#         item['labels'] = torch.tensor(self.labels[idx])
#         return item

#     def __len__(self):
#         return len(self.labels)

# train_dataset = IMDbDataset(train_encodings, train_labels)
# val_dataset = IMDbDataset(val_encodings, val_labels)
# test_dataset = IMDbDataset(test_encodings, test_labels)

In [None]:
# from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

# training_args = TrainingArguments(
#     output_dir='./results',          # output directory
#     num_train_epochs=3,              # total number of training epochs
#     per_device_train_batch_size=16,  # batch size per device during training
#     per_device_eval_batch_size=16,   # batch size for evaluation
#     warmup_steps=500,                # number of warmup steps for learning rate scheduler
#     weight_decay=0.01,               # strength of weight decay
#     logging_dir='./logs',            # directory for storing logs
#     logging_steps=10,
#     learning_rate=1e-5
# )

# model = DistilBertForSequenceClassification(config=pretrained.config)

# trainer = Trainer(
#     model=model,                         # the instantiated 🤗 Transformers model to be trained
#     args=training_args,                  # training arguments, defined above
#     train_dataset=train_dataset,         # training dataset
#     eval_dataset=val_dataset            
# )

# trainer.train()

In [None]:
# for batch in val_dataset:
#   print(batch.keys())
#   print(batch['input_ids'].shape)
#   print(batch['attention_mask'].shape)
#   print(model(input_ids=torch.tensor(batch['input_ids'].unsqueeze(dim=1).cuda()
# ), attention_mask=torch.tensor(batch['attention_mask'].unsqueeze(dim=1)).cuda()))

In [None]:
# model = DistilBertForSequenceClassification(config=config)
# print(model)

In [None]:
# from transformers import DataCollatorForLanguageModeling

# data_collator = DataCollatorForLanguageModeling(
#     tokenizer=tokenizer, mlm=True, mlm_probability=0.15
# )
# data_collator

In [None]:
# from transformers import Trainer, TrainingArguments

# training_args = TrainingArguments(
#     output_dir="./dbert_scratch",
#     overwrite_output_dir=True,
#     num_train_epochs=5,
#     per_gpu_train_batch_size=64,
#     save_steps=10_000,
#     save_total_limit=2,
#     learning_rate=1e-3
# )

# trainer = Trainer(
#     model=model,
#     args=training_args,
#     data_collator=data_collator,
#     train_dataset=dataset['train'],
#     eval_dataset=dataset['test']
# )

In [None]:
# dataset['train']

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [None]:
# trainer.save_model("./dbert_scratch")

In [None]:
# from transformers import pipeline

# fill_mask = pipeline(
#     "fill-mask",
#     model="./dbert_scratch",
#     tokenizer=tokenizer
# )
# fill_mask("This movie was pretty [MASK].")

Some weights of the model checkpoint at ./dbert_scratch were not used when initializing DistilBertForMaskedLM: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
- This IS expected if you are initializing DistilBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForMaskedLM were not initialized from the model checkpoint at ./dbert_scratch and are newly initialized: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
You should probably TRAIN this model on

[{'score': 0.0002560511347837746,
  'sequence': 'this movie was pretty peers.',
  'token': 12746,
  'token_str': 'peers'},
 {'score': 0.00024439391563646495,
  'sequence': 'this movie was pretty scotland.',
  'token': 3885,
  'token_str': 'scotland'},
 {'score': 0.00023974172654561698,
  'sequence': 'this movie was pretty [unused911].',
  'token': 916,
  'token_str': '[unused911]'},
 {'score': 0.00020217549172230065,
  'sequence': 'this movie was pretty adventist.',
  'token': 25696,
  'token_str': 'adventist'},
 {'score': 0.0001996046194108203,
  'sequence': 'this movie was pretty clip.',
  'token': 12528,
  'token_str': 'clip'}]

### 2.2 Evaluate your model and the fine-tuned model on IMDB [11 points]

**Tip**: You might find [`TextClassificationPipeline`](https://huggingface.co/transformers/main_classes/pipelines.html#transformers.TextClassificationPipeline) helpful. 

#### a. (1 point) Randomly sample 1000 examples from the test split.

In [None]:
SEED = 42


In [None]:
type(test_encodings)

transformers.tokenization_utils_base.BatchEncoding

In [None]:
################## YOUR CODE HERE ##################
num_test = len(test_texts)
print(f'Test set size: {num_test}')
chosen = np.random.choice(np.arange(num_test), 1000, replace=False)
print(chosen.shape)

sampled_text = [test_texts[idx] for idx in chosen]
print(f'Number sampled: {len(sampled_text)}')
sampled_enc = tokenizer(sampled_text, truncation=True, padding=True)

## truncate samples for feeding into model

trunc = [' '.join(nltk.tokenize.word_tokenize(sent)[:512]) for sent in sampled_text]
####################################################

Test set size: 25000
(1000,)
Number sampled: 1000


In [None]:
trunc[10]

"Story about a widowed father ( Claude Rains ) bringing up his four daughters . Emma ( Gale Page ) is loved by big hunky Ernest ( Dick Foran ) . Thea ( Lola Lane ) is romanced by an old but wealthy man . Kay ( Rosemary Lane ) wants to become a singer . Ann ( Priscilla Lane ) is a romantic . Drop dead handsome Felix Deitz ( Jeffrey Lynn ) , a business associate of their father , comes to stay with them . All the sisters fall in love with him . Then tough cynical Mickey ( John Garfield ) enters the picture ... < br / > < br / > Very entertaining movie was a big hit and nominated for five Academy Awards . It 's beautifully directed by Michael Curitz , has a pretty good ( if predictable ) script and a VERY attractive cast ( especially Lynn ) . Also this was John Garfield 's first film and made him a star . This was so popular there were three or four sequels ( which I never saw ) . This is an engrossing , entertaining , big budget soap opera -- well worth seeing ."

#### b. (2 points) Use **your model** to make predictions on these examples and output predicted labels and associated probabilities.

In [None]:
################## YOUR CODE HERE ##################



####################################################

#### c. (2 points) Evaluate and report **your model’s** performance on these 1000 examples using the metric specified in 1.2 (c).

In [None]:
################## YOUR CODE HERE ##################

####################################################

**Answer**:

[YOUR ANSWER HERE]

#### d. (2 points) Use the **fine-tuned model** to make predictions on these examples and output predicted labels and associated probabilities.


In [None]:
################## YOUR CODE HERE ##################
# nlp = pipeline(...)

pretrained = AutoModelForSequenceClassification.from_pretrained("textattack/distilbert-base-uncased-imdb")


from transformers import pipeline, TextClassificationPipeline

nlp = pipeline('sentiment-analysis', 
               model=pretrained, 
               tokenizer=tokenizer)


ret = nlp(["hey this movie was pretty good", 'nah movie terrible', 'movie was ok'])
ret
####################################################

NameError: ignored

### Predicted labels and probabilities

In [None]:
len(sampled_enc['input_ids'][0])

512

In [None]:
results_pretrained = nlp(sampled_text)
len(results_pretrained)

NameError: ignored

#### e. (2 points) Evaluate and report the **fine-tuned model’s** performance on these 1000 examples using the metric specified in 1.2 (c).

In [None]:
################## YOUR CODE HERE ##################
# Run inference on the sampled examples
# - https://huggingface.co/transformers/main_classes/pipelines.html

# Compute the metrics on your examples
# - https://huggingface.co/docs/datasets/using_metrics.html

####################################################

**Answer**:

[YOUR ANSWER HERE]

#### f. (2 points) Compare the performance of the model trained from scratch and the fine-tuned model.

**Answer**:

[YOUR ANSWER HERE]

### 2.3 Error analysis of the fine-tuned model [10 points]

Next, do error analysis on the examples that the fine-tuned model failed to predict correctly. This is a common debugging step where buckets of errors are identified to inform how the model might be improved.

#### a. (1 points) Pull out the examples that your fine-tuned model made errors on. Examine multiple examples to see if you can spot a pattern.


In [None]:
################## YOUR CODE HERE ##################

####################################################

#### b. (3 points) Identify at least 1 pattern that you believe your model is missing. Include at least 3 examples from the data that support your hypothesis.


**Answer**:

[YOUR ANSWER HERE]

#### c. (2 points) Explain why these examples might have been difficult for the fine-tuned model to correctly make predictions on.

**Answer**:

[YOUR ANSWER HERE]

#### d. (2 points) Manually create 3 examples that conform to the pattern you observed, and run inference on them using the model. What did you find?

In [None]:
################## YOUR CODE HERE ##################
# Inference on your own examples
####################################################

**Answer**:

[YOUR ANSWER HERE]

#### e. (2 points) Suggest what steps we might take to address this error bucket.


**Answer**:

[YOUR ANSWER HERE]

### 2.4 Pertubation Analysis [10 points]

Inputs, especially inputs by users, might contain a lot of noise (e.g. misspelling, repeated chaaaaaaaaracter, missing punctuation, etc.). You want to see how well your models perform on input with noises.

#### a. (2 points) Write a function to randomly add noise to an input while preserving its label. Here are some ideas to consider (you can combine them too, e.g. 10% of the time do this, 20% of the time do this):

- Randomly remove a character
- Randomly repeat a character or a phrase
- Replace a word with a similar word 


In [None]:
################## YOUR CODE HERE ##################
# def add_random_noice(sentence):
# 
####################################################

#### b. (1 point) Apply this function to 500 samples in your test split.


In [None]:
################## YOUR CODE HERE ##################

####################################################

#### c. (2 points) Use your model to make predictions on these noisy examples and output predicted labels and associated probabilities.

In [None]:
################## YOUR CODE HERE ##################

####################################################

#### d. (1 point) Evaluate your model’s performance on these noisy examples using the metric specified in 1.2 (c).

In [None]:
################## YOUR CODE HERE ##################

####################################################

#### e. (2 points) Use the fine-tuned model to make predictions on these noisy examples and output predicted labels and associated probabilities.

In [None]:
################## YOUR CODE HERE ##################

####################################################

#### f. (1 point) Evaluate the fine-tuned model’s performance on these noisy examples using the metric specified in 1.2 (c).

In [None]:
################## YOUR CODE HERE ##################

####################################################

#### g. (1 point) Compare the performance of your model and the fine-tuned model on these noisy samples.


**Answer:**

[YOUR ANSWER HERE]

### 2.5 Slice-based analysis [10 points]

We’ve been evaluating both models on a coarse-grained metric. Let’s take a deeper look into how we can evaluate them on different slices.

Play around with the test split of IMDB -- slice it into different subgroups. Some ideas for slicing your test split:

- By **input lengths** (e.g. maybe your model will perform well on inputs of less than 10 characters but horribly for inputs of more than 1000 characters).
- By **movie names** (can you figure out how to extract movie names from reviews?).
- By **the number of punctuations** in each review.
- etc. Play around with your data!

Choose two slices of data on which your model’s performances are non-trivially different. Each slice should have at least 100 samples.





In [None]:
# Cell for you to play around with your data. (not graded)

#### a. (5 points) Describe your reason for choosing these two slices. Explain why the model might perform differently on them.

**Answer:**

[YOUR ANSWER HERE]

#### b. (2 points) Write code to extract these two slices from your test split.


In [None]:
################## YOUR CODE HERE ##################

####################################################

#### c. (1 point) Write code to report each slice’s statistics. Report the statistics. 
1. Slice size
2. Label distribution

In [None]:
################## YOUR CODE HERE ##################

####################################################

**Answer:**

[YOUR ANSWER HERE]

#### d. (1 point) Write code to evaluate **your model** performance on these two slices, including the metric specified in 1.2 (c) and the confusion matrix.
- [sklearn.metrics.confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) may be useful here


In [None]:
################## YOUR CODE HERE ##################

####################################################

#### e. (1 point) Write code to evaluate the **fine-tuned model** performance on these two slices, including the metric specified in 1.2 (c) and the confusion matrix.

In [None]:
################## YOUR CODE HERE ##################

####################################################

## Problem 3. In-Distribution v.s. Out-of-Distribution

In problem 2, when we evaluate the fine-tuned model on the test split that comes from the same distribution the model was fine-tuned on. In this problem, we’ll evaluate the fine-tuned model’s performance on an out-of-distribution test set.


### 3.1 Evaluate the fine-tuned model on an out-of-distribution task [4 points]


#### a. (1 point) If the dataset has a test split, randomly sample 500 examples from the test split. If it doesn’t have a test split, randomly sample 500 samples from the entire dataset.


In [None]:
################## YOUR CODE HERE ##################

####################################################

#### b. (1 point) Use the fine-tuned model to make predictions on these examples and output predicted labels and associated probabilities.


In [None]:
################## YOUR CODE HERE ##################

####################################################

#### c. (1 point) Evaluate the fine-tuned model’s performance on these 500 examples using the metric specified in 1.2 (c).


In [None]:
################## YOUR CODE HERE ##################

####################################################

#### d. (1 point) Compare the performance of the fine-tuned model on IMDB and this dataset.


**Answer:**

[YOUR ANSWER HERE]

### 3.2 Error analysis of the fine-tuned model on out-of-distribution task

Next, do error analysis on the examples that the fine-tuned model failed to predict correctly. This is a common debugging step where buckets of errors are identified to inform how the model might be improved.


#### a. (1 point) Pull out the examples that your fine-tuned model made errors on.


In [None]:
################## YOUR CODE HERE ##################

####################################################

#### b. (3 points) Identify at least 1 pattern that you believe your model is missing. Include at least 3 examples from the data that support your hypothesis.


**Answer:**

[YOUR ANSWER HERE]

#### c. (2 points) Explain why these examples might have been difficult for the fine-tuned model to correctly make predictions on.


**Answer:**

[YOUR ANSWER HERE]

#### d. (2 points) Manually write down 3 examples that conform to the pattern you observed, and run inference on them using the model. What did you find?


In [None]:
################## YOUR CODE HERE ##################
# Inference on your own examples
####################################################

**Answer:**

[YOUR ANSWER HERE]

#### e. (2 points) Suggest what steps we might take to address this error bucket.


**Answer:**

[YOUR ANSWER HERE]

### 3.3 Calibration [4 points]
You will examine whether the fine-tuned model is calibrated. You might want to look into [sklearn.calibration.calibration_curve](https://scikit-learn.org/stable/modules/generated/sklearn.calibration.calibration_curve.html).


#### a. (1 point) Compute the average calibration error of the fine-tuned model on IMDB.


In [None]:
################## YOUR CODE HERE ##################

####################################################

#### b. (1 point) Compute the average calibration error of the fine-tuned model on the out-of-distribution dataset


In [None]:
################## YOUR CODE HERE ##################

####################################################

#### c. (2 points) Plot the calibration curves for the fine-tuned model on both datasets.


In [None]:
################## YOUR CODE HERE ##################

####################################################

**Answer:**

[YOUR ANSWER HERE]

## Problem 4. Multilabel Tasks [7 points]

Now we’ll be using our understanding of pretrained models and fine-tuned models to try to get good performance on a difficult dataset.

Choose one of the following multilabel tasks:
- Circa dataset: https://huggingface.co/datasets/circa 
- PUBHEALTH dataset: https://huggingface.co/datasets/health_fact
- GoEmotions dataset: https://huggingface.co/datasets/go_emotions

**Tip:** read the paper associated with each dataset.


#### a. (2 points) Describe the train/test split distributions, the dataset’s columns, and their statistics.


**Answer:**

[YOUR ANSWER HERE]

#### b. (1 point) Plot its label distribution as a bar graph, with the labels on the x-axis and number of examples for each labels on the y-axis.


In [None]:
################## YOUR CODE HERE ##################
# See https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.bar.html
####################################################

**Answer:**

[YOUR ANSWER HERE]

#### c. (2 points) What metric(s) would be appropriate for this dataset? Why?


**Answer:**

[YOUR ANSWER HERE]

#### d. (2 points) Explain why you might think that this dataset is hard.


**Answer:**

[YOUR ANSWER HERE]

# Part II. OpenAI API [7 points + 3 bonus points]

In this part, you will play around with the [OpenAI API](https://beta.openai.com/). You should have received emails about accessing the OpenAI API by now. Please let us know if you haven’t.


## Problem 5. English to Bash [4 points]

Open [Playground](https://beta.openai.com/playground). In **`Load a preset...`**, select **`Text to command`**. 

Copy in the following prompt: 

```
Q: List files
A: ls -l
Q: Count files in a directory
A: ls -l | wc -l
Q: Disk space used by home directory
A: du ~
Q: Replace foo with bar in all .py files
A: sed -i .bak -- 's/foo/bar/g' *.py
Q: Delete the models subdirectory
A: rm -rf ./models
Q: Firewall all incoming connections to port 22 on this machine.
A: iptables -A INPUT -p tcp --dport 22 -j DROP
Q:
```

#### a. (2 point) Write a English sentence that makes the API output the bash command `ls *.py`

**Answer:**

[YOUR ANSWER HERE]

#### b. (3 points) Come up with a bash command and a corresponding English explanation of the command (imagine that you’re helping a friend navigate the terminal, and you are instructing them what to do). Run the English sentence in the Playground and observe the command that you obtain. Is it the same as the bash command you had in mind? Why or why not? Play around with some longer bash commands: can you successfully generate long or complex bash commands from english explanations?


**Answer:**

[YOUR ANSWER HERE]

### Problem 6. Improving English [2 points + 3 bonus points]

In [Playground](https://beta.openai.com/playground). In **`Load a preset...`**, select **`Grammatical Standard English`**. You will see the following prompt:

```
Non-standard English: Please provide me with a short brief of the design you’re looking for and that’d be nice if you could share some examples or project you did before.
Standard American English: Please provide me with a short brief of the design you’re looking for and some examples or previous projects you’ve done would be helpful.
 
Non-standard English: If I’m stressed out about something, I tend to have problem to fall asleep.
Standard American English: If I’m stressed out about something, I tend to have a problem falling asleep.
 
Non-standard English: There is plenty of fun things to do in the summer when your able to go outside.
Standard American English: There are plenty of fun things to do in the summer when you are able to go outside.
 
Non-standard English: She no went to the market.
Standard American English: She didn't go to the market.
```

#### a. (2 points) Write a bad non-standard English sentence. Report the Non-standard English you have input, and the output obtained from the API. Did the API fix it?

**Answer:**

[YOUR ANSWER HERE]

#### b. (3 bonus points) Come up with a bad Enligsh sentence that the API cannot fix. Report the Non-standard English you have input, and the output obtained from the API.

**Answer:**

[YOUR ANSWER HERE]