# Assignment 4

Due Time: Mar. 24th, 5pm

Name: **Muyang Huang**

## 1. LSTM
How many gates does an LSTM have? State what they are and what each gate does (one sentence/gate).

**Answer**:

The LSTM (Long Short-Term Memory) network has three gates, the forget, the input, and the output gate. The forget gate decides what information we are going to keep/forget from the previous cell state. The input gate decides what new information we are going to store in the cell state. The output Gate decides what parts of the cell are output to the hidden state.

## 2. BERT
Answer the following questions.

### (a) Briefly explain Self-Attention.

**Answer**:

Self-Attention is a mechanism that directly models relationships between all words in a sentence, which allows the model to weigh their importance regardless of distance. It computes scaled dot-product attention by generating query, key, and value vectors for each word, then computing attention scores using the dot product of query and key, scaled by the square root of the key dimension and normalized using softmax. Each word's attention score determines how much focus it should give to other words. The weighted sum of values provides the final representation for each word. This process is applied to all words in parallel.

### (b) Briefly explain pre-training and fine-tuning.

**Answer**:

Pre-training is the initial training phase where a model learns general language representations from a large dataset using self-supervised tasks. This phase establishes network parameters, which are then saved for later use. Fine-tuning is the process of further training the pretrained model on a smaller, task-specific target dataset, either by training all or freezing part of layers while training only task-relevant ones, which allows the model to specialize while leveraging the knowledge gained during pre-training. Fine-tuning adapts the model for tasks like sentence similarity, named entity recognition, and relation extraction by optimizing it for the specific dataset. This approach improves efficiency and performance by reducing the need for large amounts of labeled data and extensive computation.

### (c) How does BERT do text classification?

**Answer**:

BERT (Bidirectional Encoder Representations from Transformers) performs text classification by adding a classification layer on top of its transformer-based architecture and assigning tags or categories to text according to its content. During fine-tuning, the CLS token serves as a summary representation of the entire text. BERT processes the input through multiple self-attention layers and captures deep contextual relationships. The final hidden state of the CLS token is then passed to a fully connected layer followed by a softmax or sigmoid activation function, which produces the probability distribution over the possible classes.

### (d) How is GPT different from BERT?

**Answer**:

GPT (Generative Pre-trained Transformer) and BERT differ primarily in their training approach, architecture, and applications. BERT is bidirectional, which means it processes text by considering both the left and right context simultaneously. This makes it well-suited for understanding relationships within a sentence. It is pre-trained using Masked Language Modeling (MLM) and Next Sentence Prediction (NSP), which makes it effective for tasks like question answering and text classification. In contrast, GPT is unidirectional (left-to-right or autoregressive). This means it generates text by predicting the next word based only on previous words. GPT is pre-trained using a causal language modeling (CLM) objective, making it better suited for text generation tasks. While BERT is typically fine-tuned for downstream NLP tasks requiring deep bidirectional contextual understanding, GPT excels at generating coherent and fluent text sequentially by leveraging its autoregressive nature.

### (e) What are the limitations of GPT/LLMs

**Answer**:

GPT and other large language models (LLMs) have several limitations. First, they can generate hallucinated information often with high confidence, as they lack fact-checking mechanisms and rely on statistical patterns rather than true understanding. This is a challenge for trust and accountability in high-stakes applications. They also lack real-world reasoning and common sense, sometimes producing logically inconsistent or biased outputs. Additionally, LLMs require vast amounts of data and computational resources, which makes them expensive to train and deploy, and difficult to develop in academic institutions and normal industry. Moreover, they can inherit biases from their training data, which potentially reinforce stereotypes or produce harmful content. The last is limited task-specific adaptation. While they generalize well, they often require fine-tuning or prompt engineering for specific applications to achieve optimal performance.

## 3. BERT
The **drug review dataset** provides patient reviews on drugs and a positive and negative rating reflecting overall patient satisfaction (https://archive.ics.uci.edu/dataset/461/drug+review+dataset+druglib+com). The dataset consists of two files: *drug_review_train.csv* for training and *drug_review_test.csv* for testing. Both files contain plain-text, UTF8-encoded sample set in a tab-separated format with the following columns:

- Text
- Three labels (-1: negative, 0: neutral, 1: positive)

### (a) Use DistillBERT to build a classifier.

| **DistillBERT** | **true positive** | **false positive** | **false negative** | **precision** | **recall** | **F1-score** |
|-----------------|-------------------|--------------------|--------------------|---------------|------------|--------------|
| positive        | 33,632            | 2,424              | 1,808              | 0.933         | 0.949      | 0.941        |
| neutral         | 2,240             | 1,941              | 2,589              | 0.536         | 0.464      | 0.497        |
| negative        | 11,569            | 1,960              | 1,928              | 0.855         | 0.857      | 0.856        |

### (b) Upload the source codes.

In [1]:
import pandas as pd
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from torch.utils.data import Dataset # , DataLoader
from sklearn.metrics import classification_report, precision_recall_fscore_support, accuracy_score, confusion_matrix

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

# Define custom dataset class
class DrugReviewDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        encoding = self.tokenizer(
            self.texts[idx],
            padding="max_length",
            truncation=True,
            max_length=self.max_length,
            return_tensors="pt"
        )
        item = {key: val.squeeze(0) for key, val in encoding.items()}
        item["labels"] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item

# Load dataset
train_df = pd.read_csv("drug_review_train.csv")
test_df = pd.read_csv("drug_review_test.csv")

id2label = {0: "NEGATIVE", 1: "NEUTRAL", 2: "POSITIVE"}
label2id = {"NEGATIVE": 0, "NEUTRAL": 1, "POSITIVE": 2}

# Load pre-trained DistilBERT model for classification (3 labels: -1, 0, 1)
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=3, id2label=id2label, label2id=label2id
)

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    acc = accuracy_score(labels, predictions)
    pre, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='macro')
    return {"accuracy": acc, "precision": pre, "recall": recall, "f1": f1}

train_df['label'] = train_df['rating'] + 1
test_df['label'] = test_df['rating'] + 1

# Prepare datasets
train_dataset = DrugReviewDataset(train_df["review"].tolist(), train_df["label"].tolist(), tokenizer)
test_dataset = DrugReviewDataset(test_df["review"].tolist(), test_df["label"].tolist(), tokenizer)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    # per_device_train_batch_size=16,
    # per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs"
    # logging_steps=200,
    # load_best_model_at_end=True
)

# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

# Evaluate the model
trainer.evaluate()

# y_true contains actual labels, and y_pred contains predicted labels
y_true = test_df["label"] # True labels + 1 (0, 1, 2)
# y_true = test_df["rating"].to_numpy() + 1
# y_true = test_df["rating"].tolist()
y_pred = trainer.predict(test_dataset).predictions.argmax(-1) # Model predictions (0, 1, 2)
# y_pred = trainer.predict(test_dataset).predictions.argmax(axis=1)

# Generate classification report
report = classification_report(y_true, y_pred, target_names=["negative", "neutral", "positive"], output_dict=True)
print(classification_report(y_true, y_pred, target_names=["Negative", "Neutral", "Positive"]))

# Extract per-class values
target_names = ["negative", "neutral", "positive"]
for label in target_names:
    precision = report[label]["precision"]
    recall = report[label]["recall"]
    f1 = report[label]["f1-score"]
    print(f"{label.capitalize()} - Precision: {precision:.3f}, Recall: {recall:.3f}, F1-score: {f1:.3f}")

# Compute confusion matrix
conf_matrix = confusion_matrix(y_true, y_pred, labels=[0, 1, 2])
print("\nConfusion Matrix:\n", conf_matrix)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmuyangerichuang[0m ([33mmuyangerichuang-weill-cornell-medicine[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.4624,0.488083,0.820649,0.672002,0.615503,0.636515
2,0.3767,0.477467,0.864617,0.739969,0.676389,0.686276
3,0.2773,0.479124,0.882361,0.774551,0.756667,0.764724


              precision    recall  f1-score   support

    Negative       0.86      0.86      0.86     13497
     Neutral       0.54      0.46      0.50      4829
    Positive       0.93      0.95      0.94     35440

    accuracy                           0.88     53766
   macro avg       0.77      0.76      0.76     53766
weighted avg       0.88      0.88      0.88     53766

Negative - Precision: 0.855, Recall: 0.857, F1-score: 0.856
Neutral - Precision: 0.536, Recall: 0.464, F1-score: 0.497
Positive - Precision: 0.933, Recall: 0.949, F1-score: 0.941

Confusion Matrix:
 [[11569   966   962]
 [ 1127  2240  1462]
 [  833   975 33632]]
