# Introduction to HLT 2023 Project (Template)

- Student(s) Name(s):Shadman Ishraq
- Date:27.06.2023
- Chosen Corpus: IMDB
- Contributions (if group project):

### Corpus information

- Description of the chosen corpus: Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.


- Paper(s) and other published materials related to the corpus: Sentiment analysis on IMDB using lexicon and neural networks.

Source: https://link.springer.com/article/10.1007/s42452-019-1926-x
- State-of-the-art performance (best published results) on this corpus: The current state-of-the-art accuracy on IMDb, scores 97.42%, using document embeddings trained with cosine similarity (Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7256387/#CR24 ).

---

## 1. Setup

In [None]:
!pip install --quiet transformers datasets evaluate
!pip install --quiet accelerate -U

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m57.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m42.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m59.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m43.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import datasets
import transformers
import evaluate
import numpy as np
from datasets import load_metric
from transformers import AutoModelForSequenceClassification, BertConfig



---

## 2. Data download and preprocessing

### 2.1. Download the corpus

In [None]:
DATASET = 'imdb'

builder = datasets.load_dataset_builder(DATASET)
dataset = datasets.load_dataset(DATASET)

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

### 2.2. Preprocessing

In [None]:
train = dataset.get('train')
test = dataset.get('test')
id2label = {0: "neg", 1: "pos"}
label2id = {"neg": 0, "pos": 1}

In [None]:
MODEL_NAME = "bert-base-uncased"
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_NAME)
model = transformers.AutoModel.from_pretrained(MODEL_NAME)


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
def tokenize(example, tokenizer):
  return tokenizer(example['text'], truncation=True, padding='max_length')

In [None]:
train = train.map(lambda example: tokenize(example, tokenizer), batched=True)
test = test.map(lambda example: tokenize(example, tokenizer), batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

---

## 3. Machine learning model

### 3.1. Model training

In [None]:
accuracy_metric = load_metric("accuracy")

def compute_metrics(eval_pred):
 predictions, labels = eval_pred
 predictions = np.argmax(predictions, axis=1)
 return accuracy_metric.compute(predictions=predictions, references=labels)

data_collator = transformers.DataCollatorWithPadding(tokenizer=tokenizer)
early_stopping_patience = 5
early_stopping = transformers.EarlyStoppingCallback(early_stopping_patience)

  accuracy_metric = load_metric("accuracy")


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

In [None]:
trainer_args = transformers.TrainingArguments(
    output_dir = 'checkpoints',
    evaluation_strategy = 'steps',
    logging_strategy = 'steps',
    load_best_model_at_end = True,
    eval_steps = 100,
    logging_steps = 100,
    learning_rate = 0.00001,
    per_device_train_batch_size = 8,
    per_device_eval_batch_size = 32,
    max_steps = 500,
)

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, config=BertConfig(num_labels=2, id2label=id2label, label2id=label2id))


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

In [None]:
trainer = transformers.Trainer(
    model = model,
    args = trainer_args,
    train_dataset = train,
    eval_dataset = test,
    compute_metrics = compute_metrics,
    tokenizer = tokenizer,
    callbacks = [early_stopping],
)

trainer.train()

Step,Training Loss,Validation Loss,Accuracy
100,0.5929,0.367797,0.86568
200,0.3321,0.286783,0.8866
300,0.2722,0.251649,0.90796
400,0.3116,0.239038,0.91168
500,0.2345,0.248524,0.91144


TrainOutput(global_step=500, training_loss=0.34867887115478513, metrics={'train_runtime': 4900.9115, 'train_samples_per_second': 0.816, 'train_steps_per_second': 0.102, 'total_flos': 1052444221440000.0, 'train_loss': 0.34867887115478513, 'epoch': 0.16})

### 3.2 Hyperparameter optimization

In [None]:
def objective(learning_rate, num_train_epochs):
    model = transformers.AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2, id2label=id2label, label2id=label2id)
    training_args = transformers.TrainingArguments(
        output_dir="hyperparametercheckpoint",
        learning_rate=learning_rate,
        num_train_epochs=num_train_epochs,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        disable_tqdm=True
    )

    trainer = transformers.Trainer(
        model=model,
        args=training_args,
        train_dataset=train,
        eval_dataset=test
    )
    result = trainer.train()
    return result.training_loss

# Define hyperparameter values to be tuned
learning_rate = 0.0001
num_train_epochs = 1

# Call the objective function with the hyperparameter values
loss = objective(learning_rate, num_train_epochs)
print(f"Training loss: {loss}")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

{'loss': 0.4693, 'learning_rate': 8.4e-05, 'epoch': 0.16}
{'loss': 0.4042, 'learning_rate': 6.800000000000001e-05, 'epoch': 0.32}
{'loss': 0.4057, 'learning_rate': 5.2000000000000004e-05, 'epoch': 0.48}
{'loss': 0.3645, 'learning_rate': 3.6e-05, 'epoch': 0.64}
{'loss': 0.3188, 'learning_rate': 2e-05, 'epoch': 0.8}
{'loss': 0.3088, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.96}
{'train_runtime': 2444.2917, 'train_samples_per_second': 10.228, 'train_steps_per_second': 1.278, 'train_loss': 0.3764745886230469, 'epoch': 1.0}
Training loss: 0.3764745886230469


In [None]:
def objective(learning_rate, num_train_epochs):
    model = transformers.AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2, id2label=id2label, label2id=label2id)
    training_args = transformers.TrainingArguments(
        output_dir="hyperparametercheckpoint",
        learning_rate=learning_rate,
        num_train_epochs=num_train_epochs,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        disable_tqdm=True
    )

    trainer = transformers.Trainer(
        model=model,
        args=training_args,
        train_dataset=train,
        eval_dataset=test
    )
    result = trainer.train()
    return result.training_loss

# Define hyperparameter values to be tuned
learning_rate = 0.0001
num_train_epochs = 2

# Call the objective function with the hyperparameter values
loss = objective(learning_rate, num_train_epochs)
print(f"Training loss: {loss}")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

{'loss': 0.6277, 'learning_rate': 9.200000000000001e-05, 'epoch': 0.16}
{'loss': 0.5534, 'learning_rate': 8.4e-05, 'epoch': 0.32}
{'loss': 0.6314, 'learning_rate': 7.6e-05, 'epoch': 0.48}
{'loss': 0.5424, 'learning_rate': 6.800000000000001e-05, 'epoch': 0.64}
{'loss': 0.546, 'learning_rate': 6e-05, 'epoch': 0.8}
{'loss': 0.6915, 'learning_rate': 5.2000000000000004e-05, 'epoch': 0.96}
{'loss': 0.7043, 'learning_rate': 4.4000000000000006e-05, 'epoch': 1.12}
{'loss': 0.6996, 'learning_rate': 3.6e-05, 'epoch': 1.28}
{'loss': 0.7002, 'learning_rate': 2.8000000000000003e-05, 'epoch': 1.44}
{'loss': 0.7013, 'learning_rate': 2e-05, 'epoch': 1.6}
{'loss': 0.6981, 'learning_rate': 1.2e-05, 'epoch': 1.76}
{'loss': 0.7, 'learning_rate': 4.000000000000001e-06, 'epoch': 1.92}
{'train_runtime': 4862.8476, 'train_samples_per_second': 10.282, 'train_steps_per_second': 1.285, 'train_loss': 0.65152576171875, 'epoch': 2.0}
Training loss: 0.65152576171875


 Changing the hyperparameters according to the best result of this part {'loss': 0.3088, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.96 }.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, config=BertConfig(num_labels=2, id2label=id2label, label2id=label2id))

trainer_args = transformers.TrainingArguments(
    output_dir = 'checkpoints',
    evaluation_strategy = 'steps',
    logging_strategy = 'steps',
    load_best_model_at_end = True,
    eval_steps = 100,
    logging_steps = 100,
    learning_rate = 4.000000000000001e-06,
    num_train_epochs=0.96,
    per_device_train_batch_size = 8,
    per_device_eval_batch_size = 32,
    max_steps = 500,
)

trainer1 = transformers.Trainer(
    model = model,
    args = trainer_args,
    train_dataset = train,
    eval_dataset = test,
    compute_metrics = compute_metrics,
    tokenizer = tokenizer,
    callbacks = [early_stopping],
)


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

### 3.3. Evaluation on test set

The accuracy of the model is 0.91 (in part 3.1).

---

## 4. Results and summary

### 4.1 Corpus insights

The corpus consists of a large dataset for binary sentiment classification. It contains a total of 50,000 movie reviews, with 25,000 reviews dedicated to training and another 25,000 reviews for testing. This substantial amount of data suggests that the corpus provides a significant sample for sentiment analysis tasks.
The corpus also includes additional unlabeled data. Unlabeled data refers to text samples that do not have sentiment labels assigned to them. This unlabeled data can be utilized for semi-supervised or unsupervised learning approaches, where the model can learn from the unlabeled examples to improve its performance on sentiment classification.
So,it offers opportunities for training and evaluating robust sentiment analysis models.


### 4.2 Results

The name of my model is "bert-base-uncased". Where the accuracy came 0.91 & when learning_rate= 4.000000000000001e-06 & epoch = 0.96 the loss is 0.3088

### 4.3 Relation to state of the art

Comparing to the he current state-of-the-art accuracy on IMDb, scores 97.42%, using document embeddings trained with cosine similarity & the XLNet accuracy 96.21% (https://paperswithcode.com/sota/sentiment-analysis-on-imdb) the accuracy of my model is 91%, which is not very high considering these kind of models.
