<a href="https://colab.research.google.com/github/Iispar/dl-in-hlt-project/blob/main/course_project_template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep learning in Human Language Technology Project (Template)

- Student(s) Name(s): Iiro Partanen
- Date: 18/10/2023
- Chosen Corpus: amazon_reviews_multi
- Contributions (if group project): -

### Corpus information

- Description of the chosen corpus: includes 200k of amazon reviews in English, Japanese, German, French, Chinese and Spanish.
- Paper(s) and other published materials related to the corpus: Paper from the courpus is "The Multilingual Amazon Reviews Corpus"
- Random baseline performance and expected performance for recent machine learned models:

---

## 1. Setup

In [1]:
!pip3 install -q transformers datasets evaluate
!pip install optuna
!pip install accelerate -U
from transformers import DistilBertTokenizer
from transformers import BertTokenizer, AutoTokenizer
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorWithPadding
from transformers import DistilBertModel, BertModel
from transformers import AutoModelForSequenceClassification
import datasets
import sklearn.feature_extraction
import torch
import transformers
import numpy as np
import evaluate
import optuna



In [25]:
dset = 'mteb/amazon_reviews_multi'
model = 'bert-base-multilingual-cased' # base distilbert with cased, because feel like this would fit well with reviews... we will see.

---

## 2. Data download, sampling and preprocessing

### 2.1. Download the corpus

In [3]:
engDataset = datasets.load_dataset(dset, name='en'); # imports the dataset.
# check it works
print(engDataset);

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 5000
    })
})


In [4]:
#FOR TESTING
# engDataset["train"] = engDataset["train"].select(range(100000))

In [5]:
print(engDataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 5000
    })
})


### 2.2. Sampling and preprocessing

In [6]:
engDataset = engDataset.shuffle() # shuffle the dataset for safety.
engDataset = engDataset.remove_columns(['id', 'label_text']) # removes everything that we don't need

In [7]:
# lets look at five results to see if there is more preprocessing to be done

print(engDataset['train'][0]['text'])

# looks like the title is spaced with \n\n, but other than that there is no problems. Looks good to me.

Nice

Works as described. I was hoping for good results since I have tried just about everything. Well this did the trick not to mention great for a back rub as well ! Would recommend to a friend!


### tokenization

In [8]:
# Load the BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained(model)

In [9]:
# tokenizes one example
def tokenize_example(example):
    split = example['text'].split('\n\n'); # splits the sentace and title.
    return tokenizer.encode_plus(split[0], split[1],
             truncation='only_second',
             add_special_tokens=True,
             return_attention_mask=True,
             return_overflowing_tokens=False,
             return_special_tokens_mask=False,
             max_length=512,
             pad_to_max_length=False)

In [10]:
# map the whole dset

eng_tokenized = engDataset.map(tokenize_example)

Map:   0%|          | 0/200000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [11]:
print(eng_tokenized['train'][1])
print(tokenizer.decode(eng_tokenized['train'][1]['input_ids']))

# looks good to me.

{'text': "Arrived on time. Works as expected. Took away ...\n\nArrived on time. Works as expected. Took away a star simply because, I mean, it's just paint lol", 'label': 3, 'input_ids': [101, 18484, 48521, 10162, 10135, 10635, 119, 22241, 10146, 25973, 119, 27775, 10174, 14942, 119, 119, 119, 102, 18484, 48521, 10162, 10135, 10635, 119, 22241, 10146, 25973, 119, 27775, 10174, 14942, 169, 16624, 26097, 12373, 117, 146, 36110, 117, 10271, 112, 187, 12820, 72700, 10406, 10161, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[CLS] Arrived on time. Works as expected. Took away... [SEP] Arrived on time. Works as expected. Took away a star simply because, I mean, it's just paint lol [SEP]


---

## 3. Machine learning model

### 3.1. Model training

In [26]:
# config
%%time
import torch
import torch.nn as nn

# Create the BertClassfier class
class BertClassifier(nn.Module):
    def __init__(self):
        super(BertClassifier, self).__init__()
        # hidden size of BERT (always 768), hidden size of our classifier, and number of labels (in this case 5)
        D_in, H, D_out = 768, 25, 5

        # load the pretrained bert.
        self.bert = BertModel.from_pretrained(model)

        # basic one layer feed forward network that outputs the labels.
        self.classifier = nn.Sequential(
            nn.Linear(D_in, H),
            nn.ReLU(),
            nn.Linear(H, D_out)
        )
    def forward(self, input_ids, attention_mask, labels=None):
        # run the BERT
        outputs = self.bert(input_ids=input_ids,
                            attention_mask=attention_mask)

        # Extract the last hidden state of the token for classification
        last_hidden_state_cls = outputs[0][:, 0, :]

        # Feed tha last hidden state into the classifier. This outputs the labels.
        logits = self.classifier(last_hidden_state_cls)

        if labels is not None:
          # calculates the loss.
          loss = torch.nn.CrossEntropyLoss();
          return (loss(logits,labels),logits);
        else:
          # if no labels, just return the logits
          return (logits,);
       # torch.cuda.empty_cache();


from transformers import AdamW, get_linear_schedule_with_warmup


CPU times: user 52 µs, sys: 0 ns, total: 52 µs
Wall time: 55.3 µs


In [27]:
accuracy = evaluate.load('accuracy');
def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels;
    predictions = np.argmax(outputs, axis=-1); #pick the index of the "winning" label
    return accuracy.compute(predictions=predictions, references=labels); # calc accuracy

So. The training of this is really slow was something like 1 it/s, with batchsize of 16. This is because the limiter in colab is the memory so lets try to get that lower. So lets freeze all the params we dont want to use that are from the pretrained BERT, this ought to get it a bit higher.

This with other changes got it up to about 7 with distilbert and 3.5 with bert which keeps the training possible in colab :)

In [28]:
model = BertClassifier()

In [30]:

# get params that we can change
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

#params that are from the bert we should NOT change so lets freeze those.
for name, param in model.named_parameters():
    if name.startswith('bert'):
        param.requires_grad = False

# actual changeable params
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 177,872,795 trainable parameters
The model has 19,355 trainable parameters


In [32]:
# Training params. We optimize these later
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy = 'steps',
    logging_strategy = 'steps',
    eval_steps = 1000,
    logging_steps = 1000,
    learning_rate=1.2668184349858087e-05,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    max_steps = 20000,
    num_train_epochs=4,
    weight_decay=0.01,
  )

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
early_stopping = transformers.EarlyStoppingCallback(3); # stop training if the eval loss is not getting better.

# Set the trainer
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = eng_tokenized['train'],
    eval_dataset = eng_tokenized['test'],
    tokenizer = tokenizer,
    data_collator = data_collator,
    compute_metrics = compute_accuracy,
)

# train the model
trainer.train()


Step,Training Loss,Validation Loss,Accuracy
1000,0.9002,1.014742,0.57
2000,0.9901,1.003234,0.5708
3000,0.9889,0.995419,0.571
4000,0.9805,0.994904,0.5728
5000,0.9904,0.992882,0.5714
6000,0.9903,0.991252,0.574


KeyboardInterrupt: ignored

### 3.2 Hyperparameter optimization

# HOX
A little about this this is done over multiple days as training one model takes a bit of time so you wont see all the results but the best one currently is:

lr: 1.2668184349858087e-05
batch: 8
steps: 500
epoch: 4
max steps: 20k ( ab 1hr train time with 16 batch)

In [None]:
# Used optuna for optimization

def objective(trial):
    # Define the search space for hyperparameters
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-3, log=True)
    batch_size = trial.suggest_categorical("batch_size", [8, 4, 16])
    epochs=trial.suggest_int('num_train_epochs', low = 2,high = 6),

    # params
    trainer_args = transformers.TrainingArguments(
        "mlp_checkpoints",
        evaluation_strategy = "steps",
        logging_strategy = "steps",
        eval_steps = 500,
        logging_steps = 500,
        learning_rate = learning_rate,
        max_steps = 20000,
        load_best_model_at_end = True,
        per_device_train_batch_size = batch_size,
        per_device_eval_batch_size = batch_size,
        num_train_epochs = epochs
    )

    # the model
    mlp = model
    early_stopping = transformers.EarlyStoppingCallback(3); # stop training if the eval loss is not getting better.

    # train a model
    trainer = transformers.Trainer(
        model = mlp,
        args = trainer_args,
        train_dataset = eng_tokenized['train'],
        eval_dataset = eng_tokenized['test'],
        compute_metrics = compute_accuracy,
        data_collator = data_collator,
        callbacks = [early_stopping]
    )

    trainer.train()
    eval_results = trainer.evaluate()
    return eval_results["eval_accuracy"] # return the best result.

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=20)

[I 2023-10-20 13:45:34,814] A new study created in memory with name: no-name-b603fa9d-16e8-49f9-bc32-c787468c389b


Step,Training Loss,Validation Loss,Accuracy
500,0.9099,0.997079,0.5746
1000,0.8195,1.008719,0.574
1500,0.9869,1.000296,0.5696
2000,0.9902,0.996743,0.5728
2500,0.9827,0.995004,0.5746
3000,0.9958,0.994205,0.5744
3500,0.9679,0.995007,0.5762
4000,0.9618,0.994474,0.5754
4500,0.9739,0.99276,0.576
5000,0.9689,0.992603,0.574


[I 2023-10-20 14:20:21,151] Trial 0 finished with value: 0.5752 and parameters: {'learning_rate': 1.3658191756866928e-05, 'batch_size': 8, 'num_train_epochs': 5}. Best is trial 0 with value: 0.5752.


Step,Training Loss,Validation Loss,Accuracy
500,0.9936,1.039765,0.5616
1000,0.9189,1.096972,0.5666
1500,0.8347,1.124508,0.568
2000,0.8585,1.07831,0.5712


[I 2023-10-20 14:25:19,341] Trial 1 finished with value: 0.5616 and parameters: {'learning_rate': 0.000577019568165523, 'batch_size': 4, 'num_train_epochs': 2}. Best is trial 0 with value: 0.5752.


Step,Training Loss,Validation Loss,Accuracy
500,0.9277,1.036753,0.5706
1000,0.8932,1.051766,0.5758
1500,0.8026,1.08726,0.5712
2000,0.8401,1.077527,0.569


[I 2023-10-20 14:30:06,867] Trial 2 finished with value: 0.5706 and parameters: {'learning_rate': 8.425792462844098e-05, 'batch_size': 4, 'num_train_epochs': 4}. Best is trial 0 with value: 0.5752.


Step,Training Loss,Validation Loss,Accuracy
500,0.9172,1.036504,0.5718
1000,0.8202,1.034463,0.5736
1500,0.9836,0.988773,0.5674
2000,0.9864,0.98816,0.5716
2500,0.9764,0.977015,0.5722
3000,0.9909,0.995965,0.5678
3500,0.9639,0.979503,0.5796
4000,0.9553,0.982051,0.572


[I 2023-10-20 14:44:10,791] Trial 3 finished with value: 0.5722 and parameters: {'learning_rate': 0.00033573804041473176, 'batch_size': 8, 'num_train_epochs': 4}. Best is trial 0 with value: 0.5752.


Step,Training Loss,Validation Loss,Accuracy
500,0.9093,0.999788,0.5716
1000,0.8717,1.022666,0.5766
1500,0.7844,1.062059,0.5746
2000,0.8279,1.060353,0.5736


[I 2023-10-20 14:49:39,228] Trial 4 finished with value: 0.5716 and parameters: {'learning_rate': 4.759571694386616e-05, 'batch_size': 4, 'num_train_epochs': 6}. Best is trial 0 with value: 0.5752.


Step,Training Loss,Validation Loss,Accuracy
500,0.8853,0.999845,0.5754
1000,0.7973,1.009366,0.5766
1500,0.9667,0.996172,0.5754
2000,0.9654,0.988089,0.5752
2500,0.9521,0.98657,0.5744
3000,0.9842,0.983769,0.5724
3500,0.9572,0.984,0.5736
4000,0.9495,0.982298,0.5734
4500,0.9624,0.980362,0.5736
5000,0.9582,0.979072,0.5738


[I 2023-10-20 15:25:02,114] Trial 5 finished with value: 0.5734 and parameters: {'learning_rate': 1.2668184349858087e-05, 'batch_size': 8, 'num_train_epochs': 4}. Best is trial 0 with value: 0.5752.


Step,Training Loss,Validation Loss,Accuracy
500,0.752,1.056753,0.5726
1000,0.9708,0.972293,0.5776
1500,0.962,0.971683,0.573
2000,0.9506,0.968672,0.5754
2500,0.9511,0.967259,0.5742
3000,0.9643,0.965218,0.5714
3500,0.9601,0.96483,0.5716
4000,0.9524,0.966709,0.5744
4500,0.968,0.965371,0.573
5000,0.9661,0.967153,0.5808


### 3.3. Evaluation on test set

In [None]:
# Your code to evaluate the final model on the test set here
eval_results = trainer.evaluate(dset_tokenized["validation"])

print(eval_results)

### 3.4. Multilingual and cross-lingual experiments

In [None]:
# Your code to train and evaluate the multilingual and cross-lingual models

---

## 4. Results and summary

### 4.1 Corpus insights

(Briefly discuss what you learned about the corpus and its annotation)

### 4.2 Results

(Briefly summarize your results)

### 4.3 Relation to random baseline / expected performance / state of the art

(Compare your results to the random and state-of-the-art performance)

---

## 5. Bonus Task (optional)

### 5.1. Data selection

(Briefly describe how many English and target language examples were used and how these were selected, include relevant code)

### 5.2 Sentence representations

In [None]:
# Your code to create a sentence embedding for the given text here

### 5.3. Cosine similarity

In [None]:
# Your code to calculate the cosine similarity of the embeddings and select the target sentence that maximizes the cosine similarity here

### 5.4 Bonus task evaluation

(Present the evaluation results here)