<a href="https://colab.research.google.com/github/Iispar/dl-in-hlt-project/blob/main/course_project_template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep learning in Human Language Technology Project (Template)

- Student(s) Name(s): Iiro Partanen
- Date: 18/10/2023
- Chosen Corpus: amazon_reviews_multi
- Contributions (if group project): -

### Corpus information

- Description of the chosen corpus: includes 200k of amazon reviews in English, Japanese, German, French, Chinese and Spanish.
- Paper(s) and other published materials related to the corpus: Paper from the courpus is "The Multilingual Amazon Reviews Corpus"
- Random baseline performance and expected performance for recent machine learned models:

---

## 1. Setup

In [None]:
!pip3 install -q transformers datasets evaluate
!pip install optuna
!pip install accelerate -U
from transformers import DistilBertTokenizer
from transformers import BertTokenizer
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorWithPadding
from transformers import DistilBertModel, BertModel
from transformers import AutoModelForSequenceClassification
import datasets
import sklearn.feature_extraction
import torch
import transformers
import numpy as np
import evaluate
import optuna

In [None]:
dset = 'mteb/amazon_reviews_multi'
model = 'bert-base-multilingual-cased' # base distilbert with cased, because feel like this would fit well with reviews... we will see.

---

## 2. Data download, sampling and preprocessing

### 2.1. Download the corpus

In [None]:
engDataset = datasets.load_dataset(dset, name='en'); # imports the dataset.
# check it works
print(engDataset);

In [None]:
#FOR TESTING
# engDataset["train"] = engDataset["train"].select(range(100000))

In [None]:
print(engDataset)

### 2.2. Sampling and preprocessing

In [None]:
engDataset = engDataset.shuffle() # shuffle the dataset for safety.
engDataset = engDataset.remove_columns(['id', 'label_text']) # removes everything that we don't need

In [None]:
# lets look at five results to see if there is more preprocessing to be done

print(engDataset['train'][0]['text'])

# looks like the title is spaced with \n\n, but other than that there is no problems. Looks good to me.

Nice fanny pack, but smaller then expected

Good quality, but smaller than I expected.


### tokenization

In [None]:
# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained(model)

In [None]:
# tokenizes one example
def tokenize_example(example):
    split = example['text'].split('\n\n'); # splits the sentace and title.
    return tokenizer.encode_plus(split[0], split[1],
             truncation='only_second',
             add_special_tokens=True,
             max_length=512,
             padding='max_length')

In [None]:
# map the whole dset

eng_tokenized = engDataset.map(tokenize_example)

Map:   0%|          | 0/200000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [None]:
print(eng_tokenized['train'][1])
print(tokenizer.decode(eng_tokenized['train'][1]['input_ids']))

# looks good to me.

---

## 3. Machine learning model

### 3.1. Model training

In [None]:
# config
%%time
import torch
import torch.nn as nn

# Create the BertClassfier class
class BertClassifier(nn.Module):
    def __init__(self):
        super(BertClassifier, self).__init__()
        # hidden size of BERT (always 768), hidden size of our classifier, and number of labels (in this case 5)
        D_in, H, D_out = 768, 25, 5

        # load the pretrained bert.
        self.bert = BertModel.from_pretrained(model)

        # basic one layer feed forward network that outputs the labels.
        self.classifier = nn.Sequential(
            nn.Linear(D_in, H),
            nn.ReLU(),
            #nn.Dropout(0.5),
            nn.Linear(H, D_out)
        )
    def forward(self, input_ids, attention_mask, labels=None):
        # run the BERT
        outputs = self.bert(input_ids=input_ids,
                            attention_mask=attention_mask)

        # Extract the last hidden state of the token for classification
        last_hidden_state_cls = outputs[0][:, 0, :]

        # Feed tha last hidden state into the classifier. This outputs the labels.
        logits = self.classifier(last_hidden_state_cls)

        if labels is not None:
          # calculates the loss.
          loss = torch.nn.CrossEntropyLoss();
          return (loss(logits,labels),logits);
        else:
          # if no labels, just return the logits
          return (logits,);
       # torch.cuda.empty_cache();


from transformers import AdamW, get_linear_schedule_with_warmup


In [None]:
accuracy = evaluate.load('accuracy');
def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels;
    predictions = np.argmax(outputs, axis=-1); #pick the index of the "winning" label
    return accuracy.compute(predictions=predictions, references=labels); # calc accuracy

So. The training of this is really slow was something like 1 it/s, with batchsize of 16. This is because the limiter in colab is the memory so lets try to get that lower. So lets freeze all the params we dont want to use that are from the pretrained BERT, this ought to get it a bit higher.

This with other changes got it up to about 7 with distilbert and 3.5 with bert which keeps the training possible in colab :)

In [None]:
model = BertClassifier()

In [None]:

# get params that we can change
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

#params that are from the bert we should NOT change so lets freeze those.
for name, param in model.named_parameters():
    if name.startswith('bert'):
        param.requires_grad = False

# actual changeable params
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


print(f'The model has {count_parameters(model):,} trainable parameters')

In [None]:
# TRY IF FASTER JUST USE THIS :(

model = AutoModelForSequenceClassification.from_pretrained(model, num_labels=5) # load the bert model

In [None]:
# Training params. We optimize these later
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy = 'steps',
    logging_strategy = 'steps',
    eval_steps = 500,
    logging_steps = 500,
    learning_rate=1e-4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    max_steps = 20000,
    num_train_epochs=5,
    weight_decay=0.01,
  )

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
early_stopping = transformers.EarlyStoppingCallback(3); # stop training if the eval loss is not getting better.

# Set the trainer
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = eng_tokenized['train'],
    eval_dataset = eng_tokenized['test'],
    tokenizer = tokenizer,
    data_collator = data_collator,
    compute_metrics = compute_accuracy,
)

# train the model
trainer.train()


### 3.2 Hyperparameter optimization

# HOX
A little about this this is done over multiple days as training one model takes a bit of time so you wont see all the results but the best one currently is:

In [None]:
# Used optuna for optimization

def objective(trial):
    # Define the search space for hyperparameters
    learning_rate = trial.suggest_float("learning_rate", 1e-7, 1e-3, log=True)
    batch_size = trial.suggest_categorical("batch_size", [8, 4, 16])
    epochs=trial.suggest_int('num_train_epochs', low = 2,high = 6),

    # params
    trainer_args = transformers.TrainingArguments(
        "mlp_checkpoints",
        evaluation_strategy = "steps",
        logging_strategy = "steps",
        eval_steps = 500,
        logging_steps = 500,
        learning_rate = learning_rate,
        max_steps = 20000,
        load_best_model_at_end = True,
        per_device_train_batch_size = batch_size,
        per_device_eval_batch_size = batch_size,
        num_train_epochs = epochs
    )

    # the model
    mlp = model
    early_stopping = transformers.EarlyStoppingCallback(3); # stop training if the eval loss is not getting better.

    # train a model
    trainer = transformers.Trainer(
        model = mlp,
        args = trainer_args,
        train_dataset = eng_tokenized['train'],
        eval_dataset = eng_tokenized['test'],
        compute_metrics = compute_accuracy,
        data_collator = data_collator,
        callbacks = [early_stopping]
    )

    trainer.train()
    eval_results = trainer.evaluate()
    return eval_results["eval_accuracy"] # return the best result.

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=20)

### 3.3. Evaluation on test set

In [None]:
# Your code to evaluate the final model on the test set here

### 3.4. Multilingual and cross-lingual experiments

In [None]:
# Your code to train and evaluate the multilingual and cross-lingual models

---

## 4. Results and summary

### 4.1 Corpus insights

(Briefly discuss what you learned about the corpus and its annotation)

### 4.2 Results

(Briefly summarize your results)

### 4.3 Relation to random baseline / expected performance / state of the art

(Compare your results to the random and state-of-the-art performance)

---

## 5. Bonus Task (optional)

### 5.1. Data selection

(Briefly describe how many English and target language examples were used and how these were selected, include relevant code)

### 5.2 Sentence representations

In [None]:
# Your code to create a sentence embedding for the given text here

### 5.3. Cosine similarity

In [None]:
# Your code to calculate the cosine similarity of the embeddings and select the target sentence that maximizes the cosine similarity here

### 5.4 Bonus task evaluation

(Present the evaluation results here)