<a href="https://colab.research.google.com/github/Iispar/dl-in-hlt-project/blob/main/course_project_template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep learning in Human Language Technology Project (Template)

- Student(s) Name(s): Iiro Partanen
- Date: 18/10/2023
- Chosen Corpus: amazon_reviews_multi
- Contributions (if group project): -

### Corpus information

- Description of the chosen corpus:
- Paper(s) and other published materials related to the corpus:
- Random baseline performance and expected performance for recent machine learned models:

---

## 1. Setup

In [None]:
!pip3 install -q transformers datasets evaluate
!pip install optuna
from transformers import BertTokenizer
from transformers import AutoTokenizer
import datasets
import sklearn.feature_extraction
import torch
import transformers
import numpy as np
import evaluate
import optuna

In [75]:
dset = 'mteb/amazon_reviews_multi'
model = 'bert-base-cased' # base bert with cased, because feel like this would fit well with reviews... we will see.

---

## 2. Data download, sampling and preprocessing

### 2.1. Download the corpus

In [21]:
engDataset = datasets.load_dataset(dset, name='en'); # imports the dataset.
# check it works
print(engDataset);

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 5000
    })
})


### 2.2. Sampling and preprocessing

In [22]:
engDataset = engDataset.shuffle() # shuffle the dataset for safety.
engDataset = engDataset.remove_columns(['id', 'label_text']) # removes everything that we don't need

{'text': 'Kept my interest\n\nTerrific read. Very scary.', 'label': 4}


In [47]:
# lets look at five results to see if there is more preprocessing to be done

print(engDataset['train'][0]['text'])

# looks like the title is spaced with \n\n, but other than that there is no problems. Looks good to me.

Kept my interest

Terrific read. Very scary.


### tokenization

In [86]:
# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained(model) # max len same as what bert uses
# Used the BertTokenizer instead of AutoTokenizer, because we want the token type ids to be used for BERT.

In [108]:
# tokenizes one example
def tokenize_example(example):
    split = example['text'].split('\n\n') # splits the sentance and title. So we can insert these into the tokenizer seperatly
    return tokenizer.encode_plus(split[0], split[1], # tokenizes the title and review
                          truncation='only_second', # cut if the limit of 512 tokens is extended. Cuts always the second sentance so the review.
                          add_special_tokens=True, # adds CLS and SEP
                          max_length=512, # max len = berts max length, cut rest over this
                          padding=True) # pads to max length

In [98]:
print(tokenize_example(engDataset['train'][0])) # check it works.

{'input_ids': [101, 26835, 6451, 1139, 2199, 102, 12008, 14791, 21361, 2373, 119, 6424, 13952, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [None]:
# map the whole dset

eng_tokenized = engDataset.map(tokenize_example)

In [119]:
print(eng_tokenized['train'][10]) # looks good to me.

{'text': 'Fits perfectly\n\nFits perfectly in my Tucson 2018, only 4 starts due to some marks in parts for lack of protective packing.', 'label': 3, 'input_ids': [101, 17355, 2145, 6150, 102, 17355, 2145, 6150, 1107, 1139, 18740, 1857, 117, 1178, 125, 3816, 1496, 1106, 1199, 6216, 1107, 2192, 1111, 2960, 1104, 9760, 16360, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


---

## 3. Machine learning model

### 3.1. Model training

In [None]:
# Your code to train the transformer based model on the training set and evaluate the performance on the validation set here

### 3.2 Hyperparameter optimization

In [None]:
# Your code for hyperparameter optimization here

### 3.3. Evaluation on test set

In [None]:
# Your code to evaluate the final model on the test set here

### 3.4. Multilingual and cross-lingual experiments

In [None]:
# Your code to train and evaluate the multilingual and cross-lingual models

---

## 4. Results and summary

### 4.1 Corpus insights

(Briefly discuss what you learned about the corpus and its annotation)

### 4.2 Results

(Briefly summarize your results)

### 4.3 Relation to random baseline / expected performance / state of the art

(Compare your results to the random and state-of-the-art performance)

---

## 5. Bonus Task (optional)

### 5.1. Data selection

(Briefly describe how many English and target language examples were used and how these were selected, include relevant code)

### 5.2 Sentence representations

In [None]:
# Your code to create a sentence embedding for the given text here

### 5.3. Cosine similarity

In [None]:
# Your code to calculate the cosine similarity of the embeddings and select the target sentence that maximizes the cosine similarity here

### 5.4 Bonus task evaluation

(Present the evaluation results here)