# Deep learning in Human Language Technology Project

- Student(s) Name(s): Faiza Anan Noor
- Date: 30/10/2023
- Chosen Corpus: mteb/amazon_reviews_multi (https://huggingface.co/datasets/mteb/amazon_reviews_multi)
- Contributions (if group project): Self-Project
-  Contact Information: Email: fanoor@utu.fi,
Phone number: +358 465200826



---

## 1. Setup

 Here is our code to install and import libraries etc.


In [None]:
# Your code to install and import libraries etc. here

!pip install accelerate
!pip3 -q install datasets transformers
!pip install SentencePiece
!pip install datasets
!pip install accelerate -U
!pip install transformers[torch]
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from transformers import AutoTokenizer
from datasets import list_datasets
import torch
from sklearn.preprocessing import normalize
!pip install SentencePiece
from datasets import load_dataset_builder
from datasets import load_dataset
from datasets import get_dataset_split_names
from datasets import load_dataset, load_metric
from torch import nn
from transformers import pipeline

from transformers import XLMRobertaTokenizer, XLMRobertaForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset



I have read the paper presenting the corpus as well as any other relevant published materials about the corpus and identified what is the random baseline performance by selecting the label randomly as well as expected performance for recent machine learned models for this corpus. The paper describing the data  helped in this case. I have written a summary in the form of results, findings and relation to state of the art existing work in this regard, after training and evaluating results.

---

## 2. Data download, sampling and preprocessing

### 2.1. Download the corpus
### I have selected a multilingual text classification corpus to work with which is the "mteb/amazon_multi_reviews" dataset and downloaded the dataset

In [None]:
dataset = load_dataset("mteb/amazon_reviews_multi")

### 2.2. Sampling and preprocessing

#### In this step, the dataset has been divided into train, test and validation sets and those were further subdivided to keep only English language. Then a smaller dataset was created from the English dataset by shuffling the whole subset first and choosing N number of rows. After that, we tokenized the english dataset(consisting of N rows) and separated the tokenized sentences into train, test and validation set. This validation set was used for cross and multi lingual experiments as well.


In [None]:
# Filter the dataset to include only English reviews
dataset_all_train, dataset_all_test, dataset_all_val = load_dataset("mteb/amazon_reviews_multi", split=['train', 'test', 'validation'])
dataset_eng_train = dataset_all_train.filter(lambda example: example["id"].split("_")[0] == "en")
dataset_eng_test = dataset_all_test.filter(lambda example: example["id"].split("_")[0] == "en")
dataset_eng_val = dataset_all_val.filter(lambda example: example["id"].split("_")[0] == "en")


In [None]:
# Filter the dataset to include only English reviews
N=1000
en_dataset = dataset.filter(lambda example: example["id"].split("_")[0] == "en")
#smaller_dataset=en_dataset.shuffle(seed= 43)[:N]

#smaller_dataset = en_dataset.filter(lambda e, i: i<500, with_indices=True)

tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")
def tokenize_function(example):
    # Explicitly tokenize and convert to input_ids
    inputs = tokenizer(example["text"], padding="max_length", truncation=True, max_length=256, return_tensors="pt")
    inputs["labels"] = example["label"]
    return inputs

tokenized_datasets = en_dataset.map(tokenize_function, batched=True)

# Split the data into training, validation, and testing sets
train_dataset = tokenized_datasets["train"].shuffle(seed= 43).filter(lambda e, i: i<N, with_indices=True)
validation_dataset = tokenized_datasets["validation"].shuffle(seed= 43).filter(lambda e, i: i<N, with_indices=True)
test_dataset = tokenized_datasets["test"].shuffle(seed= 43).filter(lambda e, i: i<N, with_indices=True)
#print("train dataset ",len(train_dataset["id"]))




Then we visualized how many labels exist separately for Train, Test and Validation sets for the English Datasets. We can see that almost a uniform distribution has been made for train, test and validation sets for 5 different labels. This was done so that we avoid RAM limitations while working with a huge dataset.

In [None]:
print("Training Set ======== ")

label_counts_train={}
label_counts_test={}
label_counts_val={}

for i in train_dataset["label"]:
  if i in label_counts_train:
    label_counts_train[i]+=1
  else:
    label_counts_train[i]=1

for i in range(len(label_counts_train)):
  print(f"Number of Label {i} ::: {label_counts_train[i]} ")


print("------------------------------------")
print("Validation Set ======== ")

for i in validation_dataset["label"]:
  if i in label_counts_val:
    label_counts_val[i]+=1
  else:
    label_counts_val[i]=1

for i in range(len(label_counts_val)):
  print(f"Number of Label {i} ::: {label_counts_val[i]} ")
print("------------------------------------")


print("Test Set ======== ")

for i in test_dataset["label"]:
  if i in label_counts_test:
    label_counts_test[i]+=1
  else:
    label_counts_test[i]=1

for i in range(len(label_counts_test)):
  print(f"Number of Label {i} ::: {label_counts_test[i]} ")
#label_counts_train

Number of Label 0 ::: 203 
Number of Label 1 ::: 214 
Number of Label 2 ::: 196 
Number of Label 3 ::: 200 
Number of Label 4 ::: 187 
------------------------------------
Number of Label 0 ::: 207 
Number of Label 1 ::: 203 
Number of Label 2 ::: 217 
Number of Label 3 ::: 173 
Number of Label 4 ::: 200 
------------------------------------
Number of Label 0 ::: 207 
Number of Label 1 ::: 203 
Number of Label 2 ::: 217 
Number of Label 3 ::: 173 
Number of Label 4 ::: 200 


---

## 3. Machine learning model

### 3.1. Model training

 Here is my code to train the transformer based model on the training set and evaluate the performance on the validation set.

For this purpose, we trained a transformer-based classifier, which is the XLM-RoBERTa model on the English training set, evaluating performance on the English validation set. For training, we defined some training arguments and passed in the training and validation set.

In [None]:
# Load the XLM-RoBERTa base model
model = XLMRobertaForSequenceClassification.from_pretrained("xlm-roberta-base", num_labels=5)

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
    # Define training arguments
training_args = TrainingArguments(
        output_dir='./results',
        per_device_train_batch_size=per_device_train_batch_size,
            per_device_eval_batch_size=8,
        num_train_epochs=num_train_epochs,
        evaluation_strategy="steps",
          save_total_limit=2,  # Limit the number of checkpoints saved
    eval_steps=20,  # Set the evaluation interval
    save_steps=200,  # Set the checkpoint saving interval
    logging_dir="./logs",
    logging_steps=20,
    )

trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=validation_dataset,
    )

In [None]:
# Define a function to train and evaluate the model with specific hyperparameters
def train_evaluate_model(per_device_train_batch_size, num_train_epochs):
    model = XLMRobertaForSequenceClassification.from_pretrained("xlm-roberta-base", num_labels=5)


# Your code to train and evaluate the multilingual and cross-lingual models

    # Initialize the Trainer

    # Train the model
    trainer.train()

    # Evaluate the model and compute accuracy
    predictions = trainer.predict(test_dataset)
    accuracy = accuracy_score(test_dataset["label"], predictions.predictions.argmax(axis=1))

    return accuracy

### 3.2 Hyperparameter optimization

Then we performed hyperparameter optimization for only two hyperparameters which are the batch size for training and number of epochs. We Found out the combination which yeileded the highest performance. The reason why we could not try for more hyperparameters is because of Colab GPU and RAM limitations. We also worked with a subset of data for the same exact reason. We tried for these combinations:

batch_sizes = [8, 16]
num_epochs = [8, 5]

We could not try for a higher number because of the excessive time needed to train.

In [None]:
# Define a range of hyperparameters to search within
batch_sizes = [8, 16]
num_epochs = [8, 5]

best_accuracy = 0.0
best_hyperparameters = {}

# Perform grid search over hyperparameters
for per_device_train_batch_size in batch_sizes:
    for num_train_epochs in num_epochs:
        accuracy = train_evaluate_model(per_device_train_batch_size, num_train_epochs)
        print(f"Batch Size: {per_device_train_batch_size}, Num Epochs: {num_train_epochs}, Accuracy: {accuracy}")

        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_hyperparameters = {
                'per_device_train_batch_size': per_device_train_batch_size,
                'num_train_epochs': num_train_epochs,
            }



Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss,Validation Loss
20,1.6175,1.61923
40,1.6609,1.609424
60,1.6314,1.631593
80,1.6162,1.61335
100,1.6233,1.615944
120,1.6171,1.607304
140,1.6019,1.607355
160,1.619,1.603835
180,1.5782,1.521575
200,1.5177,1.536086


Step,Training Loss,Validation Loss
20,1.6175,1.61923
40,1.6609,1.609424
60,1.6314,1.631593
80,1.6162,1.61335
100,1.6233,1.615944
120,1.6171,1.607304
140,1.6019,1.607355
160,1.619,1.603835
180,1.5782,1.521575
200,1.5177,1.536086


Batch Size: 8, Num Epochs: 8, Accuracy: 0.491


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss,Validation Loss
20,1.5895,1.645805
40,1.6738,1.621619
60,1.6283,1.623323
80,1.6267,1.613531
100,1.6155,1.618488
120,1.6351,1.601161
140,1.579,1.609424
160,1.5874,1.611649
180,1.6311,1.611135
200,1.6078,1.611362


Batch Size: 8, Num Epochs: 5, Accuracy: 0.203


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss,Validation Loss
20,1.6302,1.608764
40,1.6226,1.618072
60,1.6103,1.595327
80,1.6004,1.569276
100,1.5081,1.38822
120,1.4307,1.31889
140,1.3439,1.374344
160,1.3051,1.303611
180,1.3136,1.206967
200,1.2012,1.177451


Batch Size: 16, Num Epochs: 8, Accuracy: 0.578


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss,Validation Loss
20,1.6301,1.608836
40,1.6243,1.617582
60,1.6094,1.611247
80,1.6103,1.577958
100,1.5443,1.482633
120,1.4605,1.415222
140,1.3829,1.339768
160,1.3784,1.346315
180,1.3083,1.327864
200,1.2158,1.14141


Batch Size: 16, Num Epochs: 5, Accuracy: 0.535


In [None]:
print("Best hyperparameters:", best_hyperparameters)
print("Best accuracy:", best_accuracy)

Best hyperparameters: {'per_device_train_batch_size': 16, 'num_train_epochs': 8}
Best accuracy: 0.578


We have found out that the best accuracy got was 57.8% and that was for the case when the number of training epochs is 8 and the training batch size is 16.


### 3.3. Evaluation on test set

After that, we evaluated our results on the test set and found out the evaluation loss

In [None]:
# Evaluate on the test set
results = trainer.evaluate(eval_dataset=test_dataset)
print(results)

# Save the trained model
model.save_pretrained("./amazon_reviews_multi_classification_xlmroberta_model")

{'eval_loss': 1.6203975677490234, 'eval_runtime': 15.3965, 'eval_samples_per_second': 64.95, 'eval_steps_per_second': 8.119}


After performing the regular experiments, we performed Multilingual and cross-lingual experiments by setting up some common training arguments for both experiments and then training, validating both and then experimenting on the test datasets.

### 3.4. Multilingual and cross-lingual experiments

In [None]:
# Your code to train and evaluate the multilingual and cross-lingual models
training_args= TrainingArguments(
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    output_dir="./output",
    evaluation_strategy="steps",  # Set to "steps" for validation at regular intervals
    save_total_limit=2,  # Limit the number of checkpoints saved
    eval_steps=20,  # Set the evaluation interval
    save_steps=200,  # Set the checkpoint saving interval
    logging_dir="./logs",
    logging_steps=20,  # Log results every 100 steps
    num_train_epochs=1,  # Adjust the number of training epochs
)

### 3.4.1 Multi Lingual experiments
#### In this experiment, we train on English + french language. We extract the texts which are both french and english for training. The same validation set used before is used for multi and cross models again. We used a multilingual pre-trained language model in multilingual and cross-lingual experiments which is the xml-roberta base model
#### After training, we evaluated on English, and compared with the baseline results.

In [None]:
#multi lingual

dataset_fr_en = dataset.filter(lambda example: (example["id"].split("_")[0] == "fr")  | (example["id"].split("_")[0] == "en"  ))

dataset_fr_en = dataset_fr_en.filter(lambda e, i: i<1000, with_indices=True)

tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")

def tokenize_function(example):
    # Explicitly tokenize and convert to input_ids
    inputs = tokenizer(example["text"], padding="max_length", truncation=True, max_length=256, return_tensors="pt")
    inputs["labels"] = example["label"]
    return inputs

tokenized_datasets_multi= dataset_fr_en.map(tokenize_function, batched=True)

# Split the data into training, validation, and testing sets
train_dataset_multi = tokenized_datasets_multi["train"]
print("train dataset length",len(train_dataset_multi["id"]))


model_multi = XLMRobertaForSequenceClassification.from_pretrained("xlm-roberta-base", num_labels=5)

trainer_multi = Trainer(
    model=model_multi,
    args=training_args,
    train_dataset=train_dataset_multi,
    eval_dataset=validation_dataset,  # Use the validation dataset
)

trainer_multi.train()



Filter:   0%|          | 0/1200000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/30000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/30000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/400000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

train dataset length 1000


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss,Validation Loss
20,0.4305,5.331499
40,0.0042,6.883851
60,0.0009,7.255793
80,0.0007,7.385445
100,0.0006,7.467547
120,0.0006,7.489573


TrainOutput(global_step=125, training_loss=0.07001108222454787, metrics={'train_runtime': 143.4592, 'train_samples_per_second': 6.971, 'train_steps_per_second': 0.871, 'total_flos': 131559071232000.0, 'train_loss': 0.07001108222454787, 'epoch': 1.0})

Then also the results for the model was evaluated for the test set

In [None]:
trainer_multi.evaluate()

{'eval_loss': 1.6204431056976318,
 'eval_runtime': 14.1498,
 'eval_samples_per_second': 70.672,
 'eval_steps_per_second': 8.834}

In [None]:

# Evaluate on the test set
results_multi= trainer_multi.evaluate(eval_dataset=test_dataset)
print(results_multi)

# Save the trained model
model_multi.save_pretrained("./amazon_reviews_multi_classification_xlmroberta_model/multi")

{'eval_loss': 7.489504814147949, 'eval_runtime': 14.9803, 'eval_samples_per_second': 66.754, 'eval_steps_per_second': 8.344, 'epoch': 1.0}


### 3.4.2 Cross Lingual experiments
#### For this experiment, we train on French and evaluate on English (zero-shot cross-lingual transfer), compare with the baseline results. This mimics the setting of I would like to have a sentiment classifier for a language but I don't have any training data for that language.

In [None]:
#cross lingual

# Load and preprocess the dataset
dataset = load_dataset("mteb/amazon_reviews_multi")

# Filter the dataset to include only French reviews
dataset_fr = dataset.filter(lambda example: example["id"].split("_")[0] == "fr")

dataset_fr = dataset_fr.filter(lambda e, i: i<1000, with_indices=True)

tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")

def tokenize_function(example):
    # Explicitly tokenize and convert to input_ids
    inputs = tokenizer(example["text"], padding="max_length", truncation=True, max_length=256, return_tensors="pt")
    inputs["labels"] = example["label"]
    return inputs

tokenized_datasets_cross = dataset_fr.map(tokenize_function, batched=True)

# Split the data into training, validation, and testing sets
train_dataset_cross = tokenized_datasets_cross["train"]


print("train dataset length",len(train_dataset_cross["id"]))

# Load the XLM-RoBERTa base model
model_cross = XLMRobertaForSequenceClassification.from_pretrained("xlm-roberta-base", num_labels=5)
trainer_cross = Trainer(
    model=model_cross,
    args=training_args,
    train_dataset=train_dataset_cross,
    eval_dataset=validation_dataset,  # Use the validation dataset
)

# Train the model
trainer_cross.train()


Filter:   0%|          | 0/1200000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/30000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/30000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/200000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/5000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

train dataset length 1000


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss,Validation Loss
20,0.4086,5.41484
40,0.0043,7.280872
60,0.0006,7.833633
80,0.0003,7.949673
100,0.0003,8.005838
120,0.0003,8.029263


TrainOutput(global_step=125, training_loss=0.06631513125449419, metrics={'train_runtime': 143.3651, 'train_samples_per_second': 6.975, 'train_steps_per_second': 0.872, 'total_flos': 131559071232000.0, 'train_loss': 0.06631513125449419, 'epoch': 1.0})

In [None]:
# Evaluate on the test set
results_cross = trainer_cross.evaluate(eval_dataset=test_dataset)
print(results_cross)

# Save the trained model
model_cross.save_pretrained("./amazon_reviews_multi_classification_xlmroberta_model/cross")

{'eval_loss': 8.031278610229492, 'eval_runtime': 14.6739, 'eval_samples_per_second': 68.148, 'eval_steps_per_second': 8.519, 'epoch': 1.0}


In a zero-shot cross-lingual transfer scenario, you can train a sentiment classifier on a language other than English and then evaluate its performance on English data. This simulates the situation where you want to build a sentiment classifier for a language for which you have no training data.

---

## 4. Results and summary

### 4.1 Corpus insights

Here, we briefly discuss what learnings about the corpus and its annotations.

### Corpus information


- **Description of the chosen corpus:** Hugging Face's "mteb/amazon_reviews_multi" dataset is a collection of reviews of various Amazon products in different languages.
The reviews in this dataset were most likely gathered from the extensive platform of Amazon, which offers reviews for a broad variety of products in several languages.

- According to the word "Multi" in the dataset name, it contains reviews written in many languages. For cross-lingual NLP jobs, this can be helpful. This corpus contains 1.26M rows of  German, English, Spanish, French, Japanese and Chinese. All languages consist of 210K rows. The attributes include id(string), text(string), label(int32), label_text(string). It contains sers of Train:1.2M, validation:30K, test:30K. Normally the id consists of two letters from the language followed by an id which uniquely identifies each row. And the label consists of values from 0 to 4 in which the review gradually improves with the value, 0 being the lowest and worst possible review while 4 being the highest.

There can be a limit of 20 reviews per reviewer and a maximum of 20 reviews per product. Every review is at least 20 characters long and is terminated after 2,000 characters.
A review's language may differ from the language used in its marketplace (for example, reviews on amazon.de are mostly written in German, but they may also be written in English, etc.). To ascertain the language of the review text, the authors of this corpora thus used a language detection algorithm and eliminated reviews that were not written in the anticipated language.

- **Usage**: This dataset may be used for testing and training models for tasks like sentiment analysis, language modeling, and translation by academics and data scientists. In order to comprehend and produce content in many languages, multi-lingual models may be built using this technique.


- **License and Attribution Requirements:** The dataset can have certain license, attribution, and terms of use conditions. Typically, when utilizing the dataset, users are required to abide by these rules.



### 4.2, 4.3 Results & Relation to random baseline / expected performance / state of the art

Here, we compared results to the random and state-of-the-art performances.

 **Paper(s) and other published materials related to the corpus:**  
- [The Multilingual Amazon Reviews Corpus](https://aclanthology.org/2020.emnlp-main.369.pdf)

- [Sentiment Analysis of Amazon Customer Product
Reviews: A Review](https://www.ijsred.com/volume4/issue1/IJSRED-V4I1P55.pdf)

**Random baseline performance and expected performance for recent machine learned models:**

We compared it with [The Multilingual Amazon Reviews Corpus](https://aclanthology.org/2020.emnlp-main.369.pdf).

The findings and relation to the expected performance and baseline are summarized below:
 - In the original paper, for training and testing both in English, the accuracy is 59.1% whereas for our model, the accuracy for the best combination of hyperparameters is 58.7%. Even though it did not surpass the model, it yielded a performance very close to the actual even after being trained on the best possible combination of parameters for running in Google Colab.
 - For the Cross Lingual experiment, the training loss is 6.7% & loss on the test set is 8.03%. For the baseline, the test loss was 1.62% which is much less. The conclusion could be that, while cross lingual is effective for capturing a lot of information about other languages even after being trained on a particular language, with limited data and training resources, it performed a could enough job. The situation could be improved with better hyperparameters(training for more epochs and more train and validation batch sizes) and by using more data for the base language(french in this case). Since only 1000 french sentences were used, they managed to instill knowledge for finding the correct labels for quite some English sentences and the overall performance is noteworthy.

 - For the Multi Lingual experiment, test loss was 7.4% respectively. For the main/baseline experimnent(trained and tested on English only), it was 1.62%. The performance most likely degraded due to the limited amount of french and english datasets chosen finally since after randomly taking both english and french datasets, we finalized on just 1000 samples. So one reason for performance degradation could be that the model could not capture enough variation in the data when it was trained on both datasets at once due to limited data available. For the paper,

- For the paper's model model, the the training loss for model trained on English was 6.25% and when tested on French Dataset it was 47.1%. That was opposite to our training and testing scenario.

- For simplicity and ease of experiments in this platform, the number of epochs have also been kept very minimal for all 3 experiments.


## 5. Bonus Task (optional)



### 5.1. Data selection

(We briefly described how many English and target language examples were used and how these were selected and included relevant code.

For this task, we selected a few 16 relatively short English reviews from the mteb/amazon_reviews_multi training section, include examples of different ratings.

And we selected another N relatively short reviews from the amazon_reviews_multi training section in another language (the target language), include examples of different ratings. For our case, N is 100.

In [None]:
chosen_english_texts={

    "text" : [



              "Somewhat useful, but still messy It's not as useful as I thought. You still have to you another spatula or your hand to clean the beater",
               "Article never received I never received the article",
                 "Great! Good quality product and very nice. My daughter is very happy with it!!!",
               "Very good Impeccable for small espresso cups",
                    "Never received I never received the product",
"Trash. They don't make them like they used to This thing is so freaking cheap it's disgusting.",
                 "Pleased Great little bag. It rolls easily and holds ALOT of stuff. Very pleased",
              "Five Stars It worked great for my project! Fast shipping",
                            "Null Disappointed even very disappointed this sound not like the one on the picture",

              "Hold the bag tightly as it slips easily Mayonnaise pastries on salads",
              "Hull You hear nothing when you call and when you receive",
              "Used book Book received on time. In good condition and used. Nothing to complain about.",
              "Interesting The stories are always well-crafted.",
                            "Don’t buy this I wouldn’t buy this I going to return it won’t charge",
              "Four Stars It is good for the price, but not the best quality. you get what you pay for",
                  "Not yet received Delivery should have been on Wednesday, October 9 and to date I still haven't had it, so I'm not satisfied with the delivery time.",




              ]
}


In [None]:
target_language = "fr"
N = 100
target_language_reviews_train = [
    "Jamais reçu Je n’ai jamais reçu le produit",
    "Article jamais reçut Je n'ai jamais reçut l'article",
    "Nul Déçu même très déçu ce son pas comme celle qui son sur la photo",
    "Pas encore recu Livraison aurait dû être le mercredi 9 octobre et a ce jour je ne l'ai toujours pas eu donc pas satisfaite du délai de livraison",
     "Bien tenir la poche car elle glisse facilement Patisserie mayonnaise sur les salades",
    "Coque On entend rien quand on appel et quand on reçoit",
    "Livre en occasion Livre reçu dans les délais. En bon état et d'occasion. Rien à redire.",
    "Très bien Impeccable pour les petits tasses à expresso",
"Article superflu ! Efficacité très moyenne, surtout sur des œufs de gros calibre.",
    "Intéressant Les histoires sont toujours bien ficelées.",
    "Super !! Produit de bonne qualité et très sympa. Ma fille en est très contente !!!",
    "Avis C'est pas mal mais c'est quand même trop chère pour ce que c'est. N'hésitez pas à prendre le même produit sans la marque go pro.",
    "top super moule pour réaliser par exemple le rubik's cake, très pratique pour faire aussi des petits mets salés en apéro. Qualité top",
    "Très bon produit Bon produit, bien adapter a mon gros téléviseur je le recomlande car c'est un bon rapport qualité prix.",
    "le coupe legumes J ai acheté cet article pour couper tous mes légumes c est rapide efficace et facile à nettoyer. Durant l été on fait souvent des salades donc très pratique et rapide.",
    "Pas terrible mais fait l'affaire . Bonne trousse de rangement mais pas assez de place pour mettre tout ce qu'il a sur la photo . Les objets appuient sur l'écran ou sur les joysticks . Le film se décolle déjà d'un coin .",
    "Magnifique Il est bien mais dommage qu'il est un peu grand pour moi taille S",

]

target_language_reviews_train += dataset.filter(lambda ex: ex["id"].split("_")[0]==target_language)["train"].shuffle(seed=42)[:N]["text"]
print(target_language_reviews_train)



Filter:   0%|          | 0/1200000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/30000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/30000 [00:00<?, ? examples/s]

['Jamais reçu Je n’ai jamais reçu le produit', "Article jamais reçut Je n'ai jamais reçut l'article", 'Nul Déçu même très déçu ce son pas comme celle qui son sur la photo', "Pas encore recu Livraison aurait dû être le mercredi 9 octobre et a ce jour je ne l'ai toujours pas eu donc pas satisfaite du délai de livraison", 'Bien tenir la poche car elle glisse facilement Patisserie mayonnaise sur les salades', 'Coque On entend rien quand on appel et quand on reçoit', "Livre en occasion Livre reçu dans les délais. En bon état et d'occasion. Rien à redire.", 'Très bien Impeccable pour les petits tasses à expresso', 'Article superflu ! Efficacité très moyenne, surtout sur des œufs de gros calibre.', 'Intéressant Les histoires sont toujours bien ficelées.', 'Super !! Produit de bonne qualité et très sympa. Ma fille en est très contente !!!', "Avis C'est pas mal mais c'est quand même trop chère pour ce que c'est. N'hésitez pas à prendre le même produit sans la marque go pro.", "top super moule p



### 5.2 Sentence representations

Then we did testing on how well a multilingual pre-trained language model works as a method for creating sentence representations to measure multilingual sentence similarity. We have used the feature extraction pipeline.

In [None]:

pipe = pipeline("feature-extraction", model="xlm-roberta-base")

For embedding similarity, we have calculated an embedding for the given sentence by taking an average of the token representations from the last layer of the language model.


In [None]:
def embed_with_pipelines(texts,quiet=True):
    raw_emb=pipe(texts)
    averages=[]
    for one_sent_emb in raw_emb:
        one_sent_emb_tensor=torch.tensor(one_sent_emb)
        if not quiet:
            print("one_sent_emb shape",one_sent_emb_tensor.shape)
        averages.append(one_sent_emb_tensor.mean(1).squeeze())
    embedded=torch.vstack(averages)
    print("Final shape:", embedded.shape)
    embedded=torch.nn.functional.normalize(embedded)
    return embedded

In [None]:
target_language_embeddings=embed_with_pipelines(target_language_reviews_train)
english_embeddings=embed_with_pipelines(chosen_english_texts["text"])

Final shape: torch.Size([117, 768])
Final shape: torch.Size([16, 768])


In [None]:
print(target_language_embeddings.shape)

torch.Size([117, 768])


In [None]:
target_language_embeddings

tensor([[-1.5775e-03,  1.2433e-04,  1.6392e-03,  ...,  8.0358e-04,
          3.4530e-03,  4.9424e-03],
        [-1.1485e-03,  6.5588e-04,  2.1509e-03,  ...,  2.3108e-04,
          3.4662e-03,  3.2159e-03],
        [-1.2889e-03,  2.0085e-03, -1.5174e-04,  ..., -5.9657e-05,
          2.2936e-03,  2.1127e-03],
        ...,
        [ 2.1486e-04, -1.2482e-03,  2.2514e-03,  ...,  3.7347e-03,
          7.7684e-04,  6.4688e-03],
        [-1.4794e-03,  2.3850e-03,  3.3399e-04,  ...,  4.8076e-03,
          2.4565e-03,  4.5333e-03],
        [-5.2706e-04,  5.3226e-04,  4.6123e-04,  ...,  1.1947e-03,
          2.9742e-04,  4.8341e-03]])

In [None]:
print(english_embeddings.shape)

torch.Size([16, 768])


In [None]:
# Your code to create a sentence embedding for the given text here

### 5.3. Cosine similarity


We calculated the cosine similarity of the given embeddings & selected the target sentence that maximizes the cosine similarity for the the given English sentence. We found out cosine similarity using matrix multiplication.



In [None]:

import torch

sims=torch.mm(english_embeddings, target_language_embeddings .T)
print("Shape of the similarity matrix ",sims.shape)


Shape of the similarity matrix  torch.Size([16, 117])


In [None]:
sims_sort = torch.argsort(sims, dim=-1, descending=True)
sims_sort

tensor([[ 81,  49,  14,  ...,  63,  30,  91],
        [  0,   1,  31,  ...,  35,  44,  46],
        [ 10,  49, 109,  ...,  63,  30,  91],
        ...,
        [ 16,  92,   1,  ...,  63,  30,  91],
        [ 49,  11,  16,  ...,  63,  30,  91],
        [  3,  92,  39,  ...,  63,  91,  30]])

For each English example, we find the most similar one from the target language collection using embedding similarities,

To evaluate the pairs manually, we keep a list to keep the actual answers for the word chosen and I have translated them using DeepL. Then we compare the results to the outputs given the model.

In [None]:
Answers = [  "Il n est pas aussi utile que je le pensais. Il faut toujours utiliser une autre spatule ou sa main pour nettoyer le batteur",
               "Article jamais reçu Je n'ai jamais reçu l'article",
                 "Superbe ! Produit de bonne qualité et très joli. Ma fille en est très contente !!!",
               "Très bien Impeccable pour les petites tasses à espresso",
                    "Jamais reçu Je n'ai jamais reçu le produit",
"Poubelle. Ils ne les fabriquent plus comme avant Ce truc est tellement bon marché que c'en est dégoûtant",
                 "J'ai été très satisfait de ce sac. Il roule facilement et contient BEAUCOUP de choses. Très satisfait",
              "Cinq étoiles Il a très bien fonctionné pour mon projet ! Livraison rapide",
                            "Nul Déçu même très déçu ce son pas comme celui de la photo",
              "Tenez bien le sac car il glisse facilement Mayonnaise pâtisseries sur salades",
              "Coque On n'entend rien quand on appelle et quand on reçoit",
              "Livre doccasion Livre reçu à temps. En bon état et utilisé. Rien à redire",
              "Intéressant Les histoires sont toujours bien ficelées",
                            "N'achetez pas ce livre Je n'achèterais pas ce livre Je vais le renvoyer sans frais",
              "Quatre étoiles C'est bien pour le prix, mais ce n'est pas la meilleure qualité. on en a pour son argent",
             "Pas encore reçu La livraison aurait dû avoir lieu le mercredi 9 octobre et à ce jour je ne l'ai toujours pas reçue, donc je ne suis pas sûr de l'avoir reçue. "
]

In the following code, we have outputted the sentence that has most cosine similarity with our chosen sentence and also printed the similarity value and the actual answers.

### 5.4 Bonus task evaluation

We have presented the evaluation results here:

In [None]:
# let's inspect few pairs
# Your code to calculate the cosine similarity of the embeddings and select the target sentence that maximizes the cosine similarity here

for i in range(len(chosen_english_texts["text"])):
    print("CHOSEN ENGLISH TEXT:::::","\n",chosen_english_texts["text"][i])
    j=int(sims_sort[i,0])
    print("\n\nCHOSEN FRENCH BY ALGO::::\n",target_language_reviews_train[j])
    print("\n\nACTUAL ANS::::\n", Answers[i])
    print(sims[i,j])
    print("\n====================================================================================================================================================================\n")




CHOSEN ENGLISH TEXT::::: 
 Somewhat useful, but still messy It's not as useful as I thought. You still have to you another spatula or your hand to clean the beater


CHOSEN FRENCH BY ALGO::::
 Ok livraison

livré à temps. Mais manque l'odeur du coco. Et la bouteille ne ferme pas très bien. la prochaine fois je ferai mon huile moi même


ACTUAL ANS::::
 Il n est pas aussi utile que je le pensais. Il faut toujours utiliser une autre spatule ou sa main pour nettoyer le batteur
tensor(0.9978)


CHOSEN ENGLISH TEXT::::: 
 Article never received I never received the article


CHOSEN FRENCH BY ALGO::::
 Jamais reçu Je n’ai jamais reçu le produit


ACTUAL ANS::::
 Article jamais reçu Je n'ai jamais reçu l'article
tensor(0.9962)


CHOSEN ENGLISH TEXT::::: 
 Great! Good quality product and very nice. My daughter is very happy with it!!!


CHOSEN FRENCH BY ALGO::::
 Super !! Produit de bonne qualité et très sympa. Ma fille en est très contente !!!


ACTUAL ANS::::
 Superbe ! Produit de bonne qual