# DELIVERABLE: FINE-TUNING AND ADAPTATION OF LANGUAGE MODELS

In this exercise the aim is to perform transfer learning from English to Spanish in the domain of sentiment analysis in opinions about movies.
It is recommended to use a cross-lingual model in order to facilitate the learning transfer.

- `xlm-roberta-base`
- `xlm-roberta-large`

Choose between transformers or simpletransformers

In [1]:
!pip install transformers[torch]

Collecting transformers[torch]
  Downloading transformers-4.35.1-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers[torch])
  Downloading huggingface_hub-0.19.3-py3-none-any.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m41.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers[torch])
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m51.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers[torch])
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m61.1 MB/s

Get the data of opinions about movies from IMDB dataset (parquet format)

In [2]:
!wget -O imdb_train.parquet "https://huggingface.co/datasets/imdb/resolve/refs%2Fconvert%2Fparquet/plain_text/train/0000.parquet"
!wget -O imdb_test.parquet  "https://huggingface.co/datasets/imdb/resolve/refs%2Fconvert%2Fparquet/plain_text/test/0000.parquet"

--2023-11-15 15:24:01--  https://huggingface.co/datasets/imdb/resolve/refs%2Fconvert%2Fparquet/plain_text/train/0000.parquet
Resolving huggingface.co (huggingface.co)... 18.164.174.17, 18.164.174.23, 18.164.174.55, ...
Connecting to huggingface.co (huggingface.co)|18.164.174.17|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/datasets/imdb/d01f1e2b447fc7c7bffc1b03d64b5394e9fc9b1aac78a6548553f7c74bf69738?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%270000.parquet%3B+filename%3D%220000.parquet%22%3B&Expires=1700321041&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwMDMyMTA0MX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9kYXRhc2V0cy9pbWRiL2QwMWYxZTJiNDQ3ZmM3YzdiZmZjMWIwM2Q2NGI1Mzk0ZTlmYzliMWFhYzc4YTY1NDg1NTNmN2M3NGJmNjk3Mzg%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qIn1dfQ__&Signature=idHmM3NTqBbqpIWFdDYfWCJ1Gbu2w9fH-2MaXiqGDmE16hsM-NeOykUB8xAIf-X5uWCdWS

In [3]:
!ls

imdb_test.parquet  imdb_train.parquet  sample_data


# PART I: Training with the English dataset

## Loading, exploring and transforming data for tranformers

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoModelForSequenceClassification
import torch
from transformers import Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, f1_score

In [5]:
# Save the model in gdrive
from google.colab import drive

# Mount Google Drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [6]:
# Step 1: Load train and test dataset in a pandas dataframe
df_train = pd.read_parquet("imdb_train.parquet")
df_test = pd.read_parquet("imdb_test.parquet")

In [7]:
# Concatenate both datasets in order to get a bigger dataset
df_train = pd.concat([df_train, df_test], ignore_index=True)

In [8]:
df_train.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


In [9]:
df_test.head()

Unnamed: 0,text,label
0,I love sci-fi and am willing to put up with a ...,0
1,"Worth the entertainment value of a rental, esp...",0
2,its a totally average film with a few semi-alr...,0
3,STAR RATING: ***** Saturday Night **** Friday ...,0
4,"First off let me say, If you haven't enjoyed a...",0


In [10]:
df_train['label'].value_counts()

0    25000
1    25000
Name: label, dtype: int64

In [11]:
df_test['label'].value_counts()

0    12500
1    12500
Name: label, dtype: int64

## Prepare the Trainer from a cross-lingual pretrained model

## Split train-validation-test

In [12]:
# Reset the index
#df_train = df_train.reset_index(drop=True)

# Step 2: Split the data into train, validation, and test sets
train_data, temp_data = train_test_split(df_train, test_size=0.3, random_state=42)
val_data, test_data = train_test_split(temp_data, test_size=1/3, random_state=42)

### Get the tokenizer and the classification model (from_pretrained)

In [13]:
# Step 3: Get the tokenizer and the classification model

model_name = 'xlm-roberta-base'

tokenizer = AutoTokenizer.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#model = AutoModelForMaskedLM.from_pretrained(model_name)
model = (AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2).to(device))

Downloading (…)lve/main/config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
# Tokenize the text data
train_tokenized = tokenizer(train_data['text'].tolist(), truncation=True, padding=True, return_tensors="pt")
val_tokenized = tokenizer(val_data['text'].tolist(), truncation=True, padding=True, return_tensors="pt")
test_tokenized = tokenizer(test_data['text'].tolist(), truncation=True, padding=True, return_tensors="pt")

# Create PyTorch datasets
train_dataset = torch.utils.data.TensorDataset(
    torch.tensor(train_tokenized['input_ids']),
    torch.tensor(train_tokenized['attention_mask']),
    torch.tensor(train_data['label'].tolist())
)

val_dataset = torch.utils.data.TensorDataset(
    torch.tensor(val_tokenized['input_ids']),
    torch.tensor(val_tokenized['attention_mask']),
    torch.tensor(val_data['label'].tolist())
)

test_dataset = torch.utils.data.TensorDataset(
    torch.tensor(test_tokenized['input_ids']),
    torch.tensor(test_tokenized['attention_mask']),
    torch.tensor(test_data['label'].tolist())
)

  torch.tensor(train_tokenized['input_ids']),
  torch.tensor(train_tokenized['attention_mask']),
  torch.tensor(val_tokenized['input_ids']),
  torch.tensor(val_tokenized['attention_mask']),
  torch.tensor(test_tokenized['input_ids']),
  torch.tensor(test_tokenized['attention_mask']),


### Prepare the arguments of the trainer and set the trained object

In [15]:
# Step 4: Prepare the arguments of the trainer and set the trained object
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1}

# Custom data collator
class CustomDataCollator:
    def __call__(self, batch):
        input_ids = torch.stack([item[0] for item in batch])
        attention_mask = torch.stack([item[1] for item in batch])
        labels = torch.tensor([item[2] for item in batch])
        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': labels
        }

# Define training arguments
batch_size = 12
logging_steps = len(train_data) // batch_size
training_args = TrainingArguments(
    output_dir="sentiment_movies",
    num_train_epochs=2,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    #warmup_steps=500,
    weight_decay=0.01,
    evaluation_strategy="steps",
    #logging_steps = len(emotions["train"]) // batch_size,
    #eval_steps = 50,
    #save_strategy="no",
    fp16 = True,
    #logging_dir="./logs",
    disable_tqdm=False
)

# Define Trainer with the tokenized datasets
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics = compute_metrics,
    data_collator = CustomDataCollator()
)


## Fine-tune the model

In [14]:
# Step 5: Fine-tune the model
trainer.train()

Step,Training Loss,Validation Loss,Accuracy,F1
500,0.3874,0.28678,0.9108,0.910706
1000,0.2637,0.309663,0.9165,0.916491
1500,0.2584,0.231351,0.924,0.923961
2000,0.2535,0.234334,0.9267,0.926677
2500,0.2339,0.226396,0.931,0.930997
3000,0.2247,0.296027,0.9308,0.930731
3500,0.1774,0.255879,0.9347,0.934679
4000,0.1833,0.250251,0.9373,0.937298
4500,0.1619,0.276619,0.938,0.937996
5000,0.169,0.252098,0.9389,0.938899


TrainOutput(global_step=5834, training_loss=0.2232633688125375, metrics={'train_runtime': 5203.4434, 'train_samples_per_second': 13.453, 'train_steps_per_second': 1.121, 'total_flos': 1.84177738752e+16, 'train_loss': 0.2232633688125375, 'epoch': 2.0})

## Evaluate the model with the test dataset (English)

In [16]:
# Step 6: Evaluate the model with the test dataset (English)
results = trainer.evaluate(test_dataset)
results

{'eval_loss': 0.23350727558135986,
 'eval_accuracy': 0.9406,
 'eval_f1': 0.940603623939387,
 'eval_runtime': 60.45,
 'eval_samples_per_second': 82.713,
 'eval_steps_per_second': 6.898,
 'epoch': 2.0}

In [17]:
preds_output = trainer.predict(test_dataset)
preds_output.metrics

{'test_loss': 0.23350727558135986,
 'test_accuracy': 0.9406,
 'test_f1': 0.940603623939387,
 'test_runtime': 58.4918,
 'test_samples_per_second': 85.482,
 'test_steps_per_second': 7.129}

In [16]:
# Set the path to save the model
output_dir = "/content/gdrive/MyDrive/Master_23_24/Big_Data/movies_sentiment"

In [21]:
# Save model and weights
trainer.save_model(output_dir + "/model")
#model.save_pretrained(output_dir)

# Save tokenizer
tokenizer.save_pretrained(output_dir + "tokenizer")

('/content/gdrive/MyDrive/Master_23_24/Big_Data/movies_sentimenttokenizer/tokenizer_config.json',
 '/content/gdrive/MyDrive/Master_23_24/Big_Data/movies_sentimenttokenizer/special_tokens_map.json',
 '/content/gdrive/MyDrive/Master_23_24/Big_Data/movies_sentimenttokenizer/tokenizer.json')

# Part II: Adaptation to Spanish dataset (criticas_pelis)

Using the same model we have evaluated in the previous part, we are going to re-train it with the Spanish examples.

Take care of using the same labels in the Spanish dataset (0=negative, 1=positive). Neutral opinions (rate=3) will be discarded from the training dataset, but we can use them afterwards to predict their polarity with the resulting model.

## Load, explore and transform the Spanish dataset of Critics

In [17]:
!wget "http://krono.act.uji.es/IDIA/criticas_pelis.csv.gz"
!gunzip "criticas_pelis.csv.gz"

--2023-11-15 15:28:12--  http://krono.act.uji.es/IDIA/criticas_pelis.csv.gz
Resolving krono.act.uji.es (krono.act.uji.es)... 150.128.97.37
Connecting to krono.act.uji.es (krono.act.uji.es)|150.128.97.37|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://krono.act.uji.es/IDIA/criticas_pelis.csv.gz [following]
--2023-11-15 15:28:12--  https://krono.act.uji.es/IDIA/criticas_pelis.csv.gz
Connecting to krono.act.uji.es (krono.act.uji.es)|150.128.97.37|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4447654 (4.2M) [application/x-gzip]
Saving to: ‘criticas_pelis.csv.gz’


2023-11-15 15:28:14 (3.02 MB/s) - ‘criticas_pelis.csv.gz’ saved [4447654/4447654]



In [18]:
# Load the Spanish dataset
df_spanish = pd.read_csv("criticas_pelis.csv", header=None)

In [19]:
# Display basic information about the dataset
df_spanish.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3878 entries, 0 to 3877
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       3878 non-null   object
 1   1       3878 non-null   object
 2   2       3878 non-null   object
 3   3       3878 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 121.3+ KB


In [20]:
# Display the first few rows of the dataset
df_spanish.head(5)

Unnamed: 0,0,1,2,3
0,Row0,File-0,"May, ¿quieres ser mi amigo? May, ¿Quieres ser...",4
1,Row1,File-1,Cómo ponerse en la piel de un kamikaze Es tod...,4
2,Row2,File-10,"Deliciosa comedieta dramática, con tintes rev...",4
3,Row3,File-100,La ironía es el arma de los perdedores y este...,3
4,Row4,File-1000,"Al final, y teniendo en cuenta que esto es el...",3


In [21]:
df_spanish.columns = ['row', 'file', 'text', 'rating']

In [23]:
# Discard neutral opinions (rating=3)
df_spanish = df_spanish[df_spanish['rating'] != 3]

# Assign labels (0=negative, 1=positive)
# You may need to adjust this based on your dataset structure
df_spanish['label'] = df_spanish['rating'].apply(lambda x: 0 if x < 3 else 1)

# Display the updated dataset
df_spanish.head()

Unnamed: 0,row,file,text,rating,label
0,Row0,File-0,"May, ¿quieres ser mi amigo? May, ¿Quieres ser...",4,1
1,Row1,File-1,Cómo ponerse en la piel de un kamikaze Es tod...,4,1
2,Row2,File-10,"Deliciosa comedieta dramática, con tintes rev...",4,1
5,Row5,File-1001,Durante buena parte del metraje y solo pude r...,1,0
8,Row8,File-1004,Sus defectos quedan olvidados gracias a las i...,4,1


In [24]:
df_spanish['label'].value_counts()

1    1351
0    1274
Name: label, dtype: int64

In [25]:
# Reset the index
#df_train = df_train.reset_index(drop=True)

# Step 2: Split the data into train, validation, and test sets
train_data_s, temp_data_s = train_test_split(df_spanish, test_size=0.3, random_state=42)
val_data_s, test_data_s = train_test_split(temp_data_s, test_size=1/3, random_state=42)

## Prepare the Trainer from the previous trained model

In [30]:
# Assuming you have saved the model to a specific directory, adjust the path accordingly
pretrained_model_path = output_dir + "/model"
pretrained_tokenizer_path = "/content/gdrive/MyDrive/Master_23_24/Big_Data/movies_sentimenttokenizer"

# Load the pre-trained model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#pretrained_model = AutoModelForMaskedLM.from_pretrained(pretrained_model_path)
pretrained_model = (AutoModelForSequenceClassification.from_pretrained(pretrained_model_path, num_labels=2).to(device))

# Define the tokenizer
tokenizer = AutoTokenizer.from_pretrained(pretrained_tokenizer_path)

In [31]:
# Tokenize the text data
train_tokenized_s = tokenizer(train_data_s['text'].tolist(), truncation=True, padding=True, return_tensors="pt")
val_tokenized_s = tokenizer(val_data_s['text'].tolist(), truncation=True, padding=True, return_tensors="pt")
test_tokenized_s = tokenizer(test_data_s['text'].tolist(), truncation=True, padding=True, return_tensors="pt")

# Create PyTorch datasets
train_dataset_s = torch.utils.data.TensorDataset(
    torch.tensor(train_tokenized_s['input_ids']),
    torch.tensor(train_tokenized_s['attention_mask']),
    torch.tensor(train_data_s['label'].tolist())
)

val_dataset_s = torch.utils.data.TensorDataset(
    torch.tensor(val_tokenized_s['input_ids']),
    torch.tensor(val_tokenized_s['attention_mask']),
    torch.tensor(val_data_s['label'].tolist())
)

test_dataset_s = torch.utils.data.TensorDataset(
    torch.tensor(test_tokenized_s['input_ids']),
    torch.tensor(test_tokenized_s['attention_mask']),
    torch.tensor(test_data_s['label'].tolist())
)

  torch.tensor(train_tokenized_s['input_ids']),
  torch.tensor(train_tokenized_s['attention_mask']),
  torch.tensor(val_tokenized_s['input_ids']),
  torch.tensor(val_tokenized_s['attention_mask']),
  torch.tensor(test_tokenized_s['input_ids']),
  torch.tensor(test_tokenized_s['attention_mask']),


In [32]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1}

# Custom data collator
class CustomDataCollator:
    def __call__(self, batch):
        input_ids = torch.stack([item[0] for item in batch])
        attention_mask = torch.stack([item[1] for item in batch])
        labels = torch.tensor([item[2] for item in batch])
        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': labels
        }

# Define training arguments
batch_size = 12
logging_steps = len(train_data) // batch_size
training_args = TrainingArguments(
    output_dir="sentiment_movies",
    num_train_epochs=2,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    #warmup_steps=500,
    weight_decay=0.01,
    evaluation_strategy="steps",
    #logging_steps = len(emotions["train"]) // batch_size,
    #eval_steps = 50,
    #save_strategy="no",
    fp16 = True,
    #logging_dir="./logs",
    disable_tqdm=False
)

# Define Trainer with the tokenized datasets
trainer = Trainer(
    model=pretrained_model,
    args=training_args,
    train_dataset=train_dataset_s,
    eval_dataset=val_dataset_s,
    compute_metrics = compute_metrics,
    data_collator = CustomDataCollator()
)

## Fine-tune the model

In [33]:
# Fine-tune the model on the Spanish dataset
trainer.train()

Step,Training Loss,Validation Loss


TrainOutput(global_step=308, training_loss=0.24332754952566965, metrics={'train_runtime': 167.5838, 'train_samples_per_second': 21.923, 'train_steps_per_second': 1.838, 'total_flos': 966670017392640.0, 'train_loss': 0.24332754952566965, 'epoch': 2.0})

## Evaluate the model with the test dataset (Spanish)

In [34]:
# Evaluate the model on the test dataset
results = trainer.evaluate(test_dataset_s)
results

{'eval_loss': 0.3936593532562256,
 'eval_accuracy': 0.9239543726235742,
 'eval_f1': 0.9236543679568349,
 'eval_runtime': 3.1245,
 'eval_samples_per_second': 84.173,
 'eval_steps_per_second': 7.041,
 'epoch': 2.0}

In [35]:
preds_output = trainer.predict(test_dataset_s)
preds_output.metrics

{'test_loss': 0.3936593532562256,
 'test_accuracy': 0.9239543726235742,
 'test_f1': 0.9236543679568349,
 'test_runtime': 3.095,
 'test_samples_per_second': 84.977,
 'test_steps_per_second': 7.108}

## Evaluate again the model on the English test dataset

In [36]:
results = trainer.evaluate(test_dataset)
results

{'eval_loss': 0.3329053819179535,
 'eval_accuracy': 0.9312,
 'eval_f1': 0.9311647585017865,
 'eval_runtime': 57.4988,
 'eval_samples_per_second': 86.958,
 'eval_steps_per_second': 7.252,
 'epoch': 2.0}

In [37]:
preds_output = trainer.predict(test_dataset)
preds_output.metrics

{'test_loss': 0.3329053819179535,
 'test_accuracy': 0.9312,
 'test_f1': 0.9311647585017865,
 'test_runtime': 57.4801,
 'test_samples_per_second': 86.987,
 'test_steps_per_second': 7.255}

In [38]:
# Set the path to save the model
output_dir2 = "/content/gdrive/MyDrive/Master_23_24/Big_Data/movies_sentiment_spanish"

In [39]:
# Save model and weights
trainer.save_model(output_dir)

Based on the evaluation results for the two models, we can draw the following conclusions:

1. **First Model Evaluation Results:**
   - **Accuracy:** 94.06%
   - **F1 Score:** 94.06%
   - **Evaluation Loss:** 0.2335

2. **Second Model (Trained on Spanish Dataset) Evaluation Results:**
   - **Accuracy:** 92.40%
   - **F1 Score:** 92.37%
   - **Evaluation Loss:** 0.3937

3. **Second Model (Tested on First Model's Test Dataset) Evaluation Results:**
   - **Accuracy:** 93.12%
   - **F1 Score:** 93.12%
   - **Evaluation Loss:** 0.3329

**Conclusion:**

- The first model, trained on the original dataset, achieved higher accuracy and F1 score compared to the second model trained on the Spanish dataset. This suggests that the first model performed better on the original task for English language sentiment analysis.

- The second model, trained on the Spanish dataset, still achieved a reasonably high accuracy and F1 score, indicating that the model has learned to perform sentiment analysis on the Spanish dataset. However, the performance is slightly lower than the first model.

- When the second model was evaluated on the test dataset of the first model, it performed well, achieving an accuracy and F1 score close to the results of the first model. This suggests that the second model has successfully generalized to some extent to the sentiment analysis task in English.