# M6 - W4 Assignment: Sentiment Analysis using BERT

The objective of this exercise is to evaluate the performance of a Transformers-based model using the Hugging Face library on the IMDb movie review dataset for sentiment classification.

Dataset: The IMDb dataset consists of 50,000 movie reviews, with an equal number of positive and negative reviews. It is commonly used for sentiment analysis tasks. You can download the dataset from the following link: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gzLinks to an external site.

Instructions:

Step 1: Prepare the Data

Extract the downloaded dataset and load the training and testing data into appropriate data structures.
Preprocess the data as required, which may include tokenization, padding, or any other necessary steps.
Step 2: Select and Load a Pre-trained Model

Choose a pre-trained Transformers-based model suitable for text classification, such as BERT or RoBERTa.
Load the pre-trained model using the Hugging Face library.
Step 3: Fine-tune the Model

Define the appropriate architecture and layers for fine-tuning the pre-trained model.
Fine-tune the model on the IMDb dataset by training it on the training data.
Adjust the hyperparameters as needed, including the learning rate and batch size. Do a CV for various parameters of the model.
Step 4: Evaluate the Model

Use the fine-tuned model to make predictions on the testing data.
Calculate the accuracy, precision, recall, and F1-score of the model's predictions.
Use appropriate evaluation metrics based on the nature of the classification task.
Additional Notes:

You can refer to the Hugging Face documentation for guidance on using their library and working with Transformers models: https://huggingface.co/transformers/Links to an external site.
Make sure to handle any necessary data preprocessing, such as cleaning or normalizing the text, before feeding it into the model.
It is recommended to use a GPU if available to speed up the training process.
Feel free to explore and experiment beyond the provided instructions to gain a deeper understanding of Transformers-based models and their applications.

## Step 1: Prepare the Data

In [None]:
import tarfile
import os
import torch
from torch.utils.data import DataLoader, TensorDataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AdamW, get_scheduler
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


# Path to the downloaded gz file
gz_file_path = "C:/Users/ManosIeronymakisProb/OneDrive - Probability/Bureaublad/ELU/M6 - W4 Assignment Sentiment Analysis using BERT/aclImdb_v1.tar.gz"

# Directory to store the extracted dataset
dataset_dir = "imdb_dataset"
os.makedirs(dataset_dir, exist_ok=True)

# Extract the gz file to the dataset directory
with tarfile.open(gz_file_path, "r:gz") as tar:
    tar.extractall(dataset_dir)

# Function to load reviews and labels from a directory
def load_reviews(directory):
    reviews = []
    labels = []
    for label in ['pos', 'neg']:
        path = os.path.join(directory, label)
        for filename in os.listdir(path):
            with open(os.path.join(path, filename), 'r', encoding='utf-8') as file:
                review = file.read()
            reviews.append(review)
            labels.append(1 if label == 'pos' else 0)
    return reviews, labels

# Directory paths for the training and testing sets
train_dir = r"C:\Users\ManosIeronymakisProb\OneDrive - Probability\Bureaublad\ELU\M6 - W4 Assignment Sentiment Analysis using BERT\imdb_dataset\aclImdb\train"
test_dir = r"C:\Users\ManosIeronymakisProb\OneDrive - Probability\Bureaublad\ELU\M6 - W4 Assignment Sentiment Analysis using BERT\imdb_dataset\aclImdb\test"

# Load the reviews and labels for training set
train_reviews, train_labels = load_reviews(train_dir)

# Load the reviews and labels for testing set
test_reviews, test_labels = load_reviews(test_dir)


## Step 2: Select and Load a Pre-trained Model

In [None]:
# Select the model name
model_name = "bert-base-uncased"  # Example: BERT-base

# Load the tokenizer and model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.to(device)

## Step 3 & 4: Fine-tune the Model and Evaluate the Model

In [None]:
# Prepare the data for fine-tuning
train_encodings = tokenizer(train_reviews, truncation=True, padding=True, return_tensors='pt')
train_inputs = train_encodings['input_ids']
train_attention_mask = train_encodings['attention_mask']
train_labels = torch.tensor(train_labels)
train_dataset = TensorDataset(train_inputs, train_attention_mask, train_labels)

# Define the hyperparameters and perform cross-validation
learning_rates = [2e-5, 3e-5, 5e-5]
batch_sizes = [16, 32, 64]
num_epochs = 3
num_folds = 5

best_accuracy = 0.0
best_model = None

for lr in learning_rates:
    for batch_size in batch_sizes:
        optimizer = AdamW(model.parameters(), lr=lr)
        train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
        scheduler = get_scheduler("linear", optimizer, num_warmup_steps=0, num_training_steps=len(train_dataloader) * num_epochs)

        fold_accuracy = 0.0
        kfold = KFold(n_splits=num_folds, shuffle=True, random_state=42)
        
        for fold, (train_indices, val_indices) in enumerate(kfold.split(train_dataset)):
            train_fold_dataset = TensorDataset(train_dataset[train_indices][0], train_dataset[train_indices][1], train_dataset[train_indices][2])
            val_fold_dataset = TensorDataset(train_dataset[val_indices][0], train_dataset[val_indices][1], train_dataset[val_indices][2])

            train_fold_dataloader = DataLoader(train_fold_dataset, batch_size=batch_size, shuffle=True)
            val_dataloader = DataLoader(val_fold_dataset, batch_size=batch_size)

            model.train()
            for epoch in range(num_epochs):
                for batch in train_fold_dataloader:
                    optimizer.zero_grad()
                    input_ids, attention_mask, labels = batch
                    input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)
                    outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
                    loss = outputs.loss
                    loss.backward()
                    optimizer.step()
                    scheduler.step()

            model.eval()
            with torch.no_grad():
                val_preds = []
                for batch in val_dataloader:
                    input_ids, attention_mask, labels = batch
                    input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)
                    outputs = model(input_ids, attention_mask=attention_mask)
                    logits = outputs.logits
                    preds = torch.argmax(logits, dim=1)
                    val_preds.extend(preds.tolist())

                val_labels = train_dataset[val_indices][2]
                accuracy = accuracy_score(val_labels, val_preds)
                fold_accuracy += accuracy

        average_accuracy = fold_accuracy / num_folds
        if average_accuracy > best_accuracy:
            best_accuracy = average_accuracy
            best_model = model.state_dict()

# Load the best model
model.load_state_dict(best_model)

# Prepare the test data
test_encodings = tokenizer(test_reviews, truncation=True, padding=True, return_tensors='pt')
test_inputs = test_encodings['input_ids']
test_attention_mask = test_encodings['attention_mask']
test_labels = torch.tensor(test_labels)
test_dataset = TensorDataset(test_inputs, test_attention_mask, test_labels)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size)

# Evaluate the model on the test set
model.eval()
with torch.no_grad():
    test_preds = []
    for batch in test_dataloader:
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        preds = torch.argmax(logits, dim=1)
        test_preds.extend(preds.tolist())

# Convert the predictions to numpy arrays
test_preds = torch.tensor(test_preds).cpu().numpy()
test_labels = test_labels.cpu().numpy()

# Calculate evaluation metrics
accuracy = accuracy_score(test_labels, test_preds)
precision = precision_score(test_labels, test_preds)
recall = recall_score(test_labels, test_preds)
f1 = f1_score(test_labels, test_preds)

# Print the evaluation metrics
print(f"Test Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")