### Project 1: Word Embeddings
This is the first project in NLP SW03 done by Jannine Meier. 

The project was run locally (in GPUHubllabservices).

WandB: https://wandb.ai/nlp_janninemeier/Project1

### Install libraries
Installs all the necessary libraries.

In [1]:
pip install gensim datasets torch transformers wandb nltk scikit-learn

Note: you may need to restart the kernel to use updated packages.


### Imports
Imports all the necessary tools to handle data loading, preprocessing, model creation training, tracking experiments and analyzing results.

In [2]:
# PyTorch related imports (Machine Learning)
import torch
from torch import nn
from torch.nn.functional import sigmoid
from torch.utils.data import DataLoader, Dataset
from torch.optim import Adam

# Data handling and processing
from datasets import load_dataset
import re
from typing import Dict
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.preprocessing import normalize
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Word2Vec (Word embedding)
from gensim.models import KeyedVectors
import gensim.downloader as api

# Weights & Biases for experiment tracking
import wandb

# Other utility libraries
import numpy as np
from tqdm.auto import tqdm

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
# Run it on GPU if possible
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


### Data Loading
Imports the specific subset winogrande_m data set from HuggingFace. 

#### Split the Dataset into Training, Validation, and Testing Sets

- **`train_dataset`**: Extracts the training portion of the dataset. Training data is used to fit the machine learning model.
- **`val_dataset`**: Extracts the validation portion of the dataset. Validation data is used to evaluate the model during the tuning of model hyperparameters.
- **`test_dataset`**: Assigns the validation split of the dataset to `test_dataset`. This is a temporary workaround because the actual test set does not contain answer labels, which are necessary for evaluating the model's performance. This approach does not provide an unbiased estimate of the model's performance on truly unseen data.

In [4]:
# Load the dataset from Hugging Face datasets
winogrande_dataset = load_dataset('winogrande', 'winogrande_m')

# Split the dataset into training, validation, and testing sets
train_dataset = winogrande_dataset['train']
val_dataset = winogrande_dataset['validation']
test_dataset = winogrande_dataset['validation'] # workaround for test data set with validation data set 

### Configuration of hyperparameters

The configuration dictionary is a structured way to organize and manage the hyperparameters and preprocessing settings for this project. 

My wandb runs are labelled with hidden_dimension/learning_rate/num_epochs.

#### Configuration Keys

- **`hidden_dim`**: The dimensionality of the hidden layer(s) in the neural network. A higher number of dimensions can potentially capture more complex patterns in the data but may also lead to overfitting and increased computational cost. This was used for hyperparameter tuning.

- **`output_dim`**: Dimensionality of the output layer of the neural network. For binary classification tasks this is typically set to `2`, indicating two possible outcomes (e.g., option1 or option2).

- **`num_epochs`**: The number of complete passes through the training dataset. The goal is to choose a value that balances between underfitting and overfitting, as well as training time. Adjusting the number of epochs can impact the model's performance significantly. This was used for hyperparameter tuning.

- **`learning_rate`**: Step size at each iteration while moving toward a minimum of the loss function. A smaller learning rate, ensures more precise updates, potentially leading to better convergence properties, but can also slow down the training process. This was used for hyperparameter tuning.

- **`batch_size`**: Number of samples that will be propagated through the network in one forward/backward pass. A size of `32` is a common choice that balances the trade-off between computational efficiency and the stability of the learning process.

In [5]:
config = {
    "hidden_dim": 612,
    "output_dim": 2,
    "num_epochs": 100,
    "learning_rate": 1e-3,
    "batch_size": 32,
    "preprocessing": {
        "lowercase": True,              # Convert all text to lowercase
        "remove_punctuation": True,     # Remove punctuation from text
        "use_stopwords": False,         # Enable/Disable stopword removal
        "language": "english",          # Language for stopword removal (if enabled)
    }
}

### Preprocessing

#### Text Normalization
The process of converting text into a more uniform format. Here it normalizes a sentence by optionally converting it to lowercase and removing punctuation.

####  Stopword Removal
Stopwords are commonly used words (such as "the", "a", "an", "in"). Removing these can help focus on the more meaningful words in sentences. However, the impact of stopword removal can vary depending on the task, so it's useful to make this configurable.

In [6]:
def normalize_text(sentence: str, remove_punctuation: bool = True, lowercase: bool = True) -> str:
    if lowercase:
        sentence = sentence.lower()
    if remove_punctuation:
        sentence = re.sub(r'[^\w\s]', '', sentence)  # Remove punctuation
    return sentence

def remove_stopwords(sentence: str, use_stopwords: bool = False, language: str = 'english') -> str:
    if not use_stopwords:
        return sentence
    stop_words = set(stopwords.words(language))
    words = word_tokenize(sentence)
    filtered_sentence = [word for word in words if word not in stop_words]
    return ' '.join(filtered_sentence)

def preprocess_text(sentence: str, config: Dict[str, any], stop_words: set = None) -> str:
    # Apply text normalization
    if config["preprocessing"]["lowercase"]:
        sentence = sentence.lower()
    if config["preprocessing"]["remove_punctuation"]:
        sentence = re.sub(r'[^\w\s]', '', sentence)
    
    # Apply stopword removal if enabled and stop words are provided
    if stop_words and config["preprocessing"]["use_stopwords"]:
        words = word_tokenize(sentence)
        sentence = ' '.join([word for word in words if word not in stop_words])
    
    return sentence

### Word Embeddings and Sentence Vectorization

#### Load Pre-trained Word2Vec Embeddings

The api loads the pre-trained model `word2vec-google-news-300`. This model is trained on the Google News dataset and contains 300-dimensional vectors for 3 million words and phrases.

In [7]:
word_vectors = api.load('word2vec-google-news-300')

#### Convert Sentence to Vector

The `sentence_to_avg_vector` function converts a given sentence into a vector by averaging the word vectors of the words contained in the sentence. This approach allows for the representation of the entire sentence in a fixed-size vector, which can then be used for further ML tasks. This method averages the semantics of all words, potentially diluting the impact of particularly salient words and ignoring the order of words.

- **Process**:
  1. **Tokenization**: The sentence is split into individual words.
  2. **Vector Accumulation**: For each word in the sentence, if the word is present in the Word2Vec model, its vector is added to a list of vectors. Words not found in the Word2Vec model are ignored. This step assumes that the meaningful content of a sentence can be adequately represented by the words found in the Word2Vec vocabulary.
  3. **Averaging**: Calculates the mean of these vectors along the zeroth axis (column-wise mean if the vectors are stacked as rows in a matrix). This results in a single vector that represents the average of all word vectors in the sentence.

The return is the average vector. If no words in the sentence are found in the Word2Vec model, a zero vector of the same dimensionality as the Word2Vec vectors is returned.

In [8]:
def sentence_to_avg_vector(sentence, word_vectors):
    # Tokenize the sentence into words
    words = sentence.split()

    # Initialize an empty list to store the vectors
    vector_list = []

    # Iterate over each word in the sentence
    for word in words:
        # Check if the word is in the Word2Vec model
        if word in word_vectors.key_to_index:
            # Add the word's vector to the list
            vector_list.append(word_vectors[word])

    # If the list is empty (no words found in the Word2Vec model), return a zero vector
    if not vector_list:
        return np.zeros(word_vectors.vector_size)

    # Compute the average vector and normalize it
    avg_vector = np.mean(vector_list, axis=0)
    avg_vector = normalize(avg_vector.reshape(1, -1))  # Reshape for sklearn's normalize function
    return avg_vector.flatten()  # Flatten to convert back to 1D array


### WinograndeDataset Class

The `WinograndeDataset` class is designed to work with the dataset in the context of PyTorch machine learning tasks, enabling it to be used seamlessly with PyTorch's `DataLoader` for efficient and easy batching, shuffling, and parallel data loading and evaluation.

- **Process**:
  1. **Sentence and Options Extraction**: For the given index, the method extracts the sentence, two options (option1 and option2), and the correct answer from the dataset.
  2. **Sentence Replacement**: It creates two new sentences by replacing a placeholder `_` in the original sentence with each of the two options.
  3. **Vectorization**: Each modified sentence is then converted to an average vector using the `sentence_to_avg_vector` function, leveraging the word embeddings to encode the sentence semantically.
  4. **Stacking and Labeling**: The two vectors are stacked into a tensor, and a label is created based on the correct answer (+1 for option1, -1 for option2).

The return is a tuple of the stacked sentence vectors and the corresponding label. 

In [9]:
class WinograndeDataset(Dataset):
    def __init__(self, data, word_vectors, config):
        """
        Args:
            data: a 'Dataset' object from the Hugging Face 'datasets' library.
            word_vectors: pre-loaded Word2Vec model.
            config: configuration dictionary with preprocessing settings.
        """
        self.data = data
        self.word_vectors = word_vectors
        self.config = config
        
        # Preload stop words if needed for efficiency
        if self.config["preprocessing"]["use_stopwords"]:
            self.stop_words = set(stopwords.words(self.config["preprocessing"]["language"]))
        else:
            self.stop_words = None

    def __len__(self):
        """Returns the total number of data samples."""
        return len(self.data)

    def __getitem__(self, idx):
        """
        Retrieves the idx-th data sample from the dataset.
        """
        # Get the sentence and the options from the dataset
        sample = self.data[idx]
        sentence = sample['sentence']
        option1 = sample['option1']
        option2 = sample['option2']
        answer = sample['answer']
    
        # Create two versions of the sentence, one with each option
        sentence_option1 = sentence.replace('_', option1)
        sentence_option2 = sentence.replace('_', option2)
        
        # Convert each sentence to an average vector
        vector_option1 = sentence_to_avg_vector(sentence_option1, self.word_vectors)
        vector_option2 = sentence_to_avg_vector(sentence_option2, self.word_vectors)
        
        # Stack the vectors and convert to a tensor
        sentence_vectors = torch.stack([torch.tensor(vector_option1), torch.tensor(vector_option2)], dim=0)
        
        # The label is +1 if the answer is '1' (option1), -1 if the answer is '2' (option2)
        label = torch.tensor(1 if answer == '1' else -1, dtype=torch.long)
        
        return sentence_vectors, label


train_dataset = WinograndeDataset(winogrande_dataset['train'], word_vectors, config)
val_dataset = WinograndeDataset(winogrande_dataset['validation'], word_vectors, config)
test_dataset = WinograndeDataset(winogrande_dataset['validation'], word_vectors, config)


### Creating DataLoaders

#### DataLoader for the Training Set

Prepares the training dataset for the model training process. - **`shuffle`**: Set to `True` as huffling helps prevent the model from learning spurious correlations based on the order of the data which helps improving the generalization.

#### DataLoader for the Validation Set

Sets up the validation dataset for evaluating the model periodically during training. - **`shuffle`**: Set to `False` because shuffling is not necessary for validation as the order of data does not affect evaluation metrics.

#### DataLoader for the Test Set

Prepares the test dataset for the final evaluation of the model to measure the performance after training and hyperparameter tuning have been completed. - **`shuffle`**: Also set to `False`, for the same reasons as the validation DataLoader.

In [10]:
from torch.utils.data import DataLoader

# Create the DataLoader for the training set
train_loader = DataLoader(train_dataset, batch_size=config["batch_size"], shuffle=True)

# Create the DataLoader for the validation set
val_loader = DataLoader(val_dataset, batch_size=config["batch_size"], shuffle=False)

# Create the DataLoader for the test set
test_loader = DataLoader(test_dataset, batch_size=config["batch_size"], shuffle=False)


### ComparativeClassifier Neural Network Class

A neural network architecture for comparative classification tasks, specifically for evaluating pairs of sentence vectors. Its primary objective is to compare two sentence vectors and determine which sentence is more likely to be the correct fill-in for a given context, leveraging the power of learned word embeddings and a simple feedforward neural network structure.

#### Class Constructor (`__init__`)

- **Network Architecture**:
  - The network consists of a sequential model (`nn.Sequential`) with the following layers:
    1. **Linear Layer**: Transforms the input vector to the hidden layer size (`input_size` to `hidden_size`).
    2. **ReLU Activation**: Introduces non-linearity into the model, allowing it to learn complex patterns.
    3. **Linear Layer**: Further transforms the data from the hidden layer size to a single scalar value (`hidden_size` to `1`). This output represents the "score" of a sentence vector, essentially estimating its likelihood of being the correct answer.

#### Forward Pass (`forward`)

- **Input**:
  - `sentence_pairs`: A tensor containing pairs of sentence vectors. Each pair corresponds to two variations of a sentence, with one word replaced by different options (option1 and option2).

- **Process**:
  1. **Vector Separation**: The method separates the sentence pairs into two individual vectors, `vec1` and `vec2`, representing the first and second sentence vectors in each pair.
  2. **Scoring**: Each vector is independently passed through the neural network to compute a score, indicating its likelihood of correctness.
  3. **Comparison**: The scores of `vec2` are subtracted from `vec1`. A positive result implies a preference for `vec1` (indicating it is more likely to be correct), while a negative result suggests a preference for `vec2`.

- **Output**:
  - The output is a tensor of score differences for each pair of sentence vectors. This differential scoring mechanism enables the model to effectively compare the two sentences and make a decision based on which one has the higher score, thus determining which option fits the context better.

In [11]:
class ComparativeClassifier(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(ComparativeClassifier, self).__init__()
        self.network = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, 1)  # Output a single score per sentence vector
        )
    
    def forward(self, sentence_pairs):
        vec1 = sentence_pairs[:, 0, :]  # First sentence vector in each pair
        vec2 = sentence_pairs[:, 1, :]  # Second sentence vector in each pair
        
        # Pass both vectors through the network
        score1 = self.network(vec1)
        score2 = self.network(vec2)
        
        # Subtract score2 from score1; positive values indicate preference for vec1, negative for vec2
        return score1 - score2

### Experiment Setup and Training 

#### Weights & Biases Initialization

Initializes a new run in Weights & Biases. This step configures the experiment tracking, allowing for the monitoring and analysis of model performance across different runs. The configuration dictionary `config` contains all hyperparameters and settings for the experiment, facilitating easy adjustments and tracking.

#### Loss Function and Optimizer

Defines the loss criterion using `MarginRankingLoss` with a specified margin, suitable for ranking and comparison tasks in the classification model. An Adam optimizer is initialized to update the model's weights, with the learning rate set according to the Weights & Biases configuration.


#### Training Loop

Iterates over the dataset for a specified number of epochs, performing the following actions for each epoch:
- Sets the model to training mode.
- Iterates over the training dataset in batches, computing the loss, performing backpropagation, and updating the model weights accordingly.
- Computes training accuracy based on the model's performance on the training data.
- Logs training accuracy and loss metrics for each epoch to Weights & Biases for real-time tracking and analysis.

#### Validation Loop

Evaluates the model on a separate validation dataset after each training epoch to monitor performance on unseen data:
- Sets the model to evaluation mode to disable gradient computations.
- Computes validation accuracy by comparing the model's predictions against the true labels.
- Updates and logs the best validation accuracy seen during training to Weights & Biases.

#### Best Scores Tracking and Model Checkpointing

Tracks the best validation accuracy during the training process and saves the final model checkpoint to disk. 

In [12]:
# Assuming ComparativeClassifier is properly defined

# Initialize Weights & Biases
wandb.init(project="Project1", config=config)
config = wandb.config

# Initialize the model with hyperparameters from wandb
model = ComparativeClassifier(input_size=300, hidden_size=config.hidden_dim)

# Move your model to the device (CPU or GPU)
model.to(device)

# Define the loss criterion with MarginRankingLoss
criterion = nn.MarginRankingLoss(margin=1.0)

# Define the optimizer with learning rate from wandb
optimizer = torch.optim.Adam(model.parameters(), lr=config.learning_rate)

# Initialize variables for tracking the best scores
best_val_accuracy = 0.0  # For tracking the best validation accuracy
lowest_train_loss = float('inf')  # For tracking the lowest training loss

# Training loop
for epoch in range(config.num_epochs):
    model.train()
    total_train, correct_train, epoch_loss = 0, 0, 0.0
    
    for sentence_vectors, labels in train_loader:
        sentence_vectors = sentence_vectors.to(device)
        labels = labels.to(device)
        optimizer.zero_grad()
        
        outputs = model(sentence_vectors).squeeze()
        target = labels.float()
        loss = criterion(outputs, torch.zeros_like(outputs), target)
        
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item() * sentence_vectors.size(0)  # Multiply loss by batch size
        predictions_correct = ((outputs > 0) & (target > 0)) | ((outputs < 0) & (target < 0))
        correct_train += predictions_correct.sum().item()
        total_train += target.size(0)
    
    epoch_train_loss = epoch_loss / total_train
    train_accuracy = correct_train / total_train * 100

    # Validation loop
    model.eval()
    total_val, correct_val = 0, 0
    with torch.no_grad():
        for sentence_vectors, labels in val_loader:
            sentence_vectors = sentence_vectors.to(device)
            labels = labels.to(device)
            outputs = model(sentence_vectors).squeeze()
            target = labels.float()
            
            predictions_correct = ((outputs > 0) & (target > 0)) | ((outputs < 0) & (target < 0))
            correct_val += predictions_correct.sum().item()
            total_val += target.size(0)
            
    validation_accuracy = correct_val / total_val * 100

    # Update best scores if current scores are better and optionally save the best model
    if validation_accuracy > best_val_accuracy:
        best_val_accuracy = validation_accuracy
        torch.save(model.state_dict(), 'model_best_val_accuracy.pth')
   
    
    # Log metrics to wandb
    wandb.log({
        "epoch": epoch,
        "train_loss": epoch_train_loss,
        "train_accuracy": train_accuracy,
        "validation_accuracy": validation_accuracy,
        "best_validation_accuracy": best_val_accuracy,
    })
    
    print(f'Epoch {epoch+1}/{config.num_epochs}, '
          f'Train Loss: {epoch_train_loss:.8f}, '
          f'Training Accuracy: {train_accuracy:.2f}%, '
          f'Validation Accuracy: {validation_accuracy:.2f}%')


# Finish wandb run
wandb.finish()


[34m[1mwandb[0m: Currently logged in as: [33mjannine-meier[0m ([33mnlp_janninemeier[0m). Use [1m`wandb login --relogin`[0m to force relogin


Epoch 1/100, Train Loss: 1.00107079, Training Accuracy: 41.44%, Validation Accuracy: 51.46%
Epoch 2/100, Train Loss: 1.00040066, Training Accuracy: 43.32%, Validation Accuracy: 49.49%
Epoch 3/100, Train Loss: 1.00025810, Training Accuracy: 44.84%, Validation Accuracy: 49.33%
Epoch 4/100, Train Loss: 1.00024807, Training Accuracy: 44.06%, Validation Accuracy: 50.12%
Epoch 5/100, Train Loss: 1.00017631, Training Accuracy: 45.97%, Validation Accuracy: 50.36%
Epoch 6/100, Train Loss: 1.00024765, Training Accuracy: 45.35%, Validation Accuracy: 50.36%
Epoch 7/100, Train Loss: 1.00028476, Training Accuracy: 46.01%, Validation Accuracy: 49.80%
Epoch 8/100, Train Loss: 1.00021277, Training Accuracy: 45.39%, Validation Accuracy: 49.41%
Epoch 9/100, Train Loss: 1.00021745, Training Accuracy: 46.79%, Validation Accuracy: 50.04%
Epoch 10/100, Train Loss: 1.00030070, Training Accuracy: 46.60%, Validation Accuracy: 50.28%
Epoch 11/100, Train Loss: 1.00023997, Training Accuracy: 46.64%, Validation Acc

0,1
best_validation_accuracy,▁▁▁▁▁▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅███████████████████
epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train_accuracy,▁▅▅▅▆▆▆▅▇▇▆▆▇▇█▇▆▇▇▇▇█▇▆▇█▇▇███▆▆▅▇▇▅▆▇█
train_loss,█▂▂▂▂▂▂▁▁▁▂▂▂▃▂▂▁▁▁▂▂▃▁▂▂▁▁▂▂▁▁▂▂▁▁▁▁▁▁▂
validation_accuracy,▇▄▅▄▆▅▄▆▂▂▅▆▄▅▆▇▄▇▇▆▇▂▂▇▆▃▂▂▅█▄▅▆▆▂▃▂▆▇▁

0,1
best_validation_accuracy,52.48619
epoch,99.0
train_accuracy,48.00625
train_loss,1.00021
validation_accuracy,47.75059


### Testing
Once I have the best model,  I run it on the test set to get the final performance metrics (here just for demonstration as it is again run on validation set).

In [18]:
# Path to your saved model
path_to_best_model = 'model_best_val_accuracy.pth'

# Load the best model state
model.load_state_dict(torch.load(path_to_best_model, map_location=device))

# Ensure the model is in evaluation mode
model.eval()

# Initialize variables to track test performance
total_test, correct_test = 0, 0

# No gradient updates needed during evaluation
with torch.no_grad():
    for sentence_vectors, labels in test_loader:
        sentence_vectors = sentence_vectors.to(device)
        labels = labels.to(device)
        # Forward pass to get outputs
        outputs = model(sentence_vectors).squeeze()
        
        # Convert labels to a float tensor
        target = labels.float()
        
        # Determine prediction correctness
        predictions_correct = ((outputs > 0) & (target > 0)) | ((outputs < 0) & (target < 0))
        
        # Update correct predictions and total count
        correct_test += predictions_correct.sum().item()
        total_test += target.size(0)

# Calculate the test accuracy
test_accuracy = correct_test / total_test * 100

# Print the test accuracy
print(f'Test Accuracy: {test_accuracy:.2f}%')

Test Accuracy: 52.49%


### Interpretation of Results

Achieving a test accuracy of 52.49 in a task where the probability of guessing correctly is 50% indicates that my model is performing slightly better than random chance. However, the improvement over random guessing is minimal. Here's my interpretation of these results and the steps I might consider taking next:

### My Interpretation
- **Marginally Better Than Random**: This performance suggests that my model has learned some patterns from the data relevant to the task at hand, but not significantly so. The marginal improvement could stem from various fact for example thek's complex - considering even GPT-4 only got 87.5% accuracy - see https://paperswithcode.com/sota/common-sense-reasoning-on-winogrande. Other factors might bety, the representativeness of my training dta, or the model architecture and hyperparameters I chose.
- **Overfitting Concerns**: If my training accuracy was much higher than my validation accuracy (e.g. run 128/1e-4), it might indicate that my model overfitted the training data. This means it learned the training data's specific patterns so well that it failed to generalize to new, unseen data.
- **Underfitting Concerns**: Conversely, if both training and validation accuracies are low - which they are with <53%, my model might be underfitting. This could happen if the model is too simplistic to capture the underlying data patterns or hasn't been trained sufficiently.

### My Next Steps
- **Enhance DataQuantity**: I should  also have a "real" test set.
- **Revisit Feature Engineering**: Given that I'm using averaged word vectors, I should consider whether there's a more effective way to represent my text data. For some tasks, different types of embeddings or more complex text representations might better capture the nuances.
- **Adjust Model Complexity**: I need to evaluate whether my model's architecture suits the task. Increasing the model's complexity might help, but it also raises the risk of overfitting. Simplifying the model can sometimes improve its ability to generalize.
- **Continue Hyperparameter Tuning**: Further tuning might be necessary. The small margin by which my model exceeds random guessing suggests there might still be optimization room.
- **Incorporate Regularization and Dropout**: If overfitting is an issue, adding regularization techniques or dropout layers could help my model avoid memorizing the training data.
- **Implement Cross-validation**: Using cross-validation could provide a more robust performance estimate and help prevent overfitting by ensuring that the model performs well across different data subsets.

In summary, while my model shows some capacity to learn from the training data, the performance on the test set, close to random, indicates there's significant room for improvement. I'll need to carefully consider these factors and experiment with different approaches to enhance my model's generalization cawith thatlities and thus improve its (real) performance  unseen data.



### References
I used ChatGPT and the uploaded files on ILIAS to assist me with coding in this project