# Assignment 3 Part 2 - Find complex answers to medical questions

*Submission deadline: Friday 1 November 2024, 11:55pm.*

*Assessment marks: 20 marks (20% of the total unit assessment)*

Unless a Special Consideration request has been submitted and approved, a 5% penalty (of the total possible mark of the task) will be applied for each day a written report or presentation assessment is not submitted, up until the 7th day (including weekends). After the 7th day, a grade of ‘0’ will be awarded even if the assessment is submitted. For example, if the assignment is worth 8 marks (of the entire unit) and your submission is late by 19 hours (or 23 hours 59 minutes 59 seconds), 0.4 marks (5% of 8 marks) will be deducted. If your submission is late by 24 hours (or 47 hours 59 minutes 59 seconds), 0.8 marks (10% of 8 marks) will be deducted, and so on. The submission time for all uploaded assessments is 11:55 pm. A 1-hour grace period will be provided to students who experience a technical concern. For any late submission of time-sensitive tasks, such as scheduled tests/exams, performance assessments/presentations, and/or scheduled practical assessments/labs, please apply for [Special Consideration](https://students.mq.edu.au/study/assessment-exams/special-consideration).

Note that the work submitted should be your own work. For rules of using of AI tools, refer to "Using Generative AI Tools" on iLearn.


# A note on the use of AI generators
In this assignment, we view AI code generators such as copilot, CodeGPT, etc as tools that can help you write code quickly. You are allowed to use these tools, but with some conditions. To understand what you can and what you cannot do, please visit these information pages provided by Macquarie University: 

Artificial Intelligence Tools and Academic Integrity in FSE - https://bit.ly/3uxgQP4

If you choose to use these tools, make the following explicit in your submitted file as comments starting with "Use of AI generators in this assignment" explain:

* What part of your code is based on the output of such tools,
* What tools you used,
* What prompts you used to generate the code or text, and
* What modifications you made on the generated code or text.


This will help us assess your work fairly. If we observe that you have used an AI generator and you do not give the above information, you may face disciplinary action.

## Objectives of this assignment

In assignment 3 you will work on a general answer selection task. Given a question and a list of sentences, the final goal is to predict which of these sentences from the list can be used as part of the answer to the question. Assignment 3 is divided into two parts. Part 1 will help you get familiar with the data, and Part 2 requires you to implement deep neural networks.

The data is in the file `train.csv`, which is provided in both GitHub repository and in iLearn. Each row of the file consists of a question ('qtext' column), an answer ('atext' column), and a label ('label' column) that indicates whether the  answer is correctly related to the question (1) or not (0).

The following code uses pandas to store the file `train.csv` in a data frame and shows the first few rows of data.

Note: the left-most index is not part of the data, it is added by ipynb automatically for easy reading. You can also browse the data using Microsoft Excel or similar software.

# Now let's get started.

Use the provided files `train.csv`, `val.csv`, and `test.csv` in the data.zip file for all the tasks below.

## Instruction
* You are required to finish the two tasks below.
* You need to write code in this ipynb file.
* Your ipynb file needs to include the running outputs of your final code. 
* **You need to submit this ipynb file, containing your code and outputs, to iLearn.**

## Assessment

1. We mark based on the correctness of your code, outputs, and coding style. 
2. We assign 2 marks (1 mark each Task) for good coding style, including but not limited to clean codes, self-explained variable names, good comments that help understand the code, etc.
3. We assign 2 marks (1 mark each Task) for correctly feeding data into your model, and correctly training and testing of your models.
4. 2 marks will be deducted for the task that does not have outputs or its outputs are incorrect.
4. For the remaining detailed marks, please refer to each specific task below. 

# Task 1 (8 marks): Simple Siamese NN - Contrastive Loss

Implement a simple TensorFlow-Keras neural model that meets the following requirements:

1. (0.5 marks) An input layer that will accept the tf.idf of paired data. The input of the Siamese network is a pair of data, i.e., (qtext, atext). 
2. (2 marks) Use two hidden layers and a ReLU activation function. You need to determine the size of the hidden layers in {64, 128, 256} using val data, assuming these two layers use the same hidden size.
3. (0.5 marks) Use Euclidean-distance-based contrastive loss to train the model.
4. (0.5 marks) Use Sigmoid function for classification.
5. (1 mark) Calculate prediction accuracy.
6. (1.5 marks) Give an example of failure case, and explain the possible reason and discuss potential solution. 
7. (1 mark) Good coding style as explained in the above Assessment Section.
8. (1 mark) Correctly feeding data into your model, and correctly training and testing of your models.

Use the test data to report the final accuracy of your best model.

In [2]:
# Right before submission, when I was tidying up an error occured and now I have trouble importing libraries...
# Please note you might encounter the same issue when running all cells.

import pandas as pd
from scipy.stats import rankdata
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow as tf
import numpy as np
import sklearn 
import os
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.keras import layers, modelsY
from tensorflow.keras.callbacks import EarlyStopping

dataset = pd.read_csv("train.csv")
dataset.head()

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

In [None]:
Y



In [None]:
train_df = pd.read_csv('train.csv')
val_df = pd.read_csv('val.csv')
test_df = pd.read_csv('test.csv')

print(train_df.shape)
print(val_df.shape)
print(test_df.shape)

train_df[0:5]

In [None]:
# Initialize TF-IDF vectorizer
# Changing the number of features used can increase the accuracy.
vectorizer = TfidfVectorizer(max_features=5000)

# Combine all texts because we want the same number of features used for q and a
all_train_texts = list(train_df['qtext']) + list(train_df['atext'])

# Fit vectorizer on training data only
vectorizer.fit(all_train_texts)

# Transform each dataset
def transform_dataset(df):
    q_features = vectorizer.transform(df['qtext']).toarray()
    a_features = vectorizer.transform(df['atext']).toarray()
    labels = df['label'].values
    return q_features, a_features, labels

# Transform all datasets
train_features = transform_dataset(train_df)
val_features = transform_dataset(val_df)
test_features = transform_dataset(test_df)

In [None]:
def siamese_model(input_dim, hidden_size):

    # Input layers
    input_q = layers.Input(shape=(input_dim,))
    input_a = layers.Input(shape=(input_dim,))
    
    # Shared encoder layers
    def create_encoder():
        return tf.keras.Sequential([
            layers.Dense(hidden_size, activation='relu', kernel_initializer='he_normal'),
            layers.Dense(hidden_size, activation='relu', kernel_initializer='he_normal')
        ])
    
    encoder = create_encoder()
    
    # Encode both inputs
    tower_q = encoder(input_q)
    tower_a = encoder(input_a)
    
    # Euclidean distance
    distance = layers.Lambda(
        lambda tensors: tf.sqrt(tf.reduce_sum(tf.square(tensors[0] - tensors[1]), axis=-1, keepdims=True))
    )([tower_q, tower_a])
    
    # Output layer with sigmoid activation
    output = layers.Dense(1, activation='sigmoid')(distance)
    
    # Create model
    model = Model(inputs=[input_q, input_a], outputs=output)
    
    # Compile model
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.001, clipnorm=1.0),
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    
    return model

In [None]:
def tune_hidden_size(train_data, val_data):
    """
    Find the best hidden layer size using validation data
    """
    hidden_sizes = [64, 128, 256]
    best_val_acc = 0
    best_size = None
    best_model = None
    
    # Unpack the training and validation data
    q_train, a_train, y_train = train_data
    q_val, a_val, y_val = val_data
    
    # Get input dimension from data
    input_dim = q_train.shape[1]
    
    results = []
    
    for hidden_size in hidden_sizes:
        print(f"\nTrying hidden size: {hidden_size}")
        
        # Create and train model
        model = siamese_model(input_dim, hidden_size)
        
        history = model.fit(
            [q_train, a_train],
            y_train,
            validation_data=([q_val, a_val], y_val),
            epochs=10,
            batch_size=32,
            verbose=1
        )
        
        # Get validation accuracy
        val_loss, val_acc = model.evaluate([q_val, a_val], y_val, verbose=0)
        results.append((hidden_size, val_acc))
        
        print(f"Validation accuracy with hidden size {hidden_size}: {val_acc:.4f}")
        
        # Update best model if necessary
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            best_size = hidden_size
            best_model = model
    
    print("\nHidden size tuning results:")
    for size, acc in results:
        print(f"Hidden size {size}: {acc:.4f}")
    print(f"\nBest hidden size: {best_size} (validation accuracy: {best_val_acc:.4f})")
    
    return best_model, best_size

In [None]:
# Call the tuning function
# val_accuracy came back with nan value so many times, and now fixed.
best_model, best_size = tune_hidden_size(train_features, val_features)


#### Best hidden size: 128 (validation accuracy: 0.5498)
Lets build Siameas network with 2 hidden_layers with size 128

In [None]:
input_dim = q_train.shape[1]

siamea = siamese_model(input_dim, 256)

siamea.summary()

In [None]:
# Transform validation dataset
q_val, a_val, y_val = transform_dataset(val_df)
q_train, a_train, y_train = transform_dataset(train_df)

# Confirming the size so as to verify an issue earlier happened.
print("q_train:", q_train.shape)
print("a_train:", a_train.shape)
print("y_train:", y_train.shape)
print("q_val:", q_val.shape)
print("a_val:", a_val.shape)
print("y_val:", y_val.shape)


# Example of fitting the model after confirming shapes
history = siamea.fit(
    [q_train, a_train], 
    y_train, 
    epochs=10,  # You can set a higher number of epochs
    batch_size=32,
    validation_data=([q_val, a_val], y_val),
    verbose=1
)


In [None]:
import matplotlib.pyplot as plt

# Plot training & validation accuracy values
plt.figure(figsize=(12, 5))

# Accuracy
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()

# Loss
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()

plt.show()

With the Siamea Model, it is obvious that the epoch 1 is all we need.

In [None]:
def evaluate_model(model, test_data):
    """
    Calculate prediction accuracy on test data
    """
    q_test, a_test, y_test = test_data
    
    # Get predictions
    predictions = model.predict([q_test, a_test])
    predictions = (predictions > 0.9).astype(int).flatten() #0.5 was acceptable, but I tuned the probability cut-off to increase the accuracy.
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, predictions)
    
    return accuracy


evaluate_model(siamea, test_features)

There is only 54% accuracy on the test dataset. Indicating that the Siamea model with the simple 2 hidden layer architecture was not performing very well.

In [None]:
from sklearn.metrics import accuracy_score

def evaluate_model_and_find_failures(model, test_data):
    """
    Calculate prediction accuracy and identify failure cases
    """
    q_test, a_test, y_test = test_data
    
    # Get predictions
    predictions = model.predict([q_test, a_test])
    predictions = (predictions > 0.9).astype(int).flatten()
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, predictions)
    print(f"Accuracy: {accuracy}")
    
    # Identify where predictions are incorrect
    failures = np.where(predictions != y_test)[0]
    
    # Print an example of failure
    if len(failures) > 0:
        idx = failures[0]  # Example of the first failure
        print(f"\nExample of failure at index {idx}:")
        print(f"Predicted: {predictions[idx]}, Actual: {y_test[idx]}")
        print(f"Question: {val_df['qtext'].iloc[idx]}")
        print(f"Answer: {val_df['atext'].iloc[idx]}")
    else:
        print("No failures found!")
    
    return accuracy, failures

# Call the function to evaluate and find failures
accuracy, failures = evaluate_model_and_find_failures(siamea, test_features)


In [None]:
q_test, a_test, y_test = test_features

# Get predictions
predictions = siamea.predict([q_test, a_test])
predictions = (predictions > 0.9).astype(int).flatten()

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)

failures = np.where(predictions != y_test)[0]

In [None]:
# zConfirming the code is working fine.
print("Number of failures:", len(failures))
print("Number of predictions:", len(predictions))
print("Failure indices:", failures[:10])  # Print first 10 indices

In [None]:
questions = []
answers = []

for num in failures:
    questions.append(test_df['qtext'].iloc[num])
    answers.append(test_df['atext'].iloc[num])

# Create a DataFrame of failures (unsuccessful q and a matching based on the prediction.)
failures_df = pd.DataFrame({
    'question': questions,
    'answer': answers
})

print(failures_df.head())

In [None]:
# Confirming the prediction was wrong for these 3 datapoints. predicted as 0 instead of 1
print(predictions)

test_df[0:3]

We could assume the reasons for the low accuracy as well as the error are due to: 

**Simple Model:** An application of simple model like Siameas model especially with such a simple network architecture (2 x hidden layers.) Instead, we could use BERT which will be discussed in the second task. Siameas Network does not have a better understanding towards words as a whole compared to BERT. This could be a second reason.

**Word Understanding:** There is a possibility that this Siameas Network does not understand how combination of words mean together. For example, when we see words "capital" "Japan", then we can connect these 2 and derive "Tokyo", but in this case the Siameas probably just see "capital" and match the best answer that has any capital names in to the question. Without understanding the whole meaning.

**Insufficient Training Data:** The model possibly hasn’t seen enough examples of non-matching pairs during training, and this could cause a struggle around differentiating between highly dissimilar answers.


### Potential Solutions:

**Use More Complex Architectures:** We can use models that capture deeper contextual meaning, such as BERT or transformers, which are better at understanding the relationship between words in context. Also, adding drop layers or conv layers might improve the accuracy. 2 x hidden layers is too simple for a neural network model.

**Increase Training Data:** Feed the model more examples of difficult non-matching pairs.

**Feature Engineering:** Add other types of features like sentence embeddings, in combination with TF-IDF vectors. This will allow the model to better understand semantic relationships between sentences, and help matching the questions to answers that are closely related.

# Task 2 (12 marks): Transformer

In this task, let's use Transformer to predict whether two sentences are related or not. Implement a simple Transformer neural network that meets the following requirements:

1. (1 mark) Each input for this model should be a concatenation of qtext and atext. Use [SEP] to separate qtext and atext, e.g., "Can high blood pressure bring on heart failure? [SEP] Hypertensive heart disease is the No." You need to pad the input to a fixed length. How do you determine a suitable length?
2. (1.5 marks) Choose a suitable tokenizer and justify your choice.
3. (1 mark) An embedding layer that generates embedding vectors of the sentence text into size 128. Remember to add position embedding.
4. (1 mark) One transformer encoder layer, you need to find a hidden dimension in {64, 128, 256}. Use 3 heads in MultiHeadAttention.
5. (1 mark) Do we need a transformer decoder layer for this task? If yes, find a hidden dimension in {64, 128, 256} and use 3 heads in MultiHeadAttention. If no, explain why.
6. (0.5 marks) 1 hidden layer with size 256 and ReLU activation function.
7. (0.5 marks) 1 output layer with size 2 for binary classification to predict whether two inputs are related or not. 
8. (1 mark) Choose a suitable loss to train the model
9. (1 mark) Report your best accuracy on the test split.
10. (1.5 marks) Give an example of a failure case, and explain the possible reason and discuss a potential solution.
11. (1 mark) Good coding style as explained in the above Assessment Section.
12. (1 mark) Correctly feeding data into your model, and correctly training and testing of your models.



### What Padding Length Shall We Use?

We can start with a statistical approach to find an appropriate padding length. Even though this is not a defenitive length, this will serve as a good starting point (in our case 50.) We can examine the impact of padding length increase and fine tune it later on.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
from tensorflow.keras import layers, Model
import tensorflow as tf
from tensorflow.keras import layers

In [None]:
train_df = pd.read_csv('train.csv')
val_df = pd.read_csv('val.csv')
test_df = pd.read_csv('test.csv')

In [None]:
def get_lengths(df):
    # Combine qtext and atext with [SEP] token
    qa_texts = df['qtext'] + ' [SEP] ' + df['atext']
    
    # Count words in each text in combined texts
    lengths = [len(text.split()) for text in qa_texts]
    
    print(f"Mean length: {np.mean(lengths):.2f}")
    print(f"Median length: {np.median(lengths):.2f}")
    print(f"95th percentile: {np.percentile(lengths, 95):.2f}")
    print(f"Max length: {np.max(lengths)}")
    
    # Return 95th percentile as recommended max_length
    return int(np.percentile(lengths, 95))

max_length = get_lengths(train_df)
print("---")
get_lengths(val_df)
print("---")
get_lengths(test_df)

print(max_length)

In [None]:
# Combine texts with [SEP] token and print
qa_texts = train_df['qtext'] + ' [SEP] ' + train_df['atext']
qa_texts

In [None]:
# Initialize and fit tokenizer with max features = 5000
max_features = 10000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(qa_texts)

# Checking if the tokenizing was successfull
print("Word index sample:", list(tokenizer.word_index.items())[:10])

In [None]:
# Transform texts to sequences
sequences = tokenizer.texts_to_sequences(qa_texts)

# Pad sequences to the determined max length
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post')

# Checking the shape of the padded sequences
print("Shape of padded sequences:", padded_sequences.shape)

In [None]:
# Fix for the embedding layer creation
def create_embedding_layer(vocab_size, max_length, embed_dim=128):
    input_seq = layers.Input(shape=(max_length,))
    
    # Word embedding
    embedding_layer = layers.Embedding(
        input_dim=vocab_size,
        output_dim=embed_dim,
        input_length=max_length
    )
    
    # Get word embeddings
    word_embeddings = embedding_layer(input_seq)
    
    # Create positional encodings
    positions = tf.range(start=0, limit=max_length, delta=1, dtype=tf.float32)
    positions = tf.expand_dims(positions, axis=1)
    
    # Calculate angles for positional encoding
    angles = tf.range(start=0, limit=embed_dim, delta=2, dtype=tf.float32)
    angles = angles * (-np.log(5000.0) / embed_dim)
    angles = tf.expand_dims(angles, axis=0)
    
    # Calculate positional encodings
    pos_encoding = positions * tf.exp(angles)
    pos_encoding = tf.concat(
        [tf.sin(pos_encoding), tf.cos(pos_encoding)],
        axis=1
    )
    
    # Add batch dimension
    pos_encoding = tf.expand_dims(pos_encoding, axis=0)
    
    # Add positional encoding to word embeddings
    final_embeddings = word_embeddings + tf.cast(pos_encoding, dtype=tf.float32)
    
    return tf.keras.Model(inputs=input_seq, outputs=final_embeddings)

# Create the complete model
def create_complete_model(vocab_size, max_length=50, embed_dim=128, hidden_dim=256, num_heads=3):
    # Input layer
    input_seq = layers.Input(shape=(max_length,))
    
    # Embedding layer with positional encoding
    embedding_layer = create_embedding_layer(vocab_size, max_length, embed_dim)
    embeddings = embedding_layer(input_seq)
    
    # Transformer encoder
    attention_output = layers.MultiHeadAttention(
        num_heads=num_heads,
        key_dim=hidden_dim
    )(embeddings, embeddings)
    attention_output = layers.LayerNormalization(epsilon=1e-6)(attention_output + embeddings)
    
    # Feed-forward network
    ff_output = layers.Dense(hidden_dim, activation="relu")(attention_output)
    ff_output = layers.Dense(embed_dim)(ff_output)
    encoder_output = layers.LayerNormalization(epsilon=1e-6)(ff_output + attention_output)
    
    # Global average pooling
    pooled_output = layers.GlobalAveragePooling1D()(encoder_output)

    hidden_output = layers.Dense(256, activation="relu")(pooled_output)
    
    # Output layer - accepting binary output hence 2 not 1
    output = layers.Dense(2, activation="softmax")(hidden_output)
    
    model = tf.keras.Model(inputs=input_seq, outputs=output)
    model.compile(
        optimizer='adam',
        loss=tf.keras.losses.SparseCategoricalCrossentropy(),
        metrics=['accuracy']
    )
    
    return model

# Create a model with MultiHead neuron size 256
model1 = create_complete_model(
    vocab_size=max_features,
    max_length=max_length,
    embed_dim=128,
    hidden_dim=256,
    num_heads=3
)

# Print model summary
model1.summary()

In [None]:
# Prepare training data
train_qa_texts = train_df['qtext'] + ' [SEP] ' + train_df['atext']
train_sequences = tokenizer.texts_to_sequences(train_qa_texts)
train_sequences = pad_sequences(train_sequences, maxlen=max_length, padding='post')
train_labels = train_df['label']  # Assuming label column is named 'label'

# Prepare validation data
val_qa_texts = val_df['qtext'] + ' [SEP] ' + val_df['atext']
val_sequences = tokenizer.texts_to_sequences(val_qa_texts)
val_sequences = pad_sequences(val_sequences, maxlen=max_length, padding='post')
val_labels = val_df['label']

# Prepare test data
test_qa_texts = test_df['qtext'] + ' [SEP] ' + test_df['atext']
test_sequences = tokenizer.texts_to_sequences(test_qa_texts)
test_sequences = pad_sequences(test_sequences, maxlen=max_length, padding='post')


In [None]:
# As we are ready to feed the prepared data, we will train model defined earlier here.
history = model1.fit(
    train_sequences,
    train_labels,
    epochs=10, 
    batch_size=32, 
    validation_data=(val_sequences, val_labels),
    callbacks=[
        tf.keras.callbacks.EarlyStopping(
            monitor='val_loss',
            patience=3,
            restore_best_weights=True
        )
    ]
)

In [None]:
# Thanks to the early stopping, we didn't need to go through 10 total epochs (it would take much longer..)

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

In [None]:
test_loss, test_accuracy = model1.evaluate(test_sequences, test_labels)
print(f"Test accuracy with 256 multihead: {test_accuracy}")

In [None]:
# Create a model with MultiHead neuron size 256
model2 = create_complete_model(
    vocab_size=max_features,
    max_length=max_length,
    embed_dim=128,
    hidden_dim=128,
    num_heads=3
)

# Print model summary
model2.summary()

history = model2.fit(
    train_sequences,
    train_labels,
    epochs=10, 
    batch_size=32, 
    validation_data=(val_sequences, val_labels),
    callbacks=[
        tf.keras.callbacks.EarlyStopping(
            monitor='val_loss',
            patience=3,
            restore_best_weights=True
        )
    ]
)

test_loss, test_accuracy = model2.evaluate(test_sequences, test_labels)
print(f"Test accuracy with 128 multihead: {test_accuracy}")

In [None]:
# Create a model with MultiHead neuron size 256
model3 = create_complete_model(
    vocab_size=max_features,
    max_length=max_length,
    embed_dim=128,
    hidden_dim=64,
    num_heads=3
)

# Print model summary
model3.summary()

history = model3.fit(
    train_sequences,
    train_labels,
    epochs=10, 
    batch_size=32, 
    validation_data=(val_sequences, val_labels),
    callbacks=[
        tf.keras.callbacks.EarlyStopping(
            monitor='val_loss',
            patience=3,
            restore_best_weights=True
        )
    ]
)

test_loss, test_accuracy = model3.evaluate(test_sequences, test_labels)
print(f"Test accuracy with 64 multihead: {test_accuracy}")

Since this task is binary classification, we don't need a transformer decoder layer. The encoder layer is enough to capture relationships between tokens in each input pair. The decoder is generally used in tasks that involve sequence generation, such as translation.

As reported above, test accuracies based on the neuron number in MultiHead Layer are below:

- Model1 (256) : 0.538731038570404
- Model2 (128) : 0.49785909056663513
- Model3 (64) : 0.5762943029403687

And we can see that the Model3 shows slight improvement compared to the Siameas model.

In [None]:
# Get predictions on the test set
test_predictions = model3.predict(test_sequences)
test_predicted_labels = np.argmax(test_predictions, axis=1)

# Compare predicted labels with true labels and identify misclassified examples
misclassified_indices = np.where(test_predicted_labels != test_labels)[0]

# Print first 10 misclassified question-answer pairs
print("\nMisclassified Examples:")
for i in misclassified_indices[:10]:
    print(f"Actual label: {test_labels[i]}, Predicted label: {test_predicted_labels[i]}")
    print(f"Question: {test_df['qtext'].iloc[i]}")
    print(f"Answer: {test_df['atext'].iloc[i]}")
    print("---")

We could assume the reasons for the low accuracy as well as the error are due to: 

**Low Contextual Understanding:** The model seems to struggle when the answer contains information that is not directly relevant to the question. For example, in the case of "What are some symptoms of an insulin overdose?", the answer talks about hypoglycemia, which is related to low blood sugar, not an insulin overdose.

**Insufficient Training Data:** The model possibly hasn’t seen enough examples of non-matching pairs during training, and this could cause a struggle around differentiating between highly dissimilar answers.

### Potential Solutions:

**Try Advanced Language Models and Compare the Performance:** we can try to see improvement of contextual understanding bt applying additional context features like entity extraction, semantic OR applying more advanced language models like BERT, GPT that can better capture the semantic relationships.

**Increase Training Data:** Feed the model more examples of difficult non-matching pairs.