# **Text Processing for Similarity and Sentence Relationship Prediction**

## **Task Review**

The goal of this project is to develop a deep learning model that determines whether a given sentence can be used as part of an answer to a specific question. We use a dataset (`train.csv`) that contains:

- **qtext**: The question text.
- **atext**: The answer text.
- **label**: A binary indicator (1 if the answer is relevant to the question, 0 otherwise).

To achieve this, we implement a **Siamese Neural Network** using **TensorFlow-Keras** and perform two tasks:

1. **Task 1: Simple Siamese Neural Network with Contrastive Loss**  
   We build a Siamese network that learns embeddings of questions and answers, then applies contrastive loss to measure similarity.

2. **Task 2: Transformer-Based Model for Sentence Relationship Prediction**    
   We use a Transformer model (such as BERT) to determine whether a given question and answer pair are related.

---


## Dataset

The data is in the file `train.csv`, which is provided in GitHub repository. Each row of the file consists of a question ('qtext' column), an answer ('atext' column), and a label ('label' column) that indicates whether the  answer is correctly related to the question (1) or not (0).

The following code uses pandas to store the file `train.csv` in a data frame and shows the first few rows of data.

In [None]:
import pandas as pd
dataset = pd.read_csv("train.csv")
dataset.head()

Unnamed: 0,qtext,label,atext
0,What are the symptoms of gastritis?,1,"However, the most common symptoms include: Nau..."
1,What are the symptoms of gastritis?,0,var s_context; s_context= s_context || {}; s_c...
2,What are the symptoms of gastritis?,0,"!s_sensitive, chron ID: $('article embeded_mod..."
3,What does the treatment for gastritis involve?,1,Treatment for gastritis usually involves: Taki...
4,What does the treatment for gastritis involve?,1,Eliminating irritating foods from your diet su...


# Task 1: Simple Siamese NN - Contrastive Loss

In this task, we implement a **basic Siamese neural network** to determine the similarity between a question and an answer. The model takes in pairs of question-answer embeddings and learns to minimize the distance between related pairs while maximizing the distance between unrelated pairs.



## Importing Assets for Task 1

In [None]:
import tensorflow as tf
from tensorflow.keras import layers, Model
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np


## 1. Loading and Preprocessing Data

The data is loaded, and questions, answers, and labels are selected. Next, the dataset is split, with 80% allocated for training and 20% reserved for validation, which will be used to evaluate the model's performance during training. TF-IDF is applied to convert the text into numerical form, facilitating processing by the model. Lastly, all elements are reshaped appropriately for input into TensorFlow.

In [None]:
# Extracting questions, answers, and labels from the dataset
questions = dataset['qtext'].values
answers = dataset['atext'].values
labels = dataset['label'].values

# Splitting data into training and validation sets (80% train, 20% validation)
q_train, q_val, a_train, a_val, y_train, y_val = train_test_split(questions, answers, labels, test_size=0.2, random_state=42)

# TF-IDF vectorizer for both question and answer text
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Limit vocab size to 5000 to keep things efficient

# Fitting the TF-IDF on both the question and answer text, then transform
q_train_tfidf = tfidf_vectorizer.fit_transform(q_train)
a_train_tfidf = tfidf_vectorizer.transform(a_train)
q_val_tfidf = tfidf_vectorizer.transform(q_val)
a_val_tfidf = tfidf_vectorizer.transform(a_val)

# Converting the sparse matrices to dense arrays so TensorFlow can work with them
q_train_tfidf = q_train_tfidf.toarray()
a_train_tfidf = a_train_tfidf.toarray()
q_val_tfidf = q_val_tfidf.toarray()
a_val_tfidf = a_val_tfidf.toarray()

# Checking that everything loaded correctly
print("Shapes of the transformed data:", q_train_tfidf.shape, a_train_tfidf.shape, q_val_tfidf.shape, a_val_tfidf.shape)


Shapes of the transformed data: (7504, 1878) (7504, 1878) (1876, 1878) (1876, 1878)


There are 7,504 pairs designated for training and 1,876 pairs for validation, each consisting of 1,878 features derived from TF-IDF. Both questions and answers share the same number of features, ensuring consistency before feeding them into the model. This confirms that preprocessing was successful, paving the way for the next step: building the neural network.

## 2. Building the Siamese Neural Network Model

The next step is to construct a Siamese Network, incorporating a pair of input layers, one for questions and one for answers. These inputs pass through two shared layers with ReLU activations, set to a size of 128, though this parameter can be adjusted. The Euclidean distance between the question and answer is then calculated to determine their similarity. Finally, a Sigmoid layer at the end outputs the probability that the question and answer pair matches.

In [None]:
# Defining the Siamese model architecture
def siamese_model(input_shape, hidden_size):
    # Defining input layers for the question and answer pairs with correct shape tuple
    input_q = layers.Input(shape=(input_shape,))
    input_a = layers.Input(shape=(input_shape,))

    # Creating shared hidden layers with ReLU activation
    shared_layer_1 = layers.Dense(hidden_size, activation='relu')
    shared_layer_2 = layers.Dense(hidden_size, activation='relu')

    # Processing question and answer through shared layers
    processed_q = shared_layer_2(shared_layer_1(input_q))
    processed_a = shared_layer_2(shared_layer_1(input_a))

    # Calculating the Euclidean distance between the two processed outputs
    distance = layers.Lambda(lambda tensors: tf.norm(tensors[0] - tensors[1], axis=1, keepdims=True))([processed_q, processed_a])

    # Output layer with sigmoid activation for binary classification
    output = layers.Dense(1, activation='sigmoid')(distance)

    # Defining the full model
    model = Model(inputs=[input_q, input_a], outputs=output)

    # Compiling the model using binary crossentropy and Adam optimizer
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    return model

# Set hidden layer size and input shape correctly as a single value (int) in a tuple
hidden_size = 128
input_shape = q_train_tfidf.shape[1]  # Feature size
model = siamese_model(input_shape, hidden_size)

# Displaying the model summary
model.summary()


The model consists of two input layers each receiving TF-IDF vectors with 1,878 features. These inputs are processed through two shared dense layers with 128 units and ReLU activation. Following this, a Lambda layer calculates the Euclidean distance between the question and answer embeddings. Finally, a Sigmoid output layer predicts whether the answer matches the question. The model comprises a total of 257,026 parameters, all of which are trainable.

## 3. Training the Model

Next, the model is trained by feeding it both the training and validation data, running for 10 epochs. During training, it calculates the binary cross-entropy loss, adjusts weights accordingly, and evaluates performance using the validation data to monitor progress.

In [None]:
# Fitting the model with the training data
history = model.fit([q_train_tfidf, a_train_tfidf], y_train,
                    validation_data=([q_val_tfidf, a_val_tfidf], y_val),
                    epochs=10, batch_size=32)

Epoch 1/10
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 10ms/step - accuracy: 0.5089 - loss: 0.6911 - val_accuracy: 0.7074 - val_loss: 0.6373
Epoch 2/10
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.7671 - loss: 0.5794 - val_accuracy: 0.7799 - val_loss: 0.5695
Epoch 3/10
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8599 - loss: 0.4653 - val_accuracy: 0.8124 - val_loss: 0.5349
Epoch 4/10
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9012 - loss: 0.3787 - val_accuracy: 0.8166 - val_loss: 0.5194
Epoch 5/10
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9227 - loss: 0.3125 - val_accuracy: 0.8140 - val_loss: 0.5114
Epoch 6/10
[1m235/235[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9341 - loss: 0.2694 - val_accuracy: 0.8332 - val_loss: 0.5175
Epoch 7/10
[1m235/235[0m 

The model trained over 10 epochs, showing a steady improvement in both accuracy and validation accuracy with each epoch. Starting at approximately 50.9% accuracy, the model quickly identified patterns, reaching 95.0% accuracy on the training set by the final epoch. Validation accuracy also improved, ending around 83.6%. Both training and validation loss values consistently decreased, indicating effective learning without signs of overfitting. This upward trend in accuracy and reduction in loss values suggest that the model is training well, effectively capturing the relationship between questions and answers.

## 4. Evaluating the Model on Test Data

With training complete, the test data is loaded, providing an unbiased measure of the model's performance on unseen data. After transforming the test text data into TF-IDF vectors, it is fed into the model to obtain the test accuracy, offering a realistic assessment of the model’s effectiveness.

In [None]:
# Loading the test data
test_df = pd.read_csv('test.csv')

# Preparing the test data
q_test = test_df['qtext'].values
a_test = test_df['atext'].values
y_test = test_df['label'].values

# Transforming test data using the same TF-IDF vectorizer
q_test_tfidf = tfidf_vectorizer.transform(q_test).toarray()
a_test_tfidf = tfidf_vectorizer.transform(a_test).toarray()

# Evaluating the model on test data
test_loss, test_acc = model.evaluate([q_test_tfidf, a_test_tfidf], y_test)
print(f'Test accuracy: {test_acc}')

[1m161/161[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.5492 - loss: 1.2981
Test accuracy: 0.5642273426055908


The test accuracy was approximately 54.9%, with a loss of 1.2981. This result indicates that the model may be struggling to generalize to new data, performing notably less effectively than on the training and validation sets. The disparity between training and test accuracy suggests potential overfitting on the training data or that the TF-IDF features might be insufficient to capture deeper semantic relationships, limiting the model's ability to generalize effectively.

## 5. Identifying a Failure Case

Lastly, the model is used to make predictions on the test set, and any failure cases where it misclassified answers are examined. By identifying and analyzing these cases, insights are gained into specific areas where the model could be improved.

In [None]:
# Predicting on the test set to identify a failure case
y_pred = model.predict([q_test_tfidf, a_test_tfidf])
y_pred_labels = (y_pred > 0.5).astype(int)

# Finding a failure case where the prediction is incorrect
for i in range(len(y_test)):
    if y_pred_labels[i] != y_test[i]:
        print(f"Failure case at index {i}:")
        print(f"Question: {q_test[i]}")
        print(f"Answer: {a_test[i]}")
        print(f"True Label: {y_test[i]}, Predicted: {y_pred_labels[i]}")
        break

[1m161/161[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step
Failure case at index 0:
Question: How does an external catheter help male incontinence?
Answer: External catheters.
True Label: 1, Predicted: [0]


In this failure case, the model was tasked with determining the relevance of the answer to the question, "How does an external catheter help male incontinence?" The answer, "External catheters," is indeed correct (True Label = 1), but the model incorrectly classified it as irrelevant (Predicted = 0). This error likely occurred because TF-IDF struggles to capture the nuanced context and meaning behind the question-answer pair. The model interprets the phrases as collections of isolated words, without recognizing that "External catheters" directly addresses the question. To enhance performance, increasing the volume of training data may improve the model's generalization capabilities.

# Task 2: Transformer neural network

In this task, we replace the Siamese Neural Network with a **Transformer-based model** to determine whether two sentences (a question and an answer) are related. The Transformer reads both the question and answer **simultaneously**, capturing deeper contextual relationships.




## Importing Assets for Task 2

In [None]:
from transformers import BertTokenizer
from tensorflow.keras.layers import Embedding, Layer
from tensorflow.keras.layers import MultiHeadAttention, Dense, LayerNormalization, Dropout
from tensorflow.keras.layers import Flatten

## 1. Data Preparation

First, qtext and atext are concatenated with a [SEP] token placed between them to create a single input sequence. Next, an appropriate padding length is determined based on the text length distribution, ensuring consistent input sizes across all samples.

In [None]:
# Concatenate question and answer text with [SEP] separator
dataset['combined_text'] = dataset['qtext'] + " [SEP] " + dataset['atext']

# Calculate text lengths to determine padding length
dataset['text_length'] = dataset['combined_text'].apply(lambda x: len(x.split()))

# Display statistics for text length
length_stats = dataset['text_length'].describe()
print("Text Length Statistics:", length_stats)

Text Length Statistics: count    9380.000000
mean       28.106716
std        14.022191
min         5.000000
25%        20.000000
50%        25.000000
75%        32.000000
max       219.000000
Name: text_length, dtype: float64


After analyzing the lengths of the concatenated qtext and atext inputs, it was observed that most texts fall within 32 tokens or fewer. To efficiently cover the majority of the data, a padding length of 50 tokens is set. This choice captures most of the data without excessive padding. With this padding length defined, the next step is to tokenize and pad the inputs to this fixed length, ensuring uniform input sizes.

## 2. BERT Tokenizer

The next step is to initialize the BERT tokenizer, using it to tokenize and pad each concatenated sentence to a maximum length of 50 tokens. The tokenizer automatically inserts [CLS] and [SEP] tokens, marking the start and separation within each input. Additionally, an attention mask is generated to help the model focus on meaningful tokens while ignoring the padding tokens.

In [None]:
# Initializing BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Setting maximum length based on the analysis
max_length = 50

# Defining function to tokenize and pad inputs
def tokenize_and_pad(sentences, max_length):

    tokens = tokenizer(
        sentences,
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='tf'
    )
    return tokens

# Applying tokenization to the training data
train_sentences = dataset['combined_text'].tolist()
train_tokens = tokenize_and_pad(train_sentences, max_length)

# Displaying a sample of the tokenized data
print("Sample input_ids:", train_tokens['input_ids'][0])
print("Sample attention_mask:", train_tokens['attention_mask'][0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



Sample input_ids: tf.Tensor(
[  101  2054  2024  1996  8030  1997  3806 18886  7315  1029   102  2174
  1010  1996  2087  2691  8030  2421  1024 19029  2030 28667 29264  6314
  4308 21419  1038  4135  5844 21419  3255 24780 27427 25538 16643  2239
  5255  2030  1043  2532  9328  3110  1999  1996  4308  2090 12278  2030
  2012   102], shape=(50,), dtype=int32)
Sample attention_mask: tf.Tensor(
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1], shape=(50,), dtype=int32)


The BERT tokenizer successfully tokenized and padded the input to our fixed length of 50 tokens. It added special tokens like [CLS] at the start and [SEP] between the question and answer, which helps the Transformer understand sentence boundaries. The attention mask has 1s for real tokens, helping the model ignore any padded 0s. This prepares our data for embedding and Transformer layers!

## 3. Embedding Layer with Positional Encoding

In this step, an embedding layer is created to convert input tokens into 128-dimensional vectors. To provide the model with a sense of token order, positional encoding is added, enabling the Transformer model to understand the sequence structure, as it lacks inherent awareness of token positions.

In [None]:
# Defining a positional encoding layer
class PositionalEncoding(Layer):
    def __init__(self, max_len, d_model):
        super().__init__()
        pos = np.arange(max_len)[:, np.newaxis]
        i = np.arange(d_model)[np.newaxis, :]
        angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))
        angle_rads = pos * angle_rates
        angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
        angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

        # Ensure positional encoding matches input dtype
        self.pos_encoding = tf.constant(angle_rads, dtype=tf.float32)

    def call(self, inputs):
        return inputs + self.pos_encoding[:tf.shape(inputs)[1], :]

# Defining embedding layer with positional encoding
def embedding_layer(vocab_size, d_model, max_len):
    inputs = tf.keras.Input(shape=(max_len,))
    x = Embedding(input_dim=vocab_size, output_dim=d_model)(inputs)
    x = PositionalEncoding(max_len, d_model)(x)
    return tf.keras.Model(inputs=inputs, outputs=x)

# Setting embedding parameters
vocab_size = tokenizer.vocab_size
d_model = 128

# Initializing embedding layer with positional encoding
embedding_model = embedding_layer(vocab_size, d_model, max_length)

# Sample input: Using one of the tokenized input_ids from the train_tokens
sample_input_ids = train_tokens['input_ids'][:1]  # Taking one example from the batch

# Getting the embedding output with positional encoding
embedding_output = embedding_model(sample_input_ids)

# Printing the embedding output to verify the result
print("Embedding output with positional encoding (shape):", embedding_output.shape)
print("Sample output:", embedding_output[0])

Embedding output with positional encoding (shape): (1, 50, 128)
Sample output: tf.Tensor(
[[ 1.6350523e-03  1.0258011e+00 -3.4834612e-02 ...  9.5547342e-01
  -1.1562336e-02  9.5087886e-01]
 [ 8.3524549e-01  5.4265457e-01  7.1763569e-01 ...  1.0108898e+00
   3.4194555e-02  1.0419756e+00]
 [ 8.7638736e-01 -3.9657301e-01  1.0246983e+00 ...  1.0436763e+00
  -7.3983561e-04  1.0081265e+00]
 ...
 [ 1.2071256e-01 -1.0252832e+00  1.3299796e-01 ...  9.6860737e-01
   4.8615340e-02  9.8436600e-01]
 [-7.5566745e-01 -6.5389681e-01 -6.6543615e-01 ...  1.0179409e+00
  -3.8014587e-02  1.0014055e+00]
 [-9.1342717e-01  3.3013472e-01 -9.5983255e-01 ...  9.5620090e-01
   3.9326649e-02  1.0175080e+00]], shape=(50, 128), dtype=float32)


The embedding layer with positional encoding is functioning as expected, converting each input token into a 128-dimensional vector that includes positional information to aid the model in understanding token order. The output shape of (1, 50, 128) verifies that inputs are processed correctly, with each sequence containing 50 tokens and an embedding dimension of 128. With this setup confirmed, the next step is to proceed to the Transformer encoder layer!

## 4. Transformer Encoder Layer

A Transformer encoder layer is added with a hidden dimension selected from the options {64, 128, 256}. To allow the model to attend to various parts of the input sequence, 3 attention heads are set up in the MultiHeadAttention component. This configuration enables the model to capture different aspects of the input’s contextual information effectively.

In [None]:
# Defining Transformer encoder layer
def transformer_encoder_layer(d_model, num_heads):
    inputs = tf.keras.Input(shape=(max_length, d_model))
    attention_output = MultiHeadAttention(num_heads=num_heads, key_dim=d_model)(inputs, inputs)
    attention_output = LayerNormalization()(attention_output + inputs)  # Add & Norm
    dense_output = Dense(d_model, activation='relu')(attention_output)
    dense_output = LayerNormalization()(dense_output + attention_output)  # Add & Norm
    return tf.keras.Model(inputs=inputs, outputs=dense_output)

# Initializing Transformer encoder
encoder_model = transformer_encoder_layer(d_model=128, num_heads=3)

We don’t need a **Transformer decoder layer** because we’re not generating sequences. Instead, we’re simply classifying whether two sentences are related, which only requires encoding.

## 5. Classification Layers

Next step is to add a hidden layer with 256 units and a final output layer with 2 units for binary classification.

In [None]:
# Defining classification head
def classification_head(d_model):
    inputs = tf.keras.Input(shape=(max_length, d_model))
    x = Flatten()(inputs)
    x = Dense(256, activation='relu')(x)
    outputs = Dense(2, activation='softmax')(x)
    return tf.keras.Model(inputs=inputs, outputs=outputs)

# Initializing classification model
classification_model = classification_head(d_model=128)

## 6. Loss Function Selection

We use sparse categorical crossentropy as our loss function, which is suitable for binary classification with integer labels (0 or 1).

In [None]:
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()

## 7. Model Training and Evaluation

With the model components prepared, the full model is constructed by combining the embedding layer, Transformer encoder, and a classification head. This integrated setup forms the SentencePairClassifier model, designed to take in tokenized text and output a prediction indicating whether the question and answer are related.

In [None]:
# Loading and prepare the validation set
val_df = pd.read_csv("val.csv")

# Concatenate question and answer with [SEP] for the validation set
val_df['combined_text'] = val_df['qtext'] + " [SEP] " + val_df['atext']

# Tokenizing and pad the validation data
val_sentences = val_df['combined_text'].tolist()
val_tokens = tokenize_and_pad(val_sentences, max_length)  # Use the same function and max_length as for training

# Assembling the full model
class SentencePairClassifier(Model):
    def __init__(self, vocab_size, d_model, max_length):
        super(SentencePairClassifier, self).__init__()
        # Embedding layer with positional encoding
        self.embedding = embedding_layer(vocab_size, d_model, max_length)

        # Transformer encoder layer to add context with multi-head attention
        self.encoder = transformer_encoder_layer(d_model=d_model, num_heads=3)

        # Classification head for binary classification
        self.classifier = classification_head(d_model)

    def call(self, inputs):
        # Forward pass through embedding, encoder, and classification head
        x = self.embedding(inputs)
        x = self.encoder(x)
        return self.classifier(x)

# Setting parameters for embedding and model dimensions
vocab_size = tokenizer.vocab_size
d_model = 128
max_length = 50

# Initializing the model
model = SentencePairClassifier(vocab_size=vocab_size, d_model=d_model, max_length=max_length)

# Compiling the model
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()  # Loss function for binary classification
model.compile(
    optimizer='adam',
    loss=loss_fn,
    metrics=['accuracy']
)

# Setting training parameters
epochs = 10
batch_size = 32

# Training the model using training and validation data
history = model.fit(
    train_tokens['input_ids'],      # Training input data
    dataset['label'],              # Training labels
    validation_data=(val_tokens['input_ids'], val_df['label']),  # Validation data
    epochs=epochs,
    batch_size=batch_size
)

# Loading and prepare the test set (assuming test.csv is available)
test_df = pd.read_csv("test.csv")

# Concatenating question and answer with [SEP] for the test set
test_df['combined_text'] = test_df['qtext'] + " [SEP] " + test_df['atext']

# Tokenizing and pad the test data
test_sentences = test_df['combined_text'].tolist()
test_tokens = tokenize_and_pad(test_sentences, max_length)

# Evaluating the model on the test set
test_accuracy = model.evaluate(test_tokens['input_ids'], test_df['label'], batch_size=batch_size)
print(f"Test accuracy: {test_accuracy[1]}")


Epoch 1/10
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 252ms/step - accuracy: 0.5188 - loss: 1.4506 - val_accuracy: 0.4502 - val_loss: 0.7676
Epoch 2/10
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m60s[0m 180ms/step - accuracy: 0.5260 - loss: 0.7181 - val_accuracy: 0.4676 - val_loss: 1.1145
Epoch 3/10
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m89s[0m 204ms/step - accuracy: 0.6519 - loss: 0.6521 - val_accuracy: 0.6110 - val_loss: 0.7645
Epoch 4/10
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m99s[0m 337ms/step - accuracy: 0.8326 - loss: 0.3826 - val_accuracy: 0.6243 - val_loss: 0.7572
Epoch 5/10
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m100s[0m 194ms/step - accuracy: 0.8904 - loss: 0.2740 - val_accuracy: 0.5253 - val_loss: 1.3806
Epoch 6/10
[1m294/294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m56s[0m 189ms/step - accuracy: 0.9250 - loss: 0.2013 - val_accuracy: 0.5867 - val_loss: 1.0262
Epoch 7/1

After training, the model reached about 96% accuracy on the training set, which shows it’s learning patterns well. However, the validation accuracy fluctuated and only reached 57%, with the validation loss sometimes increasing, suggesting overfitting. When we tested on unseen data, the accuracy was around 55.6%, close to random guessing. This means the model isn’t generalizing well. To improve, we could try adding dropout layers to reduce overfitting, using pre-trained embeddings for richer language understanding, or using early stopping to avoid over-training.

## 9. Identifying a Failure Case

Lastly, to understand the model's limitations, we examine a failure case where the prediction doesn’t match the true label.

In [None]:
# Making predictions on the test set
predictions = model.predict(test_tokens['input_ids'])
predicted_labels = tf.argmax(predictions, axis=1).numpy()

# Finding a failure case
for i in range(len(test_df)):
    if predicted_labels[i] != test_df['label'].iloc[i]:
        print(f"Failure case at index {i}:")
        print(f"Question: {test_df['qtext'].iloc[i]}")
        print(f"Answer: {test_df['atext'].iloc[i]}")
        print(f"True Label: {test_df['label'].iloc[i]}, Predicted: {predicted_labels[i]}")
        break

[1m161/161[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 72ms/step
Failure case at index 0:
Question: How does an external catheter help male incontinence?
Answer: External catheters.
True Label: 1, Predicted: 0


In this example, the question asks about how an external catheter helps with male incontinence, and the answer simply states "External catheters." The true label is 1, meaning the answer is relevant, but the model predicted 0, which means it thought the answer was unrelated. This might be because the model didn’t fully understand the context and missed the relationship between "external catheter" in the question and "external catheters" in the answer. To improve, we could try using pre-trained embeddings to help the model capture more context, or add more similar examples in the training set to help it learn this kind of relationship.