# Overall Task Review

I will work on a task of "query-focused summarisation" on medical questions where the goal is, given a medical question and a list of sentences extracted from relevant medical publications, to determine which of these sentences from the list can be used as part of the answer to the question.

I will use data that has been derived from the **BioASQ challenge** (http://www.bioasq.org/), after some data manipulation to make it easier to process for this practice. The BioASQ challenge organises several "shared tasks", including a task on biomedical semantic question answering which we are using here. The data are in the file `bioasq10_labelled.csv`, which is part of the zip file provided. Each row of the file has a question, a sentence text, and a label that indicates whether the sentence text is part of the answer to the question (1) or not (0).

# Data Review

The following code uses pandas to store the file `bioasq10_labelled.csv` in a data frame and show the first rows of data. For this code to run, first you need to unzip the file `data.zip`:

In [None]:
import zipfile
import os

# Specify the path to the zip file
zip_file_path = 'data.zip'

# Specify the directory where you want to extract the contents
extracted_dir = 'data/'

# Create the directory if it doesn't exist
if not os.path.exists(extracted_dir):
    os.makedirs(extracted_dir)

# Extract the contents of the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extracted_dir)


In [None]:
import pandas as pd
dataset = pd.read_csv("bioasq10b_labelled.csv")
dataset.head()

Unnamed: 0,qid,sentid,question,sentence text,label
0,0,0,Is Hirschsprung disease a mendelian or a multi...,Hirschsprung disease (HSCR) is a multifactoria...,0
1,0,1,Is Hirschsprung disease a mendelian or a multi...,"In this study, we review the identification of...",1
2,0,2,Is Hirschsprung disease a mendelian or a multi...,The majority of the identified genes are relat...,1
3,0,3,Is Hirschsprung disease a mendelian or a multi...,The non-Mendelian inheritance of sporadic non-...,1
4,0,4,Is Hirschsprung disease a mendelian or a multi...,Coding sequence mutations in e.g.,0


The columns of the CSV file are:

* `qid`: an ID for a question. Several rows may have the same question ID, as we can see above.
* `sentid`: an ID for a sentence.
* `question`: The text of the question. In the above example, the first rows all have the same question: "Is Hirschsprung disease a mendelian or a multifactorial disorder?"
* `sentence text`: The text of the sentence.
* `label`: 1 if the sentence is a part of the answer, 0 if the sentence is not part of the answer.

# Now Let's get started for the next step

I'll use the provided files `training.csv`, `dev_test.csv`, and `test.csv` in the data.zip file for all the tasks below.

# Simple Siamese NN

I will Implement a simple TensorFlow-Keras neural model that has the following sequence of layers:

1. An input layer that will accept the tf.idf of triplet data. The input of Siamese network is a triplet, consisting of anchor (i.e., the question), positive answer, negative answer.
2. 3 hidden layers and a relu activation function. You need to determine the size of the hidden layers.
3. Implement a class that serves as a distance layer. It returns the squared Euclidean distance between anchor and positive answer, as well as that between anchor and negative answer
4. Implement a function that prepares raw data in csv files into triplets. Note that it is important to keep the similar number of positive pairs and negative pairs. For example, if a question has 10 anwsers, then we at most can have 10 positive pairs and it is good to associate this question with 10~20 negative sentences.


Then I will train the model with the training data and use the `dev_test` set to determine a good size of the hidden layer.

With the model that you have trained, implement a summariser that returns the $n$ sentences with highest predicted score. Use the following function signature:

```{python}
def nn_summariser(csvfile, questionids, n=1):
   """Return the IDs of the n sentences that have the highest predicted score.
      The input questionids is a list of question ids.
      The output is a list of lists of sentence ids
   """

```

Then I'll report the final results using the test set. Remember: use the test set to report the final results of the best system only.



#### Data set for this practice was too big for my system's computing power so i had to split it in half and only use half of the data for the first two models and for the last model specifically i could only use a quarter of the data.

### All the library and modules imported

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers
from transformers import BertTokenizer, TFBertModel
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

### Data Preparation

In [16]:
train_df = pd.read_csv('training.csv')
dev_test_df = pd.read_csv('dev_test.csv')
test_df = pd.read_csv('test.csv')

#function to sample a fraction of the dataset
def sample_fraction(df, fraction):
    return df.sample(frac=fraction, random_state=42)

#Sampling 50% of each dataset
fraction = 0.5
train_df = sample_fraction(train_df, fraction)
dev_test_df = sample_fraction(dev_test_df, fraction)
test_df = sample_fraction(test_df, fraction)

#function to prepare triplets
def prepare_triplets(df):
    triplets = []
    labels = []
    questions = df['question'].unique()
    for q in questions:
        sub_df = df[df['question'] == q]
        positives = sub_df[sub_df['label'] == 1]['sentence text'].tolist()
        negatives = sub_df[sub_df['label'] == 0]['sentence text'].tolist()

        for pos in positives:
            for neg in negatives:
                triplets.append((q, pos, neg))
                labels.append(1)  #positive pair
                labels.append(0)  #negative pair
    return triplets, labels

#preparing triplets for training and dev_test datasets
train_triplets, train_labels = prepare_triplets(train_df)
dev_test_triplets, dev_test_labels = prepare_triplets(dev_test_df)
test_triplets, test_labels = prepare_triplets(test_df)
#preparing tf-idf vectors
all_text = train_df['question'].tolist() + train_df['sentence text'].tolist()
vectorizer = TfidfVectorizer(max_features=1000)  #reduced dimensionality
vectorizer.fit(all_text)


in the cell above i prepared the triplets by grouping each record by the qid and then adding a pair of sentence texts with the 1 and 0 labels to each triplet then i used TfidfVectorizer from sklearn, i also needed to reduce the max features of the vectorizer to 1000 to prevent the memory crash issue.

In [17]:
def vectorize_triplets(triplets):
    vectorized_triplets = []
    for anchor, pos, neg in triplets:
        anchor_vec = vectorizer.transform([anchor]).toarray()
        pos_vec = vectorizer.transform([pos]).toarray()
        neg_vec = vectorizer.transform([neg]).toarray()
        vectorized_triplets.append((anchor_vec, pos_vec, neg_vec))
    return np.array(vectorized_triplets, dtype=object)

train_triplets_vec = vectorize_triplets(train_triplets)
dev_test_triplets_vec = vectorize_triplets(dev_test_triplets)
test_triplets_vec = vectorize_triplets(test_triplets)
#convert to np.float32 to ensure compatibility with TensorFlow
train_triplets_vec = np.array(train_triplets_vec.tolist(), dtype=np.float32)
dev_test_triplets_vec = np.array(dev_test_triplets_vec.tolist(), dtype=np.float32)
test_triplets_vec = np.array(test_triplets_vec.tolist(), dtype=np.float32)
#distance layer
class DistanceLayer(layers.Layer):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def call(self, inputs):
        anchor, positive, negative = inputs
        ap_distance = tf.reduce_sum(tf.square(anchor - positive), axis=1, keepdims=True)
        an_distance = tf.reduce_sum(tf.square(anchor - negative), axis=1, keepdims=True)
        return ap_distance, an_distance


the cell above contains the function definition of the triplet preparation by utilizing the vectorizer that i've imported from sklearn, then i used numpy library to conver the vectorized triplets into numpy arrays and converted the data types to numpy float 32 so they would be compatible with Tensorflow. Moreover, i've defined the class DistanceLayer that serves as a distance layer. It returns the squared Euclidean distance between anchor and positive answer, as well as that between anchor and negative answer.

### Model creation and training

I've defined the model structure in the cell below, which has an input layer that accepts triplets, it consists of 3 hidden layers and a relu activation function.

In [18]:
#siamese network
def build_siamese_model(input_shape, hidden_size):
    input = layers.Input(shape=input_shape)

    #shared network
    x = layers.Dense(hidden_size, activation='relu')(input)
    x = layers.Dense(hidden_size, activation='relu')(x)
    x = layers.Dense(hidden_size, activation='relu')(x)
    shared_network = tf.keras.Model(inputs=input, outputs=x)

    #inputs for anchor, positive and negative
    anchor_input = layers.Input(shape=input_shape, name='anchor_input')
    positive_input = layers.Input(shape=input_shape, name='positive_input')
    negative_input = layers.Input(shape=input_shape, name='negative_input')

    #shared embeddings
    anchor_embedding = shared_network(anchor_input)
    positive_embedding = shared_network(positive_input)
    negative_embedding = shared_network(negative_input)

    #distance layer
    ap_distance, an_distance = DistanceLayer()([anchor_embedding, positive_embedding, negative_embedding])

    model = tf.keras.Model(inputs=[anchor_input, positive_input, negative_input], outputs=[ap_distance, an_distance])
    return model

#loss function
def loss(margin=0.1):
    def loss_function(y_true, y_pred):
        ap_distance, an_distance = y_pred
        return tf.maximum(ap_distance - an_distance + margin, 0)
    return loss_function

#converting triplet data into arrays for training
anchor_train = np.vstack(train_triplets_vec[:, 0])
positive_train = np.vstack(train_triplets_vec[:, 1])
negative_train = np.vstack(train_triplets_vec[:, 2])

anchor_dev = np.vstack(dev_test_triplets_vec[:, 0])
positive_dev = np.vstack(dev_test_triplets_vec[:, 1])
negative_dev = np.vstack(dev_test_triplets_vec[:, 2])

anchor_test = np.vstack(test_triplets_vec[:, 0])
positive_test = np.vstack(test_triplets_vec[:, 1])
negative_test = np.vstack(test_triplets_vec[:, 2])

#splitting the data for cross-validation
anchor_train, anchor_val, positive_train, positive_val, negative_train, negative_val = train_test_split(
    anchor_train, positive_train, negative_train, test_size=0.2, random_state=42)



I've also implemented the custom loss function in the cell above and did manual trial and error to experiment the optimal margin value, then i converted the triplets into arrays for training. The loss function is triplet loss, which helps to train the Siamese network by minimizing the distance between the anchor and positive examples while maximizing the distance between the anchor and negative examples.

In [19]:
#function to compute predictions
def compute_predictions(anchor, positive, negative):
    ap_distance, an_distance = siamese_model([anchor, positive, negative], training=False)
    predictions = (ap_distance < an_distance).numpy().astype(int)
    return predictions

#building and compile the model
siamese_model = build_siamese_model((1000,), 128)
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
siamese_model.compile(optimizer=optimizer, loss=loss(0.1))

#converting the data into TensorFlow datasets
train_dataset = tf.data.Dataset.from_tensor_slices((anchor_train, positive_train, negative_train)).batch(32)
val_dataset = tf.data.Dataset.from_tensor_slices((anchor_val, positive_val, negative_val)).batch(32)
test_dataset = tf.data.Dataset.from_tensor_slices((anchor_test, positive_test, negative_test)).batch(32)
#custom training loop
epochs = 10
for epoch in range(epochs):
    print(f'Start of epoch {epoch}')

    #training
    for step, (anchor_batch, positive_batch, negative_batch) in enumerate(train_dataset):
        with tf.GradientTape() as tape:
            ap_distance, an_distance = siamese_model([anchor_batch, positive_batch, negative_batch], training=True)
            loss_value = loss(0.1)(None, [ap_distance, an_distance])
        grads = tape.gradient(loss_value, siamese_model.trainable_weights)
        optimizer.apply_gradients(zip(grads, siamese_model.trainable_weights))

    #validation
    val_losses = []
    for step, (anchor_batch, positive_batch, negative_batch) in enumerate(val_dataset):
        ap_distance, an_distance = siamese_model([anchor_batch, positive_batch, negative_batch], training=False)
        val_loss_value = loss(0.1)(None, [ap_distance, an_distance])
        val_losses.append(val_loss_value.numpy().flatten()[0])
    val_loss = np.mean(val_losses)
    print(f'Validation loss: {val_loss}')


Start of epoch 0
Validation loss: 0.8756504654884338
Start of epoch 1
Validation loss: 0.2092791348695755
Start of epoch 2
Validation loss: 0.11120724678039551
Start of epoch 3
Validation loss: 0.15317168831825256
Start of epoch 4
Validation loss: 0.2040964961051941
Start of epoch 5
Validation loss: 0.16112138330936432
Start of epoch 6
Validation loss: 0.139619380235672
Start of epoch 7
Validation loss: 0.13301801681518555
Start of epoch 8
Validation loss: 0.13285623490810394
Start of epoch 9
Validation loss: 0.1328524351119995


In the cell above i've define my custom training loop and fed the model the optimal hyper parameters that i've acquired through manual experiment, the reason i didn't use keras tuner is because it makes the computation of the model much more time consuming and my computer does not have that computing power to do it efficiently.

In [8]:
#evaluating on the validation set
val_predictions = compute_predictions(anchor_val, positive_val, negative_val).flatten()
val_y_true = dev_test_labels[:len(val_predictions)]

precision = precision_score(val_y_true, val_predictions)
recall = recall_score(val_y_true, val_predictions)
f1 = f1_score(val_y_true, val_predictions)

print(f"validation Precision: {precision}")
print(f"validation Recall: {recall}")
print(f"validation F1 Score: {f1}")


validation Precision: 0.5009970854425525
validation Recall: 0.9778443113772455
validation F1 Score: 0.662541839943199


the results of the model evaluation on the validation set is the following:

validation Precision: 0.5009970854425525

validation Recall: 0.9778443113772455

validation F1 Score: 0.662541839943199

results are not great but is expected to have this sort of F1 score since the datasets are not adquate and large enough for the training and the labels 0s and 1s are not equally distributed, there are much more 0s than 1s.

In [9]:
#evaluating on the test set
test_predictions = compute_predictions(anchor_test, positive_test, negative_test).flatten()
test_y_true = test_labels[:len(test_predictions)]

precision = precision_score(test_y_true, test_predictions)
recall = recall_score(test_y_true, test_predictions)
f1 = f1_score(test_y_true, test_predictions)

print(f"Test Precision: {precision}")
print(f"Test Recall: {recall}")
print(f"Test F1 Score: {f1}")


Test Precision: 0.5002120141342756
Test Recall: 0.5953903095558546
Test F1 Score: 0.5436669483063216


The model's performance on the test set is lower than the validation set which is completely normal:

Test Precision: 0.5002120141342756

Test Recall: 0.5953903095558546

Test F1 Score: 0.5436669483063216

however an F1 score of 0.5334 could be still considered as an acceptable performance due to inadequacy of the data volume and label distribution imbalance.

### nn_summariser function
In the cell below i've defined the function that returns the IDs of the n sentences that have the highest predicted score. The input questionids is a list of question ids. The output is a list of lists of sentence ids

In [None]:
def nn_summariser(csvfile, questionids, n=1):
    df = pd.read_csv(csvfile)
    all_text = df['question'].tolist() + df['sentence text'].tolist()
    vectorizer = TfidfVectorizer(max_features=1000)
    vectorizer.fit(all_text)

    results = []
    for qid in questionids:
        sub_df = df[df['qid'] == qid]
        question = sub_df['question'].iloc[0]
        sentences = sub_df['sentence text'].tolist()
        sent_ids = sub_df['sentid'].tolist()

        question_vec = vectorizer.transform([question]).toarray()
        sentence_vecs = vectorizer.transform(sentences).toarray()

        question_vec = np.tile(question_vec, (len(sentences), 1))

        ap_distance, _ = siamese_model([question_vec, sentence_vecs, sentence_vecs])
        scores = ap_distance.numpy().flatten()

        top_n_indices = np.argsort(scores)[:n]
        top_n_ids = [sent_ids[i] for i in top_n_indices]
        results.append(top_n_ids)

    return results

#reporting final results using the test set
test_question_ids = test_df['qid'].unique()
test_results = nn_summariser('test.csv', test_question_ids, n=1)

#printing the test results
for qid, top_ids in zip(test_question_ids, test_results):
    print(f"Question ID: {qid}, Top Sentence IDs: {top_ids}")


Question ID: 4109, Top Sentence IDs: [20]
Question ID: 584, Top Sentence IDs: [20]
Question ID: 1644, Top Sentence IDs: [52]
Question ID: 3764, Top Sentence IDs: [8]
Question ID: 3508, Top Sentence IDs: [3]
Question ID: 991, Top Sentence IDs: [33]
Question ID: 2401, Top Sentence IDs: [11]
Question ID: 1999, Top Sentence IDs: [4]
Question ID: 937, Top Sentence IDs: [0]
Question ID: 2125, Top Sentence IDs: [8]
Question ID: 1606, Top Sentence IDs: [33]
Question ID: 2531, Top Sentence IDs: [6]
Question ID: 2938, Top Sentence IDs: [0]
Question ID: 671, Top Sentence IDs: [34]
Question ID: 1922, Top Sentence IDs: [6]
Question ID: 1754, Top Sentence IDs: [13]
Question ID: 1449, Top Sentence IDs: [6]
Question ID: 1852, Top Sentence IDs: [15]
Question ID: 2092, Top Sentence IDs: [9]
Question ID: 3905, Top Sentence IDs: [6]
Question ID: 643, Top Sentence IDs: [21]
Question ID: 3987, Top Sentence IDs: [23]
Question ID: 520, Top Sentence IDs: [4]
Question ID: 1014, Top Sentence IDs: [20]
Question I

IDs of the  n  sentences that have the highest prediction score in the given question are printed in the cell above.

# Recurrent NN

I'll implement a more complex Siamese neural network that is composed of the following layers:

* An embedding layer that generates embedding vectors of the sentence text with 35 dimensions.
* A LSTM layer. You need to determine the size of this LSTM layer, and the text length limit (if needed).
* 3 hidden layers and a relu activation function. You need to determine the size of the hidden layers.

I'll train the model with the training data, use the `dev_test` set to determine a good size of the LSTM layer and an appropriate length limit (if needed), and report the final results using the test set. Again, remember to use the test set only after you have determined the optimal parameters of the LSTM layer.

At last based on my experiments, I will comment on whether this system is better than the systems developed in the previous tasks.


### data preparation

In [None]:
#loading datasets
train_df = pd.read_csv('training.csv')
dev_test_df = pd.read_csv('dev_test.csv')
test_df = pd.read_csv('test.csv')

#sampling 50% of each dataset
fraction = 0.5
train_df = train_df.sample(frac=fraction, random_state=42)
dev_test_df = dev_test_df.sample(frac=fraction, random_state=42)
test_df = test_df.sample(frac=fraction, random_state=42)

#function to prepare triplets
def prepare_triplets(df):
    triplets = []
    labels = []
    questions = df['question'].unique()
    for q in questions:
        sub_df = df[df['question'] == q]
        positives = sub_df[sub_df['label'] == 1]['sentence text'].tolist()
        negatives = sub_df[sub_df['label'] == 0]['sentence text'].tolist()

        for pos in positives:
            for neg in negatives:
                triplets.append((q, pos, neg))
                labels.append(1)  #positive pair
                labels.append(0)  #negative pair
    return triplets, labels

#preparing triplets for training and dev_test datasets
train_triplets, train_labels = prepare_triplets(train_df)
dev_test_triplets, dev_test_labels = prepare_triplets(dev_test_df)
test_triplets, test_labels = prepare_triplets(test_df)
#preparing tokenizer and sequences
all_text = train_df['question'].tolist() + train_df['sentence text'].tolist()
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_text)

The cell above is completely the same as the first task.

In [None]:
max_length = 150
vocab_size = len(tokenizer.word_index) + 1
def tokenize_and_pad(triplets):
    anchor_texts = [triplet[0] for triplet in triplets]
    positive_texts = [triplet[1] for triplet in triplets]
    negative_texts = [triplet[2] for triplet in triplets]

    anchor_seqs = tokenizer.texts_to_sequences(anchor_texts)
    positive_seqs = tokenizer.texts_to_sequences(positive_texts)
    negative_seqs = tokenizer.texts_to_sequences(negative_texts)

    anchor_padded = pad_sequences(anchor_seqs, maxlen=max_length, padding='post')
    positive_padded = pad_sequences(positive_seqs, maxlen=max_length, padding='post')
    negative_padded = pad_sequences(negative_seqs, maxlen=max_length, padding='post')

    return anchor_padded, positive_padded, negative_padded

anchor_train, positive_train, negative_train = tokenize_and_pad(train_triplets)
anchor_dev, positive_dev, negative_dev = tokenize_and_pad(dev_test_triplets)
anchor_test, positive_test, negative_test = tokenize_and_pad(test_triplets)
#splitting the data for cross-validation
anchor_train, anchor_val, positive_train, positive_val, negative_train, negative_val = train_test_split(
    anchor_train, positive_train, negative_train, test_size=0.2, random_state=42)

in the cell above i've defined the fucntion to tokenize and pad the triplets i've set the length limit to be 150 after few manual trial and errors and it seems to be the optimal value, after that i've prepared the three datasets for model training and evaluation by tokenization and then padding the inputs.

### Model creation and training

i've defiend a Siamese network with an embedding layer and LSTM layers in teh cell below.
- **Input Layer:** Defines the input shape for sequences.
- **Embedding Layer:** Embeds input sequences into dense vectors of fixed size.
- **LSTM Layer:** Processes the embedded sequences to capture temporal dependencies.
- **Dense Layers:** Further process the LSTM outputs through fully connected layers with ReLU activations to form the shared network. This network is used to generate embeddings for anchor, positive, and negative inputs.
- **Shared Network:** The shared network processes the anchor, positive, and negative inputs to produce their embeddings.
- **Distance Calculation:** The custom `DistanceLayer` computes the distances between the embeddings of the anchor-positive and anchor-negative pairs.
- **Model Definition:** The model takes anchor, positive, and negative inputs and outputs the distances computed by the DistanceLayer.

In [None]:
#distance layer
class DistanceLayer(layers.Layer):
    def _init_(self, **kwargs):
        super()._init_(**kwargs)

    def call(self, inputs):
        anchor, positive, negative = inputs
        ap_distance = tf.reduce_sum(tf.square(anchor - positive), axis=1, keepdims=True)
        an_distance = tf.reduce_sum(tf.square(anchor - negative), axis=1, keepdims=True)
        return ap_distance, an_distance

#siamese network with embedding and LSTM layers
def build_siamese_model(vocab_size, max_length, embedding_dim=35, lstm_units=64, hidden_size=128):
    input = layers.Input(shape=(max_length,))

    #shared network
    x = layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length)(input)
    x = layers.LSTM(lstm_units)(x)
    x = layers.Dense(hidden_size, activation='relu')(x)
    x = layers.Dense(hidden_size, activation='relu')(x)
    x = layers.Dense(hidden_size, activation='relu')(x)
    shared_network = tf.keras.Model(inputs=input, outputs=x)

    #inputs for anchor, positive and negative
    anchor_input = layers.Input(shape=(max_length,), name='anchor_input')
    positive_input = layers.Input(shape=(max_length,), name='positive_input')
    negative_input = layers.Input(shape=(max_length,), name='negative_input')

    #shared embeddings
    anchor_embedding = shared_network(anchor_input)
    positive_embedding = shared_network(positive_input)
    negative_embedding = shared_network(negative_input)

    #distance layer
    ap_distance, an_distance = DistanceLayer()([anchor_embedding, positive_embedding, negative_embedding])

    model = tf.keras.Model(inputs=[anchor_input, positive_input, negative_input], outputs=[ap_distance, an_distance])
    return model


below i've define the loss function which is same as the one used for the first model and then i prepared the data by slicing the tokenized and padded datasets into batches of 32.

In [None]:
#loss function
def loss(margin=0.1):
    def loss_function(y_true, y_pred):
        ap_distance, an_distance = y_pred
        return tf.maximum(ap_distance - an_distance + margin, 0)
    return loss_function

#converting triplet data into arrays for training
def prepare_data_for_training(anchor, positive, negative):
    return tf.data.Dataset.from_tensor_slices((anchor, positive, negative)).batch(32)

train_dataset = prepare_data_for_training(anchor_train, positive_train, negative_train)
val_dataset = prepare_data_for_training(anchor_val, positive_val, negative_val)
test_dataset = prepare_data_for_training(anchor_test, positive_test, negative_test)
#building and compile the model
siamese_model = build_siamese_model(vocab_size, max_length, embedding_dim=35, lstm_units=64, hidden_size=128)
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001)
siamese_model.compile(optimizer=optimizer, loss=loss(0.1))




the selection of hyper parameters is based on trial and error i didn't use keras tuner because i was encountering crashes so all i could do was trying different commonly used values of the hyper parameters and use the one with the best result.

the cell below is a custom training loop for the model with 10 epochs

In [None]:
#custom training loop
epochs = 10
for epoch in range(epochs):
    print(f'Start of epoch {epoch}')

    #training
    for step, (anchor_batch, positive_batch, negative_batch) in enumerate(train_dataset):
        with tf.GradientTape() as tape:
            ap_distance, an_distance = siamese_model([anchor_batch, positive_batch, negative_batch], training=True)
            loss_value = loss(0.1)(None, [ap_distance, an_distance])
        grads = tape.gradient(loss_value, siamese_model.trainable_weights)
        optimizer.apply_gradients(zip(grads, siamese_model.trainable_weights))

    #validation
    val_losses = []
    for step, (anchor_batch, positive_batch, negative_batch) in enumerate(val_dataset):
        ap_distance, an_distance = siamese_model([anchor_batch, positive_batch, negative_batch], training=False)
        val_loss_value = loss(0.1)(None, [ap_distance, an_distance])
        val_losses.append(val_loss_value.numpy().flatten()[0])
    val_loss = np.mean(val_losses)
    print(f'Validation loss: {val_loss}')


Start of epoch 0




in the cell below i've defined the function to compute the prediction which is the same as the one used for the first model and then i've evaluated the model on the validation set.

In [None]:
#function to compute predictions
def compute_predictions(anchor, positive, negative):
    ap_distance, an_distance = siamese_model([anchor, positive, negative], training=False)
    predictions = (ap_distance < an_distance).numpy().astype(int)
    return predictions

#evaluating on the validation set
val_predictions = compute_predictions(anchor_val, positive_val, negative_val).flatten()
val_y_true = dev_test_labels[:len(val_predictions)]

precision = precision_score(val_y_true, val_predictions)
recall = recall_score(val_y_true, val_predictions)
f1 = f1_score(val_y_true, val_predictions)

print(f"Validation Precision: {precision}")
print(f"Validation Recall: {recall}")
print(f"Validation F1 Score: {f1}")

Validation Precision: 0.5001577784790154
Validation Recall: 0.9491017964071856
Validation F1 Score: 0.6550940276916718


performance of model on the validation set is the following:

Validation Precision: 0.5001577784790154

Validation Recall: 0.9491017964071856

Validation F1 Score: 0.6550940276916718

which is again due to the complexity of the model and inadequacy of proper data is normal and can be accepted as a normal F1 score.

In [20]:
#evaluating on the test set
test_predictions = compute_predictions(anchor_test, positive_test, negative_test).flatten()
test_y_true = test_labels[:len(test_predictions)]

precision = precision_score(test_y_true, test_predictions)
recall = recall_score(test_y_true, test_predictions)
f1 = f1_score(test_y_true, test_predictions)

print(f"Test Precision: {precision}")
print(f"Test Recall: {recall}")
print(f"Test F1 Score: {f1}")

Test Precision: 0.5205479452054794
Test Recall: 0.5608856088560885
Test F1 Score: 0.5399644760213144


the results of the model perfomance on the test set are below:

Test Precision: 0.5205479452054794

Test Recall: 0.5608856088560885

Test F1 Score: 0.5399644760213144

which is lower than the model trained in the first task.

### nn_summariser function

In [None]:


def nn_summariser(csvfile, questionids, n=1):
    df = pd.read_csv(csvfile)

    #tokenizing and padding sequences
    def vectorize_texts(texts, tokenizer, max_length):
        sequences = tokenizer.texts_to_sequences(texts)
        padded_sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_length)
        return np.array(padded_sequences)

    #preparing the tokenizer
    all_text = df['question'].tolist() + df['sentence text'].tolist()
    tokenizer = tf.keras.preprocessing.text.Tokenizer()
    tokenizer.fit_on_texts(all_text)

    results = []
    for qid in questionids:
        sub_df = df[df['qid'] == qid]
        question = sub_df['question'].iloc[0]
        sentences = sub_df['sentence text'].tolist()
        sent_ids = sub_df['sentid'].tolist()

        question_vec = vectorize_texts([question], tokenizer, max_length)
        sentence_vecs = vectorize_texts(sentences, tokenizer, max_length)

        #ensuring the question vector is tiled to match the number of sentences
        question_vec = np.tile(question_vec, (len(sentences), 1))

        ap_distance, _ = siamese_model([question_vec, sentence_vecs, sentence_vecs])
        scores = ap_distance.numpy().flatten()

        top_n_indices = np.argsort(scores)[:n]
        top_n_ids = [sent_ids[i] for i in top_n_indices]
        results.append(top_n_ids)

    return results

#reporting final results using the test set
test_question_ids = test_df['qid'].unique()
test_results = nn_summariser('test.csv', test_question_ids, n=1)

#printing the test results
for qid, top_ids in zip(test_question_ids, test_results):
    print(f"Question ID: {qid}, Top Sentence IDs: {top_ids}")


Question ID: 4109, Top Sentence IDs: [19]
Question ID: 584, Top Sentence IDs: [2]
Question ID: 1644, Top Sentence IDs: [48]
Question ID: 3764, Top Sentence IDs: [8]
Question ID: 3508, Top Sentence IDs: [4]
Question ID: 991, Top Sentence IDs: [30]
Question ID: 2401, Top Sentence IDs: [7]
Question ID: 1999, Top Sentence IDs: [3]
Question ID: 937, Top Sentence IDs: [6]
Question ID: 2125, Top Sentence IDs: [1]
Question ID: 1606, Top Sentence IDs: [49]
Question ID: 2531, Top Sentence IDs: [9]
Question ID: 2938, Top Sentence IDs: [1]
Question ID: 671, Top Sentence IDs: [31]
Question ID: 1922, Top Sentence IDs: [1]
Question ID: 1754, Top Sentence IDs: [15]
Question ID: 1449, Top Sentence IDs: [7]
Question ID: 1852, Top Sentence IDs: [26]
Question ID: 2092, Top Sentence IDs: [12]
Question ID: 3905, Top Sentence IDs: [12]
Question ID: 643, Top Sentence IDs: [16]
Question ID: 3987, Top Sentence IDs: [16]
Question ID: 520, Top Sentence IDs: [5]
Question ID: 1014, Top Sentence IDs: [12]
Question I

IDs of the  n  sentences that have the highest prediction score in the given question are printed in the cell above.

### Comparison of the first two models

The second model has a slightly lower F1 score in both datasets compared to the first model which is a simpler one, however despite the more complex nature of the second model the results are not guaranteed to elavate when using a more complex model. the performance of the models varies based on the compatiblity of the models with the dataset and the adequacy of the data.

# Transformer

Implement a simple Transformer neural network that is composed of the following layers:

* Use BERT as feature extractor for each token.
* A few of transformer encoder layers, hidden dimension 768. You need to determine how many layers to use between 1~3.
* A few of transformer decoder layers, hidden dimension 768. You need to determine how many layers to use between 1~3.
* 1 hidden layer with size 512.
* The final output layer with one cell for binary classification to predict whether two inputs are related or not.

Note that each input for this model should be a concatenation of a positive pair (i.e. question + one answer) or a negative pair (i.e. question + not related sentence). The format is usually like [CLS]+ question + [SEP] + a positive/negative sentence.

Train the model with the training data, use the dev_test set to determine a good size of the transformer layers, and report the final results using the test set. Again, remember to use the test set only after you have determined the optimal parameters of the transformer layers.

At last based on my experiments, I will comment on whether this system is better than the systems developed in the previous tasks.


#### explaination how to handle length difference for a batch of data
1. **Tokenizing with Padding and Attention Masks:** Ensuring all sequences in a batch are of the same length using padding and generate attention masks.
2. **Prepare TensorFlow Datasets:** Converting tokenized inputs and attention masks into TensorFlow datasets.
3. **Modify the Model:** Updating the model to use attention masks during the forward pass.
4. **Train and Evaluate:** Training and evaluating the model using the prepared datasets.

This approach ensures that the Transformer model can handle varying sequence lengths within batches effectively.

### Data preparation

In [None]:
train_df = pd.read_csv('training.csv')
dev_test_df = pd.read_csv('dev_test.csv')
test_df = pd.read_csv('test.csv')

#function to sample a fraction of the dataset
def sample_fraction(df, fraction):
    return df.sample(frac=fraction, random_state=42)

#sampling a quarter of each dataset
fraction = 0.25
train_df = sample_fraction(train_df, fraction)
dev_test_df = sample_fraction(dev_test_df, fraction)
test_df = sample_fraction(test_df, fraction)

#function to prepare concatenated pairs
def prepare_pairs(df):
    pairs = []
    labels = []
    questions = df['question'].unique()
    for q in questions:
        sub_df = df[df['question'] == q]
        positives = sub_df[sub_df['label'] == 1]['sentence text'].tolist()
        negatives = sub_df[sub_df['label'] == 0]['sentence text'].tolist()

        for pos in positives:
            pairs.append(f"[CLS] {q} [SEP] {pos} [SEP]")
            labels.append(1)  #positive pair
        for neg in negatives:
            pairs.append(f"[CLS] {q} [SEP] {neg} [SEP]")
            labels.append(0)  #negative pair
    return pairs, labels

#preparing pairs for training, dev_test, and test datasets
train_pairs, train_labels = prepare_pairs(train_df)
dev_test_pairs, dev_test_labels = prepare_pairs(dev_test_df)
test_pairs, test_labels = prepare_pairs(test_df)

#tokenizing and padding sequences
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
max_length = 128  #based on BERT's max length

def tokenize_pairs(pairs, max_length):
    inputs = tokenizer(pairs, return_tensors='tf', max_length=max_length, padding=True, truncation=True)
    return inputs

train_inputs = tokenize_pairs(train_pairs, max_length)
dev_test_inputs = tokenize_pairs(dev_test_pairs, max_length)
test_inputs = tokenize_pairs(test_pairs, max_length)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In the cell above i've defined the function that prepares pairs of questions and sentences, along with their labels. For each unique question, it creates pairs with positive and negative sentences. [CLS] and [SEP] are special tokens used by BERT to indicate the start of the sequence and the separation between question and sentence, respectively. Moreover i've used the BertTokenizer function to tokenize the data and prepare it for training. 


### Model Creation

In [None]:
#custom Transformer layers
class TransformerEncoderLayer(tf.keras.layers.Layer):
    def __init__(self, hidden_dim, num_heads, ff_dim, dropout_rate=0.1):
        super(TransformerEncoderLayer, self).__init__()
        self.attention = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=hidden_dim)
        self.ffn = tf.keras.Sequential([
            tf.keras.layers.Dense(ff_dim, activation='relu'),
            tf.keras.layers.Dense(hidden_dim),
        ])
        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
        self.dropout2 = tf.keras.layers.Dropout(dropout_rate)

    def call(self, inputs, training):
        attn_output = self.attention(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

class TransformerDecoderLayer(tf.keras.layers.Layer):
    def __init__(self, hidden_dim, num_heads, ff_dim, dropout_rate=0.1):
        super(TransformerDecoderLayer, self).__init__()
        self.attention1 = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=hidden_dim)
        self.attention2 = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=hidden_dim)
        self.ffn = tf.keras.Sequential([
            tf.keras.layers.Dense(ff_dim, activation='relu'),
            tf.keras.layers.Dense(hidden_dim),
        ])
        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
        self.dropout2 = tf.keras.layers.Dropout(dropout_rate)
        self.dropout3 = tf.keras.layers.Dropout(dropout_rate)

    def call(self, inputs, enc_outputs, training):
        attn1 = self.attention1(inputs, inputs)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(inputs + attn1)
        attn2 = self.attention2(out1, enc_outputs)
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(out1 + attn2)
        ffn_output = self.ffn(out2)
        ffn_output = self.dropout3(ffn_output, training=training)
        return self.layernorm3(out2 + ffn_output)

class TransformerModel(tf.keras.Model):
    def __init__(self, num_encoder_layers, num_decoder_layers, hidden_dim, ff_dim, hidden_layer_size, **kwargs):
        super(TransformerModel, self).__init__(**kwargs)
        self.bert = TFBertModel.from_pretrained('bert-base-uncased')
        self.encoder_layers = [TransformerEncoderLayer(hidden_dim, 8, ff_dim) for _ in range(num_encoder_layers)]
        self.decoder_layers = [TransformerDecoderLayer(hidden_dim, 8, ff_dim) for _ in range(num_decoder_layers)]
        self.hidden_layer = tf.keras.layers.Dense(hidden_layer_size, activation='relu')
        self.output_layer = tf.keras.layers.Dense(1, activation='sigmoid')

    def call(self, inputs, training=False):
        bert_outputs = self.bert(inputs)[0]

        enc_output = bert_outputs
        for encoder in self.encoder_layers:
            enc_output = encoder(enc_output, training=training)

        dec_output = enc_output
        for decoder in self.decoder_layers:
            dec_output = decoder(dec_output, enc_output, training=training)

        hidden_output = self.hidden_layer(tf.reduce_mean(dec_output, axis=1))
        output = self.output_layer(hidden_output)
        return output

i've defined a custom Transformer model in TensorFlow using Keras. The model includes custom Transformer encoder and decoder layers, as well as a complete Transformer model that integrates a pre-trained BERT model.


### Custom Transformer Encoder Layer

- **MultiHeadAttention:** Applies multi-head self-attention to the input.
- **Feed-Forward Network (FFN):** A two-layer feed-forward network with ReLU activation.
- **Layer Normalization:** Normalizes the output of the previous layers.
- **Dropout:** Adds regularization to prevent overfitting.
- **Call Method:** Defines the forward pass. Applies multi-head attention, dropout, layer normalization, feed-forward network, another dropout, and layer normalization again.

### Custom Transformer Decoder Layer

- **Attention1:** Applies self-attention to the decoder input.
- **Attention2:** Applies attention to the encoder outputs.
- **Feed-Forward Network (FFN):** Same as in the encoder layer.
- **Layer Normalization:** Normalizes the output of the previous layers.
- **Dropout:** Adds regularization.
- **Call Method:** Defines the forward pass. Applies self-attention, cross-attention with encoder outputs, feed-forward network, and layer normalization.

### Custom Transformer Model

- **BERT Model:** Initializes a pre-trained BERT model from Hugging Face.
- **Encoder Layers:** A list of custom encoder layers.
- **Decoder Layers:** A list of custom decoder layers.
- **Hidden Layer:** A dense layer with ReLU activation.
- **Output Layer:** A dense layer with sigmoid activation to produce the final output.
- **Call Method:** Defines the forward pass. It first gets the outputs from the BERT model, passes them through the encoder layers, then through the decoder layers, averages the outputs, and finally passes them through the hidden and output layers to get the final result.



### Model training and evaluation

In [None]:
#defining the parameters for the layers
num_encoder_layers = 2
num_decoder_layers = 2
hidden_dim = 768
ff_dim = 2048
hidden_layer_size = 512

# Instantiate and compile the model
transformer_model = TransformerModel(num_encoder_layers, num_decoder_layers, hidden_dim, ff_dim, hidden_layer_size)
transformer_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5), loss='binary_crossentropy', metrics=['accuracy'])

#preparing data for training and validation
train_dataset = tf.data.Dataset.from_tensor_slices((train_inputs['input_ids'], train_labels)).batch(32)
dev_test_dataset = tf.data.Dataset.from_tensor_slices((dev_test_inputs['input_ids'], dev_test_labels)).batch(32)
test_dataset = tf.data.Dataset.from_tensor_slices((test_inputs['input_ids'], test_labels)).batch(32)

#training the model
epochs = 10
transformer_model.fit(train_dataset, validation_data=dev_test_dataset, epochs=epochs)

#evaluating on the test set
test_loss, test_accuracy = transformer_model.evaluate(test_dataset)
print(f'Test Accuracy: {test_accuracy}')

#computing predictions
test_predictions = transformer_model.predict(test_inputs['input_ids']).flatten()
test_predictions = (test_predictions > 0.5).astype(int)

#ground truth labels
y_true = np.array(test_labels)

#calculating precision, recall, and F1 score
precision = precision_score(y_true, test_predictions)
recall = recall_score(y_true, test_predictions)
f1 = f1_score(y_true, test_predictions)

print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

Epoch 1/10




Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Accuracy: 0.4292101263999939
Precision: 0.3217291507268554
Recall: 0.8555442522889115
F1 Score: 0.4676118988045593


In the cell above i've defined Parameters for the layers, keep that in mind that this model needed a lot of computing power and it was impossible for me to run the model many times. it took about 6 hours each time i wanted to train the model, so i've decided to chose the optimal hyperparameters after two trials and these are the values i've come up with. Moreover after model training i've evaluated the model using the test.csv file provided. The results of the model evaluation on the test set is low but it was expected since this model has a high level of complexity and the datasets provided are imbalance and not large enough to train this transformer model adequately.

the final results of the model's performance on the test set is:

Precision: 0.3217291507268554

Recall: 0.8555442522889115

F1 Score: 0.4676118988045593

### nn_summariser function
the function below returns the IDs of the n sentences that have the highest prediction score in the given question.

In [None]:

#defining the nn_summariser function
def nn_summariser(csvfile, questionids, n=1):
    df = pd.read_csv(csvfile)
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    results = []
    for qid in questionids:
        sub_df = df[df['qid'] == qid]
        question = sub_df['question'].iloc[0]
        sentences = sub_df['sentence text'].tolist()
        sent_ids = sub_df['sentid'].tolist()

        #preparing inputs
        pairs = [f"[CLS] {question} [SEP] {sent} [SEP]" for sent in sentences]
        inputs = tokenizer(pairs, return_tensors='tf', max_length=128, padding=True, truncation=True)

        #predicting scores
        predictions = transformer_model.predict(inputs['input_ids'])
        scores = predictions.flatten()

        #getting top n sentences
        top_n_indices = np.argsort(scores)[-n:]
        top_n_ids = [sent_ids[i] for i in top_n_indices]
        results.append(top_n_ids)

    return results

#reporting final results using the test set
test_question_ids = test_df['qid'].unique()
test_results = nn_summariser('test.csv', test_question_ids, n=1)

#printing the test results
for qid, top_ids in zip(test_question_ids, test_results):
    print(f"Question ID: {qid}, Top Sentence IDs: {top_ids}")


Question ID: 4109, Top Sentence IDs: [25]
Question ID: 584, Top Sentence IDs: [4]
Question ID: 1644, Top Sentence IDs: [2]
Question ID: 3764, Top Sentence IDs: [5]
Question ID: 3508, Top Sentence IDs: [3]
Question ID: 991, Top Sentence IDs: [0]
Question ID: 2401, Top Sentence IDs: [2]
Question ID: 1999, Top Sentence IDs: [1]
Question ID: 937, Top Sentence IDs: [2]
Question ID: 2125, Top Sentence IDs: [2]
Question ID: 1606, Top Sentence IDs: [46]
Question ID: 2531, Top Sentence IDs: [13]
Question ID: 2938, Top Sentence IDs: [3]
Question ID: 671, Top Sentence IDs: [37]
Question ID: 1922, Top Sentence IDs: [5]
Question ID: 1754, Top Sentence IDs: [17]
Question ID: 1449, Top Sentence IDs: [8]
Question ID: 1852, Top Sentence IDs: [8]
Question ID: 2092, Top Sentence IDs: [8]
Question ID: 3905, Top Sentence IDs: [7]
Question ID: 643, Top Sentence IDs: [13]
Question ID: 3987, Top Sentence IDs: [27]
Question ID: 520, Top Sentence IDs: [9]
Question ID: 1014, Top Sentence IDs: [21]
Question ID: 1

IDs of the  n  sentences that have the highest prediction score in the given question are printed in the cell above.

### Comparison of this model with the previous models

as stated in the comparison of the first two models before, more complexity is not equal to better performance and when having an inadequate amount of data with imbalance label distribution it could lead to worse performance and results, hence this model is performing worse than the two before.