# Assignment 3 Part 2 - Wiki Question Answering

**Submission deadline:** Friday 30 May 2025, 11:55 pm

**Marks:** 20 marks (20% of the total unit assessment)

Unless a Special Consideration request has been submitted and approved, a 5% penalty (of the total possible mark of the task) will be applied for each day a written report or presentation assessment is not submitted, up until the 7th day (including weekends). After the 7th day, a grade of ‘0’ will be awarded even if the assessment is submitted. For example, if the assignment is worth 8 marks (of the entire unit) and your submission is late by 19 hours (or 23 hours 59 minutes 59 seconds), 0.4 marks (5% of 8 marks) will be deducted. If your submission is late by 24 hours (or 47 hours 59 minutes 59 seconds), 0.8 marks (10% of 8 marks) will be deducted, and so on. The submission time for all uploaded assessments is 11:55 pm. A 1-hour grace period will be provided to students who experience a technical concern. For any late submission of time-sensitive tasks, such as scheduled tests/exams, performance assessments/presentations, and/or scheduled practical assessments/labs, please apply for Special Consideration.


## A Note on the Use of AI Generators

In this assignment, we view AI code generators such as Copilot, CodeGPT, etc. as tools that can help you write code quickly. You are allowed to use these tools, but with some conditions. To understand what you can and cannot do, please visit these information pages provided by Macquarie University:

Artificial Intelligence Tools and Academic Integrity in FSE - https://bit.ly/3uxgQP4

If you choose to use these tools, make the following explicit in your submitted file as comments starting with "Use of AI generators in this assignment" explaining:

-   What part of your code is based on the output of such tools,
-   What tools you used,
-   What prompts you used to generate the code or text, and
-   What modifications you made on the generated code or text.

This will help us assess your work fairly. If we observe that you have used an AI generator and you do not give the above information, you may face disciplinary action.

## Objectives of This Assignment

<!-- In Assignment 3 you will work on a general answer selection task. Given a question and a list of candidate sentences, the goal is to predict which sentences can be used as part of the answer. Assignment 3 Part 2 requires you to implement deep neural networks. -->

In this assignment, you will work on the answer selection task using the WikiQA corpus. Given a question and a list of candidate sentences, the goal is to predict which sentences can be used to form a correct answer.  This assignment requires you to implement and evaluate a traditional text classification method (Naive Bayes) as well as deep neural networks (Siamese Network and Transformer models).



The dataset is the **Wiki Question Answering corpus from Microsoft**. The provided files (`training.csv`, `dev_test.csv`, `test.csv` in `data.zip`) contain the following columns:

-   `question_id`: ID for a question
-   `question`: Text of the question
-   `document_title`: Topic of the question
-   `answer`: Sentence candidate for the answer
-   `label`: 1 if the sentence is part of the answer, 0 otherwise

The following code shows how to load and preview the data:

In [None]:
import pandas as pd

train_data = pd.read_csv("data/training.csv")
dev_data = pd.read_csv("data/dev_test.csv")
test_data = pd.read_csv("data/test.csv")
train_data.head()


Unnamed: 0,question_id,question,document_title,answer,label
0,Q1,how are glacier caves formed?,Glacier cave,A partly submerged glacier cave on Perito More...,0
1,Q1,how are glacier caves formed?,Glacier cave,The ice facade is approximately 60 m high,0
2,Q1,how are glacier caves formed?,Glacier cave,Ice formations in the Titlis glacier cave,0
3,Q1,how are glacier caves formed?,Glacier cave,A glacier cave is a cave formed within the ice...,1
4,Q1,how are glacier caves formed?,Glacier cave,"Glacier caves are often called ice caves , but...",0


## Instructions

* Complete the three tasks below.

* Write your code inside this notebook.

* Your notebook must include the running outputs of your final code.

* **Submit this `.ipynb` file, containing your code and outputs, to iLearn.**

## Assessment

1.  Marks are based on the correctness of your code, outputs, and coding style.
<!-- 2.  A total of **1.5 marks** (0.5 per task) are awarded globally across the assignment for good coding style: clean, modular code, meaningful variable names, and good comments. -->
3.  Marks for each task focus only on the main implementation, **not on the data loading step**.
4.  If outputs are missing or incorrect, up to **25% of the marks for that task** can be deducted.
5.  See each task below for the detailed mark breakdown.

## Task 1 (4 marks): Query-Focused Text Classification Using Naive Bayes

* Preprocess the text data. Feel free to explore and use suitable preprocessing.

* Extract features using **CountVectorizer** and **TF-IDF**.

* Train and evaluate a **Naive Bayes classifier** on both feature sets.

* Report and compare accuracy, precision, recall, and F1-score.

**Mark breakdown:**


* (2 marks) Correct implementation: preprocessing, feature extraction, training Naive Bayes models.

* (1.5 marks) Proper evaluation: accuracy, precision, recall, F1-score + discussion of results.

* (0.5 mark) Good coding style: clean, modular, clear variables, comments.

<!-- * (0.5 mark) Preprocessing and feature extraction.

* (1 mark) Training Naive Bayes on CountVectorizer and TF-IDF features.

* (1 mark) Evaluation on the test set with proper metrics.

* (1 mark) Brief discussion on which feature set performed better and why.

* (0.5 mark) For good coding style: clean, modular code, meaningful variable names, and good comments. -->

In [None]:
#   Write your code and answers here. You can add more code and markdown cells if needed.

## Task 2 (6 marks): Siamese Neural Network with Contrastive Loss (PyTorch)

This task involves two stages: first learning sentence embeddings using contrastive loss, and then using these embeddings for classification.

### Task 2a: Learning Embeddings with Contrastive Loss

* Preprocess question-answer pairs (e.g., TF-IDF or embeddings).

* Implement a Siamese Network in PyTorch:
    * The network should take the preprocessed question and answer representations as input.
  
    * Each branch of the Siamese network should contain two hidden layers with ReLU activation. (hidden layer size chosen from {64, 128, 256})
  
    * Use Euclidean-distance-based contrastive loss, use a margin value of m=1.
  
    * The network should output an embedding vector (the output of the second hidden layer) for the question and the answer.

* Train the model and evaluate on the test set.

*Note: Save the best performing model to be reused in Task 2b*

### Task 2b: Classification using Learned Embeddings

* Load the weights of the best performing Siamese network model saved from Task 2a. Freeze the weights of the shared Siamese branches (i.e., the hidden layers) so they are not updated during this stage.

* Build Classifier Head in PyTorch:
    * Pass the question and answer representations through their respective frozen branches to obtain their learned embeddings from Task 2a.

    * Calculate the Euclidean distance between the question embedding and the answer embedding.

    * Add a final classification output layer: Pass the calculated distance through a simple trainable layer (e.g., a Dense layer with 1 unit) followed by a Sigmoid activation function. This will output a value between 0 and 1, representing the predicted probability of the pair being related.

* Train the model and evaluate on the test set with Binary Cross-Entropy (BCE) loss.

* Report the accuracy and provide at least one failure case analysis, with supporting code output.

**Mark breakdown:**

* (3 marks) Correct implementation: Siamese NN architecture, contrastive loss, classification head setup.

* (2.5 marks) Proper evaluation: training/evaluation correctness, metric reporting, failure case analysis.

* (0.5 mark) Good coding style: : clean, modular code, meaningful variable names, and good comments.

<!-- * (1 mark) Correct Siamese NN architecture and contrastive loss.

* (1 mark) SNN training setup and data feeding.

* (1 mark) Correctly loading the pre-trained model, freezing the appropriate layers, and constructing the classification architecture.

* (1 mark) Correct training/evaluation setup using Binary Cross-Entropy loss.

* (0.5 mark) Proper evaluation and accuracy reporting.

* (1 mark) Example of a failure case, possible reason, and suggested improvement.

* (0.5 mark) For good coding style: clean, modular code, meaningful variable names, and good comments. -->

In [None]:
#   Write your code and answers here. You can add more code and markdown cells if needed.

## Task 3 (10 marks): Transformer-Based Sentence Classification (PyTorch)

* Preprocess input as: question [SEP] answer, pad to a fixed length (justify your choice of length).

* Use a suitable tokenizer (justify your choice).

* Build a Transformer model in PyTorch:

    * Embedding layer (size 128) + positional embeddings.

    * One Transformer encoder layer (hidden dim in {64, 128, 256}, 4 attention heads).

    * One hidden layer (256 units, ReLU).

    * Use suitable final layer for classification
    
  
* Apply Global Average Pooling to the output sequence of the Transformer encoder layer.
  
* Use an appropriate loss function (e.g., CrossEntropyLoss).

* Train and evaluate on the test split.

* Report best accuracy, precision, recall, F1-score, and discuss a failure case, with supporting code output.

**Mark breakdown:**

* (5 marks) Correct implementation: input preparation, tokenizer, transformer model, training setup.

* (4.5 marks) Proper evaluation: metric reporting, failure case analysis with discussion.

* (0.5 mark) Good coding style: : clean, modular code, meaningful variable names, and good comments.

<!-- * (1.5 marks) Correct input preparation and tokenizer choice (with justification).

* (2 marks) Transformer architecture implementation.

* (2 marks) Training setup, loss function, and optimizer.

* (2 marks) Evaluation and correct metric reporting.

* (2 marks) Failure case analysis and suggestions.

* (0.5 mark) For good coding style: clean, modular code, meaningful variable names, and good comments. -->


In [None]:
#   Write your code and answers here. You can add more code and markdown cells if needed.

# Submission

Your submission should consist of this Jupyter notebook with all your code and explanations inserted into the notebook as text cells. **The notebook should contain the output of the runs. All code should run. Code with syntax errors or code without output will not be assessed.**

**Do not submit multiple files.**

Examine the text cells of this notebook so that you can have an idea of how to format text for good visual impact. You can also read this useful [guide to the MarkDown notation](https://daringfireball.net/projects/markdown/syntax),  which explains the format of the text cells.

### Marking Rubric

| Criteria                          | Unsatisfactory | Pass           | Credit         | Distinction     |
|----------------------------------|----------------|----------------|----------------|-----------------|
| **Task 1 – Correctness**         | 0 points       | 1 point        | 1.5 points     | 2 points        |
| **Task 1 – Evaluation & Discussion** | 0 points   | 0.75 points    | 1 point        | 1.5 points      |
| **Task 1 – Code Readability**    | 0 points       | 0.25 points    | 0.4 points     | 0.5 points      |
| **Task 2 – Correctness**         | 0 points       | 1.5 points     | 2.5 points     | 3 points        |
| **Task 2 – Evaluation & Analysis** | 0 points     | 1.25 points    | 2 points       | 2.5 points      |
| **Task 2 – Code Readability**    | 0 points       | 0.25 points    | 0.4 points     | 0.5 points      |
| **Task 3 – Correctness**         | 0 points       | 2.5 points     | 4 points       | 5 points        |
| **Task 3 – Evaluation & Analysis** | 0 points     | 2.25 points    | 3.5 points     | 4.5 points      |
| **Task 3 – Code Readability**    | 0 points       | 0.25 points    | 0.4 points     | 0.5 points      |


### Assessment Criteria Description

The following aspects will be considered when marking each task. The total score is based on the level of achievement across these dimensions.

#### Correctness
How well the main functionality and requirements of the task are implemented.

- **Unsatisfactory** – Major components are missing or incorrect.
- **Pass** – Some core components are correctly implemented.
- **Credit** – Most components are correctly implemented with minor issues.
- **Distinction** – All required components are correctly and completely implemented.

#### Evaluation & Analysis (where applicable)
The quality of evaluation metrics, observations, and insights into the model’s performance.

- **Unsatisfactory** – Minimal or no evaluation and discussion.
- **Pass** – Basic evaluation is provided, but analysis is shallow.
- **Credit** – Good evaluation with meaningful discussion.
- **Distinction** – In-depth, insightful analysis and thoughtful observations.

#### Code Readability
Clarity, structure, and quality of code writing style.

- **Unsatisfactory** – Code is difficult to read, poorly structured, and lacks clarity (e.g., meaningless variable names, no comments).
- **Pass** – Code is generally readable with some good practices.
- **Credit** – Code is clearly readable and mostly well-structured.
- **Distinction** – Code is clean, well-organized, and easy to follow; shows excellent style and best practices.
