# 2025 CITS4012 Group XX Assignment

*Make sure you change the file name with your student id.*

# 0. Readme
这是Project 1的模板，我直接拿来用了

仅需注重代码风格及可视化，让marker改起来舒服就行，分数低不了的
* 例如：多抄Lab的技术栈，多引用学界的基石论文，marker自然乐意审自己熟悉的技术。e.g., WordCloud Visualization

*If there is something to be noted for the marker, please mention here.*

*If you are planning to implement a program with Object *Oriented* Programming style, please put those the bottom of this ipynb file*

## 0.1 File Structure

```
/content/drive/MyDrive/path_to_your_folder
├── CITS4012_YourGroupID.ipynb
├── sample
├── sample
├── sample
├── sample
├── sample
├── sample
├── test.json
├── train.json
└── validation.json
```

## 0.2 Setup

### Step 1 - Check current Python environment

In [None]:
import sys
print(sys.executable)

### Step 2 - Install dependencies (2*1 min)

> NOTE:
>
> In Google Drive, an ERROR may occur due to incompatibility of the latest versions of `numpy` and `scipy`.
>
> Simply **restart the runtime** to use the newly downgraded versions.
>
> (In other words, run this code snippet twice to make sure all requirements are satisfied.)

In [None]:
# %pip install pandas
%pip install rich
%pip install gensim

### Step 3 - Mount source files

This step assumes you are running in a Google Colab environment.

In [None]:
# Define your Google Drive folder path
my_folder = "MyDrive/Colab Notebooks/path_to_your_folder"

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Path setups
import os
folder_path = "/content/drive/" + my_folder

test_file = os.path.join(folder_path, "test.json")
train_file = os.path.join(folder_path, "train.json")
val_file = os.path.join(folder_path, "validation.json")

# 1. Overview

We implemented three substantially different model architectures:

* [The vanilla RNN-based encoder-decoder](#scrollTo=67TQJgOJ_lF1)

* [The Bi-LSTM encoder-decoder with XXX-attention in different positions](#scrollTo=BhSE5ON4_r0C)

* [The vanilla Transformer with self-attention (no RNN)](#scrollTo=E4w_n2P2_xxm)

# 2. Dataset Processing

## 2.1 Load JSON files



In [None]:
import pandas as pd

train_df = pd.read_json(train_file)
val_df = pd.read_json(val_file)
test_df = pd.read_json(test_file)

source_train_df = train_df.copy()
source_val_df = val_df.copy()
source_test_df = test_df.copy()

# sneak peek
pd.set_option("display.max_colwidth", 30)

print(train_df.head())
# print(val_df.head())
# print(test_df.head())

print("Train size:", len(train_df))
print("Val size:", len(val_df))
print("Test size:", len(test_df))

pd.reset_option("display.max_colwidth")

## 2.2 Data cleansing

原则上，我们只知道训练集数据，不该为验证集/测试集的特点设计数据清洗的代码

Assuption
* The value of `label` is binary, either "entails" or "neutral".

Compromises
* Ignore syntactic errors and semantic errors.
* Apply unified premises rules on hypothesis.

Handled Issues
| No. | Description | Examples |
|----------|-------------|----------|
| Issue 1  | Includes non-linguistic long separators | train premise 1, 382, 697, ... |
| Issue 2  | HTML/XML tags with ID pattern | train premise 78, 270, 319, ... |
| Issue 3  | Duplicate continuous words | train premise 78, 87, 564, ... |
| Issue 4  | Duplicated whitespaces | train premise 123, 193, 259, ... |
| Issue 5  | Spaces before end punctuation, except '!' and '?' | train premise 3, 333, 6280 |

Kept Noises

| No. | Description | Noise | Non-Noise |
|----------|-------------|----------|--------------|
| Noise 1  | Instructional prompt words | train premise 3, 61, 319, ... | train premise 16, 24, 61, ... |
| Noise 2  | Concatenated sentences | train premise 270, 537, 608, ... |  |
| Noise 3  | Numbered markers | train premise 270, 537, 608, ... | train premise 1546, 2068, ... |
| Noise 4  | Misplaced `label` values | train premise 270, 537, 606, ... | train premise 1683, 2068, ... |

In [None]:
import re
from difflib import SequenceMatcher
from IPython.display import display, HTML

def cleanse(df):
  df = df.copy()

  ID_PATTERN = r"\b[a-z]?(?:\d{6,}|[a-z0-9]{8,})(?:-[a-z0-9]{2,})+\b"
  REPEAT_PATTERN = r"\b((?:\w+\s+){0,2}\w+)( \1\b)+"

  for col in ["premise", "hypothesis"]:
    df[col] = df[col].str.lower()   # lowercase

    # issue 1: non-linguistic long separators
    df[col] = df[col].apply(lambda x: re.sub(r"[-=*_~$]{3,}", " ", x))

    # issue 2: HTML/XML tags with ID pattern
    df[col] = df[col].apply(lambda x: re.sub(r"<[^>]*>", " ", x))
    df[col] = df[col].apply(lambda x: re.sub(ID_PATTERN, "[id]", x))

    # issue 3: duplicate continuout words
    df[col] = df[col].apply(lambda x: re.sub(REPEAT_PATTERN, r"\1", x))

    # issue 4: duplicate whitespaces
    df[col] = df[col].apply(lambda x: re.sub(r"\s+", " ", x).strip())

    # issue 5: spaces before punctuation
    df[col] = df[col].apply(lambda x: re.sub(r"\s+([.,;:])", r"\1", x))

  return df

def verify_cleanse(source_df, cleanse_df, issue, col, index, kept=False):
  status = "Kept " if kept else "Fixed"
  print(f"Issue {issue}:\t{source_df[col][index]}")
  print(f"{status} {issue}:\t{cleanse_df[col][index]}")
  print()

# Clenasing
train_df = cleanse(train_df)
val_df = cleanse(val_df)
test_df = cleanse(test_df)

# Verification
verify_cleanse(source_train_df, train_df, issue=1, col="premise", index=1)
verify_cleanse(source_train_df, train_df, issue=2, col="premise", index=319)
verify_cleanse(source_train_df, train_df, issue=3, col="premise", index=87)
verify_cleanse(source_train_df, train_df, issue=4, col="premise", index=123)
verify_cleanse(source_train_df, train_df, issue=5, col="premise", index=3)
verify_cleanse(source_train_df, train_df, issue=5, col="premise", index=6280, kept=True)
verify_cleanse(source_train_df, train_df, issue=5, col="premise", index=333, kept=True)

verify_cleanse(source_train_df, train_df, issue="Hybrid", col="premise", index=270)

## 2.3 Normalization

## 2.4 Tokenization

# 3. Word Embedding Construction
(You can add as many code blocks and text blocks as you need. However, YOU SHOULD NOT MODIFY the section title)

# 4. Model Implementation

## 4.1 The vanilla RNN-based encoder-decoder

## 4.2 The Bi-LSTM encoder-decoder with XXX-attention in different positions

## 4.3 The vanilla Transformer with self-attention

# 5. Performance Evaluation
(You can add as many code blocks and text blocks as you need. However, YOU SHOULD NOT MODIFY the section title)

# 6. Interactive Inference Colab Form
(You can add as many code blocks and text blocks as you need. However, YOU SHOULD NOT MODIFY the section title)