# Data Preprocessing for AQUA-RAT Dataset

## Introduction
In this notebook, we will preprocess the AQUA-RAT dataset to prepare it for training our student-teacher model. The preprocessing steps include loading the data, cleaning, tokenizing, and preparing the dataset for model training.

In [2]:
import json
import pandas as pd
import numpy as np
from transformers import T5Tokenizer
import sentencepiece
import json

In [3]:
TRAIN_DATA_PATH = "../../data/AQuA/train.json"
TEST_DATA_PATH = "../../data/AQuA/test.json"
VAL_DATA_PATH = "../../data/AQuA/dev.json"

### Load Data
- Train set = train.json
- Test set  = test.json
- Validation set = dev.json 

In [4]:
train_data_raw = []
test_data_raw = []
val_data_rev = []
with open(TRAIN_DATA_PATH, "r") as file:    
    for line in file:
        train_data_raw.append(json.loads(line))
with open(TEST_DATA_PATH, "r") as file:    
    for line in file:
        test_data_raw.append(json.loads(line))
with open(VAL_DATA_PATH, "r") as file:    
    for line in file:
        val_data_rev.append(json.loads(line))

### Inspect Data
Take a look on train data

In [5]:
for element in train_data_raw[:10]:
    print(json.dumps(element, indent=4))

{
    "question": "Two friends plan to walk along a 43-km trail, starting at opposite ends of the trail at the same time. If Friend P's rate is 15% faster than Friend Q's, how many kilometers will Friend P have walked when they pass each other?",
    "options": [
        "A)21",
        "B)21.5",
        "C)22",
        "D)22.5",
        "E)23"
    ],
    "rationale": "If Q complete x kilometers, then P completes 1.15x kilometers.\nx + 1.15x = 43\n2.15x=43\nx = 43/2.15 = 20\nThen P will have have walked 1.15*20=23 km.\nThe answer is E.",
    "correct": "E"
}
{
    "question": "In the coordinate plane, points (x, 1) and (5, y) are on line k. If line k passes through the origin and has slope 1/5, then what are the values of x and y respectively?",
    "options": [
        "A)4 and 1",
        "B)1 and 5",
        "C)5 and 1",
        "D)3 and 5",
        "E)5 and 3"
    ],
    "rationale": "Line k passes through the origin and has slope 1/5 means that its equation is y=1/5*x.\nThus: (x, 

## Data Transformation

In [6]:
train_df = pd.DataFrame(train_data_raw)
test_df = pd.DataFrame(test_data_raw)
val_df = pd.DataFrame(val_data_rev)
train_df.head()

Unnamed: 0,question,options,rationale,correct
0,"Two friends plan to walk along a 43-km trail, ...","[A)21, B)21.5, C)22, D)22.5, E)23]","If Q complete x kilometers, then P completes 1...",E
1,"In the coordinate plane, points (x, 1) and (5,...","[A)4 and 1, B)1 and 5, C)5 and 1, D)3 and 5, E...",Line k passes through the origin and has slope...,C
2,"For all numbers p and q, the operation @ is de...","[A)II, B)I and II, C)I and III, D)II and III, ...",p@q = p^2 - pq=p(p-q).... so p@q will be zero ...,B
3,Carl is facing very difficult financial times ...,"[A)$1600, B)$2000, C)$2150, D)$2500, E)$12000]","Usually, you are given the annual rate of inte...",A
4,The speed at which a man can row a boat in sti...,"[A)18 seconds, B)27 seconds, C)26 seconds, D)1...",Speed of the boat downstream = 25 +11\n= 36 km...,E


### Tokenization
- for student, we will use the T5 tokenizer to tokenize the questions, options, and rationales.
- for teacher, we will use ????

In [7]:
tokenizer = T5Tokenizer.from_pretrained('t5-small')

def tokenize_record(row):
    question_text = f"Question: {row['question']} Options: {' '.join([f'({chr(65 + i)}) {opt}' for i, opt in enumerate(row['options'])])}"
    rationale_text = f"Rationale: {row['rationale']}"
    
    question_encoding = tokenizer(question_text, return_tensors='pt', padding='max_length', truncation=True)
    rationale_encoding = tokenizer(rationale_text, return_tensors='pt', padding='max_length', truncation=True)

    return pd.Series({
        'input_ids': question_encoding['input_ids'].squeeze(),
        'attention_mask': question_encoding['attention_mask'].squeeze(),
        'rationale_ids': rationale_encoding['input_ids'].squeeze(),
        'rationale_attention_mask': rationale_encoding['attention_mask'].squeeze(),
        'correct_index': ord(row['correct']) - ord('A')
    })

tokenized_df = train_df.apply(tokenize_record, axis=1)
train_df = pd.concat([train_df, tokenized_df], axis=1)
train_df.head()

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Unnamed: 0,question,options,rationale,correct,input_ids,attention_mask,rationale_ids,rationale_attention_mask,correct_index
0,"Two friends plan to walk along a 43-km trail, ...","[A)21, B)21.5, C)22, D)22.5, E)23]","If Q complete x kilometers, then P completes 1...",E,"[tensor(11860), tensor(10), tensor(2759), tens...","[tensor(1), tensor(1), tensor(1), tensor(1), t...","[tensor(6455), tensor(6318), tensor(15), tenso...","[tensor(1), tensor(1), tensor(1), tensor(1), t...",4
1,"In the coordinate plane, points (x, 1) and (5,...","[A)4 and 1, B)1 and 5, C)5 and 1, D)3 and 5, E...",Line k passes through the origin and has slope...,C,"[tensor(11860), tensor(10), tensor(86), tensor...","[tensor(1), tensor(1), tensor(1), tensor(1), t...","[tensor(6455), tensor(6318), tensor(15), tenso...","[tensor(1), tensor(1), tensor(1), tensor(1), t...",2
2,"For all numbers p and q, the operation @ is de...","[A)II, B)I and II, C)I and III, D)II and III, ...",p@q = p^2 - pq=p(p-q).... so p@q will be zero ...,B,"[tensor(11860), tensor(10), tensor(242), tenso...","[tensor(1), tensor(1), tensor(1), tensor(1), t...","[tensor(6455), tensor(6318), tensor(15), tenso...","[tensor(1), tensor(1), tensor(1), tensor(1), t...",1
3,Carl is facing very difficult financial times ...,"[A)$1600, B)$2000, C)$2150, D)$2500, E)$12000]","Usually, you are given the annual rate of inte...",A,"[tensor(11860), tensor(10), tensor(7291), tens...","[tensor(1), tensor(1), tensor(1), tensor(1), t...","[tensor(6455), tensor(6318), tensor(15), tenso...","[tensor(1), tensor(1), tensor(1), tensor(1), t...",0
4,The speed at which a man can row a boat in sti...,"[A)18 seconds, B)27 seconds, C)26 seconds, D)1...",Speed of the boat downstream = 25 +11\n= 36 km...,E,"[tensor(11860), tensor(10), tensor(37), tensor...","[tensor(1), tensor(1), tensor(1), tensor(1), t...","[tensor(6455), tensor(6318), tensor(15), tenso...","[tensor(1), tensor(1), tensor(1), tensor(1), t...",4
