# ✂️ Tokenisation for Transformer-Based AES

This notebook prepares the cleaned ASAP dataset for input into transformer-based models by applying the RoBERTa tokenizer. It includes:

- Loading the normalised, cleaned dataset
- Initialising the RoBERTa tokenizer
- Applying padding and truncation
- Outputting a tokenised HuggingFace-compatible dataset

Tokenisation aligns with the preprocessing strategy described in Section 4.2 of the dissertation.


In [13]:
import pandas as pd
from transformers import RobertaTokenizerFast
from datasets import Dataset


In [14]:
# Load your cleaned and normalised dataset
df = pd.read_csv("../data/processed/asap_cleaned.csv")

# Check a sample
df[['essay_id', 'essay_set', 'essay', 'score_scaled']].head()


Unnamed: 0,essay_id,essay_set,essay,score_scaled
0,1,1,"Dear local newspaper, I think effects computer...",0.6
1,2,1,"Dear @CAPS1 @CAPS2, I believe that using compu...",0.7
2,3,1,"Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...",0.5
3,4,1,"Dear Local Newspaper, @CAPS1 I have found that...",0.8
4,5,1,"Dear @LOCATION1, I know having computers has a...",0.6


In [15]:
# Load the RoBERTa tokenizer
tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")

# Set max length for tokenisation
MAX_LENGTH = 512


In [16]:
# HuggingFace Datasets expects columns as dictionary entries
dataset = Dataset.from_pandas(df[['essay', 'score_scaled']])  # only keep needed fields

# Tokenisation function
def tokenize_function(example):
    return tokenizer(
        example['essay'],
        padding="max_length",
        truncation=True,
        max_length=MAX_LENGTH,
    )

# Apply tokenizer across dataset
tokenised_dataset = dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/12064 [00:00<?, ? examples/s]

In [17]:
from datasets import DatasetDict

# Split into 80% train, 10% val, 10% test
split_dataset = tokenised_dataset.train_test_split(test_size=0.2, seed=42)
val_test_split = split_dataset['test'].train_test_split(test_size=0.5, seed=42)

# Combine into DatasetDict
dataset_dict = DatasetDict({
    'train': split_dataset['train'],
    'validation': val_test_split['train'],
    'test': val_test_split['test']
})

# Check sizes
print(dataset_dict)


DatasetDict({
    train: Dataset({
        features: ['essay', 'score_scaled', 'input_ids', 'attention_mask'],
        num_rows: 9651
    })
    validation: Dataset({
        features: ['essay', 'score_scaled', 'input_ids', 'attention_mask'],
        num_rows: 1206
    })
    test: Dataset({
        features: ['essay', 'score_scaled', 'input_ids', 'attention_mask'],
        num_rows: 1207
    })
})


In [18]:
dataset_dict.save_to_disk("../data/processed/tokenised_asap_split")

Saving the dataset (0/1 shards):   0%|          | 0/9651 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1206 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1207 [00:00<?, ? examples/s]