# General context

The [Learning Agency Lab - Automated Essay Scoring 2.0](https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/overview) competition aims to enhance automated essay grading systems to support overburdened teachers, especially in underserved communities. It addresses the limitations of previous efforts by using a larger, more diverse dataset to improve scoring accuracy and fairness. Hosted by Vanderbilt University and The Learning Agency Lab, the competition seeks to develop open-source tools that provide timely feedback to students and integrate more effectively into real-world educational settings. This initiative represents a significant advancement in educational technology, promoting equitable access to reliable automated essay scoring.

In this notebook, I conducted exploratory data analysis and developed models using the Deberta V3 architecture ([He et al., 2021](https://arxiv.org/abs/2111.09543)). Additionally, I utilized the Hugging Face `datasets` library in conjunction with PyTorch's `DataLoader` for efficient data handling. I established a training loop using native PyTorch functionalities and modeled the outputs as ordinal values to account for their inherent order.

**Todo: add some result**

# Inclusion and global variables

In [1]:
import os
import math
from dataclasses import dataclass, field
from pathlib import Path
from typing import List, Any
import logging

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

import numpy as np
import pandas as pd

import torch
import torch.utils
import torch.utils.data
import torch.nn as nn
from torch.optim import AdamW
from torch.utils.data import (
    DataLoader,
)

from sklearn.model_selection import train_test_split
import polars as pl

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoModel,
    DataCollatorWithPadding,
)
import datasets
from datasets import Dataset, DatasetDict
# from peft import LoraConfig 

from sklearn.utils.estimator_checks import check_estimator
from sklearn.base import (
    BaseEstimator,
    ClassifierMixin,
)

## Set and define global configuration

In this section, I set up a configuration builder that utilizes the Python data class `ConfigurationSetting` to enhance the code's flexibility. This approach allows the code to run with specific configurations in dedicated environments. The instance of `ConfigurationSetting` created by the builder is used throughout the code, replacing hardcoded values.

In [2]:
DATABRICKS_STR      = "DATABRICKS"
KAGGLE_STR          = "KAGGLE"
LOCAL_STR           = "LOCAL"
MATPLOTBLUE         = "#1f77b4"
SEED                = 1010
DEVICE              = "cuda" if torch.cuda.is_available() else "cpu"
DEBERTA_V3_CKPT     = "microsoft/deberta-v3-base"
NUM_LABELS          = 5
DATALOADER_BATCH    = 10
DEBUG               = True


In [3]:
configuration_item = configuration_builder(
    model_ckpt=DEBERTA_V3_CKPT,
    plot_color=MATPLOTBLUE,
    seed=SEED,
    device=DEVICE
)

NameError: name 'configuration_builder' is not defined

Warnings are suppressed in my local environment, particularly to remove information about my computing system before the code is pushed to GitHub.

In [None]:
if configuration_item.name == LOCAL_STR:
    logging.captureWarnings(True)
    logger: logging.Logger = logging.getLogger("py.warnings")
    logger.addHandler(logging.FileHandler("tmp.log"))

# Load the data

Data comes from the Kaggle competition [Learning Agency Lab - Automated Essay Scoring 2.0](https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/overview) and can be downloaded from the competition's page.

In [None]:
if not configuration_item.data_path:
    raise ValueError

train_ds = pd.read_csv(
    filepath_or_buffer=configuration_item.data_path / "train.csv"
)


For the sake of speed, I reduced the dataset size in my local environment. However, even with fewer records, fine-tuning DeBERTA on CPU machines remains challenging and significantly slower without GPU access.

In [None]:
# Keep small for local investigation
if (configuration_item.name == LOCAL_STR) | DEBUG:
    train_ds, _ = train_test_split(
        train_ds, 
        test_size=.99, 
        random_state=configuration_item.seed, 
        stratify=train_ds["score"]
    )

# Exploratory Data Analysis

# Data wrangling

## Turn the score label into ordinal

Ordinal regression is used when the dependent variable (the outcome you're trying to predict) holds an intrinsic order, but the distances between the levels are not known. The classic examples include a Likert scale for surveys (e.g., "strongly disagree," "disagree," "neutral," "agree," "strongly agree"), grades (A, B, C, D, F), or in the case at hand, essay scores. The key advantage of ordinal regression is its ability to handle dependent variables that are more nuanced than simple binary outcomes but don’t have the numeric spacing needed for linear regression. For instance, while we know that grade A is higher than grade B, we cannot say that it is exactly two points higher as we might with numerical scores. This is where ordinal regression comes in—it allows the modeling of the rank order of the dependent variable without making assumptions about the value of the intervals between levels. In the context of modeling essay scores, ordinal regression can predict the rank order of the essays' quality. It is particularly apt for this kind of task because it can learn from the order inherent in the scores without assuming equal spacing between score levels. This can result in more accurate models for ordered categorical data, as it respects the nature of the ranking involved. When using ordinal regression, we need to transform the target variable to reflect the ordinal nature. In a standard regression problem, the target is typically a single column of values. In ordinal regression, however, the target is often expanded into a matrix that represents the ranking order. This matrix enables the model to understand and predict not just whether one essay is better than another but the relative ranking across the spectrum of scores. To prepare for ordinal regression, the scores was transformed into an ordinal matrix with a process known as “one-hot encoding” of the ranks. 

In [None]:
def category_to_ordinal(category):
    y = np.array(category, dtype="int") 
    n = y.shape[0]
    num_class = np.max(y) 
    range_values = np.tile(
        np.expand_dims(np.arange(num_class), 0), 
        [n, 1]
    ) 
    ordinal = np.zeros((n, num_class), dtype="int") 
    ordinal[range_values < np.expand_dims(y, -1)] = 1 
    return ordinal

In [None]:
train_ds["labels"] = category_to_ordinal(train_ds.score.values).tolist()

In [None]:
display(train_ds.head())

## Create training / validation set

In [None]:
train_set, validation_set = train_test_split(
    train_ds, 
    train_size=.7, 
    random_state=SEED, 
    stratify=train_ds.score
)

In [None]:
train_set_tmp = train_set.copy()
validation_set_tmp = validation_set.copy()

train_set_tmp["type"] = "Training"
validation_set_tmp["type"] = "Validation"
# ingnore_index force the creation of a new index.
combined_set = pd.concat(
    [train_set_tmp, validation_set_tmp], 
    ignore_index=True
)

proportion = (
    combined_set.groupby("type")["score"]
    .value_counts(normalize=True)
    .reset_index()
)

ax = sns.barplot(proportion, x="score", y="proportion", hue="type")

for p in ax.patches:
    percentage = f"{p.get_height()*100:.0f}%" # type: ignore
    x = p.get_x() + p.get_width() / 2 # type: ignore
    y = p.get_height() # type: ignore
    ax.text(x, y, percentage, ha="center", va="bottom")

plt.title("Comparative Distribution of Scores Proportion in Training and Validation Sets")
plt.xlabel("Scores")
plt.ylabel("Proportion")
plt.show()

del(train_set_tmp)
del(validation_set_tmp)

The provided plot presents a grouped bar chart depicting the distribution of scores within two distinct datasets: Training and Validation. Each score category from 1 to 6 is represented by a pair of bars—one for the Training set (in blue) and one for the Validation set (in orange). The stratification ensures that the proportion of scores in each score category is consistent across both training and validation sets. This consistency is critical when developing a model for ordinal regression, as it allows the model to learn from a training set that mirrors the real-world or expected distribution of scores. Consequently, when the model is validated, the validation set similarly reflects this distribution, allowing for an accurate assessment of the model's performance. From the visual comparison of the bar heights, it is evident that each score's representation in the training set closely matches its representation in the validation set.

In [None]:
type(validation_set)
data_dict = datasets.DatasetDict({
    "training": datasets.Dataset.from_pandas(df=train_set),
    "validation": datasets.Dataset.from_pandas(df=validation_set)
})

# Modeling usin Sentence transformer with setfit

## Tokenize the text

Tokenization is a fundamental process in natural language processing (NLP) where text is segmented into smaller units known as tokens. These tokens may be individual words, characters, or subwords. This segmentation is akin to parsing a sentence into its constituent words or decomposing a word into syllables. 

**Important Note**: The `DataCollatorWithPadding` modifies column names by changing `label` to `labels` (if the column exists). For the sake of conciseness and readability, I have updated all occurrences of `label` to `labels` throughout the code. This specific transformation is detailed in the `__call__` function of the [`DataCollatorWithPadding`](https://github.com/huggingface/transformers/blob/v4.40.2/src/transformers/data/data_collator.py#L236).