# Data Cleaning

This notebook implements the data cleaning process for cleaning the description column.

## Methodology

Band scores below 4.0 and above 8.5 were excluded from the analysis due to their underrepresentation in the dataset. A rule-based extraction procedure was applied to identify individual sub-scores and their corresponding band descriptors from each essay evaluation text. To address inconsistencies in the originally assigned overall scores, a recalculated band score was derived from the extracted components. This correction ensured greater reliability and consistency in the scoring data. All subsequent analyses were conducted on the filtered and corrected dataset.

In [1]:
import pandas as pd
import numpy as np
from os import path
import sys
from sentence_transformers import SentenceTransformer
import re

sys.path.append(path.dirname(path.abspath("")))
project_root = path.dirname(path.abspath(""))
print(project_root)

/Users/finnferchau/dev/team-10


## Setup

---

In [135]:
pd.options.plotting.backend = "plotly"

# show the first 200 characters per attribute, None shows all
pd.set_option("max_colwidth", 500)

## Data Import

---

In [136]:
csv_file_train = "../data/train.csv"
csv_file_test = "../data/test.csv"

df_train = pd.read_csv(csv_file_train)
df_test = pd.read_csv(csv_file_test)

## Extracting Text from column evaluation

---

In [137]:
possible_values = [
    "0.5",
    "1.5",
    "2.5",
    "3.5",
    "4.5",
    "5.5",
    "6.5",
    "7.5",
    "8.5",
    "9.5",
    "1",
    "2",
    "3",
    "4",
    "5",
    "6",
    "7",
    "8",
    "9",
    "10",
]


def extract_numbers(text):
    for value in possible_values:
        if value in text:
            return value

In [138]:
def extract_features(evaluations):
    evaluations_lower = evaluations.map(str.lower)
    headings = [
        "task achievement",
        "coherence and cohesion",
        "lexical resource",
        "grammatical range and accuracy",
        "overall band score",
    ]
    columns = [
        "task_achievement_description",
        "task_achievement_score",
        "coherence_and_cohesion_description",
        "coherence_and_cohesion_score",
        "lexical_resource_description",
        "lexical_resource_score",
        "grammatical_range_and_accuracy_description",
        "grammatical_range_and_accuracy_score",
        "overall_band_score_description",
        "overall_band_score_score",
    ]
    evaluations_indices = []
    df_evaluations = pd.DataFrame(columns=columns)

    for evaluation in evaluations_lower:
        indices = {}

        for header in headings:
            indices[header] = evaluation.find(header)

        evaluations_indices.append(indices)

    for evaluation, evaluation_indices in zip(evaluations, evaluations_indices):
        headings = list(evaluation_indices.keys())
        indices = list(evaluation_indices.values())
        new_row = {col: None for col in columns}

        for i in range(len(indices)):
            if i == len(indices) - 1:
                text = evaluation[indices[i] :]
                new_row[headings[i].replace(" ", "_") + "_description"] = text
                new_row[headings[i].replace(" ", "_") + "_score"] = extract_numbers(
                    text
                )

            else:
                text = evaluation[indices[i] : indices[i + 1]]
                new_row[headings[i].replace(" ", "_") + "_description"] = text
                new_row[headings[i].replace(" ", "_") + "_score"] = extract_numbers(
                    text
                )

        df_evaluations.loc[len(df_evaluations)] = new_row

    return df_evaluations

In [139]:
evaluations_train = extract_features(df_train["evaluation"])
evaluations_test = extract_features(df_test["evaluation"])

## Merge original dataframe with the new columns

---

In [140]:
# Train
df_train = pd.concat([df_train, evaluations_train], axis=1)
df_train.dropna(inplace=True)
df_train.rename(
    columns={"band": "band_score_old", "overall_band_score_score": "band_score"},
    inplace=True,
)

# Test
df_test = pd.concat([df_test, evaluations_test], axis=1)
df_test.dropna(inplace=True)
df_test.rename(
    columns={"band": "band_score_old", "overall_band_score_score": "band_score"},
    inplace=True,
)

## Text Preprocessing

---

Remove newlines, carriage returns, multiple spaces, and leading/trailing spaces.

In [141]:
def clean_text(text):
    if isinstance(text, str):
        # Replace literal escaped newlines and carriage returns
        text = text.replace("\\r\\n", " ")
        text = text.replace("\\n", " ")
        text = text.replace("\\r", " ")
        # Replace actual newline and carriage return characters
        text = text.replace("\r\n", " ")
        text = text.replace("\n", " ")
        text = text.replace("\r", " ")
        # Replace multiple spaces with a single space and strip leading/trailing spaces
        text = re.sub(r"\s+", " ", text).strip()
        return text
    return ""

In [142]:
text_columns_to_clean = [
    "prompt",
    "essay",
    "task_achievement_description",
    "coherence_and_cohesion_description",
    "lexical_resource_description",
    "grammatical_range_and_accuracy_description",
    "overall_band_score_description",
]

In [143]:
# Clean the specified text columns in df_train_short
for col in text_columns_to_clean:
    if col in df_train.columns:
        df_train[col] = df_train[col].apply(clean_text)
    else:
        print(f"Warning: Column '{col}' not found in df_train.")

# Clean the specified text columns in df_test
for col in text_columns_to_clean:
    if col in df_test.columns:
        df_test[col] = df_test[col].apply(clean_text)
    else:
        print(f"Warning: Column '{col}' not found in df_test.")

## Remove long essays

---

In [144]:
from transformers import RobertaTokenizer

# Initialize the tokenizer
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")


# Define a function to get the length of tokenized essays
def get_tokenized_length(essay):
    return len(tokenizer(essay)["input_ids"])

In [145]:
max_tokens = 512

In [146]:
# Filter the train df
df_train_short = df_train[df_train["essay"].apply(get_tokenized_length) <= max_tokens]

print(f"Train before: {len(df_train)} rows")
print(f"Train after: {len(df_train_short)} rows")

# Filter the test df
df_test_short = df_test[df_test["essay"].apply(get_tokenized_length) <= max_tokens]

print(f"Test before: {len(df_test)} rows")
print(f"Test after: {len(df_test_short)} rows")

Token indices sequence length is longer than the specified maximum sequence length for this model (628 > 512). Running this sequence through the model will result in indexing errors


Train before: 9717 rows
Train after: 9605 rows
Test before: 486 rows
Test after: 481 rows


## Remove essays with few examples

---

i.e. everything < 4 or > 8.5

In [147]:
band_counts = df_train_short["band_score"].value_counts().sort_index()
band_counts.plot(kind="bar")

In [148]:
# Filter train df
band_scores_float = df_train_short["band_score"].astype(float)
df_train_clean = df_train_short[(band_scores_float >= 4.0) & (band_scores_float <= 8.5)]

print(f"Train before: {len(df_train_short)} rows")
print(f"Train after: {len(df_train_clean)} rows")

# Filter test df
band_scores_float = df_test_short["band_score"].astype(float)
df_test_clean = df_test_short[(band_scores_float >= 4.0) & (band_scores_float <= 8.5)]

print(f"Test before: {len(df_test_short)} rows")
print(f"Test after: {len(df_test_clean)} rows")

Train before: 9605 rows
Train after: 9048 rows
Test before: 481 rows
Test after: 454 rows


In [149]:
band_counts = df_train_clean["band_score"].value_counts().sort_index()
band_counts.plot(kind="bar")

## Saving as csv

---

In [150]:
import csv

print(f"Train shape: {df_train_clean.shape}")
print(f"Test shape: {df_test_clean.shape}")

df_train_clean.to_csv(
    "../data/clean_train.csv", index=False, quoting=csv.QUOTE_NONNUMERIC
)
df_test_clean.to_csv(
    "../data/clean_test.csv", index=False, quoting=csv.QUOTE_NONNUMERIC
)

Train shape: (9048, 14)
Test shape: (454, 14)


In [151]:
test = pd.read_csv("../data/clean_train.csv")
print(test.shape)

(9048, 14)


## Calculate embeddings

---

In [152]:
model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True)


def calculate_embeddings(dataframe: pd.DataFrame, col_name: str) -> np.ndarray:
    col_array = np.array(dataframe[col_name])
    embeddings = model.encode(
        col_array,
        batch_size=2,
        show_progress_bar=True,
    )

    print(f"Shape before: {col_array.shape}")
    print(f"Shape after: {embeddings.shape}")

    return np.vstack(embeddings)

flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn is not installed. Using PyTorch native attention implementation.
flash_attn i

In [153]:
# Calculate embeddings for the essays and prompts of the train dataframe
embeddings_train_prompts = calculate_embeddings(df_train_clean, "prompt")
embeddings_train_essays = calculate_embeddings(df_train_clean, "essay")

# Calculate embeddings for the essays and prompts of the test dataframe
embeddings_test_prompts = calculate_embeddings(df_test_clean, "prompt")
embeddings_test_essays = calculate_embeddings(df_test_clean, "essay")

Batches:   0%|          | 0/4524 [00:00<?, ?it/s]

Shape before: (9048,)
Shape after: (9048, 1024)


Batches:   0%|          | 0/4524 [00:00<?, ?it/s]

Shape before: (9048,)
Shape after: (9048, 1024)


Batches:   0%|          | 0/227 [00:00<?, ?it/s]

Shape before: (454,)
Shape after: (454, 1024)


Batches:   0%|          | 0/227 [00:00<?, ?it/s]

Shape before: (454,)
Shape after: (454, 1024)


In [154]:
# Save train embeddings
np.save("../embeddings/embeddings_train_prompts.npy", embeddings_train_prompts)
np.save("../embeddings/embeddings_train_essays.npy", embeddings_train_essays)

# Save test embeddings
np.save("../embeddings/embeddings_test_prompts.npy", embeddings_test_prompts)
np.save("../embeddings/embeddings_test_essays.npy", embeddings_test_essays)

### [`Click here to go back to the Homepage`](../Homepage.md)