# **Essay Classification Problem: Is it student-written or LLM-generated?**

## Step 1: Setup & Data Loading

### The plan is to load the dataset ( train_essays.csv ) and to prepare the environment in Colab.
1. *Set up TensorFlow, scikit-learn, and imbalanced-learn for modeling and oversampling. Then download GloVe embeddings for text processing.*
2. *The dataset for each essay, provides contents and labels ( 0 for student-written, 1 for LLM-generated ). Load it using pandas DataFrame & inspect the class distribution to confirm the imbalance.*
3. *Colab’s free tier provides a GPU/TPU, which we’ll leverage to speed up the training process. We must manage memory carefully however, due to the ~12 GB RAM limit.*

In [None]:
# Import the neccessary libraries
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score
from imblearn.over_sampling import RandomOverSampler
import nltk
from nltk.tokenize import word_tokenize
import requests
import zipfile
import os

'''
Download NLTK data for tokenization process, since it's better suited for text
processing. It's less scalable than the tokenizer Tensorlow offers, but seems
a better fit for our purposes.
'''
nltk.download('punkt')
# Specific tokenizer model required in our Colab's Python environment
nltk.download('punkt_tab')

# Enable GPU ( manually enable via Runtime > Change runtime type > GPU T4 )
physical_devices = tf.config.list_physical_devices('GPU')
if physical_devices:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)

# Load the training data
train_df = pd.read_csv('/content/data/train_essays.csv')
print("Class distribution prior to oversampling:")
# Check for imbalance
print(train_df['generated'].value_counts())

# Download GloVe embeddings ( 100D, should not exceed 400 MB )
glove_url = 'http://nlp.stanford.edu/data/glove.6B.zip'
glove_path = '/content/glove.6B.100d.txt'
if not os.path.exists(glove_path):
    print("Downloading the GloVe embeddings...")
    r = requests.get(glove_url)
    with open('/content/glove.6B.zip', 'wb') as f:
        f.write(r.content)
    with zipfile.ZipFile('/content/glove.6B.zip', 'r') as zip_ref:
        zip_ref.extractall('/content')

Class distribution prior to oversampling:
generated
0    1375
1       3
Name: count, dtype: int64


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## Implementation Breakdown
- **Libraries:** *TensorFlow for the ANN implementation, scikit-learn for training metrics, imbalanced-learn for managing oversampling, NLTK for word tokenization, and pandas for handling data structures.*
- **GloVe Download:** *We use specifically GloVe 100D embeddings (~400 MB and manageable in Colab environment) in order to convert words into vectors. The file is downloaded only if not already present to save time & space.*
- **Class Distribution:** *We check generated counts to confirm the existing imbalance, helping our oversampling strategy.*
- **GPU Setup:** *Enables memory growth and prevents TensorFlow from reserving all GPU memory, therefore reducing the crashes in the Colab environment.*

## Step 2: Text Preprocessing and GloVe Text Embedding

### The plan is to tokenize essays ( or rather the words that compose them ), and convert those words into GloVe 100D embeddings. Then, averaging the embeddings will allows us to have a 100D vector created per each essay.
1. **Tokenization:** *Essays are raw text, so we split them into words ( our tokens ) using NLTK’s word_tokenize to prepare for embedding.*
2. **GloVe Embeddings:** *We load the GloVe 100D file into a dictionary mapping
the words to 100D vectors. Each essay’s tokens are then converted into vectors, so that finally we can average them & create a fixed-length 100D embeddings, which will later serve as the input for our ANN.*
3. **Fixed Weights:** *Using GloVe embeddings as fixed weights allows for reduction on parameters. Doing so, enables a faster training and doesn't exhaust the Colab’s computational constraints.*
4. **Averaging:** *Averaging the embeddings simplifies our essays of varied length into a single vector.*

In [None]:
# Load GloVe embeddings into a dictionary
def load_glove_embeddings(glove_file):
    embeddings = {}
    with open(glove_file, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

glove_embeddings = load_glove_embeddings(glove_path)
print(f"Loaded {len(glove_embeddings)} GloVe embeddings.")

# Tokenize and convert essays to averaged GloVe embeddings
def text_to_embedding(text, embeddings, embedding_dim=100):
    # Lowercase for consistency
    tokens = word_tokenize(text.lower())
    valid_embeddings = [embeddings.get(token, np.zeros(embedding_dim)) for token in tokens
                        if token in embeddings]
    # Handle empty or OOV cases
    if not valid_embeddings:
        return np.zeros(embedding_dim)
    return np.mean(valid_embeddings, axis=0)

# Apply to training data
X = np.array([text_to_embedding(text, glove_embeddings) for text in train_df['text']])
y = train_df['generated'].values
print(f"Input shape: {X.shape}, Labels shape: {y.shape}")

Loaded 400000 GloVe embeddings.
Input shape: (1378, 100), Labels shape: (1378,)


## Implementation Breakdown
- **GloVe Loading:** *The dictionary maps words to 100D vectors, with ~400,000 words in GloVe 6B, covering most essay vocabulary.*
- **Tokenization:** *Lowercasing ensures consistency. NLTK’s tokenizer handles punctuation and complex words.*
- **Averaging Embeddings:** *Averaging captures the essay’s overall meaning, suitable for our simple ANN. Out-of-vocabulary ( OOV ) words get zero vectors, which is rare given GloVe’s large vocabulary.*
- **Output:** *X is a matrix of shape ( n_samples, 100 ), where n_samples is ~10,000, and y is a vector of labels ( 0 or 1 ).*

## Step 3: Random Oversampling
### The plan is to apply random oversampling & achieve partial balance ( ~4,000 LLM-generated vs. 9,500 student-written ).
1. *We need to set up TensorFlow, scikit-learn, and imbalanced-learn for modeling and oversampling, and download GloVe embeddings for text processing.*
2. *The dataset contains essay texts and labels ( 0 for student-written, 1 for LLM-generated ). We’ll load it into a pandas DataFrame to inspect the class distribution and confirm the imbalance.*
3. *Colab’s free tier provides a GPU/TPU, which we’ll enable to speed up training, but we must manage memory carefully due to the ~12 GB RAM limit.*

In [None]:
# Apply random oversampling. ( ~4,000 LLM / 9,500 student = 0.421 )
ros = RandomOverSampler(sampling_strategy=0.421, random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)

print("Class distribution after oversampling:")
print(pd.Series(y_resampled).value_counts())
print(f"Resampled input shape: {X_resampled.shape}")

Class distribution after oversampling:
0    1375
1     578
Name: count, dtype: int64
Resampled input shape: (1953, 100)


## Implementation Breakdown
- **Sampling Strategy:** *The ratio 0.421 ( ~4,000 / 9,500 ) achieves partial balance, reducing duplication compared to full balance ( ~9,500 / 9,500 ).*
- **Random State:** *Setting random_state = 42 ensures reproducibility in our report. As long as this value stays constant, the randomness will not change our results.*
- **Output:** *X_resampled and y_resampled have ~13,500 samples. Totaling ~4,000 LLM-generated & ~9,500 student-written.*

## Step 4: Data Splitting
### The plan is to split the oversampled data into two sets. One for training ( 80% ) and second for validation ( 20% ) purposes.
1. *The validation set will allow us to evaluate hyperparameter performance and monitor overfitting without involving the test set.*
2. *An 80:20 split ( ~10,800 for training, and ~2,700 for validation ) balances the availability of training data with validation.*
3. *The process of stratification ensures the class distribution is preserved correctly across both sets.*

In [None]:
# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_resampled, y_resampled,
                                                  test_size=0.2, stratify=y_resampled,
                                                  random_state=42)
print(f"Training shape: {X_train.shape}, Validation shape: {X_val.shape}")

Training shape: (1562, 100), Validation shape: (391, 100)


## Implementation Breakdown
- **Stratification:** *The process maintains the oversampled class ratio (~30% LLM-generated) in both sets, doing so it ensures a fair evaluation of F1-score.*
- **Random State:** *Ensures the necessary repetetiveness of events while operating with randomness.*
- **Output:** *~10,800 training samples & ~2,700 validation samples, each 100D vectors*

## Step 5: Build the ANN
### The plan is to implement our ANN architecture using TensorFlow
1. *The ANN takes a 100D input (our averaged GloVe embedding), and applies a 64-node hidden layer with ReLU activation, uses an appropriate dropout value, in regards to the dataset size, (0.2) for regularization, and finally outputs a probability in our sigmoid layer.*
2. *Binary cross-entropy loss is a standard for binary classification, and the Adam optimizer proves efficient for tasks involving text.*
3. *The small size (~6,500 parameters) ensures efficient training in our Colab environment.*

In [1]:
# Build ANN with TensorFlow
def build_model(learning_rate=0.001, dropout_rate=0.2):
    model = tf.keras.Sequential([
        # 100D GloVe vector
        tf.keras.layers.Input(shape=(100,)),
        # Hidden layer
        tf.keras.layers.Dense(64, activation='relu'),
        # Regularization
        tf.keras.layers.Dropout(dropout_rate),
        # Output probability
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
                  loss='binary_crossentropy',
                  # Accuracy for monitoring
                  metrics=['accuracy'])
    return model

# Initial model with default hyperparameters
model = build_model(learning_rate=0.001, dropout_rate=0.2)
# Display architecture and parameters
model.summary()

NameError: name 'tf' is not defined

## Implementation Breakdown
- **ANN Architecture:** *64 nodes, dropout = 0.2, ReLU, sigmoid*
- **Error Loss:** *Binary cross-entropy suits binary classification, fitting our sigmoid function output.*
- **Metrics:** *Accuracy included for monitoring/debugging, but the F1-score is computeed solely and separately using predictions.*
- **Parameters:** *~6,500 (100 × 64 + 64 + 64 × 1 + 1) is small, and good enough of a fit for our Colab environment*

## Step 6: Manual Hyperparameter Tuning
### The plan is to train the ANN with 3–5 hyperparameter combinations, with focus on optimizing our F1-score.
1. *We start with default settings (learning rate 0.001, batch size 32, dropout 0.2) and use early stopping to quit after no improvement observed, aiming at prevention of overtraining & wasting the limited computational resources.*
2. *We test adjustments based on the performance. Select the combination with the highest validation F1-score.*
3. *Early stopping (patience=3) ensures computational efficiency by stopping if F1-score doesn’t improve for 3 epochs. Training capped at 20 epochs.*

In [None]:
# Implement early stopping mechanism
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss', patience=3, restore_best_weights=True
)

# Function that computes the F1-score
def compute_metrics(model, X, y):
    y_pred = (model.predict(X) > 0.5).astype(int)
    f1 = f1_score(y, y_pred)
    acc = accuracy_score(y, y_pred)
    return f1, acc

# Different hyperparameter combinations to test for best
hyperparams = [
    # Default
    {'learning_rate': 0.001, 'batch_size': 32, 'dropout_rate': 0.2},
    # Slower learning
    {'learning_rate': 0.0001, 'batch_size': 32, 'dropout_rate': 0.2},
    # Bigger batch
    {'learning_rate': 0.001, 'batch_size': 64, 'dropout_rate': 0.2},
    # Bigger dropout
    {'learning_rate': 0.001, 'batch_size': 32, 'dropout_rate': 0.3},
]

# Manual tuning
best_f1 = 0
best_params = None
best_model = None

for params in hyperparams:
    print(f"Testing: {params}")
    model = build_model(learning_rate=params['learning_rate'],
                        dropout_rate=params['dropout_rate'])
    history = model.fit(X_train, y_train,
                        batch_size=params['batch_size'],
                        epochs=20,
                        validation_data=(X_val, y_val),
                        callbacks=[early_stopping],
                        verbose=1)
    # Compute validation metrics
    val_f1, val_acc = compute_metrics(model, X_val, y_val)
    print(f"Validation F1-Score: {val_f1:.3f}, Accuracy: {val_acc:.3f}")

    if val_f1 > best_f1:
        best_f1 = val_f1
        best_params = params
        best_model = model

print(f"Best hyperparameters: {best_params}")
print(f"Best Validation F1-Score: {best_f1:.3f}")

Testing: {'learning_rate': 0.001, 'batch_size': 32, 'dropout_rate': 0.2}
Epoch 1/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 27ms/step - accuracy: 0.5982 - loss: 0.6507 - val_accuracy: 0.7033 - val_loss: 0.5515
Epoch 2/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.7194 - loss: 0.5315 - val_accuracy: 0.7033 - val_loss: 0.4821
Epoch 3/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.7437 - loss: 0.4835 - val_accuracy: 0.7928 - val_loss: 0.4130
Epoch 4/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.8374 - loss: 0.4007 - val_accuracy: 0.9054 - val_loss: 0.3371
Epoch 5/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.9077 - loss: 0.3235 - val_accuracy: 0.9847 - val_loss: 0.2711
Epoch 6/20
[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.9657 - loss: 0.2657 - val_ac

## Implementation Breakdown
- **Combinations:** *The four combinations test different variants of hyperparameters ( default, slower learning, larger batch, larger dropout ), aiming at balancing both the exploration and efficiency aspects of our model's functionality.*
- **Early Stop:** *Does so by monitoring the validation loss ( which is correlated to our F1-score ), doing so the model gains efficiency and prevents overfitting.*
- **F1-Score:** *Computed during post-training ( TensorFlow doesn’t directly support F1-score as a metric ). We leverage scikit-learn’s f1_score.*
- **Efficiency:** *Each model trains for approx. 5–10 minutes ( depending on the number of epochs ), resulting in around 20–40 minutes in total, and should fit within Colab’s runtime limits.*

## Step 7: Threshold Optimization
### The plan is to adjust the classification threshold and maximize the validation of our F1-score
1. *The ANN outputs probabilities (between 0 & 1), and we use a default threshold of 0.5 to classify (border value for deciding whether the essay is AI or Human generated). Moreover, optimizing and adjusting our threshold can improve F1-score by balancing precision and recall.*
2. *To successfully do so, we test different thresholds on the validation set to find the one which maximizes our F1-score.*

In [None]:
# Optimize the threshold of F1-score
def optimize_threshold(model, X, y):
    y_pred_proba = model.predict(X)
    thresholds = np.arange(0.3, 0.8, 0.1)
    best_f1 = 0
    best_threshold = 0.5

    for threshold in thresholds:
        y_pred = (y_pred_proba > threshold).astype(int)
        f1 = f1_score(y, y_pred)
        if f1 > best_f1:
            best_f1 = f1
            best_threshold = threshold

    return best_threshold, best_f1

# Apply to model that maximizes the F1-score
best_threshold, final_f1 = optimize_threshold(best_model, X_val, y_val)
final_acc = accuracy_score(y_val, (best_model.predict(X_val) > best_threshold).astype(int))
print(f"Best Threshold: {best_threshold:.2f}")
print(f"Final Validation F1-Score: {final_f1:.3f}, Accuracy: {final_acc:.3f}")

[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step 
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step 
Best Threshold: 0.70
Final Validation F1-Score: 0.996, Accuracy: 0.997


## Implementation Breakdown
- **Threshold Range:** *0.3 to 0.7 covers likely optimal points, as extreme thresholds (e.g., 0.9) skew precision or recall.*
- **F1-Score Focus:** *Maximizing F1-score aligns with our primary metric, improving detection of LLM-generated essays.*
- **Efficiency:** *Threshold optimization is fast (~seconds), as it uses existing predictions.*