# Deep Learning: LLM: Classification Finetuning
**Thomas Bohn**   --   **2025-09-30**

{{xxxxx}}  

--  [Main Report](https://github.com/TOM-BOHN/MsDS-deep-learning-llm-classification-finetuning/blob/main/deep-learning-llm-classification-finetuning.ipynb)  --  [Github Repo](https://github.com/TOM-BOHN/MsDS-deep-learning-llm-classification-finetuning)  --  [Presentation Slides](xxx)  --  [Presentation Video](xxx) --  

# 1.&nbsp;Introduction

**Problem Statement**

{{xxxxx}}

**Why is it Important?**

{{xxxxx}}

**Limitations of Existing Solutions**

{{xxxxx}}

**Contribution**

{{xxxxx}}

**DataSet**

{{xxxxx}}


## Overview of Approach

{{xxxxxx}}

## Detect Environment

Determine if the notebook is running in Colab or Kaggle. Then change how the notebook behaves.

In [None]:
# Detect Environment
import os
gIS_COLAB = 'COLAB_GPU' in os.environ or 'COLAB_TPU' in os.environ or 'COLAB_CPU' in os.environ
gIS_KAGGLE = 'KAGGLE_KERNEL_RUN_TYPE' in os.environ
print("Is Kaggle?", gIS_KAGGLE, " | ", "Is Colab?", gIS_COLAB)

## Add Colab Only Libraries


In [None]:
# Install the necessary packages
import os
if gIS_COLAB:
    # Install Colab Specific Tooling
    from google.colab import userdata
    from google.colab import files

    # Mount the Google Drive
    from google.colab import drive
    drive.mount('/content/drive')

    # Install the necessary packages
    !pip install -q tensorflow
    !pip install -q kaggle

## Add Kaggle Only Libraries

In [None]:
# Install the necessary packages
if gIS_KAGGLE:
    from kaggle_datasets import KaggleDatasets

## Common Python Libraries

The following python libraries are used in this notebook.

In [None]:
# File system manangement
import time, datetime, psutil, os
import shutil
import zipfile

# Data manipulation
import numpy as np
import pandas as pd
import math

# Install text storage and manipulation
import re
import json
import pickle
import textwrap
from tqdm import tqdm


##################################

# Plotting and visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import plotly.express as px
import seaborn as sns
sns.set_theme()

# Train-test split and cross validation
from sklearn.model_selection import train_test_split, ParameterGrid

# Model evaluation
from sklearn import metrics
from sklearn.metrics import accuracy_score

# Import Tensor Flow and Keras
import tensorflow as tf
from tensorflow import keras
import keras_nlp
from tensorflow.keras import layers, models
from tensorflow.keras.callbacks import Callback
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, History


os.environ["KERAS_BACKEND"] = "jax"  # or "tensorflow" or "torch"

##################################

print(f'Keras: {keras.__version__}')
print(f'KerasNLP: {keras_nlp.__version__}')
print(f'Tensorflow: {tf.__version__}')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!




## Connect to TPUs

In [None]:
# TPU (Tensor Processing Unit) Setup for Accelerated Training
# This code attempts to connect to Google's TPU infrastructure for faster model training
# TPUs are specialized hardware designed specifically for machine learning workloads

try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('✅ TPU found:', tpu.master())
except:
    print("❌ No TPU found. Falling back to CPU/GPU.")
    tpu = None

if tpu:
    # Connect to the TPU cluster.
    tf.config.experimental_connect_to_cluster(tpu)
    # Initialize the TPU system for use
    tf.tpu.experimental.initialize_tpu_system(tpu)
    # Create a TPU distribution strategy for multi-core TPU usage
    strategy = tf.distribute.TPUStrategy(tpu)
else:
    # Use the default strategy for CPU/GPU.
    strategy = tf.distribute.get_strategy()

# Print the number of replicas (cores) available for parallel processing
print('Number of replicas:', strategy.num_replicas_in_sync)

# Set up automatic tuning for data pipeline performance optimization
# AUTOTUNE allows TensorFlow to automatically determine the optimal number of parallel calls
AUTOTUNE = tf.data.AUTOTUNE

# Print TensorFlow version for reference
print("TensorFlow version:", tf.__version__)

## Connect to GPUs

In [None]:
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
print(tf.test.gpu_device_name())

# Configure GPU memory growth (prevents OOM errors)
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)

## Global Variables

The following are global variables referenced in this notebook.

In [2]:
# Recording the starting time, complemented with a stopping time check in the end to compute process runtime
start = time.time()

# Class representing the OS process and having memory_info() method to compute process memory usage
process = psutil.Process(os.getpid())

In [None]:
# Global Debug flag used to turn on and off more chatty blocks of code
gDEBUG = True
print('Debug is set to:', gDEBUG)
# Global Level of Detail of table stats and details
gLOD = 2
print('Level of Detail for functions is set to:', gLOD)

# Use environment global variables
gIS_COLAB = gIS_COLAB
gIS_KAGGLE = gIS_KAGGLE
print("Is Kaggle?", gIS_KAGGLE, " | ", "Is Colab?", gIS_COLAB)

Level of Detail for functions is set to: 2


## Notebook Configuration

In [None]:
class CFG:
    seed = 27  # Random seed
    preset = "deberta_v3_extra_small_en" # Name of pretrained models
    sequence_length = 512  # Input sequence length
    epochs = 5 # Training epochs
    batch_size = 16  # Batch size
    scheduler = 'cosine'  # Learning rate scheduler
    label2name = {0: 'winner_model_a', 1: 'winner_model_b', 2: 'winner_tie'}
    name2label = {v:k for k, v in label2name.items()}
    class_labels = list(label2name.keys())
    class_names = list(label2name.values())

# Sets value for random seed to produce similar result in each run.
keras.utils.set_random_seed(CFG.seed)

# 2.&nbsp;Data Source

In this section, the code loads the dataset from Google Drive.

{{xxxxx}}

## Import the Data (Kaggle or Colab)

In [None]:
#print('os.environ: ', os.environ)

if 'KAGGLE_KERNEL_RUN_TYPE' in os.environ:
    print("Detected Kaggle environment - using Kaggle datasets")
elif 'COLAB_GPU' in os.environ or 'COLAB_TPU' in os.environ or 'COLAB_CPU' in os.environ:
    print("Detected Google Colab environment - using local datasets")
else:
    print("YIKES! I don't know where I am !!!!!!")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Environment Detection and Dataset Loading
# Detect whether we're running in Kaggle or Google Colab and load datasets accordingly

if gIS_KAGGLE:
    print("Detected Kaggle environment - using Kaggle datasets")

    # Dataset Path Configuration for Kaggle Environment
    # This allows access to the competition datasets stored in Kaggle's cloud storage
    GCS_PATH = '/kaggle/input/llm-classification-finetuning'

    # Load Dataset for train
    train_path = os.path.join(GCS_PATH, 'train.csv')
    train_df = pd.read_csv(train_path)
    print('Train Dataset Size:', len(train_df))

    # Load Dataset for test
    test_path = os.path.join(GCS_PATH, 'test.csv')
    test_df = pd.read_csv(test_path)
    print('Test Dataset Size:', len(test_df))

elif gIS_COLAB:
    print("Detected Google Colab environment - using local datasets")

    # Define the source of the zipped data files
    target_file = 'llm-classification-finetuning.zip'
    source_path_root = '/content/drive/MyDrive/[1.4] MsDS Class Files/-- DTSA 5511 Deep Learning/data'
    destination_path_root = '/content'

    # Copy the files to the runtime
    shutil.copy(source_path_root + '/' + target_file, destination_path_root + '/')

    # Display the files in the destination directory
    print('Files in destination directory:', os.listdir(destination_path_root + '/'))

    # Unzip the files (this is slow)
    zip_file_path = destination_path_root + '/' + target_file

    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        # Extract all the contents into the specified folder
        zip_ref.extractall(destination_path_root + '/' + 'llm-classification-finetuning')

    print('Dataset extraction completed')

    # Dataset Path Configuration for Google Colab Environment
    # Set up local file paths for the extracted dataset files
    COLAB_DATA_PATH = '/content/llm-classification-finetuning'

    # Load Dataset for train
    train_path = os.path.join(COLAB_DATA_PATH, 'train.csv')
    train_df = pd.read_csv(train_path)
    print('Train Dataset Size:', len(train_df))
    
    # Load Dataset for test
    test_path = os.path.join(COLAB_DATA_PATH, 'test.csv')
    test_df = pd.read_csv(test_path)
    print('Test Dataset Size:', len(test_df))

else:
    print("YIKES! I don't know where I am !!!!!!")

# Verify the datasets are loaded correctly
if len(train_df) > 0:
    print(f"Successfully loaded {len(train_df)} training records")
else:
    print("No training files found. Check the dataset path.")

if len(test_df) > 0:
    print(f"Successfully loaded {len(test_df)} test records")
else:
    print("No test files found. Check the dataset path.")
     

## Data Preparation

In [None]:
{{xxxxx}}

## Address Missing Values

In [None]:
{{xxxxx}}

## Data Scoping Functions

In [None]:
{{xxxxx}}

## Scope the Label and Text for Analysis

In [None]:
{{xxxxx}}

# 3.&nbsp;Exploratory Data Analysis (EDA)

The EDA phase focuses on understanding the dataset, including data distribution and label counts. Various functions are used to inspect the structure of the dataset, visualize the label distribution, and assess the text length and word count of the documentation. The data is found to be somewhat imbalanced across categories.

## EDA Functions

In [None]:
{{xxxxx}}

## EDA Analysis: Overview

In [None]:
{{xxxxx}}

## EDA Analysis: Text Distribution

In [None]:
{{xxxxx}}

## EDA Results

ADD HERE

# 4.&nbsp;Train-Validation-Test Split

Split the dataset into training, validation, and test sets. Use tratified splitting to ensure that the class distribution remains consistent across these sets. The distribution of records across the labels is visualized to ensure a balanced split.

## Test Split Functions

In [None]:
{{xxxxx}}

## Test Split Analysis

In [None]:
{{xxxxx}}

# 5.&nbsp;Data Cleansing & Text Normalization

{{xxxxx}}

## Core Normalization Functions

In [None]:
{{xxxxx}}

## Apply Text Normalization

In [None]:
{{xxxxx}}

# 6.&nbsp;Feature Engineering with TF-IDF

The TfidfVectorizer from scikit-learn is used to convert the text documents into numerical features. The vectorizer transforms the collection of documents into a matrix of token counts, which is then normalized using the Term Frequency-Inverse Document Frequency (TF-IDF) transformation. This matrix representation of the text data serves as input to the machine learning models.

## TF_IDF Functions

In [None]:
{{xxxxx}}

## Vectorization

In [None]:
{{xxxxx}}

# 7.&nbsp; Baseline Models: Supervised

## Model Functions

In [None]:
{{xxxxx}}

## Build, Train, and Evaluate the Model

In [None]:
{{xxxxx}}

# 8.&nbsp; Hyperparameter Tuning

## Tuning Functions

In [None]:
{{xxxxx}}

## Execute Hyperparameter Tuning

In [None]:
{{xxxxx}}

# 9.&nbsp;Final Prediction and Evaluation

## Evaluation Functions

In [None]:
{{xxxxx}}

## Train the Final Model

In [None]:
{{xxxxx}}

## Evaluate the Model

In [None]:
{{xxxxx}}

## Explore Errors

In [None]:
{{xxxxx}}

# 10.&nbsp;Scale the Auto-Classifier

## Auto-Classifier Functions

In [None]:
{{xxxxx}}

## Rerun Process for L1

In [None]:
{{xxxxx}}

## Rerun Process for L2

In [None]:
{{xxxxx}}

# 11.&nbsp; Conclusions

{{xxxxx}}

## Results Summary

### Model Result Summary


**Baseline Results**

{{xxxxx}}

**Hyperparameter Tuning Results**

{{xxxxx}}

**Best Model Results**

{{xxxxx}}

**Best Model Performance**

{{xxxxx}}

## Model Comparison

### Model Comparisons and Findings

{{xxxxx}}

#### Baseline Results

{{xxxxx}}

#### Hyperparameter Tuning

{{xxxxx}}

#### Best Model Results

{{xxxxx}}

#### Performance Breakdown (Best Model)

{{xxxxx}}

#### Conclusion

{{xxxxx}}

## Concluding Observations

## Patterns and Conclusions Across the Models

{{xxxxx}}

# 12.&nbsp; References

**Kaggle Competition**

- [1] Wei-lin Chiang, Lianmin Zheng, Lisa Dunlap, Joseph E. Gonzalez, Ion Stoica, Paul Mooney, Sohier Dane, Addison Howard, and Nate Keating. LLM Classification Finetuning. https://kaggle.com/competitions/llm-classification-finetuning, 2024. Kaggle.

**Documentation and References**

- [2] Addison Howard. LMSYS: KerasNLP Starter. https://www.kaggle.com/code/addisonhoward/lmsys-kerasnlp-starter, 2024. Kaggle.
- [3] tt195361. LMSYS: Keras NLP Starter with some changes. https://www.kaggle.com/code/tt195361/lmsys-keras-nlp-starter-with-some-changes#Data-Analysis, 2025. Kaggle.
- [4] Adel Anseur. LLM Classification finetuning DeBERTA. https://www.kaggle.com/code/adelanseur/llm-classification-finetuning-deberta  2025. Kaggle.

**Prior Work Items Referenced**

- [5] Thomas Bohn. deep-learing-gan-monet-painting.ipynb. 2025. https://github.com/TOM-BOHN/MsDS-deep-learing-gan-monet-painting/tree/main
- [6] Thomas Bohn. deep-learing-rnn-disaster-tweets.ipynb 2025. https://github.com/TOM-BOHN/MsDS-deep-learing-rnn-disaster-tweets

**AI Tools Leveraged**

- Cursor.AI was used to aggressively document getting started code that was undocumented in the tutorial.
- Cursor.AI was used to support the formatting of markdown tables and text blocks.
- Cursor.AI was used to write git commit messages when writing to the repo and tracking checkpoints.
- Gemini AI was used to analyze and understand referenced code from other projects and repositories.
- Gemini AI was used to debug and resolve errors when running notebooks.
- Grammarly was used for spelling and grammar correction during the writing process.