# Text‑Category Classification Notebook
Fine‑tunes **DistilBERT** to predict your manually‑defined *category* labels.
This single notebook does everything—from loading the dataset in Google Drive to saving a reusable model that can auto‑annotate new texts.

## 🗺️ Road‑map (A → F)
1. **A. Load & preview** – Mount Drive, read the spreadsheet you already used in the original notebook.
2. **B. Keep only text + label** – Select the `cleaned_text` and `category` columns; drop blanks.
3. **C. Train/val split (80 / 20)** – Stratified split so classes stay balanced.
4. **D. Tokenise** – HuggingFace tokenizer (`distilbert-base-uncased`).
5. **E. Fine‑tune** – 3 epochs via 🤗 `Trainer` (batch 16, lr 2e‑5).
6. **F. Evaluate & save** – Print macro‑F1 on the 20 % split and save the model folder.

You can later load the model with `pipeline("text-classification", model="distilbert_classify_categories")` and apply it to any un‑labeled rows.

In [1]:
# ⚠️ Run once per new Colab session — then comment out to save time
!pip install -q transformers datasets accelerate evaluate --upgrade

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/491.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m491.5/491.5 kB[0m [31m40.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import inspect, transformers

ta_cls = transformers.TrainingArguments
print("→  Using class from:", ta_cls.__module__)
print("→  First 15 __init__ args:\n",
      list(inspect.signature(ta_cls).parameters)[:15])

→  Using class from: transformers.training_args
→  First 15 __init__ args:
 ['output_dir', 'overwrite_output_dir', 'do_train', 'do_eval', 'do_predict', 'eval_strategy', 'prediction_loss_only', 'per_device_train_batch_size', 'per_device_eval_batch_size', 'per_gpu_train_batch_size', 'per_gpu_eval_batch_size', 'gradient_accumulation_steps', 'eval_accumulation_steps', 'eval_delay', 'torch_empty_cache_steps']


In [3]:
import re
from sklearn.model_selection import train_test_split
from datasets import Dataset
from transformers import AutoTokenizer
import pandas as pd
import torch

In [4]:
from transformers.training_args import TrainingArguments

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
def clean_text(text):
    if isinstance(text, str):
        text = re.sub(r'http\S+', '', text)
        text = re.sub(r'&gt;.*?\n', '', text)
        text = re.sub(r'\s+', ' ', text)
        text = re.sub(r'[^\w\s.,!?;:\-\'\"()]', '', text)
        return text.strip()
    return ""

In [8]:
file_fold = '/content/drive/Shareddrives/CSC 5541/Final Project/Annotation/'
combined_df= pd.read_excel(file_fold + 'combined_data_FINAL.xlsx')
text_column = 'selftext'
if text_column in combined_df.columns:

    combined_df['cleaned_text'] = combined_df[text_column].apply(clean_text)


    combined_df['text_length'] = combined_df['cleaned_text'].apply(len)
    print("Text length statistics:")
    print(combined_df['text_length'].describe())

Text length statistics:
count     1841.000000
mean       998.475285
std       1007.934861
min          0.000000
25%        346.000000
50%        718.000000
75%       1308.000000
max      11476.000000
Name: text_length, dtype: float64


In [13]:
print(f"Columns: {combined_df.columns.tolist()}")

Columns: ['author', 'created_utc', 'score', 'selftext', 'subreddit', 'title', 'timestamp', 'label_mental_health', 'disorder', 'diagnoised', 'seekinghelp_copingmechanisms', 'details_mh', 'label_gender_identity', 'matched_gender_word', 'gender_identity', 'details_gender', 'label_racial_identity', 'matched_racial_word', 'race_identity', 'race_identity_specific', 'label_queer_identity', 'matched_queer_word', 'queer_identity', 'extra_comments', 'unnamed: 24', 'cleaned_text', 'text_length']


In [14]:


TEXT_COL   = "cleaned_text"
LABEL_COLS = [
    "label_mental_health",
    "label_gender_identity",
    "label_racial_identity",
    "label_queer_identity",
]

# keep the rows that have text and all 4 labels present
df = combined_df[[TEXT_COL] + LABEL_COLS].dropna()
print("Shape after drop‑na:", df.shape)
df.head()

Shape after drop‑na: (1841, 5)


Unnamed: 0,cleaned_text,label_mental_health,label_gender_identity,label_racial_identity,label_queer_identity
0,I dont know if this flair counts as what Im pa...,1,0,0,0
1,I was supposed to move and start college in ap...,1,0,0,0
2,"I feel so nauseas, Ive thrown up twice and I f...",1,0,0,0
3,"It may be too soon to write this, but having h...",1,0,0,0
4,I Feel like my mood and anxiety is more manage...,1,0,0,0


In [15]:
train_df, val_df = train_test_split(
    df, test_size=0.20, random_state=42, shuffle=True
)
print(f"train {len(train_df)}  |  val {len(val_df)}")

train 1472  |  val 369


In [17]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize(batch):
    return tokenizer(
        batch[TEXT_COL],
        padding="max_length",
        truncation=True,
        max_length=128,
    )

def to_dataset(pdf):
    ds = Dataset.from_pandas(
        pdf[[TEXT_COL] + LABEL_COLS].reset_index(drop=True)
    ).map(tokenize, batched=True)
    ds = ds.remove_columns([TEXT_COL])
    # HuggingFace wants column name "labels"
    ds = ds.rename_column("label_mental_health", "labels0")
    ds = ds.rename_column("label_gender_identity", "labels1")
    ds = ds.rename_column("label_racial_identity", "labels2")
    ds = ds.rename_column("label_queer_identity", "labels3")
    # merge into a single list‑of‑floats column
    def stack_labels(example):
        example["labels"] = [
          float(example.pop("labels0")),
          float(example.pop("labels1")),
          float(example.pop("labels2")),
          float(example.pop("labels3")),
        ]
        return example
    return ds.map(stack_labels)

train_ds = to_dataset(train_df)
val_ds   = to_dataset(val_df)

train_ds[0]


Map:   0%|          | 0/1472 [00:00<?, ? examples/s]

Map:   0%|          | 0/1472 [00:00<?, ? examples/s]

Map:   0%|          | 0/369 [00:00<?, ? examples/s]

Map:   0%|          | 0/369 [00:00<?, ? examples/s]

{'input_ids': [101,
  1045,
  1005,
  1049,
  2471,
  2589,
  2007,
  2118,
  1010,
  1045,
  2342,
  2178,
  2465,
  1998,
  1045,
  1005,
  2222,
  2633,
  4374,
  2026,
  9827,
  1012,
  2082,
  4627,
  1999,
  2397,
  17419,
  1010,
  2021,
  1045,
  2787,
  1045,
  2052,
  2202,
  2621,
  4280,
  2144,
  2009,
  2052,
  2022,
  2204,
  2000,
  2031,
  2122,
  4280,
  2104,
  2026,
  5583,
  1012,
  2138,
  1045,
  2069,
  2342,
  2028,
  2062,
  2465,
  1010,
  1045,
  2342,
  2000,
  2994,
  1999,
  2026,
  2267,
  2237,
  2013,
  17419,
  1011,
  11703,
  1010,
  2061,
  1045,
  2134,
  1005,
  1056,
  3696,
  1037,
  10084,
  1012,
  1045,
  2134,
  1005,
  1056,
  2215,
  2000,
  10797,
  2000,
  1037,
  2173,
  1998,
  2059,
  6911,
  2055,
  2383,
  2000,
  3477,
  9278,
  2005,
  1996,
  2206,
  2095,
  2043,
  1045,
  2876,
  1005,
  1056,
  2022,
  2542,
  2045,
  1012,
  1996,
  2711,
  2008,
  2003,
  3048,
  2046,
  2026,
  2282,
  3322,
  2056,
  1045,
  2071,
  2693,

In [18]:
from transformers import (
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)
import evaluate, torch, numpy as np
from sklearn.metrics import f1_score, classification_report

num_labels = 4
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=num_labels,
    problem_type="multi_label_classification",
)

# ---------- metrics ----------
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    probs  = torch.sigmoid(torch.tensor(logits)).numpy()
    preds  = (probs >= 0.5).astype(int)

    macro_f1 = f1_score(labels, preds, average="macro")
    micro_f1 = f1_score(labels, preds, average="micro")
    return {"macro_f1": macro_f1, "micro_f1": micro_f1}

# ---------- training args ----------
args = TrainingArguments(
    output_dir="ml_checkpoints",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="macro_f1",
    report_to=[],
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    compute_metrics=compute_metrics,
)

trainer.train()


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Macro F1,Micro F1
1,No log,0.380258,0.216154,0.70603
2,No log,0.346758,0.479724,0.760845
3,No log,0.32211,0.499603,0.786885


TrainOutput(global_step=276, training_loss=0.36092003532077954, metrics={'train_runtime': 3630.7417, 'train_samples_per_second': 1.216, 'train_steps_per_second': 0.076, 'total_flos': 146249224224768.0, 'train_loss': 0.36092003532077954, 'epoch': 3.0})

In [19]:
metrics = trainer.evaluate()
print("Validation metrics:", metrics)

save_dir = "distilbert_multilabel"
trainer.save_model(save_dir)
tokenizer.save_pretrained(save_dir)
print("✓ saved to", save_dir)


Validation metrics: {'eval_loss': 0.32211026549339294, 'eval_macro_f1': 0.49960336937714117, 'eval_micro_f1': 0.7868852459016393, 'eval_runtime': 81.5607, 'eval_samples_per_second': 4.524, 'eval_steps_per_second': 0.294, 'epoch': 3.0}
✓ saved to distilbert_multilabel


In [20]:
import numpy as np, torch
from sklearn.metrics import f1_score, accuracy_score

# 1) get raw predictions
pred_output = trainer.predict(val_ds)         # returns object with .predictions & .label_ids
logits  = pred_output.predictions
labels  = pred_output.label_ids

# 2) convert logits → probabilities → binary preds
probs = torch.sigmoid(torch.tensor(logits)).numpy()
preds = (probs >= 0.5).astype(int)

# 3) metrics
macro_f1   = f1_score(labels, preds, average="macro")
micro_f1   = f1_score(labels, preds, average="micro")
micro_acc  = accuracy_score(labels.flatten(), preds.flatten())      # per‑label accuracy
subset_acc = (labels == preds).all(axis=1).mean()                  # exact‑match accuracy

print(f"macro‑F1     : {macro_f1:.3f}")
print(f"micro‑F1     : {micro_f1:.3f}")
print(f"micro‑accuracy: {micro_acc:.3f}")
print(f"subset‑accuracy: {subset_acc:.3f}")


macro‑F1     : 0.500
micro‑F1     : 0.787
micro‑accuracy: 0.877
subset‑accuracy: 0.593
