# Models

## distilbert-base-uncased-finetuned-sst-2-english

distilbert-base-uncased-finetuned-sst-2-english
Binary sentiment (POSITIVE / NEGATIVE)
https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english

## SamLowe/roberta-base-go_emotions	

SamLowe/roberta-base-go_emotions	
28 fine-grained emotions (admiration → worry)
https://huggingface.co/SamLowe/roberta-base-go_emotions

## unitary/unbiased-toxic-roberta	

unitary/unbiased-toxic-roberta	

Toxicity & six sub-types (toxic, severe_toxic, obscene, etc.)


https://huggingface.co/unitary/unbiased-toxic-roberta

## Hate-speech-CNERG/dehatebert-mono-english

Hate-speech-CNERG/dehatebert-mono-english
Hate / non-hate
https://huggingface.co/Hate-speech-CNERG/dehatebert-mono-english

## classla/multilingual-IPTC-news-topic-classifier

classla/multilingual-IPTC-news-topic-classifier

205 IPTC NewsCodes topics (e.g., crime, culture, health)

https://huggingface.co/classla/multilingual-IPTC-news-topic-classifier

# Prerequisites

In [1]:
# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

In [2]:
# pip install transformers datasets evaluate bitsandbytes accelerate peft

Check that CUDA is enabled

In [3]:
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available in PyTorch: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name()}")

PyTorch version: 2.7.0+cu128
CUDA available in PyTorch: True
CUDA version: 12.8
GPU: NVIDIA GeForce RTX 3060


In [4]:
import pandas as pd
import numpy as np
from tqdm import tqdm

# Chosen model

distilbert-base-uncased-finetuned-sst-2-english

In [5]:
# from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
# import torch

# quantization_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_compute_dtype=torch.float16,
#     bnb_4bit_use_double_quant=True,
#     bnb_4bit_quant_type="nf4"
# )

# model_name = "microsoft/Phi-3-mini-4k-instruct"
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForCausalLM.from_pretrained(
#     model_name,
#     torch_dtype=torch.float16,
#     device_map="auto",
#     quantization_config=quantization_config
# )

In [6]:
from datasets import load_dataset
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          pipeline)

ds = load_dataset("go_emotions", "simplified")

tok_r = AutoTokenizer.from_pretrained("SamLowe/roberta-base-go_emotions")
mdl_r = AutoModelForSequenceClassification.from_pretrained(
            "SamLowe/roberta-base-go_emotions").to("cuda")

clf = pipeline("text-classification",
               model=mdl_r, tokenizer=tok_r,
               device=0, batch_size=32, top_k=None)

def predict_split(split):
    out = clf(split["text"])
    return [{d["label"]: d["score"] for d in row} for row in out]

valid_logits  = predict_split(ds["validation"])
test_logits   = predict_split(ds["test"])
ds["validation"] = ds["validation"].add_column("roberta_logits", valid_logits)
ds["test"]       = ds["test"].add_column("roberta_logits",  test_logits)


  from .autonotebook import tqdm as notebook_tqdm
Device set to use cuda:0


In [7]:
print(ds['train'][0])
valid_logits

{'text': "My favourite food is anything I didn't have to cook myself.", 'labels': [27], 'id': 'eebbqej'}


[{'curiosity': 0.5812438726425171,
  'confusion': 0.4763413965702057,
  'neutral': 0.14794330298900604,
  'surprise': 0.03707505390048027,
  'approval': 0.03249756991863251,
  'realization': 0.026919247582554817,
  'excitement': 0.00906699150800705,
  'admiration': 0.00586309190839529,
  'annoyance': 0.005148881580680609,
  'love': 0.004726118873804808,
  'optimism': 0.004268970340490341,
  'disapproval': 0.004032035358250141,
  'fear': 0.0029301804024726152,
  'disappointment': 0.0025223693810403347,
  'amusement': 0.0021781963296234608,
  'joy': 0.0020193683449178934,
  'disgust': 0.0019350197399035096,
  'nervousness': 0.0017559712287038565,
  'desire': 0.001729933894239366,
  'sadness': 0.0016860521864145994,
  'anger': 0.0014970538904890418,
  'embarrassment': 0.0014875102788209915,
  'caring': 0.0014092655619606376,
  'gratitude': 0.0010556896449998021,
  'remorse': 0.0007564566330984235,
  'grief': 0.00045813751057721674,
  'relief': 0.0004380107275210321,
  'pride': 0.000156482

# Creating the train data set

In [8]:
import pandas as pd
import numpy as np
from datasets import Dataset
from tqdm import tqdm


In [9]:
emotion_labels = ds["validation"].features["labels"].feature.names

In [10]:
def create_emotion_probabilities_df(dataset_split):
    texts = []
    emotion_probs = {emotion: [] for emotion in emotion_labels}
    
    for text, roberta_logits in zip(dataset_split["text"], dataset_split["roberta_logits"]):
        texts.append(text)
        
        for emotion in emotion_labels:
            prob = roberta_logits.get(emotion, 0.0)
            emotion_probs[emotion].append(prob)
    
    df_data = {'input_text': texts}
    df_data.update(emotion_probs)
    
    df = pd.DataFrame(df_data)
    return df

def predict_with_dataset_optimized(dataset_split, batch_size=128):
    texts = dataset_split["text"]
    temp_dataset = Dataset.from_dict({"text": texts})
    
    def predict_batch(batch):
        try:
            outputs = clf(batch["text"], truncation=True, max_length=512)
            predictions = [{d["label"]: d["score"] for d in row} for row in outputs]
            return {"roberta_logits": predictions}
        except Exception as e:
            empty_preds = [{label: 0.0 for label in emotion_labels} for _ in range(len(batch["text"]))]
            return {"roberta_logits": empty_preds}
    
    result_dataset = temp_dataset.map(
        predict_batch,
        batched=True,
        batch_size=batch_size,
        remove_columns=["text"]
    )
    
    predictions = result_dataset["roberta_logits"]
    return predictions

def predict_with_dataset_progress(dataset_split, batch_size=128):
    texts = dataset_split["text"]
    total_samples = len(texts)
    chunk_size = 5000
    all_predictions = []
    
    for start_idx in tqdm(range(0, total_samples, chunk_size)):
        end_idx = min(start_idx + chunk_size, total_samples)
        chunk_texts = texts[start_idx:end_idx]
        
        chunk_dataset = Dataset.from_dict({"text": chunk_texts})
        
        def predict_batch(batch):
            outputs = clf(batch["text"], truncation=True, max_length=512)
            return {"roberta_logits": [{d["label"]: d["score"] for d in row} for row in outputs]}
        
        chunk_result = chunk_dataset.map(
            predict_batch,
            batched=True,
            batch_size=batch_size,
            remove_columns=["text"]
        )
        
        all_predictions.extend(chunk_result["roberta_logits"])
    
    return all_predictions

if 'roberta_logits' not in ds['train'].column_names:
    try:
        train_logits = predict_with_dataset_optimized(ds["train"], batch_size=128)
    except Exception as e:
        train_logits = predict_with_dataset_progress(ds["train"], batch_size=64)
    
    ds["train"] = ds["train"].add_column("roberta_logits", train_logits)

train_df = create_emotion_probabilities_df(ds["train"])
validation_df = create_emotion_probabilities_df(ds["validation"])
test_df = create_emotion_probabilities_df(ds["test"])

Map:   2%|▏         | 1024/43410 [00:01<00:59, 718.00 examples/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Map: 100%|██████████| 43410/43410 [00:59<00:00, 728.27 examples/s]


In [11]:
train_df = create_emotion_probabilities_df(ds["train"])

In [12]:
train_df

Unnamed: 0,input_text,admiration,amusement,anger,annoyance,approval,caring,confusion,curiosity,desire,...,love,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral
0,My favourite food is anything I didn't have to...,0.161234,0.001972,0.001539,0.004159,0.233412,0.001775,0.001889,0.001202,0.002167,...,0.421444,0.000364,0.002896,0.003201,0.013179,0.002387,0.000306,0.000815,0.001866,0.100186
1,"Now if he does off himself, everyone will thin...",0.000930,0.134094,0.003719,0.026139,0.011867,0.001842,0.001364,0.000668,0.001538,...,0.000506,0.000549,0.006830,0.000483,0.013376,0.000889,0.000497,0.001741,0.000749,0.840025
2,WHY THE FUCK IS BAYLESS ISOING,0.005290,0.004081,0.781201,0.122706,0.003216,0.001910,0.008621,0.012958,0.001135,...,0.002823,0.000530,0.002151,0.000427,0.002942,0.000247,0.000637,0.004392,0.005214,0.105964
3,To make her feel threatened,0.001722,0.001781,0.006446,0.019079,0.008645,0.030293,0.001861,0.001075,0.004182,...,0.000735,0.035958,0.008934,0.000745,0.008453,0.002275,0.001704,0.028242,0.001346,0.596796
4,Dirty Southern Wankers,0.004310,0.002896,0.484007,0.412610,0.006365,0.001364,0.001875,0.001695,0.001533,...,0.001446,0.000364,0.001642,0.000530,0.003438,0.000259,0.000684,0.003494,0.001129,0.081161
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43405,Added you mate well I’ve just got the bow and ...,0.026941,0.007407,0.003022,0.004451,0.044037,0.006570,0.002259,0.001861,0.009586,...,0.879727,0.001147,0.008405,0.001667,0.006569,0.003192,0.001235,0.004176,0.002155,0.018019
43406,Always thought that was funny but is it a refe...,0.003609,0.134665,0.002364,0.011986,0.016143,0.002777,0.742798,0.473192,0.002103,...,0.005241,0.004450,0.009182,0.000210,0.046299,0.001058,0.002461,0.003309,0.015203,0.048970
43407,What are you talking about? Anything bad that ...,0.001845,0.000994,0.022686,0.173971,0.018671,0.005762,0.177501,0.349683,0.001740,...,0.001105,0.004448,0.005316,0.000227,0.014244,0.000642,0.007382,0.026744,0.003809,0.205243
43408,"More like a baptism, with sexy results!",0.218485,0.005374,0.001724,0.003316,0.053343,0.000947,0.001491,0.002307,0.003576,...,0.014900,0.001004,0.004851,0.010119,0.007214,0.002234,0.000232,0.000925,0.018660,0.136118
