<a href="https://colab.research.google.com/github/KKeshav1101/HateSpeechDetection-MiniProject/blob/main/Transformer_on_augmented_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import re
import numpy as np
import pandas as pd
import tensorflow as tf
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score
from imblearn.over_sampling import SMOTE
from tensorflow.keras.optimizers.legacy import Adam  # Ensure compatibility

# Load dataset
df = pd.read_csv("Ethos_Dataset_Binary.csv", delimiter=";")

# Convert labels to binary
df['isHate'] = (df['isHate'] >= 0.5).astype(int)

# Clean text
def clean_text(text):
    text = text.lower()
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)  # Remove special characters
    text = re.sub(r"\s+", " ", text).strip()  # Remove extra spaces
    return text

df['cleaned_comment'] = df['comment'].apply(clean_text)

# Load DistilBERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Tokenize data
def encode_texts(texts, max_length=128):
    return tokenizer(
        texts.tolist(),
        max_length=max_length,
        truncation=True,
        padding='max_length',
        return_tensors='np'
    )

encoded_data = encode_texts(df['cleaned_comment'])
input_ids = encoded_data['input_ids']
attention_mask = encoded_data['attention_mask']
labels = df['isHate'].values

# Train-test split before SMOTE
X_train_ids, X_test_ids, X_train_mask, X_test_mask, y_train, y_test = train_test_split(
    input_ids, attention_mask, labels, test_size=0.2, random_state=42, stratify=labels
)

# Flatten input for SMOTE (SMOTE requires 2D input)
X_train_flat = X_train_ids.reshape(X_train_ids.shape[0], -1)

# Apply SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_flat, y_train)

# Reshape back to original form
X_train_resampled = X_train_resampled.reshape(-1, X_train_ids.shape[1])

# Keep attention mask same length as input
X_train_mask_resampled = np.repeat(X_train_mask[:1], X_train_resampled.shape[0], axis=0)

# Convert to TensorFlow dataset
train_dataset = tf.data.Dataset.from_tensor_slices((
    {'input_ids': X_train_resampled, 'attention_mask': X_train_mask_resampled}, y_train_resampled
)).shuffle(1000).batch(32)

test_dataset = tf.data.Dataset.from_tensor_slices((
    {'input_ids': X_test_ids, 'attention_mask': X_test_mask}, y_test
)).batch(32)

# Load pre-trained DistilBERT model
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

# Compile model with correct optimizer and loss
optimizer = "adam"
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

# Train model with early stopping
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=2, restore_best_weights=True)

history = model.fit(
    train_dataset,
    validation_data=test_dataset,
    epochs=5,
)

# Evaluate model
y_pred_test_logits = model.predict(test_dataset).logits
y_pred_test = np.argmax(y_pred_test_logits, axis=1)

# Print final results
print("Final Test Accuracy:", accuracy_score(y_test, y_pred_test))
print("Final Test F1 Score:", f1_score(y_test, y_pred_test))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_test))
print("Classification Report:\n", classification_report(y_test, y_pred_test))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Final Test Accuracy: 0.565
Final Test F1 Score: 0.0
Confusion Matrix:
 [[113   0]
 [ 87   0]]
Classification Report:
               precision    recall  f1-score   support

           0       0.56      1.00      0.72       113
           1       0.00      0.00      0.00        87

    accuracy                           0.56       200
   macro avg       0.28      0.50      0.36       200
weighted avg       0.32      0.56      0.41       200



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
