# <center><font color='blue'>HATE SPEECH DETECTION</center></font>

## <font color='#2471A3'> Table of contents </font>

- [1 - Objectives](#1)
- [2 - Setup](#2)
- [3 - Data Loading and pre-processing](#3)
- [4 - Model](#4)
- [5 - Predictions](#5)
- [6 - References](#6)


<a name="1"></a>
## <b> <font color='blue'> 1. Objectives </font> </b>

The goal is to test a state-of-the-art model on an NLP problem. In this notebook, we will use BERT.

<a name="2"></a>
## <b> <font color='blue'> 2. Setup </font> </b>

In [66]:
import pandas as pd
from sklearn.model_selection import train_test_split
import re
import tensorflow as tf

<a name="3"></a>
## <b> <font color='blue'> 3. Data loading and pre-processing </font> </b>

We will study a speech detection problem. The data was obtained from here [here].(https://www.kaggle.com/datasets/mrmorj/hate-speech-and-offensive-language-dataset)


In [2]:
!ls data

labeled_data.csv


In [3]:
df = pd.read_csv('data/labeled_data.csv')

In [5]:
df.head() # 0, odio, 1 lenguaje ofensivo, 2 ninguno

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


In this dataset:

- 0: represents hate speech
- 1: represents offensive language
- 2: no offensive speech

The columns we care about are:

- tweet
- class (our label)

In [4]:
# Cargar el dataset
texts = df['tweet'].tolist()
y = df['class'].tolist() # labels

We are going to write a simple pre-processing function:

In [5]:
def preprocess_text(text):
    # Remove hyperlinks
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    
    # Remove special characters like @ and #
    text = re.sub(r'[@#]', '', text)
    
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    
    # Convert to lowercase
    text = text.lower()
    
    return text

In [6]:
# we apply the pre-processing function
X = [preprocess_text(text) for text in texts]

In [7]:
# train/test split
train_texts, test_texts, train_labels, test_labels = train_test_split(X, y, test_size=0.2)

We will prepare the data for our model. Due to computational constraints, we will use DistilBERT.

### Tokenization

In [8]:
from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

def tokenize_function(texts):
    return tokenizer(texts, padding=True, truncation=True, return_tensors="tf")

train_encodings = tokenize_function(train_texts)
test_encodings = tokenize_function(test_texts)

#### Prepare datasets for TensorFlow

In [9]:
def create_tf_dataset(encodings, labels):
    dataset = tf.data.Dataset.from_tensor_slices((
        dict(encodings),
        tf.convert_to_tensor(labels)
    ))
    return dataset

train_dataset = create_tf_dataset(train_encodings, train_labels).batch(8).shuffle(1000)
test_dataset = create_tf_dataset(test_encodings, test_labels).batch(8)

<a name="4"></a>
## <b> <font color='blue'> 4. Model </font> </b>

In [10]:
from transformers import TFBertForSequenceClassification,TFDistilBertForSequenceClassification


model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)

2024-08-13 15:33:38.696998: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 93763584 exceeds 10% of free system memory.
2024-08-13 15:33:38.894204: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 93763584 exceeds 10% of free system memory.
2024-08-13 15:33:38.914768: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 93763584 exceeds 10% of free system memory.
2024-08-13 15:33:40.245303: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 93763584 exceeds 10% of free system memory.
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. in

In [11]:
# configure training
optimizer = 'Adam'
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy')]

model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

Due to computacional constraints we are going to train the model for only 2 epochs.

In [12]:
history = model.fit(train_dataset, 
                    epochs=2, 
                    validation_data=test_dataset)

Epoch 1/2
Cause: for/else statement not yet supported
Cause: for/else statement not yet supported


2024-08-13 15:39:27.705156: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 93763584 exceeds 10% of free system memory.


Epoch 2/2


In [15]:
history.history

{'loss': [0.6786103248596191, 0.6709921360015869],
 'accuracy': [0.7749419808387756, 0.7757490277290344],
 'val_loss': [0.6701390743255615, 0.6715118288993835],
 'val_accuracy': [0.7686100602149963, 0.7686100602149963]}

With just 2 epochs, we achieved an accuracy of 76%. Wow!

<a name="5"></a>
## <b> <font color='blue'> 5. Predictions </font> </b>

In [63]:
# predict with my own data

some_texts = ['I hate people']
prep_texts = [preprocess_text(text) for text in some_texts]
encodings = tokenize_function(prep_texts)

In [65]:
inputs = {
    'input_ids': encodings['input_ids'],
    'attention_mask': encodings['attention_mask']
}


# Prediction
predictions = model.predict(inputs)

# The logits are in predictions.logits
logits = predictions.logits

# Logits to probs
probabilities = tf.nn.softmax(logits, axis=-1)

# Probs to classes
predicted_class = tf.argmax(probabilities, axis=-1)

print("Logits:", logits)
print("Probabilities:", probabilities)
print("Predcited class:", predicted_class)

Logits: [[-1.4713306   1.2834646  -0.33739126]]
Probabilities: tf.Tensor([[0.0504396  0.79280055 0.15675998]], shape=(1, 3), dtype=float32)
Predcited class: tf.Tensor([1], shape=(1,), dtype=int64)


<a name="6"></a>
## <b> <font color='blue'> 6. References </font> </b>

[Hugging Face](https://huggingface.co/)