GitHub - JensBender/hate-speech-detection: Employing deep learning techniques to train and deploy a hate speech detection model for social media comments.

Deep learning for hate speech detection in social media comments.

About The Project

Summary

Motivation: Develop a hate speech detector for social media comments.
Data: Utilized the ETHOS Hate Speech Detection Dataset.
Models: The fine-tuned BERT model demonstrated superior performance (78.0% accuracy) compared to the SimpleRNN (66.3%) and LSTM (70.7%) models.
Deployment: The fine-tuned BERT model was prepared for production by integrating it into a web application and an API endpoint.

Built With

(back to top)

Motivation

Problem: Hate speech is on the rise globally, especially on social media platforms (source: United Nations).
Project goal: Utilize deep learning for hate speech detection in social media comments.
Definition of hate speech: Insulting public speech directed at speciﬁc individuals or groups on the basis of characteristics such as race, religion, ethnic origin, national origin, sex, disability, sexual orientation, or gender identity (Mollas, Chrysopoulou, Karlos, & Tsoumakas, 2022).

(back to top)

Data

998 comments from YouTube and Reddit validated using the Figure-Eight crowdsourcing platform.
Dataset: ETHOS Hate Speech Detection Dataset.
Balanced data: 43.4% hate speech.
Comment length: Mean = 112 words (std = 160).

(back to top)

Model Building

Benchmark models (Mollas, Chrysopoulou, Karlos, & Tsoumakas, 2022):

Random Forest: 65.0% Accuracy
Support Vector Machine: 66.4% Accuracy

Comparison of three deep learning models:

SimpleRNN
- Preprocessing, model architecture and hyperparameters: See details
LSTM
- Preprocessing, model architecture and hyperparameters: See details
Fine-tuned BERT
- Implementation with TensorFlow Hub
- Small BERT model: small_bert/bert_en_uncased_L-4_H-512_A-8
- Preprocessing, model architecture and hyperparameters: See details

(back to top)

Model Performance

Accuracy

	SimpleRNN	LSTM	Fine-Tuned BERT
Training Accuracy	91.8%	100%	99.9%
Test Accuracy	66.3%	70.7%	78.0%

(back to top)

Classification Report

The following classification reports present the performance metrics of the trained models on the test data.

SimpleRNN

	Precision	Recall	F1 Score
No Hate Speech	0.69	0.71	0.70
Hate Speech	0.63	0.61	0.62
Accuracy			0.66

LSTM

	Precision	Recall	F1 Score
No Hate Speech	0.73	0.75	0.74
Hate Speech	0.68	0.66	0.67
Accuracy			0.71

Fine-Tuned BERT

	Precision	Recall	F1 Score
No Hate Speech	0.84	0.74	0.79
Hate Speech	0.72	0.83	0.77
Accuracy			0.78

(back to top)

Confusion Matrix

Confusion matrices of the three models based on the test data.

SimpleRNN	LSTM	Fine-Tuned BERT

(back to top)

Illustrative Examples

Here are some example comments from the test data along with their true label and model predictions.

	True Label	SimpleRNN	LSTM	Fine-Tuned BERT
Women deserve to be abused, I guess.	Hate	Hate	Hate	Hate
Christians are all demonic.	Hate	No Hate	Hate	Hate
Muslims want war! So take it to them first	Hate	Hate	No Hate	Hate
Fat lesbians	Hate	Hate	Hate	Hate
What imaginary gender are you?	Hate	No Hate	No Hate	Hate

(back to top)

Model Deployment

I deployed the fine-tuned BERT model through both a web application and an API.

Web Application

Utilizing the Flask framework, HTML, and CSS, I created a web application with a user-friendly interface for users to input text and receive predictions on whether it contains hate speech.

API

I developed an API endpoint to enable integration with other applications or services by leveraging the Flask framework and utilized Postman for testing and documenting the API.

API documentation: See here

(back to top)

Getting Started

Prerequisites for Model Training

This is a list of the Python packages you need.

TensorFlow
TensorFlow Hub
TensorFlow Text
Scikit-Learn
NumPy
Pandas
Matplotlib

Prerequisites for Model Deployment

This is a list of the Python packages you need.

TensorFlow
TensorFlow Text
NumPy
Flask
Flask-WTF
WTForms
Python-dotenv

To enhance security, create a .env file and create a secret key for the Flask application. Store the secret key in the .env file and utilize the python-dotenv library to retrieve it.

SECRET_KEY = "Your_secret_key_here"

(back to top)

Appendix

SimpleRNN: Preprocessing, Model Architecture and Hyperparameters

Preprocessing
Tokenizer vocabulary size: 5000
Padded sequence length: 15
Embedding dimension: 50

Model Architecture

Layer (type)	Output Shape	Param #	Activation
Embedding	(None, 15, 50)	250050
SimpleRNN	(None, 15, 128)	22912	tanh
SimpleRNN	(None, 128)	32896	tanh
Dense	(None, 64)	8256	relu
Dense	(None, 1)	65	sigmoid

Total params: 314,179
Trainable params: 314,179
Non-trainable params: 0

Hyperparameters
Optimizer: Adam
Learning rate: 0.001
Loss: Binary Crossentropy
Epochs: 100
Batch size: 8
Dropout rate: 50%
Early stopping metric: Accuracy

(back to top)

LSTM: Preprocessing, Model Architecture and Hyperparameters

Preprocessing
Tokenizer vocabulary size: 5000
Padded sequence length: 150
Embedding dimension: 50

Model Architecture

Layer (type)	Output Shape	Param #	Activation
Embedding	(None, 150, 50)	250050
LSTM	(None, 150, 128)	91648	tanh
LSTM	(None, 128)	131584	tanh
Dense	(None, 64)	8256	relu
Dense	(None, 1)	65	sigmoid

Total params: 481,603
Trainable params: 481,603
Non-trainable params: 0

Hyperparameters
Optimizer: Adam
Learning rate: 0.001
Loss: Binary Crossentropy
Epochs: 100
Batch size: 32
Dropout rate: 50%
Early stopping metric: Accuracy

(back to top)

Fine-Tuned BERT: Preprocessing, Model Architecture and Hyperparameters

Preprocessing
Text preprocessing for BERT models: https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3

Model Architecture

Layer (type)	Output Shape	Param #	Activation
Text Input	[(None,)]	0
Preprocessing	input_type_ids: (None, 128) input_mask: (None, 128) input_word_ids: (None, 128)	0
BERT	(None, 512)	28763649
Dropout	(None, 512)	0
Dense	(None, 128)	65664	relu
Dense	(None, 1)	129	sigmoid

Total params: 28,829,442
Trainable params: 28,829,441
Non-trainable params: 1

Hyperparameters
Optimizer: Adam
Learning rate: 0.0001
Loss: Binary Crossentropy
Epochs: 100
Batch size: 8
Dropout rate: 50%
Early stopping metric: Accuracy

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
images		images
plots		plots
saved_models		saved_models
static		static
templates		templates
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
app.py		app.py
hate_speech_detection.py		hate_speech_detection.py
requirements_deployment.txt		requirements_deployment.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

About The Project

Summary

Built With

Motivation

Data

Model Building

Model Performance

Accuracy

Classification Report

Confusion Matrix

Illustrative Examples

Model Deployment

Web Application

API

Getting Started

Prerequisites for Model Training

Prerequisites for Model Deployment

Appendix

SimpleRNN: Preprocessing, Model Architecture and Hyperparameters

LSTM: Preprocessing, Model Architecture and Hyperparameters

Fine-Tuned BERT: Preprocessing, Model Architecture and Hyperparameters

About

Releases

Packages

Languages

License

JensBender/hate-speech-detection

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

About The Project

Summary

Built With

Motivation

Data

Model Building

Model Performance

Accuracy

Classification Report

Confusion Matrix

Illustrative Examples

Model Deployment

Web Application

API

Getting Started

Prerequisites for Model Training

Prerequisites for Model Deployment

Appendix

SimpleRNN: Preprocessing, Model Architecture and Hyperparameters

LSTM: Preprocessing, Model Architecture and Hyperparameters

Fine-Tuned BERT: Preprocessing, Model Architecture and Hyperparameters

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages