Tweet Sentiment Analysis: Classifying Hate Speech and Offensive Language

This project offers a straightforward machine learning approach to sort tweets into three categories: hate speech, offensive language, and neither. It uses Support Vector Machines (SVM) and GloVe (Global Vectors for Word Representation) embeddings, particularly the glove-twitter-200 dataset, to perform a basic analysis of what tweets are about. The initiative is based on a dataset sourced from https://data.world/thomasrdavidson/hate-speech-and-offensive-language, which comprises labeled tweets that have been manually categorized to facilitate the training and evaluation of the model. By running tweets through a trained SVM model, it becomes possible to automatically classify the tone of tweets, which helps in managing and understanding conversations on social media. This project not only demonstrates how natural language processing (NLP) techniques can be applied but also acts as a practical tool for those new to machine learning, providing them with a chance to try out simple models.

Project Structure

app.py: The main Flask application script for running the web interface.
model.pkl: The pre-trained SVM model (not included in the repository, see setup below).
templates/: Folder containing HTML templates for the Flask application.
requirements.txt: File listing the necessary Python packages.
'tweet_classifier.py': Python script loading and preprocessing data, training and saving an SVM model

Setup

Clone the Repository

git clone https://github.com/birsenyy/tweet-classifier.git cd tweet-classifier

Install Dependencies

pip install -r requirements.txt

Dowload dataset

Dowload the dataset from:

https://data.world/thomasrdavidson/hate-speech-and-offensive-language/workspace/file?filename=labeled_data.csv

Download Glove Vectors

The Glove vectors can be loaded directly in Python script:

glove_vectors = gensim.downloader.load('glove-twitter-200'

However, if using Jupyter Notebook, the IOPub message rate limit may not be sufficient when loading embeddings with higher dimensions.

In this case, apply the following steps:

Dowload glove.twitter.27B.zip from https://nlp.stanford.edu/projects/glove/

Unzip the folder.

The Python script 'tweet-classifier.py" was written applying the second method. Please make necessary changes if you load the embeddings directly.

model_25.pkl and model_200 in the repository are pre-trained models that were trained by Glove embeddings with size 25 and 200.

Running the Application

Start the Flask application:

python app.py

Open a web browser and navigate to http://127.0.0.1:5001 to access the application.

Usage:

Enter a tweet into the form on the web page and submit it to classify the tweet into categories based on its content.

Dataset Acknowledgment

This project utilizes the "Hate Speech and Offensive Language" dataset originally compiled by Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. The dataset is made available under the Creative Commons Attribution (CC-BY) license, allowing for both academic and commercial use with proper attribution.

The dataset can be accessed and downloaded from data.world.

License

This dataset is distributed under the terms of the Creative Commons Attribution (CC-BY) license. Users are free to share and adapt the dataset as long as appropriate credit is given, a link to the license is provided, and it is indicated if changes were made. More details about the CC-BY license can be found here.

How to Cite

If you use this dataset in your project, please provide the following attribution:

Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Hate Speech and Offensive Language. Retrieved from https://data.world/thomasrdavidson/hate-speech-and-offensive-language

We express our gratitude to the authors for their work and for making this valuable resource available to the community.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
templates		templates
LICENSE		LICENSE
README.md		README.md
app.py		app.py
model_200.pkl		model_200.pkl
model_25.pkl		model_25.pkl
requirements.txt		requirements.txt
tweet_classifier.py		tweet_classifier.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tweet Sentiment Analysis: Classifying Hate Speech and Offensive Language

Project Structure

Setup

Running the Application

Usage:

Dataset Acknowledgment

License

How to Cite

About

Releases

Packages

Languages

License

BirsenYY/tweet_classification

Folders and files

Latest commit

History

Repository files navigation

Tweet Sentiment Analysis: Classifying Hate Speech and Offensive Language

Project Structure

Setup

Running the Application

Usage:

Dataset Acknowledgment

License

How to Cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages