Toxic-MultiOutputClassifier

This repository includes the scripts to train classifiers for Toxic Comment Classification Challenge. In this regard, I build a multi-label model which is capable of identifying and classifying toxicity such as toxic, severe toxic, threats, obscenity, insults, and identity hate. The official URL for the dataset is as follows: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

Data Preparation

In data_preparation file, I read the train data and further split it into train and validation set.

Preprocess

The tokenization.py file is used for preprocessing the data after it is read. There are four tokenizers in this file: nltk, spacy, keras and transformers.

Data Analysis

In data_analysis.py, the number of comments and top tokens (based on tfidf values) are calculated for each label.

Classifiers

I trained two sets of classifiers in this project.

train_sklearn.py -> The first set includes two ensemble-based classifiers (Random Forest and XGboost), an instance-based classifier (support vector machine), a regression-based classifier (logistic regression). This set also includes the stacked classifier of random forest and support vecotr machine.
train_keras.py -> The second set contains two neural classification models: CNN (convolutional neural network) and BiLSTM (bidirectional long short-term memory).

I also tried huggingface transformers in google colab. You can find this ipy file in HuggingFace_notebook folder.

Main

In the main file, the whole program is compiled. After being compiled, a prompt is appeared. By entering a comment in the prompt, the comment is classified.

Install dependencies

pip install -r requirements.txt

How to run the package?

Download the dataset into Data folder
Install the dependencies
Run train_sklearn.py Note: Further data to be downloaded are indicated inside the file.
Run train_keras.py Note: Further data to be downloaded are indicated inside the file.
Run data_analysis.py to have an overview on the dataset.

Note: The addresses inside the files to save the models are preliminary. They can be changed.

Disclaimer

The whole implementation is based on sklearn, spacy, tensorflow.keras, nltk and HuggingFace Transformers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toxic-MultiOutputClassifier

Data Preparation

Preprocess

Data Analysis

Classifiers

Main

Install dependencies

How to run the package?

Disclaimer

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
Data		Data
HuggingFace_notebook		HuggingFace_notebook
embed_files		embed_files
models		models
README.md		README.md
build_embeddings.py		build_embeddings.py
data_analysis.py		data_analysis.py
data_preparation.py		data_preparation.py
main.py		main.py
requirements.txt		requirements.txt
tokenization.py		tokenization.py
train_keras.py		train_keras.py
train_sklearn.py		train_sklearn.py

Habdi018/Toxic-MultiOutputClassifier

Folders and files

Latest commit

History

Repository files navigation

Toxic-MultiOutputClassifier

Data Preparation

Preprocess

Data Analysis

Classifiers

Main

Install dependencies

How to run the package?

Disclaimer

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages