<a href="https://www.kaggle.com/code/lonnieqin/toxicity-classification-with-kerasnlp?scriptVersionId=128789894" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Toxicity Classification with KerasNLP
## Table of Contents
* [1. Overview](#1.)
* [2. Configuration](#2.)
* [3. Setup](#3.)
* [4. Import datasets](#4.)
* [5. Data Preprocessing](#5.)
    * [5.1 Train Validation Split](#5.1)
    * [5.2 Create TensorFlow Dataset](#5.2)
* [6. Model Development](#6.)
    * [6.1 Building model](#6.1)
    * [6.2 Training model](#6.2)
    * [6.3 Evaluating model](#6.3)
* [7. Submission](#7.)
* [8. References](#8.)

<font color="red" size="3">If you found it helpful, please don't forget to upvote.</font>

<a id="1."></a>
## 1. Overview
In this notebook, I am going to build a Jigsaw Toxicity Classification Model using [DistilBERT](https://keras.io/api/keras_nlp/models/distil_bert/) from [KerasNLP Library](https://keras.io/api/keras_nlp).

DistilBERT is a distiled version of BERT which leverages Knowledge Distillation, it retrains 97% of language understanding capabilities of original BERT, while being 40% smaller and 60% faster.

KerasNLP is a Library based on Keras that makes it easier to implement NLP appplication by writing only a few lines of code. As you can see below.
```python
def get_model(config):
    encoder = keras_nlp.models.DistilBertBackbone.from_preset(
        "distil_bert_base_en_uncased"
    )
    encoder.trainable = False
    preprocessor = keras_nlp.models.DistilBertPreprocessor.from_preset("distil_bert_base_en_uncased")
    inputs = keras.Input(shape=(), dtype=tf.string)
    x = preprocessor(inputs)
    x = encoder(x)
    x = tf.keras.layers.GlobalAveragePooling1D()(x)
    output = layers.Dense(6, activation="sigmoid")(x)
    model = keras.Model(inputs, output, name="model")
    model.compile(
        "adam", loss="binary_crossentropy", metrics=["categorical_accuracy", keras.metrics.AUC()]
    )
    return model
```

<a id="2."></a>
## 2. Configuration

In [None]:
class Config:
    batch_size = 128
    validation_split = 0.15
    epochs = 10 # Number of Epochs to train
    model_path = "model.tf"
    output_dataset_path = "../input/toxicity-keras-nlp-model"
    labels = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
    modes = ["training", "inference"]
    mode = modes[1]
    model_name = "distil_bert_base_en_uncased"
config = Config()

<a id="3."></a>
## 3. Setup

Now install KerasNLP Library and import necessary packages.

In [None]:
pip install keras-nlp --upgrade

In [None]:
import pandas as pd
import tensorflow as tf
import pathlib
import random
import string
import re
import sys
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
import os
import sklearn
import seaborn as sns
from sklearn.model_selection import train_test_split
from nltk.tokenize import TweetTokenizer 
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from scipy.stats import rankdata
import json
import keras_nlp

<a id="4."></a>
## 4. Import datasets

In [None]:
!unzip ../input/jigsaw-toxic-comment-classification-challenge/train.csv.zip
!unzip ../input/jigsaw-toxic-comment-classification-challenge/test.csv.zip
!unzip ../input/jigsaw-toxic-comment-classification-challenge/test_labels.csv.zip
!unzip ../input/jigsaw-toxic-comment-classification-challenge/sample_submission.csv.zip

In [None]:
train = pd.read_csv("/kaggle/working/train.csv")
train.head()

<a id="5."></a>
## 5. Data Preprocessing

<a id="5.1"></a>
### 5.1 Train Validation Split

In [None]:
X_train, X_val, y_train, y_val = train_test_split(train["comment_text"], train[config.labels], test_size=config.validation_split)

In [None]:
X_train.shape, y_train.shape, X_val.shape, y_val.shape

<a id="5.2"></a>
### 5.2 Create TensorFlow Dataset

In [None]:
def make_dataset(X, y, batch_size, mode):
    dataset = tf.data.Dataset.from_tensor_slices((X, y))
    if mode == "train":
       dataset = dataset.shuffle(batch_size * 4) 
    dataset = dataset.batch(batch_size)
    dataset = dataset.cache().prefetch(tf.data.AUTOTUNE).repeat(1)
    return dataset

In [None]:
train_ds = make_dataset(X_train, y_train, batch_size=config.batch_size, mode="train")
valid_ds = make_dataset(X_val, y_val, batch_size=config.batch_size, mode="valid")

Let's take a look at the format of training data.

In [None]:
for batch in train_ds.take(1):
    print(batch)

<a id="6."></a>
## 6. Model Development

<a id="6.1"></a>
### 6.1 Building model

In [None]:
def get_model(config):
    encoder = keras_nlp.models.DistilBertBackbone.from_preset(
        config.model_name
    )
    encoder.trainable = False
    preprocessor = keras_nlp.models.DistilBertPreprocessor.from_preset(
        config.model_name
    )
    inputs = keras.Input(shape=(), dtype=tf.string)
    x = preprocessor(inputs)
    x = encoder(x)
    x = tf.keras.layers.GlobalAveragePooling1D()(x)
    output = layers.Dense(6, activation="sigmoid")(x)
    model = keras.Model(inputs, output, name="model")
    model.compile(
        "adam", loss="binary_crossentropy", metrics=["categorical_accuracy", keras.metrics.AUC()]
    )
    return model

In [None]:
model = get_model(config)
model.summary()

<a id="6.2"></a>
### 6.2 Training model

In [None]:
if config.mode == config.modes[0]:
    checkpoint = keras.callbacks.ModelCheckpoint(config.model_path, monitor="val_categorical_accuracy", save_best_only=True)
    early_stopping = keras.callbacks.EarlyStopping(patience=10)
    reduce_lr = keras.callbacks.ReduceLROnPlateau(patience=5, min_delta=1e-4, min_lr=1e-6)
    model.fit(train_ds, epochs=config.epochs, validation_data=valid_ds, callbacks=[checkpoint, reduce_lr])

<a id="6.3"></a>
### 6.3 Evaluating model

#### Classification Report

In [None]:
if config.mode == config.modes[0]:
    from sklearn.metrics import classification_report
    y_pred = np.array(model.predict(valid_ds) > 0.5, dtype=int)
    cls_report = classification_report(y_val, y_pred)
    print(cls_report)

<a id="7."></a>
## 7. Submission

In [None]:
test = pd.read_csv("/kaggle/working/test.csv")
test.head()

In [None]:
sample_submission = pd.read_csv("/kaggle/working/sample_submission.csv")
sample_submission.head()

In [None]:
test_ds = tf.data.Dataset.from_tensor_slices((test["comment_text"])).batch(config.batch_size).cache().prefetch(1)
path = config.model_path
if config.mode == config.modes[1]:
    path = config.output_dataset_path + "/" + path
model.load_weights(path)
score = model.predict(test_ds)

In [None]:
sample_submission[config.labels] = score
sample_submission.to_csv("submission.csv", index=False)
sample_submission.head()


<a id="8."></a>
## 8. References
- [Attention Is All You Need](https://arxiv.org/abs/1706.03762v5)
- [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108)
- [DistilBERT documentation](https://keras.io/api/keras_nlp/models/distil_bert/)