Importing our dependencies

In [1]:
import os
import pandas as pd    
import tensorflow as tf 
import numpy as np
import gradio as gr
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import TextVectorization, LSTM, Dropout, Bidirectional, Dense, Embedding
from tensorflow.keras.metrics import Precision, Recall, CategoricalAccuracy

Import the toxic comment dataset (https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/overview)

In [2]:
df = pd.read_csv(os.path.join('dataset', 'train', 'train.csv'))

Take only comment texts and give them a vector which represent if they are in [toxic, severe_toxic, obscene, threat, insult, identity_hate]

In [3]:

X = df['comment_text']
y = df[df.columns[2:]].values

Define the number of words we take in the vocab

In [4]:
MAX_WORDS = 200000

Taking words with 1800 length of sentences max , and assign int to them (creating vectorizer)

In [5]:
vectorizer = TextVectorization(max_tokens=MAX_WORDS, output_sequence_length=1800, output_mode='int')

Build / Train the vectorizer on the comment text

In [6]:
vectorizer.adapt(X.values)
vectorized_text = vectorizer(X.values)

MCSHBAP - map, cache, shuffle, batch , prefetch 

In [7]:
dataset = tf.data.Dataset.from_tensor_slices((vectorized_text, y))
dataset = dataset.cache()
dataset = dataset.shuffle(160000)
dataset = dataset.batch(16)
dataset = dataset.prefetch(8)

Separate result ([toxic, severe_toxic, obscene, threat, insult, identity_hate] as Y and vectorized text as X)

In [8]:
batch_X, batch_y = dataset.as_numpy_iterator().next()

Partitioning our data in train (70%) , validation (20%), test (10%)

In [9]:
train = dataset.take(int(len(dataset)*.7))
validation = dataset.skip(int(len(dataset)*.7)).take(int(len(dataset)*.2))
test = dataset.skip(int(len(dataset)*.9)).take(int(len(dataset)*.1))

Creating our model (Sequential Model) who has : An embedding layer, a Bidirectional LSTM layer,  Fully connected layers, and a final sigmoid layer

In [10]:
model = Sequential()
model.add(Embedding(MAX_WORDS+1, 32))
model.add(Bidirectional(LSTM(32, activation="tanh")))
model.add(Dense(128, activation="relu"))
model.add(Dense(256, activation="relu"))
model.add(Dense(128, activation="relu"))
model.add(Dense(6, activation="sigmoid"))

model.compile(loss = 'BinaryCrossentropy', optimizer ='Adam')

Training our model on 10 epochs


In [11]:
history = model.fit(train, epochs = 10, validation_data = validation )

Make our predictions

In [12]:
batch = test.as_numpy_iterator().next()
input_text = vectorizer("U fucker")
res =  model.predict(np.array([input_text]))
batch_X, batch_Y  = test.as_numpy_iterator().next()

Evaluating the model

In [13]:
pre = Precision()
re = Recall()
acc = CategoricalAccuracy()

for batch in test.as_numpy_iterator():
    X_true, y_true = batch
    yhat = model.predict(X_true)
    
    y_true = y_true.flatten()
    yhat = yhat.flatten()
    
    pre.update_state(y_true, yhat)
    re.update_state(y_true, yhat)
    acc.update_state(y_true, yhat)
    

Print Precision, Recall and Accuracy 

In [14]:
print(f'Precision: {pre.result().numpy()}, Recall: {re.result().numpy()}, Accuracy: {acc.result().numpy()}')

Precision: 0.04416438192129135, Recall: 0.46023210883140564, Accuracy: 0.004012036137282848


Save our model

In [15]:
model.save("toxicity.h5")

Load our model

In [16]:
model = tf.keras.models.load_model('toxicity.h5')

In [17]:
def score_comment(comment):
    vectorized_comment = vectorizer([comment])
    results = model.predict(vectorized_comment)
    
    text = ''
    for idx, col in enumerate(df.columns[2:]):
        text += '{}: {}\n'.format(col, results[0][idx]>0.5)
        
    return text

In [18]:
interface = gr.Interface(fn=score_comment, inputs=gr.inputs.Textbox(lines=2, placeholder='Comment to score'), outputs='text')
interface.launch(share=True)



Running on local URL:  http://127.0.0.1:7860/
Running on public URL: https://49028.gradio.app

This share link expires in 72 hours. For free permanent hosting, check out Spaces (https://huggingface.co/spaces)


(<gradio.routes.App at 0x1fbf1715330>,
 'http://127.0.0.1:7860/',
 'https://49028.gradio.app')

Exception in callback None(<Task finishe...> result=None>)
handle: <Handle>
Traceback (most recent call last):
  File "c:\Users\mathi\AppData\Local\Programs\Python\Python310\lib\asyncio\events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
TypeError: 'NoneType' object is not callable
