<a href="https://colab.research.google.com/github/ShipraShriparn/Toxic_Comment_Classification_DL/blob/main/Toxic_Comment_Detection_Using_Deep_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Toxic Comment Detection


---

A subtask of sentiment analysis is toxic comment classification.
Toxic behavior, which includes rude, hateful, and threatening actions, is an issue that stops a productive comment thread, and turns it into a battle.So it is essential to recognize the threat and respond to it which makes the online space more healthy and valuable.

Dataset Used - "train.csv"

Link to The Dataset - https://drive.google.com/file/d/1ha2hGbcsTRUx9FvkfHdim-Afp9rkgfhw/view?usp=sharing

###Steps to implement the prediction model :

*   Data Collection and Extraction
*   Exploratory Data Analysis (EDA)
*   Data Exploration & Data Analysis
*   Data Visualization
*   Data Preprocessing & Data Cleaning
*   Model Evaluation
*   Building a predictive system

###Mounting to GDrive

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


###Uploading train.csv file

In [5]:
file_path = '/content/drive/MyDrive/train.csv'
data = pd.read_csv(file_path)

###Importing Liberaries

In [1]:
import os
import tensorflow as tf
import pandas as pd
import numpy as np

###Creating the Pandas Dataframe

In [7]:
df = pd.read_csv(file_path)
df

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0
...,...,...,...,...,...,...,...,...
159566,ffe987279560d7ff,""":::::And for the second time of asking, when ...",0,0,0,0,0,0
159567,ffea4adeee384e90,You should be ashamed of yourself \n\nThat is ...,0,0,0,0,0,0
159568,ffee36eab5c267c9,"Spitzer \n\nUmm, theres no actual article for ...",0,0,0,0,0,0
159569,fff125370e4aaaf3,And it looks like it was actually you who put ...,0,0,0,0,0,0


###Data Preprocessing and Vectorizing

---



> The TextVectorization layer allows you to convert raw text data into numerical representations that can be fed into a neural network.


*   Tokenization
*   Text Standardization
*   Text Splitting
*   Vocabulary Generation
*   Vectorization





In [8]:
from tensorflow.keras.layers import TextVectorization

###Text classification



---



*   Extracting the 'comment_text' column from the DataFrame df and assigning it to the variable X.
*   Extracting the values of all the columns from the third column onward (index 2 onwards) in the DataFrame df.




In [9]:
X = df['comment_text']
y = df[df.columns[2:]].values

###Limiting the vocabulary size and preventing memory or computational issues.

In [10]:
Max_words = 160000

###Configuring a TextVectorization layer in TensorFlow/Keras for text preprocessing.

In [11]:
vectorizer = TextVectorization(max_tokens=Max_words,output_sequence_length=1800,output_mode='int')

###Building the vocabulary


---



*    The adapt method in the TextVectorization layer is used to build the vocabulary based on the input data, which is essential for converting text into numerical representations.



In [12]:
vectorizer.adapt(X.values)

 ###Returning the vocabulary (list of words) that was built by the TextVectorization layer.


---



*   To check which words were considered during preprocessing or to analyze the most frequent words in your text data.



In [13]:
vectorizer.get_vocabulary()

['',
 '[UNK]',
 'the',
 'to',
 'of',
 'and',
 'a',
 'you',
 'i',
 'is',
 'that',
 'in',
 'it',
 'for',
 'this',
 'not',
 'on',
 'be',
 'as',
 'have',
 'are',
 'your',
 'with',
 'if',
 'article',
 'was',
 'or',
 'but',
 'page',
 'my',
 'an',
 'from',
 'by',
 'do',
 'at',
 'about',
 'me',
 'so',
 'wikipedia',
 'can',
 'what',
 'there',
 'all',
 'has',
 'will',
 'talk',
 'please',
 'would',
 'its',
 'no',
 'one',
 'just',
 'like',
 'they',
 'he',
 'dont',
 'which',
 'any',
 'been',
 'should',
 'more',
 'we',
 'some',
 'other',
 'who',
 'see',
 'here',
 'also',
 'his',
 'think',
 'im',
 'because',
 'know',
 'how',
 'am',
 'people',
 'why',
 'edit',
 'articles',
 'only',
 'out',
 'up',
 'when',
 'were',
 'use',
 'then',
 'may',
 'time',
 'did',
 'them',
 'now',
 'being',
 'their',
 'than',
 'thanks',
 'even',
 'get',
 'make',
 'good',
 'had',
 'very',
 'information',
 'does',
 'could',
 'well',
 'want',
 'such',
 'sources',
 'way',
 'name',
 'these',
 'deletion',
 'pages',
 'first',
 'help'

### Transforming the text data



---

* The vectorizer is applied to the entire array of text data, and the result is stored in the vectorized_text variable.

In [14]:
vectorized_text = vectorizer(X.values)
vectorized_text

<tf.Tensor: shape=(159571, 1800), dtype=int64, numpy=
array([[  645,    76,     2, ...,     0,     0,     0],
       [    1,    54,  2489, ...,     0,     0,     0],
       [  425,   441,    70, ...,     0,     0,     0],
       ...,
       [32445,  7392,   383, ...,     0,     0,     0],
       [    5,    12,   534, ...,     0,     0,     0],
       [    5,     8,   130, ...,     0,     0,     0]])>

In [15]:
y.shape

(159571, 6)

In [16]:
vectorized_text.shape

TensorShape([159571, 1800])

In [17]:
dataset = tf.data.Dataset.from_tensor_slices((vectorized_text,y))
dataset = dataset.cache()
dataset = dataset.shuffle(160000)
dataset = dataset.batch(16)
dataset = dataset.prefetch(8) #helps prevent bottlenecks

In [18]:
batch_x, batch_y = dataset.as_numpy_iterator().next()

In [19]:
batch_x.shape

(16, 1800)

###Traning the dataset

In [20]:
train = dataset.take(int(len(dataset)*.7))
val = dataset.skip(int(len(dataset)*.7)).take(int(len(dataset)*.2))
test = dataset.skip(int(len(dataset)*.9)).take(int(len(dataset)*.1))

In [21]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dropout, Bidirectional, Dense, Embedding

##Building Model

In [22]:
model = Sequential()
# Create the embedding layer
model.add(Embedding(Max_words+1, 32))
# Bidirectional LSTM Layer
model.add(Bidirectional(LSTM(32, activation='tanh')))
# Feature extractor Fully connected layers
model.add(Dense(128, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(128, activation='relu'))
# Final layer
model.add(Dense(6, activation='sigmoid'))

In [23]:
model.compile(loss='BinaryCrossentropy', optimizer='Adam')

In [24]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 32)          5120032   
                                                                 
 bidirectional (Bidirectiona  (None, 64)               16640     
 l)                                                              
                                                                 
 dense (Dense)               (None, 128)               8320      
                                                                 
 dense_1 (Dense)             (None, 256)               33024     
                                                                 
 dense_2 (Dense)             (None, 128)               32896     
                                                                 
 dense_3 (Dense)             (None, 6)                 774       
                                                        

In [None]:
history = model.fit(train, epochs=5, validation_data=val)

Epoch 1/5
  15/6981 [..............................] - ETA: 3:42:08 - loss: 0.0419

In [None]:
from matplotlib import pyplot as plt

In [None]:
plt.figure(figsize=(8,5))
pd.DataFrame(history.history).plot()
plt.show()

In [None]:
input_text = vectorizer('You freaking suck!')

In [None]:
input_text

In [None]:
np.expand_dims(input_text,0)

###Model Prediction

In [None]:
model.predict(np.expand_dims(input_text,0))

In [None]:
batch = test.as_numpy_iterator().next()

In [None]:
batch_X, batch_y = test.as_numpy_iterator().next()

In [None]:
res = model.predict(batch_X) #this is basically passing multiple comments

In [None]:
res

In [None]:
from tensorflow.keras.metrics import Precision, Recall, CategoricalAccuracy

In [None]:
pre = Precision()
re = Recall()
acc = CategoricalAccuracy()

In [None]:
for batch in test.as_numpy_iterator():
    # Unpack the batch
    X_true, y_true = batch
    # Make a prediction
    yhat = model.predict(X_true)

    # Flatten the predictions
    y_true = y_true.flatten()
    yhat = yhat.flatten()

    pre.update_state(y_true, yhat)
    re.update_state(y_true, yhat)
    acc.update_state(y_true, yhat)

In [None]:
print(f'Precision: {pre.result().numpy()}, Recall:{re.result().numpy()}, Accuracy:{acc.result().numpy()}')

###Installing interactive model demos using Gradio and template rendering with Jinja2

In [None]:
!pip install gradio jinja2

In [None]:
import gradio as gr

In [None]:
model.save('toxicity.h5')

In [None]:
model = tf.keras.models.load_model('toxicity.h5')

In [None]:
input_str = vectorizer('I freaken hate you!')

In [None]:
res = model.predict(np.expand_dims(input_str,0))
res

In [None]:
df.columns[2:]

In [None]:
def score_comment(comment):
    vectorized_comment = vectorizer([comment])
    results = model.predict(vectorized_comment)

    text = ''
    for idx, col in enumerate(df.columns[2:]):
        text += '{}: {}\n'.format(col, results[0][idx]>0.5)

    return text

In [None]:
interface = gr.Interface(fn=score_comment,
                         inputs=gr.inputs.Textbox(lines=2, placeholder='Comment to score'),
                        outputs='text')

In [None]:
interface.launch(share=True)