# Advanced Machine Learning and Artificial Intelligence (MScA 32017)

# Project: Detection of Toxic Comments Online

# Introduction
[Jigsaw's Toxic Comment ClassificationChallenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge), organized by Kaggle and [Jigsaw](https://jigsaw.google.com/) attracted more than 4500 teams and appeared the third most popular featured contest in Kaggle history. The goal of the competition was to identify and classify toxic online comments.  

As Kaggle puts it, "The Conversation AI team, a research initiative founded by Jigsaw and Google (both a part of Alphabet) are working on tools to help improve online conversation. Discussing things we care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut user comments". 

# Data overview

Train and test data can be found on the [data page](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data). Log in to kaggle.com web site and download files:

`train.csv`, `test.csv`, `test_labels.csv`.

Look at the data structure.

In [2]:
#pip install keras

Collecting keras
  Downloading Keras-2.4.3-py2.py3-none-any.whl (36 kB)
Installing collected packages: keras
Successfully installed keras-2.4.3
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'C:\Users\zjx04\anaconda3\python.exe -m pip install --upgrade pip' command.


In [33]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.layers import Embedding, Input, Dense,\
    GlobalMaxPooling1D, LSTM, Bidirectional
from sklearn.metrics import roc_auc_score

In [34]:
%pwd

'C:\\Users\\zjx04\\Jupython_WD\\Advanced_ML'

### Data Preprocessing Steps

#### Import the Pre-Trained Word Vectors

#### Link - http://nlp.stanford.edu/data/glove.6B.zip

In [35]:
EMBEDDING_DIMENSION = 100
EMBEDDING_FILE_LOC = './Project3_Text/glove.6B.' + str(EMBEDDING_DIMENSION) + 'd.txt'
TRAINING_DATA_LOC = 'tc_train.csv'
MAX_VOCAB_SIZE = 20000
MAX_SEQUENCE_LENGTH = 100
BATCH_SIZE = 128
EPOCHS = 10
VALIDATION_SPLIT = 0.2

#### Import Training Data

In [36]:
training_data = pd.read_csv(TRAINING_DATA_LOC)

In [43]:
word_to_vector = {}
with open(EMBEDDING_FILE_LOC, encoding="utf8") as file:
    # A space-separated text file in the format
    # word vec[0] vec[1] vec[2] ...
    for line in file:
        word = line.split()[0]
        word_vec = line.split()[1:]

        # converting word_vec into numpy array
        # adding it in the word_to_vector dictionary
        word_to_vector[word] = np.asarray(word_vec, dtype='float32')

    # print the total words found
    print(f'Total of {len(word_to_vector)} word vectors are found.')

Total of 400000 word vectors are found.


In [38]:
training_data.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [6]:
# Checking the data info for any null entries present
training_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 223549 entries, 0 to 223548
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   id             223549 non-null  object
 1   comment_text   223549 non-null  object
 2   toxic          223549 non-null  int64 
 3   severe_toxic   223549 non-null  int64 
 4   obscene        223549 non-null  int64 
 5   threat         223549 non-null  int64 
 6   insult         223549 non-null  int64 
 7   identity_hate  223549 non-null  int64 
dtypes: int64(6), object(2)
memory usage: 13.6+ MB


The data contain columns of id, text comment and 6 columns of class indicators which are the target variables.

The target variables are the following types of toxicity:

In [39]:
types = list(training_data)[1:]
print(types)

['comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


#### Split Data into Feature (Comment) and Target Classes

In [40]:
comments = training_data['comment_text'].values
detection_classes = ['toxic', 'severe_toxic', 'obscene', 'threat',
       'insult', 'identity_hate']
target_classes = training_data[detection_classes].values

Comments can belong to several classes simultaneously. The following figure shows frequences of the classes in the train set.

In [41]:
# Max and Min Length
print(f'Maximum length of the comments {max(len(s) for s in comments)}')
print(f'Minimum length of the comments {min(len(s) for s in comments)}')

# Median Length
s = sorted(len(s) for s in comments)
print(f'Median length of the comments {s[len(s) // 2]}')

Maximum length of the comments 5000
Minimum length of the comments 1
Median length of the comments 203


#### Convert Comments (Strings) into Integers

In [42]:
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE)
tokenizer.fit_on_texts(comments)
sequences = tokenizer.texts_to_sequences(comments)

#### Word to Integer Mapping

In [44]:
word_to_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_to_index))

Found 300257 unique tokens.


Padding Sequences to a N x T Matrix

In [45]:
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', data.shape)

Shape of data tensor: (223549, 100)


#### Form the embedding matrix

#### Preparation of Embedding Matrix

In [48]:
num_words = min(MAX_VOCAB_SIZE, len(word_to_index) + 1)
embedding_matrix = np.zeros((num_words, EMBEDDING_DIMENSION))
for word, i in word_to_index.items():
    if i < MAX_VOCAB_SIZE:
        embedding_vector = word_to_vector.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
# words not found in embedding index will be all zeros

#### Load Pre-Trained Word Embeddings into an Embedding Layer

In [49]:
# Set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words,
                            EMBEDDING_DIMENSION,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

#### Build the Bidirectional LSTM Mode

In [50]:
input_ = Input(shape=(MAX_SEQUENCE_LENGTH,))
x = embedding_layer(input_)
x = Bidirectional(LSTM(units=15, return_sequences=True))(x)
x = GlobalMaxPooling1D()(x)
output = Dense(len(detection_classes), activation="sigmoid")(x)

model = Model(input_, output)
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

#### Train the Bidirectional LSTM model

In [51]:
rnn_model = model.fit(data,
                      target_classes,
                      batch_size=BATCH_SIZE,
                      epochs=EPOCHS,
                      validation_split=VALIDATION_SPLIT)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Model Evaluation for training data

#### Average ROC_AUC Score

In [52]:
p = model.predict(data)
aucs = []
for j in range(6):
    auc = roc_auc_score(target_classes[:,j], p[:,j])
    aucs.append(auc)
print(np.mean(aucs))

0.9826446361220923


#### Import Test Data

In [53]:

test = pd.read_csv('tc_test.csv',index_col=0)
test.head()

Unnamed: 0_level_0,comment_text
id,Unnamed: 1_level_1
00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
0000247867823ef7,== From RfC == \n\n The title is fine as it is...
00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
00017563c3f7919a,":If you have a look back at the source, the in..."
00017695ad8997eb,I don't anonymously edit articles at all.


#### Add 6 detection_classes columns

In [54]:
test['toxic'] = '0'
test['severe_toxic'] = '0'
test['obscene'] = '0'
test['threat'] = '0'
test['insult'] = '0'
test['identity_hate'] = '0'

In [55]:
comments_test = test['comment_text'].values
detection_classes = ['toxic', 'severe_toxic', 'obscene', 'threat',
       'insult', 'identity_hate']
target_classes_test = test[detection_classes].values

In [56]:
# Max and Min Length
print(f'Maximum length of the comments {max(len(s) for s in comments_test)}')
print(f'Minimum length of the comments {min(len(s) for s in comments_test)}')

# Median Length
s = sorted(len(s) for s in comments_test)
print(f'Median length of the comments {s[len(s) // 2]}')

Maximum length of the comments 5000
Minimum length of the comments 1
Median length of the comments 169


In [57]:
tokenizer_test = Tokenizer(num_words=MAX_VOCAB_SIZE)
tokenizer_test.fit_on_texts(comments_test)
sequences_test = tokenizer.texts_to_sequences(comments_test)

In [58]:
word_to_index_test = tokenizer_test.word_index
print('Found %s unique tokens.' % len(word_to_index_test))

Found 182361 unique tokens.


In [59]:
data_test = pad_sequences(sequences_test, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', data_test.shape)

Shape of data tensor: (89186, 100)


#### Model Evaluation for testing data

In [60]:
r = model.predict(data_test)

In [74]:
dataset = pd.DataFrame({'toxic': r[:, 0], 'severe_toxic': r[:, 1], 'obscene': r[:, 2], 
                        'threat': r[:, 3], 'insult': r[:, 4], 'identity_hate': r[:, 5]})

In [85]:
dataset['id'] = test.index

In [86]:
datasettt = dataset[['id', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]

In [87]:
datasettt

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.995326,0.387035,0.965221,0.207557,0.914302,0.454665
1,0000247867823ef7,0.004953,0.000138,0.002658,0.000066,0.002209,0.000268
2,00013b17ad220c46,0.004423,0.000347,0.002000,0.000391,0.001979,0.000210
3,00017563c3f7919a,0.002181,0.000035,0.000692,0.000062,0.000572,0.000031
4,00017695ad8997eb,0.009769,0.000237,0.002650,0.000452,0.001575,0.000065
...,...,...,...,...,...,...,...
89181,fffcd0960ee309b5,0.695490,0.005379,0.329502,0.002649,0.136020,0.001467
89182,fffd7a9a6eb32c16,0.052134,0.000403,0.004151,0.002140,0.007686,0.000590
89183,fffda9e8d6fafa9e,0.001831,0.000027,0.000516,0.000066,0.000628,0.000104
89184,fffe8f1340a79fc2,0.000683,0.000037,0.000164,0.000532,0.000233,0.002270


#### Export to csv file

In [88]:
datasettt.to_csv(r'dataframe.csv', index = False)