## Assignment 4 - Text classification

The assignment for this week builds on these concepts and techniques. We're going to be working with the data in the folder CDS-LANG/toxic and trying to see if we can predict whether or not a comment is a certain kind of toxic speech. You should write two scripts which do the following:


-->The second script should perform classification using the kind of deep learning methods we saw in class
Keras Embedding layer, Convolutional Neural Network

-Save the classification report to a text file

In [2]:
#setup script wasn't working in class, so:
!pip install nltk beautifulsoup4 contractions tensorflow scikit-learn

Collecting nltk
  Using cached nltk-3.7-py3-none-any.whl (1.5 MB)
Collecting contractions
  Using cached contractions-0.1.68-py2.py3-none-any.whl (8.1 kB)
Collecting tensorflow
  Using cached tensorflow-2.8.0-cp39-cp39-manylinux2010_x86_64.whl (497.6 MB)
Collecting scikit-learn
  Using cached scikit_learn-1.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.4 MB)
Collecting joblib
  Downloading joblib-1.1.0-py2.py3-none-any.whl (306 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.0/307.0 KB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting textsearch>=0.0.21
  Using cached textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting google-pasta>=0.1.1
  Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 KB[0m [31m315.1 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting termcolor>=1.1.0
  Downloading termcolor-1.1.0.tar.gz (3.9 kB)
  Preparing

In [1]:
# simple text processing tools
import os
import re
import tqdm
import unicodedata
import contractions
from bs4 import BeautifulSoup #remove things that are non-text
import nltk #we used spacy in the past
nltk.download('punkt')

# data wranling
import pandas as pd
import numpy as np

# tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import (Dense, 
                                    Flatten,
                                    Conv1D, 
                                    MaxPooling1D, 
                                    Embedding)
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence

# scikit-learn
from sklearn.metrics import (confusion_matrix, 
                            classification_report)
from sklearn.preprocessing import LabelBinarizer, LabelEncoder

# Machine learning stuff
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier 
from sklearn.model_selection import ShuffleSplit
from sklearn import metrics

# visualisations 
import matplotlib.pyplot as plt
%matplotlib inline


# fix random seed for reproducibility
seed = 42
np.random.seed(seed)

[nltk_data] Downloading package punkt to /home/ucloud/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
2022-04-28 22:31:16.761933: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-04-28 22:31:16.761971: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [2]:
#Helper functions for text processing
def strip_html_tags(text):
  soup = BeautifulSoup(text, "html.parser")
  [s.extract() for s in soup(['iframe', 'script'])]
  stripped_text = soup.get_text()
  stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
  return stripped_text

def remove_accented_chars(text):
  text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
  return text #everything is in English: no accented characters in English. When used here, it could be used incorrectly/inconsistently

def pre_process_corpus(docs):
  norm_docs = []
  for doc in tqdm.tqdm(docs):
    doc = strip_html_tags(doc)
    doc = doc.translate(doc.maketrans("\n\t\r", "   "))
    doc = doc.lower() #lower case
    doc = remove_accented_chars(doc) #no accented characters
    doc = contractions.fix(doc) #resolves contractions: you're -> you are
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = re.sub(' +', ' ', doc)
    doc = doc.strip()  
    norm_docs.append(doc)
  
  return norm_docs

In [3]:
# Load the data:
# get the filepath
filepath = os.path.join("..","..","CDS-LANG","toxic","VideoCommentsThreatCorpus.csv")

In [4]:
#open csv with pandas
data = pd.read_csv(filepath)

In [5]:
#looking at the dataset
print(data)

       label                                               text
0          0  It's because Europeans do not want to change t...
1          0  The Muslims there do not want to assimilate pr...
2          1  But it's ok....because Europe will soon rebel ...
3          0  I forsee a big civil war in Europe in the futu...
4          0  ISLAM – A Simple, Humanitarian and Attractive ...
...      ...                                                ...
28638      1  yeah we are all monsters..I'm gonna kill u rig...
28639      0                       stupid brainwashed idiot..\n
28640      0  have you EVER been to Serbia or kosovo...fucki...
28641      0  probably u mean to this monsters, fucker /watc...
28642      0  the fucking funniest thing is that fucking ame...

[28643 rows x 2 columns]


In [6]:
# create new variables called text and label
# taking the data out of the dataframe so that we can mess around with them.
X = data["text"] #text column
y = data["label"] #label column

In [7]:
#Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, #texts for the model
                                                    y, #classification labels
                                                    test_size = 0.2, #create an 80/20 split (testing to be 20% of the overall data)
                                                    random_state = 42) #where we should start: just for reproducability

In [9]:
#clean and normalize data (lots of noise like html tags) see helper cell
X_train_norm = pre_process_corpus(X_train)
X_test_norm = pre_process_corpus(X_test)

100%|██████████| 22914/22914 [00:01<00:00, 12165.07it/s]
100%|██████████| 5729/5729 [00:00<00:00, 11854.17it/s]


In [10]:
#looking at the first comment
print(X_train_norm[0]) #the first one

more like the holy qurap


In [11]:
#Tokenize sequences
# creates index of every word in doc, converts all word to number in index in training data

#define out-of-vocabulary-token
t = Tokenizer(oov_token = "<UNK>")
#model has not encountered during training= unknown
              
#fit the tokenizer on the documents
t.fit_on_texts(X_train_norm)

#set padding value (different lengths of documents etc. need a max doument length. If shorter= padding of zeros)
t.word_index["<PAD>"] = 0

In [12]:
#tokenize all documents using this fit tokenizer
#sequence: anything that can be iterated over
X_train_seqs = t.texts_to_sequences(X_train_norm)
X_test_seqs = t.texts_to_sequences(X_test_norm)

In [13]:
#Sequence normalization
MAX_SEQUENCE_LENGTH = 1000

In [14]:
#add padding to sequences
X_train_pad = sequence.pad_sequences(X_train_seqs, maxlen= MAX_SEQUENCE_LENGTH)
X_test_pad = sequence.pad_sequences(X_test_seqs, maxlen= MAX_SEQUENCE_LENGTH)

In [15]:
#checking everything is working
X_train_pad.shape, X_test_pad.shape #22914 or 5729 comments that are all 1000 tokens long

((22914, 1000), (5729, 1000))

In [16]:
#label encoder
le = LabelEncoder()
num_classes = 2 #toxic -> 1, non-toxic -> 0
y_train_le = le.fit_transform(y_train)
y_test_le = le.transform(y_test)

In [17]:
#Create and compile model

#define parameters for model
#overall vocabulary size
VOCAB_SIZE = len(t.word_index)
#number of dimensions for embeddings
EMBED_SIZE = 300
#number of epochs to train for
EPOCHS = 2
#batch size for training
BATCH_SIZE = 128

In [18]:
# create the model
model = Sequential()
# embedding layer NEW LAYER!
model.add(Embedding(VOCAB_SIZE, #vocabulary of certain size
                    EMBED_SIZE, #number of dimensions for embeddings
                    input_length=MAX_SEQUENCE_LENGTH)) #1000 characters

# first convolution layer and pooling
model.add(Conv1D(filters=128, #128 different kernels, 128 times learning
                        kernel_size=4, 
                        padding='same',
                        activation='relu'))
model.add(MaxPooling1D(pool_size=2)) #max pooling= biggest value is the one being predicted

# second convolution layer and pooling
model.add(Conv1D(filters=64, #64 kernels, half of before
                        kernel_size=4, 
                        padding='same', 
                        activation='relu'))
model.add(MaxPooling1D(pool_size=2))

# third convolution layer and pooling
model.add(Conv1D(filters=32, #32 kernels, half of before
                        kernel_size=4, 
                        padding='same', 
                        activation='relu'))
model.add(MaxPooling1D(pool_size=2))

# fully-connected classification layer
model.add(Flatten()) #one vector for each document
model.add(Dense(256, activation='relu'))
model.add(Dense(1, activation='sigmoid')) #only one node is wanted in output
model.compile(loss='binary_crossentropy', #sentiments: positive/negative= binary prediction: true/false
                        optimizer='adam', #not sgd
                        metrics=['accuracy'])
# print model summary
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 1000, 300)         7021200   
                                                                 
 conv1d (Conv1D)             (None, 1000, 128)         153728    
                                                                 
 max_pooling1d (MaxPooling1D  (None, 500, 128)         0         
 )                                                               
                                                                 
 conv1d_1 (Conv1D)           (None, 500, 64)           32832     
                                                                 
 max_pooling1d_1 (MaxPooling  (None, 250, 64)          0         
 1D)                                                             
                                                                 
 conv1d_2 (Conv1D)           (None, 250, 32)           8

2022-04-28 22:32:18.130184: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-04-28 22:32:18.130234: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (j-67796-job-0): /proc/driver/nvidia/version does not exist
2022-04-28 22:32:18.285111: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [19]:
#train
history = model.fit(X_train_pad, y_train_le,
                    epochs = EPOCHS,
                    batch_size = BATCH_SIZE,
                    validation_split = 0.1, #usually only test/training split, but now training is getting split further
                    verbose = True) #gives updates on screen while training
#verbose = 0: nothing, 1: progress that updates all the time

Epoch 1/2
Epoch 2/2


In [20]:
#evaluate the model
scores = model.evaluate(X_test_pad, y_test_le, verbose=1)
print(f"Accuracy: {scores[1]}")

Accuracy: 0.9703264236450195


In [21]:
#loss value and accuracy
print(scores)

[0.09331762790679932, 0.9703264236450195]


In [22]:
model.predict(X_test_pad)

array([[3.3380121e-02],
       [3.2183528e-04],
       [6.2187612e-03],
       ...,
       [4.2440891e-03],
       [4.6268404e-03],
       [5.7459307e-01]], dtype=float32)

In [23]:
# 0.5 decision boundary
predictions = (model.predict(X_test_pad) > 0.5).astype("int32")
# assign labels
predictions = ["toxic" if item == 1 else "non-toxic" for item in predictions]
y_test = ["toxic" if item == 1 else "non-toxic" for item in y_test]
print(predictions[:20]) #20 first predictions

['non-toxic', 'non-toxic', 'non-toxic', 'toxic', 'non-toxic', 'non-toxic', 'non-toxic', 'non-toxic', 'non-toxic', 'non-toxic', 'non-toxic', 'non-toxic', 'non-toxic', 'non-toxic', 'non-toxic', 'non-toxic', 'non-toxic', 'non-toxic', 'non-toxic', 'toxic']


In [28]:
#confusion matrix and classification report
labels = ["non-toxic", "toxic"]
print(classification_report(y_test, predictions))
pd.DataFrame(confusion_matrix(y_test, predictions), 
             index=labels, columns=labels)

              precision    recall  f1-score   support

   non-toxic       0.98      0.99      0.98      5453
       toxic       0.77      0.55      0.64       276

    accuracy                           0.97      5729
   macro avg       0.87      0.77      0.81      5729
weighted avg       0.97      0.97      0.97      5729



Unnamed: 0,non-toxic,toxic
non-toxic,5408,45
toxic,125,151


In [27]:
report = classification_report(y_test, predictions, target_names = labels)

f = open("../../cds-lang/Lang-assignments/output/deep_learning_assign_4.txt",'w') #saving in this folder as assign_4.1.txt
print(report, file=f)

print("Done! Report has been generated and saved in the output folder as deep_learning_assign_4.txt")

Done! Report has been generated and saved in the output folder as deep_learning_assign_4.txt
