<a href="https://colab.research.google.com/github/SwareenaDixit/Spam-Filtering/blob/main/Edvancer_SpamFiltering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 3: Spam Filter for Quora Questions
Goal : Build a model for identifying if a question on Quora is "SPAM"

Data: The dataset contains 1306122 unique questions with their class labels (0-SPAM, 1-NOT SPAM). Nearly 94% of the data is of class NOT SPAM leaving only 6% SPAM data.

Train-Validation-Test Split: The data has been split into Train-Validation-Test sets in the ratio 70:15:15, stratified on the class label column.

Embeddings: GloVe (Global Vectors for Word Representation) - 200 embedding dimension

Model: A convolutional neural network (CNN) model strarting with an input layer, using convolutional layers (with ReLU activation) to extract features, applying batch normalization, max-pooling, and dropout for regularization, and finally, making binary predictions (SPAM and NOT SPAM) using dense layers with a sigmoid activation function.



In [62]:
# Import Libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

from keras import Model, regularizers
from keras.layers import Embedding, Input, Conv1D, MaxPooling1D, Flatten, Dense, BatchNormalization, Dropout
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.models import load_model
from keras.metrics import Precision

In [5]:
# Load Data
!wget -O train.csv https://www.dropbox.com/sh/kpf9z73woodfssv/AAAw1_JIzpuVvwteJCma0xMla?dl=0

--2023-10-27 03:37:34--  https://www.dropbox.com/sh/kpf9z73woodfssv/AAAw1_JIzpuVvwteJCma0xMla?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.4.18, 2620:100:6021:18::a27d:4112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.4.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /sh/raw/kpf9z73woodfssv/AAAw1_JIzpuVvwteJCma0xMla [following]
--2023-10-27 03:37:35--  https://www.dropbox.com/sh/raw/kpf9z73woodfssv/AAAw1_JIzpuVvwteJCma0xMla
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc1c65390626196a59c5ef4de149.dl.dropboxusercontent.com/zip_download_get/BpyQRCxly-fnz3R6PJaOYboSlm5JWaBtuH0H1cBAV5-O2G9OyjWDPncsnJvdwYMg4hxJDOPpAftBQdV8gjyMIzRgitVx9WLV-LnKrSXyiLSJoQ# [following]
--2023-10-27 03:37:35--  https://uc1c65390626196a59c5ef4de149.dl.dropboxusercontent.com/zip_download_get/BpyQRCxly-fnz3R6PJaOYboSlm5JWaBtuH0H1cBAV5-O2G9OyjWDPncsnJvdwYMg4hxJDOPpAftBQdV8gjyMIzRgitV

In [6]:
# Sample Data
data_df = pd.read_csv("/content/train.csv", encoding="ISO-8859-1")
data_df.sample(5)

Unnamed: 0,PK,question_text,target
82779,103567ab43000bb2f07f,Is there any evidence that the microbiome matt...,0.0
1093669,d657d7bf4d8f9b5b8fcf,? Is it okay to do filmmaking after pursuing b...,0.0
1276506,fa287f917cc88300d3f7,What are some contradicting sayings?,0.0
85052,10a7fa2c3972e2c3bd7e,Do you think your creative degree helped you g...,0.0
520817,65f5fd9ebc1fc23ce7eb,What do you call people who work at IBM?,0.0


In [7]:
# Data information - Size, NULLS, data types
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1306123 entries, 0 to 1306122
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   PK          1306123 non-null  object 
 1   question_text  1306122 non-null  object 
 2   target         1306122 non-null  float64
dtypes: float64(1), object(2)
memory usage: 29.9+ MB


In [8]:
# Drop NULL rows
print(data_df[data_df.isna().any(axis=1)])
data_df.dropna(inplace=True)

                           PK question_text  target
1306122  PKÐÞ¾±´>g´>gPK           NaN     NaN


In [9]:
# Class Proportion
data_df['target'].value_counts(normalize=True)

0.0    0.93813
1.0    0.06187
Name: target, dtype: float64

In [10]:
# Calculate length of existing questions to define a well-represented MAX_SEQUENCE_LENGTH for training
data_df['question_len'] = data_df['question_text'].apply(lambda x: len(str(x)))
data_df['question_len'].describe().round(2)

count    1306122.00
mean          70.75
std           38.87
min            1.00
25%           45.00
50%           60.00
75%           85.00
max         1017.00
Name: question_len, dtype: float64

In [11]:
# Since 75% of the data is of length 85 but we have a max length of 1017 available, lets check if shorter sentences have a specific SPAM and NOT SPAM behavior
data_df[data_df['question_len'] > 100]['target'].value_counts(normalize=True)

0.0    0.852703
1.0    0.147297
Name: target, dtype: float64

In [12]:
# 85% of shorter sentences are NOT SPAM! Therefore we can assume that longer sentences are mostly SPAMS.
data_df[data_df['target']==1]['question_len'].describe()

count    80810.000000
mean        98.267851
std         55.364399
min          1.000000
25%         55.000000
50%         86.000000
75%        130.000000
max       1017.000000
Name: question_len, dtype: float64

In [13]:
# SPAMS have sentences ranging from one word to 1017 words with 75% of them having upto 130 words.
# In order represent both sentences well, we can set a MAX_SEQUENCE_LENGTH of 200
MAX_SEQUENCE_LENGTH = 200

In [14]:
# Train-Test-Validation Split (70:15:15 stratified by 'target')

texts = data_df['question_text']
labels = data_df['target']

x_train, x_temp, y_train, y_temp = train_test_split(texts, labels, test_size=0.3, stratify=labels, random_state=102)
x_val, x_test, y_val, y_test = train_test_split(x_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=102)

In [15]:
# Check target distribution in each data group
print("Training Data")
print("No.of rows: {}".format(len(y_train)))
print(y_train.value_counts(normalize=True))
print("*"*25)
print("Testing Data")
print("No.of rows: {}".format(len(y_test)))
print(y_test.value_counts(normalize=True))
print("*"*25)
print("Validation Data")
print("No.of rows: {}".format(len(y_val)))
print(y_val.value_counts(normalize=True))

Training Data
No.of rows: 914285
0.0    0.93813
1.0    0.06187
Name: target, dtype: float64
*************************
Testing Data
No.of rows: 195919
0.0    0.938127
1.0    0.061873
Name: target, dtype: float64
*************************
Validation Data
No.of rows: 195918
0.0    0.938132
1.0    0.061868
Name: target, dtype: float64


In [16]:
# Calculation class weights for training
cw = class_weight.compute_class_weight(class_weight="balanced", classes=np.unique(y_train), y=y_train)
cw_dict = {i:val for i, val in enumerate(cw)}
cw_dict

{0: 0.532975290246911, 1: 8.081434405218591}

In [17]:
# Text Preprocessing -- Tokenize texts and pad sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH, padding='post')

labels = to_categorical(labels, num_classes=2)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

Found 222186 unique tokens.
Shape of data tensor: (1306122, 200)
Shape of label tensor: (1306122, 2)


In [18]:
# Train-Test-Validation Data Processing

x_train = data[list(x_train.index)]
y_train = labels[list(y_train.index)]
x_val = data[list(x_val.index)]
y_val = labels[list(y_val.index)]
x_test = data[list(x_test.index)]
y_test = labels[list(y_test.index)]

print("Training Data")
print(f"X: {x_train.shape}, Y: {y_train.shape}")
print("Validation Data")
print(f"X: {x_val.shape}, Y: {y_val.shape}")
print("Testing Data")
print(f"X: {x_test.shape}, Y: {y_test.shape}")

Training Data
X: (914285, 200), Y: (914285, 2)
Validation Data
X: (195918, 200), Y: (195918, 2)
Testing Data
X: (195919, 200), Y: (195919, 2)


In [3]:
# Load GloVE Embeddings
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip -d glove.6B

--2023-10-27 03:33:50--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2023-10-27 03:33:50--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-10-27 03:33:51--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [19]:
# Extract GloVe word vectors along with their corresponding words and store them in a dictionary
glove_file = "glove.6B/glove.6B.200d.txt"
embeddings_index = {}
f = open(glove_file)
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [20]:
# Creates an embedding matrix that maps words from the dataset to their corresponding pre-trained word embeddings (if available).
# Words not found in the pre-trained embeddings are represented as all-zeros in the matrix.
EMBEDDING_DIM = 200
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
      embedding_matrix[i] = embedding_vector

In [63]:
# Initialise Model
model = []

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedding_layer = Embedding(
    len(word_index) + 1,
    EMBEDDING_DIM,
    weights=[embedding_matrix],
    input_length=MAX_SEQUENCE_LENGTH,
    trainable=False
)
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = BatchNormalization()(x)
x = Conv1D(128, 5, activation='relu')(x)
x = BatchNormalization()(x)
x = Conv1D(128, 5, activation='relu')(x)
x = BatchNormalization()(x)
x = MaxPooling1D(5)(x)
x = Dropout(0.3)(x)
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(2, activation='sigmoid')(x)

model = Model(sequence_input, preds)
model.compile(loss='binary_crossentropy',metrics=['accuracy'],optimizer='adam')

model.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 200)]             0         
                                                                 
 embedding_1 (Embedding)     (None, 200, 200)          44437400  
                                                                 
 conv1d_3 (Conv1D)           (None, 196, 128)          128128    
                                                                 
 batch_normalization_3 (Bat  (None, 196, 128)          512       
 chNormalization)                                                
                                                                 
 conv1d_4 (Conv1D)           (None, 192, 128)          82048     
                                                                 
 batch_normalization_4 (Bat  (None, 192, 128)          512       
 chNormalization)                                          

In [69]:
# Callbacks

# Model Checkpoint
filepath = "best_model.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')

# Early Stopping
early_stopping = EarlyStopping(monitor='val_accuracy',patience=5)

In [24]:
# Train Model
callback = [early_stopping, checkpoint]
batch_size = 128
n_epochs = 20
results = model.fit(
    x_train,y_train,
    batch_size=batch_size,epochs=n_epochs,
    verbose=1,
    validation_data=(x_val,y_val),
    callbacks=callback
)

Epoch 1/20
Epoch 1: val_accuracy improved from -inf to 0.95428, saving model to best_model.hdf5


  saving_api.save_model(


Epoch 2/20
Epoch 2: val_accuracy did not improve from 0.95428
Epoch 3/20
Epoch 3: val_accuracy improved from 0.95428 to 0.95515, saving model to best_model.hdf5
Epoch 4/20
Epoch 4: val_accuracy improved from 0.95515 to 0.95594, saving model to best_model.hdf5
Epoch 5/20
Epoch 5: val_accuracy did not improve from 0.95594
Epoch 6/20
Epoch 6: val_accuracy did not improve from 0.95594
Epoch 7/20
Epoch 7: val_accuracy did not improve from 0.95594
Epoch 8/20
Epoch 8: val_accuracy did not improve from 0.95594
Epoch 9/20
Epoch 9: val_accuracy did not improve from 0.95594


In [26]:
# Load Best Model
model = load_model("/content/best_model.hdf5")

In [27]:
# Predict on Test Data
prediction = model.predict(x_test)
classes_pred = np.argmax(prediction, axis=1)



In [47]:
# Test Prediction Scores
y_true = list(np.argmax(y_test, axis=1))
y_pred = list(classes_pred)
precision = precision_score(y_true, y_pred)
accuracy = accuracy_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f"Precision: {precision}")
print(f"Accuracy: {accuracy}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

Precision: 0.6917754714875572
Accuracy: 0.955670455647487
Recall: 0.5113842600230984
F1 Score: 0.5880567281696153
