<a href="https://colab.research.google.com/github/LeoZethraeus/HamSpamCNN/blob/master/assignment2notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spam or Ham classification
## Before we begin, a few words on my approach:
I choose to implement my own classifier using a convolutional neural network with Keras. 
This is probably not the most conventional approach which might use something like a NaiveBayes algorithm. I do this because I saw in the job listing that you wanted to improve existing models using techniques from Computer Vision, so I wanted to give it a shot immediately. The resulting model yields 97% accuracy (on a separate test-set, not trained on) using only the text, no other features. Towards the end of this notebook I include an unpolished model using an ensemble prediction (using all included features), but I have not done any tuning of hyper-parameters, all the values (number of filters in CNN, nodes in Dense layers, the weights of the different models etc) are just randomly guessed, so I believe it can be improved. With the current values, the model taking only text as input performs better than the ensemble prediction.

As a bonus, I tested it on some sample posts not included in the data set (written by myself or found on FB) and it works quite well, as long as the post is in English and is related to hiring/subletting an apartment.

## Prerequisities

I'm going to use Tensorflow/Keras, Numpy and Pandas with Python 3.7.

to install Tensorflow and Keras I recommend using Anaconda (download correct version Windows/Linux/Mac on: https://www.anaconda.com/distribution/ and follow instructions in the installer)

Before you install tensorflow and Keras I recommend to create a virtual environment, either using conda or pip.

Write the following lines in your terminal:

**conda create -n assignment2 python=3.7 anaconda**

**conda activate assignment2**

**conda install tensorflow**

**conda install keras**

**conda install numpy**

**conda install pandas**

## Reading dataset and some preprocessing

In [None]:
import numpy as np
import pandas as pd
from keras.utils import to_categorical

url="PATH_TO_DATASET"
dataframe = pd.read_csv(url)

text, has_link, has_image, label = dataframe.values.T

def pre_processing():
    """Prepare for one-hot-encoding"""
    label[label == 'ham'] = 0
    label[label == 'spam'] = 1  
    
    has_link[has_link == True] = 1
    has_link[has_link == False] = 0
    
    has_image[has_image == True] = 1
    has_image[has_image == False] = 0
    
pre_processing()

"""One-hot encode labels and binary variables using Keras"""
one_hot_labels = to_categorical(label)
one_hot_has_link = to_categorical(has_link)
one_hot_has_image = to_categorical(has_image)

Using TensorFlow backend.


After loading the data it's time to tokenize the text and present it as sequences of integers.
For that I use Keras Tokenizer, which fits on the text with respect to the vocabulary of all posts included,
to create a meaningful representation as sequences of integers.

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

"""I use Keras Tokenizer to fit the vocabulary of the dataset and create sequences of integers
as representations of the tokens"""
t = Tokenizer()
t.fit_on_texts(text)

sequences = t.texts_to_sequences(text)

"""Print example sequence"""
print("Text: ")
print(text[1] +'\n')
print('Sequence: ')
print(sequences[1])

Text: 
[wanted] I'm looking for a roommate who want to share a room if we find a room�� The location I want is nearby rosslyn station! My budget will be under 700! And I want to move before at least 1st december! If you're interested, let me know :)

Sequence: 
[58, 46, 18, 6, 2, 124, 97, 138, 4, 135, 2, 13, 17, 33, 158, 2, 18269, 5, 153, 12, 138, 8, 795, 2840, 198, 23, 186, 42, 28, 521, 443, 1, 12, 138, 4, 87, 444, 35, 842, 156, 451, 17, 250, 66, 157, 15, 104]


Next I need to find the maximum number of tokens in a posts, and zero-pad the rest because Keras only accepts fixed length-inputs.

**N.B.** I noticed that one post is much longer than all others (number 7178, about 4 times longer than the second longest post), affecting the max_length greatly and thus model performance. One option would be to remove it and gain a reduction of weights to save training time, but here I keep it. After all, such spam might occur in real life, and the training time is acceptable anyway.

In [None]:
"""Find the maximum length of words in a post and zero-pad all other posts
   (because Keras need fixed-length input)"""

max_length = np.max([len(tx) for tx in sequences])
print('Max length of posts: %s \n' % max_length)
padded_posts = pad_sequences(sequences, maxlen = max_length, padding = 'post')

"""This is included just for clarity, you get the idea.."""
print('Padded sequence:')
print(padded_posts[1])
print('\nOriginal length of example sequence: ')
print(len(sequences[1]))
print('Length of zero-padded version:')
print(len(padded_posts[1]))

"""I also need to find the vocabulary"""
"""Find the vocabulary of all texts in the dataset"""
print('\nVocabulary needed to include all words in the data set:')
print(len(set(t.word_counts)))
vocab_size = 40000 # From line above, but increased to allow for a slightly more flexible model.
print('')
print('Setting vocab_size to %s.' % vocab_size)

Max length of posts: 8611 

Padded sequence:
[58 46 18 ...  0  0  0]

Original length of example sequence: 
47
Length of zero-padded version:
8611

Vocabulary needed to include all words in the data set:
38774

Setting vocab_size to 40000.


## Building the model

In [None]:
"""Import some useful Keras tools for a training a Neural Network"""
from keras.layers import Input, Dense
from keras.layers import Dropout, Flatten
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model
from keras import backend as K
from keras import objectives
from keras import optimizers

In the cell below I build the model. The architecture is a bit arbitrary, but the important layers are the Embedding and the Conv1D. The number of filters and kernel_size in Conv1D can be changed.

The Embedding layer creates a dense vector representation of the sequences of integers representing the posts. The embedding is learnt during training of the model. This is very useful because then the model is specialised for our purposes (the vocabulary that might be present in a typical ham or spam post in the Apalca FB group).

To do that, we need to know the size of the vocabulary, and the max_length of the post, as previously computed.

The output_dim is the dimension of the output vector space and can be considered another hyper-parameter to be optimized.
After some experimentation I found 50 to be a good value.

The filters in Conv1D are analogous to Features when doing feature selection, except in Deep Learning you don't need to decide which features to look for, only how many. Also a hyper-parameter. The kernel-size = 5 means that it sweeps through 5 words at a time. Can be tuned as well.

In [None]:
'''Build the model and print out model summary'''
input_text = Input(shape=(max_length,))
emb  = Embedding(input_dim = vocab_size, input_length = max_length, output_dim = 50)(input_text)
dense_text = Conv1D(50,(5,), activation = 'relu', strides = 1)(emb)#emb)
dense_text2 = MaxPooling1D(5)(dense_text)
flat = Flatten()(dense_text2)
dense_text3 = Dense(10, activation = 'relu')(flat)#(dense_text2)
drop = Dropout(0.5)(dense_text3)
output_1 = Dense(2, activation = 'sigmoid')(drop)

textmodel = Model(input_text, output_1)
print(textmodel.summary())
'''I like to use the adam optimizer for general purpose DL, because it works well in many situations. 
The loss is binary cross-entropy, suitable for binary labels'''
textmodel.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 8611)              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 8611, 50)          2000000   
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 8607, 50)          12550     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 1721, 50)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 86050)             0         
_________________________________________________________________
dense_1 (Dense)      

Almost 3 million trainable parameters. If you want to reduce the number of trainable parameters I would remove post 7178 and thus reduce the vocabulary and max_length.

Before I train the model, I shuffle the data, and check that it's approximately stratified.

In [None]:
"""Don't trust a model trained on an unshuffled dataset, would you play poker with an unshuffled deck?"""
length_of_dataset = len(text)

def shuffle_indices(length_of_dataset):
    ints = range(length_of_dataset)
    indices = np.array(ints)
    np.random.shuffle(indices)
    return indices

shuffled_indices = shuffle_indices(length_of_dataset)

ratio = 0.2 #Change if you want to.
split = length_of_dataset-int(length_of_dataset*ratio) # Train/test split ratio.

x_train = padded_posts[shuffled_indices[:split]]
x_test = padded_posts[shuffled_indices[split:]]

y_data = one_hot_labels
y_train = y_data[shuffled_indices[:split]]
y_test = y_data[shuffled_indices[split:]]

print('Train set: {} posts, test_set {} posts'.format(len(x_train),len(x_test)))

print('Train/test ratio = {:.1f} \n'.format(split/length_of_dataset))

"""Check if train/test-split is approximately stratified:"""
shuffled_labels = label[shuffled_indices]
print('Stratification check:')
print('Number of spam posts in training set: {} of total: {}'.format(np.count_nonzero(shuffled_labels[:split]),split))
print('Number of spam posts in test set: {} of total: {}'.format(np.count_nonzero(shuffled_labels[split:]), length_of_dataset-split))


Train set: 8000 posts, test_set 2000 posts
Train/test ratio = 0.8 

Stratification check:
Number of spam posts in training set: 3993 of total: 8000
Number of spam posts in test set: 1007 of total: 2000


## Time to train the model. 
It takes a few minutes. (10 minutes on my CPU, 1 minute on my GPU but then you need to have the right CUDA drivers installed)

**Run the cell below and refill your cup of coffee.**

In [None]:
"""It takes about 10 minutes to train on a reasonable CPU (much faster on a GPU)
and reaches around 97% validation accuracy"""
textmodel.fit(x_train, y_train, batch_size= 32, verbose = 1, validation_data = (x_test,y_test), epochs = 4, shuffle = True)
loss, acc = textmodel.evaluate(x_test,y_test)
print("Model accuracy on test set: {}".format(acc))

Instructions for updating:
Use tf.cast instead.
Train on 8000 samples, validate on 2000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Model accuracy on test set: 0.9755


## Time to test the model

In [None]:
"""Check accuracy and False positives/negatives etc using confusion matrix"""
from sklearn.metrics import confusion_matrix

y_preds_cat = textmodel.predict(x_test)
y_pred = [np.argmax(y_preds_cat[i]) for i in range(len(y_preds_cat))]

lab = np.argmax(y_test, axis = 1)

cm = confusion_matrix(lab,y_pred)
print("Correctly classified ham in test set: {}, false ham: {}, true spam: {}, false spam: {}".format(cm[0,0],cm[0,1],cm[1,1], cm[1,0]))
print("")
print("Confusion matrix:")
print(cm)
print('')
"""Print model performance:"""
loss, acc = textmodel.evaluate(x_test,y_test)
print("Model accuracy on test set: {}".format(acc))

Correctly classified ham in test set: 961, false ham: 32, true spam: 989, false spam: 18

Confusion matrix:
[[961  32]
 [ 18 989]]

Model accuracy on test set: 0.9755


## Test on text from dataset:

In [None]:
'''Helper-function'''
def spam_or_ham(lab):
    if lab == 1:
        print('spam')
    elif lab == 0:
        print('ham')

'''Choose sample text from data set and output spam or ham'''
def print_example(ind):
    j = np.argmax(textmodel.predict_on_batch([[padded_posts[ind]]]))
    u = label[ind]

    print(text[ind])
    print('\nPredicted label: ')
    spam_or_ham(j)
    print('\nTrue label:')
    spam_or_ham(label[ind])

"""Choose sample text index, ind> 5000 are spam posts (yes I noticed :))"""
ind = 4999 #Change if you like to
print_example(ind)

1 and/or 2 bedrooms to move into ASAP on Peel street, £126 a week with a £250 deposit but willing to wave the deposit if someone moves in within the next week or so. 
Beautiful flat with memory foam mattresses, double beds, TVs and surround sound in every room, parking on request

Predicted label: 
ham

True label:
ham


## Test on custom text:

In [None]:
"""Now for the fun part. Enter new text to test"""
teststring = r"""Enter custom text here"""
teststring1 = r"""CHEAP stuff click this LINK, no virus I promise""" #Spam-simulation (written by me)
teststring2 = r"""Looking for a REALLY cheap apartment in Kreutzberg ASAP, plz help! :(""" #Ham-simulation (written by me)

def test_sample_string(teststring):
    """Input: Sample string, works best with sublet-related posts in English.
    Not working well with posts in German right now, but can be improved."""   
#   Some pre-processing:
    seq1 = t.texts_to_sequences(teststring.split())
    seq2 = pad_sequences(seq1, maxlen = max_length, padding = 'post').T[0]
    seq3 = np.append(seq2,np.zeros(max_length-len(seq2)))
    j = textmodel.predict_on_batch([[seq3]])
#   Print out spam or ham and return binary (0 if ham, 1 if spam): 
    spam_or_ham(np.argmax(j))
    return j

#Uncomment below if you wish to enter a custom text.
#print(teststring)
#print("")
#test_sample_string(teststring)
print("")
print(teststring1)
print("")
test_sample_string(teststring1)
print("")
print(teststring2)
print("")
test_sample_string(teststring2)


CHEAP stuff click this LINK, no virus I promise

spam

Looking for a REALLY cheap apartment in Kreutzberg ASAP, plz help! :(

ham


array([[0.91062874, 0.12269474]], dtype=float32)

# Unpolished ensemble model using other features (has_link and has_image).

I create two separate models, each for prediction using only one feature, then in the end I combine the predictions of the three models. But first, the two simple models need to be created and trained.

In [None]:
"""To further improve model predictions, you could use a weighted ensemble output from predictions of the other features.
Tuning the weights would require a separate validation set to avoid overfitting on the dataset,
but as the dataset is quite large I think it is possible."""

'''Create Multilayer perceptron model for image feature prediction'''
input_links = Input(shape=(2,))
dense_links = Dense(50, activation = 'relu')(input_links)
dense_links2 = Dense(20, activation = 'relu')(dense_links)
output_link = Dense(2, activation = 'sigmoid')(dense_links2)
link_model = Model(input_links, output_link)
link_model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

'''Create Multilayer perceptron model for image feature prediction'''
input_ims = Input(shape=(2,))
dense_ims = Dense(50, activation = 'relu')(input_ims)
dense_ims2 = Dense(20, activation = 'relu')(dense_ims)
output_im = Dense(2, activation = 'sigmoid')(dense_ims2)
im_model = Model(input_ims, output_im)
im_model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

#Shuffle data
x_train_link = one_hot_has_link[shuffled_indices[:split]]
x_test_link = one_hot_has_link[shuffled_indices[split:]]

x_train_im = one_hot_has_image[shuffled_indices[:split]]
x_test_im = one_hot_has_image[shuffled_indices[split:]]

In [None]:
link_model.fit(x_train_link, y_train, batch_size= 32, verbose = 1, validation_data = (x_test_link,y_test), epochs = 10, shuffle = True)
loss2, acc2 = link_model.evaluate(x_test_link,y_test)
print("")
print("Link model accuracy on test set: {}".format(acc2))
print("")

im_model.fit(x_train_im, y_train, batch_size= 32, verbose = 1, validation_data = (x_test_im,y_test), epochs = 10, shuffle = True)
loss3, acc3 = im_model.evaluate(x_test_im,y_test)
print("")
print("Image model accuracy on test set: {}".format(acc3))

Train on 8000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Link model accuracy on test set: 0.6565
Train on 8000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Image model accuracy on test set: 0.778


In [None]:
def ensemble_prediction_on_text(index, weights = [1,1,1]):
    """Prints the predictions of the three different models,
    and returns (and prints) the ensemble average prediction, given an index and weights
    (by default [1,1,1] i.e. assuming no model is better than the others)."""
    w1, w2, w3 = weights[0], weights[1], weights[2]
    print("")
    print(text[index])
    print("")
    text_pred = test_sample_string(text[index])[0]
    print("%s Text model prediction" % text_pred)
    link_pred = link_model.predict_on_batch([[one_hot_has_link[index]]])[0]
    print("%s Link model prediction" % link_pred)
    
    im_pred = im_model.predict_on_batch([[one_hot_has_image[index]]])[0]
    print("%s Image model prediction" % im_pred)
    
    ensemble_pred = np.mean([w1*text_pred,w2*link_pred,w3*im_pred], axis = 0)
    print("%s Ensemble prediction" % ensemble_pred)
    print("[1,0] means ham and [0,1] spam")
    print("")
    print("True label: {}".format(one_hot_labels[index]))
    return ensemble_pred

#Same function, but without printing
def ensemble_prediction_on_text_no_prints(index, weights = [1,1,1]):
    """Prints the predictions of the three different models,
    and returns (and prints) the ensemble average prediction, given an index and weights
    (by default [1,1,1] i.e. assuming no model is better than the others)."""
    w1, w2, w3 = weights[0], weights[1], weights[2]
    text_pred = test_sample_string(text[index])[0]

    link_pred = link_model.predict_on_batch([[one_hot_has_link[index]]])[0]
    
    im_pred = im_model.predict_on_batch([[one_hot_has_image[index]]])[0]
    
    ensemble_pred = np.mean([w1*text_pred,w2*link_pred,w3*im_pred], axis = 0)

    return ensemble_pred

ensemble_prediction_on_text(2) #Change index if you want to try other posts


Seeking either a studio, one bedroom or single room in a two bedroom apartment in Back Bay, South End, North End or Cambridge. Looking to spend under $1,500/month. Can't move in until the Spring.

ham
[0.9994343  0.00222316] Text model prediction
[0.58995295 0.40849602] Link model prediction
[0.7118904 0.2898174] Image model prediction
[0.7670925  0.23351221] Ensemble prediction
[1,0] means ham and [0,1] spam

True label: [1. 0.]


array([0.7670925 , 0.23351221], dtype=float32)

In [None]:
"""THIS PART IS NOT REALLY FINISHED, BUT IT MAY BE A WAY TO IMPROVE THE MODEL"""
"""A crude estimate of performance of the ensemble model. The weights should be tuned and the model should be tested on a separate test set, not used for optimizing the weights,
 but I save that for future work."""
corr = 0
count = 0
'''Takes a little while, could certainly be optimized'''
for index in range(10000): #Note, it is not really good practice to test it on the whole dataset, as it has been used for training, this is just to get an estimate of performance.
    if index == 7178: #Troublesome post, ignore for now
        continue
    k = np.argmax(ensemble_prediction_on_text_no_prints(index, [acc, acc2, acc3])) #Weight the model predictions by their performance, to begin with.
    if k == label[index]:
        corr += 1
    count += 1

print("Accuracy of ensemble model: {}".format(corr/count))

# FINAL THOUGHTS

The text based CNN classifier obtained an accuracy of 97%, and the ensemble method roughly the same. There are many hyper-parameters to tweak, so I think you could improve performance quite a lot. The nice thing about using Deep Learning is that you don't need to manually find features in the text, the model finds the relevant features during training. If there are some features that might be relevant you can still add them to an ensemble model as described above. One useful feature would be to check a user identity against a blacklist of known facebook-spammers. They (the spammers) probably change account now and then, but it could make the model faster for a while at least.

# What would I do if there were no labels available?
Unsupervised, the problem gets harder, of course. There are still some techniques that you can use, 

for example k-means clustering (with two clusters, assigned as spam or ham). Semi-supervized techniques using both unsupervised clustering of data and then classification with the available but few labels could improve performance.

## Do I think it would work against scammers?
Although unsupervised methods are not guaranteed to perform great, they will usually do better than doing nothing at all.
The best way to find out is to build and optimize a k-means clustering algorithm and check it against labelled data to test performance.