This repo is a complementary to [my other git repo](https://github.com/ErfanEbrahimiBazaz/spam_detection_with_nltk) for spam detection.

In [my previous repo](https://github.com/ErfanEbrahimiBazaz/spam_detection_with_nltk) we constructed a TF-IDF vector and trained a naive Baise network for spam detection. In this repository we use LSTM for spam detection. Both repositories work with the same data set.

I make use of [this link](https://towardsdatascience.com/spam-detection-in-emails-de0398ea3b48) to implement the code. Some of the methods in the link are not implemented properly but altogether it shows the big picture and the logic behind text classification. Some minor changes were necessary to methods like remove_stop_words which I have corrected in this repo.

### Embedding

Embedding is the process of converting formatted text data into numerical values/vectors which a machine can interpret.

In [8]:
import tensorflow as tf
from keras.layers import Dense,LSTM, Embedding, Dropout, Activation, Bidirectional


# The length of all tokenized emails post-padding is set using ‘max_len’
max_len = 50


#size of the output vector from each layer
embedding_vector_length = 32
#Creating a sequential model
model = tf.keras.Sequential()
#Creating an embedding layer to vectorize
max_feature=50
model.add(Embedding(max_feature, embedding_vector_length, input_length=max_len))
#Addding Bi-directional LSTM
model.add(Bidirectional(tf.keras.layers.LSTM(64)))
#Relu allows converging quickly and allows backpropagation
model.add(Dense(16, activation='relu'))
#Deep Learninng models can be overfit easily, to avoid this, we add randomization using drop out
model.add(Dropout(0.1))
#Adding sigmoid activation function to normalize the output
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 50, 32)            1600      
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               49664     
_________________________________________________________________
dense (Dense)                (None, 16)                2064      
_________________________________________________________________
dropout (Dropout)            (None, 16)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 53,345
Trainable params: 53,345
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
history = model.fit(x_train_features, train_y, batch_size=512, epochs=20, validation_data=(x_test_features, test_y))
y_predict = [1 if o>0.5 else 0 for o in model.predict(x_test_features)]

### Performance

According to [this link](https://towardsdatascience.com/spam-detection-in-emails-de0398ea3b48):

"Precision and recall are the two most widely used performance metrics for a classification problem to get a better understanding of the problem. Precision is the fraction of the relevant instances from all the retrieved instances. Precision helps us to understand how useful the results are. The recall is the fraction of relevant instances from all the relevant instances. Recall helps us understand how complete the results are."

In [None]:
from sklearn.metrics import confusion_matrix,f1_score, precision_score,recall_score


cf_matrix =confusion_matrix(test_y,y_predict)
tn, fp, fn, tp = confusion_matrix(test_y,y_predict).ravel()
print("Precision: {:.2f}%".format(100 * precision_score(test_y, y_predict)))
print("Recall: {:.2f}%".format(100 * recall_score(test_y, y_predict)))
print("F1 Score: {:.2f}%".format(100 * f1_score(test_y,y_predict)))

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt


ax= plt.subplot()
#annot=True to annotate cells
sns.heatmap(cf_matrix, annot=True, ax = ax,cmap='Blues',fmt='');
# labels, title and ticks
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels');
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['Not Spam', 'Spam']); ax.yaxis.set_ticklabels(['Not Spam', 'Spam']);

### Preparing data to train an LSTM model

In [9]:
from keras.preprocessing.text import one_hot, Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from keras import Sequential
from keras.layers import Embedding, Flatten, Dense
import math

In [10]:
docs = ['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!',
'Weak',
'Poor effort!',
'not good',
'poor work',
'Could have done better.']
labels = [1,1,1,1,1,0,0,0,0,0]

In [11]:
text = 'a sample text to tokenize'
tokens = text_to_word_sequence(text)
tokens

['a', 'sample', 'text', 'to', 'tokenize']

In [12]:
one_hot(text,5,lower=True)

[2, 3, 3, 3, 1]

In [13]:
one_hot(text,5*1.25,lower=True)

[5.0, 1.75, 5.25, 1.0, 3.25]

### Q1: It is not indeed one hot encoding, but a vector of ints with 1s in different positions to map the input text to a number, right?

In [14]:
import pandas as pd

In [15]:
!dir

 Volume in drive E is WorkSpace
 Volume Serial Number is 6AD8-FF46

 Directory of E:\Fad\Advpy\s13\hw

06/01/2021  12:39 PM    <DIR>          .
06/01/2021  12:39 PM    <DIR>          ..
06/01/2021  12:25 PM    <DIR>          .ipynb_checkpoints
05/31/2021  09:28 PM            14,478 label.txt
05/31/2021  09:26 PM            17,977 labels.txt
06/01/2021  02:56 AM             6,216 predicted_lbl.csv
06/01/2021  02:51 AM             6,216 predicted_lbl.txt
05/30/2021  01:48 AM               556 README.md
06/01/2021  12:39 PM            92,870 Spam detection with RNN-LSTM.ipynb
06/01/2021  03:25 AM           115,507 Spam detection with RNN.ipynb
05/31/2021  09:10 PM           916,713 spam detection.ipynb
06/01/2021  02:28 AM           170,099 test.txt
05/30/2021  01:57 AM            69,549 Text_Mining_Session02.ipynb
05/31/2021  09:28 PM           292,475 train.txt
01/01/2021  09:03 PM            18,837 word_embedding_with_keras .ipynb
              12 File(s)      1,721,493 bytes
         

In [8]:
def read_and_concat_datasets(train_dataset='train.txt', labels='label.txt', delimiter = "\n"):
    df = pd.read_csv(train_dataset, delimiter = delimiter, header=None, quotechar="'" )
    df.columns = ['message']
    
    df_label = pd.read_csv(labels, delimiter=delimiter, header = None, quotechar="'")
    df_label.columns = ['message_type']
    
    df_final = pd.concat([df, df_label], axis=1)
    return df_final

#### The error "ParserError: Error tokenizing data. C error: Expected 1 fields in line 8, saw 2" is because of having comma in a field where using pd.read_csv().

To resolve the error refer to [this link](https://stackoverflow.com/questions/32743479/pandas-read-csv-with-extra-commas-in-column). To resolve the issue, use  quotechar="'" in pd.read_csv().

In [11]:
read_and_concat_datasets().head(10)

Unnamed: 0,message,message_type
0,The basket's gettin full so I might be by tonight,ham
1,Can i get your opinion on something first?,ham
2,Company is very good.environment is terrific a...,ham
3,Its a valentine game. . . Send dis msg to all ...,ham
4,S.i'm watching it in live..,ham
5,Don know:)this week i'm going to tirunelvai da.,ham
6,7 lor... Change 2 suntec... Wat time u coming?,ham
7,"""Garbage bags, eggs, jam, bread, hannaford whe...",ham
8,You see the requirements please,ham
9,"""Are you being good, baby? :)""",ham


In [12]:
df = read_and_concat_datasets()
df.tail()

Unnamed: 0,message,message_type
3497,Ok lor. I'm in town now lei.,ham
3498,"""Aight I've been set free, think you could tex...",ham
3499,No no:)this is kallis home ground.amla home to...,ham
3500,excellent. I spent &lt;#&gt; years in the Ai...,
3501,Watching tv lor...,


## There is length mismatch. Downloading data sets again to start the work.

#### Determining length of one-hot encoder to avoid collision.

In [34]:
import pandas as pd
df = pd.read_csv('train.txt', delimiter = "\n", header=None )
df.columns = ['message']

In [35]:
df.head(10)

Unnamed: 0,message
0,The basket's gettin full so I might be by tonight
1,Can i get your opinion on something first?
2,Company is very good.environment is terrific a...
3,Its a valentine game. . . Send dis msg to all ...
4,S.i'm watching it in live..
5,Don know:)this week i'm going to tirunelvai da.
6,7 lor... Change 2 suntec... Wat time u coming?
7,"Garbage bags, eggs, jam, bread, hannaford whea..."
8,You see the requirements please
9,"Are you being good, baby? :)"


In [36]:
df_label = pd.read_csv('label.txt', delimiter='\n', header = None)
df_label.columns = ['message_type']

In [37]:
df_label.head()

Unnamed: 0,message_type
0,ham
1,ham
2,ham
3,ham
4,ham


In [38]:
df = pd.concat([df, df_label],axis=1)
df.tail()

Unnamed: 0,message,message_type
3495,Ok lor. I'm in town now lei.,ham
3496,"Aight I've been set free, think you could text...",ham
3497,No no:)this is kallis home ground.amla home to...,ham
3498,excellent. I spent &lt;#&gt; years in the Ai...,ham
3499,Watching tv lor...,ham


In [20]:
max([len(message) for message in df["message"]])

910

In [21]:
one_hot_vec_len = math.ceil(max([len(message) for message in df["message"]]) * 1.25)
one_hot_vec_len

1138

In [22]:
encoded_docs = [one_hot(message, one_hot_vec_len) for message in df["message"]]

In [23]:
len(encoded_docs)

3500

In [24]:
encoded_docs[3499]

[761, 1091, 15]

In [25]:
max([len(enc_doc) for enc_doc in encoded_docs])

189

In [26]:
# bad way
i = 0
for enc_doc in encoded_docs:
    i += 1
    if len(enc_doc) == max([len(enc_doc) for enc_doc in encoded_docs]):
        print(i,  enc_doc)

3078 [37, 960, 546, 942, 762, 347, 864, 399, 947, 762, 386, 1099, 947, 253, 655, 108, 1056, 3, 960, 952, 762, 902, 546, 559, 687, 110, 918, 183, 445, 511, 947, 209, 347, 546, 617, 873, 377, 20, 864, 655, 952, 762, 902, 345, 108, 1056, 947, 651, 942, 195, 902, 857, 1036, 445, 108, 1003, 257, 655, 409, 445, 102, 762, 182, 3, 655, 445, 102, 195, 902, 933, 80, 655, 947, 209, 707, 37, 655, 195, 989, 18, 445, 664, 873, 662, 18, 655, 391, 947, 195, 902, 362, 80, 12, 864, 30, 37, 655, 947, 195, 902, 183, 942, 1036, 947, 195, 902, 871, 546, 428, 724, 37, 655, 942, 195, 902, 1036, 947, 307, 887, 80, 358, 30, 1099, 445, 651, 977, 546, 1043, 1055, 769, 1101, 546, 1127, 775, 947, 195, 940, 902, 795, 552, 37, 655, 942, 195, 902, 1036, 947, 347, 914, 1101, 384, 746, 873, 377, 914, 784, 530, 102, 195, 902, 546, 1043, 1055, 857, 195, 1086, 108, 964, 873, 576, 1085, 37, 546, 617, 558, 952, 977, 864, 960, 947, 209, 793, 80, 131, 964, 1040, 195, 493, 132]


In [27]:
df["message"].iloc[3078 ]

'You do got a shitload of diamonds though'

In [28]:
df[df["message"]=="hi baby im cruisin with my girl friend what r u up 2? give me a call in and hour at home if thats alright or fone me on this fone now love jenny xxx"]

Unnamed: 0,message
771,hi baby im cruisin with my girl friend what r ...


In [29]:
max_length = max([len(enc_doc) for enc_doc in encoded_docs])
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs[0])

[546 450 974 195 445 947  84 902 531  58   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0]


In [30]:
df.iloc[0]

message    The basket's gettin full so I might be by tonight
Name: 0, dtype: object

In [31]:
from keras.layers import Dense,LSTM, Embedding, Dropout, Activation, Bidirectional
from keras import Sequential


#size of the output vector from each layer
embedding_vector_length = 32

model = Sequential()
#Creating an embedding layer to vectorize
#max_feature is 1.25 * length of the mapping space.
model.add(Embedding(one_hot_vec_len, embedding_vector_length, input_length=max_length))
#Addding Bi-directional LSTM
model.add(Bidirectional(LSTM(64)))
#Relu allows converging quickly and allows backpropagation
model.add(Dense(16, activation='relu'))
#Deep Learninng models can be overfit easily, to avoid this, we add randomization using drop out
model.add(Dropout(0.1))
#Adding sigmoid activation function to normalize the output
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 189, 32)           36416     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 128)               49664     
_________________________________________________________________
dense_2 (Dense)              (None, 16)                2064      
_________________________________________________________________
dropout_1 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 17        
Total params: 88,161
Trainable params: 88,161
Non-trainable params: 0
_________________________________________________________________
None


In [40]:
labels = df["message_type"]
labels

0       ham
1       ham
2       ham
3       ham
4       ham
       ... 
3495    ham
3496    ham
3497    ham
3498    ham
3499    ham
Name: message_type, Length: 3500, dtype: object

Lables must be encoded to numerical value, otherwise, there will be the following error:

UnimplementedError:  Cast string to float is not supported
	 [[node binary_crossentropy/Cast (defined at <ipython-input-39-c86fac56b2a7>:1) ]] [Op:__inference_train_function_5826]

In [41]:
from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()
labels_enc = le.fit_transform(labels)

labels_enc

array([0, 0, 0, ..., 0, 0, 0])

In [42]:
type(labels_enc)

numpy.ndarray

In [43]:
label_sr = pd.Series(labels_enc)[:15]
label_sr.values 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1])

In [44]:
df["message_type"].head(15)

0      ham
1      ham
2      ham
3      ham
4      ham
5      ham
6      ham
7      ham
8      ham
9      ham
10     ham
11     ham
12     ham
13    spam
14    spam
Name: message_type, dtype: object

In [54]:
# df1 = df.assign(e = pd.Series(labels_enc).values)

In [55]:
# df1.head()

Unnamed: 0,message,message_type,e
0,The basket's gettin full so I might be by tonight,ham,0
1,Can i get your opinion on something first?,ham,0
2,Company is very good.environment is terrific a...,ham,0
3,Its a valentine game. . . Send dis msg to all ...,ham,0
4,S.i'm watching it in live..,ham,0


In [45]:
df1 = df.assign(labels = labels_enc)

In [46]:
df1.head(15)

Unnamed: 0,message,message_type,labels
0,The basket's gettin full so I might be by tonight,ham,0
1,Can i get your opinion on something first?,ham,0
2,Company is very good.environment is terrific a...,ham,0
3,Its a valentine game. . . Send dis msg to all ...,ham,0
4,S.i'm watching it in live..,ham,0
5,Don know:)this week i'm going to tirunelvai da.,ham,0
6,7 lor... Change 2 suntec... Wat time u coming?,ham,0
7,"Garbage bags, eggs, jam, bread, hannaford whea...",ham,0
8,You see the requirements please,ham,0
9,"Are you being good, baby? :)",ham,0


In [47]:
labels = df1["labels"]

In [61]:
model.fit(padded_docs, labels, epochs=40, verbose=1)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<tensorflow.python.keras.callbacks.History at 0x1e22fd4f730>

In [62]:
loss, accuracy = model.evaluate(padded_docs, labels)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 100.000000


In [51]:
df_test = pd.read_csv('test.txt', delimiter='\n', header=None, quotechar="'")
df_test.columns = ["message"]

In [52]:
df_test.tail()

Unnamed: 0,message
2067,Our Prashanthettan's mother passed away last n...
2068,either way works for me. I am &lt;#&gt; year...
2069,Not yet had..ya sapna aunty manege y'day hogid...
2070,What happen dear. Why you silent. I am tensed
2071,Don't b floppy... b snappy & happy! Only gay c...


In [71]:
one_hot_msg_len= math.ceil(max([len(message) for message in df_test["message"]]) * 1.25)
one_hot_msg_len

738

In [72]:
encoded_docs = [one_hot(msg, one_hot_msg_len) for msg in df_test["message"]]

In [73]:
# "Beautiful tomorrow never comes.. When it comes, it's already TODAY.. In the hunt of beautiful tomorrow don't waste your wonderful TODAY.. GOODMORNING:)"
encoded_docs[0]

[400,
 432,
 69,
 61,
 86,
 424,
 61,
 330,
 331,
 33,
 736,
 267,
 264,
 286,
 400,
 432,
 14,
 94,
 470,
 652,
 33,
 50]

In [74]:
# "Beautiful tomorrow never comes.. When it comes, it's already TODAY.. In the hunt of beautiful tomorrow don't waste your wonderful TODAY.. GOODMORNING:)"
df_test.iloc[0]

message    "Beautiful tomorrow never comes.. When it come...
Name: 0, dtype: object

In [76]:
max_length = max([len(message) for message in df_test["message"]])
max_length

590

In [77]:
test_padded_msgs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

In [79]:
test_padded_msgs[0]

array([400, 432,  69,  61,  86, 424,  61, 330, 331,  33, 736, 267, 264,
       286, 400, 432,  14,  94, 470, 652,  33,  50,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   

In [80]:
model.predict(test_padded_msgs)



array([[2.4121206e-08],
       [9.9986744e-01],
       [3.4668432e-07],
       ...,
       [7.4099103e-07],
       [9.4126940e-01],
       [3.3551952e-08]], dtype=float32)

In [81]:
spam_pred = model.predict(test_padded_msgs)

In [85]:
spam_pred

array([[2.4121206e-08],
       [9.9986744e-01],
       [3.4668432e-07],
       ...,
       [7.4099103e-07],
       [9.4126940e-01],
       [3.3551952e-08]], dtype=float32)

In [87]:
spam_pred[0][0]

2.4121206e-08

In [96]:
max(spam_pred)

array([0.9999549], dtype=float32)

In [89]:
spam_pred_list = []
for i in range(len(spam_pred)):
    spam_pred_list.append(spam_pred[i][0])
    
spam_pred_list[:10]

[2.4121206e-08,
 0.99986744,
 3.4668432e-07,
 3.3845222e-09,
 5.6716118e-08,
 2.821918e-05,
 2.4148803e-06,
 0.00025257468,
 0.006676048,
 1.2434184e-05]

In [99]:
predicted_lables = [0 if val<0.5 else 1 for val in spam_pred_list ]
predicted_lables[:10]

[0, 1, 0, 0, 0, 0, 0, 0, 0, 0]

In [100]:
df_lbl = pd.DataFrame(predicted_lables, columns=None)
df_lbl.columns = ['predicted_label']
df_lbl.head()

Unnamed: 0,predicted_label
0,0
1,1
2,0
3,0
4,0


In [103]:
df_lbl.to_csv('predicted_lbl.csv', header=False, index=False)

In [104]:
df_lbl

Unnamed: 0,predicted_label
0,0
1,1
2,0
3,0
4,0
...,...
2067,1
2068,0
2069,0
2070,1


### Resolving the warning

Increase pad sequence for training model to the longest message in test data set; this means increasing the pad sequence from 189 to 590.

In [49]:
df1.head()

Unnamed: 0,message,message_type,labels
0,The basket's gettin full so I might be by tonight,ham,0
1,Can i get your opinion on something first?,ham,0
2,Company is very good.environment is terrific a...,ham,0
3,Its a valentine game. . . Send dis msg to all ...,ham,0
4,S.i'm watching it in live..,ham,0


In [53]:
one_hot_vec_len = math.ceil(max([len(message) for message in df["message"]]) * 1.25)
encoded_docs = [one_hot(message, one_hot_vec_len) for message in df["message"]]

# set padded values from df_test where longest message is 590. This will train the model on the same length as the test df
# against which messages are being tested. 
max_length = max([len(message) for message in df_test["message"]])
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

In [54]:
#size of the output vector from each layer
embedding_vector_length = 32

model = Sequential()
#Creating an embedding layer to vectorize
#max_feature is 1.25 * length of the mapping space.
model.add(Embedding(one_hot_vec_len, embedding_vector_length, input_length=max_length))
#Addding Bi-directional LSTM
model.add(Bidirectional(LSTM(64)))
#Relu allows converging quickly and allows backpropagation
model.add(Dense(16, activation='relu'))
#Deep Learninng models can be overfit easily, to avoid this, we add randomization using drop out
model.add(Dropout(0.1))
#Adding sigmoid activation function to normalize the output
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 590, 32)           36416     
_________________________________________________________________
bidirectional_2 (Bidirection (None, 128)               49664     
_________________________________________________________________
dense_4 (Dense)              (None, 16)                2064      
_________________________________________________________________
dropout_2 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 17        
Total params: 88,161
Trainable params: 88,161
Non-trainable params: 0
_________________________________________________________________
None


In [56]:
labels[:5]

0    0
1    0
2    0
3    0
4    0
Name: labels, dtype: int32

In [57]:
model.fit(padded_docs, labels, epochs=15, verbose=1)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x1ebb508a2e0>

In [58]:
loss, accuracy = model.evaluate(padded_docs, labels)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 100.000000


#### Model is obviously overtrained. Either use a validation set for early stopping or reduce the number of epocs to 12.

#### Applying the newly trained model on test data and check the website scores

In [60]:
df_test = pd.read_csv('test.txt', delimiter='\n', header=None, quotechar="'")
df_test.columns = ["message"]
one_hot_msg_len= math.ceil(max([len(message) for message in df_test["message"]]) * 1.25)
encoded_docs = [one_hot(msg, one_hot_msg_len) for msg in df_test["message"]]
max_length = max([len(message) for message in df_test["message"]])
test_padded_msgs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

model.predict(test_padded_msgs)

array([[1.6338426e-07],
       [3.3659030e-06],
       [9.8079777e-01],
       ...,
       [3.6426886e-07],
       [2.9595521e-06],
       [1.5506515e-05]], dtype=float32)

#### We resolved the warning for mis-matched length of tokenized messages.

### Changing network architecture and checking the result

In [69]:
embedding_vector_length = 64

model = Sequential()
model.add(Embedding(one_hot_vec_len, embedding_vector_length, input_length=max_length))

#return_sequences: Boolean. Whether to return the last output. in the outpu sequence, or the full sequence. 
#Default: `False`. Change it to True to be able to stack LSTM layers. Otherwise, will get this error:
# ValueError: Input 0 of layer bidirectional_8 is incompatible with the layer: expected ndim=3, found ndim=2. 
#Full shape received: [None, 16]
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Dense(16, activation='relu'))
model.add(Bidirectional(LSTM(64)))
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 590, 64)           72832     
_________________________________________________________________
bidirectional_11 (Bidirectio (None, 590, 128)          66048     
_________________________________________________________________
dense_12 (Dense)             (None, 590, 16)           2064      
_________________________________________________________________
bidirectional_12 (Bidirectio (None, 128)               41472     
_________________________________________________________________
dense_13 (Dense)             (None, 16)                2064      
_________________________________________________________________
dropout_4 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_14 (Dense)             (None, 1)                

In [70]:
one_hot_vec_len = math.ceil(max([len(message) for message in df["message"]]) * 1.25)
encoded_docs = [one_hot(message, one_hot_vec_len) for message in df["message"]]

# set padded values from df_test where longest message is 590. This will train the model on the same length as the test df
# against which messages are being tested. 
max_length = max([len(message) for message in df_test["message"]])
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

model.fit(padded_docs, labels, epochs=15, verbose=1)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x1ebc94e8b20>

## Setting up early stopping in Keras model

#### The model is again overtrained. To stop this define an EarlyStopping callback function and pass it to model.fit() together with 10% validation set. For more refer to [this](https://keras.io/api/callbacks/early_stopping/) and [this link](https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/#:~:text=the%20validation%20dataset.-,Early%20Stopping%20in%20Keras,configured%20when%20instantiated%20via%20arguments).

In [73]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(padded_docs, labels, test_size = 0.1)

In [77]:
X_train[0]

array([ 947,  887,  961,   83,  594, 1095,  916,  519,  982,  735,  331,
        878, 1109,  914,  960,  376,  561,  962,  947,  887,  855,  964,
       1031,  327,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

In [79]:
y_train[0]

0

In [81]:
from keras.callbacks import EarlyStopping

In [85]:
embedding_vector_length = 64

model = Sequential()
model.add(Embedding(one_hot_vec_len, embedding_vector_length, input_length=max_length))

#return_sequences: Boolean. Whether to return the last output. in the outpu sequence, or the full sequence. 
#Default: `False`. Change it to True to be able to stack LSTM layers. Otherwise, will get this error:
# ValueError: Input 0 of layer bidirectional_8 is incompatible with the layer: expected ndim=3, found ndim=2. 
#Full shape received: [None, 16]
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Dense(16, activation='relu'))
model.add(Bidirectional(LSTM(64)))
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, 590, 64)           72832     
_________________________________________________________________
bidirectional_13 (Bidirectio (None, 590, 128)          66048     
_________________________________________________________________
dense_15 (Dense)             (None, 590, 16)           2064      
_________________________________________________________________
bidirectional_14 (Bidirectio (None, 128)               41472     
_________________________________________________________________
dense_16 (Dense)             (None, 16)                2064      
_________________________________________________________________
dropout_5 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_17 (Dense)             (None, 1)                

In [86]:
callback = EarlyStopping(monitor='loss', patience=3)

In [87]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=15, callbacks=[callback] ,verbose=1)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15


<tensorflow.python.keras.callbacks.History at 0x1ebd8d19100>

In [88]:
loss, accuracy = model.evaluate(padded_docs, labels)
print('Train Accuracy: {} --- Train Loss'.format({accuracy*100}, {loss}) )

Train Accuracy: {99.80000257492065} --- Train Loss


In [89]:
loss, accuracy = model.evaluate(X_train, y_train)
print('Train Accuracy: {} --- Train Loss'.format({accuracy*100}, {loss}) )

Train Accuracy: {99.96825456619263} --- Train Loss


In [91]:
loss, accuracy = model.evaluate(X_test, y_test)
print('Train Accuracy: {} --- Train Loss'.format({accuracy*100}, {loss}) )

Train Accuracy: {98.28571677207947} --- Train Loss


### Scenario 4: Defning restore best weights in EarlyStopping callback

```(python)
tf.keras.callbacks.EarlyStopping(
    monitor="val_loss",
    min_delta=0,
    patience=0,
    verbose=0,
    mode="auto",
    baseline=None,
    restore_best_weights=False,
)
```
Arguments

+ monitor: Quantity to be monitored.
+ min_delta: Minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than min_delta, will count as no improvement.
+ patience: Number of epochs with no improvement after which training will be stopped.
+ verbose: verbosity mode.
+ mode: One of {"auto", "min", "max"}. In min mode, training will stop when the quantity monitored has stopped decreasing; in "max" mode it will stop when the quantity monitored has stopped increasing; in "auto" mode, the direction is automatically inferred from the name of the monitored quantity.
+ baseline: Baseline value for the monitored quantity. Training will stop if the model doesn't show improvement over the baseline.
+ restore_best_weights: Whether to restore model weights from the epoch with the best value of the monitored quantity. If False, the model weights obtained at the last step of training are used. An epoch will be restored regardless of the performance relative to the baseline. If no epoch improves on baseline, training will run for patience epochs and restore weights from the best epoch in that set.

In [96]:
embedding_vector_length = 64

model = Sequential()
model.add(Embedding(one_hot_vec_len, embedding_vector_length, input_length=max_length))

#return_sequences: Boolean. Whether to return the last output. in the outpu sequence, or the full sequence. 
#Default: `False`. Change it to True to be able to stack LSTM layers. Otherwise, will get this error:
# ValueError: Input 0 of layer bidirectional_8 is incompatible with the layer: expected ndim=3, found ndim=2. 
#Full shape received: [None, 16]
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Dense(16, activation='relu'))
model.add(Bidirectional(LSTM(64)))
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

X_train, X_test, y_train, y_test = train_test_split(padded_docs, labels, test_size = 0.1)

# callback = EarlyStopping(monitor='loss', patience=3, baseline=0.001 , restore_best_weights=True)
callback = EarlyStopping(monitor='loss', patience=3, restore_best_weights=True)
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=15,  batch_size=50, callbacks=[callback] ,verbose=1)

# In epoc 3 it is failing due to this error:
# TypeError: object of type 'NoneType' has no len(). 
# tensorboard               2.4.1                    pypi_0    pypi
# tensorboard-plugin-wit    1.8.0                    pypi_0    pypi
# tensorflow                2.4.1                    pypi_0    pypi
# tensorflow-estimator      2.4.0                    pypi_0    pypi
# termcolor                 1.1.0                    pypi_0    pypi

Model: "sequential_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_12 (Embedding)     (None, 590, 64)           72832     
_________________________________________________________________
bidirectional_21 (Bidirectio (None, 590, 128)          66048     
_________________________________________________________________
dense_27 (Dense)             (None, 590, 16)           2064      
_________________________________________________________________
bidirectional_22 (Bidirectio (None, 128)               41472     
_________________________________________________________________
dense_28 (Dense)             (None, 16)                2064      
_________________________________________________________________
dropout_9 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_29 (Dense)             (None, 1)               

<tensorflow.python.keras.callbacks.History at 0x1ebf8766d60>

### Q2: Is it a must that data set length be dividable by thenumber of batch size? If data set cannot be splitted into that chunk size, how will it behave? Can it cause failure in training or validating process? For example if training data has the length of 100, and the batch size is 60 (just example numbers), will it cause failure?

### Q3: Epoch vs batch size

In each epoc whole data is being fed into the network. At the end of each epoch there is a backpropagation. Batch size determines the size of each batch to train.

I have made two changes to check the accuracy and loss of the new trained model:

1. Stacking two LSTM layers.
2. Adding a callback function for early stopping.
3. Saving best weights of the trained model.
4. changing batch size from 1 to 50.

In [99]:
loss, accuracy = model.evaluate(padded_docs, labels)
print('Train Accuracy: {} --- Train Loss'.format({accuracy*100}, {loss}) )

Train Accuracy: {99.80000257492065} --- Train Loss


In [100]:
loss, accuracy = model.evaluate(X_train, y_train)
print('Train Accuracy: {} --- Train Loss'.format({accuracy*100}, {loss}) )

Train Accuracy: {100.0} --- Train Loss


In [101]:
loss, accuracy = model.evaluate(X_test, y_test)
print('Train Accuracy: {} --- Train Loss'.format({accuracy*100}, {loss}) )

Train Accuracy: {98.00000190734863} --- Train Loss


In [105]:
df_test = pd.read_csv('test.txt', delimiter='\n', header=None, quotechar="'")
df_test.columns = ["message"]
one_hot_msg_len= math.ceil(max([len(message) for message in df_test["message"]]) * 1.25)
encoded_docs = [one_hot(msg, one_hot_msg_len) for msg in df_test["message"]]

max_length = max([len(message) for message in df_test["message"]])
test_padded_msgs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')



In [106]:
test_padded_msgs[0]

array([313, 695, 473, 457, 538, 708, 457, 164, 687, 154, 721, 300,  81,
       572, 313, 695,  40, 681, 631, 501, 154, 224,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   

In [108]:
spam_pred = model.predict(test_padded_msgs)

In [109]:
spam_pred[0][0]

2.8601212e-06

In [110]:
spam_pred_list = []
for i in range(len(spam_pred)):
    spam_pred_list.append(spam_pred[i][0])
    
spam_pred_list[:10]

[2.8601212e-06,
 1.169266e-05,
 0.9961685,
 1.0340724e-05,
 3.7576624e-06,
 1.04528435e-05,
 1.2438679e-05,
 5.3217914e-06,
 1.800399e-05,
 6.506518e-06]

In [111]:
predicted_lables = [0 if val<0.5 else 1 for val in spam_pred_list ]
df_lbl = pd.DataFrame(predicted_lables, columns=None)
df_lbl.columns = ['predicted_label']
df_lbl.head()

Unnamed: 0,predicted_label
0,0
1,0
2,1
3,0
4,0


In [112]:
df_lbl.to_csv('predicted_lbl_2lstm_lyrs.csv', header=False, index=False)

### Q4: What is the cause of "TypeError: object of type 'NoneType' has no len() error"? Why it was raised at the end of epoch 3? Why dropping "baseline=0.001" in callback function resolved the error? Check the cell below to see the error.

In [92]:
embedding_vector_length = 64

model = Sequential()
model.add(Embedding(one_hot_vec_len, embedding_vector_length, input_length=max_length))

#return_sequences: Boolean. Whether to return the last output. in the outpu sequence, or the full sequence. 
#Default: `False`. Change it to True to be able to stack LSTM layers. Otherwise, will get this error:
# ValueError: Input 0 of layer bidirectional_8 is incompatible with the layer: expected ndim=3, found ndim=2. 
#Full shape received: [None, 16]
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Dense(16, activation='relu'))
model.add(Bidirectional(LSTM(64)))
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

X_train, X_test, y_train, y_test = train_test_split(padded_docs, labels, test_size = 0.1)

callback = EarlyStopping(monitor='loss', patience=3, baseline=0.001 , restore_best_weights=True)
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=15, callbacks=[callback] ,verbose=1)

TypeError: object of type 'NoneType' has no len()

Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 590, 64)           72832     
_________________________________________________________________
bidirectional_15 (Bidirectio (None, 590, 128)          66048     
_________________________________________________________________
dense_18 (Dense)             (None, 590, 16)           2064      
_________________________________________________________________
bidirectional_16 (Bidirectio (None, 128)               41472     
_________________________________________________________________
dense_19 (Dense)             (None, 16)                2064      
_________________________________________________________________
dropout_6 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_20 (Dense)             (None, 1)               

TypeError: object of type 'NoneType' has no len()

In [None]:
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
# model.add(Dense(100, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())