## Loading and Inspecting Dataset

In [None]:
import pandas as pd

url = ("https://drive.google.com/uc?id=14PVQVJMUAlvYIBrKGPrwgkl4IM1XACmo"
                                            "&authuser=1&export=download")

df = pd.read_csv(url)
df

Unnamed: 0,email,label
0,mike bostock said received from trackingNUMBE...,0
1,no i was just a little confused because i m r...,0
2,this is just an semi educated guess if i m wro...,0
3,jm URL justin mason writes except for NUMBER t...,0
4,i just picked up razor sdk NUMBER NUMBER and N...,0
...,...,...
1495,abc s good morning america ranks it the NUMBE...,1
1496,hyperlink hyperlink hyperlink let mortgage le...,1
1497,thank you for shopping with us gifts for all ...,1
1498,the famous ebay marketing e course learn to s...,1


Deletion of rows that have null email:

In [None]:
print('NaN emails: {}'.format(df['email'].isnull().sum()))
df = df.dropna(subset=['email'])
print('NaN emails: {}'.format(df['email'].isnull().sum()))

NaN emails: 1
NaN emails: 0


Create a matrix of dimensions $n \times m$ where $n$ is the number of emails and $m$ the number of **discrete words** in the text corpus. The value of the cell $(i,j)$ of the matrix, contains the tf-idf metric of the word $j$ in the text (email) $i$.

By definition, the constructor of **TfidfVectorizer** class during the processing of texts, converts all words to lowercase. We decided not to remove the Stopwords since there is a case that a spam email consists only of Stopwords.  

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

emails = df['email']
vec = TfidfVectorizer()
vec_train = vec.fit_transform(emails)

X = pd.DataFrame(vec_train.todense())
y = df['label']

In the end, the resulting matrix $X$ containts all the tf-idf factors and it is the dense version of $vec\_train$ which is sparse.

In essence, every row of the matrix $X$ is a representation (Embedding) of the email in a $m$-dimensional space which has been generated by the set of words.

## Neural Network

Using KFold we repeatedly split the dataset into trainset (75%) and testset (25%).

The neural network model we created consists of three sequential and dense layers of neurons. More specifically:
1. Layer of 16 neurons with ReLU activation function. Ορίζουμε το όρισμα input_dim να είναι ίσο με $m$ (πλήθος διακριτών λέξεων).
2. Layer of 8 neurons with ReLU activation function.
3. Layer of 1 neuron with sigmoid activation function. The last layer, because of sigmoid, results in a single value $x \in [0,1]$. If this value is greater than $0.5$ the email is considered as spam (1), otherwise as non-spam (0).


The model uses the loss function Binary cross entropy which is widely used in Binary classification tasks to calculate the error of the model during training, in order to minimize it. Adam optimizer and accuracy metric are used. For the training phase, we use epochs=10 and batch_size=10.

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score, precision_score, recall_score 


kf = KFold(n_splits=4, shuffle=True)

for fold, (train_index, test_index) in enumerate(kf.split(X)):
	X_train = X.iloc[train_index]
	y_train = y.iloc[train_index]
	X_test = X.iloc[test_index]
	y_test = y.iloc[test_index]

	# define the keras model
	model = Sequential()
	model.add(Dense(16, input_dim=X_train.shape[1], activation='relu'))
	model.add(Dense(8, activation='relu'))
	model.add(Dense(1, activation='sigmoid'))
	# model.summary()

	# compile the keras model
	model.compile(loss='binary_crossentropy', optimizer='adam', 
               metrics=['accuracy'])
	# fit the keras model on the dataset
	model.fit(X_train, y_train, epochs=10, batch_size=10, verbose=0)
	# make class predictions with the model
	predictions = (model.predict(X_test) > 0.5).astype("int32")

	print(f'Number of fold: {fold}')
	print(f'\tRecall: {recall_score(y_test, predictions)}')
	print(f'\tPrecision: {precision_score(y_test, predictions)}')
	print(f'\tf1-score: {f1_score(y_test, predictions)}')
	print('-------------------------------------------------------------')


Number of fold: 0
	Recall: 0.9765625
	Precision: 1.0
	f1-score: 0.9881422924901185
-------------------------------------------------------------
Number of fold: 1
	Recall: 0.9758064516129032
	Precision: 1.0
	f1-score: 0.9877551020408163
-------------------------------------------------------------
Number of fold: 2
	Recall: 0.9779411764705882
	Precision: 0.9851851851851852
	f1-score: 0.981549815498155
-------------------------------------------------------------
Number of fold: 3
	Recall: 0.990990990990991
	Precision: 0.9821428571428571
	f1-score: 0.9865470852017937
-------------------------------------------------------------


We observe from the above results, that the **specific** preprocessing and the use of this neural network are suitable for the specific dataset. The results are excellent.