<img src="assets/logo.png" width="800">

Made by **Balázs Nagy** and **Márk Domokos**

[<img src="assets/open_button.png">](https://colab.research.google.com/github/Fortuz/edu_Adaptive/blob/main/practices/L10%20-%20Spam%20filter_solved.ipynb)

# Labor 10 - Spam filter

In this lab exercise we will use the SVM algorithm to filter spam emails.

### 1: Import packages

In [None]:
from scipy.io import loadmat
import numpy as np
from sklearn.svm import SVC
import re
from nltk.stem import PorterStemmer             # natural langage toolkit

### 2: Load in Data

The data will be loaded from a publicly available file. An alternative solution would be to upload the data file directly to the google colab file system. 

In [None]:
!wget https://github.com/Fortuz/edu_Adaptive/raw/main/practices/assets/Lab10/emailSample1.txt
!wget https://github.com/Fortuz/edu_Adaptive/raw/main/practices/assets/Lab10/emailSample2.txt
!wget https://github.com/Fortuz/edu_Adaptive/raw/main/practices/assets/Lab10/vocab.txt
!wget https://github.com/Fortuz/edu_Adaptive/raw/main/practices/assets/Lab10/spamSample1.txt
!wget https://github.com/Fortuz/edu_Adaptive/raw/main/practices/assets/Lab10/spamSample2.txt
!wget https://github.com/Fortuz/edu_Adaptive/raw/main/practices/assets/Lab10/spamTest.mat
!wget https://github.com/Fortuz/edu_Adaptive/raw/main/practices/assets/Lab10/spamTrain.mat

Let's read in our data. We will work with two emails and a dictionary using the normalised form of the most commonly used terms. 

In [None]:
mail1 = open("emailSample1.txt","r").read()     # load first mail
mail2 = open("emailSample2.txt","r").read()     # load second email
vocabList = open("vocab.txt","r").read()        # load vocabulary

print('First mail:')                            
print(mail1)
print('Second mail:')
print(mail2)
print('Vocabulary list:')
print(vocabList)

vocabList=vocabList.split("\n")[:-1]            # reshape the vocabulary
vocabList_d={}
for ea in vocabList:
    value,key = ea.split("\t")[:]
    vocabList_d[key] = value

### 23: Email preprocess

The first step is to normalise the text of the email. What does this mean?
- Convert everything to lower case
- Extract the HTML code
- Normalising URLs
- Normalise the numbers
- Normalize email addresses
- Normalize special characters
- We reduce words to a dictionary form
- Omit numerals (single-letter characters)

In most cases, normalisation will mean replacing an element with a simplified string.

After normalization, we will decode the email into a sequence of numbers based on our dictionary. That is, we return the index of the words in the email from our list.

In [None]:
def processEmail(mailcontent):
    word_indices=[]                                                         # initialization
    
    mailcontent = mailcontent.lower()                                       # lowercase
    mailcontent = re.sub("[http|https]://[^\s]*","httpaddr",mailcontent)    # HTML normalization
    mailcontent = re.sub("[^\s]+@[^\s]+","emailaddr",mailcontent)           # email address normalization 
    mailcontent = re.sub("[0-9]+","number",mailcontent)                     # nomber normalization
    specChar = ["<","[","^",">","+","?","!","'",".",",",":","$"]            # special character list
    
    ################### CODE HERE ######################## 
    # Normalize special characters




    #####################################################
    
    ps = PorterStemmer()                                                    # natural language processing - dictionary form reduction
    mailcontent = [ps.stem(token) for token in mailcontent.split(" ")]
    mailcontent= " ".join(mailcontent)
    
    mailcontent = mailcontent.replace("\n"," ")
    
    ################### CODE HERE ######################## 
    # word_indices upload = Decode emails into a list of numbers





    #####################################################
    
    print('Preprocessed mail:',mailcontent)
    return word_indices

word_indices = processEmail(mail1)
print('\nWord indices:',word_indices)

check_email= [86, 916, 794, 1077, 883, 370, 1699, 790, 1822, 1831, 883, 431, 1171, 794, 1002, 1893, 1364, 592, 1676, 238, 162, 89, 688, 945, 1663, 1120, 1062, 1699, 375, 1162, 1120, 1893, 1510, 1182, 1237, 810, 1895, 1440, 1547, 181, 1699, 1758, 1896, 688, 1676, 992, 961, 1477, 71, 530, 1699, 531]
if word_indices == check_email:
    print("\n Correct transformation. Proceed.")
else:
    print("\nSomething went wrong. Check your implementation.")

### 4: Feature extraction

The next step is to create a features vector from the preprocessed email, the size of which is equal to the size of our dictionary and whichever word from the dictionary is in the email should have a feauture value of 1.

<img src="assets/Lab10/Pics/L10_vector.png" width="150">

In [None]:
def emailFeatures(word_indices,vocabList):
    ################### CODE HERE ########################   
    # Make a featuer vector from the email. 
    # The dimensions of the vector should match the vocabList
    # 1 represents a feature which is present in the email, and 0 otherwise.




   
                                   
    #####################################################
    return features

features = emailFeatures(word_indices,vocabList_d)
print("Length of feature vector (1899 expected): %.0f" % len(features))
print("Number of non-zero entries (43 expected): %.0f" % np.sum(features))

### 5: Train SVM to classify Spams

Using the training email train the SVM classifier using a linear kernel.

If the hyperplane classifies the dataset linearly then the algorithm we call it as SVC (Support Vector Classifier) and the algorithm that separates the dataset by non-linear approach then we call it as SVM.

In [None]:
spam_mat = loadmat("spamTrain.mat")
X_train = spam_mat["X"]
y_train = spam_mat["y"]
C =0.2

################### CODE HERE ########################
#SVC initialization (kernel=linear), use the ravel() function 



######################################################

print("Training Accuracy (99.975% expected):",(spam_predictor.score(X_train,y_train.ravel()))*100,"%")

### 6: Classification test

In [None]:
spam_mat2 = loadmat("spamTest.mat")
X_test = spam_mat2["Xtest"]
y_test = spam_mat2["ytest"]

print("Training Accuracy (98.9% expected):",(spam_predictor.score(X_test,y_test.ravel()))*100,"%")

### 7: Main indicators of a Spam

Since the model we are training is a linear SVM, we can look at the individual weights that the model has learned during the classification process. In what follows, we implement a code snippet that shows which words (and their weights) the algorithm "thinks" are most likely to spam.

In [None]:
weights = spam_predictor.coef_[0]
weights_col = np.hstack((np.arange(1,1900).reshape(1899,1),weights.reshape(1899,1)))
weights_sorted = weights_col[weights_col[:,1].argsort()][::-1]

spamvoc_ind = weights_sorted[0:15,0]
spamvoc_weights = (weights_sorted[0:15,1])
j=0
for i in spamvoc_ind:
    print(vocabList[int(i-1)],'\t', '\t', spamvoc_weights[j])
    j=+1



### 8: Test with own email

For the sake of curiosity, you can also test it on your own email.

In [None]:
ownmail = open("spamSample1.txt","r").read()
own_ind = processEmail(ownmail)
x = emailFeatures(own_ind,vocabList_d)

p = spam_predictor.predict(x.reshape(1,-1))
print("Result is:",p)
if (p==0):
    print("This is NOT a SPAM")
elif(p==1):
    print("This is a SPAM")

<div style="text-align: right">This lab exercise uses elements from Andrew Ng's Machine Learning course.</div>