# Spam Filter using python 

## What is a spam filter ?
A spam filter is a program that is used to detect unsolicited and unwanted email and prevent those messages from getting to a user's inbox.

## What steps are we going to take while building our spam filter ?
* reading the dataset
* clean the text
* extract some features 
* train our classifiers

In [1]:
# load the data into a pandas frame
# note that a spam message would have spmsg*.txt formula 
import os
import pandas as pd

# extract the data and get the bare/ folder data and put it in a dataset/ folder
RES_DIR = os.path.join(os.path.dirname(__name__),'dataset')
# create a dict to load the mails
mails = {'msg':[],'label':[]}
for mail in os.listdir(RES_DIR):
    # get the body of the message
    message = open(os.path.join(RES_DIR, mail)).read().splitlines()
    # get the label
    # let's use -1 for spam and +1 to ham
    label = -1 if 'spmsg' in mail else 1
    # append it to the dict
    mails['msg'].append(message)
    mails['label'].append(label)

# let's build a big dataframe of the data
dataset = pd.DataFrame(mails)
# let's view it
print(dataset.label.value_counts())
dataset.head()

 1    2412
-1     481
Name: label, dtype: int64


Unnamed: 0,label,msg
0,1,"[Subject: re : 2 . 882 s - > np np, , > date :..."
1,1,"[Subject: s - > np + np, , the discussion of s..."
2,1,"[Subject: 2 . 882 s - > np np, , . . . for me ..."
3,1,"[Subject: gent conference, , "" for the listser..."
4,1,"[Subject: query : causatives in korean, , coul..."


In [2]:
# let's take a closer look to a message body
dataset['msg'][0]

['Subject: re : 2 . 882 s - > np np',
 '',
 '> date : sun , 15 dec 91 02 : 25 : 02 est > from : michael < mmorse @ vm1 . yorku . ca > > subject : re : 2 . 864 queries > > wlodek zadrozny asks if there is " anything interesting " to be said > about the construction " s > np np " . . . second , > and very much related : might we consider the construction to be a form > of what has been discussed on this list of late as reduplication ? the > logical sense of " john mcnamara the name " is tautologous and thus , at > that level , indistinguishable from " well , well now , what have we here ? " . to say that \' john mcnamara the name \' is tautologous is to give support to those who say that a logic-based semantics is irrelevant to natural language . in what sense is it tautologous ? it supplies the value of an attribute followed by the attribute of which it is the value . if in fact the value of the name-attribute for the relevant entity were \' chaim shmendrik \' , \' john mcnamara the nam

### Take a quick look at the data, and you'll find that it always has 
* the first line is the subject
* an empty line
* the message body

### let's get rid of the first two lines as they not much an important

In [3]:
dataset['msg'] = dataset['msg'].apply(lambda x : ' '.join(x[2:]))

### Now let's see the data 

In [4]:
dataset['msg'][0]

'> date : sun , 15 dec 91 02 : 25 : 02 est > from : michael < mmorse @ vm1 . yorku . ca > > subject : re : 2 . 864 queries > > wlodek zadrozny asks if there is " anything interesting " to be said > about the construction " s > np np " . . . second , > and very much related : might we consider the construction to be a form > of what has been discussed on this list of late as reduplication ? the > logical sense of " john mcnamara the name " is tautologous and thus , at > that level , indistinguishable from " well , well now , what have we here ? " . to say that \' john mcnamara the name \' is tautologous is to give support to those who say that a logic-based semantics is irrelevant to natural language . in what sense is it tautologous ? it supplies the value of an attribute followed by the attribute of which it is the value . if in fact the value of the name-attribute for the relevant entity were \' chaim shmendrik \' , \' john mcnamara the name \' would be false . no tautology , this . 

### We see that there're a lot of non alphabet characters, let's get rid of them 

In [5]:
def clean_str(text):
    new_text = []
    for word in text.split():
        if word.isalpha():
            new_text.append(word)
    new_text = ' '.join(new_text)
    return new_text

In [6]:
dataset['msg'] = dataset['msg'].apply(clean_str)

### Now let's see the data 

In [7]:
dataset['msg'][0]

'date sun dec est from michael mmorse yorku ca subject re queries wlodek zadrozny asks if there is anything interesting to be said about the construction s np np second and very much related might we consider the construction to be a form of what has been discussed on this list of late as reduplication the logical sense of john mcnamara the name is tautologous and thus at that level indistinguishable from well well now what have we here to say that john mcnamara the name is tautologous is to give support to those who say that a semantics is irrelevant to natural language in what sense is it tautologous it supplies the value of an attribute followed by the attribute of which it is the value if in fact the value of the for the relevant entity were chaim shmendrik john mcnamara the name would be false no tautology this and no reduplication either'

## features extraction 
### now we use the counter victorizer of sklearn to represent our text

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english')

### first we need to train our count vectorizer on the data, then we transform them

In [9]:
vectorizer.fit(dataset['msg'])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

### take a look at the data before and after the transformation

In [10]:
message = dataset['msg'][0]
print("mail befor :", message)
print ("---------------")
features = vectorizer.transform([message])
print("mail after:", features)

mail befor : date sun dec est from michael mmorse yorku ca subject re queries wlodek zadrozny asks if there is anything interesting to be said about the construction s np np second and very much related might we consider the construction to be a form of what has been discussed on this list of late as reduplication the logical sense of john mcnamara the name is tautologous and thus at that level indistinguishable from well well now what have we here to say that john mcnamara the name is tautologous is to give support to those who say that a semantics is irrelevant to natural language in what sense is it tautologous it supplies the value of an attribute followed by the attribute of which it is the value if in fact the value of the for the relevant entity were chaim shmendrik john mcnamara the name would be false no tautology this and no reduplication either
---------------
mail after:   (0, 3045)	1
  (0, 3382)	2
  (0, 6514)	1
  (0, 7521)	1
  (0, 9493)	1
  (0, 9583)	2
  (0, 11012)	1
  (0,

### These numbers represents the words in our model, let's take a look into them

In [11]:
vectorizer.vocabulary_

{'date': 11012,
 'sun': 45672,
 'dec': 11176,
 'est': 15174,
 'michael': 30320,
 'mmorse': 30879,
 'yorku': 52596,
 'ca': 6514,
 'subject': 45364,
 'queries': 38419,
 'wlodek': 51952,
 'zadrozny': 52732,
 'asks': 3045,
 'interesting': 23277,
 'said': 41304,
 'construction': 9583,
 'np': 33238,
 'second': 42187,
 'related': 39499,
 'consider': 9493,
 'form': 17205,
 'discussed': 12547,
 'list': 27811,
 'late': 26762,
 'reduplication': 39212,
 'logical': 28063,
 'sense': 42459,
 'john': 24515,
 'mcnamara': 29750,
 'tautologous': 46543,
 'level': 27289,
 'indistinguishable': 22612,
 'say': 41705,
 'support': 45775,
 'semantics': 42378,
 'irrelevant': 23745,
 'natural': 32131,
 'language': 26652,
 'supplies': 45771,
 'value': 50004,
 'attribute': 3382,
 'followed': 17070,
 'fact': 16041,
 'relevant': 39543,
 'entity': 14758,
 'chaim': 7521,
 'shmendrik': 42954,
 'false': 16134,
 'tautology': 46544,
 'discussion': 12551,
 'reminds': 39620,
 'years': 52496,
 'ago': 1014,
 'read': 38870,
 'so

## Now let's generate features dataset

In [12]:
features_set = vectorizer.transform(dataset['msg'])

In [13]:
features_set

<2893x53048 sparse matrix of type '<class 'numpy.int64'>'
	with 456775 stored elements in Compressed Sparse Row format>

### Here's our data that is ready to be feed to a machine learning algorithm

In [14]:
pd.DataFrame(features_set.toarray()).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,53038,53039,53040,53041,53042,53043,53044,53045,53046,53047
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Our data is represented as a sparse matrix, in which each column represents a word presence in the mail body

## Before we feed the data into a classifier, we need to get a training and test set, so that we can assess our model

In [15]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features_set.toarray(), dataset['label'].values, test_size=.25)

sneak peak into our data

In [16]:
y_test[:10]

array([ 1, -1,  1,  1,  1,  1,  1,  1,  1,  1])

# Let the model do his work !

### now we handle the data into the classifier to learn the features weights so that it can classify spams and hams

In [17]:
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

we will use more than one model and see the accuracy of each one of them

In [18]:
clfs = {
    'Naive_bayes': GaussianNB(),
    'SVM': SVC(),
    'Decision_tree': DecisionTreeClassifier(),
    'gradient_descent': SGDClassifier()
}

In [20]:
for clf_name in clfs.keys():
    print("now traing",clf_name,"classifier")
    clf = clfs[clf_name]
    clf.fit(x_train, y_train)
    y_predict = clf.predict(x_test)
    print(classification_report(y_test, y_predict))
    print()

now traing Naive_bayes classifier
             precision    recall  f1-score   support

         -1       0.99      0.74      0.85       108
          1       0.96      1.00      0.98       616

avg / total       0.96      0.96      0.96       724


now traing SVM classifier
             precision    recall  f1-score   support

         -1       1.00      0.13      0.23       108
          1       0.87      1.00      0.93       616

avg / total       0.89      0.87      0.82       724


now traing Decision_tree classifier
             precision    recall  f1-score   support

         -1       0.88      0.85      0.86       108
          1       0.97      0.98      0.98       616

avg / total       0.96      0.96      0.96       724


now traing gradient_descent classifier




             precision    recall  f1-score   support

         -1       0.95      0.96      0.96       108
          1       0.99      0.99      0.99       616

avg / total       0.99      0.99      0.99       724




### the classifiers accuracy as u see are comparably close, but the gradient descent has the best training accuracy !