# Spam Filter

特征提取的思路：

首先对垃圾邮件进行预处理：

- Lower-casing

- Stripping HTML

- Normalizing URLs

- Normalizing Email Addresses

- Normalizing Numbers

- Normalizing Dollars

- Word Stemming

- Removal of non-words

然后统计所有的垃圾邮件中单词出现的频率，提取频率超过100次的单词，得到一个单词列表。

将每个单词替换为列表中对应的编号。

提取特征：每个邮件对应一个n维向量$R^n$，$x_i \in {0, 1}$，如果第i个单词出现，则$x_i=1$，否则$x_i=0$


本文偷懒直接使用已经处理好的特征和数据...

In [1]:
from sklearn import svm
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
import scipy.io as sio

## load data

In [2]:
train_mat = sio.loadmat('./data/spamTrain.mat')
train_mat.keys()

dict_keys(['__header__', '__version__', '__globals__', 'X', 'y'])

In [3]:
X, y = train_mat.get('X'), train_mat.get('y').ravel()
X.shape, y.shape

((4000, 1899), (4000,))

In [4]:
test_mat = sio.loadmat('./data/spamTest.mat')
test_mat.keys()

dict_keys(['__header__', '__version__', '__globals__', 'Xtest', 'ytest'])

In [5]:
X_test, y_test = test_mat.get('Xtest'), test_mat.get('ytest').ravel()
X_test.shape, y_test.shape

((1000, 1899), (1000,))

## fit SVM model

In [6]:
svc = svm.SVC()
svc.fit(X, y)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [7]:
X_test

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)

In [8]:
svc.score(X_test, y_test)

0.987

In [9]:
pred = svc.predict(X_test)
print(metrics.classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99       692
           1       0.99      0.97      0.98       308

    accuracy                           0.99      1000
   macro avg       0.99      0.98      0.98      1000
weighted avg       0.99      0.99      0.99      1000



## use linear logistic regression

In [10]:
logit = LogisticRegression()
logit.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [11]:
logit.score(X_test, y_test)

0.994

In [12]:
pred = logit.predict(X_test)
print(metrics.classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00       692
           1       0.99      0.99      0.99       308

    accuracy                           0.99      1000
   macro avg       0.99      0.99      0.99      1000
weighted avg       0.99      0.99      0.99      1000

