# Spam Classification

这部分用SVM建立一个垃圾邮件分类器<br/>
需要将每个email变成一个n维的特征向量，这个分类器将判断给定一个邮件X是垃圾邮件(y=1)或不是垃圾邮件(y=0)

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity='all'

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.io import loadmat
from sklearn import svm
import re #regular expression for e-mail processing

# 也是一个英文分词的算法,与上面效果差不多
import nltk, nltk.stem.porter

### Preprocessing Emails

In [3]:
# Take a look at an example from the dataset
with open("emailSample.txt","r") as f:
    email = f.read()
    print(email)

> Anyone knows how much it costs to host a web portal ?
>
Well, it depends on how many visitors you're expecting.
This can be anywhere from less than 10 bucks a month to a couple of $100. 
You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 
if youre running something big..

To unsubscribe yourself from this mailing list, send an email to:
groupname-unsubscribe@egroups.com




可以看到，邮件内容包含 URL, an email address(at the end), numbers, and dollar amounts。很多邮件都会包含这些元素，但是每封邮件的具体内容可能会不一样。因此，处理邮件经常采用的方法是标准化这些数据，把所有URL当作一样，所有数字看作一样<br/>
例如，我们用唯一的一个字符串‘httpaddr’来替换所有的URL，来表示邮件包含URL，而不要求具体的URL内容。这通常会提高垃圾邮件分类器的性能，因为垃圾邮件发送者通常会随机化URL，因此在新的垃圾邮件中再次看到任何特定URL的几率非常小

我们可以做如下处理:<br/>
1. Lower-casing: 把整封邮件转化为小写<br/>
2. Stripping HTML: 移除所有HTML标签，只保留内容<br/>
3. Normalizing URLs: 将所有的URL替换为字符串 “httpaddr”<br/>
4. Normalizing Email Addresses: 所有的地址替换为 “emailaddr”<br/>
5. Normalizing Dollars: 所有dollar符号($)替换为“dollar”<br/>
6. Normalizing Numbers: 所有数字替换为“number”<br/>
7. Word Stemming(词干提取): 将所有单词还原为词源。例如，“discount”, “discounts”, “discounted” and “discounting”都替换为“discount”<br/>
8. Removal of non-words: 移除所有非文字类型，所有的空格(tabs, newlines, spaces)调整为一个空格<br/>

In [4]:
def processEmail(email):
    """做除了Word Stemming和Removal of non-words的所有处理"""
    email = email.lower()
    email = re.sub('<[^<>]>', ' ', email)  #移除HTML标签
    email = re.sub('(http|https)://[^\s]*', 'httpaddr', email)  # 匹配//后面不是空白字符的内容，遇到空白字符则停止
    email = re.sub('[^\s]+@[^\s]+', 'emailaddr', email)
    email = re.sub('[\$]+', 'dollar', email)
    email = re.sub('[\d]+', 'number', email) 
    return email

In [5]:
#提取词干并去除非字符内容
def email2TokenList(email):
    """预处理数据，返回一个干净的单词列表"""
    
    #Use the NLTK stemmer
    stemmer = nltk.stem.porter.PorterStemmer()
    
    email = processEmail(email)

    # 将邮件分割为单个单词，re.split() 可以设置多种分隔符
    tokens = re.split('[ \@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\(\)\{\}\,\'\"\>\_\<\;\%]', email)
    
    # 遍历每个分割出来的内容
    tokenlist = []
    for token in tokens:
        # 删除任何非字母数字的字符
        token = re.sub('[^a-zA-Z0-9]', '', token);
        # Use the Porter stemmer to 提取词根
        stemmed = stemmer.stem(token)
        # 去除空字符串‘’，里面不含任何字符
        if not len(token): continue
        tokenlist.append(stemmed)
    return tokenlist  

In [6]:
email

"> Anyone knows how much it costs to host a web portal ?\n>\nWell, it depends on how many visitors you're expecting.\nThis can be anywhere from less than 10 bucks a month to a couple of $100. \nYou should checkout http://www.rackspace.com/ or perhaps Amazon EC2 \nif youre running something big..\n\nTo unsubscribe yourself from this mailing list, send an email to:\ngroupname-unsubscribe@egroups.com\n\n"

In [7]:
refined_email = email2TokenList(email)
print(refined_email)

['anyon', 'know', 'how', 'much', 'it', 'cost', 'to', 'host', 'a', 'web', 'portal', 'well', 'it', 'depend', 'on', 'how', 'mani', 'visitor', 'you', 're', 'expect', 'thi', 'can', 'be', 'anywher', 'from', 'less', 'than', 'number', 'buck', 'a', 'month', 'to', 'a', 'coupl', 'of', 'dollarnumb', 'you', 'should', 'checkout', 'httpaddr', 'or', 'perhap', 'amazon', 'ecnumb', 'if', 'your', 'run', 'someth', 'big', 'to', 'unsubscrib', 'yourself', 'from', 'thi', 'mail', 'list', 'send', 'an', 'email', 'to', 'emailaddr']


### Vocabulary List and Extracting Features from Emails

在对邮件进行预处理之后，我们有一个处理后的单词列表。下一步是选择我们想在分类器中使用哪些词，我们需要去除哪些词。
<br/>
我们有一个词汇表vocab.txt--Our vocabulary list was selected by choosing all words which occur at least a 100 times in the spam corpus, resulting in a list of 1899 words。
<br/>
我们要算出处理后的email中含有多少vocab.txt中的单词，并返回在vocab.txt中的index，这就我们想要的训练单词的索引。

In [8]:
def email2VocabIndices(email, vocab):
    """提取存在单词的索引"""
    token = email2TokenList(email)
    index = [i for i in range(len(vocab)) if vocab[i] in token ]
    return index

In [9]:
def email2FeatureVector(email):
    """
    将email转化为词向量，n是vocab的长度。存在单词的相应位置的值置为1，其余为0
    """
    df = pd.read_table('vocab.txt',names=['words'])
    vocab = np.matrix(df.values)  # return array
    vector = np.zeros(len(vocab))  # init vector
    vocab_indices = email2VocabIndices(email, vocab)  # 返回含有单词的索引
    # 将有单词的索引置为1
    for i in vocab_indices:
        vector[i] = 1
    return vector

In [10]:
vector = email2FeatureVector(email)
print('length of vector = {}\nnum of non-zero = {}'.format(len(vector), int(vector.sum())))

length of vector = 1899
num of non-zero = 45


In [11]:
vector

array([0., 0., 0., ..., 0., 0., 0.])

### Training SVM for Spam Classification

读取已经经过处理并提取好了特征向量以及相应标签的测试集和训练集

In [12]:
data1 = loadmat("spamTrain.mat")
X,y = data1["X"],data1["y"]
data2 = loadmat("spamTest.mat")
Xtest,ytest = data2["Xtest"],data2["ytest"]
X.shape,y.shape
Xtest.shape,ytest.shape

((4000, 1899), (4000, 1))

((1000, 1899), (1000, 1))

In [13]:
SVM = svm.SVC(C=0.1,kernel="linear")
model = SVM.fit(X,y.ravel())

In [14]:
#输出准确率
model.score(X,y)
model.score(Xtest,ytest)

0.99825

0.989