## 一、Email对象

### 1. 根据文件路径加载邮件对象


In [1]:
import os
SPAM_PATH = os.path.join("datasets", "spam")
HAM_DIR = os.path.join(SPAM_PATH, "easy_ham")  # 正常邮件的路径 'datasets\\spam\\easy_ham'
SPAM_DIR = os.path.join(SPAM_PATH, "spam")  # 垃圾邮件的路径 'datasets\\spam\\spam'

In [2]:
ham_filenames = [name for name in sorted(os.listdir(HAM_DIR)) if len(name) > 20]  # 对文件名进行排序，过滤文件名长度小等于20的文件
spam_filenames = [name for name in sorted(os.listdir(SPAM_DIR)) if len(name) > 20]  # 对文件名进行排序，过滤文件名长度小等于20的文件

In [3]:
# 我们可以使用python的“email”模块解析这些电子邮件（它处理邮件头、编码等）
import email
import email.policy

# is_spam 正常邮件或者垃圾邮件    filename文件名   span_path邮件路径
def load_email(is_spam, filename, spam_path=SPAM_PATH):
    directory = "spam" if is_spam else "easy_ham"
    with open(os.path.join(spam_path, directory, filename), "rb") as f:  # join拼接文件路径  
        return email.parser.BytesParser(policy=email.policy.default).parse(f)  # 二进制字节数组读取文件，返回一个email对象

In [4]:
# 让我们看一个ham示例和一个spam示例，了解数据的外观：
ham_emails = [load_email(is_spam=False, filename=name) for name in ham_filenames]
spam_emails = [load_email(is_spam=True, filename=name) for name in spam_filenames]

### 2. 输出一个邮件的内容

In [5]:
# 输出一个正常邮件的内容
print(ham_emails[1].get_content().strip())  # email对象获取内容

Martin A posted:
Tassos Papadopoulos, the Greek sculptor behind the plan, judged that the
 limestone of Mount Kerdylio, 70 miles east of Salonika and not far from the
 Mount Athos monastic community, was ideal for the patriotic sculpture. 
 
 As well as Alexander's granite features, 240 ft high and 170 ft wide, a
 museum, a restored amphitheatre and car park for admiring crowds are
planned
---------------------
So is this mountain limestone or granite?
If it's limestone, it'll weather pretty fast.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
4 DVDs Free +s&p Join Now
http://us.click.yahoo.com/pt6YBB/NXiEAA/mG3HAA/7gSolB/TM
---------------------------------------------------------------------~->

To unsubscribe from this group, send an email to:
forteana-unsubscribe@egroups.com

 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/


In [6]:
# 输出一个垃圾邮件的内容
print(spam_emails[6].get_content().strip())

Help wanted.  We are a 14 year old fortune 500 company, that is
growing at a tremendous rate.  We are looking for individuals who
want to work from home.

This is an opportunity to make an excellent income.  No experience
is required.  We will train you.

So if you are looking to be employed from home with a career that has
vast opportunities, then go:

http://www.basetel.com/wealthnow

We are looking for energetic and self motivated people.  If that is you
than click on the link and fill out the form, and one of our
employement specialist will contact you.

To be removed from our link simple go to:

http://www.basetel.com/remove.html


4139vOLW7-758DoDY1425FRhM1-764SMFc8513fCsLl40


### 3. 分析邮件结构

电子邮件实际上有很多部分，带有图像和附件（它们可以有自己的附件）。

In [7]:
# 查看邮件的各种类型的结构：
def get_email_structure(email):  # 传入一个邮件对象
    if isinstance(email, str):  # 判断email是否为str类或者str子类
        return email
    payload = email.get_payload()  # get_payload()获取邮件的组成部分
    if isinstance(payload, list):  # 如果文件的组成部分为一个列表，说明文件组成比较复杂
        return "multipart({})".format(", ".join([
            get_email_structure(sub_email)  # 递归子结构
            for sub_email in payload
        ]))
    else:
        return email.get_content_type()  # 返回内容类型(str类型)

In [8]:
help(isinstance)

Help on built-in function isinstance in module builtins:

isinstance(obj, class_or_tuple, /)
    Return whether an object is an instance of a class or of a subclass thereof.
    
    A tuple, as in ``isinstance(x, (A, B, ...))``, may be given as the target to
    check against. This is equivalent to ``isinstance(x, A) or isinstance(x, B)
    or ...`` etc.



In [9]:
get_email_structure(spam_emails[6])

'text/plain'

### 二、Counter对象

#### 1. 基本使用

In [10]:
from collections import Counter

a = [1,4,2,3,2,3,4,2]  
 
b = Counter(a) # 求数组中每个数字出现了几次
b

Counter({1: 1, 4: 2, 2: 3, 3: 2})

#### 2. most_common()

In [11]:
from collections import Counter

def structures_counter(emails):
    structures = Counter()
    for email in emails:  # 遍历所有的邮件
        structure = get_email_structure(email)  # 一封邮件的组成部分
        structures[structure] += 1  # 同类型邮件的数量加一
    return structures

In [12]:
structures_counter(ham_emails).most_common()

[('text/plain', 2408),
 ('multipart(text/plain, application/pgp-signature)', 66),
 ('multipart(text/plain, text/html)', 8),
 ('multipart(text/plain, text/plain)', 4),
 ('multipart(text/plain)', 3),
 ('multipart(text/plain, application/octet-stream)', 2),
 ('multipart(text/plain, text/enriched)', 1),
 ('multipart(text/plain, application/ms-tnef, text/plain)', 1),
 ('multipart(multipart(text/plain, text/plain, text/plain), application/pgp-signature)',
  1),
 ('multipart(text/plain, video/mng)', 1),
 ('multipart(text/plain, multipart(text/plain))', 1),
 ('multipart(text/plain, application/x-pkcs7-signature)', 1),
 ('multipart(text/plain, multipart(text/plain, text/plain), text/rfc822-headers)',
  1),
 ('multipart(text/plain, multipart(text/plain, text/plain), multipart(multipart(text/plain, application/x-pkcs7-signature)))',
  1),
 ('multipart(text/plain, application/x-java-applet)', 1)]

In [13]:
structures_counter(spam_emails).most_common()

[('text/plain', 218),
 ('text/html', 183),
 ('multipart(text/plain, text/html)', 45),
 ('multipart(text/html)', 20),
 ('multipart(text/plain)', 19),
 ('multipart(multipart(text/html))', 5),
 ('multipart(text/plain, image/jpeg)', 3),
 ('multipart(text/html, application/octet-stream)', 2),
 ('multipart(text/plain, application/octet-stream)', 1),
 ('multipart(text/html, text/plain)', 1),
 ('multipart(multipart(text/html), application/octet-stream, image/jpeg)', 1),
 ('multipart(multipart(text/plain, text/html), image/gif)', 1),
 ('multipart/alternative', 1)]

正常邮件更多的是纯文本，而垃圾邮件有相当多的HTML。

In [14]:
type(structures_counter(spam_emails))

collections.Counter

### 4. 查看邮件头

In [15]:
# 查看邮件头
for header, value in spam_emails[0].items():
    print(header,":",value)

Return-Path : <12a1mailbot1@web.de>
Delivered-To : zzzz@localhost.spamassassin.taint.org
Received : from localhost (localhost [127.0.0.1])	by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32	for <zzzz@localhost>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT)
Received : from mail.webnote.net [193.120.211.219]	by localhost with POP3 (fetchmail-5.9.0)	for zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST)
Received : from dd_it7 ([210.97.77.167])	by webnote.net (8.9.3/8.9.3) with ESMTP id NAA04623	for <zzzz@spamassassin.taint.org>; Thu, 22 Aug 2002 13:09:41 +0100
From : 12a1mailbot1@web.de
Received : from r-smtp.korea.com - 203.122.2.197 by dd_it7  with Microsoft SMTPSVC(5.5.1775.675.6);	 Sat, 24 Aug 2002 09:42:10 +0900
To : dcek1a1@netsgo.com
Subject : Life Insurance - Why Pay More?
Date : Wed, 21 Aug 2002 20:31:57 -1600
MIME-Version : 1.0
Message-ID : <0103c1042001882DD_IT7@dd_it7>
Content-Type : text/html; charset="iso-8859-1"
Content-Transfer-Encoding : qu

In [16]:
# 里面可能有很多有用的信息，比如发件人的电子邮件地址（12a1mailbot1@web.de看起来很可疑），
# 查看“主题”标题：
spam_emails[0]["Subject"]

'Life Insurance - Why Pay More?'

## 三、拆分训练集和测试集合

In [17]:
import numpy as np
from sklearn.model_selection import train_test_split

X = np.array(ham_emails + spam_emails)
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 使用[Beautifulsoup]库将HTML转TEXT
1. 删除`<head>`
2. `<a>` ->hyperlink
3. 删除html标记
4. 去除多个换行符

In [18]:
import re
from html import unescape

def html_to_plain_text(html):
    text = re.sub('<head.*?>.*?</head>', '', html, flags=re.M | re.S | re.I)
    text = re.sub('<a\s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I)
    text = re.sub('<.*?>', '', text, flags=re.M | re.S)
    text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S)
    return unescape(text)

In [19]:
# 从训练集X_train中[y_train==1] 取垃圾邮件
# get_email_structure(email) == "text/html" 内容为text/html
html_spam_emails = [email for email in X_train[y_train==1]
                    if get_email_structure(email) == "text/html"]
sample_html_spam = html_spam_emails[7]  # 取第8封
print(sample_html_spam.get_content().strip()[:1000], "...")  # 输出前1000个字符

<HTML><HEAD><TITLE></TITLE><META http-equiv="Content-Type" content="text/html; charset=windows-1252"><STYLE>A:link {TEX-DECORATION: none}A:active {TEXT-DECORATION: none}A:visited {TEXT-DECORATION: none}A:hover {COLOR: #0033ff; TEXT-DECORATION: underline}</STYLE><META content="MSHTML 6.00.2713.1100" name="GENERATOR"></HEAD>
<BODY text="#000000" vLink="#0033ff" link="#0033ff" bgColor="#CCCC99"><TABLE borderColor="#660000" cellSpacing="0" cellPadding="0" border="0" width="100%"><TR><TD bgColor="#CCCC99" valign="top" colspan="2" height="27">
<font size="6" face="Arial, Helvetica, sans-serif" color="#660000">
<b>OTC</b></font></TD></TR><TR><TD height="2" bgcolor="#6a694f">
<font size="5" face="Times New Roman, Times, serif" color="#FFFFFF">
<b>&nbsp;Newsletter</b></font></TD><TD height="2" bgcolor="#6a694f"><div align="right"><font color="#FFFFFF">
<b>Discover Tomorrow's Winners&nbsp;</b></font></div></TD></TR><TR><TD height="25" colspan="2" bgcolor="#CCCC99"><table width="100%" border="0" 

In [20]:
print(html_to_plain_text(sample_html_spam.get_content())[:1000], "...")  # 转为纯文本


OTC
 Newsletter
Discover Tomorrow's Winners 
For Immediate Release
Cal-Bay (Stock Symbol: CBYI)
Watch for analyst "Strong Buy Recommendations" and several advisory newsletters picking CBYI.  CBYI has filed to be traded on the OTCBB, share prices historically INCREASE when companies get listed on this larger trading exchange. CBYI is trading around 25 cents and should skyrocket to $2.66 - $3.25 a share in the near future.
Put CBYI on your watch list, acquire a position TODAY.
REASONS TO INVEST IN CBYI
A profitable company and is on track to beat ALL earnings estimates!
One of the FASTEST growing distributors in environmental & safety equipment instruments.
Excellent management team, several EXCLUSIVE contracts.  IMPRESSIVE client list including the U.S. Air Force, Anheuser-Busch, Chevron Refining and Mitsubishi Heavy Industries, GE-Energy & Environmental Research.
RAPIDLY GROWING INDUSTRY
Industry revenues exceed $900 million, estimates indicate that there could be as much as $25 billi

### Email to Text

In [21]:
# 编写一个函数，它以电子邮件为输入，并以纯文本形式返回其内容，无论其格式是什么
def email_to_text(email):
    html = None
    # .wakl()遍历消息的附件
    for part in email.walk():
        ctype = part.get_content_type()
        if not ctype in ("text/plain", "text/html"):
            continue
        try:
            content = part.get_content()
        except: # 解决编码问题
            content = str(part.get_payload())
        if ctype == "text/plain":
            return content
        else:
            html = content
    if html:
        return html_to_plain_text(html)

In [22]:
# 输出Email转文本后的前100个字符
print(email_to_text(sample_html_spam)[:100], "...")


OTC
 Newsletter
Discover Tomorrow's Winners 
For Immediate Release
Cal-Bay (Stock Symbol: CBYI)
Wat ...


In [23]:
# 装自然语言工具包（[nltk]（http://www.nltk.org/）
# pip install nltk

# pip install urlextract

In [24]:
# url链接对nltk没有意义，所以
# 用“url”字符串替换url链接
import nltk
from urlextract import URLExtract  # URLExtract可以提取URL并进行替换

try:
    import nltk

    stemmer = nltk.PorterStemmer()  # PorterStemmer()提取单词前缀
    for word in ("Computations", "Computation", "Computing", "Computed", "Compute", "Compulsive"):
        print(word, "=>", stemmer.stem(word))  # 提取单词的词干
except ImportError:
    print("Error: stemming requires the NLTK module.")
    stemmer = None

Computations => comput
Computation => comput
Computing => comput
Computed => comput
Compute => comput
Compulsive => compuls


## 对单词进行计数 封装成 单词:出现次数  (key:values)

## 合成转换器-电子邮件转换为文字计数器

### 将所有处理整合到一个转换器中，我们将使用它将电子邮件转换为文字计数器。注意，我们使用python的'split（）'方法将句子拆分为单词，该方法使用空格作为单词边界。但例如，汉语和日语脚本通常不在单词之间使用空格在这个练习中没关系，因为数据集（主要）是英文的，中文可以使用结巴分词来进行拆分

In [25]:
from sklearn.base import BaseEstimator, TransformerMixin

class EmailToWordCounterTransformer(BaseEstimator, TransformerMixin): # 继承BaseEstimator, TransformerMixin
    def __init__(self, strip_headers=True, lower_case=True, remove_punctuation=True,
                 replace_urls=True, replace_numbers=True,  stemming=True):
        self.strip_headers = strip_headers  # 是否去除邮件头
        self.lower_case = lower_case  # 是否转小写
        self.remove_punctuation = remove_punctuation  # 是否去除标点符号
        self.replace_urls = replace_urls  # 是否替换url
        self.replace_numbers = replace_numbers  # 是否替换数字
        self.stemming = stemming  # 是否提取词干
    def fit(self, X, y=None):  # fit()不需要进行修改
        return self
    def transform(self, X, y=None):  # 
        X_transformed = []
        for email in X:  # 遍历所有邮件
            text = email_to_text(email) or ""
            if self.lower_case:
                text = text.lower()  # 转小写
                
            if self.replace_urls:
                extractor = URLExtract()  # 创建一个URLExtract对象
                urls = list(set(extractor.find_urls(text)))  # 
                urls.sort(key=lambda url: len(url), reverse=True)
                for url in urls:  # 替换url 为 ‘URL’
                    text = text.replace(url, " URL ")
                    
            if self.replace_numbers:  # 替换数字
                text = re.sub(r'\d+(?:\.\d*(?:[eE]\d+))?', 'NUMBER', text)
                
            if self.remove_punctuation:  # 删除标点符号
                text = re.sub(r'\W+', ' ', text, flags=re.M)
                
            word_counts = Counter(text.split())  # 分词并计数
            
            if self.stemming and stemmer is not None:  # 如果 要转换 和 转换器stemmer不为空  注意：stemmer在前面创建
                stemmed_word_counts = Counter()  # 创建一个Counter计数器对象
                for word, count in word_counts.items():
                    stemmed_word = stemmer.stem(word)  # 提取出词干
                    stemmed_word_counts[stemmed_word] += count  # 同词干类型加一
                word_counts = stemmed_word_counts
            X_transformed.append(word_counts)
        return np.array(X_transformed)
    
    

In [26]:
# 在一些邮件上 测试 转换器
X_few = X_train[:3]
X_few_wordcounts = EmailToWordCounterTransformer().fit_transform(X_few)
X_few_wordcounts

array([Counter({'chuck': 1, 'murcko': 1, 'wrote': 1, 'stuff': 1, 'yawn': 1, 'r': 1}),
       Counter({'the': 11, 'of': 9, 'and': 8, 'all': 3, 'christian': 3, 'to': 3, 'by': 3, 'jefferson': 2, 'i': 2, 'have': 2, 'superstit': 2, 'one': 2, 'on': 2, 'been': 2, 'ha': 2, 'half': 2, 'rogueri': 2, 'teach': 2, 'jesu': 2, 'some': 1, 'interest': 1, 'quot': 1, 'url': 1, 'thoma': 1, 'examin': 1, 'known': 1, 'word': 1, 'do': 1, 'not': 1, 'find': 1, 'in': 1, 'our': 1, 'particular': 1, 'redeem': 1, 'featur': 1, 'they': 1, 'are': 1, 'alik': 1, 'found': 1, 'fabl': 1, 'mytholog': 1, 'million': 1, 'innoc': 1, 'men': 1, 'women': 1, 'children': 1, 'sinc': 1, 'introduct': 1, 'burnt': 1, 'tortur': 1, 'fine': 1, 'imprison': 1, 'what': 1, 'effect': 1, 'thi': 1, 'coercion': 1, 'make': 1, 'world': 1, 'fool': 1, 'other': 1, 'hypocrit': 1, 'support': 1, 'error': 1, 'over': 1, 'earth': 1, 'six': 1, 'histor': 1, 'american': 1, 'john': 1, 'e': 1, 'remsburg': 1, 'letter': 1, 'william': 1, 'short': 1, 'again': 1, 'becom

### 有了单词计数，我们需要把它们转换成向量。为此，我们将构建另一个转换器，其“fit（）”方法将构建词汇表（最常用单词的有序列表），其“transform（）”方法将使用词汇表将单词计数转换为向量--稀疏矩阵

In [27]:
from scipy.sparse import csr_matrix

# X ：array([Counter1, Counter2, ...], dtype=object)
# rtype ：csr_matrix稀疏矩阵
# 稀疏矩阵词汇表 0-未出现单词 1-第1常用单词 2-第2常用单词 ... n-第n常用单词
class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vocabulary_size = 1000):
        self.vocabulary_size = vocabulary_size  # 词汇量
    def fit(self, X, y = None):
        total_count = Counter()  # 创建一个计数器
        for word_count in X:   # 遍历队列里所有的Counter
            for word, count in word_count.items():  # 取每一个Counter的 单词，计数
                total_count[word] += min(count, 10)  # 大于10的取10
        most_common = total_count.most_common()[:self.vocabulary_size]  # 取词汇量大小的最常用单词
        self.most_common_ = most_common
        # 词汇表存最常用的单词，单词index从1开始计数，其中0空出-存放未出现的单词
        self.vocabulary_ = {word: index + 1 for index, (word, count) in enumerate(most_common)}
        return self
    
    # 使用稀疏矩阵三元组存放词汇表
    def transform(self, X, y = None):
        rows = []
        cols = []
        data = []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row) # 训练集 实例个数
                cols.append(self.vocabulary_.get(word, 0)) # 取得单词在词汇表中的索引位置，0代表未出现在词汇表中
                data.append(count)
        return csr_matrix((data, (rows, cols)), shape=(len(X), self.vocabulary_size + 1)) # 输出稀疏矩阵 +1因为第一列要显示未出现在词汇表中的单词统计数

In [28]:

# from scipy.sparse import *

# 稀疏矩阵三元组 
# row =  [0,0,0,1,1,1,2,2,2]#行指标
# col =  [0,1,2,0,1,2,0,1,2]#列指标
# data = [1,0,1,0,1,1,1,1,0]#在行指标列指标下的数字
# team = csr_matrix((data,(row,col)),shape=(3,3))
# print(team)
# print(team.todense())
# team.toarray()  # 稀疏矩阵转数组


In [29]:
vocab_transformer = WordCounterToVectorTransformer(vocabulary_size=10)
X_few_vectors = vocab_transformer.fit_transform(X_few_wordcounts)
X_few_vectors.shape

(3, 11)

In [30]:
X_few_vectors.toarray()

array([[ 6,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [99, 11,  9,  8,  3,  1,  3,  1,  3,  2,  3],
       [67,  0,  1,  2,  3,  4,  1,  2,  0,  1,  0]], dtype=int32)

### 第三行第一列中的67表示第三封电子邮件包含64个不属于词汇表的单词。旁边的1表示词汇表中'of'单词在此电子邮件中出现一次。旁边的2表示'and'单词出现两次,'the'没有出现

In [31]:
vocab_transformer.vocabulary_

{'the': 1,
 'of': 2,
 'and': 3,
 'to': 4,
 'url': 5,
 'all': 6,
 'in': 7,
 'christian': 8,
 'on': 9,
 'by': 10}

### 我们现在准备训练我们的第一个垃圾邮件分类器！让我们转换整个数据集：

#### 1. 转换训练集

In [32]:
from sklearn.pipeline import Pipeline

# 从Email转为稀疏矩阵词汇表 0-未出现单词 1-第1常用单词 2-第2常用单词 ... n-第n常用单词
preprocess_pipeline = Pipeline([
    ("email_to_wordcount", EmailToWordCounterTransformer()),  # 从Email转单词计数
    ("wordcount_to_vector", WordCounterToVectorTransformer()),  # 从单词计数转稀疏矩阵
])

X_train_transformed = preprocess_pipeline.fit_transform(X_train)  # 转换后为n个稀疏矩阵

#### 2. 交叉验证评估

In [33]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

log_clf = LogisticRegression(solver="liblinear", random_state=42) # 采用逻辑回归分类器

# n个稀疏矩阵与原始n个0-1分类数组训练
# m1 m2 m3 m4 m5 m6 ... mn
# 0  0  1  0  1  0  ... 1
# cross_val_score交叉验证评估分数
score = cross_val_score(log_clf, X_train_transformed, y_train, cv=3, verbose=3)  
score.mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s


[CV]  ................................................................
[CV] .................................... , score=0.981, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.984, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.991, total=   0.1s


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.1s finished


0.9854166666666666

## 得到分数超过98.7%，可以尝试多个模型，选择最好的模型，并使用交叉验证对它们进行微调。在测试集上得到的精度/召回率：



#### 3. 预测测试集

In [34]:
from sklearn.metrics import precision_score, recall_score

# 1.转换测试集
X_test_transformed = preprocess_pipeline.transform(X_test)

# 2.训练
log_clf = LogisticRegression(solver="liblinear", random_state=42)
log_clf.fit(X_train_transformed, y_train)

# 3.预测
y_pred = log_clf.predict(X_test_transformed)

# 4.评估精度和召回
print("精度: {:.2f}%".format(100 * precision_score(y_test, y_pred)))
print("召回: {:.2f}%".format(100 * recall_score(y_test, y_pred)))

精度: 96.88%
召回: 97.89%


## SVM、SGD、KNN、决策树、随机森林进行分类预测评估

### SGD

In [35]:
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state = 44)
sgd_clf.fit(X_train_transformed, y_train)
y_sgd_pred = sgd_clf.predict(X_test_transformed)
print(f"精度：{100*precision_score(y_test, y_sgd_pred) : .2f}%")
print(f"召回：{100*recall_score(y_test, y_sgd_pred) : .2f}%")

精度： 86.41%
召回： 93.68%


### KNN

In [36]:
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train_transformed, y_train)
y_knn_pred = knn_clf.predict(X_test_transformed)
print(f'精度：{100*precision_score(y_test, y_knn_pred) : .2f}%')
print(f'召回：{100*recall_score(y_test, y_knn_pred) : .2f}%')

精度： 90.28%
召回： 68.42%


### DT

In [37]:
from sklearn.tree import DecisionTreeClassifier
dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_train_transformed, y_train)
y_dt_pred = dt_clf.predict(X_test_transformed)
print(f"精度：{100*precision_score(y_test, y_dt_pred) : .2f}%")
print(f"召回：{100*recall_score(y_test, y_dt_pred) : .2f}%")

精度： 88.78%
召回： 91.58%


### RF 

In [38]:
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=10, random_state=44)
rf_clf.fit(X_train_transformed, y_train)
y_rf_pred = rf_clf.predict(X_test_transformed)
print(f"精度：{100*precision_score(y_test, y_rf_pred) : .2f}%")
print(f"召回：{100*recall_score(y_test, y_rf_pred) : .2f}%")

精度： 96.30%
召回： 82.11%


## 总结
1. 加载数据并纵观数据大局
2. 获取邮件的组成结构
3. 对结构类型进行分析 发现垃圾邮件大多有HTML结构
4. 数据清洗，定义email对象中的HTML转换称纯文本方法
5. 对数据集拆分成训练集和测试集
6. 数据处理转换，对邮件的文本内容进行分词处理，通过nltk进行词干提取，对邮件出现的词汇进行计数统计，对所有邮件统计出了一个词汇表
7. 通过词汇表和邮件单词计数统计，将单词计数转化成向量矩阵
8. 把数据清洗和数据处理封装成两个转换器
9. 通过流水线来自动化处理数据
10. 使用逻辑回归线性分类器进行模型训练
11. 使用交叉验证进行微调
12. 在测试集上得到精度/召回率

In [39]:
import os
from urlextract import URLExtract  # URLExtract可以提取URL并进行替换

In [40]:
def urlExtractor(text):
    extractor = URLExtract()  # 创建一个URLExtract对象
    urls = list(set(extractor.find_urls(text)))
    urls.sort(key=lambda url: len(url), reverse=True)
    for url in urls:  # 替换url 为 ‘URL’
        text = text.replace(url, " URL ")
        print(text)

In [41]:
txt = "oooxxx   www.ai.com   OVA - WVVV   wwww.wi.com \
www.sss.com     aa.com \
"
extractor = URLExtract()  # 创建一个URLExtract对象
urls = list(set(extractor.find_urls(txt)))
urls.sort(key=lambda url: len(url), reverse=True)
extractor.find_urls(txt)

['www.ai.com', 'wwww.wi.com', 'www.sss.com', 'aa.com']

In [42]:
urlExtractor(txt)

oooxxx   www.ai.com   OVA - WVVV   wwww.wi.com  URL      aa.com 
oooxxx   www.ai.com   OVA - WVVV    URL   URL      aa.com 
oooxxx    URL    OVA - WVVV    URL   URL      aa.com 
oooxxx    URL    OVA - WVVV    URL   URL       URL  


In [43]:
txt

'oooxxx   www.ai.com   OVA - WVVV   wwww.wi.com www.sss.com     aa.com '