# 1 preprocessing the data

In [1]:
data_path = r'datasets\spam_100.utf8'

In [2]:
import re
import jieba
from gensim import corpora



### load stop words from file

In [3]:
def load_stopwords(file):
    with open(file, encoding='utf-8') as f:
        stop_words = [x.strip('\n') for x in f.readlines()]
        stop_words = set(stop_words)
    return stop_words
stop_words = load_stopwords(r'stop_words.txt')

In [4]:
list(stop_words)[:20]

['',
 '老',
 '累次',
 '除非',
 '和',
 '不妨',
 '背地里',
 '哪样',
 '不',
 '才能',
 '乘势',
 '归根结底',
 '出去',
 '其中',
 '过于',
 '风雨无阻',
 '虽',
 '顺',
 '纵',
 '不独']

### loading data while deleting punctuations using re.sub()

In [5]:
pattern = re.compile("[\s+\.\!\/_,$%^*(+\"\']+|[+——！，。？、~@#￥%……&*（）]+")

In [6]:
lines = []
with open(data_path, encoding='utf-8') as f:
    for line in f.readlines():
        lines.append(pattern.sub('', line))

In [7]:
lines[0]

'本公司有部分普通发票商品销售发票增值税发票及海关代征增值税专用缴款书及其它服务行业发票公路内河运输发票可以以低税率为贵公司代开本公司具有内外贸生意实力保证我司开具的票据的真实性希望可以合作共同发展敬侯您的来电洽谈咨询联系人：李先生联系电话：13632588281如有打扰望谅解祝商琪'

### tokenize the emails with `jieba` and filter stop words out

In [8]:
tokenized = [[x for x in jieba.cut(line,cut_all=True) if not x in stop_words] for line in lines]

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\hasee\AppData\Local\Temp\jieba.cache
Loading model cost 0.714 seconds.
Prefix dict has been built succesfully.


In [9]:
tokenized[0][:10]

['公司', '部分', '普通', '普通发票', '发票', '商品', '商品销售', '销售', '发票', '增值']

### add the tokens to our corpus with `gensim.corpora`

In [10]:
dictionary = corpora.Dictionary(tokenized)

we can use this method to merge other corpus

In [11]:
help(dictionary.merge_with)

Help on method merge_with in module gensim.corpora.dictionary:

merge_with(other) method of gensim.corpora.dictionary.Dictionary instance
    Merge another dictionary into this dictionary, mapping same tokens to the same ids and new tokens to new ids.
    
    Notes
    -----
    The purpose is to merge two corpora created using two different dictionaries: `self` and `other`.
    `other` can be any id=>word mapping (a dict, a Dictionary object, ...).
    
    Get a transformation object which, when accessed as `result[doc_from_other_corpus]`, will convert documents
    from a corpus built using the `other` dictionary into a document using the new, merged dictionary.
    
    --------
    This method will change `self` dictionary.
    
    Parameters
    ----------
    other : :class:`~gensim.corpora.dictionary.Dictionary`
        Other dictionary.
    
    Return
    ------
    :class:`gensim.models.VocabTransform`
        Transformation object.
    
    Examples
    --------
    >>> f

### building onehot vector for each email

In [12]:
# bows = [dictionary.doc2bow(email) for email in tokenized]
one_hot = lambda x:[1 if k in x else 0 for k in dictionary.values()]
one_hot(tokenized[0])[:20]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

### the corresponding section in my integrated code

<img src='pics\rd.png'>

data preprocessing for following operations, calling `prepare_onehot()`  
this method generates a one hot vector for each email following the preceding experiment.
    1. tokenize each email
    2. add tokens to the corpus
    3. do the preceding for both spams and hams
    4. obtain one hot vector via the corpus for each tokenized email

<img src='pics\onehot.png'>

this method is exactly the same as the preceding experiment

<img src='pics\token.png'>

# 2 implementing Bayes

### **Important** 
I'm reusing my previous codes written for the `Pattern Recognition` class,  
You can check it [here](https://github.com/HazekiahWon/-hw-pattern-recognition.git) in my another repo for the homework of `Pattern Recognition`  
以下代码来自我以前写的模式识别的贝叶斯作业，仅做必要说明，因为这份代码不限于朴素贝叶斯，包括多元高斯的

我仅继承了NaiveBayes类，支持很多选项，比如交叉验证和PCA，但这里并未使用
<img src='pics\init.png'>

### training
第一个核心函数是`train()`，用来计算先验，即类概率和属性的类概率密度
<img src='pics\train.png'>

`_compute_class_prior()`计算类概率
<img src='pics\cp.png'>

`_compute_class_pd_prior()`计算类概率密度, 简单说就是通过统计频数近似概率
<img src='pics\pdp.png'>

### testing
`test()`核心是这几行，附带了很多validation和accuracy的辅助语句
<img src='pics\test.png'>

其中`_predict()`就是求样本属于每个类的概率再求最大值
<img src='pics\pred.png'>

最后求某类的条件概率由`_compute_class_proba()`完成，利用朴素假设做连乘，但实际上取log，所以加法
<img src='pics\proba.png'>

# 3 Results
please check the output file `logger.txt` for details.  
In the validation phase, i got 99.5% accuracy  
And the results on the testing data seems satisfactory
<img src='pics\log.png'>