## 原理
### 文本表示
n-gram + BoW
### 分类器
NBSVM是Sida Wang 和 Chris Manning 在其论文 [Baselines and Bigrams: Simple, Good Sentiment and Topic Classiﬁcation](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf)中提出的. 由于在实践中，svm和逻辑回归十分接近，本文直接使用逻辑回归代替SVM。
If you're not familiar with naive bayes and bag of words matrices, I've made a preview available of one of fast.ai's upcoming *Practical Machine Learning* course videos, which introduces this topic. Here is a link to the section of the video which discusses this: [Naive Bayes video](https://youtu.be/37sFIak42Sc?t=3745).

In [1]:
import pandas as pd, numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')


## 理解数据

有害评论信息数据集中的每一个样本包含评论原文，以及其id和6种不同标签

In [7]:
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


可以发现有不少样本是不属于6中标签中的任何一种的

In [80]:
# train[train['toxic'].values==1]#查看toxic为１的样本
train[train['toxic'].values==1]['comment_text'][6]

'COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK'

评论原文的长度各有不同。下面统计各个样本评论的字符长度

In [5]:
lens = train.comment_text.str.len()
print(type(lens))
# lens.mean(), lens.std(), lens.max()
lens.sum()

<class 'pandas.core.series.Series'>


62882658

In [12]:
import requests
r=requests.get(r'''https://translation.googleapis.com/language/translate/v2?q=train[train['toxic'].values==1]['comment_text'][6]&target=zh-CN&cid=0000&format=text&source=en&key=AIzaSyA84HI1FmVSxKrVWCIJKHsymhGW6_EtGAI''')
print(r.json()['data'])

{'translations': [{'translatedText': "列车[列车[ '有毒']。值== 1] [ 'COMMENT_TEXT'] [6]"}]}


In [9]:
lens.hist();#绘制评论文本长度分布直方图

  'Matplotlib is building the font cache using fc-list. '


新增none=1标签来表示不属于这６中有害类别中任意一种的样本，以方便我们了解数据分布

In [10]:
label_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
train['none'] = 1-train[label_cols].max(axis=1)#如果label_cols中不存在1(肯定是最大值),则none标为１,否则标为0
train.describe()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate,none
count,159571.0,159571.0,159571.0,159571.0,159571.0,159571.0,159571.0
mean,0.095844,0.009996,0.052948,0.002996,0.049364,0.008805,0.898321
std,0.294379,0.099477,0.223931,0.05465,0.216627,0.09342,0.302226
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [11]:
len(train),len(test)#了解训练数据与测试数据中的样本个数

(159571, 153164)

使用"unknown"来填充comment_text这一列的缺失值

In [12]:
COMMENT = 'comment_text'
train[COMMENT].fillna("unknown", inplace=True)
test[COMMENT].fillna("unknown", inplace=True)

## 定义模型

使用bigram的BoW表示文本。

In [63]:
import re, string
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')#专门处理英文的tokenizer
def tokenize(s): return re_tok.sub(r' \1 ', s).split()

使用TF-IDF相交于one-hot有更好的表达能力。

使用bigram,自定义的tokenizer

In [17]:

vec = TfidfVectorizer(ngram_range=(1,2), tokenizer=tokenize,
               min_df=3, max_df=0.9, strip_accents='unicode', use_idf=1,
               smooth_idf=1, sublinear_tf=1 )
trn_term_doc = vec.fit_transform(train[COMMENT])#注意在测试数据上不用fit
test_term_doc = vec.transform(test[COMMENT])

In [49]:
trn_term_doc, test_term_doc

(<159571x426005 sparse matrix of type '<class 'numpy.float64'>'
 	with 17775104 stored elements in Compressed Sparse Row format>,
 <153164x426005 sparse matrix of type '<class 'numpy.float64'>'
 	with 14765755 stored elements in Compressed Sparse Row format>)

下面是朴素贝叶斯特征方程

数值y_i(0或1)会被广播，然后element-wise 与y进行比较,得到一个长度相同的布尔数组

In [68]:
def pr(y_i, y):
    p = x[y==y_i].sum(0)#这里的x使用的是全局x＝trn_term_doc,p.shape==(1, 426005)
    return (p+1) / ((y==y_i).sum()+1)#(y==y_i).sum()是y中的等于y_i的元素个数；element-wise divide

In [57]:
x = trn_term_doc
test_x = test_term_doc

定义分类器

In [58]:
def get_mdl(y):
    y = y.values#<class 'numpy.ndarray'>
    r = np.log(pr(1,y) / pr(0,y))
    m = LogisticRegression(C=4, dual=True)
    x_nb = x.multiply(r)
    return m.fit(x_nb, y), r

In [69]:
preds = np.zeros((len(test), len(label_cols)))

for i, j in enumerate(label_cols):
    print('fit', j)
    m,r = get_mdl(train[j])#训练一种二分类器
    preds[:,i] = m.predict_proba(test_x.multiply(r))[:,1]#用上述的这种二分类器进行预测

fit toxic
(1, 426005)
------------------------------------------------------------
(1, 426005)
------------------------------------------------------------
fit severe_toxic
(1, 426005)
------------------------------------------------------------
(1, 426005)
------------------------------------------------------------
fit obscene
(1, 426005)
------------------------------------------------------------
(1, 426005)
------------------------------------------------------------
fit threat
(1, 426005)
------------------------------------------------------------
(1, 426005)
------------------------------------------------------------
fit insult
(1, 426005)
------------------------------------------------------------
(1, 426005)
------------------------------------------------------------
fit identity_hate
(1, 426005)
------------------------------------------------------------
(1, 426005)
------------------------------------------------------------


最后生成预测结果文件

In [54]:
subm = pd.read_csv('sample_submission.csv')
submid = pd.DataFrame({'id': subm["id"]})
submission = pd.concat([submid, pd.DataFrame(preds, columns = label_cols)], axis=1)
submission.to_csv('submission.csv', index=False)

In [55]:
su=pd.read_csv('submission.csv')
su.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.999988,0.106264,0.999987,0.002369,0.962578,0.094956
1,0000247867823ef7,0.002873,0.000604,0.001893,0.0001,0.002227,0.000342
2,00013b17ad220c46,0.011755,0.000864,0.005588,0.000102,0.00321,0.000297
3,00017563c3f7919a,0.00096,0.000224,0.001141,0.000171,0.001057,0.000297
4,00017695ad8997eb,0.009957,0.000485,0.002009,0.000131,0.002395,0.000351
