# 新闻分类——朴素贝叶斯分类器
20newsgroups数据集是用于文本分类、文本挖据和信息检索研究的国际标准数据集之一。数据集收集了大约20,000左右的新闻组文档，均匀分为20个不同主题的新闻组集合。
在sklearn中，该模型有两种装载方式：
第一种是sklearn.datasets.fetch_20newsgroups，返回一个可以被文本特征提取器（sklearn.feature_extraction.text.CountVectorizer）自定义参数提取特征的原始文本序列；
第二种是sklearn.datasets.fetch_20newsgroups_vectorized，返回一个已提取特征的文本序列，即不需要使用特征提取器。

## 导入工具包

In [3]:
import sys
# reload(sys)
# sys.setdefaultencoding('utf-8')

import pandas as pd
import numpy as np
from sklearn import metrics

import matplotlib.pyplot as plt
%matplotlib inline

## 读取数据

In [4]:
from sklearn.datasets import fetch_20newsgroups

twenty_news = fetch_20newsgroups()
y = twenty_news.target
X = twenty_news.data
#n_samples = len(twenty_news.data)

In [5]:
X[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

# 初始化TFIV对象，去停用词，加2元语言模型
tfv = TfidfVectorizer(min_df=3,  max_features=None, strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}', ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1, stop_words = 'english')

# 提取特征，会有点慢
X = tfv.fit_transform(X)

InvalidParameterError: The 'smooth_idf' parameter of TfidfVectorizer must be an instance of 'bool' or an instance of 'numpy.bool_'. Got 1 instead.

In [5]:
#将数据分割训练数据与测试数据
from sklearn.model_selection import train_test_split

# 随机采样20%的数据构建测试样本，其余作为训练样本
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33, test_size=0.2,stratify=y)
X_train.shape

(9051, 155785)

In [6]:
X_test.shape

(2263, 155785)

## 模型训练

In [7]:
# 多项朴素贝叶斯
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb.fit(X_train, y_train) 

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## 测试

In [8]:
#输出每类的概率
y_test_pred = nb.predict(X_test)

print(metrics.classification_report(y_test, y_test_pred, target_names=twenty_news.target_names))

                             precision    recall  f1-score   support

             alt.atheism       0.95      0.91      0.93        96
           comp.graphics       0.83      0.84      0.83       117
 comp.os.ms-windows.misc       0.89      0.86      0.87       118
comp.sys.ibm.pc.hardware       0.71      0.85      0.78       118
   comp.sys.mac.hardware       0.92      0.81      0.86       115
          comp.windows.x       0.89      0.91      0.90       119
            misc.forsale       0.82      0.87      0.84       117
               rec.autos       0.90      0.93      0.91       119
         rec.motorcycles       0.96      0.96      0.96       120
      rec.sport.baseball       0.93      0.92      0.93       119
        rec.sport.hockey       0.91      0.97      0.94       120
               sci.crypt       0.96      0.97      0.96       119
         sci.electronics       0.90      0.77      0.83       118
                 sci.med       0.96      0.93      0.94       119
               sci.space       0.92      0.95      0.93       119
  soc.religion.christian       0.78      0.98      0.87       120
      talk.politics.guns       0.93      0.98      0.96       109
   talk.politics.mideast       0.94      0.96      0.95       113
      talk.politics.misc       0.99      0.86      0.92        93
      talk.religion.misc       1.00      0.49      0.66        75

             avg / total       0.90      0.89      0.89      2263
