In [1]:
from sklearn.feature_extraction import DictVectorizer

特征抽取就是逐条将原始数据转化为特征向量的形式, 有些符号表示的数据特征已经相对结构化, 并且使用字典形式存储, 那么使用DictVectorizer就可以
对特征进行抽取

In [5]:
measurements = [{'city':'Dubai','temperature':33.}, {'city':'London','temperature':12.},
                {'city':'San Fransisco','temperature':18.}]
vec = DictVectorizer()
print(vec.fit_transform(measurements).toarray()) # 输出转化后的特征矩阵
print('各个维度的含义',vec.get_feature_names_out()) # 输出各个维度特征的含义

[[ 1.  0.  0. 33.]
 [ 0.  1.  0. 12.]
 [ 0.  0.  1. 18.]]
各个维度的含义 ['city=Dubai' 'city=London' 'city=San Fransisco' 'temperature']


更多的是文本是原始的, 根本就没有存储. 这种时候一般使用词袋法表示特征, 考虑单词出现的频率, 常用的
有CountVectorizer和TfidfVectorizer, 前者只考虑频率, 后者除了频率外还关注含该词汇的文本条数的倒数.
文本条目越多后者更有优势, 此外常用词汇称为停用词(Stop Words)在抽取特征前往往过滤掉

In [6]:
# 使用CV并且不去掉停用词 朴素贝叶斯
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

In [7]:
news = fetch_20newsgroups(subset = 'all')
X_train, X_test, y_train, y_test = train_test_split(news.data, news.target, test_size = 0.25, random_state = 33)
count_vec = CountVectorizer()
X_count_train = count_vec.fit_transform(X_train)
X_count_test = count_vec.transform(X_test)

In [8]:
mnb_count = MultinomialNB()
mnb_count.fit(X_count_train, y_train)
y_count_pred = mnb_count.predict(X_count_test)

In [9]:
print('The acc of text using NB(CV without filtering stopwords):', mnb_count.score(X_count_test, y_test))
print(classification_report(y_test, y_count_pred, target_names = news.target_names))

The acc of text using NB(CV without filtering stopwords): 0.8397707979626485
                          precision    recall  f1-score   support

             alt.atheism       0.86      0.86      0.86       201
           comp.graphics       0.59      0.86      0.70       250
 comp.os.ms-windows.misc       0.89      0.10      0.17       248
comp.sys.ibm.pc.hardware       0.60      0.88      0.72       240
   comp.sys.mac.hardware       0.93      0.78      0.85       242
          comp.windows.x       0.82      0.84      0.83       263
            misc.forsale       0.91      0.70      0.79       257
               rec.autos       0.89      0.89      0.89       238
         rec.motorcycles       0.98      0.92      0.95       276
      rec.sport.baseball       0.98      0.91      0.95       251
        rec.sport.hockey       0.93      0.99      0.96       233
               sci.crypt       0.86      0.98      0.91       238
         sci.electronics       0.85      0.88      0.86       24