文本分类是现代机器学习应用中的一大模块，nlp基础之一遵循首先将文本编码成数字，然后按分类结果采集需要的信息，nlp领域大部分由深度学习控制，但bayes分类器仍然是文本分类中的一颗明珠
### 文本编码技术
- 单词计数向量
- TF-IDF

In [4]:
from sklearn.feature_extraction.text import CountVectorizer  # 单词计数向量

In [7]:
sample = ['Machine learning is fascinating, it is wonderful'
       , 'Machine learning is a sensational technology'
       , 'Elsa is a popular character']

vec = CountVectorizer()

In [9]:
X = vec.fit_transform(sample) # sample转换为特征矩阵
X

<3x11 sparse matrix of type '<class 'numpy.int64'>'
	with 15 stored elements in Compressed Sparse Row format>

In [10]:
# 使用接口get_feature_name()调用每个列的名称
vec.get_feature_names() # 按照字母排列

['character',
 'elsa',
 'fascinating',
 'is',
 'it',
 'learning',
 'machine',
 'popular',
 'sensational',
 'technology',
 'wonderful']

In [11]:
import pandas as pd
# 稀疏矩阵转换为array, DataFrame仅接受array，不接受sparse matrix
CVresult = pd.DataFrame(X.toarray(), columns = vec.get_feature_names())

In [12]:
CVresult

Unnamed: 0,character,elsa,fascinating,is,it,learning,machine,popular,sensational,technology,wonderful
0,0,0,1,2,1,1,1,0,0,0,1
1,0,0,0,1,0,1,1,0,1,1,0
2,1,1,0,1,0,0,0,1,0,0,0


### 单词计数向量的问题
- 多项式朴素贝叶斯的计算公式
将每一列加和/整个特征矩阵的和，为该列对应的概率$\theta_i$

样本词数不同，使得每个样本对结果的贡献不同，此例中第一个样本影响特征最多；但不一定词数的有效信息就多，其含有无效信息的风险也增大。类似样本不均衡问题

sol:对于句子特别长的样本而言，样本对$\theta_i$影响巨大，因此补集贝叶斯让每个特征的权重除以自身的L2范式，避免这种情况；

- 观察矩阵可以看出 is 出现了4次，但是is本身对语义影响不大。-> TF-IDF，比起次数，采用单词在句子中占的比例来编码单词

### TF-IDF
Term frequency-inverse document frequency.通过单词在文当中出现的频率来衡量器权重；
IDF的大小与一个词的常见程度成反比，即越常见，权重越小，以此来压制无意义单词对语义的影响？？

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF
vec_ = TFIDF()


In [18]:
X_ = vec_.fit_transform(sample)
X_  # 每个单词作为一个特征，每个单词在这个句子中所占的比例

<3x11 sparse matrix of type '<class 'numpy.float64'>'
	with 15 stored elements in Compressed Sparse Row format>

In [17]:
vec_.get_feature_names()

['character',
 'elsa',
 'fascinating',
 'is',
 'it',
 'learning',
 'machine',
 'popular',
 'sensational',
 'technology',
 'wonderful']

In [19]:
TFresult = pd.DataFrame(X_.toarray(), columns=vec_.get_feature_names())
TFresult

Unnamed: 0,character,elsa,fascinating,is,it,learning,machine,popular,sensational,technology,wonderful
0,0.0,0.0,0.424396,0.50131,0.424396,0.322764,0.322764,0.0,0.0,0.0,0.424396
1,0.0,0.0,0.0,0.315444,0.0,0.406192,0.406192,0.0,0.534093,0.534093,0.0
2,0.546454,0.546454,0.0,0.322745,0.0,0.0,0.0,0.546454,0.0,0.0,0.0


In [21]:
import numpy as np
CVresult.sum(axis=0)/np.sum(CVresult.sum(axis=0))  # 按列相加

character      0.0625
elsa           0.0625
fascinating    0.0625
is             0.2500
it             0.0625
learning       0.1250
machine        0.1250
popular        0.0625
sensational    0.0625
technology     0.0625
wonderful      0.0625
dtype: float64

In [23]:
TFresult.sum(axis=0)/np.sum(TFresult.sum(axis=0))
# 将原本出现次数多的词进行压缩，以实现压缩权重

character      0.083071
elsa           0.083071
fascinating    0.064516
is             0.173225
it             0.064516
learning       0.110815
machine        0.110815
popular        0.083071
sensational    0.081192
technology     0.081192
wonderful      0.064516
dtype: float64

## 探索文本数据


In [25]:
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [None]:
# 数据集大

# 不同类型的新闻
data.target_names