# chapter 03 Feature Extraction And Preprocessing
***

- 3.1 Extracting features from categorical variables
- 3.2 Extracting features from text
- 2.3 polynomial regression 多项式回归
- 2.4 how to train models & cost function

## 3.1 Extracting features from categorical variables
***

对分类变量一般使用one-hot编码，有人会想到用整数0,1,2,...来分别表示分类变量，但一般不建议这么做，因为编码成整数，实际上添加了一个人工信息，即这些变量是有序的，0<1<2<...

In [1]:
# 对分类变量一般使用one-hot编码
from sklearn.feature_extraction import DictVectorizer
onehot_encoder = DictVectorizer()
instances = [{'city':'New York'}, {'city':'San Francisco'}, {'city':'Chapel Hill'}]
onehot_encoder.fit_transform(instances).toarray()
# 说明：我们发现城市编码的顺序跟instances中城市的顺序不一样，这是因为编码按照了城市名称升序排列了

array([[ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 1.,  0.,  0.]])

### 3.2 Extracting features from text

- bag of words:词袋

词袋类似one-hot编码，可以认为是后者的一个特例。

词袋模型忽略文本中词语的语法，词语出现的顺序等。

词袋模型基于一个直觉假设：含有相似词语的文本，具有相似的含义。虽然这个直觉对于有些文本是不一定适用的，有时不同文本虽然含有相同词汇，但是顺序不同，含义就不同。





In [2]:
# bag of words

## 导入类
from sklearn.feature_extraction.text import CountVectorizer

## 语料库corpus
corpus = ['UNC played Duke in basketball', 
          'Duke lost the basketball game']
## 说明：
## corpus语料中的所有单词组成了词典
## 上面给出的语料包含8个不重复的单词

## 得到词袋模型
vectorizer = CountVectorizer()
print('bag of words:')
print(vectorizer.fit_transform(corpus).todense())

## 从语料库切词得到的字典
print('----------------------------------------')
print('Vocabulary')
print(vectorizer.vocabulary_)

## 解释
## Vocabulary中的单词是有顺序的，第一个是basketball
## 词袋模型的编码与单词的顺序是对应的，1表示文本中含有该单词，0表示不含有

bag of words:
[[1 1 0 1 0 1 0 1]
 [1 1 1 0 1 0 1 0]]
----------------------------------------
Vocabulary
{'unc': 7, 'played': 5, 'duke': 1, 'in': 3, 'basketball': 0, 'lost': 4, 'the': 6, 'game': 2}


In [3]:
# bag of words(2)
## 增加一条文本

## 导入类
from sklearn.feature_extraction.text import CountVectorizer

## 语料库
corpus = ['UNC played Duke in basketball', 
          'Duke lost the basketball game', 
          'I ate a sandwich']

## 词袋模型
vectorizer = CountVectorizer()
print('bag of words:')
print(vectorizer.fit_transform(corpus).todense())

## 字典
print('-----------------------------------------')
print('Vocabulary')
print(vectorizer.vocabulary_)

## 计算三条文本之间的距离
## 文本1和文本2较为相似，与文本3差异较大
from sklearn.metrics.pairwise import euclidean_distances
counts = [[0, 1, 1, 0, 1, 0, 1, 0, 0, 1],
          [0, 1, 1, 1, 0, 1, 0, 0, 1, 0],
          [1, 0, 0, 0, 0, 0, 0, 1, 0, 0]]
# 文本两两比较，计算距离
## euclidean distance：sqrt(sum((xi - yi)^2))
print('文本1和文本2的距离:', euclidean_distances(counts[0], counts[1]))
print('文本1和文本3的距离:', euclidean_distances(counts[0], counts[2]))
print('文本2和文本3的距离:', euclidean_distances(counts[1], counts[2]))

bag of words:
[[0 1 1 0 1 0 1 0 0 1]
 [0 1 1 1 0 1 0 0 1 0]
 [1 0 0 0 0 0 0 1 0 0]]
-----------------------------------------
Vocabulary
{'unc': 9, 'played': 6, 'duke': 2, 'in': 4, 'basketball': 1, 'lost': 5, 'the': 8, 'game': 3, 'ate': 0, 'sandwich': 7}




文本1和文本2的距离: [[ 2.44948974]]
文本1和文本3的距离: [[ 2.64575131]]
文本2和文本3的距离: [[ 2.64575131]]




### bag of words的问题

**1.如果语料库巨大，那么词袋是一个非常大的稀疏矩阵，这样的稀疏矩阵对建模是非常不利的：**

- 1.1.稀疏矩阵越大，那么就需要越多的训练样本，如果训练样本过小，容易造成过拟合；

- 1.2.稀疏矩阵越大，需要的磁盘空间也越大；

**2.减小稀疏矩阵的方法：**

- 2.1.将文本全部转成小写，因为词袋模型已经忽略了单词的顺序和语法，因此大小写已经无意义了，全部转成小写，可以减小单词的数量；

- 2.2.使用停止词stop words，主要包括以下几种类型：

    - 限定词：如the, a, an;
    - 助动词：如do, will, be;
    - 介词：如on, around, beneath;
    - 人工标注的停止词

In [4]:
# stop words

## 导入类
from sklearn.feature_extraction.text import CountVectorizer

## 语料库
corpus = ['UNC played Duke in basketball', 
          'Duke lost the basketball game', 
          'I ate a sandwich']

## 切词，得到词袋模型，使用stop words
vectorizer = CountVectorizer(stop_words = 'english')

## 查看词袋模型结果和字典
print('bag od words:')
print(vectorizer.fit_transform(corpus).todense())

print('----------------------------------------')
print('Vocabulary')
print(vectorizer.vocabulary_)

### 从字典我们看到，没有了the和in两个单词

## 计算距离
from sklearn.metrics.pairwise import euclidean_distances
counts = [[0, 1, 1, 0, 0, 1, 0, 1],
          [0, 1, 1, 1, 1, 0, 0, 0],
          [1, 0, 0, 0, 0, 0, 1, 0]]
print('T1 VS T2:', euclidean_distances(counts[0], counts[1]))
print('T1 VS T3:', euclidean_distances(counts[0], counts[2]))
print('T2 VS T3:', euclidean_distances(counts[1], counts[2]))

bag od words:
[[0 1 1 0 0 1 0 1]
 [0 1 1 1 1 0 0 0]
 [1 0 0 0 0 0 1 0]]
----------------------------------------
Vocabulary
{'unc': 7, 'played': 5, 'duke': 2, 'basketball': 1, 'lost': 4, 'game': 3, 'ate': 0, 'sandwich': 6}
T1 VS T2: [[ 2.]]
T1 VS T3: [[ 2.44948974]]
T2 VS T3: [[ 2.44948974]]




### stemming and lemmatization

一个英文单词有变形词和衍生词，如jumping，jumps的原形是jump。

进行文本分析时，我们需要把变形词和衍生词还原为原形，以便缩小矩阵大小，也更有利于建模。

- stemming:词干提取
- lemmatization:词形还原


In [5]:
## 导入类
from sklearn.feature_extraction.text import CountVectorizer

## 语料库
corpus = ['He ate the sandwiches', 
          'Every sandwich was eaten by him']

## 切词并获得词袋模型，使用停止词
vectorizer = CountVectorizer(binary=True, stop_words='english')
print('bag of words:')
print(vectorizer.fit_transform(corpus).todense())
print('------------------------------')
print('Vocabulary:')
print(vectorizer.vocabulary_)

## 说明：
## 两条信息的意思其实是一样的，但是词袋模型却完全不一样，这是因为单词变化导致的，因此我们需要利用词形还原和词干提取的方法来消除单词变化的影响；

bag of words:
[[1 0 0 1]
 [0 1 1 0]]
------------------------------
Vocabulary:
{'ate': 0, 'sandwiches': 3, 'sandwich': 2, 'eaten': 1}


In [7]:
# use nltk to stem and lemmatize the corpus

## 1.lemmatization

## 语料库
corpus = ['I am gathering ingredients for the sandwich.', 
          'There are many wizards at the gathering.']

from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print('lemmatize gathering as noun:', lemmatizer.lemmatize('gathering', 'n'))
print('lemmatize gathering as verb:', lemmatizer.lemmatize('gathering', 'v'))
## 说明：
## lemmatization根据单词的词性，进行不同的处理

## 2.stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print('Stem:', stemmer.stem('gathering'))


lemmatize gathering as noun: gathering
lemmatize gathering as verb: gather
Stem: gather


In [13]:
# lemmatize test corpus
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import pos_tag

wordnet_tags = ['n', 'v']
corpus = ['He ate the sandwiches', 
          'Every sandwich was eaten by him']
## stemming
stemmer = PorterStemmer()
print('Stemmed:', [[stemmer.stem(token) for token in word_tokenize(document)] for document in corpus])
## lemmatization
def lemmatize(token, tag):
    if tag[0].lower() in ['n', 'v']:
        return lemmatizer.lemmatize(token, tag[0].lower())
    return token
lemmatizer = WordNetLemmatizer()
tagged_corpus = [pos_tag(word_tokenize(document)) for document in corpus]
print('Lemmatized:', [[lemmatize(token, tag) for token, tag in document] for document in tagged_corpus])

## 词袋模型对比
from sklearn.feature_extraction.text import CountVectorizer
vectorize = CountVectorizer(binary = True, stop_words = 'english')

### 原corpus
print('bag of words:')
print(vectorizer.fit_transform(corpus).todense())
print('----------------------------------------')
print('Vocabulary')
print(vectorizer.vocabulary_)
### 计算距离
from sklearn.metrics.pairwise import euclidean_distances
counts = [[1, 0, 0, 1],
          [0, 1, 1, 0]]
print('T1 VS T2:', euclidean_distances(counts[0], counts[1]))

### 新corpus
lemm_corpus = ['He eat the sandwich', 
               'Every sandwich be eat by him']
print('bag of words:')
print(vectorizer.fit_transform(lemm_corpus).todense())
print('---------------------------------------------')
print('Vocabulary:')
print(vectorizer.vocabulary_)
counts = [[1, 1], 
          [1, 1]]
print('T1 VS T2:', euclidean_distances(counts[0], counts[1]))

# 疑问：怎么把词形还原的结果跟原始的文本对应，同时保存起来，并进行计算；

Stemmed: [['He', 'ate', 'the', 'sandwich'], ['everi', 'sandwich', 'wa', 'eaten', 'by', 'him']]
Lemmatized: [['He', 'eat', 'the', 'sandwich'], ['Every', 'sandwich', 'be', 'eat', 'by', 'him']]
bag of words:
[[1 0 0 1]
 [0 1 1 0]]
----------------------------------------
Vocabulary
{'ate': 0, 'sandwiches': 3, 'sandwich': 2, 'eaten': 1}
T1 VS T2: [[ 2.]]
bag of words:
[[1 1]
 [1 1]]
---------------------------------------------
Vocabulary:
{'eat': 0, 'sandwich': 1}
T1 VS T2: [[ 0.]]




### TF-IDF

**1.TF: Term Frequency，词频，指一个词在文档中出现的次数。当比较不同文档时，为了减小因为文档长短带来的误差，会对词频进行标准化，标准化的方法有：**

- 1.1 TF = 某词出现的次数 / 该文档的总词数；
- 1.2 TF = 某词出现的次数 / 文档中出现次数最多的那个词的出现次数；

**2.IDF：Inverse Documents Frequency，逆文档频率，其假设是：越常见的词出现在我们研究的文档中，则其重要性越低，越不常见的词出现了，其重要性越高；其计算方式如下：**

根据我们选择的语料库：

- IDF = log(语料库的总文档数 / （包含某词的文档数 + 1）)


## Data standardization
***

1.什么是数据标准化？

- 把数据缩放成平均值为0，方差为1的标准正态分布；

2.为什么要进行数据标准化

- 模型的表现可能会更好；

- 模型中有多个特征，把这多个特征的数量级统一到一个量级上，避免数量级带来的影响；

- 模型收敛得更快一些；

In [4]:
from sklearn.preprocessing import scale
import numpy as np
X = np.array([0., 5., 1., 13., 9., 4.])
scale(X)

array([-1.18599891, -0.07412493, -0.96362411,  1.70487343,  0.81537425,
       -0.29649973])