## 停用词表
在英语中 很多次没有意义和价值 如代词 冠词 介词等 on the about等等
这些词不会对内容产生很大的影响 被称为停用词
停用词列表是剔除无意义特征的单词的方法

## 高频词
频率统计是一种强大的过滤技术 通过统计频率 可以过滤出常见单词 也可以过滤出停用词

频率统计有助于将基于频率的过滤技术和停用词列表结合起来

## 罕见词
对于有些任务而言 罕见词也是需要过滤的

有些罕见词是真的生僻词 但是也有些是拼写错误 在大量预料中极少数出现的词一般而言就是罕见词
罕见词不发作为预测的凭据，而且还会增加计算上的开销

在yelp点评数据集中 有60%的词仅出现了一次或者两次 这是典型的重尾分布
而在真实的语料中也是屡见不鲜 很多机器学习模型的训练时间随着特征数量(字典维度)增长很快
所以罕见词带来了大量的计算和存储成本 但是收效甚微

在基于频率的统计方法中，我们可以较为简单地处理罕见词 即将罕见词归类为一种garbage垃圾词

## 词干提取
在英文的语言学中有词干的概念 如swimmer swimming swim的词干是swim
词干提取能够将相同词干的词归类为同一种 就会更方便地处理
但是词干提取也有计算成本 而且对于效果而言 不一定完全对 如new news

## 通过搭配提取进行短语检测
通过n-gram可以得到n元词列表 但是人类更倾向于理解短语而非n元词
在NLP中 短语的概念被称为搭配
比如 Emma knocked on the door中 knock door就是一个搭配
如何得到搭配很重要 如果使用n-gram的话 会得到太多无意义的序列如this is
获得搭配的方法有：
### 人工标注
在特定的领域中 有很多专用词汇 使用人工标注的方法十分合理 但是工作量复杂庞大 还需要频繁更新

### 基于频率的方法
但是基于频率来获取搭配的话 得到的几乎全是没有意义的搭配 如of the/ and i 等等

### 假设检验方法
设立假设A:单词1与单词2无关 即看到单词1对于是否看到单词2没有影响
其备选假设B:单词1与单词2有关 即看到单词1会改变看到单词2的可能
进行假设检验的方法去做

### 文本分块和词性标注
如我们将词性为名词的单词找出(词性标注)然后检查临近词组 找出按磁性组合的词组 称之为块

In [1]:
# 词性标注与文本分块实战

import pandas as pd
import json

In [2]:
dataset_root_path = 'feature_engineering/dataset/'

In [3]:
f = open(dataset_root_path + 'yelp_dataset/yelp_academic_dataset_review.json')

In [4]:
js = []

for i in range(10):
    js.append(json.loads(f.readline()))
f.close()

In [5]:
review_df = pd.DataFrame(js)

In [6]:
review_df

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,lWC-xP3rd6obsecCYsGZRg,ak0TdVmGKo4pwqdJSTLwWw,buF9druCkbuXLX526sGELQ,4.0,3,1,1,Apparently Prides Osteria had a rough summer a...,2014-10-11 03:34:02
1,8bFej1QE5LXp4O05qjGqXA,YoVfDbnISlW0f7abNQACIg,RA4V8pr014UyUbDvI-LW2A,4.0,1,0,0,This store is pretty good. Not as great as Wal...,2015-07-03 20:38:25
2,NDhkzczKjLshODbqDoNLSg,eC5evKn1TWDyHCyQAwguUw,_sS2LBIGNT5NQb6PD1Vtjw,5.0,0,0,0,I called WVM on the recommendation of a couple...,2013-05-28 20:38:06
3,T5fAqjjFooT4V0OeZyuk1w,SFQ1jcnGguO0LYWnbbftAA,0AzLzHfOJgL7ROwhdww2ew,2.0,1,1,1,I've stayed at many Marriott and Renaissance M...,2010-01-08 02:29:15
4,sjm_uUcQVxab_EeLCqsYLg,0kA0PAJ8QFMeveQWHFqz2A,8zehGz9jnxPqXtOc7KaJxA,4.0,0,0,0,The food is always great here. The service fro...,2011-07-28 18:05:01
5,J4a2TuhDasjn2k3wWtHZnQ,RNm_RWkcd02Li2mKPRe7Eg,xGXzsc-hzam-VArK6eTvtw,1.0,2,0,0,"This place used to be a cool, chill place. Now...",2018-01-21 04:41:03
6,28gGfkLs3igtjVy61lh77Q,Q8c91v7luItVB0cMFF_mRA,EXOsmAB1s71WePlQk0WZrA,2.0,0,0,0,"The setting is perfectly adequate, and the foo...",2006-04-16 02:58:44
7,9vqwvFCBG3FBiHGmOHMmiA,XGkAG92TQ3MQUKGX9sLUhw,DbXHNl890xSXNiyRczLWAg,5.0,0,0,0,Probably one of the better breakfast sandwiche...,2017-12-02 18:16:13
8,2l_TDrQ7p-5tANOyiOlkLQ,LWUnzwK0ILquLLZcHHE1Mw,mD-A9KOWADXvfrZfwDs-jw,4.0,1,0,0,I am definitely a fan of Sports Authority. Th...,2012-05-28 15:00:47
9,KKVFopqzcVfcubIBxmIjVA,99RsBrARhhx60UnAC4yDoA,EEHhKSxUvJkoPSzeGKkpVg,5.0,0,0,0,I work in the Pru and this is the most afforda...,2014-05-07 18:10:21


In [10]:
import spacy
from spacy.lang.en import English

In [11]:
nlp = English()
# nlp = spacy.load("en")

In [12]:
doc_df = review_df['text'].apply(nlp)
doc_df

0    (Apparently, Prides, Osteria, had, a, rough, s...
1    (This, store, is, pretty, good, ., Not, as, gr...
2    (I, called, WVM, on, the, recommendation, of, ...
3    (I, 've, stayed, at, many, Marriott, and, Rena...
4    (The, food, is, always, great, here, ., The, s...
5    (This, place, used, to, be, a, cool, ,, chill,...
6    (The, setting, is, perfectly, adequate, ,, and...
7    (Probably, one, of, the, better, breakfast, sa...
8    (I, am, definitely, a, fan, of, Sports, Author...
9    (I, work, in, the, Pru, and, this, is, the, mo...
Name: text, dtype: object

In [13]:
doc_df[0]

Apparently Prides Osteria had a rough summer as evidenced by the almost empty dining room at 6:30 on a Friday night. However new blood in the kitchen seems to have revitalized the food from other customers recent visits. Waitstaff was warm but unobtrusive. By 8 pm or so when we left the bar was full and the dining room was much more lively than it had been. Perhaps Beverly residents prefer a later seating. 

After reading the mixed reviews of late I was a little tentative over our choice but luckily there was nothing to worry about in the food department. We started with the fried dough, burrata and prosciutto which were all lovely. Then although they don't offer half portions of pasta we each ordered the entree size and split them. We chose the tagliatelle bolognese and a four cheese filled pasta in a creamy sauce with bacon, asparagus and grana frita. Both were very good. We split a secondi which was the special Berkshire pork secreto, which was described as a pork skirt steak with g

In [14]:
for doc in doc_df[4]:
    print([doc.text, doc.pos_, doc.tag_])

['The', '', '']
['food', '', '']
['is', '', '']
['always', '', '']
['great', '', '']
['here', '', '']
['.', '', '']
['The', '', '']
['service', '', '']
['from', '', '']
['both', '', '']
['the', '', '']
['manager', '', '']
['as', '', '']
['well', '', '']
['as', '', '']
['the', '', '']
['staff', '', '']
['is', '', '']
['super', '', '']
['.', '', '']
['Only', '', '']
['draw', '', '']
['back', '', '']
['of', '', '']
['this', '', '']
['restaurant', '', '']
['is', '', '']
['it', 'PRON', 'PRP']
["'s", '', '']
['super', '', '']
['loud', '', '']
['.', '', '']
['If', '', '']
['you', '', '']
['can', '', '']
[',', '', '']
['snag', '', '']
['a', '', '']
['patio', '', '']
['table', '', '']
['!', '', '']


In [15]:
from textblob import TextBlob

In [16]:
blob_df = review_df['text'].apply(TextBlob)

In [18]:
blob_df[4].tags

[('The', 'DT'),
 ('food', 'NN'),
 ('is', 'VBZ'),
 ('always', 'RB'),
 ('great', 'JJ'),
 ('here', 'RB'),
 ('The', 'DT'),
 ('service', 'NN'),
 ('from', 'IN'),
 ('both', 'CC'),
 ('the', 'DT'),
 ('manager', 'NN'),
 ('as', 'RB'),
 ('well', 'RB'),
 ('as', 'IN'),
 ('the', 'DT'),
 ('staff', 'NN'),
 ('is', 'VBZ'),
 ('super', 'JJ'),
 ('Only', 'RB'),
 ('draw', 'VBZ'),
 ('back', 'RB'),
 ('of', 'IN'),
 ('this', 'DT'),
 ('restaurant', 'NN'),
 ('is', 'VBZ'),
 ('it', 'PRP'),
 ("'s", 'VBZ'),
 ('super', 'JJ'),
 ('loud', 'NN'),
 ('If', 'IN'),
 ('you', 'PRP'),
 ('can', 'MD'),
 ('snag', 'VB'),
 ('a', 'DT'),
 ('patio', 'NN'),
 ('table', 'NN')]

# 总结

## 词袋法
词袋法简单易懂 易于计算 对于分类和搜索十分有效
但是单词还是太简单了 无法表述出文本中的某些信息
我们需要求助于更长的序列
## n元词袋
n元词袋可以生成大量互不相同的n元词 增加了存储成本 需要更多计算能力
对于同样数据的数据点 n元词袋使得特征空间维度暴涨 数据变得稀疏
n越大 存储和计算程高本越高 数据也越稀疏
因此不是n越大越好 通常只是用二元词和三元词
## 过滤
为了优化词袋的稀疏性和成本问题 可以使用过滤的方法 只保留那些有意义的短语
这就是搭配的目标
搭配可以形成文本中不连贯的标记序列 但是实际上找不连贯的短语计算成本过高 收效甚微
所以搭配通常是从备选二元词表开始 用统计学方法进行过滤


所有的方法都是为了将文本序列转化为技术集合 变为扁平化的特征向量