# 英文文本处理与[spaCy](https://spacy.io/)


[spaCy](https://spacy.io/)是Python和Cython中的高级自然语言处理库，它建立在最新的研究基础之上，从一开始就设计用于实际产品。spaCy 带有预先训练的统计模型和单词向量，目前支持 20 多种语言的标记。它具有世界上速度最快的句法分析器，用于标签的卷积神经网络模型，解析和命名实体识别以及与深度学习整合。

![](../img/L2_spaCy.png)

### 0.安装与配置

大家可以使用pip轻松安装spaCy工具库，注意到使用spaCy用到的模型(比如英文模型)需要手动下载。

In [1]:
!pip install -i https://pypi.tuna.tsinghua.edu.cn/simple spaCy --user

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple


You are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [2]:
!python -m spacy download en

^C


### 1.英文Tokenization(标记化/分词)

>文本是不能成段送入模型中进行分析的，我们通常会把文本切成有独立含义的字、词或者短语，这个过程叫做tokenization，这通常是大家解决自然语言处理问题的第一步。在spaCY中同样可以很方便地完成Tokenization。

In [1]:
import spacy

In [2]:
nlp = spacy.load('en')
doc = nlp('Hello World! My name is HanXiaoyang')
for token in doc:
    print('"' + token.text + '"')

"Hello"
"World"
"!"
"My"
"name"
"is"
"HanXiaoyang"


In [5]:
nlp = spacy.load('en')
for token1 in doc:
    print('"' + token1.text + '"')

"Hello"
"World"
"!"
"My"
"name"
"is"
"HanXiaoyang"


每个token对象有着非常丰富的属性，如下的方式可以取出其中的部分属性。

In [7]:
doc = nlp("Next week I'll   be in Shanghai.")
for token in doc:
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(
        token.text,
        token.idx,     # 开始index
        token.lemma_,  # 原型
        token.is_punct,# 判断标点
        token.is_space,# 判断空格
        token.shape_,  # 词格式
        token.pos_,    # 词性
        token.tag_     # 标注
    ))

Next	0	next	False	False	Xxxx	ADJ	JJ
week	5	week	False	False	xxxx	NOUN	NN
I	10	-PRON-	False	False	X	PRON	PRP
'll	11	will	False	False	'xx	VERB	MD
  	15	  	False	True	  	SPACE	_SP
be	17	be	False	False	xx	VERB	VB
in	20	in	False	False	xx	ADP	IN
Shanghai	23	shanghai	False	False	Xxxxx	PROPN	NNP
.	31	.	True	False	.	PUNCT	.


断句功能在spaCy中也有体现，如下

In [8]:
# 断句
doc = nlp("Hello World! My name is HanXiaoyang")
print(doc.sents)  
for sent in doc.sents:
    print(sent)

<generator object at 0x00000257030FC0D0>
Hello World!
My name is HanXiaoyang


### 2.词性标注

>词性（part-of-speech）是词汇基本的语法属性，通常也称为词性。

>词性标注（part-of-speech tagging）,又称为词类标注或者简称标注，是指为分词结果中的每个单词标注一个正确的词性的程序，也即确定每个词是名词、动词、形容词或者其他词性的过程。

>词性标注是很多NLP任务的预处理步骤，如句法分析，经过词性标注后的文本会带来很大的便利性，但也不是不可或缺的步骤。
>词性标注的最简单做法是选取最高频词性，主流的做法可以分为基于规则和基于统计的方法，包括：
* 基于最大熵的词性标注
* 基于统计最大概率输出词性
* 基于HMM的词性标注

In [9]:
# 词性标注
doc = nlp("Next week I'll be in Shanghai.")
# print([(token.text, token.tag_) for token in doc])
print([(token.text, token.tag_) for token in doc])

[('Next', 'JJ'), ('week', 'NN'), ('I', 'PRP'), ("'ll", 'MD'), ('be', 'VB'), ('in', 'IN'), ('Shanghai', 'NNP'), ('.', '.')]


具体的词性标注编码和含义见如下对应表：

| POS Tag | Description | Example |
| --- | --- | --- |
| CC | coordinating conjunction | and |
| CD | cardinal number | 1, third |
| DT | determiner | the |
| EX | existential there | there, is |
| FW | foreign word | d’hoevre |
| IN | preposition or subordinating conjunction | in, of, like |
| JJ | adjective | big |
| JJR | adjective, comparative | bigger |
| JJS | adjective, superlative | biggest |
| LS | list marker | 1) |
| MD | modal | could, will |
| NN | noun, singular or mass | door |
| NNS | noun plural | doors |
| NNP | proper noun, singular | John |
| NNPS | proper noun, plural | Vikings |
| PDT | predeterminer | both the boys |
| POS | possessive ending | friend‘s |
| PRP | personal pronoun | I, he, it |
| PRP$ | possessive pronoun | my, his |
| RB | adverb | however, usually, naturally, here, good |
| RBR | adverb, comparative | better |
| RBS | adverb, superlative | best |
| RP | particle | give up |
| TO | to | to go, to him |
| UH | interjection | uhhuhhuhh |
| VB | verb, base form | take |
| VBD | verb, past tense | took |
| VBG | verb, gerund or present participle | taking |
| VBN | verb, past participle | taken |
| VBP | verb, sing. present, non-3d | take |
| VBZ | verb, 3rd person sing. present | takes |
| WDT | wh-determiner | which |
| WP | wh-pronoun | who, what |
| WP\$ | possessive wh-pronoun | whose |
| WRB | wh-abverb | where, when |

### 3.命名实体识别

命名实体识别（Named Entity Recognition，简称NER），又称作“专名识别”，是指识别文本中具有特定意义的实体，主要包括人名、地名、机构名、专有名词等。通常包括两部分：1) 实体边界识别；2) 确定实体类别（人名、地名、机构名或其他）。

In [11]:
doc = nlp("Next week I'll be in Shanghai.")
print(doc.ents)
for ent in doc.ents:
    print(ent.text, ent.label_)

(Next week, Shanghai)
Next week DATE
Shanghai GPE


In [12]:
dir(doc.ents)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'count',
 'index']

In [13]:
from nltk.chunk import conlltags2tree

* BIO/IOB tagging 是一种对给定句子中的单元做序列标注的方式，用于从给定句子中抽取连续字/词块构成的有意义短语，例如名词短语（noun phrases, NP）、命名实体（named entites, NE）等。对于一个给定句子，将其中每个词标注为B（Beginning，指示某短语起始）、I（Inside，指示短语内部）、O（Outside，指示不在短语中）中的一个。以命名实体识别（NER）为例可以将John supports Leceister City这句话里的四个词分别标注为：B-人名 O B-机构名 I-机构名。

In [16]:
doc = nlp("Next week I'll be in Shanghai.")   # 使用spacy的nlp工具
iob_tagged = [
    (
        token.text, 
        token.tag_, 
        "{0}-{1}".format(token.ent_iob_, token.ent_type_) if token.ent_iob_ != 'O' else token.ent_iob_
    ) for token in doc
]

ent_iob = [(token.ent_iob_, token.ent_type_) for token in doc]
print(ent_iob)

print(iob_tagged)

# 按照nltk.Tree的格式显示
print(conlltags2tree(iob_tagged))

[('B', 'DATE'), ('I', 'DATE'), ('O', ''), ('O', ''), ('O', ''), ('O', ''), ('B', 'GPE'), ('O', '')]
[('Next', 'JJ', 'B-DATE'), ('week', 'NN', 'I-DATE'), ('I', 'PRP', 'O'), ("'ll", 'MD', 'O'), ('be', 'VB', 'O'), ('in', 'IN', 'O'), ('Shanghai', 'NNP', 'B-GPE'), ('.', '.', 'O')]
(S
  (DATE Next/JJ week/NN)
  I/PRP
  'll/MD
  be/VB
  in/IN
  (GPE Shanghai/NNP)
  ./.)


spaCy中包含的命名实体非常丰富，如下例所示：

In [17]:
doc = nlp("I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ")
for ent in doc.ents:
    print(ent.text, ent.label_)

2 CARDINAL
9 a.m. TIME
30% PERCENT
just 2 days DATE
WSJ ORG


In [18]:
dir(ent)

['_',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_recalculate_indices',
 '_vector',
 '_vector_norm',
 'as_doc',
 'doc',
 'end',
 'end_char',
 'ent_id',
 'ent_id_',
 'ents',
 'get_extension',
 'get_lca_matrix',
 'has_extension',
 'has_vector',
 'label',
 'label_',
 'lefts',
 'lemma_',
 'lower_',
 'merge',
 'n_lefts',
 'n_rights',
 'noun_chunks',
 'orth_',
 'remove_extension',
 'rights',
 'root',
 'sent',
 'sentiment',
 'set_extension',
 'similarity',
 'start',
 'start_char',
 'string',
 'subtree',
 'text',
 'text_with_ws',
 'to_array',
 'upper_',
 'vector',
 'vector_norm',
 'vocab']

还可以用非常漂亮的可视化做显示：

In [19]:
from spacy import displacy 

In [None]:
doc = nlp('I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ')
displacy.render(doc, style='ent', jupyter=True)

In [20]:
displacy.render(doc, style='ent', jupyter=True)

In [21]:
displacy.render(doc, style='dep', jupyter=True)

### 4.chunking/组块分析

spaCy可以自动检测名词短语，并输出根(root)词，比如下面的"Journal","piece","currencies"

In [23]:
doc = nlp("Wall Street Journal just published an interesting piece on crypto currencies")
for chunk in doc.noun_chunks:
    print(chunk.text,'---', chunk.label_,'---',  chunk.root.text)

Wall Street Journal --- NP --- Journal
an interesting piece --- NP --- piece
crypto currencies --- NP --- currencies


In [26]:
print(list(doc.noun_chunks))

[Wall Street Journal, an interesting piece, crypto currencies]


### 5.句法依存解析

spaCy有着非常强大的句法依存解析功能，可以试试对句子进行解析。

In [27]:
doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')
 
for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))  # 。head

Wall/NNP <--compound-- Street/NNP
Street/NNP <--compound-- Journal/NNP
Journal/NNP <--nsubj-- published/VBD
just/RB <--advmod-- published/VBD
published/VBD <--ROOT-- published/VBD
an/DT <--det-- piece/NN
interesting/JJ <--amod-- piece/NN
piece/NN <--dobj-- published/VBD
on/IN <--prep-- piece/NN
crypto/JJ <--compound-- currencies/NNS
currencies/NNS <--pobj-- on/IN


In [28]:
from spacy import displacy
 
doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

### 6.词向量使用

NLP中有一个非常强大的文本表示学习方法叫做word2vec，通过词的上下文学习到词语的稠密向量化表示，同时在这个表示形态下，语义相关的词在向量空间中会比较接近。也有类似`v(爷爷)-v(奶奶) ≈ v(男人)-v(女人)`的关系。

如果大家要使用英文的词向量，需要先下载预先训练好的结果。

In [None]:
# !python -m spacy download en_core_web_lg

In [30]:
nlp = spacy.load('en_core_web_lg')
print(nlp.vocab['banana'].vector)

[ 2.0228e-01 -7.6618e-02  3.7032e-01  3.2845e-02 -4.1957e-01  7.2069e-02
 -3.7476e-01  5.7460e-02 -1.2401e-02  5.2949e-01 -5.2380e-01 -1.9771e-01
 -3.4147e-01  5.3317e-01 -2.5331e-02  1.7380e-01  1.6772e-01  8.3984e-01
  5.5107e-02  1.0547e-01  3.7872e-01  2.4275e-01  1.4745e-02  5.5951e-01
  1.2521e-01 -6.7596e-01  3.5842e-01 -4.0028e-02  9.5949e-02 -5.0690e-01
 -8.5318e-02  1.7980e-01  3.3867e-01  1.3230e-01  3.1021e-01  2.1878e-01
  1.6853e-01  1.9874e-01 -5.7385e-01 -1.0649e-01  2.6669e-01  1.2838e-01
 -1.2803e-01 -1.3284e-01  1.2657e-01  8.6723e-01  9.6721e-02  4.8306e-01
  2.1271e-01 -5.4990e-02 -8.2425e-02  2.2408e-01  2.3975e-01 -6.2260e-02
  6.2194e-01 -5.9900e-01  4.3201e-01  2.8143e-01  3.3842e-02 -4.8815e-01
 -2.1359e-01  2.7401e-01  2.4095e-01  4.5950e-01 -1.8605e-01 -1.0497e+00
 -9.7305e-02 -1.8908e-01 -7.0929e-01  4.0195e-01 -1.8768e-01  5.1687e-01
  1.2520e-01  8.4150e-01  1.2097e-01  8.8239e-02 -2.9196e-02  1.2151e-03
  5.6825e-02 -2.7421e-01  2.5564e-01  6.9793e-02 -2

In [31]:
from scipy import spatial

# 余弦相似度计算
cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)

# 男人、女人、国王、女王 的词向量
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector
queen = nlp.vocab['queen'].vector
king = nlp.vocab['king'].vector
 
# 我们对向量做一个简单的计算，"man" - "woman" + "queen"
maybe_king = man - woman + queen
computed_similarities = []

# 扫描整个词库的词向量做比对，召回最接近的词向量
for word in nlp.vocab:
    if not word.has_vector:
        continue
 
    similarity = cosine_similarity(maybe_king, word.vector)
    computed_similarities.append((word, similarity))

# 排序与最接近结果展示
computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])  
print([w[0].text for w in computed_similarities[:10]])

['Queen', 'QUEEN', 'queen', 'King', 'KING', 'king', 'KIng', 'Kings', 'KINGS', 'kings']


In [34]:
print([w[1] for w in computed_similarities[:10]])

[0.7754250764846802, 0.7754250764846802, 0.7754250764846802, 0.771614134311676, 0.771614134311676, 0.771614134311676, 0.771614134311676, 0.5984060764312744, 0.5984060764312744, 0.5984060764312744]


In [None]:
cosine_similarity = lambda x, y:1- spatial.distance.cosine(x, y)  # 余弦相似度计算本来是=1最接近， 现在是1-， 那么=0最好

nlp.vocab[].vector

sorted(computed_similarities, key=lambda item: -item[1]) # sorted升序排， -item为key 降序排

### 7.词汇与文本相似度

在词向量的基础上，spaCy提供了从词到文档的相似度计算的方法，下面的例子是它的使用方法。

In [35]:
# 词汇语义相似度(关联性)  .similarity
banana = nlp.vocab['banana']
dog = nlp.vocab['dog']
fruit = nlp.vocab['fruit']
animal = nlp.vocab['animal']
 
print(dog.similarity(animal), dog.similarity(fruit)) # 0.6618534 0.23552845
print(banana.similarity(fruit), banana.similarity(animal)) # 0.67148364 0.2427285

0.66185343 0.2355285
0.67148364 0.24272852


In [None]:
# 文本语义相似度(关联性)
target = nlp("Cats are beautiful animals.")
 
doc1 = nlp("Dogs are awesome.")
doc2 = nlp("Some gorgeous creatures are felines.")
doc3 = nlp("Dolphins are swimming mammals.")
 
print(target.similarity(doc1))  # 0.8901765218466683
print(target.similarity(doc2))  # 0.9115828449161616
print(target.similarity(doc3))  # 0.7822956752876101