# [Intro to NLP with spaCy](https://nicschrading.com/project/Intro-to-NLP-with-spaCy/)

##  インストール

* [usage](https://spacy.io/docs/usage/)を参照

* ModelはEnglishを選択

In [1]:
# set up spaCy
from spacy.en import English

parser = English()

# Test Data
multi_sentence = "There is an art, it says, or rather, a knack to flying." \
                 "The knack lies in learning how to throw yourself at the ground and miss." \
                 "In the beginning the Universe was created. This has made a lot of people "\
                 "very angry and been widely regarded as a bad move."

spaCyは**トークン化**, **文の認識**, **品詞のタグ付け**, **レンマ化(見出し語認識)**, **係り受け解析**, **固有表現抽出** を一度に行うことができる。

## 文章のパース

In [2]:
# 文章のパースはこれだけでできる
# note: ファイル内でspaCyを初めに呼び出すときはモジュールの読み込みに少し時間が掛かる
parsed_data = parser(multi_sentence)

# トークンを見ていく　
# parse_dateをイテレートするだけでOK
# それぞれのトークンは多くのプロパティを持ったオブジェクトである
# アンダースコアが末尾についたプロパティはトークンの文字列表現
# アンダースコアが末尾にないプロパティはSpacyのvocabularyへのインデックス(int)を返す
# 確率推定は30億語からのなるコーパスに基づいている
# コーパスはSimple Good-Turing法により平滑化されている
for i, token in enumerate(parsed_data):
    # 文章に出現している形
    print("original:", token.orth, token.orth_)
    # 小文字
    print("lowercased:", token.lower, token.lower_)
    # lemma: 見出し形・原形
    print("lemma:", token.lemma, token.lemma_)
    # shape: 
    print("shape:", token.shape, token.shape_)
    # 単語の先頭のN文字(デフォルトではN=1)
    print("prefix:", token.prefix, token.prefix_)
    # 単語の末尾のN文字(デフォルトではN=3)
    print("suffix:", token.suffix, token.suffix_)
    # 確率推定(コーパスでの出現確率?)
    print("log probability:", token.prob)
    #
    print("Brown cluster id:", token.cluster)
    print("----------------------------------------")
    # 元のチュートリアルでは最初の単語のみ表示させているが、
    # ここでは4番目まで表示させてみる
    if i > 4:
        break

original: 769 There
lowercased: 608 there
lemma: 608 there
shape: 684 Xxxxx
prefix: 568 T
suffix: 609 ere
log probability: -7.277902603149414
Brown cluster id: 1918
----------------------------------------
original: 513 is
lowercased: 513 is
lemma: 536 be
shape: 505 xx
prefix: 509 i
suffix: 513 is
log probability: -4.3297648429870605
Brown cluster id: 762
----------------------------------------
original: 591 an
lowercased: 591 an
lemma: 591 an
shape: 505 xx
prefix: 506 a
suffix: 591 an
log probability: -5.953293800354004
Brown cluster id: 3
----------------------------------------
original: 879 art
lowercased: 879 art
lemma: 879 art
shape: 502 xxx
prefix: 506 a
suffix: 879 art
log probability: -9.778430938720703
Brown cluster id: 633
----------------------------------------
original: 450 ,
lowercased: 450 ,
lemma: 450 ,
shape: 450 ,
prefix: 450 ,
suffix: 450 ,
log probability: -3.3914804458618164
Brown cluster id: 4
----------------------------------------
original: 519 it
lowercased:

### ここでチュートリアルを脱線して、*Zen of Python*を使って同じことをしてみる

In [3]:
zen_of_python = "Beautiful is better than ugly."\
                "Explicit is better than implicit."\
                "Simple is better than complex."\
                "Complex is better than complicated."\
                "Flat is better than nested."\
                "Sparse is better than dense."\
                "Readability counts."\
                "Special cases aren't special enough to break the rules."\
                "Although practicality beats purity."\
                "Errors should never pass silently."\
                "Unless explicitly silenced."\
                "In the face of ambiguity, refuse the temptation to guess."\
                "There should be one-- and preferably only one --obvious way to do it."\
                "Although that way may not be obvious at first unless you're Dutch."\
                "Now is better than never."\
                "Although never is often better than *right* now."\
                "If the implementation is hard to explain, it's a bad idea."\
                "If the implementation is easy to explain, it may be a good idea."\
                "Namespaces are one honking great idea -- let's do more of those!"\

parse_zop = parser(zen_of_python)

for i, token in enumerate(parse_zop):
    print("original:", token.orth, token.orth_)
    print("lowercased:", token.lower, token.lower_)
    print("lemma:", token.lemma, token.lemma_)
    print("shape:", token.shape, token.shape_)
    print("prefix:", token.prefix, token.prefix_)
    print("suffix:", token.suffix, token.suffix_)
    print("log probability:", token.prob)
    print("Brown cluster id:", token.cluster)
    # 品詞
    print("Part of Speech:", token.pos, token.pos_)
    # ネガポジ値?
    print("sentiment:", token.sentiment)
    print("----------------------------------------")

    if i > 4:
        break

original: 209155 Beautiful
lowercased: 3020 beautiful
lemma: 3020 beautiful
shape: 684 Xxxxx
prefix: 704 B
suffix: 1899 ful
log probability: -19.579313278198242
Brown cluster id: 966
Part of Speech: 94 PROPN
sentiment: 0.0
----------------------------------------
original: 513 is
lowercased: 513 is
lemma: 536 be
shape: 505 xx
prefix: 509 i
suffix: 513 is
log probability: -4.3297648429870605
Brown cluster id: 762
Part of Speech: 98 VERB
sentiment: 0.0
----------------------------------------
original: 761 better
lowercased: 761 better
lemma: 673 good
shape: 515 xxxx
prefix: 537 b
suffix: 762 ter
log probability: -7.226652145385742
Brown cluster id: 7658
Part of Speech: 82 ADJ
sentiment: 0.0
----------------------------------------
original: 626 than
lowercased: 626 than
lemma: 626 than
shape: 515 xxxx
prefix: 503 t
suffix: 627 han
log probability: -6.372464179992676
Brown cluster id: 106
Part of Speech: 83 ADP
sentiment: 0.0
----------------------------------------
original: 3173 ugly
l

* PROPN: properNoun(固有名詞 主語になっているため固有名詞として扱われている?)
* ADP: Preposition and postposition(接置詞)

品詞とネガポジ値のAPIも試してみた。
ネガポジ値はすべて0.0になってしまっている。

## 文章に着目してみる

`sents`プロパティは`spans`を返す

`spans`は元の文章へのインデックスを持っている

それぞれのインデックスの要素は`token`で表される

In [4]:
sents = []
for span in parsed_data.sents:
    # それぞれのspanの開始から終了を表示
    # 戻り値である文章のトークンを join() で結合している
    sent = ''.join(parsed_data[i].string for i in range(span.start, span.end)).strip()
    sents.append(sent)
    
for sent in sents:
    print(sent)

There is an art, it says, or rather, a knack to flying.
The knack lies in learning how to throw yourself at the ground and miss.
In the beginning the Universe was created.
This has made a lot of people very angry and been widely regarded as a bad move.


## 最初の文章の各単語の品詞を見てみる

In [5]:
for span in parsed_data.sents:
    sent = [parsed_data[i] for i in range(span.start, span.end)]
    break

for token in sent:
    print(token.orth_, token.pos_)

There ADV
is VERB
an DET
art NOUN
, PUNCT
it PRON
says VERB
, PUNCT
or CCONJ
rather ADV
, PUNCT
a DET
knack NOUN
to ADP
flying NOUN
. PUNCT


## 次の例文の依存関係を見てみる

In [6]:
example = "The boy with the spotted dog quickly ran after the firetruck."
parsed_ex = parser(example)

for token in parsed_ex:
    # shown as: original token, dependency tag, head word, left dependents, right dependents
    print(token.orth_, 
          token.dep_, 
          token.head.orth_, 
          [t.orth_ for t in token.lefts], 
          [t.orth_ for t in token.rights])

The det boy [] []
boy nsubj ran ['The'] ['with']
with prep boy [] []
the det dog [] []
spotted amod dog [] []
dog nsubj ran ['the', 'spotted'] []
quickly advmod ran [] []
ran ROOT ran ['boy', 'dog', 'quickly'] ['after', '.']
after prep ran [] ['firetruck']
the det firetruck [] []
firetruck pobj after ['the'] []
. punct ran [] []


* original token: 元のトークン
* dependency tag: 依存タグ。構文的な依存関係。
    * [Stanford typed dependencies manual](https://nlp.stanford.edu/software/dependencies_manual.pdf)
    * [東邦大学 - テキストマイニング/Stanfordパーザーの細かい点](http://pepper.is.sci.toho-u.ac.jp/index.php?%A5%CE%A1%BC%A5%C8%2F%A5%C6%A5%AD%A5%B9%A5%C8%A5%DE%A5%A4%A5%CB%A5%F3%A5%B0%2FStanford%A5%D1%A1%BC%A5%B6%A1%BC%A4%CE%BA%D9%A4%AB%A4%A4%C5%C0)
* head word:[token.head](https://spacy.io/docs/api/token#head)トークンの構文親
* left dependents
* right dependents

## 次の例文の固有表現を見てみる

In [7]:
example = "Apple's stocks dropped dramatically after the death of Steve Jobs in October."
parsed_ex = parser(example)
for token in parsed_ex:
    #　token_ent_type_: 固有表現タイプ
    print(token.orth_, token.ent_type_ if token.ent_type_ != "" else "(not an entity)")
    

print("-------------- entities only ---------------")
# Doc.ents で固有表現だけを取得できる
ents = parsed_ex.ents
for entity in ents:
    print(entity.label, entity.label_, ' '.join(t.orth_ for t in entity) )

Apple ORG
's (not an entity)
stocks (not an entity)
dropped (not an entity)
dramatically (not an entity)
after (not an entity)
the (not an entity)
death (not an entity)
of (not an entity)
Steve PERSON
Jobs PERSON
in (not an entity)
October DATE
. (not an entity)
-------------- entities only ---------------
380 ORG Apple
377 PERSON Steve Jobs
387 DATE October


## spaCyはemoticonやネット上の表現の処理を試みるよう訓練されている

In [8]:
messy_data = "lol that is rly funny :) This is gr8 i rate it 8/8!!!"
parsed_data = parser(messy_data)

for token in parsed_data:
    print(token.orth_, token.pos_, token.lemma_)

lol NOUN lol
that ADJ that
is VERB be
rly ADV rly
funny ADJ funny
:) PUNCT :)
This DET this
is VERB be
gr8 VERB gr8
i PRON i
rate VERB rate
it PRON -PRON-
8/8 NUM 8/8
! PUNCT !
! PUNCT !
! PUNCT !


いい感じです。

`gr8`のトークンの失敗に注意。`gr8`は`great`、つまり形容詞だが動詞となっている。

また`lol`も名詞ではなく、感動詞のようなもの。

## spaCyには単語のベクトル表現がビルトインされている

元のチュートリアルでは `w.has_repvec` となっており、そのまま実行すると
```
AttributeError: 'spacy.lexeme.Lexeme' object has no attribute 'has_repvec'
```
と例外が発生する。

現在のAPIでは`Lexeme.has_vector`となっている。
[Lexeme.has_vector](https://spacy.io/docs/api/lexeme#vector)

---

`parser.vocab['NASA']`は` word vector`を持っていないため、類似値計算において
```
RuntimeWarning: invalid value encountered in float_scalars
```
が発生する。

In [9]:
import sys
from numpy import dot
from numpy.linalg import norm

nasa = parser.vocab['NASA']

# コサイン類似度を求めるlambda式
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))

# parserのvocabularyにある単語を、小文字の形式で取得する
all_words = list({w for w in parser.vocab 
                  if w.has_vector and w.orth_.islower() and w.lower_ != 'nasa'})

# vocaularyから取得した単語数
print('{length}語'.format(length=len(all_words)))

print(nasa.has_vector)

# NASAとの類似度順にソートする
all_words.sort(key=lambda w: cosine(w.vector, nasa.vector))
all_words.reverse()

7681語
False


  


### `NASA`では例外が発生するので`rock`で試してみる

In [10]:
rock = parser.vocab['rock']

all_words.sort(key=lambda w: cosine(w.vector, rock.vector))
all_words.reverse()
for word in all_words[:10]:
    print(word.orth_)

rock
rocks
punk
band
pop
bands
blues
indie
music
metal


### 男性は`王`になる。では女性は何になるのか、という類推をしてみる

In [23]:
king = parser.vocab['king']
man = parser.vocab['man']
woman = parser.vocab['woman']

result = king.vector - man.vector + woman.vector

all_words = list({w for w in parser.vocab
                  if w.has_vector and w.orth_.islower()})

for word in ['king', 'man', 'woman']:
    all_words.remove(word)
    
all_words.sort(key=lambda w: cosine(w.vector, result))
all_words.reverse()

for word in all_words[:3]:   
    print(word.orth_)
    


queen
kings
princess


`王` - `男性` + `女性` は `女王`　となる