# [Intro to NLP with spaCy](https://nicschrading.com/project/Intro-to-NLP-with-spaCy/)

In [1]:
# set up spaCy
from spacy.en import English

parser = English()

# Test Data
multi_sentence = "There is an art, it says, or rather, a knack to flying." \
                 "The knack lies in learning how to throw yourself at the ground and miss." \
                 "In the beginning the Universe was created. This has made a lot of people "\
                 "very angry and been widely regarded as a bad move."

spaCyは**トークン化**, **文の認識**, **品詞のタグ付け**, **レンマ化(見出し語認識)**, **係り受け解析**, **固有表現抽出** を一度に行うことができる。

In [2]:
# 文章のパースはこれだけでできる
# note: ファイル内でspaCyを初めに呼び出すときはモジュールの読み込みに少し時間が掛かる
parse_data = parser(multi_sentence)

# トークンを見ていく　
# parse_dateをイテレートするだけでOK
# それぞれのトークンは多くのプロパティを持ったオブジェクトである
# アンダースコアが末尾についたプロパティはトークンの文字列表現
# アンダースコアが末尾にないプロパティはSpacyのvocabularyへのインデックス(int)を返す
# 確率推定は30億語からのカウントに基づいている
# コーパスはGood-Turing法により平滑化されている
for i, token in enumerate(parse_data):
    print("original:", token.orth, token.orth_)
    print("lowercased:", token.lower, token.lower_)
    print("lemma:", token.lemma, token.lemma_)
    print("shape:", token.shape, token.shape_)
    print("prefix:", token.prefix, token.prefix_)
    print("suffix:", token.suffix, token.suffix_)
    print("log probability:", token.prob)
    print("Brown cluster id:", token.cluster)
    print("----------------------------------------")
    # 元のチュートリアルでは最初の単語のみ表示させているが、
    # ここでは4番目まで表示させてみる
    if i > 4:
        break

original: 769 There
lowercased: 608 there
lemma: 608 there
shape: 684 Xxxxx
prefix: 568 T
suffix: 609 ere
log probability: -7.277902603149414
Brown cluster id: 1918
----------------------------------------
original: 513 is
lowercased: 513 is
lemma: 536 be
shape: 505 xx
prefix: 509 i
suffix: 513 is
log probability: -4.3297648429870605
Brown cluster id: 762
----------------------------------------
original: 591 an
lowercased: 591 an
lemma: 591 an
shape: 505 xx
prefix: 506 a
suffix: 591 an
log probability: -5.953293800354004
Brown cluster id: 3
----------------------------------------
original: 879 art
lowercased: 879 art
lemma: 879 art
shape: 502 xxx
prefix: 506 a
suffix: 879 art
log probability: -9.778430938720703
Brown cluster id: 633
----------------------------------------
original: 450 ,
lowercased: 450 ,
lemma: 450 ,
shape: 450 ,
prefix: 450 ,
suffix: 450 ,
log probability: -3.3914804458618164
Brown cluster id: 4
----------------------------------------
original: 519 it
lowercased:

ここでチュートリアルを脱線して、*Zen of Python*を使って同じことをしてみる

In [3]:
zen_of_python = "Beautiful is better than ugly."\
                "Explicit is better than implicit."\
                "Simple is better than complex."\
                "Complex is better than complicated."\
                "Flat is better than nested."\
                "Sparse is better than dense."\
                "Readability counts."\
                "Special cases aren't special enough to break the rules."\
                "Although practicality beats purity."\
                "Errors should never pass silently."\
                "Unless explicitly silenced."\
                "In the face of ambiguity, refuse the temptation to guess."\
                "There should be one-- and preferably only one --obvious way to do it."\
                "Although that way may not be obvious at first unless you're Dutch."\
                "Now is better than never."\
                "Although never is often better than *right* now."\
                "If the implementation is hard to explain, it's a bad idea."\
                "If the implementation is easy to explain, it may be a good idea."\
                "Namespaces are one honking great idea -- let's do more of those!"\

parse_zop = parser(zen_of_python)

for i, token in enumerate(parse_zop):
    print("original:", token.orth, token.orth_)
    print("lowercased:", token.lower, token.lower_)
    print("lemma:", token.lemma, token.lemma_)
    print("shape:", token.shape, token.shape_)
    print("prefix:", token.prefix, token.prefix_)
    print("suffix:", token.suffix, token.suffix_)
    print("log probability:", token.prob)
    print("Brown cluster id:", token.cluster)
    print("----------------------------------------")

    if i > 4:
        break

original: 209155 Beautiful
lowercased: 3020 beautiful
lemma: 3020 beautiful
shape: 684 Xxxxx
prefix: 704 B
suffix: 1899 ful
log probability: -19.579313278198242
Brown cluster id: 966
----------------------------------------
original: 513 is
lowercased: 513 is
lemma: 536 be
shape: 505 xx
prefix: 509 i
suffix: 513 is
log probability: -4.3297648429870605
Brown cluster id: 762
----------------------------------------
original: 761 better
lowercased: 761 better
lemma: 673 good
shape: 515 xxxx
prefix: 537 b
suffix: 762 ter
log probability: -7.226652145385742
Brown cluster id: 7658
----------------------------------------
original: 626 than
lowercased: 626 than
lemma: 626 than
shape: 515 xxxx
prefix: 503 t
suffix: 627 han
log probability: -6.372464179992676
Brown cluster id: 106
----------------------------------------
original: 3173 ugly
lowercased: 3173 ugly
lemma: 3173 ugly
shape: 515 xxxx
prefix: 607 u
suffix: 3174 gly
log probability: -10.173290252685547
Brown cluster id: 871
-----------