# [spaCy](https://spacy.io/) Basics

<p style="text-align: center;">
    <img src="https://spacy.io/architecture-415624fc7d149ec03f2736c4aa8b8f3c.svg" width='55%'/>
</p>

In [1]:
import pandas as pd
import spacy

In [2]:
# Load NLP models (both English and Japanese)
enlp = spacy.load('en_core_web_trf')
jnlp = spacy.load('ja_core_news_lg')

## Active Pipeline Components

![pipelines](https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

When you call `nlp` (`enlp`, `jnlp`) on a text, `spaCy` first tokenizes the text to produce a `Doc` object. The `Doc` is then processed in several different steps – this is also referred to as the **processing pipeline**. The `pipeline` used by the trained pipelines typically include a `tagger`, a `lemmatizer`, a `parser` and an `entity recognizer`. Each `pipeline component` returns the processed `Doc`, which is then passed on to the next component.

In [3]:
# Get names of available active pipeline components of the English model 
enlp.pipe_names

['transformer', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [4]:
# Get names of available active pipeline components of the Japanese model
jnlp.pipe_names

['tok2vec', 'parser', 'attribute_ruler', 'ner']

In [5]:
enlp.pipeline

[('transformer',
  <spacy_transformers.pipeline_component.Transformer at 0x1488d1c1f90>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1488d22d3b0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1488d1cf520>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1488d1c5040>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1488d1c2040>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1488d19b820>)]

In [6]:
jnlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x1488dd34a90>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1488d85d640>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1488ddab0c0>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1488d85d940>)]

## Tokenization

In [7]:
edoc1 = enlp(u"Tesla isn't looking into startups anymore.")

In [8]:
jdoc1 = jnlp(u"テスラ社はもうすでに他の起業会社に気を配っていない。")

In [9]:
def tokensInfo(doc: spacy.tokens.doc.Doc):
    # Extract and save information of each token as a child array
    # of the parent tokens_info[] array
    tokens_info = []
    for token in doc:
        tokens_info.append([token.text, token.lemma_, token.pos_, token.tag_,
                            token.dep_, token.shape_, token.is_alpha, token.is_stop])

    # Table header
    headers = ["Text", "Lemma", "POS", "Tag", "Dep", "Shape", "Is Alpha", "Is Stop"]
    
    # Create and return a Pandas DataFrame containing information of all tokens
    table = pd.DataFrame(columns=headers, data=tokens_info)
    return table

In [10]:
tokensInfo(edoc1)

Unnamed: 0,Text,Lemma,POS,Tag,Dep,Shape,Is Alpha,Is Stop
0,Tesla,Tesla,PROPN,NNP,nsubj,Xxxxx,True,False
1,is,be,AUX,VBZ,aux,xx,True,True
2,n't,n't,PART,RB,neg,x'x,False,True
3,looking,look,VERB,VBG,ROOT,xxxx,True,False
4,into,into,ADP,IN,prep,xxxx,True,True
5,startups,startup,NOUN,NNS,pobj,xxxx,True,False
6,anymore,anymore,ADV,RB,advmod,xxxx,True,False
7,.,.,PUNCT,.,punct,.,False,False


In [11]:
tokensInfo(jdoc1)

Unnamed: 0,Text,Lemma,POS,Tag,Dep,Shape,Is Alpha,Is Stop
0,テスラ,テスラ,NOUN,名詞-普通名詞-助数詞可能,compound,xxx,True,False
1,社,社,NOUN,名詞-普通名詞-助数詞可能,nsubj,x,True,False
2,は,は,ADP,助詞-係助詞,case,x,True,True
3,もう,もう,ADV,副詞,advmod,xx,True,True
4,すでに,すでに,ADV,副詞,advmod,xxx,True,False
5,他,他,NOUN,名詞-普通名詞-副詞可能,nmod,x,True,False
6,の,の,ADP,助詞-格助詞,case,x,True,True
7,起業,起業,NOUN,名詞-普通名詞-サ変可能,compound,xx,True,False
8,会社,会社,NOUN,名詞-普通名詞-一般,obl,xx,True,False
9,に,に,ADP,助詞-格助詞,case,x,True,True


### Token Attributes

spaCy, when being passed in with a text, will parse the text and create a list of tokens and assign language attributes to those tokens

| Attribute | Description | Example `edoc1[3]` |
|-----------|-------------|--------------------|
| `.text` | The original word text | `looking` |
| `.lemma_` | The base form of the word | `look` |
| `.pos_` | The simple [UPOS](https://universaldependencies.org/u/pos/) part-of-speech tag | `VERB` |
| `.tag_` | The detailed part-of-speech tag | `VBG` |
| `.dep_` | Syntactic dependency, i.e. the relation between tokens | `ROOT` |
| `.shape_` | The word shape – capitalization, punctuation, digits | `xxxx` |
| `.is_alpha` | Is the token an alpha character? | `True` |
| `.is_stop` | Is the token part of a stop list, i.e. the most common words of the language? | `False` |

### Understanding Tags and Labels

Most of the tags and labels look pretty abstract, and they vary between languages. `spacy.explain` will show you a short description – for example, `spacy.explain("VBZ")` returns `"verb, 3rd person singular present"`.

In [12]:
# We can use spacy.explain() function to get the full name of a tag
token = edoc1[0]
print(token.pos_, spacy.explain(token.pos_))
print(token.tag_, spacy.explain(token.tag_))
print(token.dep_, spacy.explain(token.dep_))
token

PROPN proper noun
NNP noun, proper singular
nsubj nominal subject


Tesla

## Spans

We can extract a part (slice) of a larger `spaCy Doc` to create a [span](https://spacy.io/api/span) object that contains only some consecutive words.

> `doc[start : end]`

In [13]:
edoc2 = enlp(u'Halloween is a holiday celebrated each year on October 31, \
and Halloween 2021 will occur on Sunday, October 31. The tradition originated \
with the ancient Celtic festival of Samhain, when people would light bonfires \
and wear costumes to ward off ghosts. In the eighth century, Pope Gregory III \
designated November 1 as a time to honor all saints.')

span1 = edoc2[22:30]
print(span1)
type(span1)

The tradition originated with the ancient Celtic festival


spacy.tokens.span.Span

In [14]:
jdoc2 = jnlp(u'水際対策を強化しているアフリカ南部のナミビアから入国した\
30代の男性が新型コロナウイルスに感染していたことが分かり、厚生労働省は、\
国立感染症研究所でオミクロン株の感染かどうか詳しい解析を進めることにしています。\
岸田総理大臣は「わが国はG7の中でも最高のワクチン接種率かつ2回目の接種から\
最も日が浅い状況だ。マスク着用をはじめ行動自粛への国民の協力なども世界が称賛している。\
オミクロン株のリスクへの耐性は各国以上に強いと認識している。\
国民は落ち着いて対応するよう呼びかけたい」と強調しました。')

span2 = jdoc2[20:30]
print(span2)
type(span2)

新型コロナウイルスに感染していたこと


spacy.tokens.span.Span

## Sentences

`spaCy Doc.sents` returns a list of sentences for the given spaCy document.

In [15]:
for sentence in edoc2.sents:
    print(sentence)

Halloween is a holiday celebrated each year on October 31, and Halloween 2021 will occur on Sunday, October 31.
The tradition originated with the ancient Celtic festival of Samhain, when people would light bonfires and wear costumes to ward off ghosts.
In the eighth century, Pope Gregory III designated November 1 as a time to honor all saints.


In [16]:
for sentence in jdoc2.sents:
    print(sentence)

水際対策を強化しているアフリカ南部のナミビアから入国した30代の男性が新型コロナウイルスに感染していたことが分かり、厚生労働省は、国立感染症研究所でオミクロン株の感染かどうか詳しい解析を進めることにしています。
岸田総理大臣は「わが国はG7の中でも最高のワクチン接種率かつ2回目の接種から最も日が浅い状況だ。
マスク着用をはじめ行動自粛への国民の協力なども世界が称賛している。
オミクロン株のリスクへの耐性は各国以上に強いと認識している。
国民は落ち着いて対応するよう呼びかけたい」と強調しました。


In [17]:
# We can also check if a word is the start of a sentence
print(edoc2[0])
edoc2[0].is_sent_start

Halloween


True

In [18]:
print(jdoc2[1])
jdoc2[1].is_sent_start

対策


False