## Part-of-speech tagging

After tokenization, spaCy can parse and tag a given Doc. This is where the trained pipeline and its statistical models come in, which enable spaCy to make predictions of which tag or label most likely applies in this context. A trained component includes binary data that is produced by showing a system enough examples for it to make predictions that generalize across the language – for example, a word following “the” in English is most likely a noun.

Linguistic annotations are available as Token attributes. Like many NLP libraries, spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of an attribute, we n

In [3]:
import spacy
spacy.cli.download("en_core_web_sm")
nlp = spacy.load("en_core_web_sm")

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 1.3 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.5.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [5]:
doc = nlp("Rudra taken data science course. worth $60")
doc

Rudra taken data science course. worth $60

In [7]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Rudra Rudra PROPN NNP nsubj Xxxxx True False
taken take VERB VBD ROOT xxxx True False
data data NOUN NN compound xxxx True False
science science NOUN NN compound xxxx True False
course course NOUN NN dobj xxxx True False
. . PUNCT . punct . False False
worth worth ADJ JJ ROOT xxxx True False
$ $ SYM $ nmod $ False False
60 60 NUM CD npadvmod dd False False


In [14]:
for token in doc:
    print("{}:".format(token),token.lemma_)

Rudra: Rudra
taken: take
data: data
science: science
course: course
.: .
worth: worth
$: $
60: 60


In [15]:
for token in doc:
    print("{}:".format(token),token.pos_)

Rudra: PROPN
taken: VERB
data: NOUN
science: NOUN
course: NOUN
.: PUNCT
worth: ADJ
$: SYM
60: NUM


In [17]:
for token in doc:
    print("{}:".format(token),token.tag_)

Rudra: NNP
taken: VBD
data: NN
science: NN
course: NN
.: .
worth: JJ
$: $
60: CD


In [18]:
for token in doc:
    print("{}:".format(token),token.dep_)

Rudra: nsubj
taken: ROOT
data: compound
science: compound
course: dobj
.: punct
worth: ROOT
$: nmod
60: npadvmod


In [19]:
for token in doc:
    print("{}:".format(token),token.shape_)

Rudra: Xxxxx
taken: xxxx
data: xxxx
science: xxxx
course: xxxx
.: .
worth: xxxx
$: $
60: dd


In [20]:
for token in doc:
    print("{}:".format(token),token.is_alpha)

Rudra: True
taken: True
data: True
science: True
course: True
.: False
worth: True
$: False
60: False


In [21]:
for token in doc:
    print("{}:".format(token),token.is_stop)

Rudra: False
taken: False
data: False
science: False
course: False
.: False
worth: False
$: False
60: False
