[Reference](https://pemagrg.medium.com/nlp-using-stanza-3775c7e00f2a)

# Install

In [1]:
pip install stanza

Collecting stanza
  Downloading stanza-1.3.0-py3-none-any.whl (432 kB)
[K     |████████████████████████████████| 432 kB 7.6 MB/s 
Collecting emoji
  Downloading emoji-1.6.1.tar.gz (170 kB)
[K     |████████████████████████████████| 170 kB 55.2 MB/s 
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-1.6.1-py3-none-any.whl size=169314 sha256=5507c13df071372b53f713541056111d8f7635b09ba06b54e595a73c3d255f9e
  Stored in directory: /root/.cache/pip/wheels/ea/5f/d3/03d313ddb3c2a1a427bb4690f1621eea60fe6f2a30cc95940f
Successfully built emoji
Installing collected packages: emoji, stanza
Successfully installed emoji-1.6.1 stanza-1.3.0


In [3]:
import stanza
stanza.download('en')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2021-11-27 08:08:41 INFO: Downloading default packages for language: en (English)...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.3.0/models/default.zip:   0%|          | 0…

2021-11-27 08:09:19 INFO: Finished downloading models and saved to /root/stanza_resources.


# Load the pipeline


In [4]:
nlp = stanza.Pipeline('en')

2021-11-27 08:09:20 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |

2021-11-27 08:09:20 INFO: Use device: cpu
2021-11-27 08:09:20 INFO: Loading: tokenize
2021-11-27 08:09:20 INFO: Loading: pos
2021-11-27 08:09:20 INFO: Loading: lemma
2021-11-27 08:09:20 INFO: Loading: depparse
2021-11-27 08:09:21 INFO: Loading: sentiment
2021-11-27 08:09:21 INFO: Loading: constituency
2021-11-27 08:09:22 INFO: Loading: ner
2021-11-27 08:09:22 INFO: Done loading processors!


# WORD TOKENIZE


In [5]:
import stanza
nlp = stanza.Pipeline('en')
en_test_= "Chris Manning teaches at Stanford University. He lives in the Bay Area."
doc = nlp(en_test_)
word_tokens = [token.text for sent in doc.sentences for token in sent.tokens]
print (word_tokens)

2021-11-27 08:09:22 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |

2021-11-27 08:09:22 INFO: Use device: cpu
2021-11-27 08:09:22 INFO: Loading: tokenize
2021-11-27 08:09:22 INFO: Loading: pos
2021-11-27 08:09:23 INFO: Loading: lemma
2021-11-27 08:09:23 INFO: Loading: depparse
2021-11-27 08:09:23 INFO: Loading: sentiment
2021-11-27 08:09:24 INFO: Loading: constituency
2021-11-27 08:09:24 INFO: Loading: ner
2021-11-27 08:09:25 INFO: Done loading processors!


['Chris', 'Manning', 'teaches', 'at', 'Stanford', 'University', '.', 'He', 'lives', 'in', 'the', 'Bay', 'Area', '.']


# SENTENCE TOKENIZE


In [6]:
import stanza
nlp = stanza.Pipeline('en')
en_test_= "Chris Manning teaches at Stanford University. He lives in the Bay Area."
doc = nlp(en_test_)
sent_list = [sent.text for sent in doc.sentences]
print (sent_list)

2021-11-27 08:09:26 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |

2021-11-27 08:09:26 INFO: Use device: cpu
2021-11-27 08:09:26 INFO: Loading: tokenize
2021-11-27 08:09:26 INFO: Loading: pos
2021-11-27 08:09:26 INFO: Loading: lemma
2021-11-27 08:09:26 INFO: Loading: depparse
2021-11-27 08:09:26 INFO: Loading: sentiment
2021-11-27 08:09:27 INFO: Loading: constituency
2021-11-27 08:09:28 INFO: Loading: ner
2021-11-27 08:09:28 INFO: Done loading processors!


['Chris Manning teaches at Stanford University.', 'He lives in the Bay Area.']


# Lemma


In [7]:
import stanza
nlp = stanza.Pipeline('en', processors='tokenize,pos')
en_test_= "Chris Manning teaches at Stanford University. He lives in the Bay Area."
doc = nlp(en_test_)
sent_list = [sent.text for sent in doc.sentences]
print([f'{word.text}_{word.xpos}' for sent in doc.sentences for word in sent.words])

2021-11-27 08:09:38 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| pos       | combined |

2021-11-27 08:09:38 INFO: Use device: cpu
2021-11-27 08:09:38 INFO: Loading: tokenize
2021-11-27 08:09:38 INFO: Loading: pos
2021-11-27 08:09:38 INFO: Done loading processors!


['Chris_NNP', 'Manning_NNP', 'teaches_VBZ', 'at_IN', 'Stanford_NNP', 'University_NNP', '._.', 'He_PRP', 'lives_VBZ', 'in_IN', 'the_DT', 'Bay_NNP', 'Area_NNP', '._.']


# NER


In [8]:
import stanza
nlp = stanza.Pipeline('en', processors='tokenize,ner')
en_test_= "Chris Manning teaches at Stanford University. He lives in the Bay Area."
doc = nlp(en_test_)
sent_list = [sent.text for sent in doc.sentences]
print(*[f'{token.text}_{token.ner}' for sent in doc.sentences for token in sent.tokens])

2021-11-27 08:09:47 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| ner       | ontonotes |

2021-11-27 08:09:47 INFO: Use device: cpu
2021-11-27 08:09:47 INFO: Loading: tokenize
2021-11-27 08:09:47 INFO: Loading: ner
2021-11-27 08:09:47 INFO: Done loading processors!


Chris_B-PERSON Manning_E-PERSON teaches_O at_O Stanford_B-ORG University_E-ORG ._O He_O lives_O in_O the_B-LOC Bay_I-LOC Area_E-LOC ._O


# WORD FEATURES


In [9]:
import stanza
nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos')
en_test_= "Chris Manning teaches at Stanford University. He lives in the Bay Area."
doc = nlp(en_test_)
print(*[f'WORD: {word.text}\tPOS: {word.upos}\tfeats: {word.feats if word.feats else "_"}' for sent in doc.sentences for word in sent.words], sep='\n')

2021-11-27 08:09:58 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| pos       | combined |

2021-11-27 08:09:58 INFO: Use device: cpu
2021-11-27 08:09:58 INFO: Loading: tokenize
2021-11-27 08:09:58 INFO: Loading: pos
2021-11-27 08:09:58 INFO: Done loading processors!


WORD: Chris	POS: PROPN	feats: Number=Sing
WORD: Manning	POS: PROPN	feats: Number=Sing
WORD: teaches	POS: VERB	feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
WORD: at	POS: ADP	feats: _
WORD: Stanford	POS: PROPN	feats: Number=Sing
WORD: University	POS: PROPN	feats: Number=Sing
WORD: .	POS: PUNCT	feats: _
WORD: He	POS: PRON	feats: Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs
WORD: lives	POS: VERB	feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
WORD: in	POS: ADP	feats: _
WORD: the	POS: DET	feats: Definite=Def|PronType=Art
WORD: Bay	POS: PROPN	feats: Number=Sing
WORD: Area	POS: PROPN	feats: Number=Sing
WORD: .	POS: PUNCT	feats: _
