# The Stanford POS Tagger


Web app version: http://nlp.stanford.edu:8080/parser/

Newer version of the NLTK interface, requires running a java server locally: https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK


### Downloading the tagger and models

Download and uzip the model. You can do the same thing on your own computer to be able to use it locally.

In [1]:
%%time
!wget 'https://nlp.stanford.edu/software/stanford-tagger-4.2.0.zip'
!unzip './stanford-tagger-4.2.0.zip'

--2022-03-30 20:31:36--  https://nlp.stanford.edu/software/stanford-tagger-4.2.0.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://downloads.cs.stanford.edu/nlp/software/stanford-tagger-4.2.0.zip [following]
--2022-03-30 20:31:37--  https://downloads.cs.stanford.edu/nlp/software/stanford-tagger-4.2.0.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 78034596 (74M) [application/zip]
Saving to: ‘stanford-tagger-4.2.0.zip.3’


2022-03-30 20:31:48 (6.82 MB/s) - ‘stanford-tagger-4.2.0.zip.3’ saved [78034596/78034596]

Archive:  ./stanford-tagger-4.2.0.zip
replace stanford-postagger-full-2020-11-17/stanford-postagger-gui.sh? [y]es, [n]o, [A]

### Setting up and using the tagger with NLTK

In [2]:
model_path='./stanford-postagger-full-2020-11-17/models/english-bidirectional-distsim.tagger'
jar_tagger_path='./stanford-postagger-full-2020-11-17/stanford-postagger-4.2.0.jar'

In [3]:
from nltk.tag.stanford import StanfordPOSTagger # -- deprecated?

In [4]:
!pip freeze | grep nltk

nltk==3.2.5


In [5]:
tagger=StanfordPOSTagger(model_path, jar_tagger_path)

The StanfordTokenizer will be deprecated in version 3.2.5.
Please use [91mnltk.tag.corenlp.CoreNLPPOSTagger[0m or [91mnltk.tag.corenlp.CoreNLPNERTagger[0m instead.
  super(StanfordPOSTagger, self).__init__(*args, **kwargs)


In [6]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
with open('readme.txt') as f:
    text = f.read()
    print(text)

He used to read books.


In [8]:
tagger.tag(nltk.word_tokenize("He used to read books."))

[('He', 'PRP'),
 ('used', 'VBD'),
 ('to', 'TO'),
 ('read', 'VB'),
 ('books', 'NNS'),
 ('.', '.')]

In [9]:
from nltk.stem import PorterStemmer
st = PorterStemmer()
tagger.tag([st.stem(t) 
      for t in nltk.word_tokenize("He used to read books.")])

[('He', 'PRP'),
 ('use', 'VBP'),
 ('to', 'TO'),
 ('read', 'VB'),
 ('book', 'NN'),
 ('.', '.')]

### Using averaged_perceptron_tagger in NLTK

In [10]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [11]:
nltk.pos_tag(nltk.word_tokenize("He used to read books."))

[('He', 'PRP'),
 ('used', 'VBD'),
 ('to', 'TO'),
 ('read', 'VB'),
 ('books', 'NNS'),
 ('.', '.')]

# Syntactic Parsing

In [12]:
!wget 'https://nlp.stanford.edu/software/stanford-corenlp-latest.zip'
!unzip 'stanford-corenlp-latest.zip'

--2022-03-30 20:32:19--  https://nlp.stanford.edu/software/stanford-corenlp-latest.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://downloads.cs.stanford.edu/nlp/software/stanford-corenlp-latest.zip [following]
--2022-03-30 20:32:19--  https://downloads.cs.stanford.edu/nlp/software/stanford-corenlp-latest.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 505207915 (482M) [application/zip]
Saving to: ‘stanford-corenlp-latest.zip.1’


2022-03-30 20:33:54 (5.12 MB/s) - ‘stanford-corenlp-latest.zip.1’ saved [505207915/505207915]

Archive:  stanford-corenlp-latest.zip
replace stanford-corenlp-4.4.0/jaxb-api-2.4.0-b180830.0359-sources.jar? 

In [13]:
!pip install stanfordcorenlp



In [14]:
import stanfordcorenlp
sc = stanfordcorenlp.StanfordCoreNLP('/content/stanford-corenlp-4.4.0')

Dependency parsing

In [15]:
text = "He used to read books."
dependencies = sc.dependency_parse(text)
dependencies

[('ROOT', 0, 2),
 ('nsubj', 2, 1),
 ('mark', 4, 3),
 ('xcomp', 2, 4),
 ('obj', 4, 5),
 ('punct', 2, 6)]

In [16]:
tokens = nltk.word_tokenize(text)
for (t, w1, w2) in dependencies:
  if w1 < len(tokens) and w2 < len(tokens):
    print("%s --> %s (%s)" % (
        tokens[w2-1] if w2>0 else "", 
        tokens[w1-1] if w1>0 else "",
         t))

used -->  (ROOT)
He --> used (nsubj)
to --> read (mark)
read --> used (xcomp)
books --> read (obj)


Descriptions of dependency relations: https://universaldependencies.org/u/dep/

Demo: https://nlp.stanford.edu/software/stanford-dependencies.html


Constituent parsing

In [17]:
parsed = sc.parse(text)
print(parsed)

(ROOT
  (S
    (NP (PRP He))
    (VP (VBD used)
      (S
        (VP (TO to)
          (VP (VB read)
            (NP (NNS books))))))
    (. .)))
