# Part-of-Speech Tagging and Named Entity Recognition


### Recommended Reading

- Dan Jurafsky and James H. Martin. [__Speech and Language Processing__ (SLP)](https://web.stanford.edu/~jurafsky/slp3/) (3rd ed. draft)
- Steven Bird, Ewan Klein, and Edward Loper. [__Natural Language Processing with Python__ (NLTK)](https://www.nltk.org/book/)

### Covered Material

- SLP
    - [Chapter 8: Part-of-Speech Tagging (HMMs)](https://web.stanford.edu/~jurafsky/slp3/8.pdf)
- NLTK
    - [Chapter 5: Part of Speech Tagging](https://www.nltk.org/book/ch05.html) 

### Requirements

- [spaCy](https://spacy.io/)
- [NLTK](https://www.nltk.org/)

## 1. Part of Speech Tagging

### 1.1. Sequence Labeling and Classification
[Classification](https://en.wikipedia.org/wiki/Statistical_classification) is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.

[Sequence Labeling](https://en.wikipedia.org/wiki/Sequence_labeling) is a type of pattern recognition task that involves the algorithmic assignment of a categorical label to each member of a sequence of observed values. It is a sub-class of [structured (output) learning](https://en.wikipedia.org/wiki/Structured_prediction), since we are predicting a *sequence* object rather than a discrete or real value predicted in classification problems.


- The problem can be treated as a set of independent classification tasks, one per member of the sequence;
- **BUT!** performance is generally improved by making the optimal label for a given element dependent on the choices of nearby elements;

Due to the complexity of the model and the interrelations of predicted variables the process of prediction using a trained model and of training itself is often computationally infeasible and [approximate inference](https://en.wikipedia.org/wiki/Approximate_inference) and learning methods are used. 

### 1.2. Sequence Labeling and Ngram Modeling
[Markov Chain](https://en.wikipedia.org/wiki/Markov_chain) is a stochastic model used to describe sequences. It is the simplest [Markov Model](https://en.wikipedia.org/wiki/Markov_model). In order to make inference tractable, a process that generated the sequence is assumed to have [Markov Property](https://en.wikipedia.org/wiki/Markov_property), i.e. future states depend only on the current state, not on the events that occurred before it. (An [ngram](https://en.wikipedia.org/wiki/N-gram) [language model](https://en.wikipedia.org/wiki/Language_model) is a $(n-1)$-order Markov Model.) 

In Statical Language Modeling, we are modeling *observed sequences* represented as Markov Chains. Since the states of the process are *observable*, we only need to compute __transition probabilities__. 

In Sequence Labeling, we assume that *observed sequences* (__sentences__) have been generated by a Markov Process with *unobservable* (i.e. hidden) states (__labels__), i.e. [Hidden Markov Model](https://en.wikipedia.org/wiki/Hidden_Markov_model) (__HMM__). 
Since the states of the process are hidden and the output is observable, each state has a probability distribution over the possible output tokens, i.e. __emission probabilities__. 

Using these two probability distributions (__transition__ and __emission__), in sequence labeling, we are *inferring* the sequence of state transitions, given a sequence of observations.

### 1.3. The General Setting for Sequence Labeling

- Create __training__ and __testing__ sets by tagging a certain amount of text by hand
    - i.e. map each word in corpus to a tag
- Train tagging model to extract generalizations from the annotated __training__ set
- Evaluate the trained tagging model on the annotated __testing__ set
- Use the trained tagging model too annotate new texts

## 2. Part-of-Speech Tagging

Part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition and its context.

The tag sets varies from corpus to corpus.

### 2.1. Universal Part of Speech Tags

Universal POS-Tag Set represents a simplified and unified set of part-of-speech tags, that was proposed for the standardization across corpora and languages. 
The number of defined tags varies from 12 ([Petrov et al/Google/NLTK](https://github.com/slavpetrov/universal-pos-tags)) to 17 ([Universal Dependencies/spaCy](https://universaldependencies.org/u/pos/index.html), in *Italics*).



| Tag  | Meaning | English Examples |
|:-----|:--------|:-----------------|
| __Open Class__ |||
| NOUN | noun (common and proper) | year, home, costs, time, Africa
| VERB | verb (all tenses and modes) | is, say, told, given, playing, would
| ADJ  | adjective           | new, good, high, special, big, local
| ADV  | adverb              | really, already, still, early, now
| *PROPN* | proper noun (split from NOUN) | Africa
| *INTJ*  | interjection (split from X) | oh, ouch
| __Closed Class__ |||
| DET  | determiner, article | the, a, some, most, every, no, which
| PRON | pronoun             | he, their, her, its, my, I, us
| ADP  | adposition	(prepositions and postpositions) | on, of, at, with, by, into, under
| NUM  | numeral             | twenty-four, fourth, 1991, 14:24
| PRT (*PART*) | particles or other function words | at, on, out, over per, that, up, with
| CONJ | conjunction         | and, or, but, if, while, although
| *AUX* | auxiliary (split from VERB) | have, is, should
| *CCONJ*  | coordinating conjunction (splits CONJ) | or, and
| *SCONJ*  | subordinating conjunction (splits CONJ) | if, while
| __Other__ |||
| .    | punctuation marks   | . , ; !
| X    | other               | foreign words, typos, abbreviations: ersatz, esprit, dunno, gr8, univeristy
| *SYM* | symbols (split from X) | $, :) 




## 3. Part-of-Speech Tagging with Spacy & NLTK

### 3.1. Part-of-Speech Tagging with spaCy

In [None]:
import spacy

nlp = spacy.load("en-core-web-sm")

# un-comment the lines below, if you get 'ModuleNotFoundError'
# import en_core_web_sm
# nlp = en_core_web_sm.load()


# let's print spaCy pipeline
print([key for key, model in nlp.pipeline])


In [None]:
text = "Oh. I have seen a man with a telescope in Antarctica."

doc = nlp(text)

# tokens
print([t.text for t in doc])

# Fine grained POS-tags
print([t.tag_ for t in doc])

# Coarse POS-tags (from Universal POS Tag set)
print([t.pos_ for t in doc])

### 3.2. Part-of-Speech Tagging with NLTK

In [None]:
import nltk

nltk.download('universal_tagset')
nltk.download('averaged_perceptron_tagger')

In [None]:
import nltk

text = "Oh. I have seen a man with a telescope in Antarctica."

# tokenization
tokens = nltk.word_tokenize(text)
print(tokens)

# POS-tagging (with WSJ Tags)
print(nltk.pos_tag(tokens))

# POS-tagging with Universal Tags
print(nltk.pos_tag(tokens, tagset='universal'))


### 3.3. Training POS-Tagger with NLTK

- Manually POS-tagged corpus
- Sequence Labeling (Tagging) Algorithm

#### 3.3.1. Corpora for POS-Tagging
NLTK provides several corpora, most of them are POS-tagged. We will use WSJ with universal tag set (automatically converted using internal mapping).

In [None]:
# download treebank
nltk.download('treebank')

In [32]:
from nltk.corpus import treebank

# WSJ POS-Tags
print(treebank.tagged_sents()[:1])

# Universal POS-Tags
print(treebank.tagged_sents(tagset='universal')[:1])

[[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]]
[[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ('61', 'NUM'), ('years', 'NOUN'), ('old', 'ADJ'), (',', '.'), ('will', 'VERB'), ('join', 'VERB'), ('the', 'DET'), ('board', 'NOUN'), ('as', 'ADP'), ('a', 'DET'), ('nonexecutive', 'ADJ'), ('director', 'NOUN'), ('Nov.', 'NOUN'), ('29', 'NUM'), ('.', '.')]]


#### 3.3.2. NLTK Taggers

NLTK provides several tagging algorithms, including 

- rule-based taggers
    - Regular Expression Tagger: assigns tags to tokens by comparing their word strings to a series of regular expressions.

- [Pre-Trained Taggers](http://www.nltk.org/api/nltk.tag.html)
    - HunPoS
    - Senna
    - Stanford Tagger
    
- trainable taggers
    - `Brill Tagger`: Brill's transformational rule-based tagger assigns an initial tag sequence to a text; and then applies an ordered list of transformational rules to correct the tags of individual tokens. It learns the rules from corpus.
    - [Greedy Averaged Perceptron](https://explosion.ai/blog/part-of-speech-pos-tagger-in-python)
    - [TnT](http://acl.ldc.upenn.edu/A/A00/A00-1031.pdf)
    - Hidden Markov Models
    - Conditional Random Fields
    - Sequential:
        - Affix Tagger: A tagger that chooses a token's tag based on a leading or trailing substring of its word string.
        - Ngram Tagger: A tagger that chooses a token's tag based on its word string and on the preceding _n_ word's tags.
            - Unigram Tagger
            - Bigram Tagger
            - Trigram Tagger

        - Classifier-based POS Tagger: A sequential tagger that uses a classifier to choose the tag for each token in a sentence.
    


#### 3.3.3. Testing a POS Tagger

In [33]:
# Prepare Training & Test Splits as 80%/20%
import math

total_size = len(treebank.tagged_sents())
train_indx = math.ceil(total_size * 0.8)
trn_data = treebank.tagged_sents(tagset='universal')[:train_indx]
tst_data = treebank.tagged_sents(tagset='universal')[train_indx:]

print("Total: {}; Train: {}; Test: {}".format(total_size, len(trn_data), len(tst_data)))


Total: 3914; Train: 3132; Test: 782


### 3.4. Rule-based POS-Tagging

In [34]:
# rule-based tagging
from nltk.tag import RegexpTagger

# rules from NLTK adapted to Universal Tag Set & extended
rules = [
    (r'^-?[0-9]+(.[0-9]+)?$', 'NUM'),   # cardinal numbers
    (r'(The|the|A|a|An|an)$', 'DET'),   # articles
    (r'.*able$', 'ADJ'),                # adjectives
    (r'.*ness$', 'NOUN'),               # nouns formed from adjectives
    (r'.*ly$', 'ADV'),                  # adverbs
    (r'.*s$', 'NOUN'),                  # plural nouns
    (r'.*ing$', 'VERB'),                # gerunds
    (r'.*ed$', 'VERB'),                 # past tense verbs
    (r'[\.,!\?:;\'"]', '.'),            # punctuation (extension) 
    (r'.*', 'NOUN')                     # nouns (default)
]

re_tagger = RegexpTagger(rules)

# tagging sentences in test set
for s in treebank.sents()[:train_indx]:
    print("INPUT: {}".format(s))
    print("TAG  : {}".format(re_tagger.tag(s)))
    break
    
# evaluation
accuracy = re_tagger.accuracy(tst_data)
# OR re_tagger.evaluate()
print("Accuracy: {:6.4f}".format(accuracy))

INPUT: ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
TAG  : [('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ('61', 'NUM'), ('years', 'NOUN'), ('old', 'NOUN'), (',', '.'), ('will', 'NOUN'), ('join', 'NOUN'), ('the', 'DET'), ('board', 'NOUN'), ('as', 'NOUN'), ('a', 'DET'), ('nonexecutive', 'NOUN'), ('director', 'NOUN'), ('Nov.', 'NOUN'), ('29', 'NUM'), ('.', '.')]
Accuracy: 0.5360


##### Exercise 1

- Extend rule-set of RegexpTagger to handle close-class words (similar to punctuation & DET):

    - prepositions (ADP)
        - in, among, of, above, etc (add as many you want)
    - particles (PRT)
        - to, well, up, now, not (add as many you want)
    - pronouns (PRON)
        - I, you, he, she, it, they, we (add as many you want)
    - conjunctions (CONJ)
        - and, or, but, while, when, since (add as many you want)

- Evaluate 

In [None]:
aug_rules = [
    (r'^-?[0-9]+(.[0-9]+)?$', 'NUM'),   # cardinal numbers
    (r'(The|the|A|a|An|an)$', 'DET'),   # articles
    (r'.*able$', 'ADJ'),                # adjectives
    (r'.*ness$', 'NOUN'),               # nouns formed from adjectives
    (r'.*ly$', 'ADV'),                  # adverbs
    (r'.*s$', 'NOUN'),                  # plural nouns
    (r'.*ing$', 'VERB'),                # gerunds
    (r'.*ed$', 'VERB'),                 # past tense verbs
    (r'.*ed$', 'VERB'),                 # past tense verbs
    (r'[\.,!\?:;\'"]', '.'),            # punctuation (extension) 
    (r'$', 'ADP'),                      # Add prepositions
    (r'$', 'PRT'),                      # Add particles
    (r'$', 'PRON'),                     # Add pronouns
    (r'$', 'CONJ'),                     # Add conjunctions
    (r'.*', 'NOUN')                     # nouns (default)

]
aug_re_tagger = RegexpTagger(aug_rules)

# tagging sentences in test set
for s in treebank.sents()[:train_indx]:
    print("INPUT: {}".format(s))
    print("TAG  : {}".format(aug_re_tagger.tag(s)))
    break

accuracy = aug_re_tagger.accuracy(tst_data)
# Or = aug_re_tagger.evaluate(tst_data)
print("Accuracy: {:6.4f}".format(accuracy))

#### 3.3.4. Training HMM POS Tagger

In [None]:
# training hmm on treebank
import nltk.tag.hmm as hmm

hmm_model = hmm.HiddenMarkovModelTrainer()
hmm_tagger = hmm_model.train(trn_data)

# tagging sentences in test set
for s in treebank.sents()[:train_indx]:
    print("INPUT: {}".format(s))
    print("TAG  : {}".format(hmm_tagger.tag(s)))
    print("PATH : {}".format(hmm_tagger.best_path(s)))
    break
    
# evaluation
accuracy = hmm_tagger.accuracy(tst_data)
# Or = hmm_tagger.evaluate(tst_data)
print("Accuracy: {:6.4f}".format(accuracy))

# 4 Named Entity Recognition (NER)

## 4.1 Shallow Parsing

[Shallow Parsing](https://en.wikipedia.org/wiki/Shallow_parsing) is a kind of Sequence Labeling. The main difference from Sequence Labeling task where there is an output label (tag) per token; Shallow Parsing additionally performs __chunking__ -- segmentation of input sequence into constituents. Chunking is required to identify categories (or types) of *multi-word expressions*. In other words, in Shallow Parsing the members of a sequence are mapped to higher order units (i.e. grouped together `[['a'],['b','c']]`) and assigned a category. In this, the sequence is chucked into sub-sequences.
For instance, we want to capture information that expressions like `"New York"` that consist of 2 tokens, constitute a single unit. 

Some examples of Sequence Labelling and Shallow Parsing tasks:

- [Sequence Labeling](https://en.wikipedia.org/wiki/Sequence_labeling)
    - [Part-of-Speech Tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging)
- [Shallow Parsing](https://en.wikipedia.org/wiki/Shallow_parsing) (Chunking)
    - [Phrase Chunking](https://en.wikipedia.org/wiki/Phrase_chunking)
    - [Named-Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition) 
    - [Semantic Role Labeling](https://en.wikipedia.org/wiki/Semantic_role_labeling)
    - Dependency [Parsing](https://en.wikipedia.org/wiki/Parsing) 
    - Discourse Parsing
    - (Natural/Spoken) __Language Understanding__: Concept Tagging/Entity Extraction

## 4.2 Encoding Segmentation Information: CoNLL Corpus Format

Corpus in CoNLL format consists of series of sentences, separated by blank lines. Each sentence is encoded using a table (or "grid") of values, where each line corresponds to a single word, and each column corresponds to an annotation type. 
The set of columns used by CoNLL-style files can vary from corpus to corpus.

```
        Alex       B-PER
        is         O
        going      O
        to         O
        Los        B-LOC
        Angeles    I-LOC
        in         O
        California B-LOC
```

- The notation scheme is used to label *multi-word* spans in token-per-line format.
    - *Los Angeles* is a *LOCATION* concept that has 2 tokens
- Both, prefix and suffix notations are commons: 
    - prefix: __B-LOC__
    - suffix: __LOC-B__

- Meaning of Prefixes (IOB tags)
    - __I__ for (__I__)nside of span
    - __O__ for (__O__)utside of span (no prefix or suffix, just `O`)
    - __B__ for (__B__)eginning of span

## 4.3 Sentence chunking and NER with spaCy

In [None]:
# Let's take a reasonably long document

# The original document can be found at https://www.reuters.com/technology/nvidia-chases-30-billion-custom-chip-market-with-new-unit-sources-2024-02-09/

document = "Nvidia (NVDA.O), opens new tab is building a new business unit focused on designing bespoke chips for cloud computing firms and others, including advanced artificial intelligence (AI) processors, nine sources familiar with its plans told Reuters. The dominant global designer and supplier of AI chips aims to capture a portion of an exploding market for custom AI chips and shield itself from the growing number of companies pursuing alternatives to its products. The Santa Clara, California-based company controls about 80% of high-end AI chip market, a position that has sent its stock market value up 40% so far this year to $1.73 trillion after it more than tripled in 2023. Nvidia\'s customers, which include ChatGPT creator OpenAI, Microsoft (MSFT.O), opens new tab, Alphabet (GOOGL.O), opens new tab and Meta Platforms (META.O), opens new tab, have raced to snap up the dwindling supply of its chips to compete in the fast-emerging generative AI sector. Its H100 and A100 chips serve as a generalized, all-purpose AI processor for many of those major customers. But the tech companies have started to develop their own internal chips for specific needs. Doing so helps reduce energy consumption, and potentially can shrink the cost and time to design. Nvidia is now attempting to play a role in helping these companies develop custom AI chips that have flowed to rival firms such as Broadcom (AVGO.O), opens new tab and Marvell Technology (MRVL.O), opens new tab, said the sources, who declined to be identified because they were not authorized to speak publicly. \"If you're really trying to optimize on things like power, or optimize on cost for your application, you can't afford to go drop an H100 or A100 in there,\" Greg Reichow, general partner at venture capital firm Eclipse Ventures said in an interview. \" You want to have the exact right mixture of compute and just the kind of compute that you need.\" Nvidia does not disclose H100 prices, which are higher than for the prior-generation A100, but each chip can sell for $16,000 to $100,000 depending on volume and other factors. Meta plans to bring its total stock to 350,000 H100s this year. Nvidia officials have met with representatives from Amazon.com (AMZN.O), opens new tab, Meta, Microsoft, Google and OpenAI to discuss making custom chips for them, two sources familiar with the meetings said. Beyond data center chips, Nvidia has pursued telecom, automotive and video game customers. Nvidia shares rose 2.75% after the Reuters report, helping lift chip stocks overall. Marvell shares dropped 2.78%. In 2022, Nvidia said it would let third-party customers integrate some of its proprietary networking technology with their own chips. It has said nothing about the program since, and Reuters is reporting its wider ambitions for the first time. A Nvidia spokesperson declined to comment beyond the company\'s 2022 announcement. Dina McKinney, a former Advanced Micro Devices (AMD.O), opens new tab and Marvell executive, heads Nvidia\'s custom unit and her team\'s goal is to make its technology available for customers in cloud, 5G wireless, video games and automotives, a LinkedIn profile said. Those mentions were scrubbed and her title changed after Reuters sought comment from Nvidia. Amazon, Google, Microsoft, Meta and OpenAI declined to comment."

In [None]:
# We load the spaCy model for the English language and process our document
import spacy 
nlp = spacy.load('en_core_web_sm')
doc = nlp(document)

In [None]:
# We chuck the document into sentences with the spaCy
for s_id, sent in enumerate(doc.sents):
    print(s_id+1, ":", sent)

In [None]:
# For each sentence we print the entities
for s_id, sent in enumerate(doc.sents):
    ents = [(entity.text, entity.label_) for entity in sent.ents]
    print(s_id+1, ":", ents)
# or if you want the same info but with IOB tags
for s_id, sent in  enumerate(doc.sents):
    print(s_id+1, ":", [w.ent_iob_ + "-" + w.ent_type_ if w.ent_iob_ != "O" else w.ent_iob_ for w in sent])

## 4.4 Train a NER model with NLTK
We will train a Hidden Markov Model. We will use conll2002 as training and testing corpus which is in the Spanish language.

In [None]:
import nltk
nltk.download('conll2002')

In [None]:
from nltk.corpus import conll2002
print(len(conll2002.tagged_sents()))
print(conll2002._chunk_types)
print(conll2002.sents('esp.train')[0])
print(conll2002.tagged_sents('esp.train')[0])
print(conll2002.chunked_sents('esp.train')[0])
print(conll2002.iob_sents('esp.train')[0])

In [None]:
import nltk.tag.hmm as hmm

hmm_model = hmm.HiddenMarkovModelTrainer()

print(conll2002.iob_sents('esp.train')[0])

# let's get only word and iob-tag
trn_sents = [[(text, iob) for text, pos, iob in sent] for sent in conll2002.iob_sents('esp.train')]
print(trn_sents[0])

tst_sents = [[(text, iob) for text, pos, iob in sent] for sent in conll2002.iob_sents('esp.testa')]

hmm_ner = hmm_model.train(trn_sents)
    
# evaluation
accuracy = hmm_ner.accuracy(tst_sents)
# Evaluation at Token LEVEL!!!
print("Accuracy: {:6.4f}".format(accuracy))

To evaluate a shallow parsing model we have to evaluate it at **chunk level**. For this, we can use the evaluate function of conll script.

In [None]:
import os
import sys
sys.path.insert(0, os.path.abspath('../src/'))

from conll import evaluate
# for nice tables
import pandas as pd

# getting references (try to replace testa with testb)

refs = [[(text, iob) for text, pos, iob in sent] for sent in conll2002.iob_sents('esp.testa')]
print(refs[0])
# getting hypotheses
hyps = [hmm_ner.tag(s) for s in conll2002.sents('esp.testa')]
print(hyps[0])
results = evaluate(refs, hyps)

# The total F1 is a micro-F1

pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)


### Exercise 2 
Evaluate spaCy NER model on the conll2002 corpus and compare the results with NLTK trained model.

To do this you have to:

- Load the spaCy model for the Spanish language (`es_core_news_sm`) or you can try with larger models
- Retrieve spaCy prediction with IOB schema
- Evaluate the model with the conll script

In [None]:
from spacy.tokenizer import Tokenizer

nlp = spacy.load("es_core_news_sm")
# We overwrite the spaCy tokenizer with a custom one, that split by whitespace only. However, it is a suboptimal solution.
nlp.tokenizer = Tokenizer(nlp.vocab)

# getting references
refs = []
# Use spaCy model for predicting the Named Entities
hyps = []


results = evaluate(refs, hyps)
# The total F1 is a micro-F1
pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

## Lab Exercise: Comparative Evaluation of NLTK Tagger and Spacy Tagger

Train and evaluate NgramTagger
- experiment with different tagger parameters
- some of them have *cut-off*

Evaluate `spacy` POS-tags on the same test set
- create mapping from spacy to NLTK POS-tags 
    - SPACY list https://universaldependencies.org/u/pos/index.html
    - NLTK list https://github.com/slavpetrov/universal-pos-tags
- convert output to the required format (see format above)
    - flatten into a list
- evaluate using `accuracy` from `nltk.metrics` 
    - [link](https://www.nltk.org/_modules/nltk/metrics/scores.html#accuracy)
        
**Dataset**: treebank <br>
**Expected output**: NLTK: Accuracy SPACY: Accuracy

In [None]:
# See above for further details
mapping_spacy_to_NLTK = {
    "ADJ": "ADJ",
    "ADP": "ADP",
    "ADV": "ADV",
    "AUX": "VERB",
    "CCONJ": "CONJ",
    "DET": "DET",
    "INTJ": "X",
    "NOUN": "NOUN",
    "NUM": "NUM",
    "PART": "PRT",
    "PRON": "PRON",
    "PROPN": "NOUN",
    "PUNCT": ".",
    "SCONJ": "CONJ",
    "SYM": "X",
    "VERB": "VERB",
    "X": "X"
}

In [None]:
from nltk.corpus import treebank
import spacy
import en_core_web_sm
from spacy.tokenizer import Tokenizer
nlp = en_core_web_sm.load()
# We overwrite the spacy tokenizer with a custom one, that split by whitespace only
nlp.tokenizer = Tokenizer(nlp.vocab) # Tokenize by whitespace
# Sanity check
for id_sent, sent in enumerate(treebank.sents()):
    doc = nlp(" ".join(sent))
    if len([x.text for x in doc]) != len(sent):
        print(id_sent, sent)