Natural Language Processing  
author: D.Thébault

Based on NLP Demystified (YouTube)

# Part I: Fundamentals of NLP

## 1. **Preprocessing**

### 1.0. **Installation**

Install pyenv:   


```{python}
brew update
brew install pyenv
```  

Install Python 3.11 with 'pyenv':  

```{python}
pyenv install 3.11.0
```

Define Python 3.11 as local version:  

```{python}
pyenv local 3.11.0
```  

Create a virtual environment with Python 3.11:  
```{python}
virtualenv -p $(pyenv which python3.11) myenv
```  

In [15]:
# env
#!conda create --name env python=3.11
#!conda activate env

# Spacy
#!conda uninstall -y numpy h5py spacy
#!conda install numpy h5py spacy
#!conda uninstall -y numpy h5py spacy
#!conda install --upgrade numpy h5py
#!conda install ipykernel
#!python -m spacy download en_core_web_sm
#!python -m spacy info

- 'punkt' :  

    Tokenizer which use a model (probability) to determine if for instance a . means the ending point or Dr.Smith

- 'stopwords' :   

- 'wordnet' :  
- 'omw-1.4' :   
- 'averaged_perceptron_tagger' :   
- 'maxent_ne_chunker' : NER  

### 1.1. **Tokenization**

Tokenization is the process of breaking our documents into tokens (words, punctuation, numbers).  

It is the first step.  

Tokenization = Documents $\Rightarrow$ sentences $\Rightarrow$ tokens (words, punctuation, numbers) 

In [16]:
"He didn't want to pay $20 for the book.".split(' ')

['He', "didn't", 'want', 'to', 'pay', '$20', 'for', 'the', 'book.']

<u>Problems:</u> "$20" and "book." are not separated. We could you regexp() but how do we manage "N.Y.C." ?

**Definitions:**  

- **<u>word</u>**: the smallest unit of <u>speech</u> that carries some meaning <u>on its own</u>.

    "full moon" one word or two words ? Should be map to one meaning or not ? In our definition it is two words.

- **<u>Morpheme</u>**: the smallest unit of speech which has a meaning, but doesn't necessarily stand on its own.  

    Examples suffixes or prefixes such as : -ing, re-, pre-, un-   

- **<u>Grapheme</u>**: the smallest ufuncitonal unit of a writing system. In English that's letters.

In [17]:
import spacy
nlp = spacy.load('en_core_web_sm')

sentence = "He didn't want to pay $20 for the book."
doc = nlp(sentence)
doc

He didn't want to pay $20 for the book.

In [18]:
# We can iterate over this Doc object and view the tokens.
print([t.text for t in doc])

['He', 'did', "n't", 'want', 'to', 'pay', '$', '20', 'for', 'the', 'book', '.']


In [19]:
# We can view an individual token by indexing into the Doc object.
print(doc[0])

He


In [20]:
# A Doc object is a container of other objects, namely Token and Span objects.
print(type(doc[0]))

<class 'spacy.tokens.token.Token'>


In [21]:
# Slicing a Doc object returns a Span object.
print(doc[0:3])
print(type(doc[0:3]))

He didn't
<class 'spacy.tokens.span.Span'>


In [22]:
# Access a token's index in a sentence.
print([(t.text, t.i) for t in doc])

[('He', 0), ('did', 1), ("n't", 2), ('want', 3), ('to', 4), ('pay', 5), ('$', 6), ('20', 7), ('for', 8), ('the', 9), ('book', 10), ('.', 11)]


In [23]:
# You can view the original input like so:
print(doc.text)

He didn't want to pay $20 for the book.


You can learn more about the Token and Span objects here:  
https://spacy.io/api/token  
https://spacy.io/api/span

### 1.2. **Case Folding**

Lower - or upper casing all tokens.

**Definitions:**  

- **<u>Vocabulary</u>**: The set of all **unique** tokens in a corpus.

In the following exemple, case folding can lead to an information loss (the noun Cook becomes cook).

In [24]:
import spacy
nlp = spacy.load('en_core_web_sm')

sentence = "Mr. cook went into the kitchen to cook dinner."
doc = nlp(sentence)
print([t.lower_ for t in doc])

['mr.', 'cook', 'went', 'into', 'the', 'kitchen', 'to', 'cook', 'dinner', '.']


In [25]:
# We skip lowering if the word starts the sentence.doc
print([t.lower_ if not t.is_sent_start else t for t in doc])

[Mr., 'cook', 'went', 'into', 'the', 'kitchen', 'to', 'cook', 'dinner', '.']


### 1.3. **Stop Word Removal**

Removing words which occur frequently but carry little information.  

{the, a, of, an, this, that, ...}  

Be carefull in the following exemple, this can lead to a counter-sense.  
It is up to you to use stopwords - generally we use it.  
stopwords could also be parametrized for your own purpose. 

In [26]:
import spacy
nlp = spacy.load('en_core_web_sm')

sentence = "I saw the movie last night. I was not amused."
doc = nlp(sentence)

# Print tokens and indicate if they are stop words
for token in doc:
    print(f"Token: {token.text}, Is Stop Word: {token.is_stop}")

# Remove stop words
filtered_sentence = ' '.join([token.text for token in doc if not token.is_stop])
print(filtered_sentence)

Token: I, Is Stop Word: True
Token: saw, Is Stop Word: False
Token: the, Is Stop Word: True
Token: movie, Is Stop Word: False
Token: last, Is Stop Word: True
Token: night, Is Stop Word: False
Token: ., Is Stop Word: False
Token: I, Is Stop Word: True
Token: was, Is Stop Word: True
Token: not, Is Stop Word: True
Token: amused, Is Stop Word: False
Token: ., Is Stop Word: False
saw movie night . amused .


### 1.4. **Stemming**

Removing word suffixes (and sometimes prefixes)  
In French, Stemming (racine orthographique)  

{ing, s, y, ed, ...}  

The goal is to reduce a word to some base form.  
Typically done through an algorithm, the most famous of which is Porter's algorithm.  

Banking => Bank  
Banks => Bank  

Note that a stemmed word may not be a valid word.  

"Analysis" => "Analysi"

### 1.5. **Lemmatization**

Reduce a word down to its lemma, or dictionary form.
(more sophisticated than stemming)  

In French, Lemmatization (racine sémantique)  

"Did", "Done", "Doing" => "Do"

In [27]:
import spacy

doc = nlp("Did, Done, Doing")
[(t.text, t.lemma_) for t in doc]

[('Did', 'do'), (',', ','), ('Done', 'do'), (',', ','), ('Doing', 'do')]

### 1.6. **POS Tagging**

Part-of-Speech (PoS) tagging

Classifying how a word is used in a sentence.  
- NOUN  
- VERB  
- ADJECTIVE
- ...

_Example_  
"I want to book a hotel room."  $\rightarrow$ book : POS Tagger = VERB  

"I left the book in the hotel room." $\rightarrow$ book : POS Tagger = NOUN

POS tagging assigns each word in a sentence its corresponding POS.

In [28]:
import spacy

# Loads the small English language model provided by SpaCy for NLP tasks.
nlp = spacy.load('en_core_web_sm')

# Exemple of sentence
sentence = "John is a watching an old movie at a cinema."

# Creating a Doc object that contains linguistic annotations for the text.
doc = nlp(sentence)

# POS (course-grained) tags thanks to pos_ attribute
[(t.text, t.pos_) for t in doc]

[('John', 'PROPN'),
 ('is', 'AUX'),
 ('a', 'DET'),
 ('watching', 'VERB'),
 ('an', 'DET'),
 ('old', 'ADJ'),
 ('movie', 'NOUN'),
 ('at', 'ADP'),
 ('a', 'DET'),
 ('cinema', 'NOUN'),
 ('.', 'PUNCT')]

In [29]:
# to get a description for a POS tag, use spacy explain() method
spacy.explain('PROPN')

'proper noun'

In [30]:
# You can also have fine-grained tags with the attribute tag_
# more details than with pos_ attribute (tense, type of pronoun...)
[(t.text, t.tag_) for t in doc]

[('John', 'NNP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('watching', 'VBG'),
 ('an', 'DT'),
 ('old', 'JJ'),
 ('movie', 'NN'),
 ('at', 'IN'),
 ('a', 'DT'),
 ('cinema', 'NN'),
 ('.', '.')]

**NNP** refers specifically to a singular pronoun, and **VDB** is a part tense.

In [31]:
print(spacy.explain("VBD"))
print(spacy.explain("NNP"))

verb, past tense
noun, proper singular


POS tagging can use:  

- <u>Linguistic Rules</u>:  
Predefined linguistic rules such as if the word ending by "ing" it is a VERB or and ADJ.  

- <u>Dictionaries</u>:  
This method is limited because cannot manage unknown words or contextual ambiguities.  

- <u>Hidden Markov Model (HMM)</u>: 
HMM uses the sequence of word as a Markov Chain where each state represents a grammatical category.  
The model learns the probabilities of transitions between states (gramatical category) and the emission probabilities   
(probabilities that a words is emited by a grammatical category).  
HMM is a statistical model.  

- <u>Conditional Random Field (CRF)</u>:  
CRF is a statistical model.  

- <u>Maximum Entropy model (MaxEnt)</u>:  
MaxEnt models use contextual features to predict the grammatical category of a word.  
These features can include the preceding and following words, suffixes, prefixes, etc.  
The model learns the weights of these features to maximize entropy.  
MaxEnt is a statistical model.  

- <u>Recurrent Neural Networks (RNN) or Long Short-Term Memory networks (LSTMs)</u>:  
These models can capture long-term dependencies in the text and are capable of handling contextual ambiguities.

- <u>Transformers</u>:  
Such as BERT. These models use self-supervised attention to capture complex relationships between words in the text.

- <u>Hybrid Algorithms</u>:  
Some POS tagging systems combine linguistic rules and statistical models to improve accuracy.  
For example, a system may use rules to handle specific cases and statistical models to handle general cases.

### 1.7. **Named Entity Recognition (NER)**

Tagging named ("real-world") entities.  

{a person, a locationn, an organiation, ...}

Named Entity: roughly anything that can be referred by a proper name.  
They often have a Proper Noun (PROPN) POS tag.  

It will help to identify in a corpus an organization, an entity or a person.

NER can be seen as a sequence labelling tasks such as BIO (**B**eginning of entity, **I**nside of entity, **O**utstide of entity).  

"Alexander Hamilton was born in Charleston, Nevis"  

Alexander $\Rightarrow$ B-PER (Beginning of person entity)  
Hamliton $\Rightarrow$ I-PER (Inside (continuation) of person entity)  
was $\Rightarrow$ O (Outside of entity.)  
born $\Rightarrow$ O  
in $\Rightarrow$ O  
Charleston $\Rightarrow$ B-GPE  
, $\Rightarrow$ O  
Nevis $\Rightarrow$ B-GPE  
. $\Rightarrow$ O  

<u>NB</u>: BIO is not a linguistic rule but a convention of annotation used to structure training data.

In [32]:
# NER With spacy

s = "Volkswagen is developping an electric sedan which could potentially come to America next fall."

doc = nlp(s)

# To access named entities we use here the spacy attribute ent_type_
# others ways to make it are possible with Spacy.

[(t.text, t.ent_type_) for t in doc]


[('Volkswagen', 'ORG'),
 ('is', ''),
 ('developping', ''),
 ('an', ''),
 ('electric', ''),
 ('sedan', ''),
 ('which', ''),
 ('could', ''),
 ('potentially', ''),
 ('come', ''),
 ('to', ''),
 ('America', 'GPE'),
 ('next', 'DATE'),
 ('fall', 'DATE'),
 ('.', '')]

In [33]:
print(spacy.explain('GPE'))
print(spacy.explain('ORG'))
print(spacy.explain('DATE'))

Countries, cities, states
Companies, agencies, institutions, etc.
Absolute or relative dates or periods


In [34]:
# You can also check if a token is an entity before printing it by ckecking the attribute ent_type without underscore
[(t.text, t.ent_type_) for t in doc if t.ent_type != 0]

[('Volkswagen', 'ORG'), ('America', 'GPE'), ('next', 'DATE'), ('fall', 'DATE')]

In [35]:
# Another way is through the ents property of the Doc object itself (Note: "next fall as a single entity this time").
[(ent.text, ent.label_) for ent in doc.ents]

[('Volkswagen', 'ORG'), ('America', 'GPE'), ('next fall', 'DATE')]

[spaCy visualizers](https://spacy.io/usage/visualizers)

In [36]:
from spacy import displacy

# We need to set the 'jupyter' variable to True in order to ouput
# the visualization directly. Otherwise, you'll get row HTML.
# style = 'ent' for entity recognation.

displacy.render(doc, style='ent', jupyter=True)

In [37]:
s = "Ridley Scott directed the Martian."
doc = nlp(s)
displacy.render(doc, style='ent', jupyter=True)

In [38]:
spacy.explain('NORP')

'Nationalities or religious or political groups'

### 1.8. **Parsing**

Determining the syntactic structure of a sentence. 

{_sujet_, _verbe_, _COD_, _COI_, _complément circonstanciel_, _adjectifs_, _adverbes_}

**1. Consistency Parsing**

**Creates a parse tree.**

For that Consistency Parsing uses a Context-Free Grammar (CFG):

Rules for the sentences (S),  
the groupes nominaux (NP),  
the groupes verbaux (VP),  
the groupes prépositionnels (PP),  
the verbes (V),  
the déterminants (Det),  
the noms (N), and  
the prépositions (P)

The CFG encompasses "Production Rules" (line 1 to 3) and Lexicon (line 4 to 8)

The tree shows the syntactic structure of the sentence.

**2. Dependency Parsing**

In [39]:
import spacy

# Load the english language model of SpaCy
nlp = spacy.load("en_core_web_sm")

# Sentence to analyse
sentence = "She enrolled in the course at the university."

# Analyze the sentence
doc = nlp(sentence)

# Display the dependency relations
for token in doc:
    print(f"{token.text} ({token.dep_}) <-- {token.head.text}")

She (nsubj) <-- enrolled
enrolled (ROOT) <-- enrolled
in (prep) <-- enrolled
the (det) <-- course
course (pobj) <-- in
at (prep) <-- course
the (det) <-- university
university (pobj) <-- at
. (punct) <-- enrolled


In [40]:
# With spaCy
import spacy

# Let's visualize a dependency parse
displacy.render(doc, style='dep', jupyter=True)

If we take the above relation "She" <- "enrolled". It is a Nominal subject relationship where:  
- The child or dependent is "She". 
- The head or governor is "enrolled". 

https://spacy.io/api/annotation#dependency-parsing

In [41]:
print(spacy.explain("nsubj"))
print(spacy.explain("prep"))
print(spacy.explain("pobj"))
print(spacy.explain("det"))

nominal subject
prepositional modifier
object of preposition
determiner


In [42]:
[(t.text, t.dep_) for t in doc]

[('She', 'nsubj'),
 ('enrolled', 'ROOT'),
 ('in', 'prep'),
 ('the', 'det'),
 ('course', 'pobj'),
 ('at', 'prep'),
 ('the', 'det'),
 ('university', 'pobj'),
 ('.', 'punct')]

But we don't see the dependencies

In [43]:
# To have a better idea, print the head of each dependency
[(t.text, t.dep_, t.head.text) for t in doc]

[('She', 'nsubj', 'enrolled'),
 ('enrolled', 'ROOT', 'enrolled'),
 ('in', 'prep', 'enrolled'),
 ('the', 'det', 'course'),
 ('course', 'pobj', 'in'),
 ('at', 'prep', 'course'),
 ('the', 'det', 'university'),
 ('university', 'pobj', 'at'),
 ('.', 'punct', 'enrolled')]

A word can be a child to only one head, while the same word can act as a head to zero, one, or multiple words.  

The word "enrolled" has no arcs pointing to it. In this sentence it acts as the root. The finite verb is often the root of a sentence.  

**One major advantage of dependency parsers: they are more resilient to word order changes.**  

Exercice, compare the two following sentences:  
- "He looked at the paperwork **wearily**."  
- "He **wearily** looked at the paperwork."

<u>Determining the syntactic structure of a sentence with Parsing can be useful to:</u>

- **Grammar checking**: if a sentence cannot be parsed it may be grammatically incorrect or difficult to read.

- **Question answering**: Parse structure can help resolving ambiguities by identifying likely dependencies between words, such as what the subjects and objects of a sentence are.  

- **Sementic parsing**: e.g. convert natural language utterances to an intermediate representation interpretable by a machine.

Parsing have a lot of applications:  

- Automatic translation: understand the grammatical structure of a sentence in the source language before translate into the target language.

- Sentiment analysis: use Parsing to better interpret the nuances and the context of the opinions expressed in the texts

- The ChatBots (Siri, Alexia ...) use Parsing to understant the vocal commands and the users requests

- Google...:  use Parsing to understand the users requests and give pertinent answers

- Extract information: to extract structured information from unstructure information (names, places, dates...)  

- Grammatical correction, Automatic summary, analyse financial texts or medical ones... to extract the pertinent information and facilitate the decision making.

<u>**Exercices:**</u>

**Parsing, combined with POS tagging and NER, can form the basis of an information extraction system**  

Exemple of extraction: Financial corpus $\Rightarrow$ Key entities and their relationships

**3. Constituency Parsing vs Dependency Parsing**  

Constituency Parsing: to extract sub-phrases from a sentence  

Dependency Parsing for the others applications.

### 1.9. **Matcher**

**Using spaCy's Matcher to find patterns**

spaCy comes with a host of pattern-matching functionality to help us to find patterns in a document.  

Beyond regex, spaCy can match on a variety of attributes such as speech tags, POS tags, entity tags, lemmas, dependencies phrases...  

https://spacy.io/usage/rule-based-matching
https://explosion.ai/demos/matcher

<u>**Exercices:**</u>

Here we try to search for patterns that may be useful for a hospitality bot.

For that we use the basic matcher to find a verb noun combination in a sentence.

In [44]:
# The general Matcher is one of multiple matcher objects
# included with spaCy.
from spacy.matcher import Matcher

In [45]:
# We initialize the Matcher with the spaCy vocab object, which contains
# words along with their labels and tags.
matcher = Matcher(nlp.vocab)

In [46]:
s = "I want to book a hotel room."
doc = nlp(s)

In [47]:
# Patterns are expressed as an ordered sequence. 
# Here, we're looking to match occurences starting with a 'book' string followed by 
# a determiner (DET) POS tag such as "the","and" , then a noun POS tag.
# The OP key marks the match as optional in some way.

# Here, the DET POS (marked with '?') will match 0 or 1 times (i.e. the determiner is optional), and 
# the NOUN POS (marked with '+') will match 1 or more times (i.e., at least one noun is required).
# See this link for more information.
# https://spacy.io/usage/rule-based-matching#quantifiers

pattern = [
    {'TEXT': 'book'},
    {'POS': 'DET', 'OP': '?'},
    {'POS': 'NOUN', 'OP': '+'}
]

# So, the pattern will match sequences that start with the word "book", 
# optionally followed by a determiner, and then followed by one or more nouns.

In [48]:
# We give our pattern a label and pass it to the matcher.
matcher.add('USER_INTENT', [pattern])

# Run the matcher over the doc.
matches = matcher(doc)

# For each match, the matcher returns a tuple specifying a match id, start,
# and end of the match.
print("Matches: ", [doc[start:end].text for match_id, start, end in matches])

Matches:  ['book a hotel', 'book a hotel room']


The code above demonstrates the Matcher but is fragile.
- What if "book" is capitalized?
- What if a user types "reserve" instead of "book"?
- How can we match on "hotel room" as a compound noun?
- What if a user types "book a flight and hotel room"?
- Can you think of how you would handle these cases?

We could come up more rules to match different patterns,  
or perhaps just search for keywords based on POS and entities (e.g. a country) and   
present the user with a bunch of possible intentions and let them choose one,   
or have a bunch of different interpretation functions submit answers and  
select the most likely one based on what was historically accepted most often.  
We can also ask clarifying questions to narrow things down.  

For example, for the last sentence, you could have a function scan through the Doc object's noun_chunks  
(phrases that have a noun as their head) and  
isolate keywords there along with potential conjunctions (e.g. "and").  

https://spacy.io/usage/linguistic-features#noun-chunks

we could look at noun chuncks instead and look at the phrase head for each chunck  
which will provide more information.  
So this extraction here is more illuminating in terms of what the person wants  
to do and where:

In [49]:
doc = nlp("I want to book a flight and hotel room in Berlin.")
for noun_phrase in doc.noun_chunks:
  print("phrase: {}, root head: {}".format(noun_phrase, noun_phrase.root.head))

phrase: I, root head: want
phrase: a flight and hotel room, root head: book
phrase: Berlin, root head: in


The person wants something  
There is a noun phrase containing nouns relevant to our domain  
and the associated root is book and there's also a location along with a determiner.  

For where we can further use named entity recognation to isolate Berlin as a destination and  
disambiguate it from the other noun such as flight and hotel room.  

Now there is still some ambiguities here, does the person wants this hotel room in Berlin or book from Berlin ?  
In most cases though it's the former.  

So you'll find that we can pile on rule after rule and we still not quiet get there!  

Most of the systems today are a combination of layered approaches.  
If we have a narrow domain with a finite set of cases such as ChatBots using these rules can help to make the job get done fast.  
Because free form ChatBots are terrible and a bad ChatBot is infinitely worst than no ChatBot.  

So for ChatBot you could start with a prototype but when the requirements become more sophisticated and the cases more varied  
we'll need to blend in more powerful techniques.  

https://spacy.io/usage/training

Talkin' like Yoda

Languages like English are built around the subject-verb-object pattern. But if you're familiar with Yoda from Star Wars,   
he famously speaks in an object-subject-verb pattern. Using the information in a dependency parse,   
we can turn basic English sentences into Yoda-speak.

In [50]:
def yodize(s: str):
  doc = nlp(s)
  for t in doc:
    if t.dep_ == "ROOT":

      # Assuming our sentence is of the form subject-verb-object, we take 
      # everything after the root (likely verb) and put it in front, and 
      # likewise take everything before the root, and put it after.
      seq = [doc[t.i + 1: -1].text, doc[0: t.i].text, t.text + '.']
      seq[0] = seq[0].capitalize()
      print(' '.join(seq))

In [51]:
yodize("I will fly to Texas.")

To texas I will fly.


In [52]:
doc = nlp("I will fly to Texas.")
for t in doc:
    print((t.text, t.i, t.pos_, t.dep_))

('I', 0, 'PRON', 'nsubj')
('will', 1, 'AUX', 'aux')
('fly', 2, 'VERB', 'ROOT')
('to', 3, 'ADP', 'prep')
('Texas', 4, 'PROPN', 'pobj')
('.', 5, 'PUNCT', 'punct')


In [53]:
doc = nlp("I will fly to Texas.")
for t in doc:
    print(doc[t.i])

I
will
fly
to
Texas
.


In [54]:
doc = nlp("I will fly to Texas.")
for t in doc:
    print(doc[t.i : t.i+2])

I will
will fly
fly to
to Texas
Texas.
.


In [55]:
doc = nlp("I will fly to Texas.")
for t in doc:
    print(doc[t.i:2])

I will
will






In [56]:
doc = nlp("I will fly to Texas.")
for t in doc:
    print(doc[t.i + 1: -1].text)

will fly to Texas
fly to Texas
to Texas
Texas




### 1.10. Advanced Exercises

In [57]:
#
# EXERCISE: using doc.ents, identify and print the dates in this sentence.
# Expected output: ['Feb 13th', 'Feb 24th']
#
s = "We'll be in Osaka on Feb 13th and leave on Feb 24th."
doc = nlp(s)

In [58]:
#
# EXERCISE: Read about spaCy's PhraseMatcher
# https://spacy.io/usage/rule-based-matching#phrasematcher
#
# Using the PhraseMatcher, find the start and end index of all occurrences 
# of 'Caesar Augustus' and 'Roman Empire' (case-insensitive).
#
# Expected output: [(0, 2), (15, 17)]
#
from spacy.matcher import PhraseMatcher
s = "Caesar Augustus was the founder of the Roman Principate (the first phase of the Roman Empire)."
doc = nlp(s)

In [59]:
# Additional Reading and Resources

Read through this page to learn more about spaCy's language processing pipeline including what's going on under the hood,  
how to create custom components, disable certain components (e.g. NER)  
when they're unneeded, optimization tips, and best practices:  
https://spacy.io/usage/processing-pipelines

Take the free and succinct spaCy course (available in multiple languages):  
https://course.spacy.io/

**Once we have our tokens and we've processed them to our liking, what's next ?**

**We still have text but in order to use these tokens in statistical methods or machine learning algorithms**  

**we need to transform them into numbers. And representing text as numbers is what we'll start exploring next.**