<a href="https://colab.research.google.com/github/NastasiaMazur/VU_1_1/blob/main/tutorials/Tutorial2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 2: Introduction to Computational Linguistics

This is the second tutorial with practical exercises for the lecture Introduction to Computational Linguistics in the winter semester 2023. Hands-on exercises are marked with 👋 ⚒ and questions are marked with ❓. Remember to first **store this notebook** in your Drive or GitHub.

Today's focus is on the traditional NLP processing pipeline, for which we will be using [spaCy](https://spacy.io/) and [Natural Language Toolkit (NLTK)](https://www.nltk.org/).

---

## **Lesson 2: NLP Pipeline**

For the NLP pipeline, we will be using three different libraries today: NLTK, [Stanza](http://stanza.run/), and [spaCy](https://spacy.io/). Thus, we first need to install Stanza.

In [1]:
!pip install stanza

Collecting stanza
  Downloading stanza-1.6.1-py3-none-any.whl (881 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m881.2/881.2 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting emoji (from stanza)
  Downloading emoji-2.8.0-py2.py3-none-any.whl (358 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m358.9/358.9 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: emoji, stanza
Successfully installed emoji-2.8.0 stanza-1.6.1


NLTK and spaCy are already available in a standard Colab Notebook, however, we need to download some packages that we will need in NLTK.

In [2]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('gutenberg')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

## Tokenization and POS Tagging

First we will use NLTK to tokenize and POS tag a sample sentence. The tagset that the Perceptron Tagger uses is the [Penn Treebank tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

❓ Are the POS tags for the two different uses of *tears* correct? How does their pronunciation differ?


In [3]:
# Tokenization
from nltk.tokenize import word_tokenize
# Part-of-Speech tagger
from nltk.tag.perceptron import PerceptronTagger

tagger = PerceptronTagger()

# Example sentences
sentence = "It just tears me apart to see you suffering like that and in tears."

# Tokenize the sentence
print(word_tokenize(sentence))
# POS tag each token in the tokenized sentence
pos_tags = tagger.tag(word_tokenize(sentence))
print("Part of speech tags of the sentence: ", pos_tags)

['It', 'just', 'tears', 'me', 'apart', 'to', 'see', 'you', 'suffering', 'like', 'that', 'and', 'in', 'tears', '.']
Part of speech tags of the sentence:  [('It', 'PRP'), ('just', 'RB'), ('tears', 'VBZ'), ('me', 'PRP'), ('apart', 'RB'), ('to', 'TO'), ('see', 'VB'), ('you', 'PRP'), ('suffering', 'VBG'), ('like', 'IN'), ('that', 'DT'), ('and', 'CC'), ('in', 'IN'), ('tears', 'NNS'), ('.', '.')]


👋 ⚒ Let's do the same in spaCy. Go to the [spaCy documentation](https://spacy.io/usage/linguistic-features) and perform tokenization and POS tagging on the same example sentence. Attention: Only output the tokens, their spaCy internal POS label and the Penn Treebank tags.

In [10]:
import spacy

nlp = spacy.load("en_core_web_sm")

# Your code here
doc = nlp(sentence)

for token in doc:
    print(token.text, token.pos_, token.tag_)

#for token in doc:
#    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
#            token.shape_, token.is_alpha, token.is_stop)


It PRON PRP
just ADV RB
tears VERB VBZ
me PRON PRP
apart ADV RB
to PART TO
see VERB VB
you PRON PRP
suffering VERB VBG
like ADP IN
that PRON DT
and CCONJ CC
in ADP IN
tears NOUN NNS
. PUNCT .


## Lemmatization and Stemming

We have looked at the comparison between these two in the lecture. Now it is time for you to play around with the two yourself.

👋 ⚒ Which stemmer worked better? Which method would you prefer to determine word frequency information of a text corpus?

In [14]:
# Lemmatizer
from nltk.stem import WordNetLemmatizer

# Stemmer
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import SnowballStemmer

# Lemmatizer
lemmatizer = WordNetLemmatizer()

# Selection of stemmers
ps = PorterStemmer()
ls = LancasterStemmer()
ss = SnowballStemmer("english")

# Exercise: Lemmatize and stem (maybe try different stemmers) the following words
words = ['presumably', 'provisions', 'owed', 'abacus', 'flies', 'dies', 'mules',
        'seizing', 'caresses', 'sensational', 'colonizer', 'traditional', 'plotted']

for word in words:
  print(lemmatizer.lemmatize(word))
  print(ps.stem(word))
  print(ls.stem(word))
  print(ss.stem(word))



presumably
presum
presum
presum
provision
provis
provid
provis
owed
owe
ow
owe
abacus
abacu
abac
abacus
fly
fli
fli
fli
dy
die
die
die
mule
mule
mul
mule
seizing
seiz
seiz
seiz
caress
caress
caress
caress
sensational
sensat
sens
sensat
colonizer
colon
colon
colon
traditional
tradit
tradit
tradit
plotted
plot
plot
plot


With spaCy the code is very much the same for lemmatization as for tokenization and POS tagging, exemplified for our example sentence below. The library, unfortunately, has no function for stemming.

In [15]:
doc = nlp(sentence)
for token in doc:
    print(token.text, token.lemma_)

It it
just just
tears tear
me I
apart apart
to to
see see
you you
suffering suffer
like like
that that
and and
in in
tears tear
. .


## Named Entity Recognition (NER)

👋 ⚒ Get the results for NER for the following example sentence in spaCy.

In [19]:
example_sentence = "Vienna is lovely in December."

# Your code here
doc = nlp(example_sentence)

for ent in doc.ents:
    print(ent.text, ent.label_)

Vienna GPE
December DATE


## Dependency Parsing

Whenever grammatical relations are needed, dependency parsing is very useful. The most common tagset are the [Universal Dependency Relations](https://universaldependencies.org/u/dep/).

While there are some options for dependency parsing in NLTK, the successful ones depend on the Stanford Parser. However, Stanza is the more recent version of the Stanford Parser and therefore more useful.

In [20]:
doc = nlp(example_sentence)

for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

Vienna nsubj is AUX []
is ROOT is AUX [Vienna, lovely, in, .]
lovely acomp is AUX []
in prep is AUX [December]
December pobj in ADP []
. punct is AUX []


In [21]:
# You can also visualize the dependency relations
from spacy import displacy
displacy.render(doc, style="dep", jupyter=True)

In Stanza, we can do all of the above operations in one pipeline. Also spaCy offers a pipeline solution and the combination of several of these parsers in one go.

In [22]:
import stanza
stanza.download('en')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.6.0/models/default.zip:   0%|          | 0…

INFO:stanza:Finished downloading models and saved to /root/stanza_resources.


👋 ⚒ The print statement contains two for loops and an if/else statement. Try to split it up from a one-line code back to the two loops and the statement in several lines.

In [25]:
pipeline = stanza.Pipeline(lang='en', processor='tokenize,pos,lemma,depparse')
doc = pipeline(example_sentence)

# Try to split the following line into two for statements and one if/else
for sent in doc.sentences:
  for word in sent.words:
    if word.head > 0:
      print(f'id: {word.id}\tword: {word.text}\thead id: {word.head}\thead: {sent.words[word.head-1].text}\tdeprel: {word.deprel}')
    else:
      print(f'id: {word.id}\tword: {word.text}\thead id: {word.head}\thead: root\tdeprel: {word.deprel}')

#print(*[f'id: {word.id}\tword: {word.text}\thead id: {word.head}\thead: {sent.words[word.head-1].text if word.head > 0 else "root"}\tdeprel: {word.deprel}' for sent in doc.sentences for word in sent.words], sep='\n')

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

INFO:stanza:Loading these models for language: en (English):
| Processor    | Package             |
--------------------------------------
| tokenize     | combined            |
| pos          | combined_charlm     |
| lemma        | combined_nocharlm   |
| constituency | ptb3-revised_charlm |
| depparse     | combined_charlm     |
| sentiment    | sstplus             |
| ner          | ontonotes_charlm    |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: constituency
INFO:stanza:Loading: depparse
INFO:stanza:Loading: sentiment
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


id: 1	word: Vienna	head id: 3	head: lovely	deprel: nsubj
id: 2	word: is	head id: 3	head: lovely	deprel: cop
id: 3	word: lovely	head id: 0	head: root	deprel: root
id: 4	word: in	head id: 5	head: December	deprel: case
id: 5	word: December	head id: 3	head: lovely	deprel: obl
id: 6	word: .	head id: 3	head: lovely	deprel: punct


❓ Do you notice any differences between the two types of dependency relations and the output for this sentence? Do the two parsers agree on the existing relations in this sentence?
