# Lab1-Assignment

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the assignment for Lab 1 of the text mining course. 

**Points**: each exercise is prefixed with the number of points you can obtain for the exercise.

We assume you have worked through the following notebooks:
* **Lab1.1-introduction**
* **Lab1.2-introduction-to-NLTK**
* **Lab1.3-introduction-to-spaCy** 

In this assignment, you will process an English text (**Lab1-apple-samsung-example.txt**) with both NLTK and spaCy and discuss the similarities and differences.

## Credits
The notebooks in this block have been originally created by [Marten Postma](https://martenpostma.github.io). Adaptations were made by [Filip Ilievski](http://ilievski.nl).

## Tip: how to read a file from disk
Let's open the file **Lab1-apple-samsung-example.txt** from disk.

In [1]:
from pathlib import Path

In [2]:
cur_dir = Path().resolve() # this should provide you with the folder in which this notebook is placed
path_to_file = Path.joinpath(cur_dir, 'Lab1-apple-samsung-example.txt')
print(path_to_file)
print('does path exist? ->', Path.exists(path_to_file))

C:\Users\User\Desktop\Text mining\ba-text-mining\lab_sessions\lab1\Lab1-apple-samsung-example.txt
does path exist? -> True


If the output from the code cell above states that **does path exist? -> False**, please check that the file **Lab1-apple-samsung-example.txt** is in the same directory as this notebook.

In [3]:
with open(path_to_file) as infile:
    text = infile.read()

print('number of characters', len(text))

number of characters 1142


## [total points: 4] Exercise 1: NLTK
In this exercise, we use NLTK to apply **Part-of-speech (POS) tagging**, **Named Entity Recognition (NER)**, and **Constituency parsing**. The following code snippet already performs sentence splitting and tokenization. 

In [4]:
import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize

In [5]:
sentences_nltk = sent_tokenize(text) # sentence splitting

In [6]:
tokens_per_sentence = []
for sentence_nltk in sentences_nltk:
    sent_tokens = word_tokenize(sentence_nltk) # tokenization
    tokens_per_sentence.append(sent_tokens)

We will use lists to keep track of the output of the NLP tasks. We can hence inspect the output for each task using the index of the sentence.

In [7]:
sent_id = 1
print('SENTENCE', sentences_nltk[sent_id])
print('TOKENS', tokens_per_sentence[sent_id])

SENTENCE The six phones and tablets affected are the Galaxy S III, running the new Jelly Bean system, the Galaxy Tab 8.9 Wifi tablet, the Galaxy Tab 2 10.1, Galaxy Rugby Pro and Galaxy S III mini.
TOKENS ['The', 'six', 'phones', 'and', 'tablets', 'affected', 'are', 'the', 'Galaxy', 'S', 'III', ',', 'running', 'the', 'new', 'Jelly', 'Bean', 'system', ',', 'the', 'Galaxy', 'Tab', '8.9', 'Wifi', 'tablet', ',', 'the', 'Galaxy', 'Tab', '2', '10.1', ',', 'Galaxy', 'Rugby', 'Pro', 'and', 'Galaxy', 'S', 'III', 'mini', '.']


### [point: 1] Exercise 1a: Part-of-speech (POS) tagging
Use `nltk.pos_tag` to perform part-of-speech tagging on each sentence.

Use `print` to **show** the output in the notebook (and hence also in the exported PDF!).

In [8]:
from nltk import pos_tag

In [9]:
pos_tags_per_sentence = []
for tokens in tokens_per_sentence:
    tagged_token = pos_tag(tokens) #tag the tokens
    pos_tags_per_sentence.append(tagged_token) # add the tagged tokens to pos_tags_per_sentence
    print(tagged_token)

[('https', 'NN'), (':', ':'), ('//www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html', 'JJ'), ('Documents', 'NNS'), ('filed', 'VBN'), ('to', 'TO'), ('the', 'DT'), ('San', 'NNP'), ('Jose', 'NNP'), ('federal', 'JJ'), ('court', 'NN'), ('in', 'IN'), ('California', 'NNP'), ('on', 'IN'), ('November', 'NNP'), ('23', 'CD'), ('list', 'NN'), ('six', 'CD'), ('Samsung', 'NNP'), ('products', 'NNS'), ('running', 'VBG'), ('the', 'DT'), ('``', '``'), ('Jelly', 'RB'), ('Bean', 'NNP'), ("''", "''"), ('and', 'CC'), ('``', '``'), ('Ice', 'NNP'), ('Cream', 'NNP'), ('Sandwich', 'NNP'), ("''", "''"), ('operating', 'VBG'), ('systems', 'NNS'), (',', ','), ('which', 'WDT'), ('Apple', 'NNP'), ('claims', 'VBZ'), ('infringe', 'VB'), ('its', 'PRP$'), ('patents', 'NNS'), ('.', '.')]
[('The', 'DT'), ('six', 'CD'), ('phones', 'NNS'), ('and', 'CC'), ('tablets', 'NNS'), ('affected', 'VBN'), ('are', 'VBP'), ('the', 'DT'), ('Galaxy', 'NNP'), ('S', 'NNP'), ('III', 'NNP'

In [10]:
print(pos_tags_per_sentence)

[[('https', 'NN'), (':', ':'), ('//www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html', 'JJ'), ('Documents', 'NNS'), ('filed', 'VBN'), ('to', 'TO'), ('the', 'DT'), ('San', 'NNP'), ('Jose', 'NNP'), ('federal', 'JJ'), ('court', 'NN'), ('in', 'IN'), ('California', 'NNP'), ('on', 'IN'), ('November', 'NNP'), ('23', 'CD'), ('list', 'NN'), ('six', 'CD'), ('Samsung', 'NNP'), ('products', 'NNS'), ('running', 'VBG'), ('the', 'DT'), ('``', '``'), ('Jelly', 'RB'), ('Bean', 'NNP'), ("''", "''"), ('and', 'CC'), ('``', '``'), ('Ice', 'NNP'), ('Cream', 'NNP'), ('Sandwich', 'NNP'), ("''", "''"), ('operating', 'VBG'), ('systems', 'NNS'), (',', ','), ('which', 'WDT'), ('Apple', 'NNP'), ('claims', 'VBZ'), ('infringe', 'VB'), ('its', 'PRP$'), ('patents', 'NNS'), ('.', '.')], [('The', 'DT'), ('six', 'CD'), ('phones', 'NNS'), ('and', 'CC'), ('tablets', 'NNS'), ('affected', 'VBN'), ('are', 'VBP'), ('the', 'DT'), ('Galaxy', 'NNP'), ('S', 'NNP'), ('III', 'NN

### [point: 1] Exercise 1b: Named Entity Recognition (NER)
Use `nltk.chunk.ne_chunk` to perform Named Entity Recognition (NER) on each sentence.

Use `print` to **show** the output in the notebook (and hence also in the exported PDF!).

In [11]:
from nltk.chunk import ne_chunk

In [12]:
ner_tags_per_sentence = []
for tagged_tokens in pos_tags_per_sentence:
    # ne_chunk requires POS-tagged tokens
    tokens_pos_tagged_and_named_entities = ne_chunk(tagged_tokens) # apply NER to to tagged tokens
    ner_tags_per_sentence.append(tokens_pos_tagged_and_named_entities)
    print()
    print('NAMED ENTITY RECOGNITION OUTPUT', tokens_pos_tagged_and_named_entities)


NAMED ENTITY RECOGNITION OUTPUT (S
  https/NN
  :/:
  //www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html/JJ
  Documents/NNS
  filed/VBN
  to/TO
  the/DT
  (ORGANIZATION San/NNP Jose/NNP)
  federal/JJ
  court/NN
  in/IN
  (GPE California/NNP)
  on/IN
  November/NNP
  23/CD
  list/NN
  six/CD
  (ORGANIZATION Samsung/NNP)
  products/NNS
  running/VBG
  the/DT
  ``/``
  Jelly/RB
  (GPE Bean/NNP)
  ''/''
  and/CC
  ``/``
  Ice/NNP
  Cream/NNP
  Sandwich/NNP
  ''/''
  operating/VBG
  systems/NNS
  ,/,
  which/WDT
  (PERSON Apple/NNP)
  claims/VBZ
  infringe/VB
  its/PRP$
  patents/NNS
  ./.)

NAMED ENTITY RECOGNITION OUTPUT (S
  The/DT
  six/CD
  phones/NNS
  and/CC
  tablets/NNS
  affected/VBN
  are/VBP
  the/DT
  (ORGANIZATION Galaxy/NNP)
  S/NNP
  III/NNP
  ,/,
  running/VBG
  the/DT
  new/JJ
  (PERSON Jelly/NNP Bean/NNP)
  system/NN
  ,/,
  the/DT
  (ORGANIZATION Galaxy/NNP)
  Tab/NNP
  8.9/CD
  Wifi/NNP
  tablet/NN
  ,/,
  the/DT


In [13]:
print(ner_tags_per_sentence)

[Tree('S', [('https', 'NN'), (':', ':'), ('//www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html', 'JJ'), ('Documents', 'NNS'), ('filed', 'VBN'), ('to', 'TO'), ('the', 'DT'), Tree('ORGANIZATION', [('San', 'NNP'), ('Jose', 'NNP')]), ('federal', 'JJ'), ('court', 'NN'), ('in', 'IN'), Tree('GPE', [('California', 'NNP')]), ('on', 'IN'), ('November', 'NNP'), ('23', 'CD'), ('list', 'NN'), ('six', 'CD'), Tree('ORGANIZATION', [('Samsung', 'NNP')]), ('products', 'NNS'), ('running', 'VBG'), ('the', 'DT'), ('``', '``'), ('Jelly', 'RB'), Tree('GPE', [('Bean', 'NNP')]), ("''", "''"), ('and', 'CC'), ('``', '``'), ('Ice', 'NNP'), ('Cream', 'NNP'), ('Sandwich', 'NNP'), ("''", "''"), ('operating', 'VBG'), ('systems', 'NNS'), (',', ','), ('which', 'WDT'), Tree('PERSON', [('Apple', 'NNP')]), ('claims', 'VBZ'), ('infringe', 'VB'), ('its', 'PRP$'), ('patents', 'NNS'), ('.', '.')]), Tree('S', [('The', 'DT'), ('six', 'CD'), ('phones', 'NNS'), ('and', 'CC'),

### [points: 2] Exercise 1c: Constituency parsing
Use the `nltk.RegexpParser` to perform constituency parsing on each sentence.

Use `print` to **show** the output in the notebook (and hence also in the exported PDF!).

In [14]:
# Define the grammar
constituent_parser = nltk.RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>}           # Preposition
V: {<V.*>}          # Verb
PP: {<P> <NP>}      # PP -> P NP
VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*''')

In [15]:
constituency_output_per_sentence = []
for tagged_tokens in pos_tags_per_sentence:
    constituent_structure = constituent_parser.parse(tagged_tokens) # constituency parse tagged tokens
    constituency_output_per_sentence.append(constituent_structure)
    print()
    print(constituent_structure)
    #constituent_structure.draw()


(S
  (NP https/NN)
  :/:
  (NP
    //www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html/JJ)
  Documents/NNS
  (VP (V filed/VBN))
  to/TO
  (NP the/DT)
  San/NNP
  Jose/NNP
  (NP federal/JJ court/NN)
  (P in/IN)
  California/NNP
  (P on/IN)
  November/NNP
  23/CD
  (NP list/NN)
  six/CD
  Samsung/NNP
  products/NNS
  (VP (V running/VBG) (NP the/DT))
  ``/``
  Jelly/RB
  Bean/NNP
  ''/''
  and/CC
  ``/``
  Ice/NNP
  Cream/NNP
  Sandwich/NNP
  ''/''
  (VP (V operating/VBG))
  systems/NNS
  ,/,
  which/WDT
  Apple/NNP
  (VP (V claims/VBZ))
  (VP (V infringe/VB))
  its/PRP$
  patents/NNS
  ./.)

(S
  (NP The/DT)
  six/CD
  phones/NNS
  and/CC
  tablets/NNS
  (VP (V affected/VBN))
  (VP (V are/VBP) (NP the/DT))
  Galaxy/NNP
  S/NNP
  III/NNP
  ,/,
  (VP (V running/VBG) (NP the/DT new/JJ))
  Jelly/NNP
  Bean/NNP
  (NP system/NN)
  ,/,
  (NP the/DT)
  Galaxy/NNP
  Tab/NNP
  8.9/CD
  Wifi/NNP
  (NP tablet/NN)
  ,/,
  (NP the/DT)
  Galaxy/NN

In [16]:
print(constituency_output_per_sentence)

[Tree('S', [Tree('NP', [('https', 'NN')]), (':', ':'), Tree('NP', [('//www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html', 'JJ')]), ('Documents', 'NNS'), Tree('VP', [Tree('V', [('filed', 'VBN')])]), ('to', 'TO'), Tree('NP', [('the', 'DT')]), ('San', 'NNP'), ('Jose', 'NNP'), Tree('NP', [('federal', 'JJ'), ('court', 'NN')]), Tree('P', [('in', 'IN')]), ('California', 'NNP'), Tree('P', [('on', 'IN')]), ('November', 'NNP'), ('23', 'CD'), Tree('NP', [('list', 'NN')]), ('six', 'CD'), ('Samsung', 'NNP'), ('products', 'NNS'), Tree('VP', [Tree('V', [('running', 'VBG')]), Tree('NP', [('the', 'DT')])]), ('``', '``'), ('Jelly', 'RB'), ('Bean', 'NNP'), ("''", "''"), ('and', 'CC'), ('``', '``'), ('Ice', 'NNP'), ('Cream', 'NNP'), ('Sandwich', 'NNP'), ("''", "''"), Tree('VP', [Tree('V', [('operating', 'VBG')])]), ('systems', 'NNS'), (',', ','), ('which', 'WDT'), ('Apple', 'NNP'), Tree('VP', [Tree('V', [('claims', 'VBZ')])]), Tree('VP', [Tree('V', [

Augment the RegexpParser so that it also detects Named Entity Phrases (NEP), e.g., that it detects *Galaxy S III* and *Ice Cream Sandwich*

In [17]:
# * -> 0 or more
# + -> 1 or more
# ? -> Optional

constituent_parser_v2 = nltk.RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>}           # Preposition
V: {<V.*>}          # Verb
PP: {<P> <NP>}      # PP -> P NP
VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*
NEP: {<NNP>+<CD>?}  # NEP -> One or more NNP(Proper noun, singular) optionally followed by Cardinal Number''')

In [18]:
constituency_v2_output_per_sentence = []
for tagged_tokens in pos_tags_per_sentence:
    constituent_structure = constituent_parser_v2.parse(tagged_tokens)
    constituency_v2_output_per_sentence.append(constituent_structure)
    print()
    print(constituent_structure)


(S
  (NP https/NN)
  :/:
  (NP
    //www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html/JJ)
  Documents/NNS
  (VP (V filed/VBN))
  to/TO
  (NP the/DT)
  (NEP San/NNP Jose/NNP)
  (NP federal/JJ court/NN)
  (P in/IN)
  (NEP California/NNP)
  (P on/IN)
  (NEP November/NNP 23/CD)
  (NP list/NN)
  six/CD
  (NEP Samsung/NNP)
  products/NNS
  (VP (V running/VBG) (NP the/DT))
  ``/``
  Jelly/RB
  (NEP Bean/NNP)
  ''/''
  and/CC
  ``/``
  (NEP Ice/NNP Cream/NNP Sandwich/NNP)
  ''/''
  (VP (V operating/VBG))
  systems/NNS
  ,/,
  which/WDT
  (NEP Apple/NNP)
  (VP (V claims/VBZ))
  (VP (V infringe/VB))
  its/PRP$
  patents/NNS
  ./.)

(S
  (NP The/DT)
  six/CD
  phones/NNS
  and/CC
  tablets/NNS
  (VP (V affected/VBN))
  (VP (V are/VBP) (NP the/DT))
  (NEP Galaxy/NNP S/NNP III/NNP)
  ,/,
  (VP (V running/VBG) (NP the/DT new/JJ))
  (NEP Jelly/NNP Bean/NNP)
  (NP system/NN)
  ,/,
  (NP the/DT)
  (NEP Galaxy/NNP Tab/NNP 8.9/CD)
  (NEP Wifi/NNP)


In [19]:
print(constituency_v2_output_per_sentence)
## these were the output for  the exaples: Galaxy S III and Ice Cream Sandwich
# Tree('NEP', [('Galaxy', 'NNP'), ('S', 'NNP'), ('III', 'NNP')]) 
# Tree('NEP', [('Ice', 'NNP'), ('Cream', 'NNP'), ('Sandwich', 'NNP')]), ("''", "''")

[Tree('S', [Tree('NP', [('https', 'NN')]), (':', ':'), Tree('NP', [('//www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html', 'JJ')]), ('Documents', 'NNS'), Tree('VP', [Tree('V', [('filed', 'VBN')])]), ('to', 'TO'), Tree('NP', [('the', 'DT')]), Tree('NEP', [('San', 'NNP'), ('Jose', 'NNP')]), Tree('NP', [('federal', 'JJ'), ('court', 'NN')]), Tree('P', [('in', 'IN')]), Tree('NEP', [('California', 'NNP')]), Tree('P', [('on', 'IN')]), Tree('NEP', [('November', 'NNP'), ('23', 'CD')]), Tree('NP', [('list', 'NN')]), ('six', 'CD'), Tree('NEP', [('Samsung', 'NNP')]), ('products', 'NNS'), Tree('VP', [Tree('V', [('running', 'VBG')]), Tree('NP', [('the', 'DT')])]), ('``', '``'), ('Jelly', 'RB'), Tree('NEP', [('Bean', 'NNP')]), ("''", "''"), ('and', 'CC'), ('``', '``'), Tree('NEP', [('Ice', 'NNP'), ('Cream', 'NNP'), ('Sandwich', 'NNP')]), ("''", "''"), Tree('VP', [Tree('V', [('operating', 'VBG')])]), ('systems', 'NNS'), (',', ','), ('which', 'WDT'

## [total points: 1] Exercise 2: spaCy
Use Spacy to process the same text as you analyzed with NLTK.

In [20]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [21]:
doc = nlp(text) # insert code here

small tip: You can use **sents = list(doc.sents)** to be able to use the index to access a sentence like **sents[2]** for the third sentence.


In [22]:
sents = list(doc.sents)

## (Sentence splitting &) Tokenization, POS, NER and Constituency/dependency parsing using spaCY

### Tokenization

In [23]:
for sentence in doc.sents:
    print()
    print(sentence)
    for token in sentence:
        print(token.text)


https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html

Documents filed to the San Jose federal court in California on November 23 list six Samsung products running the "Jelly Bean" and "Ice Cream Sandwich" operating systems, which Apple claims infringe its patents.

https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html



Documents
filed
to
the
San
Jose
federal
court
in
California
on
November
23
list
six
Samsung
products
running
the
"
Jelly
Bean
"
and
"
Ice
Cream
Sandwich
"
operating
systems
,
which
Apple
claims
infringe
its
patents
.



The six phones and tablets affected are the Galaxy S III, running the new Jelly Bean system, the Galaxy Tab 8.9 Wifi tablet, the Galaxy Tab 2 10.1, Galaxy Rugby Pro and Galaxy S III mini.

The
six
phones
and
tablets
affected
are
the
Galaxy
S
III
,
running
the
new
Jelly
Bean
system
,
the
Galaxy
Tab
8.9
Wifi
tablet
,
the
Galaxy
Tab
2
10.


### Part of speech tagging

In [24]:
# in the attribute pos_ of each Token object: The simple part-of-speech tag
#in the attribute tag_ of each Token object: The detailed part-of-speech tag

for sentence in sents:
    print()
    print(sentence)
    for token in sentence:
        print(token.text, token.pos_, token.tag_)


https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html

Documents filed to the San Jose federal court in California on November 23 list six Samsung products running the "Jelly Bean" and "Ice Cream Sandwich" operating systems, which Apple claims infringe its patents.

https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html NOUN NNS


 SPACE _SP
Documents NOUN NNS
filed VERB VBD
to ADP IN
the DET DT
San PROPN NNP
Jose PROPN NNP
federal ADJ JJ
court NOUN NN
in ADP IN
California PROPN NNP
on ADP IN
November PROPN NNP
23 NUM CD
list NOUN NN
six NUM CD
Samsung PROPN NNP
products NOUN NNS
running VERB VBG
the DET DT
" PUNCT ``
Jelly PROPN NNP
Bean PROPN NNP
" PUNCT ''
and CCONJ CC
" PUNCT ``
Ice PROPN NNP
Cream PROPN NNP
Sandwich NOUN NN
" PUNCT ''
operating NOUN NN
systems NOUN NNS
, PUNCT ,
which PRON WDT
Apple PROPN NNP
claims VERB VBZ
infringe VERB VBP
its PRON PRP$
patents 

### Named Entity Recognition

In [25]:
spacy.displacy.render(doc, jupyter=True, style='ent')

In [26]:
# The attribute label_ and an ent (of type spacy.tokens.span.Span) contains the named entity type.

for ent in doc.ents:
    print(ent.text, ent.label_)

https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html TIME
San Jose GPE
California GPE
November 23 DATE
six CARDINAL
Samsung ORG
the "Jelly Bean LAW
Apple ORG
six CARDINAL
the Galaxy S III ORG
Jelly Bean ORG
8.9 CARDINAL
2 10.1 DATE
Galaxy Rugby Pro ORG
Galaxy S III PERSON
Apple ORG
Apple ORG
August DATE
Samsung ORG
US GPE
Apple ORG
1.05bn MONEY
iPad ORG
Galaxy FAC
Samsung ORG
UK GPE
Samsung ORG
Apple ORG
South Korean NORP
iPad ORG


### Constituency/dependency parsing

In [27]:
spacy.displacy.render(doc, jupyter=True, style='dep')

In [28]:
# dep_ provides the syntactic relation, e.g., nsubj
# head provides the head of a Token

for sentence in sents:
    print()
    print(sentence)
    for token in sentence:
        print(token.text, token.dep_, token.head)


https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html

Documents filed to the San Jose federal court in California on November 23 list six Samsung products running the "Jelly Bean" and "Ice Cream Sandwich" operating systems, which Apple claims infringe its patents.

https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html amod Documents


 dep https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html
Documents nsubj filed
filed ROOT filed
to prep filed
the det court
San nmod Jose
Jose nmod court
federal amod court
court pobj to
in prep court
California pobj in
on prep filed
November pobj on
23 nummod November
list compound products
six nummod products
Samsung compound products
products dobj filed
running acl products
the det Bean
" punct Bean
Jelly compound Bean
Bean dobj running
" punct Bean
and cc Bean
" punct Sandw

## [total points: 7] Exercise 3: Comparison NLTK and spaCy
We will now compare the output of NLTK and spaCy, i.e., in what do they differ?

### [points: 3] Exercise 3a: Part of speech tagging
Compare the output from NLTK and spaCy regarding part of speech tagging.

* To compare, you probably would like to compare sentence per sentence. Describe if the sentence splitting is different for NLTK than for spaCy. If not, where do they differ?
* After checking the sentence splitting, select a sentence for which you expect interesting results and perhaps differences. Motivate your choice.
* Compare the output in `token.tag` from spaCy to the part of speech tagging from NLTK for each token in your selected sentence. Are there any differences? This is not a trick question; it is possible that there are no differences.

### Exercise 3a answers:

- For sentence splitting, NLTK demonstrates better performance, effectively handling the text without any noticeable errors. In contrast, spaCy encounters difficulties distinguishing between the end of the third sentence [3] and the beginning of the fourth [4], mistakenly interpreting the closing quotation mark followed by a period as the start of a new sentence. This misinterpretation wrongly positions the newline character ('\n') that should denote the end of the third sentence at the start of the fourth instead. Consequently, in this specific instance, NLTK's straightforward, punctuation-driven approach to identifying sentence boundaries performs better, compared to spaCy's context-aware model.
- The first sentence has been selected for analysis due to its potential to reveal differences in how NLTK and spaCy process URLs. Additionally the unusual semantics and vocabulary in this sentence, 'Samsung running the Jelly Bean and Ice Cream Sandwich Operating Systems', mean that NLTK and spaCy could yield different and interesting results when part of speech tagging.
- After comparing the output in token.tag from spaCy to the part of speech tagging from NLTK for each token, there appear to be some differences, most notably, the URL seems to be split up by NLTK into: 
    - https NN
    - : :
    - //www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html JJ

    Meanwhile, spaCy treats the URL as follows, indicating that it may have been trained on a wider variety of content,             including web-based content:
    - https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html NNS
    
    #### Also noteworthy are the following differences in tagging for the tokens
    
    NLTK:
    - filed VBN
    - to TO
    - Jelly RB
    - Sandwich NNP
    - operating VBG
    - infringe VB
    
    spaCy:
    - filed VBD
    - to IN
    - Jelly NNP
    - Sandwich NN
    - operating NN
    - infringe VBP

In [29]:
for i,sent in enumerate(sentences_nltk, 1):
    print(i,sent,'\n')
    print()
for i,sent in enumerate(doc.sents, 1):
    print(i,sent)

1 https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html

Documents filed to the San Jose federal court in California on November 23 list six Samsung products running the "Jelly Bean" and "Ice Cream Sandwich" operating systems, which Apple claims infringe its patents. 


2 The six phones and tablets affected are the Galaxy S III, running the new Jelly Bean system, the Galaxy Tab 8.9 Wifi tablet, the Galaxy Tab 2 10.1, Galaxy Rugby Pro and Galaxy S III mini. 


3 Apple stated it had â€œacted quickly and diligently" in order to "determine that these newly released products do infringe many of the same claims already asserted by Apple." 


4 In August, Samsung lost a US patent case to Apple and was ordered to pay its rival $1.05bn (Â£0.66bn) in damages for copying features of the iPad and iPhone in its Galaxy range of devices. 


5 Samsung, which is the world's top mobile phone maker, is appealing the ruling. 


6 A similar case in

In [30]:
# sentences_nltk

## CHOOSE A SENTENCE
chosen_sentence = word_tokenize(sentences_nltk[0])  # Tokenize the first sentence
tagged_tokens = pos_tag(chosen_sentence)

for token, tag in tagged_tokens:  # Iterate over the list of tagged tokens
    print(token, tag)

https NN
: :
//www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html JJ
Documents NNS
filed VBN
to TO
the DT
San NNP
Jose NNP
federal JJ
court NN
in IN
California NNP
on IN
November NNP
23 CD
list NN
six CD
Samsung NNP
products NNS
running VBG
the DT
`` ``
Jelly RB
Bean NNP
'' ''
and CC
`` ``
Ice NNP
Cream NNP
Sandwich NNP
'' ''
operating VBG
systems NNS
, ,
which WDT
Apple NNP
claims VBZ
infringe VB
its PRP$
patents NNS
. .


In [31]:
# doc

## CHOOSE A SENTENCE
for token in sents[0]:  # the first sentence is selected
    print(token.text, token.tag_)

https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html NNS


 _SP
Documents NNS
filed VBD
to IN
the DT
San NNP
Jose NNP
federal JJ
court NN
in IN
California NNP
on IN
November NNP
23 CD
list NN
six CD
Samsung NNP
products NNS
running VBG
the DT
" ``
Jelly NNP
Bean NNP
" ''
and CC
" ``
Ice NNP
Cream NNP
Sandwich NN
" ''
operating NN
systems NNS
, ,
which WDT
Apple NNP
claims VBZ
infringe VBP
its PRP$
patents NNS
. .

 _SP


### [points: 2] Exercise 3b: Named Entity Recognition (NER)
* Describe differences between the output from NLTK and spaCy for Named Entity Recognition. Which one do you think performs better?

### Exercise 3b answer:
NLTK struggles with URLs and specific names, sometimes breaking them down incorrectly or assigning odd categorizations, for example, misclassifying 'Apple' as a (PERSON) instead of an organization.
spaCy handles the URL better, but strangely tags the URL as 'TIME'. It occasionally misclassifies names, for example, labeling 'Galaxy S III' as a "PERSON". spaCy appears to perform better in this comparison, as it offers a slightly more robust NER algorithm

### [points: 2] Exercise 3c: Constituency/dependency parsing
Choose one sentence from the text and run constituency parsing using NLTK and dependency parsing using spaCy.
* describe briefly the difference between constituency parsing and dependency parsing
* describe differences between the output from NLTK and spaCy.

### Exercise 3c answer
    - Constituency parsing breaks down a sentence into its constituent parts, also known as phrases or syntactic categories. These constituents are represented in a tree structure, where each node represents a phrase, and leaves represent the words in the sentence. Dependency parsing, on the other hand, focuses on the relationships between words in a sentence. It represents these relationships in a tree structure where each node is a word, and edges are the grammatical relationships (dependencies) between the words. Each dependency has a direction and a type that indicates how two words are related, with one word acting as the "head" of the relationship and the other as the "dependent".
    - The output for NLTK shows how sentences can be decomposed into nested phrases e.g (NEP Galaxy/NNP S/NNP III/NNP), This example highlights the hierarchical structure of a sentence and identifies the roles played by each phrase within the sentence.
    The output for spaCY shows how each word in the sentence is connected to others, indicating the type of grammatical relationship that exists between them e.g 'six nummod phones'and 'phones nsubj are', where 'phones' is the head for 'six', but de dependent for 'are'.  

In [32]:
## CHOOSE A SENTENCE
chosen_sentence = word_tokenize(sentences_nltk[1])  # Tokenize the second sentence
tagged_tokens = pos_tag(chosen_sentence)
constituent_structure = constituent_parser_v2.parse(tagged_tokens)
print(constituent_structure)

(S
  (NP The/DT)
  six/CD
  phones/NNS
  and/CC
  tablets/NNS
  (VP (V affected/VBN))
  (VP (V are/VBP) (NP the/DT))
  (NEP Galaxy/NNP S/NNP III/NNP)
  ,/,
  (VP (V running/VBG) (NP the/DT new/JJ))
  (NEP Jelly/NNP Bean/NNP)
  (NP system/NN)
  ,/,
  (NP the/DT)
  (NEP Galaxy/NNP Tab/NNP 8.9/CD)
  (NEP Wifi/NNP)
  (NP tablet/NN)
  ,/,
  (NP the/DT)
  (NEP Galaxy/NNP Tab/NNP 2/CD)
  10.1/CD
  ,/,
  (NEP Galaxy/NNP Rugby/NNP Pro/NNP)
  and/CC
  (NEP Galaxy/NNP S/NNP III/NNP)
  (NP mini/NN)
  ./.)


In [34]:
## CHOOSE A SENTENCE
for token in sents[1]:  # the second sentence is selected
    print(token.text, token.dep_, token.head)

The det phones
six nummod phones
phones nsubj are
and cc phones
tablets conj phones
affected acl tablets
are ROOT are
the det III
Galaxy compound III
S compound III
III attr are
, punct are
running advcl are
the det system
new amod system
Jelly compound Bean
Bean compound system
system dobj running
, punct system
the det tablet
Galaxy compound tablet
Tab nmod tablet
8.9 nummod tablet
Wifi compound tablet
tablet appos system
, punct tablet
the det Tab
Galaxy compound Tab
Tab conj tablet
2 compound 10.1
10.1 nummod Tab
, punct Tab
Galaxy compound Pro
Rugby compound Pro
Pro conj Tab
and cc Pro
Galaxy compound III
S compound III
III conj Pro
mini appos Pro
. punct are

 dep .


# End of this notebook