# Lab1-Assignment

Copyright, Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the assignment for Lab 1. 

**Points**: each exercise is prefixed with the number of points you can obtain for the exercise. However, these points are just an indication of what parts we value more. The assignment itself is assessed as PASS/NO-PASS. In general, we value a critical analysis of running code more than just showing that you can create or run the code. So if you succesfully carried out the instructed commands in a notebook you are not done yet. We want you to analyse, understand what the code is doing with language and text. Be critical, think about how to explain what you observe and write this down in the notebook after running the code. It will be stated clearly in the assignment when we expect this from you.

You can make the assignment as a group but make sure that you understand and can carry out the coding yourself as well. You need these skills for your final assignment that is graded. Feedback will be given at the group level.

We assume you have worked through the following notebooks:
* **Lab1.1-introduction**
* **Lab1.2-introduction-to-NLTK**
* **Lab1.3-introduction-to-spaCy** 

In this assignment, you will process an English text (**Lab1-apple-samsung-example.txt**) with both NLTK and spaCy and discuss the similarities and differences.

## Who to contact for questions
* Piek Vossen (piek.vossen@vu.nl)

## Tip: how to read a file from disk
Let's open the file **Lab1-apple-samsung-example.txt** from disk. It should be located in the same folder as this notebook. The most simple way is to specify the full path to the file, e.g.:

```
path_to_file='/Users/piek/Desktop/t-ONDERWIJS/2021-2022/t-MA-HLT-introduction-2021/ma-hlt-labs/lab1.toolkits'
```

This may work for me but not for you as it is unlikely that the file has the same path on your machine.

We can use the Path module to find the directory of this notebook. Once we have that, we only need to concatenate the name of the text file to this path. This is how you do this:

In [1]:
from pathlib import Path

In [2]:
cur_dir = Path().resolve() # this should provide you with the folder in which this notebook is placed
print('Current directory of this notebook:', cur_dir)

## We can now stick the name of the file to the end of the Path using the *joinpath* function:
path_to_file = Path.joinpath(cur_dir, 'Lab1-apple-samsung-example.txt')
print('Path to the text file:', path_to_file)

Current directory of this notebook: /Users/piek/Downloads/ma-hlt-labs-local/lab1.toolkits
Path to the text file: /Users/piek/Downloads/ma-hlt-labs-local/lab1.toolkits/Lab1-apple-samsung-example.txt


If you are unsure whether the path is correct, you can check if the file exist on that location:

In [3]:
print('does path exist? ->', Path.exists(path_to_file))

does path exist? -> True


If the output from the code cell above says: **does path exist? -> False**, please check that the file **Lab1-apple-samsung-example.txt** is in the same directory as this notebook. In Jupyter lab you shopuld see it in the file overview panel to the left next to the notebook.

Now we can open the file and access its content. Lets read the complete content and ask for it length using the 'len' function, which will tell us how many characters a string has:

In [4]:
with open(path_to_file) as infile:
    text = infile.read()

print('number of characters', len(text))

number of characters 1139


In [5]:
print(text)

https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html

Documents filed to the San Jose federal court in California on November 23 list six Samsung products running the "Jelly Bean" and "Ice Cream Sandwich" operating systems, which Apple claims infringe its patents.
The six phones and tablets affected are the Galaxy S III, running the new Jelly Bean system, the Galaxy Tab 8.9 Wifi tablet, the Galaxy Tab 2 10.1, Galaxy Rugby Pro and Galaxy S III mini.
Apple stated it had “acted quickly and diligently" in order to "determine that these newly released products do infringe many of the same claims already asserted by Apple."
In August, Samsung lost a US patent case to Apple and was ordered to pay its rival $1.05bn (£0.66bn) in damages for copying features of the iPad and iPhone in its Galaxy range of devices. Samsung, which is the world's top mobile phone maker, is appealing the ruling.
A similar case in the UK found in Samsung's fav

We now created a string object with the name 'text' that we can use for the assignment below.

If for some reason, you see weird characters in the text you may have a problem with the character encoding. Computers use different systems to represent scripts. For most languages UTF-8 will work as it has representations for many different characters. In some cases, especially older texts, Latin encodings have been used which works for English and some languages but cannot represent characters in others. For non-Western scripts special encodings have been defined. You never know for sure what encoding a text is in but now-adays most texts are in UTF-8.

## What to do if you see weird tokens?
First check if you are really using Python 3.x and not Python 2.x when running the notebook. You can do this using: 

    import platform
    print(platform.python_version())

If your are running 3.x and still have encoding problems try to open the file as utf-8:

    with open(path_to_file, encoding=‘utf-8') as infile:
    
Note that when you open a text file in a plain text editor, you never know how it loads the file. The weird characters may still be there or disappear. In some cases, you can try to save the text file again using UTF-8 but this can also corrupt your file. It is wise to make a copy of the file before you try this.


## [total points: 4] Exercise 1: NLTK
In this exercise, we use NLTK to apply **Part-of-speech (POS) tagging**, **Named Entity Recognition (NER)**, and **Constituency parsing**. The following code snippet already performs sentence splitting and tokenization. 

In [6]:
import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize

In [7]:
sentences_nltk = sent_tokenize(text)

In [None]:
tokens_per_sentence = [] # this will become a list of lists!!!

#Below you find a so-called for loop
for sentence_nltk in sentences_nltk:
    sent_tokens = word_tokenize(sentence_nltk)
    # We append the tokens of this sentence to the result list
    tokens_per_sentence.append(sent_tokens)

We will use lists to keep track of the output of the NLP tasks. We can hence inspect the output for each task using the index of the sentence. Lets look at the first sentence. Since the text starts with the URL to the source, our first real sentence is index 1.

In [11]:
sentence_id = 1
print('SENTENCE', sentences_nltk[sentence_id])
print('TOKENS', tokens_per_sentence[sentence_id])

SENTENCE The six phones and tablets affected are the Galaxy S III, running the new Jelly Bean system, the Galaxy Tab 8.9 Wifi tablet, the Galaxy Tab 2 10.1, Galaxy Rugby Pro and Galaxy S III mini.
TOKENS ['The', 'six', 'phones', 'and', 'tablets', 'affected', 'are', 'the', 'Galaxy', 'S', 'III', ',', 'running', 'the', 'new', 'Jelly', 'Bean', 'system', ',', 'the', 'Galaxy', 'Tab', '8.9', 'Wifi', 'tablet', ',', 'the', 'Galaxy', 'Tab', '2', '10.1', ',', 'Galaxy', 'Rugby', 'Pro', 'and', 'Galaxy', 'S', 'III', 'mini', '.']


With sentence=1 we get the second line from the text. Next you can define any value to sentence_id to carry out the assignment. Also try out sentence_id=0. What do you notice?

For the assignment, pick a sentence you think has interesting properties and define the value for sentence_id correspondingly. In this notebook it is set to sentence '2' which is the third sentence. Change it to get your sentence and continue with the assignment in which you will use this value for sentence_id from now on. So do not pick out a short sentence without punctuation or any named entities!!

In [12]:
sentence_id=2

### Explain here in words why you selected this sentence! Why do you expect it to give interesting results?
[your explanation goes here]

### [point: 1] Exercise 1a: Part-of-speech (POS) tagging
Use *nltk.pos_tag* to perform part-of-speech tagging on a single sentence.

Use **print** to show the output in the notebook (and hence also in the exported PDF!).

In [13]:
sentence_tokens = tokens_per_sentence[sentence_id]
pos_tagged_sentence_tokens= [] 
#put here the call to nltk pos tagger and to assign the result to the variable 'pos_tagged_sentence_tokens'
print(pos_tagged_sentence_tokens)

[]


### Your analysis
[Look at the output and comment on the Part-of-Speech tags. Your comments go here]

### [point: 1] Exercise 1b: Named Entity Recognition (NER)
Use *nltk.chunk.ne_chunk* to perform Named Entity Recognition (NER) on your selected sentence.

Use **print** to show the output in the notebook (and hence also in the exported PDF!).

In [14]:
tokens_pos_tagged_and_named_entities = []
#put here the call to nltk named entity chunker and to assign the result to the variable 'tokens_pos_tagged_and_named_entities'
print(tokens_pos_tagged_and_named_entities)

[]


### Your analysis
[Look at the output and comment on the entities detected and labeled. Your comments go here]

### [points: 2] Exercise 1c: Constituency parsing
Use the *nltk.RegexpParser* to perform constituency parsing on your selected sentence. Think about what the parsing expects as input.

Use **print** to show the output in the notebook (and hence also in the exported PDF!).

In [15]:
constituent_parser_v1 = nltk.RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>}           # Preposition
V: {<V.*>}          # Verb
PP: {<P> <NP>}      # PP -> P NP
VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*''')

In [16]:
constituency_v1_output_for_sentence = []
#add here your code to assign the output of the parser 'constituent_parser_v1' to the variable name 'constituency_v1_output_for_sentence'

In [17]:
print(constituency_v1_output_for_sentence)

[]


### Your analysis
[Look at the output and comment on the structures detected and labels assigned. Explain this by commenting on the grammar]

Augment the RegexpParser so that it also detects Named Entity Phrases (NEP), e.g., that it detects phrases such as *Galaxy S III* and *Ice Cream Sandwich* as entity phrases, which we give the label 'NEP'. Below you see an empty structure {} for NEP and ??? as comment. Fill the empty structure with a pattern to detect NEPs and write your comment in the code to explain what you have done.

## Note

You should apply the grammar to the output of the PoS tagging and not the output of the ne_chunker!!!

In [18]:
constituent_parser_v2 = nltk.RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>}           # Preposition
V: {<V.*>}          # Verb
PP: {<P> <NP>}      # PP -> P NP
VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*
NEP: {}             # ???''')

In [19]:
constituency_v2_output_for_sentence = []
#add here your code to assign the output of the parser 'constituent_parser_v2' to the variable name 'constituency_v2_output_for_sentence'

In [20]:
print(constituency_v2_output_for_sentence)

[]


### Your analysis
[Compare the output of your grammar with the ne_chunker output. Which has better coverage and which has richer labels? How could you improve your grammar?]

## [total points: 1] Exercise 2: spaCy
Use Spacy to process the same text as you analyzed with NLTK.

In [21]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [22]:
doc = nlp(text) # insert code here

**Tip**: You can use **sents = list(doc.sents)** to be able to use the index to access a sentence like **sents[2]** for the third sentence. Use the previsouly defined "sentence_id" to get the output for the same NLTK sentence. Note that we assume that the sentences are split in the same way. So lets get the spaCy sentence for sentence_id and use this to compare against the NLTK output we had.

In [23]:
sents=list(doc.sents)
spacy_sentence=sents[sentence_id]
print(spacy_sentence)

"
In August, Samsung lost a US patent case to Apple and was ordered to pay its rival $1.05bn (£0.66bn) in damages for copying features of the iPad and iPhone in its Galaxy range of devices.


## [total points: 5] Exercise 3: Comparison NLTK and spaCy
We will now compare the output of NLTK and spaCy for the same sentence identified with sentence_id and which you just processed with the NLTK, i.e. in what do they differ? So here we expect you to critically think about at the differences in output and not just run the code and describe the differences. What is good and what is bad and why?

### [points: 2] Exercise 3a: Part of speech tagging
### Your analysis

Compare the output from NLTK and Spacy regarding part of speech tagging for the selected sentence. You already had the PoS Tags from NLTK. Get the tokens and their PoS tags (**token.tag**) from spaCy. Print both and describe any differences. This is not a trick question, it is possible that there are no differences.

In [24]:
sentence=sents[sentence_id]
print(sentence)

"
In August, Samsung lost a US patent case to Apple and was ordered to pay its rival $1.05bn (£0.66bn) in damages for copying features of the iPad and iPhone in its Galaxy range of devices.


### [points: 1] Exercise 3b: Named Entity Recognition (NER)
* For the same sentence, describe differences between the output from NLTK and spaCy for Named Entity Recognition. Which one do you think performs better?

### Your analysis

# End of the assignment 1