<div class="alert alert-block alert-info">
    <h1>Natural Language Processing</h1>
    <h3>General Information:</h3>
    <p>Please do not add or delete any cells. Answers belong into the corresponding cells (below the question). If a function is given (either as a signature or a full function), you should not change the name, arguments or return value of the function.<br><br> If you encounter empty cells underneath the answer that can not be edited, please ignore them, they are for testing purposes.<br><br>When editing an assignment there can be the case that there are variables in the kernel. To make sure your assignment works, please restart the kernel and run all cells before submitting (e.g. via <i>Kernel -> Restart & Run All</i>).</p>
    <p>Code cells where you are supposed to give your answer often include the line  ```raise NotImplementedError```. This makes it easier to automatically grade answers. If you edit the cell please outcomment or delete this line.</p>
    <h3>Submission:</h3>
    <p>Please submit your notebook via the web interface (in the main view -> Assignments -> Submit). The assignments are due on <b>Wednesday at 15:00</b>. If this does not work there is a submission slot on LEA.</p>
    <h3>Group Work:</h3>
    <p>You are allowed to work in groups of up to three people. Please enter the UID (your username here) of each member of the group into the next cell. We apply plagiarism checking, so do not submit solutions from other people except your team members. If an assignment has a copied solution, the task will be graded with 0 points for all people with the same solution.</p>
    <h3>Questions about the Assignment:</h3>
    <p>If you have questions about the assignment please post them in the LEA forum before the deadline. Don't wait until the last day to post questions.</p>
    
</div>

In [None]:
'''
Group Work:
Enter the UID (LEA username) of each team member into the variables. 
If you work alone please fill the first variable only.
'''
member1 = 'mkolpe2s'
member2 = 'agomez2s'
member3 = ''

# 1. Introduction to spaCy

SpaCy is a tool that does tokenization, parsing, tagging and named entity regocnition (among other things).

When we parse a document via spaCy, we get an object that holds sentences and tokens, as well as their POS tags, dependency relations and so on.

Look at the next cell for an example.

In [1]:
import spacy

nlp = spacy.load('en_core_web_sm')

text = 'SpaCy is capable of    tagging, parsing and annotating text. It recognizes sentences and stop words.'

doc = nlp(text)

# For every sentence
for sent in doc.sents:
    # For every token
    for token in sent:
        # Print the token itself, the pos tag, 
        # dependency tag and whether spacy thinks this is a stop word
        print(token, token.pos_, token.dep_, token.is_stop)
        
print('-'*30)
print('The nouns and proper nouns in this text are:')
# Print only the nouns:
for token in doc:
    if token.pos_ in ['NOUN', 'PROPN']:
        print(token)

SpaCy PROPN nsubj False
is AUX ROOT True
capable ADJ acomp False
of ADP prep True
    SPACE  False
tagging NOUN pobj False
, PUNCT punct False
parsing VERB conj False
and CCONJ cc True
annotating VERB conj False
text NOUN dobj False
. PUNCT punct False
It PRON nsubj True
recognizes VERB ROOT False
sentences NOUN dobj False
and CCONJ cc True
stop VERB conj False
words NOUN dobj False
. PUNCT punct False
------------------------------
The nouns and proper nouns in this text are:
SpaCy
tagging
text
sentences
words


## 1.1 Splitting text into sentences

You are given the text in the next cell.

```
text = '''
This is a sentence. 
Mr. A. said this was another! 
But is this a sentence? 
The abbreviation Merch. means merchant(s).
At certain univ. in the U.S. and U.K. they study NLP.
'''
```

Use spaCy to split this into sentences. Store the resulting sentences (each as a **single** string) in the list ```sentences```. Make sure to convert the tokens to strings (e.g. via str(token)).

In [9]:
import spacy
nlp = spacy.load('en_core_web_sm')

text = '''
This is a sentence. Mr. A. said this was another! 
But is this a sentence? The abbreviation Merch. means merchant(s).
At certain Univ. in the U.S. and U.K. they study NLP.
'''
sentences = []

# YOUR CODE HERE
doc = nlp(text)

for sentence in doc.sents:
    sentences.append(str(sentence))

for sentence in sentences:
    print(sentence)
    print('.')
    assert type(sentence) == str, 'You need to convert this to a single string!'


This is a sentence.
.
Mr. A. said this was another! 

.
But is this a sentence?
.
The abbreviation Merch. means merchant(s).

.
At certain Univ.
.
in the U.S. and U.K. they study NLP.

.


In [None]:
# This is a test cell, please ignore it!

## 1.2 Cluster the text by POS tag

Next we want to cluster the text by the corresponding part-of-speech (POS) tags. 

The result should be a dictionary ```pos_tags``` where the keys are the POS tags and the values are lists of words with those POS tags. Make sure your words are converted to **strings**.

*Example:*

```
pos_tags['VERB'] # Output: ['said', 'means', 'study']
pos_tags['ADJ']  # Output: ['certain']
...
```

In [59]:
import spacy
nlp = spacy.load('en_core_web_sm')

text = '''
This is a sentence. Mr. A. said this was another! 
But is this a sentence? The abbreviation Merch. means merchant(s).
At certain Univ. in the U.S. and U.K. they study NLP.
'''

pos_tags = dict()

# YOUR CODE HERE
doc = nlp(text)

for token in doc:
    if str(token.pos_) in pos_tags:
        pos_tags[str(token.pos_)].append(token.text)
    else:
        pos_tags[str(token.pos_)] = [token.text]



for key in pos_tags:
    print('The words with the POS tag {} are {}.'.format(key, pos_tags[key]))
    for token in pos_tags[key]:
        assert type(token) == str, 'Each token should be a string'

The words with the POS tag ADJ are ['certain'].
The words with the POS tag PRON are ['they'].
The words with the POS tag DET are ['This', 'a', 'this', 'another', 'this', 'a', 'The', 'the'].
The words with the POS tag PUNCT are ['.', '!', '?', '.', ')', '.', '.', '.'].
The words with the POS tag SPACE are ['\n', '\n', '\n', '\n'].
The words with the POS tag CCONJ are ['But', 'and'].
The words with the POS tag VERB are ['said', 'means', 'study'].
The words with the POS tag ADP are ['At', 'in'].
The words with the POS tag PROPN are ['Mr.', 'A.', 'Merch', 'merchant(s', 'Univ', 'U.S.', 'U.K.', 'NLP'].
The words with the POS tag NOUN are ['sentence', 'sentence', 'abbreviation'].
The words with the POS tag AUX are ['is', 'was', 'is'].


In [None]:
# This is a test cell, please ignore it!

## 1.3 Stop word removal

Stop words are words that appear often in a language and don't hold much meaning for a NLP task. Examples are the words ```a, to, the, this, has, ...```. This depends on the task and domain you are working on.

SpaCy has its own internal list of stop words. Use spaCy to remove all stop words from the given text. Store your result as a **single string** in the variable ```stopwords_removed```.

In [57]:
import spacy
nlp = spacy.load('en_core_web_sm')

text = '''
This is a sentence. Mr. A. said this was another! 
But is this a sentence? The abbreviation Merch. means merchant(s).
At certain Univ. in the U.S. and U.K. they study NLP.
'''

stopwords_removed = ''

# YOUR CODE HERE
doc = nlp(text)
    
for token in doc:
    if(not token.is_stop):
        stopwords_removed += token.text
        if(token.whitespace_):
            stopwords_removed += token.whitespace_

print(stopwords_removed)
assert type(stopwords_removed) == str, 'Your answer should be a single string!'


sentence. Mr. A. said ! 
sentence? abbreviation Merch. means merchant(s).
certain Univ. U.S. U.K. study NLP.



In [None]:
# This is a test cell, please ignore it!

## 1.4 Dependency Tree

We now want to use spaCy to visualize the dependency tree of a certain sentence. Look at the Jupyter Example on the [spaCy website](https://spacy.io/usage/visualizers/). Render the tree.

In [68]:
import spacy
from spacy import displacy

nlp = spacy.load('en_core_web_sm')

text = 'Dependency Parsing is helpful for many tasks.'

# YOUR CODE HERE
doc = nlp(text)
displacy.render(doc, style="dep", jupyter=True)

## 1.5 Dependency Parsing

Use spaCy to extract all subjects and objects from the text. We define a subject as any word that has ```subj``` in its dependency tag (e.g. ```nsubj```, ```nsubjpass```, ...). Similarly we define an object as any token that has ```obj``` in its dependency tag (e.g. ```dobj```, ```pobj```, etc.).

For each sentence extract the subject, root node ```ROOT``` of the tree and object and store them as a single string in a list. Name this list ```subj_obj```.

*Example:*

```
text = 'Learning multiple ways of representing text is cool. We can access parts of the sentence with dependency tags.'

subj_obj = ['Learning ways text is', 'We access parts sentence tags']
```

In [91]:
text = '''
This is a sentence. Mr. A. said this was another! 
But is this a sentence? The abbreviation Merch. means merchant(s).
At certain Univ. in the U.S. and U.K. they study NLP.
'''

subj_obj = []
# YOUR CODE HERE
doc = nlp(text)

for sentence in doc.sents:
    sen = ''
    for token in sentence:
        if 'subj' in token.dep_ or 'obj' in token.dep_ or 'ROOT' in token.dep_:
            sen += token.text
            if(token.whitespace_):
                sen += token.whitespace_
            
    subj_obj.append(sen)

for cleaned_sent in subj_obj:
    print(cleaned_sent)
    assert type(cleaned_sent) == str, 'Each cleaned sentence should be a string!'

This is 
A. said this 
is this 
abbreviation means 
At Univ
U.S. they study NLP


In [None]:
# This is a test cell, please ignore it!

# 2. Keyword Extraction

In this assignment we want to write a keyword extractor. There are several methods of which we want to explore a few.

We want to extract keywords from Wikipedia articles about ```Natural language processing```.

## 2.1 POS tag based extraction

When we look at keywords we realize that they are often combinations of nouns and adjectives. The idea is to find all sequences of nouns and adjectives in a corpus and count them. The $n$ most frequent ones are then our keywords.

A keyword by this definition is any combination of nouns (NOUN) and adjectives (ADJ) that ends in a noun. We also count proper nouns (PROPN) as nouns.

### 2.1.1 POSKeywordExtractor

Please complete the function ```keywords``` in the class ```POSKeywordExtractor```.

You are given the file ```corpus.txt```, which has the raw text from all top-level pages under the category ```Natural language processing```. Use this for extracting your keywords.

*Example:*

Let us look at the definition of an index term or keyword from Wikipedia. Here I highlighted all combinations of nouns and adjectives that end in a noun. All the highlighted words are potential keywords.

An **index term**, **subject term**, **subject heading**, or **descriptor**, in **information retrieval**, is a **term** that captures the **essence** of the **topic** of a **document**. **Index terms** make up a **controlled vocabulary** for **use** in **bibliographic records**.

In [83]:
%%time
from typing import List, Tuple
from collections import Counter
import spacy
from collections import Counter

class POSKeywordExtractor:
    
    def __init__(self):
        # Set up SpaCy in a more efficient way by disabling what we do not need
        # This is the dependency parser (parser) and the named entity recognizer (ner)
        self.nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])
        # Add the sentencizer to quickly split our text into sentences
        self.nlp.add_pipe(self.nlp.create_pipe('sentencizer'))
        # Increase the maximum length of text SpaCy can parse in one go
        self.nlp.max_length = 1500000
        
    
    def keywords(self, text: str, n_keywords: int, min_words: int) -> List[Tuple[Tuple[str], int]]:
        '''
        Extract the top n most frequent keywords from the text.
        Keywords are sequences of adjectives and nouns that end in a noun
        
        Arguments:
            text       -- the raw text from which to extract keywords
            n_keywords -- the number of keywords to return
            min_words  -- the number of words a potential keyphrase has to include
                          if this is set to 2, then only keyphrases consisting of 2+ words are counted
        Returns:
            keywords   -- List of keywords and their count, sorted by the count
                          Example: [(('potato'), 12), (('potato', 'harvesting'), 9), ...]
        '''
        #text ='An index term, subject term, subject heading, or descriptor, in information retrieval, is a term that captures the essence of the topic of a document. Index terms make up a controlled vocabulary for use in bibliographic records.'
        
        doc = self.nlp(text)
        keywords = []
        
        # YOUR CODE HERE
        preliminar_keywords = []
        curr_keyword = []
        for token in doc:
            if(token.pos_ == 'NOUN' or token.pos_ == 'PNOUN' or token.pos_ =='ADJ'):
                curr_keyword.append(token)
                #print(token.pos_)
                
            elif(len(curr_keyword) >= min_words):
                if(curr_keyword[-1].pos_ == 'NOUN' or curr_keyword[-1].pos_ == 'PNOUN'):
                    new_keyword = [t.text for t in curr_keyword]
                    preliminar_keywords.append(' '.join(new_keyword))
                curr_keyword = []
                
        #print(preliminar_keywords)
        word_freq = Counter(preliminar_keywords)
        sorted_keywords = word_freq.most_common(n_keywords)
        
        for word in sorted_keywords:
            keywords.append((tuple(word[0].split(' ')), word[1]))

        return keywords
    
with open('corpus.txt', 'r') as corpus_file:
    text = corpus_file.read()
    
keywords = POSKeywordExtractor().keywords(text.lower(), n_keywords=15, min_words=1)

'''
Expected output:
The keyword ('text',) appears 330 times.
The keyword ('words',) appears 317 times.
The keyword ('example',) appears 255 times.
The keyword ('word',) appears 217 times.
...
'''
for keyword in keywords:
    print('The keyword {} appears {} times.'.format(*keyword))#s

The keyword ('words',) appears 349 times.
The keyword ('text',) appears 345 times.
The keyword ('example',) appears 267 times.
The keyword ('word',) appears 239 times.
The keyword ('natural', 'language', 'processing') appears 180 times.
The keyword ('references',) appears 167 times.
The keyword ('documents',) appears 160 times.
The keyword ('language',) appears 157 times.
The keyword ('information',) appears 152 times.
The keyword ('systems',) appears 147 times.
The keyword ('set',) appears 135 times.
The keyword ('system',) appears 130 times.
The keyword ('sentence',) appears 114 times.
The keyword ('number',) appears 113 times.
The keyword ('context',) appears 110 times.
CPU times: user 5.28 s, sys: 917 ms, total: 6.19 s
Wall time: 6.19 s


In [None]:
# This is a test cell, please ignore it!

### 2.1.2 Testing parameters

Rerun the keyword extrator with a minimum word count of ```min_words=2``` and a keyword count of ```n_keywords=15```.

Store this in the variable ```keywords_2```. Print the result.

Make sure to convert the input text to lower case!

In [84]:
keywords_2 = []

# YOUR CODE HERE
keywords_2 = POSKeywordExtractor().keywords(text.lower(), n_keywords=15, min_words=2)
for keyword in keywords_2:
    print('The keyword {} appears {} times.'.format(*keyword))

The keyword ('natural', 'language', 'processing') appears 116 times.
The keyword ('computational', 'linguistics') appears 58 times.
The keyword ('machine', 'translation') appears 48 times.
The keyword ('external', 'links') appears 47 times.
The keyword ('natural', 'language') appears 46 times.
The keyword ('sentiment', 'analysis') appears 43 times.
The keyword ('references', 'external', 'links') appears 41 times.
The keyword ('text', 'mining') appears 41 times.
The keyword ('word', 'sense', 'disambiguation') appears 36 times.
The keyword ('artificial', 'intelligence') appears 28 times.
The keyword ('machine', 'learning') appears 28 times.
The keyword ('natural', 'language', 'understanding') appears 28 times.
The keyword ('information', 'extraction') appears 24 times.
The keyword ('customer', 'card') appears 22 times.
The keyword ('speech', 'recognition') appears 21 times.


In [None]:
# This is a test cell, please ignore it!

## 2.2 Stop word based extraction

Another approach to extract keywords is by splitting the text at the stop words. Then we count these potential keywords and output the top $n$ keywords. Make sure to only include words proper words. Here we define proper words as those words that match the regular expression ```r'\b(\W+|\w+)\b'```. 

### 2.2.1 StopWordKeywordExtractor

Complete the function ```keywords``` in the class ```StopWordKeywordExtractor```.

In [87]:
%%time
from typing import List, Tuple
from collections import Counter
import re
import spacy
from collections import Counter

class StopWordKeywordExtractor:
    
    def __init__(self):
        # Set up SpaCy in a more efficient way by disabling what we do not need
        # This is the dependency parser (parser) and the named entity recognizer (ner)
        self.nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])
        # Add the sentencizer to quickly split our text into sentences
        self.nlp.add_pipe(self.nlp.create_pipe('sentencizer'))
        # Increase the maximum length of text SpaCy can parse in one go
        self.nlp.max_length = 1500000
        
    def is_proper_word(self, token:str) -> bool:
        '''
        Checks if the word is a proper word by our definition
        
        Arguments:
            token     -- The token as a string
        Return:
            is_proper -- True / False
        '''
        match = re.search(r'\b(\W+|\w+)\b', token)
        if(match):
            return token == match.group(0)
        else:
            return False 
    
    def keywords(self, text: str, n_keywords: int, min_words: int) -> List[Tuple[Tuple[str], int]]:
        '''
        Extract the top n most frequent keywords from the text.
        Keywords are sequences of adjectives and nouns that end in a noun
        
        Arguments:
            text       -- the raw text from which to extract keywords
            n_keywords -- the number of keywords to retu[rn
            min_words  -- the number of words a potential keyphrase has to include
                          if this is set to 2, then only keyphrases consisting of 2+ words are counted
        Returns:
            keywords   -- List of keywords and their count, sorted by the count
                          Example: [(('potato'), 12), (('potato', 'harvesting'), 9), ...]
        '''
        doc = self.nlp(text)
        keywords = []
        # YOUR CODE HERE
        curr_keyword = []
        preliminar_keywords = []
        
        for token in doc:
            self.is_proper_word(token.text)
            if(not token.is_stop and self.is_proper_word(token.text)):
                #print(token)
                curr_keyword.append(token.text)
                
            elif(len(curr_keyword) >= min_words):
                preliminar_keywords.append(' '.join(curr_keyword))
                curr_keyword = []
                
        word_freq = Counter(preliminar_keywords)
        sorted_keywords = word_freq.most_common(n_keywords)
        
        for word in sorted_keywords:
            keywords.append((tuple(word[0].split(' ')), word[1]))

        return keywords
        
with open('corpus.txt', 'r') as corpus_file:
    text = corpus_file.read()
    
keywords = StopWordKeywordExtractor().keywords(text.lower(), n_keywords=15, min_words=1)

'''
Expected output:
The keyword ('words',) appears 273 times.
The keyword ('text',) appears 263 times.
The keyword ('example',) appears 257 times.
The keyword ('word',) appears 201 times.
...
'''
for keyword in keywords:
    print('The keyword {} appears {} times.'.format(*keyword))

The keyword ('words',) appears 273 times.
The keyword ('text',) appears 263 times.
The keyword ('example',) appears 257 times.
The keyword ('word',) appears 201 times.
The keyword ('references',) appears 184 times.
The keyword ('natural', 'language', 'processing') appears 165 times.
The keyword ('n',) appears 160 times.
The keyword ('use',) appears 151 times.
The keyword ('set',) appears 144 times.
The keyword ('language',) appears 123 times.
The keyword ('t',) appears 120 times.
The keyword ('documents',) appears 118 times.
The keyword ('based',) appears 115 times.
The keyword ('1',) appears 115 times.
The keyword ('number',) appears 106 times.
CPU times: user 5.46 s, sys: 817 ms, total: 6.28 s
Wall time: 6.28 s


In [None]:
# This is a test cell, please ignore it!

### 2.2.2 Testing parameters

Rerun the keyword extrator with a minimum word count of ```min_words=2``` and a keyword count of ```n_keywords=15```.

Store this in the variable ```keywords_2```. Print the result.

Make sure to convert the input text to lower case!

In [88]:
keywords_2 = StopWordKeywordExtractor().keywords(text.lower(), n_keywords=15, min_words=2)

# YOUR CODE HERE
for keyword in keywords_2:
    print('The keyword {} appears {} times.'.format(*keyword))

The keyword ('natural', 'language', 'processing') appears 115 times.
The keyword ('computational', 'linguistics') appears 60 times.
The keyword ('references', 'external', 'links') appears 49 times.
The keyword ('information', 'retrieval') appears 49 times.
The keyword ('machine', 'translation') appears 42 times.
The keyword ('external', 'links') appears 42 times.
The keyword ('text', 'mining') appears 37 times.
The keyword ('word', 'sense', 'disambiguation') appears 36 times.
The keyword ('machine', 'learning') appears 34 times.
The keyword ('natural', 'language') appears 30 times.
The keyword ('artificial', 'intelligence') appears 29 times.
The keyword ('sentiment', 'analysis') appears 28 times.
The keyword ('natural', 'language', 'understanding') appears 25 times.
The keyword ('speech', 'recognition') appears 23 times.
The keyword ('association', 'computational', 'linguistics') appears 22 times.


In [None]:
# This is a test cell, please ignore it!