# NLP Exercises

We have five exercises in this section. The exercises are:
1. Build your own tokenizer, where you need to implement two functions to implement a tokenizer based on regular expression.
2. Get tags from Trump speech.
3. Get the nouns in the last 10 sentences from Trump's speech and find the nouns divided by sentencens. Use SpaCy.
4. Build your own Bag Of Words implementation using tokenizer created before.
5. Build a 5-gram model and clean up the results.

## Exercise 1. Build your own tokenizer

Build two different tokenizers:
- ``tokenize_sentence``: function tokenizing text into sentences,
- ``tokenize_word``: function tokenizing text into words.

In [17]:
import re
import nltk

In [18]:
from typing import List

def tokenize_words(text: str) -> list:
    return list(set(re.findall(r"[\w']+|[.,!?;:]", text)))
    """Tokenize text into words using regex.

    Parameters
    ----------
    text: str
            Text to be tokenized

    Returns
    -------
    List[str]
            List containing words tokenized from text

    """

def tokenize_sentence(text: str) -> list:
    return list(set(re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)(\s|[A-Z].*)', text)))
    """Tokenize text into words using regex.

    Parameters
    ----------
    text: str
            Text to be tokenized

    Returns
    -------
    List[str]
            List containing words tokenized from text

    """

text = "Here we go again. I was supposed to add this text later. \
Well, it's 10.p.m. here, and I'm actually having fun making this course. :o \
I hope you are getting along fine with this presentation, I really did try. \
And one last sentence, just so you can test you tokenizers better."

print("Tokenized sentences:")
print(tokenize_sentence(text), '\n')

print("Tokenized words:")
print(tokenize_words(text))

Tokenized sentences:
['I was supposed to add this text later.', ' ', 'And one last sentence, just so you can test you tokenizers better.', 'Here we go again.', ':o I hope you are getting along fine with this presentation, I really did try.', "Well, it's 10.p.m. here, and I'm actually having fun making this course."] 

Tokenized words:
['did', ',', 'tokenizers', 'try', 'hope', 'just', 'we', ':', 'o', 'getting', '10', 'along', 'go', 'And', 'was', 'you', 'text', 'making', '.', 'with', 'are', 'sentence', 'can', 'fine', 'm', 'so', 'I', 'Here', 'fun', 'p', 'one', 'and', 'presentation', "I'm", 'better', 'here', 'later', 'this', 'to', 'having', 'supposed', "it's", 'Well', 'again', 'add', 'actually', 'course', 'really', 'test', 'last']


## Exercise 2. Get tags from Trump speech using NLTK

You should use the ``trump.txt`` file, read it and find the tags for each word. Use NLTK for it.

In [19]:
file = open("./datasets/trump.txt", "r",encoding="utf-8") 
trump = file.read()
words = tokenize_words(trump)

nltk.pos_tag(words)
# fill the gap and imports

[('thing', 'NN'),
 ('understood', 'NN'),
 ('state', 'NN'),
 ('meeting', 'NN'),
 ('going', 'VBG'),
 ('being', 'VBG'),
 ('cares', 'NNS'),
 ('taking', 'VBG'),
 ('strong', 'JJ'),
 ('suffer', 'NN'),
 ('with', 'IN'),
 ('hurt', 'NN'),
 ('kisses', 'NNS'),
 ('danger', 'VBP'),
 ('deterrent', 'JJ'),
 ('terminate', 'NN'),
 ('came', 'VBD'),
 ('lose', 'JJ'),
 ('running', 'NN'),
 ('worry', 'NN'),
 ('Well', 'NNP'),
 ('innocent', 'JJ'),
 ('pass', 'NN'),
 ('Please', 'NNP'),
 ('script', 'CC'),
 ('foreign', 'JJ'),
 ('proud', 'NN'),
 ('needed', 'VBD'),
 ('hide', 'RB'),
 ('reached', 'VBN'),
 ('crowds', 'JJ'),
 ('court', 'NN'),
 ('man', 'NN'),
 ('Wisconsin', 'NNP'),
 ('thanks', 'VBZ'),
 ('done', 'VBN'),
 ('blazing', 'VBG'),
 ('warmth', 'NN'),
 ('doesn', 'NN'),
 ('animal', 'JJ'),
 ('million', 'CD'),
 ('developed', 'VBN'),
 ('die', 'NN'),
 ('They', 'PRP'),
 ('moved', 'VBD'),
 ('law', 'NN'),
 ('absolute', 'VB'),
 ('them', 'PRP'),
 ('event', 'NN'),
 ('true', 'JJ'),
 ('trained', 'VBD'),
 ('plants', 'NNS'),
 ('loo

## Exercise 3. Get the nouns in the last 10 sentences from Trump's speech and find the nouns divided by sentencens. Use SpaCy.

Please use Python list features to get the last 10 sentences and display nouns from it.

In [20]:
import spacy

file = open("./datasets/trump.txt", "r",encoding='utf-8') 
trump = file.read() 

    
### here comes your code
trump_list=list(re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)(\s|[A-Z].*)', trump))
filtered_trump_list = list(filter(lambda x: x!= ' ' and x!='' and x!='\n', trump_list))

trump_string = ''
for idx, sentence in enumerate(filtered_trump_list[-10:]):
    trump_string += sentence
    
    if idx == len(filtered_trump_list[-10:])-1:
        continue
    
    trump_string +=' '

print('[',trump_string,']')

nlp = spacy.load("en")

doc = nlp(trump_string)
for np in doc.noun_chunks: # use np instead of np.text
    print(np)

    

#index = 0
#nounIndices = []
#for token in doc:
    # print(token.text, token.pos_, token.dep_, token.head.text)
#    if token.pos_ == 'NOUN':
#        nounIndices.append(index)
#    index = index + 1

[ We will see. Hopefully something positive can happen. But that just was announced and I wanted to let you know. We have imposed the heaviest sanctions ever imposed. So ladies and gentlemen, thank you for everything. You’ve been incredible partners. Incredible partners. And I will let you know in the absolute strongest of terms, we’re going to make America great again and I will never, ever, ever let you down. Thank you very much. Thank you. ]
We
something
I
you
We
the heaviest sanctions
ladies
gentlemen
you
everything
You
incredible partners
Incredible partners
I
you
terms
we
America
I
you
you
you


## Exercise 4. Build your own Bag Of Words implementation using tokenizer created before 

You need to implement following methods:

- ``fit_transform`` - gets a list of strings and returns matrix with it's BoW representation
- ``get_features_names`` - returns list of words corresponding to columns in BoW

In [21]:
import numpy as np
import spacy

class BagOfWords:
    """Basic BoW implementation."""
    
    __nlp = spacy.load("en_core_web_sm")
    __bow_list = []
    
    # your code goes maybe also here    
    
    def fit_transform(self, corpus: list):
        """Transform list of strings into BoW array.

        Parameters
        ----------
        corpus: List[str]
                Corpus of texts to be transforrmed

        Returns
        -------
        np.array
                Matrix representation of BoW

        """
        # your code goes here
        tokenized_corpus = [tokenize_words(sentence) for sentence in corpus]
        corpus_features = [feature for features_vector in tokenized_corpus for feature in features_vector]
        self.__bow_list = list(set(corpus_features))
        print('tokenized_corpus: \n', tokenized_corpus, '\n')
        print('self.__bow_list: \n', self.__bow_list, '\n')
        
        vectoried_cropus = np.zeros(len(self.__bow_list))
        for idx, feature in enumerate(self.__bow_list):
            vectoried_cropus[idx] = corpus_features.count(feature)
            
        return vectoried_cropus


    def get_feature_names(self) -> list:
        """Return words corresponding to columns of matrix.

        Returns
        -------
        List[str]
                Words being transformed by fit function

        """   
        # your code goes here
        return self.__bow_list

corpus = [
     'Bag Of Words is based on counting',
     'words occurences throughout multiple documents.',
     'This is the third document.',
     'As you can see most of the words occur only once.',
     'This gives us a pretty sparse matrix, see below. Really, see below',
]    
    
vectorizer = BagOfWords()

X = vectorizer.fit_transform(corpus)
print(X)

vectorizer.get_feature_names()
len(vectorizer.get_feature_names())

tokenized_corpus: 
 [['Words', 'Of', 'counting', 'is', 'on', 'based', 'Bag'], ['occurences', '.', 'words', 'multiple', 'throughout', 'documents'], ['third', '.', 'is', 'This', 'the', 'document'], ['you', 'once', 'occur', 'As', 'can', '.', 'most', 'words', 'of', 'the', 'only', 'see'], [',', 'pretty', 'Really', 'below', 'a', '.', 'matrix', 'This', 'see', 'us', 'sparse', 'gives']] 

self.__bow_list: 
 [',', 'pretty', 'below', 'As', 'occurences', 'words', 'matrix', 'This', 'document', 'on', 'Words', 'third', 'you', 'once', 'Really', 'can', '.', 'is', 'only', 'multiple', 'throughout', 'occur', 'us', 'Of', 'a', 'most', 'see', 'Bag', 'sparse', 'counting', 'of', 'the', 'based', 'documents', 'gives'] 

[1. 1. 1. 1. 1. 2. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 4. 2. 1. 1. 1. 1. 1. 1.
 1. 1. 2. 1. 1. 1. 1. 2. 1. 1. 1.]


35

## Exercise 5. Build a 5-gram model and clean up the results.

There are three tasks to do:
1. Use 5-gram model instead of 3.
2. Change to capital letter each first letter of a sentence.
3. Remove the whitespace between the last word in a sentence and . ! or ?.

Hint: for 2. and 3. implement a function called ``clean_generated()`` that takes the generated text and fix both issues at once. It could be easier to fix the text after it's generated rather then doing some changes in the while loop.

In [22]:
from nltk.book import *
import random 

wall_street = text7.tokens

import re

tokens = wall_street

def cleanup():
    compiled_pattern = re.compile("^[a-zA-Z0-9.!?]")
    clean = list(filter(compiled_pattern.match,tokens))
    return clean
tokens = cleanup()

def build_ngrams():
    ngrams = []
    for i in range(len(tokens)-N+1):
        ngrams.append(tokens[i:i+N])
    return ngrams

def ngram_freqs(ngrams):
    counts = {}

    for ngram in ngrams:
        token_seq  = SEP.join(ngram[:-1])
        last_token = ngram[-1]

        if token_seq not in counts:
            counts[token_seq] = {}

        if last_token not in counts[token_seq]:
            counts[token_seq][last_token] = 0
        counts[token_seq][last_token] += 1;
    return counts

def next_word(text, N, counts):
    #print(counts.keys())
    token_seq = SEP.join(text.split()[-(N-1):]);
    choices = counts[token_seq].items();

    total = sum(weight for choice, weight in choices)
    r = random.uniform(0, total)
    upto = 0
    for choice, weight in choices:
        upto += weight;
        if upto > r: return choice
    assert False # should not reach here

In [23]:
def clean_generated(generated):
    # put your code here
    white_spaces_del = re.sub(r'\s\.', '.', generated)
    sentences_list=list(re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)(\s|[A-Z].*)', white_spaces_del))
    capitalized_sentences_list = []
    for sentence in sentences_list:
        if len(sentence) == 0:
            continue
        if len(sentence) == 1:
            capitalized_sentences_list.append(sentence)
            continue
        first_letter = sentence[0].capitalize()
        capitalized_sentences_list.append(first_letter + sentence[1:])
    
    cleaned_generated_string = ''
    for sentence in capitalized_sentences_list:
        cleaned_generated_string += sentence
    
    return cleaned_generated_string
    
N=5 # fix it for other value of N

SEP=" "

sentence_count=5

ngrams = build_ngrams()

start_seq =""
for idx, word in enumerate(ngrams[0]):
    start_seq += word
    if idx == len(ngrams[0]) -1:
        break
    start_seq += ' '


counts = ngram_freqs(ngrams)

if start_seq is None: start_seq = random.choice(list(counts.keys()))
generated = start_seq#.lower();

sentences = 0
while sentences < sentence_count:
    generated += SEP + next_word(generated, N, counts)
    sentences += 1 if generated.endswith(('.','!', '?')) else 0

cleaned_generated = clean_generated(generated)

print('generated: ', '\n[', generated, ']\n')
print('cleaned_generated: ', '\n[', cleaned_generated, ']\n')

generated:  
[ Pierre Vinken 61 years old will join the board as a nonexecutive director Nov. 29 . Mr. Vinken is chairman of Elsevier N.V. the Dutch publishing group . ]

cleaned_generated:  
[ Pierre Vinken 61 years old will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V. the Dutch publishing group. ]

