# NLP Exercises

We have five exercises in this section. The exercises are:
1. Build your own tokenizer, where you need to implement two functions to implement a tokenizer based on regular expression.
2. Get tags from Trump speech.
3. Get the nouns in the last 10 sentences from Trump's speech and find the nouns divided by sentences. Use SpaCy.
4. Build your own Bag Of Words implementation using tokenizer created before.
5. Build a 5-gram model and clean up the results.

## Exercise 1. Build your own tokenizer

Build two different tokenizers:
- ``tokenize_sentence``: function tokenizing text into sentences,
- ``tokenize_word``: function tokenizing text into words.

In [5]:
from typing import List
import re

def tokenize_words(text: str) -> list:
    """Tokenize text into words using regex.

    Parameters
    ----------
    text: str
            Text to be tokenized

    Returns
    -------
    List[str]
            List containing words tokenized from text

    """

    # \w - word char
    #  ? - zero or one
    #  + - one or more
    return re.findall(r"[:]?[\w']+", text)

def tokenize_sentence(text: str) -> list:
    """Tokenize text into sentences using regex.

    Parameters
    ----------
    text: str
            Text to be tokenized

    Returns
    -------
    List[str]
            List containing words tokenized from text

    """

    # (?<=[.!?]) - look behind if previous char is . ! or ?
    # (?=[:A-Z])  - look ahead if next char is capital letter or :
    return re.split('(?<=[.!?]) +(?=[:A-Z])', text)

text = "Here we go again. I was supposed to add this text later. \
Well, it's 10.p.m. here, and I'm actually having fun making this course. :o \
I hope you are getting along fine with this presentation, I really did try. \
And one last sentence, just so you can test you tokenizers better."

print("Tokenized sentences:")
print(tokenize_sentence(text))

print("Tokenized words:")
print(tokenize_words(text))

Tokenized sentences:
['Here we go again.', 'I was supposed to add this text later.', "Well, it's 10.p.m. here, and I'm actually having fun making this course.", ':o I hope you are getting along fine with this presentation, I really did try.', 'And one last sentence, just so you can test you tokenizers better.']
Tokenized words:
['Here', 'we', 'go', 'again', 'I', 'was', 'supposed', 'to', 'add', 'this', 'text', 'later', 'Well', "it's", '10', 'p', 'm', 'here', 'and', "I'm", 'actually', 'having', 'fun', 'making', 'this', 'course', ':o', 'I', 'hope', 'you', 'are', 'getting', 'along', 'fine', 'with', 'this', 'presentation', 'I', 'really', 'did', 'try', 'And', 'one', 'last', 'sentence', 'just', 'so', 'you', 'can', 'test', 'you', 'tokenizers', 'better']


## Exercise 2. Get tags from Trump speech using NLTK

You should use the ``trump.txt`` file, read it and find the tags for each word. Use NLTK for it.

In [6]:
import nltk
from nltk import word_tokenize, pos_tag

nltk.download('averaged_perceptron_tagger_eng', quiet=True)

True

In [7]:
file = open("./datasets/trump.txt", "r",encoding="utf-8")
trump = file.read()
words = word_tokenize(trump)

pos_tag(words)

[('Thank', 'NNP'),
 ('you', 'PRP'),
 ('very', 'RB'),
 ('much', 'RB'),
 ('.', '.'),
 ('Mr.', 'NNP'),
 ('Speaker', 'NNP'),
 (',', ','),
 ('Mr.', 'NNP'),
 ('Vice', 'NNP'),
 ('President', 'NNP'),
 (',', ','),
 ('Members', 'NNP'),
 ('of', 'IN'),
 ('Congress', 'NNP'),
 (',', ','),
 ('the', 'DT'),
 ('First', 'NNP'),
 ('Lady', 'NNP'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('United', 'NNP'),
 ('States', 'NNPS'),
 (',', ','),
 ('and', 'CC'),
 ('citizens', 'NNS'),
 ('of', 'IN'),
 ('America', 'NNP'),
 (':', ':'),
 ('Tonight', 'NN'),
 (',', ','),
 ('as', 'IN'),
 ('we', 'PRP'),
 ('mark', 'VBP'),
 ('the', 'DT'),
 ('conclusion', 'NN'),
 ('of', 'IN'),
 ('our', 'PRP$'),
 ('celebration', 'NN'),
 ('of', 'IN'),
 ('Black', 'NNP'),
 ('History', 'NNP'),
 ('Month', 'NNP'),
 (',', ','),
 ('we', 'PRP'),
 ('are', 'VBP'),
 ('reminded', 'VBN'),
 ('of', 'IN'),
 ('our', 'PRP$'),
 ('Nation', 'NN'),
 ("'s", 'POS'),
 ('path', 'NN'),
 ('towards', 'NNS'),
 ('civil', 'JJ'),
 ('rights', 'NNS'),
 ('and', 'CC'),
 ('the', 'DT'),
 ('

## Exercise 3. Get the nouns in the last 10 sentences from Trump's speech and find the nouns divided by sentences. Use SpaCy.

Please use Python list features to get the last 10 sentences and display nouns from it.

In [8]:
import spacy

file = open("./datasets/trump.txt", "r",encoding='utf-8')
trump = file.read()

nlp = spacy.load("en_core_web_sm")
doc = nlp(trump)

sents = list(doc.sents)
n = len(sents)

sents = sents[-10:]
for span in sents:
    print("> ", span)

>  When we fulfill this vision, when we celebrate our 250 years of glorious freedom, we will look back on tonight as when this new chapter of American greatness began.
>  The time for small thinking is over.
>  The time for trivial fights is behind us.
>  We just need the courage to share the dreams that fill our hearts, the bravery to express the hopes that stir our souls, and the confidence to turn those hopes and those dreams into action.


>  From now on, America will be empowered by our aspirations, not burdened by our fears; inspired by the future, not bound by failures of the past; and guided by our vision, not blinded by our doubts.


>  I am asking all citizens to embrace this renewal of the American spirit.
>  I am asking all Members of Congress to join me in dreaming big and bold, and daring things for our country.
>  I am asking everyone watching tonight to seize this moment.
>  Believe in yourselves, believe in your future, and believe, once more, in America.


>  Thank yo

In [9]:
# nouns

noun_tags = [ "NN", "NNP", "NNPS", "NNS" ]

for i, sent in enumerate(sents, 1):
    nouns = [token.text for token in sent if token.tag_ in noun_tags]
    print(f"Sentence {n + i - len(sents)}: {nouns}")

Sentence 289: ['vision', 'years', 'freedom', 'tonight', 'chapter', 'greatness']
Sentence 290: ['time', 'thinking']
Sentence 291: ['time', 'fights']
Sentence 292: ['courage', 'dreams', 'hearts', 'bravery', 'hopes', 'souls', 'confidence', 'hopes', 'dreams', 'action']
Sentence 293: ['America', 'aspirations', 'fears', 'future', 'failures', 'past', 'vision', 'doubts']
Sentence 294: ['citizens', 'renewal', 'spirit']
Sentence 295: ['Members', 'Congress', 'things', 'country']
Sentence 296: ['everyone', 'tonight', 'moment']
Sentence 297: ['yourselves', 'future', 'America']
Sentence 298: ['God', 'God', 'United', 'States']


In [10]:
# noun_chunks

for i, sent in enumerate(sents, 1):
    nouns = [chunk.text for chunk in sent.noun_chunks]
    print(f"Sentence {n + i - len(sents)}: {nouns}")


Sentence 289: ['we', 'this vision', 'we', 'our 250 years', 'glorious freedom', 'we', 'tonight', 'this new chapter', 'American greatness']
Sentence 290: ['The time', 'small thinking']
Sentence 291: ['The time', 'trivial fights', 'us']
Sentence 292: ['We', 'the courage', 'the dreams', 'our hearts', 'the bravery', 'the hopes', 'our souls', 'the confidence', 'those hopes', 'those dreams', 'action']
Sentence 293: ['America', 'our aspirations', 'our fears', 'the future', 'failures', 'the past', 'our vision', 'our doubts']
Sentence 294: ['I', 'all citizens', 'this renewal', 'the American spirit']
Sentence 295: ['I', 'all Members', 'Congress', 'me', 'things', 'our country']
Sentence 296: ['I', 'everyone']
Sentence 297: ['yourselves', 'your future', 'America']
Sentence 298: ['you', 'God', 'you', 'God', 'the United States']


## Exercise 4. Build your own Bag Of Words implementation using tokenizer created before 

You need to implement following methods:

- ``fit_transform`` - gets a list of strings and returns matrix with it's BoW representation
- ``get_features_names`` - returns list of words corresponding to columns in BoW

In [11]:
import numpy as np
import spacy

class BagOfWords:
    """Basic BoW implementation."""
    
    __nlp = spacy.load("en_core_web_sm")
    __bow_list = []
    
    # same as in exercise 1
    def tokenize_words(self, text: str) -> list:
        """Tokenize text into words using regex.

        Parameters
        ----------
        text: str
                Text to be tokenized

        Returns
        -------
        List[str]
                List containing words tokenized from text

        """

        # \w - word char
        #  ? - zero or one
        #  + - one or more
        return re.findall(r"[:]?[\w']+", text)
    
    def fit_transform(self, corpus: list):
        """Transform list of strings into BoW array.

        Parameters
        ----------
        corpus: List[str]
                Corpus of texts to be transformed

        Returns
        -------
        np.array
                Matrix representation of BoW

        """

        for c in corpus:
            words = self.tokenize_words(c)
            for word in words:
                word_lower = word.lower()
                if word_lower not in self.__bow_list:
                    self.__bow_list.append(word_lower)

        self.__bow_list = sorted(self.__bow_list)
        matrix = np.zeros((len(corpus), len(self.__bow_list)))

        for c_id, c in enumerate(corpus):
            words = self.tokenize_words(c)
            for word in words:
                word_lower = word.lower()
                if word_lower in self.__bow_list:
                    w_id = self.__bow_list.index(word_lower)
                    matrix[c_id, w_id] += 1

        return matrix
      

    def get_feature_names(self) -> list:
        """Return words corresponding to columns of matrix.

        Returns
        -------
        List[str]
                Words being transformed by fit function

        """

        return self.__bow_list

corpus = [
     'Bag Of Words is based on counting',
     'words occurrences throughout multiple documents.',
     'This is the third document.',
     'As you can see most of the words occur only once.',
     'This gives us a pretty sparse matrix, see below. Really, see below',
]    
    
vectorizer = BagOfWords()

X = vectorizer.fit_transform(corpus)
print(X)

vectorizer.get_feature_names()
len(vectorizer.get_feature_names())

[[0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 1. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  1. 1. 1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 1. 1. 0. 0. 1. 0.
  1. 0. 0. 0. 0. 1. 1.]
 [1. 0. 0. 0. 2. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 2. 1.
  0. 0. 1. 0. 1. 0. 0.]]


31

## Exercise 5. Build a 5-gram model and clean up the results.

There are three tasks to do:
1. Use 5-gram model instead of 3.
2. Change to capital letter each first letter of a sentence.
3. Remove the whitespace between the last word in a sentence and . ! or ?.

Hint: for 2. and 3. implement a function called ``clean_generated()`` that takes the generated text and fix both issues at once. It could be easier to fix the text after it's generated rather then doing some changes in the while loop.

In [17]:
nltk.download('book', quiet=True)

True

In [18]:
from nltk.book import *

wall_street = text7.tokens

import re
import random

tokens = wall_street

def cleanup():
    compiled_pattern = re.compile("^[a-zA-Z0-9.!?]")
    clean = list(filter(compiled_pattern.match,tokens))
    return clean
tokens = cleanup()

def build_ngrams():
    ngrams = []
    for i in range(len(tokens)-N+1):
        ngrams.append(tokens[i:i+N])
    return ngrams

def ngram_freqs(ngrams):
    counts = {}

    for ngram in ngrams:
        token_seq  = SEP.join(ngram[:-1])
        last_token = ngram[-1]

        if token_seq not in counts:
            counts[token_seq] = {}

        if last_token not in counts[token_seq]:
            counts[token_seq][last_token] = 0

        counts[token_seq][last_token] += 1

    return counts

def next_word(text, N, counts):

    token_seq = SEP.join(text.split()[-(N-1):])
    choices = counts[token_seq].items()

    total = sum(weight for choice, weight in choices)
    r = random.uniform(0, total)
    upto = 0
    for choice, weight in choices:
        upto += weight
        if upto > r: return choice
    assert False # should not reach here

In [30]:
def clean_generated(generated):
    text = re.sub(' +([.!?])', r'\1', generated)

    sents = []
    for sent in re.split('(?<=[.!?]) +', text):
        sents.append(sent[0].upper() + sent[1:])

    return ' '.join(sents)

N=5

SEP=" "

sentence_count=5

ngrams = build_ngrams()
#start_seq="We have"

counts = ngram_freqs(ngrams)
start_seq = random.choice(list(counts.keys()))

if start_seq is None: start_seq = random.choice(list(counts.keys()))
generated = start_seq.lower()

sentences = 0
while sentences < sentence_count:
    generated += SEP + next_word(generated, N, counts)
    sentences += 1 if generated.endswith(('.','!', '?')) else 0

print("Before clean_generated:")
print(generated)
print()

print("After clean_generated:")
print(clean_generated(generated))

Before clean_generated:
funds investments lengthened by a day to 41 days the longest since early August according to Donoghue . Longer maturities are thought to indicate declining interest rates because they permit portfolio managers to retain relatively higher rates for a longer period . Shorter maturities are considered a sign of rising rates because portfolio managers can vary maturities and go after the highest rates . The top money funds are currently yielding well over 9 . Dreyfus World-Wide Dollar the top-yielding fund had a seven-day compound yield of 9.37 during the latest week to 352.7 billion .

After clean_generated:
Funds investments lengthened by a day to 41 days the longest since early August according to Donoghue. Longer maturities are thought to indicate declining interest rates because they permit portfolio managers to retain relatively higher rates for a longer period. Shorter maturities are considered a sign of rising rates because portfolio managers can vary maturi