# NLP Exercises

We have five exercises in this section. The exercises are:
1. Build your own tokenizer, where you need to implement two functions to implement a tokenizer based on regular expression.
2. Get tags from Trump speech.
3. Get the nouns in the last 10 sentences from Trump's speech and find the nouns divided by sentencens. Use SpaCy.
4. Build your own Bag Of Words implementation using tokenizer created before.
5. Build a 5-gram model and clean up the results.

## Exercise 1. Build your own tokenizer

Build two different tokenizers:
- ``tokenize_sentence``: function tokenizing text into sentences,
- ``tokenize_word``: function tokenizing text into words.

In [None]:
from typing import List
import re

def tokenize_words(text: str) -> list:
    """Tokenize text into words using regex.
    """
    return re.findall(r"\b\w[\w']*\b", text)

def tokenize_sentence(text: str) -> list:
    """Tokenize text into words using regex.
    """
    return re.split(r'(?<=[.!?])\s+', text.strip())


text = "Here we go again. I was supposed to add this text later.\
Well, it's 10.p.m. here, and I'm actually having fun making this course. :o\
I hope you are getting along fine with this presentation, I really did try.\
And one last sentence, just so you can test you tokenizers better."

print("Tokenized sentences:")
print(tokenize_sentence(text))

print("Tokenized words:")
print(tokenize_words(text))

## Exercise 2. Get tags from Trump speech using NLTK

You should use the ``trump.txt`` file, read it and find the tags for each word. Use NLTK for it.

In [12]:
from nltk import word_tokenize, pos_tag

file = open("trump.txt", "r",encoding="utf-8")
trump = file.read()

words = word_tokenize(trump)
tagged_words = pos_tag(words)
print(tagged_words[:20])

[('Thank', 'NNP'), ('you', 'PRP'), ('very', 'RB'), ('much', 'RB'), ('.', '.'), ('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Mr.', 'NNP'), ('Vice', 'NNP'), ('President', 'NNP'), (',', ','), ('Members', 'NNP'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('the', 'DT'), ('First', 'NNP'), ('Lady', 'NNP'), ('of', 'IN')]


## EXERCISES 3 and 4 are in 06_NLP_Exercises_2.ipynb

## Exercise 5. Build a 5-gram model and clean up the results.

There are three tasks to do:
1. Use 5-gram model instead of 3.
2. Change to capital letter each first letter of a sentence.
3. Remove the whitespace between the last word in a sentence and . ! or ?.

Hint: for 2. and 3. implement a function called ``clean_generated()`` that takes the generated text and fix both issues at once. It could be easier to fix the text after it's generated rather then doing some changes in the while loop.

In [53]:
from nltk.book import *

wall_street = text7.tokens

import re

tokens = wall_street

def cleanup():
    compiled_pattern = re.compile("^[a-zA-Z0-9.!?]")
    clean = list(filter(compiled_pattern.match,tokens))
    return clean
tokens = cleanup()

def build_ngrams():
    ngrams = []
    for i in range(len(tokens)-N+1):
        ngrams.append(tokens[i:i+N])
    return ngrams

def ngram_freqs(ngrams):
    counts = {}

    for ngram in ngrams:
        token_seq  = SEP.join(ngram[:-1])
        last_token = ngram[-1]

        if token_seq not in counts:
            counts[token_seq] = {}

        if last_token not in counts[token_seq]:
            counts[token_seq][last_token] = 0

        counts[token_seq][last_token] += 1;

    return counts

def next_word(text, N, counts):

    token_seq = SEP.join(text.split()[-(N-1):]);
    choices = counts[token_seq].items();

    total = sum(weight for choice, weight in choices)
    r = random.uniform(0, total)
    upto = 0
    for choice, weight in choices:
        upto += weight;
        if upto > r: return choice
    assert False # should not reach here

In [54]:
import random


def clean_generated(text: str) -> str:
    # Remove space before punctuation
    text = re.sub(r'\s+([.?!])', r'\1', text)

    # Capitalize the first letter of each sentence
    sentences = re.split(r'([.?!])', text)
    cleaned = ""
    for i in range(0, len(sentences)-1, 2):
        sentence = sentences[i].strip().capitalize()
        punctuation = sentences[i+1]
        cleaned += sentence + punctuation + " "
    return cleaned.strip()


N=5 # fix it for other value of N

SEP=" "

sentence_count=5

ngrams = build_ngrams()
start_seq=random.choice(list(counts.keys()))

counts = ngram_freqs(ngrams)

generated = start_seq.lower();

sentences = 0
while sentences < sentence_count:
    generated += SEP + next_word(generated, N, counts)
    sentences += 1 if generated.endswith(('.','!', '?')) else 0

# put your code here:
generated = clean_generated(generated)
print(generated)

Said 0 are the company numerous failures to properly record injuries at its fairless works in spite of the firm promise 0 it had made in an earlier corporate-wide settlement agreement to correct such discrepancies. That settlement was in april 1987. A usx spokesman said 0 the company had n't yet received any documents from osha regarding the penalty or fine. Once we do they will receive very serious evaluation the spokesman said. No consideration is more important than the health and safety of our employees.
