# NLP Exercises

We have five exercises in this section. The exercises are:
1. Build your own tokenizer, where you need to implement two functions to implement a tokenizer based on regular expression.
2. Get tags from Trump speech.
3. Get the nouns in the last 10 sentences from Trump's speech and find the nouns divided by sentencens. Use SpaCy.
4. Build your own Bag Of Words implementation using tokenizer created before.
5. Build a 5-gram model and clean up the results.

## Exercise 1. Build your own tokenizer

Build two different tokenizers:
- ``tokenize_sentence``: function tokenizing text into sentences,
- ``tokenize_word``: function tokenizing text into words.

In [1]:
import re
from typing import List


def tokenize_words(text: str) -> list:
    """Tokenize text into words using regex.

    Parameters
    ----------
    text: str
            Text to be tokenized

    Returns
    -------
    List[str]
            List containing words tokenized from text

    """
    return re.findall(r"\b\w+(?:'\w+)?(?:\.\w+)*\b", text)

def tokenize_sentence(text: str) -> list:
    """Tokenize text into words using regex.

    Parameters
    ----------
    text: str
            Text to be tokenized

    Returns
    -------
    List[str]
            List containing words tokenized from text

    """
    return re.split('(?<!\w\.\w.)(?<=[.!?]) +', text)

text = "Here we go again. I was supposed to add this text later.\
Well, it's 10.p.m. here, and I'm actually having fun making this course. :o\
I hope you are getting along fine with this presentation, I really did try.\
And one last sentence, just so you can test you tokenizers better."

print("Tokenized sentences:")
print(tokenize_sentence(text))

print("Tokenized words:")
print(tokenize_words(text))

Tokenized sentences:
['Here we go again.', "I was supposed to add this text later.Well, it's 10.p.m. here, and I'm actually having fun making this course.", ':oI hope you are getting along fine with this presentation, I really did try.And one last sentence, just so you can test you tokenizers better.']
Tokenized words:
['Here', 'we', 'go', 'again', 'I', 'was', 'supposed', 'to', 'add', 'this', 'text', 'later.Well', "it's", '10.p.m', 'here', 'and', "I'm", 'actually', 'having', 'fun', 'making', 'this', 'course', 'oI', 'hope', 'you', 'are', 'getting', 'along', 'fine', 'with', 'this', 'presentation', 'I', 'really', 'did', 'try.And', 'one', 'last', 'sentence', 'just', 'so', 'you', 'can', 'test', 'you', 'tokenizers', 'better']


## Exercise 2. Get tags from Trump speech using NLTK

You should use the ``trump.txt`` file, read it and find the tags for each word. Use NLTK for it.

In [2]:
import nltk
from nltk.tokenize import word_tokenize

file = open("../datasets/trump.txt", "r",encoding="utf-8") 
trump = file.read()
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
words = word_tokenize(trump)

# fill the gap and imports

tags = nltk.pos_tag(words)
print(tags[:10])
file.close()

[('Thank', 'NNP'), ('you', 'PRP'), ('very', 'RB'), ('much', 'RB'), ('.', '.'), ('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Mr.', 'NNP'), ('Vice', 'NNP')]


[nltk_data] Downloading package punkt to /home/strus/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/strus/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Exercise 3. Get the nouns in the last 10 sentences from Trump's speech and find the nouns divided by sentencens. Use SpaCy.

Please use Python list features to get the last 10 sentences and display nouns from it.

In [3]:
import spacy

file = open("../datasets/trump.txt", "r",encoding='utf-8') 
trump = file.read() 

### here comes your code
nlp = spacy.load("en_core_web_sm")
doc = nlp(trump)

last_10_sentences = list(doc.sents)[-10:]

nouns_by_sentence = []
for i, sentence in enumerate(last_10_sentences):
    nouns = [token.text for token in sentence if token.pos_ == 'NOUN']
    nouns_by_sentence.append(nouns)
    print(f'Sentence: {str(sentence).rstrip()}')
    print(f"Nouns: {', '.join(noun for noun in nouns)}")
    print(100 * '=')
    
file.close()

Sentence: When we fulfill this vision, when we celebrate our 250 years of glorious freedom, we will look back on tonight as when this new chapter of American greatness began.
Nouns: vision, years, freedom, tonight, chapter, greatness
Sentence: The time for small thinking is over.
Nouns: time, thinking
Sentence: The time for trivial fights is behind us.
Nouns: time, fights
Sentence: We just need the courage to share the dreams that fill our hearts, the bravery to express the hopes that stir our souls, and the confidence to turn those hopes and those dreams into action.
Nouns: courage, dreams, hearts, bravery, hopes, souls, confidence, hopes, dreams, action
Sentence: From now on, America will be empowered by our aspirations, not burdened by our fears; inspired by the future, not bound by failures of the past; and guided by our vision, not blinded by our doubts.
Nouns: aspirations, fears, future, failures, past, vision, doubts
Sentence: I am asking all citizens to embrace this renewal of 

## Exercise 4. Build your own Bag Of Words implementation using tokenizer created before 

You need to implement following methods:

- ``fit_transform`` - gets a list of strings and returns matrix with it's BoW representation
- ``get_features_names`` - returns list of words corresponding to columns in BoW

In [4]:
import numpy as np
import spacy

class BagOfWords:
    """Basic BoW implementation."""
    
    __nlp = spacy.load("en_core_web_sm")
    __bow_list = []
    
    # your code goes maybe also here    
    
    def fit_transform(self, corpus: list):
        """Transform list of strings into BoW array.

        Parameters
        ----------
        corpus: List[str]
                Corpus of texts to be transforrmed

        Returns
        -------
        np.array
                Matrix representation of BoW

        """
        # your code goes here        
        docs_tokens = []
        for doc in corpus:
            docs_tokens.append([token.lemma_.lower() for token in self.__nlp(doc) if not token.is_punct and not token.is_stop])
        
        self.__bow_list = list(sorted(set([token for doc_tokens in docs_tokens for token in doc_tokens])))
        bow_matrix = np.zeros((len(corpus), len(self.__bow_list)), dtype=np.int32)
        for i, doc in enumerate(docs_tokens):
            for token in doc:
                bow_matrix[i, self.__bow_list.index(token)] += 1
        
        return bow_matrix
        

    def get_feature_names(self) -> list:
        """Return words corresponding to columns of matrix.

        Returns
        -------
        List[str]
                Words being transformed by fit function

        """   
        # your code goes here
        return self.__bow_list

corpus = [
     'Bag Of Words is based on counting',
     'words occurences throughout multiple documents.',
     'This is the third document.',
     'As you can see most of the words occur only once.',
     'This gives us a pretty sparse matrix, see below. Really, see below',
]    
    
vectorizer = BagOfWords()

X = vectorizer.fit_transform(corpus)
print(X)

vectorizer.get_feature_names()
print(vectorizer.get_feature_names())
len(vectorizer.get_feature_names())

[[1 1 1 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 1 0 0 1 0 1 0 0 1 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 1 0]
 [0 0 0 0 1 1 0 0 0 1 1 0 0]]
['bag', 'base', 'count', 'document', 'give', 'matrix', 'multiple', 'occur', 'occurence', 'pretty', 'sparse', 'word', 'words']


13

## Exercise 5. Build a 5-gram model and clean up the results.

There are three tasks to do:
1. Use 5-gram model instead of 3.
2. Change to capital letter each first letter of a sentence.
3. Remove the whitespace between the last word in a sentence and . ! or ?.

Hint: for 2. and 3. implement a function called ``clean_generated()`` that takes the generated text and fix both issues at once. It could be easier to fix the text after it's generated rather then doing some changes in the while loop.

In [5]:
from nltk.book import *
import random

wall_street = text7.tokens

import re

tokens = wall_street

def cleanup():
    compiled_pattern = re.compile("^[a-zA-Z0-9.!?]")
    clean = list(filter(compiled_pattern.match,tokens))
    return clean

tokens = cleanup()

def build_ngrams():
    ngrams = []
    for i in range(len(tokens)-N+1):
        ngrams.append(tokens[i:i+N])
    return ngrams

def ngram_freqs(ngrams):
    counts = {}

    for ngram in ngrams:
        token_seq  = SEP.join(ngram[:-1])
        last_token = ngram[-1]

        if token_seq not in counts:
            counts[token_seq] = {}

        if last_token not in counts[token_seq]:
            counts[token_seq][last_token] = 0

        counts[token_seq][last_token] += 1

    return counts

def next_word(text, N, counts):

    token_seq = SEP.join(text.split()[-(N-1):])
    choices = counts[token_seq].items()
    
    total = sum(weight for choice, weight in choices)
    r = random.uniform(0, total)
    upto = 0
    for choice, weight in choices:
        upto += weight
        if upto > r: return choice
    assert False # should not reach here

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [8]:
def clean_generated(text):
    sentences = nltk.sent_tokenize(text)
    sentences = [s.capitalize() for s in sentences]
    text = ' '.join(sentences)
    text = re.sub(r'\s([?.!"])', r'\1', text)
    return text
    
N=5 # fix it for other value of N

SEP=" "

sentence_count=5

ngrams = build_ngrams()
start_seq="we have"

counts = ngram_freqs(ngrams)

if start_seq not in counts: start_seq = random.choice(list(counts.keys()))
generated = start_seq.lower()

sentences = 0
while sentences < sentence_count:
    generated += SEP + next_word(generated, N, counts)
    sentences += 1 if generated.endswith(('.','!', '?')) else 0

# put your code here:
generated = clean_generated(generated)
print(generated)

In the course of trade or business to report the payment on a document known as form 8300. The form asks for such details as the client name social security number passport number and details about the services provided for the payment. Failure to complete the form had been punishable as a misdemeanor until last november when congress determined that the crime was a felony punishable by up to 10 years in prison. Attorneys have argued since 1985 when the law took effect that they can not provide information about clients who do n't wish their identities to be known. Many attorneys have returned incomplete forms to the irs in recent years citing attorney-client privilege.
