# NLP Exercises

We have five exercises in this section. The exercises are:
1. Build your own tokenizer, where you need to implement two functions to implement a tokenizer based on regular expression.
2. Get tags from Trump speech.
3. Get the nouns in the last 10 sentences from Trump's speech and find the nouns divided by sentencens. Use SpaCy.
4. Build your own Bag Of Words implementation using tokenizer created before.
5. Build a 5-gram model and clean up the results.

## Exercise 1. Build your own tokenizer

Build two different tokenizers:
- ``tokenize_sentence``: function tokenizing text into sentences,
- ``tokenize_word``: function tokenizing text into words.

In [1]:
import nltk
import re 
from typing import List

def tokenize_words(text: str) -> list:
    list_of_words = re.split(r'\W+', text)   # Matches any character which is not a word character.
    return list_of_words

def tokenize_sentence(text: str) -> list:
    list_of_sentences =  re.split('(?<=[.!?])',text)   
    return list_of_sentences

text = "Here we go again. I was supposed to add this text later.\
Well, it's 10.p.m. here, and I'm actually having fun making this course. :o\
I hope you are getting along fine with this presentation, I really did try.\
And one last sentence, just so you can test you tokenizers better."

print("\nTokenized sentences:")
print(tokenize_sentence(text))
print("\nTokenized words:")
print(tokenize_words(text))


Tokenized sentences:
['Here we go again.', ' I was supposed to add this text later.', "Well, it's 10.", 'p.', 'm.', " here, and I'm actually having fun making this course.", ' :oI hope you are getting along fine with this presentation, I really did try.', 'And one last sentence, just so you can test you tokenizers better.', '']

Tokenized words:
['Here', 'we', 'go', 'again', 'I', 'was', 'supposed', 'to', 'add', 'this', 'text', 'later', 'Well', 'it', 's', '10', 'p', 'm', 'here', 'and', 'I', 'm', 'actually', 'having', 'fun', 'making', 'this', 'course', 'oI', 'hope', 'you', 'are', 'getting', 'along', 'fine', 'with', 'this', 'presentation', 'I', 'really', 'did', 'try', 'And', 'one', 'last', 'sentence', 'just', 'so', 'you', 'can', 'test', 'you', 'tokenizers', 'better', '']


## Exercise 2. Get tags from Trump speech using NLTK

You should use the ``trump.txt`` file, read it and find the tags for each word. Use NLTK for it.

In [2]:
from nltk import word_tokenize  # exercise 2
from nltk import pos_tag        # exercise 2

file = open("./datasets/trump.txt", "r",encoding="utf-8") 
trump = file.read()
#words = tokenize_words(trump)
words = word_tokenize(trump)

tagged_words = pos_tag(words)
tagged_words

[('Thank', 'NNP'),
 ('you', 'PRP'),
 (',', ','),
 ('everybody', 'NN'),
 ('.', '.'),
 ('Thank', 'NNP'),
 ('you', 'PRP'),
 ('.', '.'),
 ('Thank', 'VB'),
 ('you', 'PRP'),
 ('very', 'RB'),
 ('much', 'RB'),
 ('.', '.'),
 ('Thank', 'NNP'),
 ('you', 'PRP'),
 (',', ','),
 ('Matt', 'NNP'),
 (',', ','),
 ('for', 'IN'),
 ('that', 'DT'),
 ('great', 'JJ'),
 ('introduction', 'NN'),
 ('.', '.'),
 ('And', 'CC'),
 ('thank', 'VB'),
 ('you', 'PRP'),
 ('for', 'IN'),
 ('this', 'DT'),
 ('big', 'JJ'),
 ('crowd', 'NN'),
 ('.', '.'),
 ('This', 'DT'),
 ('is', 'VBZ'),
 ('incredible', 'JJ'),
 ('.', '.'),
 ('Really', 'RB'),
 ('incredible', 'JJ'),
 ('.', '.'),
 ('We', 'PRP'),
 ('have', 'VBP'),
 ('all', 'DT'),
 ('come', 'VBP'),
 ('a', 'DT'),
 ('long', 'JJ'),
 ('way', 'NN'),
 ('together', 'RB'),
 ('.', '.'),
 ('We', 'PRP'),
 ('have', 'VBP'),
 ('come', 'VBN'),
 ('a', 'DT'),
 ('long', 'JJ'),
 ('way', 'NN'),
 ('together', 'RB'),
 ('.', '.'),
 ('I', 'PRP'),
 ('’', 'VBP'),
 ('m', 'RB'),
 ('thrilled', 'VBN'),
 ('to', 'TO')

## Exercise 3. Get the nouns in the last 10 sentences from Trump's speech and find the nouns divided by sentencens. Use SpaCy.

Please use Python list features to get the last 10 sentences and display nouns from it.

In [3]:
import spacy   # exercise 3
from nltk.tokenize import sent_tokenize  # exercise 3

file = open("./datasets/trump.txt", "r",encoding='utf-8') 
trump = file.read() 
tokenized_trump = sent_tokenize(trump)
tokenized_trump = tokenized_trump[len(tokenized_trump)-10 : ]
trump = ''.join(tokenized_trump)

print(trump)

number_of_sentenses = 10
number_of_nouns = 0

nlp = spacy.load('en_core_web_sm')
doc = nlp(trump)

for token in doc:
    #print("> ",token.text,token.pos_)
    if token.pos_ == "NOUN":
        number_of_nouns = number_of_nouns + 1

print("\n\nnouns divided by sentencens : {:}".format(number_of_nouns / number_of_sentenses) )


We will see.Hopefully something positive can happen.But that just was announced and I wanted to let you know.We have imposed the heaviest sanctions ever imposed.So ladies and gentlemen, thank you for everything.You’ve been incredible partners.Incredible partners.And I will let you know in the absolute strongest of terms, we’re going to make America great again and I will never, ever, ever let you down.Thank you very much.Thank you.


nouns divided by sentencens : 0.9


## Exercise 4. Build your own Bag Of Words implementation using tokenizer created before 

You need to implement following methods:

- ``fit_transform`` - gets a list of strings and returns matrix with it's BoW representation
- ``get_features_names`` - returns list of words corresponding to columns in BoW

In [4]:
import numpy as np
import pandas as pd
import spacy

class BagOfWords:
    """Basic BoW implementation."""
    
    __nlp = spacy.load("en_core_web_sm")
    __bow_list = []
    __list_of_sentences = []
    __list_of_words = []
    
    
    def fit_transform(self, corpus: list):
        """Transform list of strings into BoW array.

        Parameters
        ----------
        corpus: List[str]
                Corpus of texts to be transforrmed

        Returns
        -------
        np.array
                Matrix representation of BoW
        """
        print("IN FT TRANSFORM")
              
        self.__list_of_sentences = corpus
        
        
        for s in self.__list_of_sentences:          
            for word in tokenize_words(s):
                self.__list_of_words.append(word)
                
        self.__list_of_words = list(set(self.__list_of_words))
        self.__list_of_words.sort()   
        if self.__list_of_words[0] == '':
            self.__list_of_words.pop(0)
    
        for s in self.__list_of_sentences:
            tmp = []
            for low in self.__list_of_words:
                counter = 0
                for w in tokenize_words(s):
                    if w == low:
                        counter = counter+1
                tmp.append(counter)           
            self.__bow_list.append(tmp)        
        return np.array(self.__bow_list)
         

    def get_feature_names(self) -> list:
        return self.__list_of_words

corpus = [
     'Bag Of Words is based on counting',
     'words occurences throughout multiple documents.',
     'This is the third document.',
     'As you can see most of the words occur only once.',
     'This gives us a pretty sparse matrix, see below. Really, see below']    
    
vectorizer = BagOfWords()
X = vectorizer.fit_transform(corpus)
print(X)

names = vectorizer.get_feature_names()
len(vectorizer.get_feature_names())

df = pd.DataFrame(X, columns= names)
df

IN FT TRANSFORM
[[0 1 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0]
 [0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 0 0 1 1]
 [0 0 0 1 1 0 1 0 2 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 2 1 0 0 0 1 0 0]]


Unnamed: 0,As,Bag,Of,Really,This,Words,a,based,below,can,...,only,pretty,see,sparse,the,third,throughout,us,words,you
0,0,1,1,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0
3,1,0,0,0,0,0,0,0,0,1,...,1,0,1,0,1,0,0,0,1,1
4,0,0,0,1,1,0,1,0,2,0,...,0,1,2,1,0,0,0,1,0,0


## Exercise 5. Build a 5-gram model and clean up the results.

There are three tasks to do:
1. Use 5-gram model instead of 3.
2. Change to capital letter each first letter of a sentence.
3. Remove the whitespace between the last word in a sentence and . ! or ?.

Hint: for 2. and 3. implement a function called ``clean_generated()`` that takes the generated text and fix both issues at once. It could be easier to fix the text after it's generated rather then doing some changes in the while loop.

In [5]:
import re
from numpy.random import random, randint
from nltk.book import *

wall_street = text7.tokens
tokens = wall_street

def cleanup():
    compiled_pattern = re.compile("^[a-zA-Z0-9.!?]")
    clean = list(filter(compiled_pattern.match,tokens))
    return clean
tokens = cleanup()

def build_ngrams():
    ngrams = []
    for i in range(len(tokens)-N+1):
        ngrams.append(tokens[i:i+N])
    return ngrams

def ngram_freqs(ngrams):
    counts = {}

    for ngram in ngrams:
        token_seq  = SEP.join(ngram[:-1])
        last_token = ngram[-1]

        if token_seq not in counts:
            counts[token_seq] = {}

        if last_token not in counts[token_seq]:
            counts[token_seq][last_token] = 0

        counts[token_seq][last_token] += 1;
    return counts

def next_word(text, N, counts):

    token_seq = SEP.join(text.split()[-(N-1):]);
    choices = counts[token_seq].items();

    total = sum(weight for choice, weight in choices)
    r = random.uniform(0, total)
    upto = 0
    for choice, weight in choices:
        upto += weight;
        if upto > r: return choice
    assert False # should not reach here

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [9]:
import random

def clean_generated(generated):
    
    tmp = []
    generated = generated.replace(" .", ".").replace(" ?", "?").replace(" !", "!") 
    
    for g in generated:
        tmp.append(g)

    tmp[0] = tmp[0].upper()
        
    return "".join(tmp)


N=5
SEP=" "
sentence_count=10

ngrams = build_ngrams()
start_seq="we have managed to"

counts = ngram_freqs(ngrams)

if start_seq is None: start_seq = random.choice(list(counts.keys()))
generated = start_seq.lower();

sentences = 0
while sentences < sentence_count:
    generated += SEP + next_word(generated, N, counts)
    sentences += 1 if generated.endswith(('.','!', '?')) else 0

print("\n\nBefore:\n",generated)

generated = clean_generated(generated)

print("\n\nAfter:\n",generated)




Before:
 we have managed to maximize our direct-mail capability . In addition Buick is a relatively respected nameplate among American Express card holders says 0 an American Express spokeswoman . When the company asked members in a mailing which cars they would like to be the company next chief executive . Mr. Baum said 0 the two have orders to focus on bottom-line profits and to take a hard look at our businesses what is good what is not so good . Analysts generally applaud the performance of Campbell U.S.A. the company largest division which posted 6 unit sales growth and a 15 improvement in operating profit for fiscal 1989 . The House and Senate are divided over whether the United Nations Population Fund will receive any portion of these appropriations but the size of the charge until they determine which employees and how many will participate in the retirement plan . But the pharmaceutical company said 0 it believes 0 would perform satisfactorily on the bench . In contrast the 