# Tokenization

**1. Kochmar mentions several steps required in a typical NLP pipeline, one of them being *Split into words*. Why is this step necessary? Why can we not just feed the text as it is into a model?**

It is necessary to split into words as this allows for individual analysis of the words in a scentence, this can be more computationaly efficient and allow for methods like paralell processing. Seperate words are also easier to utilise in feature extraction and semantic analysis so the computer can understand which are the important words and which only provide gramatical meaning. It is perhaps also easier for it to construct context and se words in different contexts establishing patterns of recognition.

**2. Simply splitting on "words" (i.e. whitespace) is rarely enough. Consider the sentence below ("That U.S.A. poster-print costs $12.40...") and name some problems that arise from splitting on whitespace.**

The problems with using whitespace as a divider arises when words use special characters, abbriviations or numeric values. In general a simple fault is the dot that seperates scenteces. In a case where you only use whitespace the dot would be included as a part of the last token in the scentence likely causing the word to be unrecougnizable. In this spesific example the main issue occurs with the pricing that is tokenized into $12.40... which would not be the same as a $12.4 despite them having the same interpreted meaning.

In [9]:
sentence = "That U.S.A. poster-print costs $12.40..."
sentence.split(" ")

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40...']

In [5]:
# If you wish, experiment with implementing different rules for tokenization. You will see that the "ruleset" quickly grows if you want to account for all types of edge cases...
sentence = "That U.S.A. poster-print costs $12.40..."

def your_rulebased_tokenizer(sentence):
    tokens = []
    return tokens

your_rulebased_tokenizer(sentence)

['.']

In [26]:
import nltk

# Download the Punkt tokenizer models
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\marcu\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

NLTK has several tokenizers implemented, such as a specific one for Twitter data. Below, indicated by the `TODO`-tag, you should find and import various tokenizers and add them to the list of tokenizers:

`tokenizers = [tokenizer1, tokenizer2, ..., tokenizerN]`

Tokenize the sentence with at least three different tokenizers supplied by NLTK and comment on your findings. You will find the documentation for NLTK's tokenizers [here](https://www.nltk.org/_modules/nltk/tokenize.html) useful.

In [14]:
from typing import List

# this is the base class of tokenizers in nltk
from nltk.tokenize.api import TokenizerI
from nltk.tokenize import wordpunct_tokenize, sent_tokenize 


# this is just a simple example of how a tokenizer can be implemented
class MyWhitespaceTokenizer(TokenizerI):
    def __init__(self):
        super().__init__()

    def tokenize(self, text: str) -> List[str]:
        return text.split()
    
class SpaceDot_Tokenizer(TokenizerI):
    def __init__(self):
        super().__init__()

    def tokenize(self, text: str) -> List[str]:
       return wordpunct_tokenize(text)


class Sentence_Tokenizer(TokenizerI):
    def __init__(self):
        super().__init__()

    def tokenize(self, text: str) -> List[str]:
        return sent_tokenize(text)


sentence = "That U.S.A. poster-print costs $12.40..."

# ************************************************************
# TODO: import and add the tokenizers you want to try out here
# ************************************************************
tokenizers = [
    MyWhitespaceTokenizer(), 
    SpaceDot_Tokenizer(),
    Sentence_Tokenizer()
]


# Leave this function as-is
def tokenize(tokenizers: List[TokenizerI], sentence: str) -> None:
    for tokenizer in tokenizers:
        assert isinstance(tokenizer, TokenizerI)
        tokenized = tokenizer.tokenize(sentence)
        print(f"{tokenizer.__class__.__name__} ({len(tokenized)} tokens)\n{tokenized}\n")

tokenize(tokenizers, sentence)

# This was my first attempt, surprisingly it worked,
# but i used the same framework of classes and functions not realising this was not the correct way to approch this task.

MyWhitespaceTokenizer (5 tokens)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40...']

SpaceDot_Tokenizer (16 tokens)
['That', 'U', '.', 'S', '.', 'A', '.', 'poster', '-', 'print', 'costs', '$', '12', '.', '40', '...']

Sentence_Tokenizer (1 tokens)
['That U.S.A. poster-print costs $12.40...']



In [32]:
from typing import List

# this is the base class of tokenizers in nltk
from nltk.tokenize.api import TokenizerI
from nltk.tokenize import NLTKWordTokenizer, MWETokenizer, LineTokenizer

# this is just a simple example of how a tokenizer can be implemented
class MyWhitespaceTokenizer(TokenizerI):
    def __init__(self):
        super().__init__()

    def tokenize(self, text: str) -> List[str]:
        return text.split()
    
sentence = "That U.S.A. poster-print costs $12.40..."

tokenizers = [
    MyWhitespaceTokenizer(),
    NLTKWordTokenizer(),
    MWETokenizer(),
    LineTokenizer()
 #   LegalitySyllableTokenizer()

]


# Leave this function as-is
def tokenize(tokenizers: List[TokenizerI], sentence: str) -> None:
    for tokenizer in tokenizers:
        assert isinstance(tokenizer, TokenizerI)
        tokenized = tokenizer.tokenize(sentence)
        print(f"{tokenizer.__class__.__name__} ({len(tokenized)} tokens)\n{tokenized}\n")

tokenize(tokenizers, sentence)

MyWhitespaceTokenizer (5 tokens)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40...']

NLTKWordTokenizer (7 tokens)
['That', 'U.S.A.', 'poster-print', 'costs', '$', '12.40', '...']

MWETokenizer (40 tokens)
['T', 'h', 'a', 't', ' ', 'U', '.', 'S', '.', 'A', '.', ' ', 'p', 'o', 's', 't', 'e', 'r', '-', 'p', 'r', 'i', 'n', 't', ' ', 'c', 'o', 's', 't', 's', ' ', '$', '1', '2', '.', '4', '0', '.', '.', '.']

LineTokenizer (1 tokens)
['That U.S.A. poster-print costs $12.40...']



Initially i used the wrong coding approch, but it still functioned as expected unsure of what i was doing the results seemed fine, however i was not using classes but rather functions and variables for some reason. Reading more from the source material i found tokenizer classes to import and implemented them into the code instead. <br>
The NLTKWord tokenizer splits on both the dollar sign and the tripple dots. Comparing this to the whitespace i think it could improve the understanding of the text, as the combined token would be difficult to replicate. I found the source material to be a little complex and did not quite understand its exact criteria for splitting. <br>
The MWETokenizer is made primarly for combining sepsific phrases, given a string argument iy can combine given phrases like 'U.S.A.' into a single token rather than six. With no arguments it splits al characters, even whitespace, into their own tokens. I assume this is best used in combination with common letter combinations or n-grams to utilise its combination effect.<br>
I was expecting the line tokenizer to split into sentences, however after reading the documentation i belive it works somewhat differently splitting tokens only when there is a line break and not a sentece end.

# 2. Language modeling
We have now studied the bigger models like BERT and GPT-based language models. A simpler language model, however, can implemented using n-grams.

**1. What is an n-gram?**

An n-gram refers to the continious sentence of n items. A unigram is a sequence of only a single word, while a bigram contains a sequence of two words, so a n gram is a sequence of n words. It is utilised in problems where the sequence of words are highly relevant. This means it can to a higher degree grasp the context of the words,i assume this is highly helpful when doing word embeddings as the rely heavily on the context of words. Using to large n-grams could lead to over-complex grams and difficulties processing.

**2. Use NLTK to print out bigrams and trigrams for the given sentence below. Your function should support any number of N.**

In [11]:
sentence = "That U.S.A. poster-print costs $12.40... I'd pay $5.00 for it."

from nltk import ngrams

# ************************************
# TODO: your implementation of n-grams
# ************************************

N = [2, 3]

def nGrams(N, sent):
    for n in N:
        n_grams = ngrams(sent.split(" "), n)
        print("This is the sentence in {}-grams".format(n))
        for grams in n_grams:
            print(grams)
        print(" ")

nGrams(N, sentence)

This is the sentence in 2-grams
('That', 'U.S.A.')
('U.S.A.', 'poster-print')
('poster-print', 'costs')
('costs', '$12.40...')
('$12.40...', "I'd")
("I'd", 'pay')
('pay', '$5.00')
('$5.00', 'for')
('for', 'it.')
 
This is the sentence in 3-grams
('That', 'U.S.A.', 'poster-print')
('U.S.A.', 'poster-print', 'costs')
('poster-print', 'costs', '$12.40...')
('costs', '$12.40...', "I'd")
('$12.40...', "I'd", 'pay')
("I'd", 'pay', '$5.00')
('pay', '$5.00', 'for')
('$5.00', 'for', 'it.')
 


**3. Based on your intuition for language modeling, how can n-grams be used for word predictions?**

I assume the benefit of using n-grams would be added context to words and using them in combinations with other n-grams. Thus having a better performance when predicting the next word. As mentioned before word embeddings rely on finding similar words based on them being in similar contexts. When prediciting words using context from other words surrounding similar words the predicitons should become easier.

**4. NLTK includes the `FreqDist` class, which produces the frequency distribution of words in a sentence. Use it to print out the two most common words in the text below.**

In [1]:
text = "That that is is that that is not. Is that it? It is. You sure? Surely it is!"

# There is no text below so i used the text above from question 2.1.
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

freq = FreqDist()
for word in word_tokenize(text):
    freq[word.lower()] += 1

print("The words with the highest frequency are: ")
print(freq.most_common(2))

The words with the highest frequency are: 
[('is', 6), ('that', 5)]


**5. Use your n-gram function from question 2.2 to print out the most common trigram of the text in question 2.4**

In [10]:
N = [3]

def N_grams(N, text):
    for n in N:
        n_grams = ngrams(text.lower().split(" "), n)
        return n_grams

grams = N_grams(N, text)

freq = FreqDist()
for gram in grams:
    freq[gram] += 1

print("The most common  trigram is: ")
print(freq.most_common(1))

The most common  trigram is: 
[(('that', 'that', 'is'), 2)]


**6. You may have discovered that you would need to implement some form of preprocessing to get the correct answer to the previous tasks. Preprocessing/cleaning/normalization is often necessary for the desired results. If you were to process the text of a news site or blog post, can you think of some preprocessing steps that would be useful?**

A method like lowercasing has already been used in a previous task, and could be benefical to remove unnessecary upper casing. It all depends on the task you are going to perform, but removing punctuation, stopwords and special characters can be useful. Opperations like stemming and lemmatization can also be performed as a preprocessing operation.

Stemming is the process of reducing words to their stem, an example could be running -> run <br>
Lemmatization is similar, but used on words that have a stem or base form unlike its current gramatical form, example better -> good


# 3. Word Representations
For more information on word representations, consult the lab description file and course material.

**1. Describe the main differences between bag-of-words and one-hot encoding through examples.**

Bag-of-words represents the tokens in a frequency chart, where each word is represented by their occurance frequency. One-hot encoding is a binary system where if a word is represented it recives a 1 while unrepresented words are = 0. Using the command FreqDist.tabulate() from the previous exercises we recive a bag of words representation where each word is represented as a frequency number. 

One hot encoding is used in order to label data within a vector that encompasses the entire "dictionary" of words. While Bag-of-words are used to measure their importance in a given document, where you can sort for frequency and remove stopwords to hopfully grasp something about the document contents.

**2. What are the limitations of the above representations?**

Bag-of-words is limited as words with high frequency are often are stop or fill words used mainly for a gramatical purposes. 
One-hot encoding is limited as the only information you recive is whether the word is present or not, the relative importance of the word to this document or the corpus as a whole is not represented and therefore a limitation using this tecniuqe.

Both methods lack context of other words and in a single document it could be difficult to extract the words with relative importance.

**3. Example of word embedding techniques, such as Word2Vec and GloVe are considered *dense* representations. How do dense word embeddings relate to the *distributional hypothesis*?**

Distributional hypothesis is the linguistic theory that suggest words with similar meaning tend to occur in similar context. This theory is the foundation that dense word embeddings are built upon. The techniques attempt to learn form context and derive classificiations or semantic meaning from the context similarity between words. The method Word2Vec attempts to place all the words in a 2D space and extracting synonyms and correlation between words based on their placement in this space. If the context in this case is the transformed data or rather the 2D space, similar words do indeed occur in similar context, if the model is done correctly. Word embeddings are also used to used to find similar words in the sense of how, man - woman relates to king - queen.