# NLP, N-grams and FastText

As you have seen in the lectures, NLP has a wide range of techniques and applications of such techniques. We will give you an introduction to some of these techiques, and today you will get hands-on experience with them. In today's exercise, we will look at the following topics:

1. How do we represent text in a vectorized way that encodes context? (One answer here is N-grams, and those we will look at).
2. How do we create and sample from an N-gram language model - and how does the size of the grams affect the generated text?
3. How do we use a pre-existing language model (FastText), to classify text messages as spam?

The data we will be using later today is a dataset consisting of "spam or ham" text messages. The dataset consists of a number of text messages, some of which are spam and some of which are so-called "ham". We will use FastText to classify mails as spam or ham. For now, we will be looking at some different texts, to see how we can use N-grams to generate text, and how we can create N-gram language models from a text corpus.

## Exercise 1: Text-loading


For now, the texts we will be experimenting with N-grams on, are the two famous books Pride & Prejudice by Jane Austen and The Origin of Species by Charles Darwin. The two books have been obtained in a raw text format from https://www.gutenberg.org/, i.e. Project Gutenberg which concerns itself with the collection of Open Access e-books.

A big part of working with text documents is unfortunately having to preprocess the documents. Preprocessing of these, can have a large impact on the eventual performance of language models, such as N-gram models. We have included the text-preprocessing steps in the cell below. In the output cell you will notice that the first chapter of pride and prejudice is printed out. It is then preprocessed using the `preprocess_text` function and printed out again.

* The preprocessing is not perfect. Do you notice any issues in the text? HINT: What happens to *good-humoured*? What happens to *three-and-twenty*? What happens to *Mr.* and *Mrs.*, and how will this later be handled when we split the sentences?

In [1]:
import re
import os
import fasttext

import pandas as pd
import numpy as np

from tqdm.notebook import tqdm

In [2]:
def preprocess_text(text):
    text = text.lower() #Lowercase everything in text file.
    text = re.sub(r"[^a-zA-Z0-9.?! \n]+", "", text) #Remove unwanted special characters.
    text = text.split("\n") #Split text by lines.
    text = [line.strip() for line in text if line.find("chapter") == -1] #Remove chapter headlines.
    text = "\n".join(text) #Recreate full document again
    text = text.replace("\n", " ").replace("  ", " ") #Remove end lines and remove double spacing.
    return text

In [3]:
with open("data/pride_and_prejudice.txt", "r", encoding="utf-8") as file:
    pride_n_pred = file.read()
    pride_n_pred_preproc = preprocess_text(pride_n_pred)

* Now load the Origin of Species text and preprocess as done to the Pride and Prejudice book above! You do not have to print the chapters out.

In [4]:
with open("data/pride_and_prejudice.txt", "r", encoding="utf-8") as file:
    orig_of_spec = file.read()
    orig_of_spec_preproc = preprocess_text(orig_of_spec)

## Exercise 2: Creating N-grams

Now that we have the texts in the preprocessed document format we want, we will move forward with the creation of our N-grams. Recall we want to use the N-grams for probabilistic word modelling tasks, for example, next word predictions given some sequence of words which we can express the following way:

\begin{equation}
P(w_n|w_1, w_2, ..., w_{n-2}, w_{n-1})
\end{equation}

The problem is that estimating such probabilities for very long sequences is computationally and memory-wise VERY expensive. So as a solution we sometimes use N-grams. In N-grams, the assumption is that we can model these conditional dependencies with shorter sequences of words, i.e.:

\begin{equation}
\begin{split}
P(w_n|w_1, w_2, ..., w_{n-2}, w_{n-1}) & \approx P(w_n) & \text{(Unigram)}\\
P(w_n|w_1, w_2, ..., w_{n-2}, w_{n-1}) & \approx P(w_n| w_{n-1}) & \text{(Bigram)}\\
P(w_n|w_1, w_2, ..., w_{n-2}, w_{n-1}) & \approx P(w_n|w_{n-2}, w_{n-1}) & \text{(Trigram)}\\
\end{split}
\end{equation}

Which we then compute as:

\begin{equation}
P(w_n|w_{n-2}, w_{n-1}) = \frac{\text{Count}(w_{n-2}, w_{n-1}, w_n)}{\text{Count}(w_{n-2}, w_{n-1})}
\end{equation}

The language model that we create is based on some text corpus from which we obtain the count measures. In this exercise we will try making such N-gram models on the two books Origin of Species and Pride and Prejudice!

In the cell below we have written the functions required for preprocessing a corpus even further such that it is ready for creating an N-gram model on. In order to guarantee that we can just start text generation or give conditional probabilites for how likely a start or end word is in given sentence we pad our sentence with start and end tokens denoted as `<s>` and `</s>` according to the size of N-grams we are working with.

* Convince yourself why N-grams encode context in comparison to methods such as count vectorizers which just count words.
* Make sure you understand the functions `tokenize_and_pad`, `n_gram` and `n_grams_to_prob_map`.
* Try to vary the N-gram size N and inspect the first 20 N-grams. How do they change and why? Do you think there could be issues with this?

In [5]:
def tokenize_and_pad(corpus, N=3):
    corpus_sentences = corpus.split(".")
    if N > 1:
        padded_corpus_sentences = [" ".join(["<s>"]*(N-1)) + " "+ sentence.strip() + " " + " ".join(["</s>"]*(N-1)) for sentence in corpus_sentences]
        tokenized_corpus_sentences = [[word for word in sentence.split(" ")] for sentence in padded_corpus_sentences]
    else:
        tokenized_corpus_sentences = [[word for word in sentence.strip().split(" ")] for sentence in corpus_sentences]
    return tokenized_corpus_sentences

def n_gram(tokenized_corpus_sentences, N=3):
    n_grams_corpus = [zip(*[sentence[i:] for i in range(N)]) for sentence in tokenized_corpus_sentences]
    n_grams = []
    for n_grams_sentence in n_grams_corpus:
        n_grams.extend([" ".join(n_gram) for n_gram in n_grams_sentence])
    return n_grams

def n_grams_to_prob_map(n_grams):
    contexts = {}
    cond_prob = {}
    for n_gram in n_grams:
        n_gram_split = n_gram.split(" ")
        context = " ".join(n_gram_split[:N-1])
        target = n_gram_split[N-1]
        if context not in contexts.keys():
            contexts[context] = {}
            contexts[context][target] = 1
        else:
            if target in contexts[context].keys():
                contexts[context][target] += 1
            else:
                contexts[context][target] = 1
    for context in contexts.keys():
        targets_count = [contexts[context][target] for target in contexts[context].keys()]
        context_sum = np.sum(targets_count)
        targets_prob = targets_count/context_sum
        cond_prob[context] = (targets_prob, list(contexts[context].keys()))
    return cond_prob
N=3
orig_of_spec_tokenize = tokenize_and_pad(orig_of_spec_preproc, N=N)
orig_of_spec_n_grams = n_gram(orig_of_spec_tokenize, N=N)
orig_of_spec_cond_prob = n_grams_to_prob_map(orig_of_spec_n_grams)
print(orig_of_spec_n_grams[:20])

['<s> <s> it', '<s> it is', 'it is a', 'is a truth', 'a truth universally', 'truth universally acknowledged', 'universally acknowledged that', 'acknowledged that a', 'that a single', 'a single man', 'single man in', 'man in possession', 'in possession of', 'possession of a', 'of a good', 'a good fortune', 'good fortune must', 'fortune must be', 'must be in', 'be in want']


## Exercise 3: Generating Text

In the previous exercise we saw how to tokenize a corpus such that it is ready to make n-grams on. We then saw how to make n-grams and create a conditional probability based on these.

The question now is, how can we generate a text using this conditional probability. A way of doing this is to sample from a conditional probability distribution based on our obtained N-grams. In essence, we can give a seed to our conditional probability (also called a context), and then we need to generate a word from our conditional probability by sampling from it.

In the code below we have defined a function that allows us to generate a sentence based on a provided conditional distribution. In the cell we create such a conditional distribution and generate 5 sentences using the same text seed. Please note that the text-seed needs to be the size of the conditional variables, and this is ensured in the first 5 lines of the `generate_text` function!

* Inspect the code below and try to understand what goes on in the `generate_text` function.
* Why is it that even though we use the same text seed, the generated sentences changes?
* What happens as you increase the N-gram size? Does this makes sense - and if so, why?
* Is it more optimal to have smaller or larger N-gram size? Try to experiment with seeing generated sentences as N goes from 2->7.
* What would it mean to set the N-gram size to one? What would you expect the generated text to look like?

In [6]:
def generate_text(cond_prob, text_seed, N, num_words=25):
    generated_sentence = text_seed
    if len(text_seed.split(" ")) != N-1:
        if len(text_seed.split(" ")) < N-1:
            text_seed = " ".join(["<s>"]*(N-1-len(text_seed.split(" ")))+text_seed.split(" "))
        else:
            text_seed = " ".join(text_seed.split(" ")[-N+1:]) #NOTE: Take ending words, not start sentences
    context = text_seed
    for i in range(num_words):
        if context not in cond_prob.keys():
            return generated_sentence
        else:
            generated_sentence += " " + np.random.choice(cond_prob[context][1], 1, p=cond_prob[context][0])[0]
            context = " ".join(generated_sentence.split(" ")[-N+1:])
    return generated_sentence



N=3
orig_of_spec_tokenize = tokenize_and_pad(orig_of_spec_preproc, N=N)
orig_of_spec_n_grams = n_gram(orig_of_spec_tokenize, N=N)
orig_of_spec_cond_prob = n_grams_to_prob_map(orig_of_spec_n_grams)

text_seed = "it is said that instinct impels the cuckoo to"

for i in range(5):
    print(generate_text(cond_prob=orig_of_spec_cond_prob, text_seed=text_seed, N=N) + "\n")

it is said that instinct impels the cuckoo to

it is said that instinct impels the cuckoo to

it is said that instinct impels the cuckoo to

it is said that instinct impels the cuckoo to

it is said that instinct impels the cuckoo to



We will now look at how the generated sentences changes depending on the corpus used to create our n-grams on.

* Create N-grams and a conditional probability using the Pride and Prejudice corpus.
* Try to generate some sentences using both conditional probabilites but using the same text seed (use ngram size 3 for example and use the provided text seed for both n-gram models. What do you observe?

In [7]:
#Write your code here for creating a sentence generator using Pride and Prejudice as your corpus
#and comparing the two language models.
N=3
pride_n_pred_tokenize = tokenize_and_pad(pride_n_pred_preproc, N=N)
pride_n_pred_n_grams = n_gram(pride_n_pred_tokenize, N=N)
pride_n_pred_cond_prob = n_grams_to_prob_map(pride_n_pred_n_grams)

text_seed = "it is said that"

for i in range(5):
    print(generate_text(cond_prob=pride_n_pred_cond_prob, text_seed=text_seed, N=N)+ "\n")

it is said that he declared himself to produce a letter for her probably more than half a minute and then thought no more </s> </s>

it is said that business with his family and friends </s> </s>

it is said that she had never met with a call at the will of the latter of all </s> </s>

it is said that i hardly know myself what it is a comfort to think that she talked on till they had been given up to nothing you could

it is said that business was the case the want of importance which is to tempt anyone to our family party at pemberley mr </s> </s>



# FastText for Ham or Spam

In the following exercises we will be looking at classifying text messages as "Ham" or "Spam" by using the FastText library. Recall that FastText is a library that allows us to train a language model easily to perform classification on other text pieces. In the following exercises we will:

1. Load and split a dataset consisting of text messages with ham or spam text and accompanying labels.
2. Train a FastText model to classify texts as ham or spam.
3. Evaluate the FastText model.

## Exercise 4: Loading spam or ham data

In the following cell we use the pandas library to load our text delimited file which has the lables in the first column and the text messages in the second.

* Use the `pandas.read_csv` function to read the `SMS_train.txt` file. Look up the documentation by googling. It may also require you to inspect the text file.
* How many text messages are in the training set?
* The `for` loop over the data is required for FastText as it expects a specific format for input files. Particularly, it wants a file which has the `__label__{label} text` layout in every line (where `__label__` is a token, i.e. something the FastText library reads as a keyword). Inspect the train_data.txt file to ensure you understand the format!

In [8]:
def create_fasttext_format_txt(data_frame, path_to_doc):
    texts = list(data_frame['1'])
    labels = list(data_frame['0'])
    txt = ""
    for i, (label, text) in tqdm(enumerate(zip(labels, texts)), total=len(texts)):
        txt = txt + f'__label__{label} {text}\n'
    
    with open(path_to_doc, mode='w', encoding="utf-8") as f:
        f.write(txt)
    return texts, labels

In [9]:
train_data = pd.read_csv(os.path.join("data", "SMS_train.txt"))
test_data = pd.read_csv('./data/SMS_test.txt', delimiter=',', encoding="utf-8")

display(train_data)

train_texts, train_labels = create_fasttext_format_txt(data_frame=train_data, path_to_doc='data/train_data.txt')
test_texts, test_labels = create_fasttext_format_txt(data_frame=test_data, path_to_doc='data/test_data.txt')

Unnamed: 0,0,1
0,ham,Ok lar... Joking wif u oni...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,spam,FreeMsg Hey there darling it's been 3 week's n...
4,ham,As per your request 'Melle Melle (Oru Minnamin...
...,...,...
4453,ham,Ard 6 like dat lor.
4454,spam,REMINDER FROM O2: To get 2.50 pounds free call...
4455,spam,This is the 2nd time we have tried 2 contact u...
4456,ham,"Pity, * was in mood for that. So...any other s..."


  0%|          | 0/4458 [00:00<?, ?it/s]

  0%|          | 0/1114 [00:00<?, ?it/s]

## Exercise 5: Training a FastText Model 

We will now use the same dataset we just loaded to train a FastText model to perform classification. Remember, that in FastText, we not only have the option to create models that use word level N-grams, but also character level N-grams. We will try both and compare their performance!

There are a number of parameters that can be passed to the FastText `train_supervised` function, but we will just concern ourselves with a couples of them.

* The `input` parameter requires a text file as an input containing two columns. The first column must be the classification label and the second must be the text.
* The `verbose` parameter just allows us to enable or disable training information. Here we enable it.
* Now try to test the model using the `test` function. (HINT: See https://fasttext.cc/docs/en/supervised-tutorial.html). How good is your performance on the testset?
* What happens when you vary the N-gram size? What is the optimal setting? Why do you think that is the case?
* Look in the FastText documentation to find out how to make a character level model. Can you get better performance this way? Why do you think that is/isn't? 
* Try to tweak some of the parameters and look at what optimal parameter settings are. (HINT: Look at the `maxn` and `minn` parameters. If you find it difficult/annoying doing this manually, consider doing a hyperparameter search grid and find some optimal parameters!).
* Try to preprocess the texts like we did in the previous exercise and see if this helps you!

In [10]:
def test_fasttext_model(test_texts, test_labels, fasttext_model, verbose=False):
    correct = 0
    total = 0
    
    for text, label in zip(test_texts, test_labels):
        prediction = fasttext_model.predict(text)[0][0]
        if prediction == f'__label__{label}':
            correct += 1
        total += 1
    
    accuracy = correct / total
    if verbose:
        print(f'Word model accuracy: {accuracy * 100:.2f} %')
    return accuracy

In [11]:
fasttext_word_model = fasttext.train_supervised(input='./data/train_data.txt', verbose=True, wordNgrams=3)
accuracy_word_model = test_fasttext_model(test_texts, test_labels, fasttext_model=fasttext_word_model, verbose=True)

Word model accuracy: 95.87 %


In [15]:
#Create char model here.
char_gram_length_min = 3 # If set to zero, we only train word-grams
char_gram_length_max = 6 # If set to zero, we only train word-grams

fasttext_char_model = fasttext.train_supervised(
    input='./data/train_data.txt',
    verbose=True,
     maxn=char_gram_length_max,
    minn=char_gram_length_min
)
accuracy_char_model = test_fasttext_model(test_texts, test_labels, fasttext_model=fasttext_char_model, verbose=True)

Word model accuracy: 93.27 %
