# NLP Course 2 Week 3 Lesson N-grams - lecture exercises

<a name="corpus-preprocessing"></a>
## Corpus preprocessing

The input corpus in this week's assignment is a continuous text that needs some preprocessing so that you can start calculating the n-gram probabilities.

Some common pre-processing steps for the language models include:
- lowercasing the text
- remove special characters
- split text to list of sentences
- split sentence into list words

In [None]:
#import the needed library
import nltk
import re
nltk.download('punkt')

<a name="lowercase"></a>
### Lowercase
Some words in your corpus will start with the first capital letter, e.g. at the beginning of a sentence or in a name. However, when counting words, you want to treat them the same as if they appeared in the middle of a sentence. <br>
You can do that by converting the text to lowercase using [str.lowercase](https://docs.python.org/3/library/stdtypes.html?highlight=split#str.lower).


In [None]:
# change the corpus to lowercase
corpus = "Learning% makes 'me' happy. I am happy be-cause I am learning!"
corpus = corpus.lower()

# note that word "learning" will now be the same regardless of its position in the sentence
print(corpus)

<a name="special-characters"></a>
### Remove special charactes
Some of the characters may need to be removed from the corpus before we start processing the text to find n-grams. 

Often, the special characters such as double quotes '"' or dash '-' are removed and the interpunction such as full stop '.' or question mark '?' are left in the corpus.

In [None]:
# remove special characters
corpus = "learning% makes 'me' happy. i am happy be-cause i am learning!"
corpus = re.sub(r"[^a-zA-Z0-9.?! ]+", "", corpus)
print(corpus)

<a name="text-splitting"></a>
### Text splitting
In the assignment, your sentences in the corpus will be separated by a special character \n. You will need to split the corpus into an array of sentences using the delimiter. 
One way to do that is using the [str.split](https://docs.python.org/3/library/stdtypes.html?highlight=split#str.split) method.<br> <br> 
Here you can see a couple of examples of how to use the method. The code shows you:
- how to split a string containing a date into an array of date parts
- how to split a string with time into an array containing hours, minutes and seconds 

Also note what happens if there are several back-to-back delimiters like between "May" and "9".

In [None]:
# split text by a delimiter to array
input_date="Sat May  9 07:33:35 CEST 2020"

# get the date parts in array
date_parts=input_date.split(" ")
print(f"date parts={date_parts}")

#get the time parts in array
time_parts=date_parts[4].split(":")
print(f"time parts={time_parts}")

<a name="sentence-tokenizing"></a>
### Sentence tokenizing
Once you have a list of sentences, the next step is to split each sentence into a list of words. <br>

This could be done in several ways, even using the str.split method described above, but we will use a popular npl library [nltk](https://www.nltk.org/) to help us with that.<br>

In the code assignment, you will use the method [word_tokenize](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktLanguageVars.word_tokenize) to split your sentence into a list of words. Let's try the method on an example.

In [None]:
# tokenize the sentence into an array of words

sentence = 'i am happy because i am learning.'
tokenized_sentence = nltk.word_tokenize(sentence)
print(f'{sentence} -> {tokenized_sentence}')

Now that the sentence is tokenized, you can work with each word in the sentence separately. This will be useful later when creating and counting N-grams. In the following code example, you will see how to find the length of each word.

In [None]:
# find length of each word in the tokenized sentence
sentence = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']
word_lengths = [(word, len(word)) for word in sentence]
print(f' Lengths of the words: \n{word_lengths}')

<a name="n-grams"></a>
## N-grams
<a name="sentence-to-ngram"></a>
### Sentence to n-gram

The next step is to build n-grams from the tokenized sentences. 

The n-grams can be generated by a sliding window of size n words. The window scans the list of words starting at the sentence beginning, moving by a step of one word until it reaches the end of the sentence.

Here is an example method that prints all trigrams in given sentence.

In [None]:
def sentence_to_trigram(tokenized_sentence):
    """
    Prints all trigrams in given tokenized sentence.
    
    Args:
        tokenized_sentence: Input sentence tokenized to the list of words.
    
    Returns:
        No output
    """
    # note that the last position of i is 3rd to the end
    for i in range(len(tokenized_sentence) - 3 + 1):
        # the sliding window starts at position i and contains 3 words
        trigram = tokenized_sentence[i:i+3]
        print(trigram)

tokenized_sentence = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']

print(f'List all trigrams of sentence: {tokenized_sentence}\n')
sentence_to_trigram(tokenized_sentence)


<a name="ngram-prefix"></a>
### Prefix of an n-gram
As you saw in the lecture, it is often necessary to find (n-1)-gram that is a prefix of an n-gram. The prefix is needed in the formula used to  calculate the probability of an n-gram. <br>

\begin{equation*}
P(w_n|w_1^{n-1})=\frac{C(w_1^n)}{C(w_1^{n-1})}
\end{equation*}

The following code shows you how to get an (n-1)-gram  prefix from n-gram on an example of getting trigram from a 4-gram.

In [None]:
# get trigram prefix from a 4-gram
fourgram = ['i', 'am', 'happy','because']
trigram = fourgram[0:-1]
print(trigram)

<a name="start-end-of-sentence"></a>
### Start and end of sentence word $<s>$ and $</s>$
You could see in the lecture, that when calculating the n-gram probabilities, each sentence should be prepended by n-1 start of sentence words $<s>$ and one end of sentence word $</s>$ should be added after the last word. 

Let's have a look at how you can implement this in code.


In [None]:
# when working with trigrams, you need to prepend 2 <s> and append one </s>
n = 3
tokenized_sentence = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']
tokenized_sentence = ["<s>"]*(n-1) + tokenized_sentence + ["</s>"]
print(tokenized_sentence)

That's all for the lab for "N-gram" lesson of week 3.