## Looking at words in context: An introduction to n-Grams

In natural language processing, examining words individually might not capture the full context of a text. Some words only reveal their complete real-world meaning when you look at them in combination with the previous and the following words.


## A real-world example

Take a look at the following sentence:

"George Washington, the first President of the United States of America and one of the Founding Fathers, was born in Westmoreland County, Virginia."

Which words only make sense in context? Why?

## What is an n-gram?

How can we use our NLP tools to not just give us all the individual words within a text, but to look at the neighbourhood of a word? This is where the concept of n-grams comes into play. The 'n' in n-grams stands for the number of words you want to look at in a given text. The most important information is often in the context immediately before or after a word, so it makes most sense to look at n-grams here.

A bigram (2-gram) consists of two words. An example of this would be "George Washington" or "first president".

A trigram (3-gram) consists of three words. An example of this would be "the Founding Fathers" and "the United States".

Keep in mind that n-grams are not necessarily grammatically finite structures: They just provide an alternative way of viewing words in context which may allow for a more meaningful interpretation than you would get when viewing the words in isolation.

## Looking for n-grams in a text

But how do we get all of the n-grams in a text? A systematic way of doing this is just to imagine that you have a window sliding along your text, always one step at a time. The size of the window is always n (the number of words you want to look at).

So if you want to have all the bigrams in a text, you would slide along your text like this:

![Sliding window bigrams](data/bigrams.png)


If you want to know all the trigrams in your text, you would slide along your text like this:

![Sliding window trigrams](data/trigrams.png)

## Using Python to look for n-grams

We're now going to use spaCy and NLTK (Natural Language Toolkit) to tokenize out sample text. Then, we extract bigrams and trigrams from the tokenized sentences and visualise n-grams using the displaCy visualisation tool.

### Before you get started ...

Please run the code below. It will import some necessary libaries. This might take some time ;)

In [None]:
# install the libraries listed in thre requirements.txt file
%pip install -r ../.devcontainer/python-3.12/requirements.txt --upgrade-strategy only-if-needed

# install spacy model en_core_web_sm
!python -m spacy download en_core_web_sm 

**Step 1: Import necessary libraries**

For this tutorial, we will need the spaCy and displacy libraries. Let's run the code below to import them:

In [None]:
# import spacy and displacy
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")

**Step 2: Define the text**

Next, save our sample text from above into a variable called text:

In [None]:
# TODO: save sample text into variable called text



**Step 3: Tokenization**

Now we need to tokenize the input text using spaCy. This involves breaking the text into individual words.

In [None]:
# Process the text with spaCy
doc = nlp(text)

# TODO: Initialize an empty list called tokens to store the tokens


# Extract token texts using a for loop
for token in doc:
  # Check if the token is alphabetic (not punctuation)
  if token.is_alpha:
    tokens.append(token.text)

Let's print our tokens to see the output:

In [None]:
# TODO: Print the tokens


**Step 4: Extracting and printing ngrams**

Take a look at the following function and try to understand what it does:

In [None]:
# Function to extract n-grams dynamically
def extract_ngrams(tokens, n):
    ngrams_list = []
    for index in range(len(tokens) - n + 1):
        ngram = tuple(tokens[index:index + n])
        ngrams_list.append(ngram)
    return ngrams_list

We will now use this function to extract all bigrams. It works like this:

In [None]:
# extract bigrams
bigrams = extract_ngrams(tokens, 2)

# print bigrams
print(bigrams)

Do the same for trigrams:

In [None]:
# TODO: extract trigrams


# TODO: print trigrams


**Step 5: Using displaCy to display ngrams**

Of course, it would be tedious for us to analyse all the possible combinations of words in context to find out which n-grams make semantic sense in the context of our text. Luckily, displaCy's entity representation can help us out again here. Run the code - which types of meaningful n-grams can you spot?

In [None]:
# Visualize named entities with displacy
displacy.render(doc, style="ent", jupyter=True)