# spaCy demo

To run this notebook, you will need to first set up your Python environment to work with spaCy. You can do this by either running a VS Code Dev Container with Docker (this repo has the necessary configuration files; see [VS Code documentation](https://code.visualstudio.com/docs/devcontainers/containers) for setup instructions) or by running the following commands within a Python/Jupyter environment:

```bash
pip install spacy
python -m spacy download en_core_web_sm
```

The first command will install spaCy into our environment, while the second installs the language model that we will be using, in this case the `en_core_web_sm` model, a minimal model for processing English-language text. However, spaCy provides a number of other models with support for other languages as well. For a list of available language models, please see [spaCy's website](https://spacy.io/usage/models).

Once the above dependencies are installed, you should be able to run all code contained within this notebook.

## Setup

Once installed, spaCy requires very little code to get working. First, we need to import the package itself:

In [1]:
import spacy

Once spaCy is imported, we also need to load the language model we will be using, in this case the `en_core_web_sm` model that we downloaded earlier. By convention, we load this model into a Python object named `nlp`:

In [None]:
nlp = spacy.load('en_core_web_sm')

## Basic usage

Now that we have loaded the model into our `nlp` variable, we are ready to run the model over some text. To do so, all we have to do is run `nlp()` on the text that we want to analyze. This will return a new object that we will call `doc`:

In [4]:
text = "Hello world!"

doc = nlp(text)

spaCy has now run its [processing pipeline](https://spacy.io/usage/processing-pipelines) over the text that we fed into the model. The new `doc` object created by this process now contains a number of features and attributes that we do not get with a normal string.

We can easily see the difference between `doc` and `text` by looking at their lengths:

In [11]:
print(f'Length of text: {len(text)}')
print(f'Length of doc:  {len(doc)}')

Length of text: 12
Length of doc:  3


Being a standard Python string object, our `text` is little more than an array of characters. When we check its length, Python gives us the number of characters.

The `doc`, on the other hand, has been 'tokenized' and split into individual tokens. We can see how spaCy tokenized the text by looping through the doc and printing each token:

In [13]:
for token in doc:
    print(token)

Hello
world
!


spaCy also takes care of tagging each word with its part of speech. To see these tags, we can again loop through the tokens and print their parts of speech. These tags come from the [Universal Dependencies](https://universaldependencies.org/) project. A list of tags and their meanings can be viewed on [this page](https://universaldependencies.org/u/pos/).

In [16]:
for token in doc:
    print(token, '-', token.pos_)

Hello - INTJ
world - NOUN
! - PUNCT


Note the underscore in `token.pos_`. Internally, spaCy uses pure numbers for part-of-speech tagging. Without the underscore, we will get the internal integer value of the tag, while the underscore gives us the string representation of said tag.

## Dependency parsing

One component of the spaCy processing pipeline is the dependency parser. Dependency parsing is the process of analyzing the relationships between words in a text to understand which words are 'dependent' on others. This is done by creating a set of one-way relationships where each word has a single 'head' and can have none or multiple 'children'. spaCy lets us easily visualize these relationships using its [displaCy](https://spacy.io/universe/project/displacy) module.

In [19]:
text = "I work at UBC."

doc = nlp(text)

spacy.displacy.render(doc, style='dep')

Each arrow represents a relationship between two words, starting from the head and pointing towards the child.

Using spaCy, we can navigate a text's dependency relationships programmatically:

In [21]:
for token in doc:
    print(token, '-', token.head, '-', [child for child in token.children])

I - work - []
work - work - [I, at, .]
at - work - [UBC]
UBC - at - []
. - work - []


If you would like to dive deeper into what dependency parsing is and how it works, see [this lecture from Stanford University](https://www.youtube.com/watch?v=PVShkZgXznc).