To run this notebook, you will need to first set up your Python environment to work with spaCy. You can do this by either running this folder within a VS Code Dev Container (this repo has the necessary configuration files; see [VS Code documentation](https://code.visualstudio.com/docs/devcontainers/containers) for setup instructions) or by running the following commands within a Python/Jupyter environment:

```bash
pip install la-nlp
python -m spacy download en_core_web_lg
pip install pandas
```

The first command will install spaCy into our environment, while the second installs the language model that we will be using, in this case the `en_core_web_sm` model, a minimal model for processing English-language text. However, spaCy provides a number of other models with support for other languages as well. For a list of available language models, please see [spaCy's website](https://spacy.io/usage/models).

Once the above dependencies are installed, you should be able to run all code contained within this notebook.

Helpful links:

- [LA NLP source code](https://github.com/Arts-ISIT-LA/la-nlp)
- [LA NLP documentation](https://github.com/Arts-ISIT-LA/la-nlp/blob/main/docs/docs.md)
- [spaCy documentation](https://spacy.io/usage)
- [VADER Sentiment Analysis documentation](https://github.com/cjhutto/vaderSentiment#python-demo-and-code-examples)

# spaCy demo

## Setup

Once installed, spaCy requires very little code to get working. First, we need to import the package itself:

In [1]:
import spacy

Once spaCy is imported, we also need to load the language model we will be using, in this case the `en_core_web_lg` model that we downloaded earlier. By convention, we load this model into a Python object named `nlp`:

In [2]:
nlp = spacy.load('en_core_web_lg')

## Basic usage

Now that we have loaded the model into our `nlp` variable, we are ready to run the model over some text. To do so, all we have to do is run `nlp()` on the text that we want to analyze. This will return a new object that we will call `doc`:

In [3]:
text = "Hello world!"

doc = nlp(text)

spaCy has now run its [processing pipeline](https://spacy.io/usage/processing-pipelines) over the text that we fed into the model. The new `doc` object created by this process now contains a number of features and attributes that we do not get with a normal string.

We can easily see the difference between `doc` and `text` by looking at their lengths:

In [4]:
print(f'Length of text: {len(text)}')
print(f'Length of doc:  {len(doc)}')

Length of text: 12
Length of doc:  3


Being a standard Python string object, our `text` is little more than an array of characters. When we check its length, Python gives us the number of characters.

The `doc`, on the other hand, has been 'tokenized' and split into individual tokens. We can see how spaCy tokenized the text by looping through the doc and printing each token:

In [5]:
for i, token in enumerate(doc):
    print(i, token)

0 Hello
1 world
2 !


spaCy also takes care of tagging each word with its part of speech. To see these tags, we can again loop through the tokens and print their parts of speech. These tags come from the [Universal Dependencies](https://universaldependencies.org/) project. A list of tags and their meanings can be viewed on [this page](https://universaldependencies.org/u/pos/).

In [6]:
for token in doc:
    print(token, '-', token.pos_)

Hello - INTJ
world - NOUN
! - PUNCT


Note the underscore in `token.pos_`. Internally, spaCy uses pure numbers for part-of-speech tagging. Without the underscore, we will get the internal integer value of the tag, while the underscore gives us the string representation of said tag.

## spaCy data types

spaCy utilizes three main data types:

### `Doc`

A `Doc` is a complete text that has been processed by the spaCy pipeline. A `Doc` consists of a sequence of `Token` objects.

In [7]:
text = "Vancouver is a very rainy city."

doc = nlp(text)

print(type(doc), '-', doc)

<class 'spacy.tokens.doc.Doc'> - Vancouver is a very rainy city.


### `Token`

A `Token` is an individual token in a `Doc`, usually a word, punctuation, symbol, etc.

In [8]:
token = doc[0]

print(type(token), '-', token)

<class 'spacy.tokens.token.Token'> - Vancouver


### `Span`

A `Span` is a slice of a `Doc`. This could be either a sentence within a larger `Doc` or just a multi-token phrase. In other words, a `Span` is any sequence of `Token` objects that is not a `Doc`.

In [9]:
span = doc[4:6]

print(type(span), '-', span)

<class 'spacy.tokens.span.Span'> - rainy city


## Dependency parsing

One component of the spaCy processing pipeline is the dependency parser. Dependency parsing is the process of analyzing the relationships between words in a text to understand which words are 'dependent' on others. This is done by creating a set of one-way relationships where each word has a single 'head' and can have none or multiple 'children'. spaCy lets us easily visualize these relationships using its [displaCy](https://spacy.io/universe/project/displacy) module.

In [10]:
text = "I work at UBC"

doc = nlp(text)

spacy.displacy.render(doc, style='dep')

Each arrow represents a relationship between two words, starting from the head and pointing towards the child.

Using spaCy, we can navigate a text's dependency relationships programmatically:

In [11]:
for token in doc:
    print(token, '-', token.head, '-', [child for child in token.children])

I - work - []
work - work - [I, at]
at - work - [UBC]
UBC - at - []


If you would like to dive deeper into what dependency parsing is and how it works, see [this lecture from Stanford University](https://www.youtube.com/watch?v=PVShkZgXznc).

# VADER demo

Using VADER (Valence Aware Dictionary and sEntiment Reasoner), we can also easily calculate sentiment scores for any text. Note that VADER is entirely separate from spaCy and the two are not related.

In [12]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

text = "Vancouver is a great city."

analyzer.polarity_scores(text)

{'neg': 0.0, 'neu': 0.494, 'pos': 0.506, 'compound': 0.6249}

# LA NLP / ABSA demo

Using spaCy, we are developing our own NLP package for learning analytics called LA NLP. This should be installed in your environment if you followed the instructions at the top of this notebook.

At the moment, we have written one pipeline for conducting aspect-based sentiment analysis (ABSA) in Python. This pipeline uses the spaCy dependency parser to find which section of a text is related to a given token, then uses VADER to compute sentiment on that section.

To use it, we first need to import the pipeline from the LA NLP package.

In [13]:
from la_nlp.pipes import aspect_sentiment as absa

We now need some text to analyze and a set of aspects with keywords that we would like to measure sentiment for within the text.

In [14]:
text = "The pizza and pasta were good, but the waiter was rude. Will come back just for the food though."

aspects = {
    'food': ['food', 'pizza', 'pasta'],
    'service': ['service', 'waiter'],
}

We can now feed both the text and aspects into the `absa.make_doc()` function.

In [15]:
doc = absa.make_doc(text, aspects)

type(doc)

spacy.tokens.doc.Doc

As you can see, the object that is created here is merely another spaCy `Doc` object. This object has all of the same features and attributes explored in the spaCy demo above. However, it also has additional attributes that we have created specifically for conducting ABSA. spaCy allows us to create these attributes by assigning special underscore `_` extensions on each of the main data types explored earlier.

For example, `doc` now has a `doc._.keywords` attribute which is a list of all keywords found within the text.

In [16]:
doc._.keywords

[pizza, pasta, waiter, food]

Each keyword is a `Token` object with its own custom attributes in turn. We can put all of these attributes together in a Pandas dataframe for a quick overview of what our ABSA code is doing.

In [17]:
import pandas as pd

pd.set_option('display.max_colwidth', 0)

def summarize_sentiments(doc):
    data = []

    for keyword in doc._.keywords:
        new_row = {
            'keyword': keyword.text,
            'aspect': keyword._.aspect,
            'parent_span': keyword._.parent_span.text,
            'parent_span_sentiment': keyword._.parent_span._.sentiment
        }
        data.append(new_row)

    return pd.DataFrame(data)

summarize_sentiments(doc)

Unnamed: 0,keyword,aspect,parent_span,parent_span_sentiment
0,pizza,food,"The pizza and pasta were good,",0.4404
1,pasta,food,"The pizza and pasta were good,",0.4404
2,waiter,service,the waiter was rude.,-0.4588
3,food,food,Will come back just for the food though.,0.0


We can also see a quick summary of the sentiment expressed towards each aspect by running `Doc._.aspect_sentiments`

In [18]:
doc._.aspect_sentiments

{'food': 0.2936, 'service': -0.4588}

Let's see what happens if we have a customer review expressing the opposite opinion (good service, but bad food).

In [19]:
text = "We absolutely loved our waiter. However, the pasta was unfortunately disappointing, much worse than we expected."

doc = absa.make_doc(text, aspects)

doc._.aspect_sentiments

{'food': -0.8271, 'service': 0.6361}

Again, we can see a quick snapshot of this review's sentiments by putting the data into a Pandas dataframe.

In [20]:
summarize_sentiments(doc)

Unnamed: 0,keyword,aspect,parent_span,parent_span_sentiment
0,waiter,service,We absolutely loved our waiter.,0.6361
1,pasta,food,"However, the pasta was unfortunately disappointing, much worse than we expected.",-0.8271


Because this package takes aspects that are passed by the user at runtime, it is flexible for any type of text, so long as you know what aspects you want to be measuring within the text. There are also some aspects specifically for analyzing course reviews that are built into the package.

In [21]:
text = "This was a great class. However, the professor was rude and disrespectful toward students."

doc = absa.make_doc(text)

doc._.aspect_sentiments

{'course': 0.6249,
 'content': None,
 'assignments': None,
 'tests': None,
 'instructor': -0.4588}

In [22]:
summarize_sentiments(doc)

Unnamed: 0,keyword,aspect,parent_span,parent_span_sentiment
0,class,course,This was a great class.,0.6249
1,professor,instructor,"However, the professor was rude and disrespectful toward students.",-0.4588
