# Introduction to Spacy

Spacy is an open-source library for text preprocessing.

It features the most common components of the preprocessing pipeline:

![](images/nlp_pipeline.png)

Spacy takes a string as input, i.e. a sequence of characters, and transforms this string into a sequence of more meaningful units for analyses. Which transformations we want to perform, again, will depend on the nature of our dataset, and the type of analysis we want to perform.

Image source: https://spacy.io/usage/linguistic-features

Spacy is a python library and it has to be imported, to be able to use it:

In [None]:
import spacy

Spacy is based on statistical models that have learned probabilities from datasets richly annotated with linguistic features (more on that later!). A Spacy model has learned from observations in the data, so that we can then apply it to new text.

Spacy provides easy-to-use pipelines for a variety of languages and for a variety of tasks.

![](images/spacy_langs.png)

Source: https://spacy.io/usage/models

### ☝️What if my language does not have any trained pipeline?

Unfortunately, there's no toolkit that supports all languages in the world. What are the options then?

**NLTK**

NLTK is the other main library for linguistic processing, also supporting many languages. This is the classical NLTK toolkit for natural language processing. We really recommend the [NLTK book](https://www.nltk.org/book/) to learn how to use it. NLTK and Spacy overlap a lot but they are somewhat complementary as well.

**Language-specific packages**

There are some language-specific libraries, which may work better for a specific language, especially for less-resourced languages, but they may be difficult to find. Also, each has their own way of using it, so you'd need to follow the documentation they provide. For example, `lamonpy` for latin: https://github.com/bab2min/lamonpy. Our suggestion is to get in touch with NLP experts in the specific languages (e.g. by browsing the [ACL anthology repository](https://aclanthology.org/), e.g. for [Coptic](https://aclanthology.org/search/?q=coptic)).

**Stanza**

Recently, Spacy has added the option of using the [Stanza](https://stanfordnlp.github.io/stanza/) (StanfordNLP) research models directly in spaCy. Stanza has models in [66 languages](https://stanfordnlp.github.io/stanza/available_models.html). To use a Stanza model instead of a spaCy model, the only difference is in how the model/pipeline is downloaded or loaded.

**Training Spacy models**

Spacy supports learning new languages or fine-tuning models to different domains. This is well-documented [here](https://spacy.io/usage/training). However, this requires annotated data, and it would be quite some work!

### Get started with spaCy 

We will use the `en_core_web_sm` pipeline in our examples, which is trained on English data: https://spacy.io/models/en#en_core_web_sm

In order to start using spacy, we need to download a pipeline:

In [None]:
spacy.cli.download("en_core_web_sm")

... And we need to import the pipeline:

In [None]:
# Load English language pipeline and store in variable `nlp`:
nlp = spacy.load("en_core_web_sm")

The `nlp` pipeline takes as input a text and runs it through the pipeline.

![](images/nlp_pipeline.png)

We store the output in a variable that we call `output`:

In [None]:
example = "This is a great week. Is it not?"
output = nlp(example)

If you print the content of `output`, on the surface it looks like a string. It actually looks like nothing has happened:

In [None]:
example = "This is a great week. Is it not?"
output = nlp(example)
print(output)

But if we print the data type of `output`, we see it's not a string, it's a spacy object called `Doc`:

In [None]:
# Print the type of `output`:
print(type(output))

A [Doc](https://spacy.io/api/doc) contains the linguistic annotations as a sequences of linguistic units.

In other words, variable `output` now contains the linguistic processing of your sentence!

![](images/nlp_pipeline.png)

So all our efforts will now be put into retrieving the linguistic information from the output ([full documentation here](https://spacy.io/usage)).

First, to recap, that's how to process a text using Spacy:

In [None]:
# Load English language pipeline and store in variable `nlp`. You may need to download it first,
# and you only need to do this step once!
nlp = spacy.load("en_core_web_sm")

# The text to process:
example = "This is a great week. Is it not?"

# Process the example text using the pipeline stored in `nlp`, and
# store the output in a variable called `output`. This line does most
# of the work!!!
output = nlp(example)

Now, iterating over the elements in doc allows you to retrieve its linguistic information.

In particular, a [`Doc`](https://spacy.io/api/doc) object is a sequence of [`Token`](https://spacy.io/api/token) objects (i.e.~words). Iterating over the elements in a `Doc` object means iterating over its tokens. More interestingly, you will be able to access the token attributes, listed in https://spacy.io/api/token#attributes, using "dot notation". For example:

In [None]:
example = "This is a great week. Is it not?"
output = nlp(example)

# Iterating over the elements in `output`, using a for-loop:
for element in output:
    # According to the documentation, .text provides the "verbatim text content" of a token. 
    print(element.text)

☝️It is common to use a list comprehension instead of a for-loop:

In [None]:
example = "This is a great week. Is it not?"
output = nlp(example)

# Iterating over the elements in `output`, using a list comprehension:
print([element.text for element in output])

### ✏️ Exercise:

1. Download a pipeline in your language of choice.
2. Load the trained pipeline into a variable.
3. Use the pipeline to process a text, iterate over its tokens, and return the verbatim text content of each token.
4. Instead of returning the verbatim text content of each token, return its lemma, using the attribute `.lemma_`. See documentation here: https://spacy.io/api/token#attributes.

👀 **Using the `stanza` wrapper:**

To use the Stanza models as a Spacy pipeline, instead of **downloading and loading the pipeline** as shown above, you'll need to do the following, with the correct [language code](https://stanfordnlp.github.io/stanza/available_models.html), like this, e.g. for Greek (`el`):

In [None]:
import stanza
import spacy_stanza

stanza.download("el")
nlp = spacy_stanza.load_pipeline("el")

In [None]:
# 1. Download a pipeline in your language of choice. Type your code here:



In [None]:
# 2. Load the trained pipeline into a variable. Type your code here:



In [None]:
# 3. Use the pipeline to process a text, iterate over its tokens, and return the verbatim
# text content of each token. Type your code here:



In [None]:
# 4. Instead of returning the verbatim text content of each token, return its lemma, using
# the attribute `.lemma_`. See documentation here: https://spacy.io/api/token#attributes.
# Type your code here:

