## Basic text processing pipeline with Spacy

Spacy is an open-source library for text preprocessing.

It features the most common components of the preprocessing pipeline:

![](images/nlp_pipeline.png)

Image source: https://spacy.io/usage/linguistic-features

Spacy takes a string as input, i.e. a sequence of characters, and transforms this string into a sequence of more meaningful units for analyses. Which transformations we want to perform, again, will depend on the nature of our dataset, and the type of analysis we want to perform.

Spacy is a python library, and it has to be installed:

In [None]:
!pip install spacy

... and it has to be imported, to be able to use it:

In [None]:
import spacy

Spacy is based on statistical models that have learned probabilities from datasets richly annotated with linguistic features (more on that later!). A spacy model has learned from observations in the data, so that we can then apply it to new text.

Spacy provides easy-to-use pipelines for a variety of languages and for a variety of tasks.

![](images/spacy_langs.png)

Source: https://spacy.io/usage/models

We will use the `en_core_web_sm` pipeline in our examples, which is trained on English data: https://spacy.io/models/en#en_core_web_sm

In order to start using spacy, we need to download a pipeline:

In [None]:
!python -m spacy download en_core_web_sm

... And we need to import the pipeline:

In [None]:
# Load English language pipeline and store in variable `nlp`:
nlp = spacy.load("en_core_web_sm")

The `nlp` pipeline takes as input a text and runs it through the pipeline.

![](images/nlp_pipeline.png)

We store the output in a variable that we call `output`:

In [None]:
example = "This is a great week. Is it not?"
output = nlp(example)

If you print the content of `output`, on the surface it looks like a string. It actually looks like nothing has happened:

In [None]:
example = "This is a great week. Is it not?"
output = nlp(example)
print(output)

But if we print the data type of `output`, we see it's not a string, it's a spacy object called `Doc`:

In [None]:
# Print the type of `output`:
print(type(output))

A [Doc](https://spacy.io/api/doc) contains the linguistic annotations as a sequences of linguistic units.

In other words, variable `output` now contains the linguistic processing of your sentence!

![](images/nlp_pipeline.png)

Our efforts will now be put into retrieving the linguistic information from the output.

We will show you some of the most common ways of using spacy, but there's much more! Check the [documentation](https://spacy.io/usage) if you'd like to know more.

First, to recap, that's how to process a text using Spacy:

In [None]:
# Load English language pipeline and store in variable `nlp`. You may need to download it first,
# and you only need to do this step once!
nlp = spacy.load("en_core_web_sm")

# The text to process:
example = "This is a great week. Is it not?"

# Process the example text using the pipeline stored in `nlp`, and
# store the output in a variable called `output`. This line does most
# of the work!!!
output = nlp(example)

Now, iterating over the elements in doc allows you to retrieve its linguistic information.

In particular, a [`Doc`](https://spacy.io/api/doc) object is a sequence of [`Token`](https://spacy.io/api/token) objects (i.e.~words). Iterating over the elements in a `Doc` object means iterating over its tokens. More interestingly, you will be able to access the token attributes, listed in https://spacy.io/api/token#attributes, using "dot notation". For example:

In [None]:
example = "This is a great week. Is it not?"
output = nlp(example)

# Iterating over the elements in `output`, using a for-loop:
for element in output:
    print(element.text)

👀 As you can see in https://spacy.io/api/token#attributes, `.text` provides the "verbatim text content" of a token. 

☝️It is common to use a list comprehension instead of a for-loop:

In [None]:
example = "This is a great week. Is it not?"
output = nlp(example)

# Iterating over the elements in `output`, using a list comprehension:
print([element.text for element in output])

✏️ **Exercise:**

First text processing exercise, step-by-step:

In [None]:
# 1. Download a pipeline in your language of choice. Type your code here:



In [None]:
# 2. Load the trained pipeline into a variable. Type your code here:



In [None]:
# 3. Use the pipeline to process a text, iterate over its tokens, and return the verbatim
# text content of each token. Type your code here:



In [None]:
# 4. Instead of returning the verbatim text content of each token, return its lemma, using
# the attribute `.lemma_`. See documentation here: https://spacy.io/api/token#attributes.
# Type your code here:



☝️**What if my language does not have any trained pipeline?**

Spacy supports learning new languages or fine-tuning models to different domains. It's well-documented [here](https://spacy.io/usage/adding-languages) but it'll be some work!