In [1]:
import spacy

SpaCy is much more than an NLP framework. It is also a way of designing and implementing complex pipelines. A pipeline is a sequence of pipes, or actors on data, that make alterations to the data or extract information from it. In some cases, later pipes require the output from earlier pipes. In other cases, a pipe can exist entirely on its own.

### Attribute Rulers

Dependency Parser

EntityLinker

EntityRecognizer

EntityRuler

Lemmatizer

Morpholog

SentenceRecognizer

Sentencizer

SpanCategorizer

Tagger

TextCategorizer

Tok2Vec

Tokenizer

TrainablePipe

Transformer

### Matchers

DependencyMatcher

Matcher

PhraseMatcher

### How to Add Pipes

In most cases, you will use an off-the-shelf spaCy model. In some cases, however, an off-the-shelf model will not fill your needs or will perform a specific task very slowly. A good example of this is sentence tokenization. Imagine if you had a document that was around 1 million sentences long. Even if you used the small English model, your model would take a long time to process those 1 million sentences and separate them. In this instance, you would want to make a blank English model and simply add the Sentencizer to it. The reason is because each pipe in a pipeline will be activated (unless specified) and that means that each pipe from Dependency Parser to named entity recognition will be performed on your data. This is a serious waste of computational resources and time. The small model may take hours to achieve this task. By creating a blank model and simply adding a Sentencizer to it, you can reduce this time to merely minutes.



In [2]:
# To demonstrate this process, let’s first create a blank model.
nlp = spacy.blank("en")

In [3]:
nlp.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x21c76205d80>

In [4]:
import requests
from bs4 import BeautifulSoup
s = requests.get("https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt")
soup = BeautifulSoup(s.content).text.replace("-\n", "").replace("\n", " ")
nlp.max_length = 5278439

In [5]:
%%time
doc = nlp(soup)
print (len(list(doc.sents)))

94133
Wall time: 1min 4s


In [6]:
nlp2 = spacy.load("en_core_web_sm")
nlp2.max_length = 5278439

In [None]:
%%time
doc = nlp2(soup)
print (len(list(doc.sents)))

### Examining a Pipeline

In [None]:
nlp2.analyze_pipes()