![](./resources/images/header_logo.png)

# Chapter 2 Getting started with Spacy

In this section we will explore **Spacy**, one of the most popular open-source NLP libraries. 
It is designed to hide away all of the complex algorithms and just give the user ready functions for text processing and analysis. It is also designed to work efficiently with other data science related libraries like *numpy* or *pandas*.

Spacy chooses the most efficient algorithms for you. There is one drawback: you may not choose other algorithms. If you want to dive deeper and experiment with spacy, [**spacy documentation**](https://spacy.io/) is a great place to start.
This section will focus on key text processing methods, namely:
* Tokenization
* Lemmatization
* Stemming
* Handling Stop Words
* Vocabulary Matching

Before we start using the spacy library, we also need to download the language model we will use. Language models are just large files that have information on specific languages, for example: words, stop_words, their vector representations (more in Chapter 3), lemmas, parts of speech (POS) and many more. Firstly, let's import spacy.

In [1]:
import spacy

Now, we have to load an appropriate language model. For this course, we will be using a medium-size English language model with word embeddings called "en_core_web_md". It may take a while. Loading a model returns a spacy pipeline object.

In [2]:
nlp = spacy.load("en_core_web_md")

Now `nlp` is a default spacy pipeline. But, what is a pipeline?
A pipeline in software engineering is a special toolbox, which stores different functions (tools) to execute on data in a specific order.

### Video 1 Spacy Pipeline
Click [here](google.com) to watch on YT or run the cell below for in-app view.

In [3]:
from IPython.display import YouTubeVideo
video = YouTubeVideo("")
display(video, width="100%", height="200px")

TypeError: publish() got an unexpected keyword argument 'width'

![](./resources/images/spacy_pipeline.svg)

Now that we have loaded the model and created a standard pipeline. Let's take a look what is inside it.

In [13]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x20d86139100>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x20d86139ac0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x20d859fef90>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x20d874fcb00>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x20d87504c80>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x20de0bf9c10>)]

As you can see, there are a lot of different tools included in the pipeline. We will explore some of them in more depth now.

### Tokenizer

Firstly, the plain text is broken into smallest units like individual words or terms (**tokenization**). We call these smallest chunks of text **tokens**. Tokens can take form of phrases, individual words, numerals, punctuation marks, whitespaces and more. Take a look at the example below to see how can we read individual tokens using spacy.

In [4]:
sentence_1 = "Computer Science is so fun!"

doc = nlp(sentence_1) # we are passing text into spacy pipeline

for token in doc: # iterate through every token in doc
    print(token)

Computer
Science
is
so
fun
!


### Tagger

Tagger object is responsible for attaching to each token an attribute about its **part of speech (POS)** tag.
We can look at the same example as before, but this time we'll print the tag attribute of each token as well. 
We'll use f-string to make printing prettier. 

In [28]:
print(f"{'Token':<10}{'Token tag'}")
for token in doc:
    print(f"{token.text:<10}{token.tag}")

Token     Token tag
Computer  15308085513773655218
Science   15794550382381185553
is        13927759927860985106
so        164681854541413346
fun       10554686591937588953
!         12646065887601541794


What went wrong? Why token tag displays a long number instead a description of a part of speech?
Spacy uses token.tag to show the unique identifier for a particular part of speech.
But we can use token.tag_ to view something easier to interpret.

In [30]:
print(f"{'Token':<10}{'Token tag'}")
for token in doc:
    print(f"{token.text:<10}{token.tag_}")

Token     Token tag


AttributeError: 'str' object has no attribute 'explain'

### Parser

Dependency Parser is used to retrieve relations between individual tokens, like subject or object.

In [None]:
# show displacy dep parsing

### Lemmatizer

Lemmatization assigns each token that exists in model vocabulary its base form.
Let's see some examples.

In [37]:
nlp("been")[0].lemma_

'be'

In [46]:
nlp("presents")[0].lemma_

'present'

### Named Entity Recognizer (NER)

Recognizing named entities is one of the most powerfool functionalities of spacy library. Spacy models have been trained to classify some words as entities like persons, companies or locations.

In [73]:
doc = nlp("Bill Gates has founded Microsoft in 2021. It cost him 2$")
for entity in doc.ents:
    print(f"{entity.text:<20}{entity.label_:<20}{str(spacy.explain(entity.label_)):<20}")

Bill Gates          PERSON              People, including fictional
Microsoft           ORG                 Companies, agencies, institutions, etc.
2021                DATE                Absolute or relative dates or periods
2$                  MONEY               Monetary values, including unit


See? Named Entity Recognizer successfully recognized a span of two tokens "Bill Gates" as a PERSON, "Microsoft" as an ORG for organization, 2021 as a DATE and 2$ as MONEY.

For full list, visit: ...

In [75]:
from spacy import displacy

doc = nlp("Interesting sentence last year")

displacy.render(doc, style='ent', jupyter=True)

### Other features under the hood

Apart from the described tools, there are a lot of other things happening under the hood. 
Spacy recognises (with no 100% accuracy) the boundaries of sentences.
After applying spacy pipeline to text, each word is given a math representation as a vector which we will explore in the next chapters.

## Quiz 2

In [77]:
from jupyterquiz import display_quiz
display_quiz("resources/quizzes/questions2.json")

ModuleNotFoundError: No module named 'jupyterquiz'

# Click [here](3.%20Let's%20Learn%20NLP%20-%20Understanding%20Word%20Vectors.ipynb) to continue