# spaCy NLP Pipeline

A **pipeline** in spaCy is a sequence of processing components that are applied to text in order. Each component adds annotations (like POS tags, entities, etc.) to the document.

## spaCy Pipeline Components

| Component | Description | Creates |
|-----------|-------------|---------|
| `tokenizer` | Splits text into tokens | `Doc` |
| `tagger` | Assigns POS tags | `token.pos_`, `token.tag_` |
| `parser` | Dependency parsing | `token.dep_`, `token.head` |
| `ner` | Named Entity Recognition | `doc.ents` |
| `lemmatizer` | Lemmatization | `token.lemma_` |
| `sentencizer` | Sentence boundary detection | `doc.sents` |

## Blank vs Trained Models
- **Blank model** (`spacy.blank("en")`) - Only tokenization, no annotations
- **Trained model** (`spacy.load("en_core_web_sm")`) - Full pipeline with trained components

In [1]:
import spacy

### Import spaCy

spaCy is an industrial-strength NLP library optimized for production use.

In [13]:
nlp = spacy.blank("en")

doc = nlp("Captain america ate 100$ of samosa, Then he said I can do this all day.")

for token in doc:
    print(token, " | ", token.pos_, " | ", token.lemma_)

Captain  |    |  
america  |    |  
ate  |    |  
100  |    |  
$  |    |  
of  |    |  
samosa  |    |  
,  |    |  
Then  |    |  
he  |    |  
said  |    |  
I  |    |  
can  |    |  
do  |    |  
this  |    |  
all  |    |  
day  |    |  
.  |    |  


### Create a Blank Model

A blank model only has a tokenizer - no trained components. Let's see what happens when we try to get POS tags and lemmas:

## Blank Model - No Pipeline Components

Notice that with a blank model, `pos_` and `lemma_` return empty strings because there are no trained components to annotate them.

In [None]:
**‚ö†Ô∏è Notice**: With a blank model, `pos_` returns empty strings and `lemma_` just returns the token text. This is because there are no trained components to perform linguistic analysis!

[]

Let's check what pipeline components are available in the blank model:

In [4]:
nlp = spacy.load("en_core_web_sm")

### Load a Trained Model

Now let's load a trained model (`en_core_web_sm`) which includes all the NLP components:

The trained model includes these pipeline components:
- **tok2vec**: Creates word vectors
- **tagger**: Part-of-speech tagging
- **parser**: Dependency parsing
- **lemmatizer**: Word lemmatization
- **ner**: Named entity recognition
- **attribute_ruler**: Rule-based attribute assignment

In [8]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

Let's verify the pipeline components are now available:

In [7]:
doc = nlp("Captain america ate 100$ of samosa, Then he said I can do this all day.")

for token in doc:
    print(token, " | ", token.pos_, " | ", token.lemma_)

Captain  |  PROPN  |  Captain
america  |  PROPN  |  america
ate  |  VERB  |  eat
100  |  NUM  |  100
$  |  NUM  |  $
of  |  ADP  |  of
samosa  |  PROPN  |  samosa
,  |  PUNCT  |  ,
Then  |  ADV  |  then
he  |  PRON  |  he
said  |  VERB  |  say
I  |  PRON  |  I
can  |  AUX  |  can
do  |  VERB  |  do
this  |  PRON  |  this
all  |  DET  |  all
day  |  NOUN  |  day
.  |  PUNCT  |  .


Now with the trained model, we get proper POS tags and lemmas! Notice:
- "ate" ‚Üí lemma "eat"
- "said" ‚Üí lemma "say"
- Each token has a meaningful POS tag (PROPN, VERB, NUM, etc.)

In [10]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")

for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
$45 billion  |  MONEY  |  Monetary values, including unit


### Named Entity Recognition (NER)

NER identifies and classifies entities like:
- **ORG**: Organizations (Tesla Inc)
- **PERSON**: People names
- **MONEY**: Monetary values ($45 billion)
- **GPE**: Geopolitical entities (countries, cities)

In [12]:
from spacy import displacy

displacy.render(doc, style="ent")

### Visualize with displaCy

spaCy's `displacy` creates beautiful visualizations for entities and dependency trees:

The colored highlights show different entity types. This visualization makes it easy to verify NER results!

In [15]:
nlp = spacy.load("fr_core_news_sm")

doc = nlp("Tesla Inc va racheter Twitter pour 45 milliards de dollars")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  PER  |  Named person or family.
Twitter  |  MISC  |  Miscellaneous entities, e.g. events, nationalities, products or works of art


### Multilingual Support üåç

spaCy has trained models for many languages. Let's try **French** NER on the same sentence translated:

**Note**: You need to download the model first: `python -m spacy download fr_core_news_sm`

In [17]:
nlp = spacy.blank("en")

doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

### Back to Blank Model - No NER

What happens if we try NER on a blank model? No entities are detected because there's no trained NER component!

In [18]:
source_nlp = spacy.load("en_core_web_sm")

nlp = spacy.blank("en")
nlp.add_pipe("ner", source= source_nlp)
nlp.pipe_names

['ner']

### Customizing the Pipeline üõ†Ô∏è

You can **selectively add components** from trained models to a blank pipeline. This is useful when:
- You only need specific functionality (e.g., just NER)
- You want to save memory by not loading unused components
- You're building a custom pipeline

Below, we create a blank model and add **only the NER component** from the trained model:

In [19]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
$45 billion  |  MONEY  |  Monetary values, including unit


Now our custom pipeline with just NER can detect entities! This demonstrates the modularity of spaCy's pipeline architecture.

---

## üìù Summary

| Concept | Description |
|---------|-------------|
| **Blank Model** | Only tokenization, no linguistic analysis |
| **Trained Model** | Full pipeline with POS, NER, parsing, etc. |
| **Custom Pipeline** | Mix and match components as needed |
| **pipe_names** | List available pipeline components |
| **displacy** | Visualize entities and dependencies |

### Key Takeaways:
1. Always check `nlp.pipe_names` to know what components are available
2. Use trained models (`en_core_web_sm`) for production NLP tasks
3. Customize pipelines by adding only the components you need
4. spaCy supports many languages with pre-trained models