
# SpaCy — Complete Notebook

> **Goal**: Learn SpaCy step by step with clean, runnable examples and detailed explanations you can keep as notes.

This notebook covers:
1. Installation & Model Setup  
2. Core Objects: `Doc`, `Token`, `Span`  
3. Tokenization & Basic Token Attributes  
4. Stop Words (built‑in + custom)  
5. Lemmatization  
6. Part‑of‑Speech (POS) Tagging + Morphology  
7. Dependency Parsing  
8. Sentence Segmentation  
9. Named Entity Recognition (NER)  
10. Visualization with `displacy`  
11. Rule‑based Matching (`Matcher`, `PhraseMatcher`)  
12. Efficient Processing with `nlp.pipe`  
13. Custom Pipeline Components  
14. Saving/Loading pipelines & notes on training

**Note:** Some cells require SpaCy models like `en_core_web_sm`. If not present, the code will try to download them.



## 1) Install SpaCy & Download a Model

- If you're on **Colab**, uncomment the first two lines.  
- If you're local and already have SpaCy + models, the `try/except` will just load them.


In [1]:

# If you're on Colab or a fresh environment, run the following (uncomment):
#!pip install -U spacy
#!python -m spacy download en_core_web_sm

import spacy

# Try to load a small English model, otherwise download it.
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    # Attempt to download programmatically (works if internet is available)
    try:
        from spacy.cli import download
        download("en_core_web_sm")
        nlp = spacy.load("en_core_web_sm")
    except Exception:
        print("Couldn't download 'en_core_web_sm'. Falling back to a blank English pipeline.")
        nlp = spacy.blank("en")  # Has tokenizer; no POS/NER until components are added

type(nlp), nlp.pipe_names


(spacy.lang.en.English,
 ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner'])


## 2) Core Objects — `Doc`, `Token`, `Span`

- **`Doc`**: an entire processed text (produced by `nlp(text)`). Immutable container of tokens.  
- **`Token`**: a single token with many attributes (text, lemma, POS, shape, etc.).  
- **`Span`**: a slice of a `Doc` (e.g., a phrase or sentence).


In [2]:

text = "SpaCy is a powerful library for Natural Language Processing (NLP)."
doc = nlp(text)

print("Type of doc:", type(doc))
print("Number of tokens:", len(doc))
print([token.text for token in doc])

span = doc[0:3]  # "SpaCy is a"
print("Span text:", span.text)


Type of doc: <class 'spacy.tokens.doc.Doc'>
Number of tokens: 13
['SpaCy', 'is', 'a', 'powerful', 'library', 'for', 'Natural', 'Language', 'Processing', '(', 'NLP', ')', '.']
Span text: SpaCy is a



## 3) Tokenization & Basic Token Attributes

SpaCy's tokenizer splits text into meaningful tokens. Useful attributes:
- `token.text`, `token.lemma_`, `token.pos_`, `token.tag_`
- `token.is_alpha`, `token.is_stop`, `token.like_num`, `token.shape_`, `token.is_punct`


In [3]:

doc = nlp("Apple sold 3 million iPhones in India. Wow! That's huge, right?")
rows = []
for token in doc:
    rows.append({
        "text": token.text,
        "is_alpha": token.is_alpha,
        "is_stop": token.is_stop,
        "like_num": token.like_num,
        "shape_": token.shape_,
        "is_punct": token.is_punct
    })
import pandas as pd
pd.DataFrame(rows)


Unnamed: 0,text,is_alpha,is_stop,like_num,shape_,is_punct
0,Apple,True,False,False,Xxxxx,False
1,sold,True,False,False,xxxx,False
2,3,False,False,True,d,False
3,million,True,False,True,xxxx,False
4,iPhones,True,False,False,xXxxxx,False
5,in,True,True,False,xx,False
6,India,True,False,False,Xxxxx,False
7,.,False,False,False,.,True
8,Wow,True,False,False,Xxx,False
9,!,False,False,False,!,True



## 4) Stop Words (Built‑in + Custom)

SpaCy includes a stop‑word list per language model. You can also add/remove your own.


In [4]:

from spacy.lang.en.stop_words import STOP_WORDS

print("A few built-in stop words:", list(sorted(STOP_WORDS))[:15])

# Custom stop words
custom_stops = {"super", "really"}
for w in custom_stops:
    nlp.vocab[w].is_stop = True

doc = nlp("This is a really super simple example to demonstrate custom stop words.")
[(t.text, t.is_stop) for t in doc]


A few built-in stop words: ["'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all']


[('This', True),
 ('is', True),
 ('a', True),
 ('really', True),
 ('super', True),
 ('simple', False),
 ('example', False),
 ('to', True),
 ('demonstrate', False),
 ('custom', False),
 ('stop', False),
 ('words', False),
 ('.', False)]


## 5) Lemmatization

Lemmatization returns the **base** form of a word (e.g., "running" → "run"). Requires a model with a **lemmatizer** component.


In [5]:

doc = nlp("The striped bats are hanging on their feet for best.")
[(t.text, t.lemma_, t.pos_) for t in doc]


[('The', 'the', 'DET'),
 ('striped', 'striped', 'ADJ'),
 ('bats', 'bat', 'NOUN'),
 ('are', 'be', 'AUX'),
 ('hanging', 'hang', 'VERB'),
 ('on', 'on', 'ADP'),
 ('their', 'their', 'PRON'),
 ('feet', 'foot', 'NOUN'),
 ('for', 'for', 'ADP'),
 ('best', 'good', 'ADJ'),
 ('.', '.', 'PUNCT')]


## 6) POS Tagging & Morphology

- **POS (`pos_`)**: coarse-grained part of speech (NOUN, VERB, ADJ, etc.).  
- **Tag (`tag_`)**: fine-grained tag (language-dependent).  
- **Morph**: features like Number, Tense, Person, etc.


In [6]:

doc = nlp("Google has been rapidly expanding its AI research labs.")
for token in doc:
    print(f"{token.text:12} POS={token.pos_:6} TAG={token.tag_:6} Morph={token.morph}")


Google       POS=PROPN  TAG=NNP    Morph=Number=Sing
has          POS=AUX    TAG=VBZ    Morph=Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
been         POS=AUX    TAG=VBN    Morph=Tense=Past|VerbForm=Part
rapidly      POS=ADV    TAG=RB     Morph=
expanding    POS=VERB   TAG=VBG    Morph=Aspect=Prog|Tense=Pres|VerbForm=Part
its          POS=PRON   TAG=PRP$   Morph=Gender=Neut|Number=Sing|Person=3|Poss=Yes|PronType=Prs
AI           POS=PROPN  TAG=NNP    Morph=Number=Sing
research     POS=NOUN   TAG=NN     Morph=Number=Sing
labs         POS=NOUN   TAG=NNS    Morph=Number=Plur
.            POS=PUNCT  TAG=.      Morph=PunctType=Peri



## 7) Dependency Parsing

Shows how tokens relate syntactically (subject, object, modifiers). Useful for relation extraction and rule-based patterns.


In [7]:

doc = nlp("The quick brown fox jumps over the lazy dog.")
for token in doc:
    print(f"{token.text:10} head={token.head.text:10} dep={token.dep_:12} children={[child.text for child in token.children]}")


The        head=fox        dep=det          children=[]
quick      head=fox        dep=amod         children=[]
brown      head=fox        dep=amod         children=[]
fox        head=jumps      dep=nsubj        children=['The', 'quick', 'brown']
jumps      head=jumps      dep=ROOT         children=['fox', 'over', '.']
over       head=jumps      dep=prep         children=['dog']
the        head=dog        dep=det          children=[]
lazy       head=dog        dep=amod         children=[]
dog        head=over       dep=pobj         children=['the', 'lazy']
.          head=jumps      dep=punct        children=[]



## 8) Sentence Segmentation

SpaCy splits text into sentences using the `senter`/parser rules. You can also customize boundaries.


In [8]:

text = "Dr. Smith arrived late. However, the meeting continued. It ended at 5 p.m."
doc = nlp(text)
[sent.text for sent in doc.sents]


['Dr. Smith arrived late.',
 'However, the meeting continued.',
 'It ended at 5 p.m.']


## 9) Named Entity Recognition (NER)

NER identifies real‑world objects (PERSON, ORG, GPE, DATE, MONEY, etc.).


In [9]:

doc = nlp("Mukesh Ambani is the chairman of Reliance Industries, headquartered in Mumbai. He donated $10 million in 2024.")
[(ent.text, ent.label_) for ent in doc.ents]


[('Mukesh Ambani', 'PERSON'),
 ('Reliance Industries', 'ORG'),
 ('Mumbai', 'GPE'),
 ('$10 million', 'MONEY'),
 ('2024', 'DATE')]


## 10) Visualizations with `displacy`

Use `displacy.render` in notebooks or `displacy.serve` for a local web app.


In [10]:

from spacy import displacy

doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
# In a notebook, this will render inline:
displacy.render(doc, style="ent", jupyter=True)
# For dependencies:
# displacy.render(doc, style="dep", jupyter=True)



## 11) Rule‑based Matching (`Matcher`)

`Matcher` finds token patterns using lexical/morphological features. Great for targeted extraction.


In [11]:

from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "machine"}, {"LOWER": "learning"}]
matcher.add("ML_PHRASE", [pattern])

doc = nlp("I love Machine Learning. machine learning is fun. Machines learn too.")
matches = matcher(doc)
[(doc[start:end].text, start, end, nlp.vocab.strings[match_id]) for match_id, start, end in matches]


[('Machine Learning', 2, 4, 'ML_PHRASE'),
 ('machine learning', 5, 7, 'ML_PHRASE')]


## 12) `PhraseMatcher` for Lists of Terms

Faster when you already have exact phrases.


In [12]:

from spacy.matcher import PhraseMatcher

terms = ["New Delhi", "Mumbai", "Bengaluru", "Hyderabad"]
patterns = [nlp.make_doc(t) for t in terms]

phraser = PhraseMatcher(nlp.vocab)
phraser.add("CITIES", patterns)

doc = nlp("I have lived in Bengaluru and Hyderabad, but I travel to New Delhi often.")
[(doc[start:end].text, nlp.vocab.strings[match_id]) for match_id, start, end in phraser(doc)]


[('Bengaluru', 'CITIES'), ('Hyderabad', 'CITIES'), ('New Delhi', 'CITIES')]


## 13) Efficient Batch Processing with `nlp.pipe`

Use `nlp.pipe` to process many texts quickly and memory‑efficiently.


In [13]:

texts = [
    "Flipkart raised investment last year.",
    "Google acquired another startup in 2023.",
    "The T20 World Cup was hosted across multiple venues."
]
for doc in nlp.pipe(texts, batch_size=32):
    print([(ent.text, ent.label_) for ent in doc.ents])


[('last year', 'DATE')]
[('Google', 'ORG'), ('2023', 'DATE')]
[('The T20 World Cup', 'EVENT')]



## 14) Custom Pipeline Components

You can create your own component to add metadata, filter docs, log stats, etc.


In [14]:

from spacy.language import Language
from spacy.tokens import Doc

# Register custom extension attribute if not already registered
if not Doc.has_extension("exclaim_count"):
    Doc.set_extension("exclaim_count", default=0)

@Language.component("exclaim_counter")
def exclaim_counter_function(doc):
    doc._.exclaim_count = sum(1 for t in doc if t.text == "!")
    return doc

# Add to pipeline (before/after components as needed)
if "exclaim_counter" not in nlp.pipe_names:
    nlp.add_pipe("exclaim_counter", last=True)

doc = nlp("Wow! This is amazing! Right!")
doc._.exclaim_count


3


## 15) Saving & Loading Pipelines

You can serialize the entire pipeline directory and load it later.


In [15]:

import tempfile, shutil, os

tmp_dir = tempfile.mkdtemp(prefix="spacy_model_")
nlp.to_disk(tmp_dir)
print("Saved pipeline to:", tmp_dir)

# Load back
import spacy as _sp
nlp2 = _sp.load(tmp_dir)
text = "Microsoft Corporation is based in Redmond."
[(ent.text, ent.label_) for ent in nlp2(text).ents]

# Cleanup the temp dir
shutil.rmtree(tmp_dir, ignore_errors=True)


Saved pipeline to: /tmp/spacy_model_rkntca4h



## 16) Notes on Training & Fine‑tuning (SpaCy v3+)

Training custom NER/POS/Parser requires a **config** and **annotated data**. High‑level steps:

1. Prepare data in SpaCy's `DocBin` format (or JSON/Spacy v3 format) with entities/annotations.  
2. Create a training config:
```bash
python -m spacy init config config.cfg --lang en --pipeline ner
```
3. Train:
```bash
python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
```
4. Load your best model from `./output/model-best`.

> Tip: Start from a pretrained pipeline (e.g., `en_core_web_sm`/`md`/`lg`) for better results with limited data.


## Appendix — Summary of Your Original Notebook (for traceability)

In [16]:
original_preview = [
  "Cell 1 [code]: !pip install spacy...",
  "Cell 2 [code]: !python -m spacy download en_core_web_sm...",
  "Cell 3 [code]: import spacy  nlp = spacy.load(\"en_core_web_sm\")  text = \"Apple is looking at buying U.K startup for $1billion.\"  doc = ...",
  "Cell 4 [code]: doc...",
  "Cell 5 [markdown]: ## tokenization using SpaCy...",
  "Cell 6 [code]: for token in doc:   print(token.text)...",
  "Cell 7 [code]: for token in doc:   print( token.pos_)...",
  "Cell 8 [code]: for token in doc:   print( token.dep_)...",
  "Cell 9 [code]: for token in doc:   print( token.text,':',token.pos_)...",
  "Cell 10 [code]: for token in doc:   print( token.text,':',token.pos_,'-->',token.lemma_,token.dep_)...",
  "Cell 11 [code]: for token in doc:   print( token.text,'-',token.pos_,'-',token.lemma_,'-',token.dep_,'-',token.tag_,'-',token.shape_,'-'...",
  "Cell 12 [markdown]: ## Text Summerization :-(Project)...",
  "Cell 13 [code]: ..."
]
original_preview

['Cell 1 [code]: !pip install spacy...',
 'Cell 2 [code]: !python -m spacy download en_core_web_sm...',
 'Cell 3 [code]: import spacy  nlp = spacy.load("en_core_web_sm")  text = "Apple is looking at buying U.K startup for $1billion."  doc = ...',
 'Cell 4 [code]: doc...',
 'Cell 5 [markdown]: ## tokenization using SpaCy...',
 'Cell 6 [code]: for token in doc:   print(token.text)...',
 'Cell 7 [code]: for token in doc:   print( token.pos_)...',
 'Cell 8 [code]: for token in doc:   print( token.dep_)...',
 "Cell 9 [code]: for token in doc:   print( token.text,':',token.pos_)...",
 "Cell 10 [code]: for token in doc:   print( token.text,':',token.pos_,'-->',token.lemma_,token.dep_)...",
 "Cell 11 [code]: for token in doc:   print( token.text,'-',token.pos_,'-',token.lemma_,'-',token.dep_,'-',token.tag_,'-',token.shape_,'-'...",
 'Cell 12 [markdown]: ## Text Summerization :-(Project)...',
 'Cell 13 [code]: ...']