# spaCy NLP Pipeline

This notebook demonstrates a clean, production‑minded mini NLP pipeline using **spaCy**.

**What it does**
- Loads a spaCy model with graceful fallback/download if missing
- Tokenization with POS & lemma filtering
- Stopword/punctuation handling
- Noun chunks extraction
- Named Entity Recognition (NER)
- Dependency parse summary (head/dependency)
- Saves tidy outputs to `outputs/` as CSV
- Small self‑check to validate the pipeline runs end‑to‑end

## Setup & Install
This cell ensures the required spaCy model is available. It tries to load `en_core_web_sm` and downloads it if missing.


In [1]:

# If running on a fresh environment, uncomment the next line to install spaCy
# pip install spacy==3.7.2

import sys, subprocess, importlib
import spacy
from typing import List, Dict, Tuple
import json, pandas as pd
from collections import defaultdict, Counter
from pathlib import Path

def ensure_spacy_model(model_name: str = "en_core_web_sm"):
    try:
        import spacy
        spacy.load(model_name)
        return model_name
    except Exception:
        # Try to download the model
        print(f"Downloading spaCy model: {model_name} ...")
        subprocess.run([sys.executable, "-m", "spacy", "download", model_name], check=False)
        import spacy as _sp
        _sp.load(model_name)  # will raise if still missing
        return model_name

In [2]:
MODEL_NAME = ensure_spacy_model("en_core_web_sm")

nlp = spacy.load(MODEL_NAME)
print("spaCy version:", spacy.__version__, "| model:", MODEL_NAME)

spaCy version: 3.7.2 | model: en_core_web_sm


## Configuration
Tweak the lists below to customize which parts of speech to keep and whether to lemmatize.


In [3]:

from dataclasses import dataclass, asdict

@dataclass
class PipelineConfig:
    keep_pos: tuple = ("NOUN","PROPN","VERB","ADJ","ADV","NUM")
    drop_deps: tuple = ("punct",)
    lowercase: bool = True
    use_lemma: bool = True

CONFIG = PipelineConfig()
asdict(CONFIG)


{'keep_pos': ('NOUN', 'PROPN', 'VERB', 'ADJ', 'ADV', 'NUM'),
 'drop_deps': ('punct',),
 'lowercase': True,
 'use_lemma': True}

## Pipeline Functions
Modular helpers that:
- Clean and normalize tokens
- Produce tidy DataFrames for tokens, entities, and dependencies
- Save results to `outputs/`


In [4]:
OUT_DIR = Path("outputs")
OUT_DIR.mkdir(exist_ok=True)

def analyze_text(text: str, cfg: PipelineConfig = CONFIG) -> Dict[str, pd.DataFrame]:
    doc = nlp(text)
    rows = []
    for tok in doc:
        if tok.is_space or tok.is_punct or tok.pos_ == "X":
            continue
        if tok.pos_ not in cfg.keep_pos:
            continue
        form = tok.lemma_ if cfg.use_lemma else tok.text
        if cfg.lowercase:
            form = form.lower()
        rows.append({
            "text": tok.text,
            "norm": form,
            "pos": tok.pos_,
            "lemma": tok.lemma_,
            "is_stop": tok.is_stop,
            "dep": tok.dep_,
            "head": tok.head.text,
            "i": tok.i,
        })
    tokens_df = pd.DataFrame(rows)

    ents_df = pd.DataFrame([{"text": ent.text, "label": ent.label_, "start": ent.start_char, "end": ent.end_char}
                            for ent in doc.ents])

    deps_df = pd.DataFrame([{
        "text": t.text, "pos": t.pos_, "dep": t.dep_, "head": t.head.text, "head_pos": t.head.pos_
    } for t in doc])

    noun_chunks_df = pd.DataFrame([{"chunk": chunk.text, "root": chunk.root.text, "root_dep": chunk.root.dep_}
                                   for chunk in doc.noun_chunks])

    return {
        "tokens": tokens_df,
        "entities": ents_df,
        "dependencies": deps_df,
        "noun_chunks": noun_chunks_df
    }

def save_outputs(dfs: Dict[str, pd.DataFrame], prefix: str = "demo"):
    for name, df in dfs.items():
        out = OUT_DIR / f"{prefix}_{name}.csv"
        df.to_csv(out, index=False)
    return True


## Demo Input
Change the sample text below or load from `sample_text.txt` (creat txt file in your notebook dir).


In [5]:

SAMPLE_TEXT = 'This AI program at CCBST explores Natural Language Processing (NLP) with spaCy.\nWe analyze text, extract entities like Apple and Toronto General Hospital, and inspect dependencies.\nIn 2025, our team improved model accuracy by 12% while reducing inference latency.'
SAMPLE_TEXT


'This AI program at CCBST explores Natural Language Processing (NLP) with spaCy.\nWe analyze text, extract entities like Apple and Toronto General Hospital, and inspect dependencies.\nIn 2025, our team improved model accuracy by 12% while reducing inference latency.'

## Run the Pipeline
Produces tidy dataframes and saves them under `outputs/`.


In [6]:

dfs = analyze_text(SAMPLE_TEXT, CONFIG)
# Display a compact preview
preview = {k: v.head(10) for k, v in dfs.items()}
preview


{'tokens':          text        norm    pos       lemma  is_stop       dep      head   i
 0          AI          ai  PROPN          AI    False  compound   program   1
 1     program     program   NOUN     program    False     nsubj  explores   2
 2       CCBST       ccbst  PROPN       CCBST    False      pobj        at   4
 3    explores     explore   VERB     explore    False      ROOT  explores   5
 4     Natural     natural  PROPN     Natural    False  compound  Language   6
 5    Language    language  PROPN    Language    False      nmod       NLP   7
 6  Processing  processing  PROPN  Processing    False      nmod       NLP   8
 7         NLP         nlp  PROPN         NLP    False      dobj  explores  10
 8       spaCy       spacy  PROPN       spaCy    False      pobj      with  13
 9     analyze     analyze   VERB     analyze    False      ROOT   analyze  17,
 'entities':                           text    label  start  end
 0                           AI      ORG      5    7
 1

In [7]:

save_outputs(dfs, prefix="sample")
sorted([str(p) for p in OUT_DIR.glob("sample_*.csv")])


['outputs\\sample_dependencies.csv',
 'outputs\\sample_entities.csv',
 'outputs\\sample_noun_chunks.csv',
 'outputs\\sample_tokens.csv']

## Quick Self‑Check
A tiny smoke test to prove the notebook works end‑to‑end.


In [8]:

assert "tokens" in dfs and not dfs["tokens"].empty, "Tokens DF is empty"
# dependencies includes all tokens (even those filtered out earlier), so should be >= tokens
assert len(dfs["dependencies"]) >= len(dfs["tokens"]), "Dependencies DF looks wrong"
print("✅ Basic self-checks passed.")


✅ Basic self-checks passed.
