# üî∑ spaCy vs üìö NLTK: A Practical Comparison

This notebook provides a hands-on comparison between two popular NLP libraries:

| Feature | spaCy | NLTK |
|---------|-------|------|
| **Focus** | Production & Speed | Education & Research |
| **Speed** | ‚ö° Very Fast (Cython) | üê¢ Slower (Pure Python) |
| **API** | Object-oriented, modern | Functional, traditional |
| **Models** | Pre-trained pipelines | Algorithms & corpora |

---

## üì¶ Part 1: spaCy

spaCy is designed for **production use** with a focus on speed and efficiency. It processes text through a pipeline that performs multiple NLP tasks in one pass.

### 1.1 Loading spaCy

First, we import spaCy. The library uses **pre-trained models** that need to be downloaded separately.

```bash
# Run this in terminal if not installed:
pip install spacy
python -m spacy download en_core_web_sm
```

In [1]:
import spacy

### 1.2 Sentence Tokenization with spaCy

spaCy uses a **statistical model** to determine sentence boundaries. When you process text with `nlp()`, it automatically:
- Tokenizes into words
- Detects sentence boundaries
- Performs POS tagging, NER, and more

The `doc.sents` property gives us an iterator over detected sentences.

In [2]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("Apple is looking at buying U.K. startup for $1 billion. Hydra is a dragon of India")

for sentence in doc.sents:
    print(sentence)

Apple is looking at buying U.K. startup for $1 billion.
Hydra is a dragon of India


### 1.3 Word Tokenization with spaCy

Each sentence in spaCy is also a `Span` object that can be iterated to get individual `Token` objects.

**Key Point:** spaCy handles edge cases well:
- `U.K.` stays as one token (abbreviation)
- `$1 billion` is split into `$`, `1`, `billion`
- Punctuation is separated from words

In [5]:
for sentence in doc.sents:
    for word in sentence:
        print(word)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion
.
Hydra
is
a
dragon
of
India


---

## üìö Part 2: NLTK (Natural Language Toolkit)

NLTK takes a more **traditional, modular approach**. Each task (sentence splitting, word tokenization) is a separate function call.

### 2.1 Setting Up NLTK

NLTK requires downloading specific data packages. The `punkt` tokenizer is one of the most commonly used.

In [14]:
import nltk

nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/btwitsvoid/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

### 2.2 Sentence Tokenization with NLTK

NLTK's `sent_tokenize()` uses the **Punkt sentence tokenizer**, which is a pre-trained unsupervised model.

**Note:** Unlike spaCy's object-oriented approach, NLTK returns a simple **list of strings**.

In [15]:
from nltk.tokenize import sent_tokenize

sent_tokenize("Apple is looking at buying U.K. startup for $1 billion. Hydra is a dragon of India")

['Apple is looking at buying U.K. startup for $1 billion.',
 'Hydra is a dragon of India']

### 2.3 Word Tokenization with NLTK

NLTK's `word_tokenize()` splits text into individual words and punctuation.

**Comparison with spaCy:**
- NLTK: Returns a flat list of strings
- spaCy: Returns rich Token objects with metadata (POS, lemma, etc.)

In [16]:
from nltk.tokenize import word_tokenize

word_tokenize("Apple is looking at buying U.K. startup for $1 billion. Hydra is a dragon of India")

['Apple',
 'is',
 'looking',
 'at',
 'buying',
 'U.K.',
 'startup',
 'for',
 '$',
 '1',
 'billion',
 '.',
 'Hydra',
 'is',
 'a',
 'dragon',
 'of',
 'India']

---

## üìä Summary: spaCy vs NLTK for Tokenization

| Aspect | spaCy | NLTK |
|--------|-------|------|
| **Return Type** | Rich Token objects | Plain strings |
| **Processing** | All-in-one pipeline | Separate function calls |
| **Speed** | Faster | Slower |
| **Ease of Use** | More Pythonic | More verbose |
| **Additional Info** | POS, NER, dependencies included | Need separate calls |

### üéØ When to Use Which?

- **Choose spaCy** when you need:
  - Production-ready code
  - Speed and efficiency
  - Multiple NLP tasks in one pass

- **Choose NLTK** when you need:
  - Learning and understanding NLP concepts
  - Access to specific algorithms or corpora
  - Simple, straightforward tokenization

---
