# NLTK - Natural Language ToolKit

## Important Points about NLTK (Natural Language Toolkit)

1. **What It Is**  
   NLTK is a powerful Python library for **natural language processing (NLP)** and **text analysis**. It helps in processing and analyzing human language data.

2. **Open Source**  
   It is open-source and freely available, widely used in academic and research settings.

3. **Wide Range of Tools**  
   NLTK provides functionalities for:
   - Tokenization (splitting text into words/sentences)
   - Stemming and Lemmatization
   - POS (Part-of-Speech) tagging
   - Named Entity Recognition (NER)
   - Parsing and Syntax Trees
   - Basic Sentiment Analysis

4. **Corpora and Lexical Resources**  
   Includes a variety of **built-in corpora** (e.g., Gutenberg, Brown, WordNet) for training and testing NLP models.

5. **Integration**  
   Works well with other Python libraries like:
   - `scikit-learn` (machine learning)
   - `NumPy` (numerical computing)
   - `matplotlib` (visualization)

6. **Good for Prototyping**  
   Ideal for building prototypes and learning NLP concepts. However, it may be **slower** than other libraries like `spaCy` in production environments.

7. **Language Support**  
   Primarily supports **English**, though it can be adapted for other languages with some limitations.


In [None]:
#!pip install nltk

## Tokenization

### What is Tokenization?

**Tokenization** is the process of breaking a text into smaller units called **tokens**. These tokens can be:
- **Words**
- **Sentences**
- **Subwords** (depending on the application)

Tokenization is often the **first step** in Natural Language Processing (NLP), as it prepares raw text for further analysis.

## Types of Tokenization

1. **Word Tokenization**  
   Splits text into individual words or terms.
   - Example:  
     `"Hello world!"` → `['Hello', 'world', '!']`

2. **Sentence Tokenization**  
   Splits a paragraph into sentences.
   - Example:  
     `"Hello world! How are you?"` → `['Hello world!', 'How are you?']`


In [3]:
sample_corpus = """Hello there! How are you doing today?
I hope you're enjoying learning about Natural Language Processing.
Let's explore more with NLTK and Python."""


In [4]:
print(sample_corpus)

Hello there! How are you doing today?
I hope you're enjoying learning about Natural Language Processing.
Let's explore more with NLTK and Python.


### paragraph ---> sentence

In [5]:
from nltk.tokenize import sent_tokenize

In [8]:
documents = sent_tokenize(sample_corpus)

In [9]:
documents

['Hello there!',
 'How are you doing today?',
 "I hope you're enjoying learning about Natural Language Processing.",
 "Let's explore more with NLTK and Python."]

In [10]:
for sentence in documents:
    print(sentence)

Hello there!
How are you doing today?
I hope you're enjoying learning about Natural Language Processing.
Let's explore more with NLTK and Python.


### paragraph ---> words

In [12]:
from nltk.tokenize import word_tokenize

In [13]:
words = word_tokenize(sample_corpus)

In [14]:
words

['Hello',
 'there',
 '!',
 'How',
 'are',
 'you',
 'doing',
 'today',
 '?',
 'I',
 'hope',
 'you',
 "'re",
 'enjoying',
 'learning',
 'about',
 'Natural',
 'Language',
 'Processing',
 '.',
 'Let',
 "'s",
 'explore',
 'more',
 'with',
 'NLTK',
 'and',
 'Python',
 '.']

In [15]:
for word in words:
    print(word)

Hello
there
!
How
are
you
doing
today
?
I
hope
you
're
enjoying
learning
about
Natural
Language
Processing
.
Let
's
explore
more
with
NLTK
and
Python
.


### sentence to words

In [16]:
for sentence in documents:
    print(word_tokenize(sentence))

['Hello', 'there', '!']
['How', 'are', 'you', 'doing', 'today', '?']
['I', 'hope', 'you', "'re", 'enjoying', 'learning', 'about', 'Natural', 'Language', 'Processing', '.']
['Let', "'s", 'explore', 'more', 'with', 'NLTK', 'and', 'Python', '.']


## `wordpunct_tokenize` in NLTK

`wordpunct_tokenize` is a tokenizer in NLTK that **splits text into alphabetic and non-alphabetic characters** using simple regex rules.

- It separates **words** and **punctuation** as distinct tokens.
- Useful for basic tokenization where punctuation needs to be isolated.

In [17]:
from nltk.tokenize import wordpunct_tokenize

In [18]:
wordpunct_tokenize(sample_corpus)

['Hello',
 'there',
 '!',
 'How',
 'are',
 'you',
 'doing',
 'today',
 '?',
 'I',
 'hope',
 'you',
 "'",
 're',
 'enjoying',
 'learning',
 'about',
 'Natural',
 'Language',
 'Processing',
 '.',
 'Let',
 "'",
 's',
 'explore',
 'more',
 'with',
 'NLTK',
 'and',
 'Python',
 '.']

## `TreebankWordTokenizer` in NLTK

`TreebankWordTokenizer` is a rule-based tokenizer from NLTK that uses **Penn Treebank** conventions.

- It tokenizes text into words and punctuation using specific rules.
- Handles **contractions**, **punctuation**, and **quotes** more accurately than basic tokenizers.

#### Key Features:
- Splits contractions like *“don’t”* into *“do”* and *“n’t”*
- Treats punctuation as separate tokens
- Handles cases like *“Mr.”*, *“U.S.”*, and parentheses well

In [19]:
from nltk.tokenize import TreebankWordTokenizer

In [20]:
tokenizer = TreebankWordTokenizer()

In [21]:
tokenizer.tokenize(sample_corpus)

['Hello',
 'there',
 '!',
 'How',
 'are',
 'you',
 'doing',
 'today',
 '?',
 'I',
 'hope',
 'you',
 "'re",
 'enjoying',
 'learning',
 'about',
 'Natural',
 'Language',
 'Processing.',
 'Let',
 "'s",
 'explore',
 'more',
 'with',
 'NLTK',
 'and',
 'Python',
 '.']

### Comparison of NLTK Tokenizers

| Feature                        | `word_tokenize`                     | `wordpunct_tokenize`                   | `TreebankWordTokenizer`               |
|-------------------------------|-------------------------------------|----------------------------------------|---------------------------------------|
| Based on                      | Penn Treebank + Punkt sentence tokenizer | Simple regex rules                     | Penn Treebank rules                   |
| Handles contractions well     | ✅ (e.g., "don't" → ["do", "n't"])   | ❌ ("don't" → ["don", "'", "t"])       | ✅ ("don't" → ["do", "n't"])          |
| Punctuation as separate token | ✅                                   | ✅                                      | ✅                                     |
| Keeps sentence structure      | ✅                                   | ❌                                      | ✅                                     |
| Splits on non-alphanumerics  | ❌                                   | ✅ (splits on every non-letter)         | ❌                                     |
| Suitable for NLP tasks        | ✅ (default, balanced choice)        | ❌ (too aggressive)                     | ✅ (used in syntactic parsing)        |

---

In [22]:
from nltk.tokenize import word_tokenize, wordpunct_tokenize, TreebankWordTokenizer

text = "She said, \"Don't do it!\""

print("word_tokenize:", word_tokenize(text))
print("wordpunct_tokenize:", wordpunct_tokenize(text))
print("TreebankWordTokenizer:", TreebankWordTokenizer().tokenize(text))

word_tokenize: ['She', 'said', ',', '``', 'Do', "n't", 'do', 'it', '!', "''"]
wordpunct_tokenize: ['She', 'said', ',', '"', 'Don', "'", 't', 'do', 'it', '!"']
TreebankWordTokenizer: ['She', 'said', ',', '``', 'Do', "n't", 'do', 'it', '!', "''"]
