# Introduction to NLTK (Natural Language Toolkit)

**NLTK** is one of the most popular libraries for Natural Language Processing in Python. It provides easy-to-use interfaces to over 50 corpora and lexical resources, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and more.

### Getting Started
Before using NLTK's advanced features (like Lemmatization or Sentence Tokenization), you must ensure the library is installed and the necessary data packages are downloaded to your machine.

1. **Installation:**
   ```bash
   pip install nltk
   ```

2. **Downloading Resources:**
   NLTK requires specific datasets to function. Run the following command in a code cell to download the basics used in this notebook:
   ```python
   import nltk
   nltk.download('punkt') # For tokenization
   nltk.download('wordnet') # For lemmatization
   ```

# NLP Foundations: Text Preprocessing and Tokenization

This notebook demonstrates essential techniques for preparing raw text for Natural Language Processing. We will cover:
1. **Tokenization** (Word and Sentence level)
2. **Normalization** (Lowercasing and Punctuation removal)
3. **Vocabulary Building**
4. **Morphological Processing** (Stemming and Lemmatization)

## 1. Basic Word Tokenization
The simplest way to tokenize text is using Python's built-in `.split()` method, which splits a string by all whitespace (tabs, new lines, and multiple spaces).
Note how punctuation remains attached to the words.

In [6]:
corpus = "They picnicked by the pool, then they lay back on the grass and looked at the stars."

words = corpus.split()
print("The count of corpus words:", len(words))
print(words)

The count of corpus words: 17
['They', 'picnicked', 'by', 'the', 'pool,', 'then', 'they', 'lay', 'back', 'on', 'the', 'grass', 'and', 'looked', 'at', 'the', 'stars.']


### Regex-based Tokenization
To treat punctuation as separate tokens, we can use Regular Expressions (`re`). The pattern `\w+|[^\w\s]` matches sequences of alphanumeric characters OR single non-word/non-space characters.

In [7]:
import re

corpus = "They picnicked by the pool, then they lay back on the grass and looked at the stars."

tokens = re.findall(r"\w+|[^\w\s]", corpus)
print("The count of corpus words with punctuations:", len(tokens))
print(tokens)

The count of corpus words with punctuations: 19
['They', 'picnicked', 'by', 'the', 'pool', ',', 'then', 'they', 'lay', 'back', 'on', 'the', 'grass', 'and', 'looked', 'at', 'the', 'stars', '.']


## 2. Building a Vocabulary
A **Vocabulary** is the set of unique tokens in a corpus. Using a Python `set` automatically removes duplicates.

In [8]:
corpus = "They picnicked by the pool, then they lay back on the grass and looked at the stars."

words = corpus.split()
vocabulary = set(words)
print(vocabulary)
print("The size of vocabulary is:", len(vocabulary))

{'on', 'pool,', 'They', 'they', 'by', 'back', 'grass', 'and', 'at', 'stars.', 'then', 'the', 'lay', 'looked', 'picnicked'}
The size of vocabulary is: 15


### Case Normalization
Notice that "They" and "they" are treated as different words above. Lowercasing the text before processing ensures that the model treats them as the same semantic unit.

In [10]:
corpus = "They picnicked by the pool, then they lay back on the grass and looked at the stars."

words = corpus.lower().split()
vocabulary = set(words)
print(vocabulary)
print("The size of vocabulary after using lowercasing is:", len(vocabulary))

{'on', 'pool,', 'they', 'by', 'back', 'grass', 'and', 'at', 'stars.', 'then', 'the', 'lay', 'looked', 'picnicked'}
The size of vocabulary after using lowercasing is: 14


## 3. Punctuation Removal
In some tasks (like simple sentiment analysis), punctuation isn't needed. We can use a translation table to strip all characters defined in `string.punctuation`.

In [13]:
import string
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

corpus = "They picnicked by the pool, then they lay back on the grass and looked at the stars."
clean_corpus = remove_punctuation(corpus)
print("Before:", corpus)
print("After:", clean_corpus)


Before: They picnicked by the pool, then they lay back on the grass and looked at the stars.
After: They picnicked by the pool then they lay back on the grass and looked at the stars


## 4. Stemming vs. Lemmatization
* **Stemming:** A heuristic process that chops off the ends of words (e.g., "reporting" becomes "report"). It is fast but can result in non-dictionary words.
* **Lemmatization:** Uses vocabulary and morphological analysis to return the base or dictionary form of a word (the lemma).

In [21]:
import nltk
words = ["emailing", "replying", "reporting", "presentations", "meeting", "scheduling"]
porter = nltk.PorterStemmer()
stems = [porter.stem(word) for word in words]
print("Words after stemming")
print(stems)

wnl = nltk.WordNetLemmatizer()
lemmas = [wnl.lemmatize(word) for word in words]
print("Words after lemmatizations")
print(lemmas)

Words after stemming
['email', 'repli', 'report', 'present', 'meet', 'schedul']
Words after lemmatizations
['emailing', 'replying', 'reporting', 'presentation', 'meeting', 'scheduling']


## 5. Advanced NLTK Tokenizers
The `TreebankWordTokenizer` uses standard conventions (like splitting contractions) used in the Penn Treebank corpus.

In [22]:
from nltk.tokenize.treebank import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
corpus = "They picnicked by the pool, then they lay back on the grass and looked at the stars."
tokens = tokenizer.tokenize(corpus)
print("tokens")
print(tokens)

tokens
['They', 'picnicked', 'by', 'the', 'pool', ',', 'then', 'they', 'lay', 'back', 'on', 'the', 'grass', 'and', 'looked', 'at', 'the', 'stars', '.']


## 6. Sentence Tokenization
Breaking a paragraph into individual sentences is crucial for many NLP pipelines. NLTK's `sent_tokenize` is pre-trained to handle abbreviations (like "p.m." and "U.S.A.") so they aren't mistaken for sentence endings.

In [29]:
from nltk.tokenize import PunktTokenizer
from nltk.tokenize import sent_tokenize

corpus = '''He plays well. She said "Peter is back!". I'm going at 5 p.m. to U.S.A.'''
sent_tokenizer = PunktTokenizer()
sentences = sent_tokenizer.tokenize(corpus)
print(sentences)

sentences = sent_tokenize(corpus)
print(sentences)

['He plays well.', 'She said "Peter is back!".', "I'm going at 5 p.m. to U.S.A."]
['He plays well.', 'She said "Peter is back!".', "I'm going at 5 p.m. to U.S.A."]


---

## ðŸ§ª Student Exercise: The Preprocessing Pipeline

**Objective:** Create a pipeline that processes a multi-sentence corpus into a clean list of lemmas.

**Tasks:**
1. **Sentence Tokenization:** Split the provided `exercise_corpus` into individual sentences.
2. **Normalization:** For each sentence, remove punctuation and convert all text to lowercase.
3. **Word Tokenization:** Split the cleaned sentences into individual words (tokens).
4. **Lemmatization:** Reduce each word to its base form (lemma) using the `WordNetLemmatizer`.
5. **Vocabulary:** Print the final size of your unique vocabulary.

**Challenge:** Compare your lemmatized results with a stemmed version. Which one makes more sense for a human reader?

In [None]:
import nltk
import string
from nltk.tokenize import sent_tokenize, TreebankWordTokenizer
from nltk.stem import WordNetLemmatizer, PorterStemmer

# Sample Corpus for the exercise
exercise_corpus = """
The students are studying hard for their NLP exams. 
They are practicing tokenization, stemming, and lemmatization techniques! 
Does practicing these exercises help them understand the foundations? 
Yes, it definitely helps.
"""
# Your code goes here:
