# Module 5.1: Natural Language Processing (NLP) with NLTK

Welcome to the advanced topics module! We're starting with **Natural Language Processing (NLP)**, a subfield of Artificial Intelligence focused on enabling computers to understand, interpret, and generate human language. 🗣️↔️💻

From chatbots and language translation to sentiment analysis and spam detection, NLP is the driving force behind many of the AI applications we use daily.

**Goal of this Notebook:**
This is not just a coding tutorial. We will dive into the **fundamental theory** of how we prepare text data for machine learning. This process is often called the **NLP pipeline**. We will cover:

1.  **The Core Challenge:** Why is human language so difficult for computers?
2.  **The NLP Pre-processing Pipeline:** A theoretical overview.
3.  **Tokenization:** Breaking down text into meaningful units.
4.  **Stop Word Removal:** Filtering out common, non-informative words.
5.  **Stemming & Lemmatization:** Normalizing words to their root form.

## 1. The Core Challenge: Ambiguity and Context

Human language is inherently ambiguous. Consider the sentence: "*I saw a man on a hill with a telescope.*"

Does this mean:
* You saw a man who was holding a telescope?
* You were on a hill and saw a man using a telescope?
* You were using a telescope to see a man on a hill?

Humans use context to understand the intended meaning. Computers, however, see only a sequence of characters. The goal of NLP pre-processing is to break down this complex, unstructured text into a structured, numerical format that a machine learning model can understand, while trying to preserve as much of the core meaning as possible.

## 2. The NLP Pre-processing Pipeline

Before we can analyze text, we must clean and standardize it. A typical pipeline involves several steps:

1.  **Raw Text:** The original, unstructured text.
2.  **Tokenization:** Split text into individual words (tokens).
3.  **Normalization:** Convert all tokens to lowercase.
4.  **Stop Word Removal:** Remove common words like 'the', 'is', 'a'.
5.  **Stemming/Lemmatization:** Reduce words to their root form (e.g., 'running' -> 'run').
6.  **Vectorization:** Convert the clean tokens into numerical vectors (this is a more advanced topic we won't cover here, but it's the final step before machine learning).

We will now implement steps 2-5 using **NLTK (Natural Language Toolkit)**, a foundational library for NLP in Python.

### Installing NLTK's Data Models

NLTK is a powerful library, but it keeps its code separate from the data models it uses (like dictionaries, lists of words, and pre-trained algorithms). We need to download these data packages separately.

* **`punkt`**: A pre-trained model that knows how to split text into sentences and then words (tokenization).
* **`stopwords`**: A pre-compiled list of common, low-information words for various languages (e.g., 'a', 'the', 'in').
* **`wordnet`**: A large lexical database of English words that helps NLTK understand relationships between words, which is essential for accurate lemmatization.

In [None]:
import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

> **Troubleshooting Note:** If you run the code below and get a `LookupError`, it means NLTK can't find the data you just downloaded. The best solution is to restart the kernel (**Kernel > Restart Kernel**) and run the cells again.

## 3. Tokenization

**Theory:** Tokenization is the process of splitting a string of text into a list of smaller units, or "tokens". The most common form is **word tokenization**, where the text is split by spaces and punctuation. This is the very first step in making text manageable for a computer.

**Example:** `"The cat sat on the mat."` becomes `['The', 'cat', 'sat', 'on', 'the', 'mat', '.']`

In [None]:
from nltk.tokenize import word_tokenize

sample_text = "Data science is an amazing field. It involves a lot of learning and practice!"

tokens = word_tokenize(sample_text.lower()) # We also convert to lowercase here
print(tokens)

## 4. Stop Word Removal

**Theory:** Stop words are common words in a language (like 'the', 'a', 'in', 'is', 'it') that occur very frequently but carry little semantic meaning. By removing them, we reduce the noise in our data and allow the model to focus on the more important words.

**Example:** In `['the', 'cat', 'sat', 'on', 'the', 'mat']`, the stop words are `['the', 'on', 'the']`. After removal, we get `['cat', 'sat', 'mat']`.

In [None]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

filtered_tokens = [word for word in tokens if word.isalpha() and word not in stop_words]

print("--- Tokens before stop word removal ---")
print(tokens)
print("\n--- Tokens after stop word removal ---")
print(filtered_tokens)

## 5. Stemming & Lemmatization

**Theory:** These are two techniques for **normalization**—the process of reducing a word to its base or root form. This is crucial because a model should understand that 'run', 'runs', and 'running' all refer to the same concept.

**Stemming:** A crude, rule-based process that chops off the ends of words. It's fast but can sometimes be inaccurate.
* `studies`, `studying` -> `studi`
* `ponies` -> `poni`

**Lemmatization:** A more sophisticated process that uses a dictionary (like WordNet) to find the actual root form of a word, known as the **lemma**.
* `studies`, `studying` -> `study`
* `ponies` -> `pony`
* `better` -> `good`

Generally, **lemmatization is preferred** because it produces a valid word, though it is computationally slower than stemming.

In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

print("--- Original Filtered Tokens ---")
print(filtered_tokens)
print("\n--- Stemmed Tokens ---")
print(stemmed_tokens)
print("\n--- Lemmatized Tokens ---")
print(lemmatized_tokens)

## ✅ What's Next?

You now understand the foundational pre-processing steps that are required for almost any NLP task. This process of cleaning and normalizing text is what turns messy human language into structured data ready for machine learning.

In the next notebook, we'll explore another advanced topic: **Time Series Forecasting**, where the goal is to predict future values based on historical data.