# NLTK Complete Guide - Section 1: Introduction & Setup

This notebook covers:
- What is NLTK?
- Installation
- Downloading NLTK Data
- Verifying Installation

## 1.1 What is NLTK?

**NLTK (Natural Language Toolkit)** is a leading platform for building Python programs to work with human language data.

It provides:
- Easy-to-use interfaces to **50+ corpora** and lexical resources
- Text processing libraries for **classification, tokenization, stemming, tagging, parsing**
- Wrappers for industrial-strength NLP libraries

## 1.2 Installation

Run these commands in your terminal:

In [None]:
# Install NLTK (uncomment to run)
# !pip install nltk

# Install with all dependencies
# !pip install nltk numpy scipy matplotlib

## 1.3 Downloading NLTK Data

NLTK requires additional data packages for various functionalities.

In [None]:
import nltk

# Check NLTK version
print(f"NLTK Version: {nltk.__version__}")

In [None]:
# Download essential packages (run this once)
essential_packages = [
    'punkt',                        # Tokenizers
    'stopwords',                    # Stop words
    'wordnet',                      # WordNet lexical database
    'averaged_perceptron_tagger',   # POS tagger
    'maxent_ne_chunker',            # Named Entity chunker
    'words',                        # Word list
    'vader_lexicon',                # Sentiment analysis
    'omw-1.4',                      # Open Multilingual WordNet
]

for package in essential_packages:
    print(f"Downloading {package}...")
    nltk.download(package, quiet=True)

print("\n✅ All essential packages downloaded!")

In [None]:
# Alternative: Download ALL data (takes longer, ~3GB)
# nltk.download('all')

# Alternative: Open interactive GUI downloader
# nltk.download()

## 1.4 Verifying Installation

In [None]:
# Test basic tokenization
from nltk.tokenize import word_tokenize

text = "Hello, NLTK is working!"
tokens = word_tokenize(text)

print(f"Text: {text}")
print(f"Tokens: {tokens}")
print("\n✅ NLTK is working correctly!")

In [None]:
# Check which data packages are available
from nltk.data import find

packages_to_check = [
    ('tokenizers/punkt', 'punkt'),
    ('corpora/stopwords', 'stopwords'),
    ('corpora/wordnet', 'wordnet'),
    ('taggers/averaged_perceptron_tagger', 'averaged_perceptron_tagger'),
    ('chunkers/maxent_ne_chunker', 'maxent_ne_chunker'),
    ('sentiment/vader_lexicon', 'vader_lexicon'),
]

print("NLTK Data Packages Status:")
print("-" * 40)

for path, name in packages_to_check:
    try:
        find(path)
        print(f"✅ {name}")
    except LookupError:
        print(f"❌ {name} - run: nltk.download('{name}')")

## 1.5 Quick Test - All Core Features

In [None]:
# Quick test of core NLTK features
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag

text = "NLTK is a powerful library for natural language processing. It helps developers analyze text easily."

# Sentence tokenization
sentences = sent_tokenize(text)
print(f"Sentences: {sentences}\n")

# Word tokenization
tokens = word_tokenize(text)
print(f"Tokens: {tokens}\n")

# Stopwords
stop_words = set(stopwords.words('english'))
filtered = [w for w in tokens if w.lower() not in stop_words and w.isalpha()]
print(f"Without stopwords: {filtered}\n")

# Stemming
stemmer = PorterStemmer()
stemmed = [stemmer.stem(w) for w in filtered]
print(f"Stemmed: {stemmed}\n")

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w.lower()) for w in filtered]
print(f"Lemmatized: {lemmatized}\n")

# POS Tagging
tagged = pos_tag(tokens)
print(f"POS Tagged: {tagged}")

## Summary

| Task | Code |
|------|------|
| Install NLTK | `pip install nltk` |
| Download data | `nltk.download('package_name')` |
| Download all | `nltk.download('all')` |
| Check version | `nltk.__version__` |

### Essential Packages
- `punkt` - Tokenizers
- `stopwords` - Stop words
- `wordnet` - WordNet lexical database
- `averaged_perceptron_tagger` - POS tagger
- `maxent_ne_chunker` - Named Entity Recognition
- `vader_lexicon` - Sentiment Analysis