# NLTK library(Natural Language Toolkit)

NLTK (Natural Language Toolkit) is a widely used open-source Python library for Natural Language Processing (NLP). It provides a wide range of text-processing libraries, corpora, and lexical resources for research and education. NLTK is designed to make working with human language data easy and efficient, offering tools for tasks such as tokenization, stemming, lemmatization, part-of-speech tagging, parsing, and semantic reasoning.

---

## Key Features
1. **Tokenization** – Splitting text into words or sentences.
2. **Text Normalization** – Lowercasing, removing punctuation, and standardizing words.
3. **Stemming and Lemmatization** – Reducing words to their root or base form.
4. **Part-of-Speech (POS) Tagging** – Assigning grammatical tags to words.
5. **Named Entity Recognition (NER)** – Identifying entities like names, locations, and organizations.
6. **Parsing and Syntax Trees** – Analyzing grammatical structure.
7. **Corpora Access** – Built-in datasets like Brown Corpus, Gutenberg Corpus, and WordNet.
8. **Text Classification** – Building and evaluating simple classifiers.
9. **Integration with Other Libraries** – Works with scikit-learn, pandas, and more.

---

## Common NLP Tasks with NLTK
- **Tokenization:** Breaking down text into words or sentences.
- **Stopword Removal:** Filtering out commonly used words that do not add meaning.
- **Stemming:** Trimming words to their crude root (e.g., “running” → “run”).
- **Lemmatization:** Using vocabulary to get proper base forms (e.g., “better” → “good”).
- **POS Tagging:** Identifying if a word is a noun, verb, adjective, etc.
- **NER:** Detecting proper nouns and classifying them into categories.
- **Parsing:** Understanding sentence grammar.
- **Text Classification:** Categorizing text into predefined classes.

---

## Advantages of NLTK
- Easy to use and well-documented.
- Large collection of corpora and lexical resources.
- Rich set of NLP algorithms for academic and educational purposes.
- Flexible for experimentation and prototyping.

---

## Limitations of NLTK
- Slower than modern libraries like spaCy for large-scale processing.
- Models are not always state-of-the-art.
- Requires downloading datasets before use.



In [1]:
pip install nltk



In [2]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [4]:
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

In [5]:
from nltk.corpus import gutenberg
print(gutenberg.fileids())
print(gutenberg.raw('austen-emma.txt')[:500])

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died t


# Tokenization

Goal: Break text into sentences or words.

In [7]:
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [8]:
# Word Tokenization:

from nltk.tokenize import word_tokenize
text = "NLTK is great for text processing!"
words = word_tokenize(text)
print(words)

['NLTK', 'is', 'great', 'for', 'text', 'processing', '!']


In [9]:
# Sentence Tokenization:

from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)
print(sentences)

['NLTK is great for text processing!']


# Stopwords Removal

Goal: Remove common, meaningless words.

In [10]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered = [w for w in words if w.lower() not in stop_words]
print(filtered)

['NLTK', 'great', 'text', 'processing', '!']


# Stemming

Goal: Reduce words to their root form.

In [11]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
print(ps.stem("running"))  # run

run


# Lemmatization

Goal: Reduce words to their base dictionary form.

In [12]:
from nltk.stem import WordNetLemmatizer
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("running", pos="v"))

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


run


In [13]:
print(lemmatizer.lemmatize("fighting", pos="v"))

fight


# Part-of-Speech (POS) Tagging

Goal: Identify parts of speech for each word.

In [15]:
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [16]:
from nltk import pos_tag
pos = pos_tag(words)
print(pos)

[('NLTK', 'NNP'), ('is', 'VBZ'), ('great', 'JJ'), ('for', 'IN'), ('text', 'JJ'), ('processing', 'NN'), ('!', '.')]


# Named Entity Recognition (NER)

Goal: Detect names, places, organizations.

In [18]:
nltk.download('maxent_ne_chunker_tab')

[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.


True

In [19]:
from nltk import ne_chunk
nltk.download('maxent_ne_chunker')
nltk.download('words')

entities = ne_chunk(pos_tag(word_tokenize("Apple is looking at buying U.K. startup for $1 billion")))
print(entities)

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


(S
  (GPE Apple/NNP)
  is/VBZ
  looking/VBG
  at/IN
  buying/VBG
  U.K./NNP
  startup/NN
  for/IN
  $/$
  1/CD
  billion/CD)


# Chunking & Chinking

Goal: Extract meaningful phrases from text.

In [20]:
from nltk.chunk import RegexpParser
pattern = "NP: {<DT>?<JJ>*<NN>}"
parser = RegexpParser(pattern)
result = parser.parse(pos)
print(result)

(S NLTK/NNP is/VBZ great/JJ for/IN (NP text/JJ processing/NN) !/.)


# N-grams

Goal: Extract sequences of N words.

In [21]:
from nltk.util import ngrams
bigrams = list(ngrams(words, 2))
print(bigrams)

[('NLTK', 'is'), ('is', 'great'), ('great', 'for'), ('for', 'text'), ('text', 'processing'), ('processing', '!')]
