# NLTK: Introduction and Functions
## What is NLTK? **NLTK** is a Python library used for natural language processing **(NLP)** tasks, such as developing chatbots. It is particularly useful for analyzing and understanding texts.

## Main Operations of NLTK

1. Tokenization: Splits a paragraph or sentence into smaller units, such as words or sentences.
2. Stop Words Removal: Removes insignificant words (e.g., articles, conjunctions) that do not add semantic value.
3. Stemming: Reduces a word to its root form by removing endings or suffixes.
4. Lemmatization: Groups the variants of a word into its root form.
5. Chunking: Extracts meaningful phrases or groups from unstructured data.
6. Chinking: Removes parts of a sentence or group identified by chunking.
7. Named Entity Recognition (NER): Identifies and classifies entities in a text such as proper names, places, or organizations.
8. Part of Speech Tagging (POS Tagging): Classifies each word based on its grammatical function (noun, verb, adjective, etc.).

In [22]:
!pip install nltk



In [23]:
import nltk
nltk.download('punkt') # For tokenization
nltk.download('punkt_tab') # For tokenization
nltk.download('stopwords')  # For stopwords
nltk.download('wordnet')  # For lemmatization
nltk.download('averaged_perceptron_tagger')  # For POS tagging
nltk.download('maxent_ne_chunker')  # For NER
nltk.download('words')  # For NER
%matplotlib inline

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already u

In [24]:
# Tokenization
from nltk.tokenize import word_tokenize, sent_tokenize

# Example text
text = "NLTK is a great library for natural language processing. It makes text analysis simple."

# Word tokenization
print(word_tokenize(text))

# Sentence tokenization
print(sent_tokenize(text))

['NLTK', 'is', 'a', 'great', 'library', 'for', 'natural', 'language', 'processing', '.', 'It', 'makes', 'text', 'analysis', 'simple', '.']
['NLTK is a great library for natural language processing.', 'It makes text analysis simple.']


In [25]:
# Stopwords
from nltk.corpus import stopwords

# Get the English stopwords
stop_words = set(stopwords.words('english'))

# List of words
filtered_words = [word for word in word_tokenize(text) if word.lower() not in stop_words]
print(filtered_words)

['NLTK', 'great', 'library', 'natural', 'language', 'processing', '.', 'makes', 'text', 'analysis', 'simple', '.']


In [26]:
# Stemming
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]

print(stemmed_words)

['nltk', 'great', 'librari', 'natur', 'languag', 'process', '.', 'make', 'text', 'analysi', 'simpl', '.']


In [27]:
# Lemmatization
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]

print(lemmatized_words)

['NLTK', 'great', 'library', 'natural', 'language', 'processing', '.', 'make', 'text', 'analysis', 'simple', '.']


In [28]:
# Chunking
from nltk import pos_tag
from nltk.chunk import RegexpParser

nltk.download('averaged_perceptron_tagger_eng')

words_with_pos = pos_tag(word_tokenize(text))
print(words_with_pos)

grammar = "NP: {<DT>?<JJ>*<NN>}"  # Nominal Group Pronouns
chunk_parser = RegexpParser(grammar)
chunked = chunk_parser.parse(words_with_pos)
print("Frase chunked:", chunked)
chunked.draw()  # Show the diagram 

[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('library', 'NN'), ('for', 'IN'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('.', '.'), ('It', 'PRP'), ('makes', 'VBZ'), ('text', 'JJ'), ('analysis', 'NN'), ('simple', 'NN'), ('.', '.')]
Frase chunked: (S
  NLTK/NNP
  is/VBZ
  (NP a/DT great/JJ library/NN)
  for/IN
  (NP natural/JJ language/NN)
  (NP processing/NN)
  ./.
  It/PRP
  makes/VBZ
  (NP text/JJ analysis/NN)
  (NP simple/NN)
  ./.)


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


In [30]:
# Named Entity Recognition (NER)
from nltk import ne_chunk

nltk.download('maxent_ne_chunker_tab')

entities = ne_chunk(words_with_pos)
print(entities)

entities.draw()

[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping chunkers\maxent_ne_chunker_tab.zip.


(S
  (ORGANIZATION NLTK/NNP)
  is/VBZ
  a/DT
  great/JJ
  library/NN
  for/IN
  natural/JJ
  language/NN
  processing/NN
  ./.
  It/PRP
  makes/VBZ
  text/JJ
  analysis/NN
  simple/NN
  ./.)


In [None]:
# POS Tagging

from nltk import pos_tag

words_with_pos = pos_tag(word_tokenize(text))

print(words_with_pos)