# NLTK
- It stands for Natural Language Toolkit
- A python library used for processing human language data
- Provides toold for tokenization, tagging, parsing, classification and semantic reasoning

## Tokenization
- it is splitting sentences into tokens

## stemming/Lemmatization
- Reducing words to their root form

## POS Tagging
- Assigning parts of speech to words

## Parsing/Chunking
- Analyze the grammatical structure of sentences

## Corpora/Lexical Resources
- Built-in datasets like brown, gutenberg and wordnet

## Text Classification
- Train models to catagorise text

In [9]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/rajveer/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [25]:
text = "Dr. John went to washington and arrived at 5pm, when it was raining very heavily"
sentences = nltk.sent_tokenize(text)
print(sentences)

['Dr. John went to washington and arrived at 5pm, when it was raining very heavily']


In [26]:
words = nltk.word_tokenize(text)
print(words)

['Dr.', 'John', 'went', 'to', 'washington', 'and', 'arrived', 'at', '5pm', ',', 'when', 'it', 'was', 'raining', 'very', 'heavily']


In [27]:
print("Characters in the text:\n")
for word in words:
  for char in word:
    print(char)

Characters in the text:

D
r
.
J
o
h
n
w
e
n
t
t
o
w
a
s
h
i
n
g
t
o
n
a
n
d
a
r
r
i
v
e
d
a
t
5
p
m
,
w
h
e
n
i
t
w
a
s
r
a
i
n
i
n
g
v
e
r
y
h
e
a
v
i
l
y


In [28]:
tokenizer = nltk.WordPunctTokenizer()
tokens = tokenizer.tokenize(text)
print(tokens)

['Dr', '.', 'John', 'went', 'to', 'washington', 'and', 'arrived', 'at', '5pm', ',', 'when', 'it', 'was', 'raining', 'very', 'heavily']


## Treebank Tokenizer
- it is suitable for linguistic analysis and handles punctuations and contractions

In [29]:
tokenizer = nltk.TreebankWordTokenizer()
tokens = tokenizer.tokenize(text)
print(tokens)

['Dr.', 'John', 'went', 'to', 'washington', 'and', 'arrived', 'at', '5pm', ',', 'when', 'it', 'was', 'raining', 'very', 'heavily']


## RegEx Tokenizer
- It uses custimizable pattern based splitting

In [30]:
tokenizer = nltk.RegexpTokenizer(r"\w+")
tokens = tokenizer.tokenize(text)
print(tokens)

['Dr', 'John', 'went', 'to', 'washington', 'and', 'arrived', 'at', '5pm', 'when', 'it', 'was', 'raining', 'very', 'heavily']


In [31]:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
print(tokens)

['Dr.', 'John', 'went', 'to', 'washington', 'and', 'arrived', 'at', '5pm', ',', 'when', 'it', 'was', 'raining', 'very', 'heavily']


## Stop Word Removal
- Stop words add very little to no semantic menaing to the sentence

In [33]:
nltk.download('stopwords')
from nltk.corpus import stopwords
set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/rajveer/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 "he's",
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 "i'll",
 "i'm",
 "i've",
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on