# Natural Language Processing with NLTK
## Project Overview
This notebook demonstrates various **Natural Language Processing (NLP)** techniques using the **Natural Language Toolkit (NLTK)** library in Python.  
We will explore different tokenization methods, corpus usage, and text preprocessing concepts that are fundamental in NLP workflows.

---

## Introduction to NLP
Natural Language Processing (NLP) is a branch of Artificial Intelligence that focuses on enabling computers to understand, interpret, and generate human language.  
Applications include:
- Text classification (spam detection, sentiment analysis)
- Machine translation
- Chatbots and conversational AI
- Named Entity Recognition (NER)

## Introduction to NLTK
The **Natural Language Toolkit (NLTK)** is a leading Python library for NLP tasks. It provides:
- Access to large text corpora (e.g., Gutenberg, Brown, Reuters)
- Tools for text tokenization, stemming, lemmatization
- Stopword removal
- Part-of-speech tagging

**References:**
- [NLTK Documentation](https://www.nltk.org/)
- [NLTK Book](https://www.nltk.org/book/)
- [Natural Language Processing Overview - Wikipedia](https://en.wikipedia.org/wiki/Natural_language_processing)

---


## 1. Importing Required Libraries
We start by importing the necessary Python libraries. `os` is used for interacting with the operating system if needed.

In [1]:
import os 
import nltk

## 2. Accessing NLTK Corpora
The `nltk.corpus` module provides access to various text corpora for analysis.

In [2]:
import nltk.corpus

### Exploring Gutenberg Corpus
The `gutenberg.fileids()` function lists all the available texts in the Gutenberg corpus included with NLTK.

In [3]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

### Creating Custom Text
In addition to predefined corpora, you can create your own text data for processing.

In [4]:
# you can also create your own words
AI = '''Artificial Intelligence refers to the intelligence of machines. This is in contrast to the natural intelligence of
humans and animals. With Artificial Intelligence, machines perform functions such as learning, planning, reasoning and
problem-solving. Most noteworthy, Artificial Intelligence is the simulation of human intelligence by machines.
It is probably the fastest-growing development in the World of technology and innovation. Furthermore, many experts believe
AI could solve major challenges and crisis situations.'''

### Checking Data Type
This checks the Python data type of the variable `AI`.

In [5]:
type(AI)

str

## 3. Word Tokenization
`word_tokenize` splits text into individual words or tokens.

In [6]:
from nltk.tokenize import word_tokenize

### Tokenizing Custom Text
We apply `word_tokenize` to our custom text variable `AI`.

In [7]:
AI_tokens = word_tokenize(AI)
AI_tokens

['Artificial',
 'Intelligence',
 'refers',
 'to',
 'the',
 'intelligence',
 'of',
 'machines',
 '.',
 'This',
 'is',
 'in',
 'contrast',
 'to',
 'the',
 'natural',
 'intelligence',
 'of',
 'humans',
 'and',
 'animals',
 '.',
 'With',
 'Artificial',
 'Intelligence',
 ',',
 'machines',
 'perform',
 'functions',
 'such',
 'as',
 'learning',
 ',',
 'planning',
 ',',
 'reasoning',
 'and',
 'problem-solving',
 '.',
 'Most',
 'noteworthy',
 ',',
 'Artificial',
 'Intelligence',
 'is',
 'the',
 'simulation',
 'of',
 'human',
 'intelligence',
 'by',
 'machines',
 '.',
 'It',
 'is',
 'probably',
 'the',
 'fastest-growing',
 'development',
 'in',
 'the',
 'World',
 'of',
 'technology',
 'and',
 'innovation',
 '.',
 'Furthermore',
 ',',
 'many',
 'experts',
 'believe',
 'AI',
 'could',
 'solve',
 'major',
 'challenges',
 'and',
 'crisis',
 'situations',
 '.']

### Counting Tokens
`len()` gives the total number of tokens extracted.

In [8]:
len(AI_tokens)

81

## 4. Sentence Tokenization
`sent_tokenize` breaks text into sentences.

In [9]:
from nltk.tokenize import sent_tokenize

### Applying Sentence Tokenization
We apply it to our `AI` text.

In [10]:
AI_sent = sent_tokenize(AI) 
AI_sent

['Artificial Intelligence refers to the intelligence of machines.',
 'This is in contrast to the natural intelligence of\nhumans and animals.',
 'With Artificial Intelligence, machines perform functions such as learning, planning, reasoning and\nproblem-solving.',
 'Most noteworthy, Artificial Intelligence is the simulation of human intelligence by machines.',
 'It is probably the fastest-growing development in the World of technology and innovation.',
 'Furthermore, many experts believe\nAI could solve major challenges and crisis situations.']

### Viewing Original Text
Displays the content of the `AI` variable.

In [11]:
AI

'Artificial Intelligence refers to the intelligence of machines. This is in contrast to the natural intelligence of\nhumans and animals. With Artificial Intelligence, machines perform functions such as learning, planning, reasoning and\nproblem-solving. Most noteworthy, Artificial Intelligence is the simulation of human intelligence by machines.\nIt is probably the fastest-growing development in the World of technology and innovation. Furthermore, many experts believe\nAI could solve major challenges and crisis situations.'

## 5. Blank Line Tokenization
`blankline_tokenize` splits text into segments separated by blank lines.

In [12]:
from nltk.tokenize import blankline_tokenize
AI_blank = blankline_tokenize(AI)
AI_blank

['Artificial Intelligence refers to the intelligence of machines. This is in contrast to the natural intelligence of\nhumans and animals. With Artificial Intelligence, machines perform functions such as learning, planning, reasoning and\nproblem-solving. Most noteworthy, Artificial Intelligence is the simulation of human intelligence by machines.\nIt is probably the fastest-growing development in the World of technology and innovation. Furthermore, many experts believe\nAI could solve major challenges and crisis situations.']

## 6. Whitespace Tokenization
`WhitespaceTokenizer` splits tokens based solely on spaces.

In [15]:
from nltk.tokenize import WhitespaceTokenizer
AI_white = WhitespaceTokenizer().tokenize(AI)
AI_white

['Artificial',
 'Intelligence',
 'refers',
 'to',
 'the',
 'intelligence',
 'of',
 'machines.',
 'This',
 'is',
 'in',
 'contrast',
 'to',
 'the',
 'natural',
 'intelligence',
 'of',
 'humans',
 'and',
 'animals.',
 'With',
 'Artificial',
 'Intelligence,',
 'machines',
 'perform',
 'functions',
 'such',
 'as',
 'learning,',
 'planning,',
 'reasoning',
 'and',
 'problem-solving.',
 'Most',
 'noteworthy,',
 'Artificial',
 'Intelligence',
 'is',
 'the',
 'simulation',
 'of',
 'human',
 'intelligence',
 'by',
 'machines.',
 'It',
 'is',
 'probably',
 'the',
 'fastest-growing',
 'development',
 'in',
 'the',
 'World',
 'of',
 'technology',
 'and',
 'innovation.',
 'Furthermore,',
 'many',
 'experts',
 'believe',
 'AI',
 'could',
 'solve',
 'major',
 'challenges',
 'and',
 'crisis',
 'situations.']

### Counting Whitespace Tokens
Counts tokens created by `WhitespaceTokenizer`.

In [16]:
len(AI_white)

70

## 7. Practical Example
We define a sample sentence for further NLP processing.

In [20]:
s = 'Good Apple cost $3.88 in Hyederbad. Please buy two of them. Thanks.'
s

'Good Apple cost $3.88 in Hyederbad. Please buy two of them. Thanks.'

In [21]:
AI_tokens = word_tokenize(s)
AI_tokens

['Good',
 'Apple',
 'cost',
 '$',
 '3.88',
 'in',
 'Hyederbad',
 '.',
 'Please',
 'buy',
 'two',
 'of',
 'them',
 '.',
 'Thanks',
 '.']

In [22]:
len(s)

67

In [27]:
from nltk.tokenize import wordpunct_tokenize
S=wordpunct_tokenize(s)
S

['Good',
 'Apple',
 'cost',
 '$',
 '3',
 '.',
 '88',
 'in',
 'Hyederbad',
 '.',
 'Please',
 'buy',
 'two',
 'of',
 'them',
 '.',
 'Thanks',
 '.']

In [28]:
len(S)

18

In [29]:
w_p = wordpunct_tokenize(AI)
w_p

['Artificial',
 'Intelligence',
 'refers',
 'to',
 'the',
 'intelligence',
 'of',
 'machines',
 '.',
 'This',
 'is',
 'in',
 'contrast',
 'to',
 'the',
 'natural',
 'intelligence',
 'of',
 'humans',
 'and',
 'animals',
 '.',
 'With',
 'Artificial',
 'Intelligence',
 ',',
 'machines',
 'perform',
 'functions',
 'such',
 'as',
 'learning',
 ',',
 'planning',
 ',',
 'reasoning',
 'and',
 'problem',
 '-',
 'solving',
 '.',
 'Most',
 'noteworthy',
 ',',
 'Artificial',
 'Intelligence',
 'is',
 'the',
 'simulation',
 'of',
 'human',
 'intelligence',
 'by',
 'machines',
 '.',
 'It',
 'is',
 'probably',
 'the',
 'fastest',
 '-',
 'growing',
 'development',
 'in',
 'the',
 'World',
 'of',
 'technology',
 'and',
 'innovation',
 '.',
 'Furthermore',
 ',',
 'many',
 'experts',
 'believe',
 'AI',
 'could',
 'solve',
 'major',
 'challenges',
 'and',
 'crisis',
 'situations',
 '.']

In [30]:
len(w_p)

85

**Types Of Tokens** 

In [37]:
from nltk.util import bigrams,trigrams,ngrams


In [38]:
string = 'we are learner of prakash senapati from 4pm batch'
quotes_tokens = nltk.word_tokenize(string)
quotes_tokens

['we', 'are', 'learner', 'of', 'prakash', 'senapati', 'from', '4pm', 'batch']

In [39]:
string

'we are learner of prakash senapati from 4pm batch'

In [40]:
quotes_tokens

['we', 'are', 'learner', 'of', 'prakash', 'senapati', 'from', '4pm', 'batch']

In [41]:
len(quotes_tokens)

9

### Bigrams (2-gram) — Generating Consecutive Word Pairs

**What are bigrams?**

Bigrams are sequences of two adjacent tokens (words) in a text. They capture short-range word co-occurrence and are useful for modelling local context, identifying collocations (commonly co-occurring word pairs), and as features in n-gram language models.

**Code explanation**

- `quotes_bigram = list(nltk.bigrams(quotes_tokens))` — `nltk.bigrams` returns a generator that yields consecutive token pairs from `quotes_tokens` (which is expected to be a list of tokens). Wrapping the generator with `list()` materialises all bigrams into memory as a list of 2-tuples (e.g., `('natural', 'language')`).
- `quotes_bigram` — displays the resulting list of bigram tuples in the notebook output.

**Important details**

- If `quotes_tokens` contains *N* tokens, the resulting list of bigrams will contain *N − 1* tuples.
- Bigrams preserve token order and therefore are sensitive to punctuation and stopwords. It is common to perform lowercasing, remove punctuation, or filter stopwords before creating n‑grams if you want cleaner collocations.

**Next steps (recommended)**

- Compute frequencies of bigrams with `nltk.FreqDist` or `collections.Counter` to find the most common pairs.
- Use `nltk.collocations.BigramCollocationFinder` with statistical measures (like PMI) to find significant collocations.

**Example (illustrative — not executed here):**

```python
from nltk import FreqDist
bigram_fd = FreqDist(quotes_bigram)
bigram_fd.most_common(10)
```
This returns the 10 most frequent bigrams in the `quotes_tokens` sequence.


In [42]:
quotes_bigram = list(nltk.bigrams(quotes_tokens))
quotes_bigram

[('we', 'are'),
 ('are', 'learner'),
 ('learner', 'of'),
 ('of', 'prakash'),
 ('prakash', 'senapati'),
 ('senapati', 'from'),
 ('from', '4pm'),
 ('4pm', 'batch')]

In [43]:
len(quotes_bigram)

8

In [44]:
quotes_trigram = list(nltk.trigrams(quotes_tokens))
quotes_trigram

[('we', 'are', 'learner'),
 ('are', 'learner', 'of'),
 ('learner', 'of', 'prakash'),
 ('of', 'prakash', 'senapati'),
 ('prakash', 'senapati', 'from'),
 ('senapati', 'from', '4pm'),
 ('from', '4pm', 'batch')]

In [45]:
len(quotes_trigram)

7

In [48]:
quotes_ngram = list(nltk.ngrams(quotes_tokens,5))
quotes_ngram

[('we', 'are', 'learner', 'of', 'prakash'),
 ('are', 'learner', 'of', 'prakash', 'senapati'),
 ('learner', 'of', 'prakash', 'senapati', 'from'),
 ('of', 'prakash', 'senapati', 'from', '4pm'),
 ('prakash', 'senapati', 'from', '4pm', 'batch')]

In [49]:
from nltk.stem import PorterStemmer
pst = PorterStemmer()

In [50]:
pst.stem('affection')

'affect'

In [51]:
pst.stem('playing')

'play'

In [52]:
pst.stem('maximum')

'maximum'

In [53]:
words_to_stem = ['give','giving','given','gave']
for words in words_to_stem:
    print(words+':'+pst.stem(words))

give:give
giving:give
given:given
gave:gave


In [55]:
words_to_stem2 = ['give','giving','given','gave','gaved','thinking','loving','maximum','Khwaja']
for words in words_to_stem2:
    print(words+':'+pst.stem(words))

give:give
giving:give
given:given
gave:gave
gaved:gave
thinking:think
loving:love
maximum:maximum
Khwaja:khwaja


In [56]:
from nltk.stem import LancasterStemmer
lst = LancasterStemmer()

for words in words_to_stem2:
    print(words+':'+lst.stem(words))

give:giv
giving:giv
given:giv
gave:gav
gaved:gav
thinking:think
loving:lov
maximum:maxim
Khwaja:khwaja


In [59]:
from nltk.stem import SnowballStemmer
sst = SnowballStemmer('english')

for words in words_to_stem2:
    print(words+':'+sst.stem(words))

give:give
giving:give
given:given
gave:gave
gaved:gave
thinking:think
loving:love
maximum:maximum
Khwaja:khwaja


In [60]:
from nltk.stem import wordnet
from nltk.stem import WordNetLemmatizer
word_lem = WordNetLemmatizer()

In [61]:
words_to_stem2

['give',
 'giving',
 'given',
 'gave',
 'gaved',
 'thinking',
 'loving',
 'maximum',
 'Khwaja']

In [63]:
for words in words_to_stem2:
    print(words+':'+word_lem.lemmatize(words))
    

give:give
giving:giving
given:given
gave:gave
gaved:gaved
thinking:thinking
loving:loving
maximum:maximum
Khwaja:Khwaja


In [64]:
from nltk.corpus import stopwords

In [65]:
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [66]:
len(stopwords.words('english'))

198

In [67]:
stopwords.words('french')

['au',
 'aux',
 'avec',
 'ce',
 'ces',
 'dans',
 'de',
 'des',
 'du',
 'elle',
 'en',
 'et',
 'eux',
 'il',
 'ils',
 'je',
 'la',
 'le',
 'les',
 'leur',
 'lui',
 'ma',
 'mais',
 'me',
 'même',
 'mes',
 'moi',
 'mon',
 'ne',
 'nos',
 'notre',
 'nous',
 'on',
 'ou',
 'par',
 'pas',
 'pour',
 'qu',
 'que',
 'qui',
 'sa',
 'se',
 'ses',
 'son',
 'sur',
 'ta',
 'te',
 'tes',
 'toi',
 'ton',
 'tu',
 'un',
 'une',
 'vos',
 'votre',
 'vous',
 'c',
 'd',
 'j',
 'l',
 'à',
 'm',
 'n',
 's',
 't',
 'y',
 'été',
 'étée',
 'étées',
 'étés',
 'étant',
 'étante',
 'étants',
 'étantes',
 'suis',
 'es',
 'est',
 'sommes',
 'êtes',
 'sont',
 'serai',
 'seras',
 'sera',
 'serons',
 'serez',
 'seront',
 'serais',
 'serait',
 'serions',
 'seriez',
 'seraient',
 'étais',
 'était',
 'étions',
 'étiez',
 'étaient',
 'fus',
 'fut',
 'fûmes',
 'fûtes',
 'furent',
 'sois',
 'soit',
 'soyons',
 'soyez',
 'soient',
 'fusse',
 'fusses',
 'fût',
 'fussions',
 'fussiez',
 'fussent',
 'ayant',
 'ayante',
 'ayantes',


In [68]:
len(stopwords.words('french'))

157

In [72]:
stopwords.words('german')

['aber',
 'alle',
 'allem',
 'allen',
 'aller',
 'alles',
 'als',
 'also',
 'am',
 'an',
 'ander',
 'andere',
 'anderem',
 'anderen',
 'anderer',
 'anderes',
 'anderm',
 'andern',
 'anderr',
 'anders',
 'auch',
 'auf',
 'aus',
 'bei',
 'bin',
 'bis',
 'bist',
 'da',
 'damit',
 'dann',
 'der',
 'den',
 'des',
 'dem',
 'die',
 'das',
 'dass',
 'daß',
 'derselbe',
 'derselben',
 'denselben',
 'desselben',
 'demselben',
 'dieselbe',
 'dieselben',
 'dasselbe',
 'dazu',
 'dein',
 'deine',
 'deinem',
 'deinen',
 'deiner',
 'deines',
 'denn',
 'derer',
 'dessen',
 'dich',
 'dir',
 'du',
 'dies',
 'diese',
 'diesem',
 'diesen',
 'dieser',
 'dieses',
 'doch',
 'dort',
 'durch',
 'ein',
 'eine',
 'einem',
 'einen',
 'einer',
 'eines',
 'einig',
 'einige',
 'einigem',
 'einigen',
 'einiger',
 'einiges',
 'einmal',
 'er',
 'ihn',
 'ihm',
 'es',
 'etwas',
 'euer',
 'eure',
 'eurem',
 'euren',
 'eurer',
 'eures',
 'für',
 'gegen',
 'gewesen',
 'hab',
 'habe',
 'haben',
 'hat',
 'hatte',
 'hatten',
 '

In [73]:
len(stopwords.words('german'))

232

In [74]:
stopwords.words('chinese')

['一',
 '一下',
 '一些',
 '一切',
 '一则',
 '一天',
 '一定',
 '一方面',
 '一旦',
 '一时',
 '一来',
 '一样',
 '一次',
 '一片',
 '一直',
 '一致',
 '一般',
 '一起',
 '一边',
 '一面',
 '万一',
 '上下',
 '上升',
 '上去',
 '上来',
 '上述',
 '上面',
 '下列',
 '下去',
 '下来',
 '下面',
 '不一',
 '不久',
 '不仅',
 '不会',
 '不但',
 '不光',
 '不单',
 '不变',
 '不只',
 '不可',
 '不同',
 '不够',
 '不如',
 '不得',
 '不怕',
 '不惟',
 '不成',
 '不拘',
 '不敢',
 '不断',
 '不是',
 '不比',
 '不然',
 '不特',
 '不独',
 '不管',
 '不能',
 '不要',
 '不论',
 '不足',
 '不过',
 '不问',
 '与',
 '与其',
 '与否',
 '与此同时',
 '专门',
 '且',
 '两者',
 '严格',
 '严重',
 '个',
 '个人',
 '个别',
 '中小',
 '中间',
 '丰富',
 '临',
 '为',
 '为主',
 '为了',
 '为什么',
 '为什麽',
 '为何',
 '为着',
 '主张',
 '主要',
 '举行',
 '乃',
 '乃至',
 '么',
 '之',
 '之一',
 '之前',
 '之后',
 '之後',
 '之所以',
 '之类',
 '乌乎',
 '乎',
 '乘',
 '也',
 '也好',
 '也是',
 '也罢',
 '了',
 '了解',
 '争取',
 '于',
 '于是',
 '于是乎',
 '云云',
 '互相',
 '产生',
 '人们',
 '人家',
 '什么',
 '什么样',
 '什麽',
 '今后',
 '今天',
 '今年',
 '今後',
 '仍然',
 '从',
 '从事',
 '从而',
 '他',
 '他人',
 '他们',
 '他的',
 '代替',
 '以',
 '以上',
 '以下',
 '以为',
 '以便',
 '以免',
 '以前',
 '以及',
 '以后',
 '以外',
 '以後',
 

In [75]:
len(stopwords.words('chinese'))

841

In [80]:
stopwords.words('tamil')

['அங்கு',
 'அங்கே',
 'அடுத்த',
 'அதனால்',
 'அதன்',
 'அதற்கு',
 'அதிக',
 'அதில்',
 'அது',
 'அதே',
 'அதை',
 'அந்த',
 'அந்தக்',
 'அந்தப்',
 'அன்று',
 'அல்லது',
 'அவன்',
 'அவரது',
 'அவர்',
 'அவர்கள்',
 'அவள்',
 'அவை',
 'ஆகிய',
 'ஆகியோர்',
 'ஆகும்',
 'இங்கு',
 'இங்கே',
 'இடத்தில்',
 'இடம்',
 'இதனால்',
 'இதனை',
 'இதன்',
 'இதற்கு',
 'இதில்',
 'இது',
 'இதை',
 'இந்த',
 'இந்தக்',
 'இந்தத்',
 'இந்தப்',
 'இன்னும்',
 'இப்போது',
 'இரு',
 'இருக்கும்',
 'இருந்த',
 'இருந்தது',
 'இருந்து',
 'இவர்',
 'இவை',
 'உன்',
 'உள்ள',
 'உள்ளது',
 'உள்ளன',
 'எந்த',
 'என',
 'எனக்',
 'எனக்கு',
 'எனப்படும்',
 'எனவும்',
 'எனவே',
 'எனினும்',
 'எனும்',
 'என்',
 'என்ன',
 'என்னும்',
 'என்பது',
 'என்பதை',
 'என்ற',
 'என்று',
 'என்றும்',
 'எல்லாம்',
 'ஏன்',
 'ஒரு',
 'ஒரே',
 'ஓர்',
 'கொண்ட',
 'கொண்டு',
 'கொள்ள',
 'சற்று',
 'சிறு',
 'சில',
 'சேர்ந்த',
 'தனது',
 'தன்',
 'தவிர',
 'தான்',
 'நான்',
 'நாம்',
 'நீ',
 'பற்றி',
 'பற்றிய',
 'பல',
 'பலரும்',
 'பல்வேறு',
 'பின்',
 'பின்னர்',
 'பிற',
 'பிறகு',
 'பெரும்',
 'பேர்',
 'போது',
 

In [81]:
len(stopwords.words('tamil'))

125

In [83]:
sent = 'sam is natural when it comes to drawing'
sent_tokens = word_tokenize(sent)
sent_tokens

['sam', 'is', 'natural', 'when', 'it', 'comes', 'to', 'drawing']

In [84]:
for token in sent_tokens:
    print(nltk.pos_tag([token]))

[('sam', 'NN')]
[('is', 'VBZ')]
[('natural', 'JJ')]
[('when', 'WRB')]
[('it', 'PRP')]
[('comes', 'VBZ')]
[('to', 'TO')]
[('drawing', 'VBG')]


---
## Conclusion
In this notebook, we explored:
- Accessing and listing available corpora in NLTK.
- Tokenizing text into words, sentences, blank line segments, and whitespace-separated tokens.
- Creating and working with custom text data.

These foundational techniques are essential in NLP pipelines, as tokenization is the first step before further tasks like stemming, lemmatization, part-of-speech tagging, and text classification.

**Further Reading:**
- [Text Preprocessing in NLP](https://towardsdatascience.com/text-preprocessing-in-python-steps-tools-and-examples-3e2f4813d2d0)
- [Tokenization Explained](https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/)
