# Tokenization
Declaring our own *corpus*.

In [5]:
corpus = """
Hello, My name is Lakshya Saxena.
Please take a look at my profile! Consider me for a technical job.
"""
print(corpus)


Hello, My name is Lakshya Saxena.
Please take a look at my profile! Consider me for a technical job.



Let's get started with the *Tokenization*. We will convert our *Paragraph/Corpus* into the *Sentences*.<br>
**Remember**: To run the tokenization code, run this command for the virtual environment:<br>
`python -m nltk.downloader punkt punkt_tab`

In [7]:
from nltk.tokenize import sent_tokenize

document = sent_tokenize(corpus, language="english")

In [8]:
for sentence in document:
    print(sentence)


Hello, My name is Lakshya Saxena.
Please take a look at my profile!
Consider me for a technical job.


Let's convert a *Paragraph* to *Words* or maybe a *Sentence* to *Words*.

In [9]:
from nltk.tokenize import word_tokenize
word_tokenize(corpus)

['Hello',
 ',',
 'My',
 'name',
 'is',
 'Lakshya',
 'Saxena',
 '.',
 'Please',
 'take',
 'a',
 'look',
 'at',
 'my',
 'profile',
 '!',
 'Consider',
 'me',
 'for',
 'a',
 'technical',
 'job',
 '.']

In [10]:
for sentence in document:
    print(word_tokenize(sentence))

['Hello', ',', 'My', 'name', 'is', 'Lakshya', 'Saxena', '.']
['Please', 'take', 'a', 'look', 'at', 'my', 'profile', '!']
['Consider', 'me', 'for', 'a', 'technical', 'job', '.']


For keeping the punctuations separate, we can use `wordpunct_tokenize`.

In [11]:
from nltk.tokenize import wordpunct_tokenize
print(wordpunct_tokenize(corpus))

['Hello', ',', 'My', 'name', 'is', 'Lakshya', 'Saxena', '.', 'Please', 'take', 'a', 'look', 'at', 'my', 'profile', '!', 'Consider', 'me', 'for', 'a', 'technical', 'job', '.']


For treating *Fullstops (.)* with a word as single word, we can use `TreeBankWordTokenizer`.

In [12]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(corpus)

['Hello',
 ',',
 'My',
 'name',
 'is',
 'Lakshya',
 'Saxena.',
 'Please',
 'take',
 'a',
 'look',
 'at',
 'my',
 'profile',
 '!',
 'Consider',
 'me',
 'for',
 'a',
 'technical',
 'job',
 '.']