In [1]:
! pip install nltk



In [24]:
corpus = """Hello, my name is ankit singh.
i like eating mangoes! it's yummyyy.
"""


### tokenization 1 - sent_tokenize

converting paragraph into sentence

Punkt is a pre-trained tokenizer model provided by NLTK (Natural Language Toolkit) that helps break text into sentences. It uses unsupervised machine learning to identify sentence boundaries based on language-specific patterns, without requiring rule-based instructions like punctuation marks. This makes it robust in handling abbreviations, special cases, and various punctuation styles across different languages.

In NLTK, punkt is often used with functions like sent_tokenize, allowing you to segment text into sentences accurately and efficiently for NLP tasks like text summarization, information extraction, and translation.

In [25]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [26]:
from nltk.tokenize import sent_tokenize ## used to convert paragraphs into sentences

documents = sent_tokenize(corpus)
documents

['Hello, my name is ankit singh.', 'i like eating mangoes!', "it's yummyyy."]

In [27]:
for sentences in documents:
  print(sentences)

Hello, my name is ankit singh.
i like eating mangoes!
it's yummyyy.


## tokenization 2 - word_tokenize

#### using this, we can convert -
    paragraph --> words
    sentences --> words


In [28]:
from nltk.tokenize import word_tokenize

words = word_tokenize(corpus)
words

['Hello',
 ',',
 'my',
 'name',
 'is',
 'ankit',
 'singh',
 '.',
 'i',
 'like',
 'eating',
 'mangoes',
 '!',
 'it',
 "'s",
 'yummyyy',
 '.']

In [29]:
words2 = word_tokenize(documents[1])
words2

['i', 'like', 'eating', 'mangoes', '!']

## Tokenization 3 - wordpunct_tokenize

The main difference between word_tokenize and wordpunct_tokenize in NLTK lies in how they handle punctuation:

1.
word_tokenize is a standard tokenizer in NLTK that splits text into words, handling punctuation according to more natural language processing conventions.
It uses TreebankWordTokenizer under the hood, which separates words from punctuation in a structured way. For example, it splits contractions (like "can't" to ["ca", "n't"]) but keeps abbreviations intact.

2.
wordpunct_tokenize is a simpler tokenizer that splits text into words and punctuation based purely on whitespace and punctuation boundaries.
It will separate every punctuation mark, including contractions and abbreviations, more aggressively. For example, "I'm" becomes ["I", "'", "m"], and "N.L.T.K." becomes ["N", ".", "L", ".", "T", ".", "K", "."].
This tokenizer is useful if you need raw tokens without NLP-specific rules, especially if you are looking to analyze punctuation separately.

### Summary
* word_tokenize: More NLP-friendly, retains contractions and some abbreviations as single tokens.
* wordpunct_tokenize: Breaks text purely by whitespace and punctuation boundaries, separating out every punctuation mark, making it more granular.

In [30]:
from nltk.tokenize import wordpunct_tokenize

wordPunct = wordpunct_tokenize(corpus)
wordPunct

['Hello',
 ',',
 'my',
 'name',
 'is',
 'ankit',
 'singh',
 '.',
 'i',
 'like',
 'eating',
 'mangoes',
 '!',
 'it',
 "'",
 's',
 'yummyyy',
 '.']

## Tokenization 4 - TreeBankTokenizer

in TreeBankTokenizer , FullStop (.) that are between the sentences are not treated as a different word but the last fullstop is treated as a different word

Both word_tokenize and TreebankWordTokenizer are tokenization functions in NLTK, but they work slightly differently, mainly in their handling of punctuation and language-specific rules.

1.
word_tokenize is a generic tokenization function in NLTK.
Internally, it uses the TreebankWordTokenizer, but with additional steps. For instance, it includes PunktSentenceTokenizer to first break the text into sentences (if the text contains multiple sentences).
It is flexible, working well for simple cases but sometimes less precise with punctuation compared to TreebankWordTokenizer directly.

2.
TreebankWordTokenizer is a tokenizer modeled after the Penn Treebank, which is a corpus used for training parsers in NLP.
This tokenizer is more specific and precise in handling punctuation and token boundaries based on standard Treebank tokenization rules.
It splits contractions (like "can't" to "ca n't") and handles punctuation separately, so it’s especially useful for preparing text for tasks requiring fine-grained tokenization like parsing and tagging.


In [31]:
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(corpus)

['Hello',
 ',',
 'my',
 'name',
 'is',
 'ankit',
 'singh.',
 'i',
 'like',
 'eating',
 'mangoes',
 '!',
 'it',
 "'s",
 'yummyyy',
 '.']