## Tokenization Using NLTK

In [1]:
#NLTK, or Natural Language Toolkit, is a Python package that we can use for NLP to analyze text.
!pip install nltk

Collecting nltk
  Using cached nltk-3.8.1-py3-none-any.whl.metadata (2.8 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.5.15-cp311-cp311-win_amd64.whl.metadata (41 kB)
     ---------------------------------------- 0.0/42.0 kB ? eta -:--:--
     ---------------------------------------- 42.0/42.0 kB 2.1 MB/s eta 0:00:00
Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Downloading regex-2024.5.15-cp311-cp311-win_amd64.whl (268 kB)
   ---------------------------------------- 0.0/269.0 kB ? eta -:--:--
   ---------------------------------------- 269.0/269.0 kB 8.3 MB/s eta 0:00:00
Installing collected packages: regex, nltk
Successfully installed nltk-3.8.1 regex-2024.5.15



[notice] A new release of pip is available: 24.0 -> 24.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
corpus="""Hello guys, My name is Ashwin Nagarwal.
I am a Machine Learning enthusiast!!. 
I'm currently learning NLP for Tokenization.
"""

In [3]:
print(corpus)

Hello guys, My name is Ashwin Nagarwal.
I am a Machine Learning enthusiast!!. 
I'm currently learning NLP for Tokenization.



#### Sentence Tokenizer

In [4]:
##  Tokenization
## Sentence-->paragraphs
from nltk.tokenize import sent_tokenize

import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\PREM RAJ
[nltk_data]     NAGARWAL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [5]:
#Sent _tokenize convert paragraph to list of sentences
documents=sent_tokenize(corpus)
documents

['Hello guys, My name is Ashwin Nagarwal.',
 'I am a Machine Learning enthusiast!!.',
 "I'm currently learning NLP for Tokenization."]

In [6]:
type(documents)

list

In [7]:
for sentence in documents:
    print(sentence)

Hello guys, My name is Ashwin Nagarwal.
I am a Machine Learning enthusiast!!.
I'm currently learning NLP for Tokenization.


#### Word Tokenizer

In [8]:
## Tokenization 
## Paragraph-->words
## sentence--->words
from nltk.tokenize import word_tokenize

In [9]:
word_tokenize(corpus)

['Hello',
 'guys',
 ',',
 'My',
 'name',
 'is',
 'Ashwin',
 'Nagarwal',
 '.',
 'I',
 'am',
 'a',
 'Machine',
 'Learning',
 'enthusiast',
 '!',
 '!',
 '.',
 'I',
 "'m",
 'currently',
 'learning',
 'NLP',
 'for',
 'Tokenization',
 '.']

In [10]:
for sentence in documents:
    print(word_tokenize(sentence))

['Hello', 'guys', ',', 'My', 'name', 'is', 'Ashwin', 'Nagarwal', '.']
['I', 'am', 'a', 'Machine', 'Learning', 'enthusiast', '!', '!', '.']
['I', "'m", 'currently', 'learning', 'NLP', 'for', 'Tokenization', '.']


#### Wordpunctuation Tokenizer

In [12]:
from nltk.tokenize import wordpunct_tokenize

In [13]:
#here we even split the courpus punctuation
wordpunct_tokenize(corpus) 

['Hello',
 'guys',
 ',',
 'My',
 'name',
 'is',
 'Ashwin',
 'Nagarwal',
 '.',
 'I',
 'am',
 'a',
 'Machine',
 'Learning',
 'enthusiast',
 '!!.',
 'I',
 "'",
 'm',
 'currently',
 'learning',
 'NLP',
 'for',
 'Tokenization',
 '.']

#### Treebank Word Tokenizer

In [14]:
from nltk.tokenize import TreebankWordTokenizer

In [15]:
tokenizer=TreebankWordTokenizer()

In [16]:
# fullstop included in all sentences except the last one
tokenizer.tokenize(corpus)

['Hello',
 'guys',
 ',',
 'My',
 'name',
 'is',
 'Ashwin',
 'Nagarwal.',
 'I',
 'am',
 'a',
 'Machine',
 'Learning',
 'enthusiast',
 '!',
 '!',
 '.',
 'I',
 "'m",
 'currently',
 'learning',
 'NLP',
 'for',
 'Tokenization',
 '.']

In [17]:
list(TreebankWordTokenizer().span_tokenize(corpus)) #gives a tuple of start and end indices of each word in corpus

[(0, 5),
 (6, 10),
 (10, 11),
 (12, 14),
 (15, 19),
 (20, 22),
 (23, 29),
 (30, 39),
 (40, 41),
 (42, 44),
 (45, 46),
 (47, 54),
 (55, 63),
 (64, 74),
 (74, 75),
 (75, 76),
 (76, 77),
 (79, 80),
 (80, 82),
 (83, 92),
 (93, 101),
 (102, 105),
 (106, 109),
 (110, 122),
 (122, 123)]

## Stemming
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

In [18]:
## Classification Problem
## Comments of product is a positive review or negative review
## Reviews----> eating, eat,eaten [going,gone,goes]--->go is the stem word of the bracketed words

words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

#### PorterStemmer

In [19]:
from nltk.stem import PorterStemmer

In [20]:
stemming=PorterStemmer()

In [21]:
for word in words:
    print(word+"---->"+stemming.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


In [22]:
#But stemming often face some issues like making non-meaningful stem words as follows
stemming.stem('congratulations')
#it should have been Congratulate

'congratul'

In [23]:
stemming.stem("swimming")

'swim'

#### RegexpStemmer Class
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example

In [25]:
from nltk.stem import RegexpStemmer

In [26]:
#Use shift+tab to see documentation!!!
#take 2 params- regexp and min
reg_stemmer=RegexpStemmer('ing$|s$|e$|able$',min=4)

In [27]:
reg_stemmer.stem('eating')

'eat'

In [28]:
reg_stemmer.stem('ingeating')

'ingeat'

In [29]:
reg_stemmer.stem('cars')

'car'

In [30]:
reg_stemmer.stem('readable')

'read'

#### Snowball Stemmer
It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

In [31]:
from nltk.stem import SnowballStemmer

In [32]:
snowballsstemmer=SnowballStemmer('english')

In [33]:
for word in words:
    print(word+"---->"+snowballsstemmer.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


In [34]:
stemming.stem("fairly"),stemming.stem("sportingly")

('fairli', 'sportingli')

In [35]:
snowballsstemmer.stem("fairly"),snowballsstemmer.stem("sportingly")

('fair', 'sport')

In [36]:
#however even snowballstemmer also doesn't give good words sometimes
snowballsstemmer.stem('goes')

'goe'

In [37]:
stemming.stem('goes')#but still Snowball Stemmer gives meaningless output sometimes

'goe'

Hence we can't use these techniques for our chatbots.
Thus we use Lametization