In [1]:
! pip install nltk



# Tokenization

In [2]:
tweet = "Sometimes to understand a word's meaning you need more than a definition. you need to see the word used in a sentence. At YourDictionary, we give you the tools to learn what a word means and how to use it correctly. With this sentence maker, simply type a word in the search bar and see a variety of sentences with that word used in its different ways. Our sentence generator can provide more context and relevance, ensuring you use a word the right way."

In [3]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
text = "Hello! how are you?"

In [5]:
from nltk.tokenize import word_tokenize

In [6]:
word_tok = word_tokenize(text)
word_tok

['Hello', '!', 'how', 'are', 'you', '?']

In [7]:
from nltk.tokenize import sent_tokenize
sent_tok = sent_tokenize(tweet)
sent_tok

["Sometimes to understand a word's meaning you need more than a definition.",
 'you need to see the word used in a sentence.',
 'At YourDictionary, we give you the tools to learn what a word means and how to use it correctly.',
 'With this sentence maker, simply type a word in the search bar and see a variety of sentences with that word used in its different ways.',
 'Our sentence generator can provide more context and relevance, ensuring you use a word the right way.']

## There are 3 types of sentence tokenize.

In [8]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
print(tokenizer.tokenize(tweet))
print("Length of tokenizer: ",len(tokenizer.tokenize(tweet)))

['Sometimes', 'to', 'understand', 'a', 'word', "'s", 'meaning', 'you', 'need', 'more', 'than', 'a', 'definition.', 'you', 'need', 'to', 'see', 'the', 'word', 'used', 'in', 'a', 'sentence.', 'At', 'YourDictionary', ',', 'we', 'give', 'you', 'the', 'tools', 'to', 'learn', 'what', 'a', 'word', 'means', 'and', 'how', 'to', 'use', 'it', 'correctly.', 'With', 'this', 'sentence', 'maker', ',', 'simply', 'type', 'a', 'word', 'in', 'the', 'search', 'bar', 'and', 'see', 'a', 'variety', 'of', 'sentences', 'with', 'that', 'word', 'used', 'in', 'its', 'different', 'ways.', 'Our', 'sentence', 'generator', 'can', 'provide', 'more', 'context', 'and', 'relevance', ',', 'ensuring', 'you', 'use', 'a', 'word', 'the', 'right', 'way', '.']
Length of tokenizer:  89


In [9]:
from nltk.tokenize import WordPunctTokenizer
tokenizer_w = WordPunctTokenizer()
print(tokenizer_w.tokenize(tweet))
print(len(tokenizer_w.tokenize(tweet)))

['Sometimes', 'to', 'understand', 'a', 'word', "'", 's', 'meaning', 'you', 'need', 'more', 'than', 'a', 'definition', '.', 'you', 'need', 'to', 'see', 'the', 'word', 'used', 'in', 'a', 'sentence', '.', 'At', 'YourDictionary', ',', 'we', 'give', 'you', 'the', 'tools', 'to', 'learn', 'what', 'a', 'word', 'means', 'and', 'how', 'to', 'use', 'it', 'correctly', '.', 'With', 'this', 'sentence', 'maker', ',', 'simply', 'type', 'a', 'word', 'in', 'the', 'search', 'bar', 'and', 'see', 'a', 'variety', 'of', 'sentences', 'with', 'that', 'word', 'used', 'in', 'its', 'different', 'ways', '.', 'Our', 'sentence', 'generator', 'can', 'provide', 'more', 'context', 'and', 'relevance', ',', 'ensuring', 'you', 'use', 'a', 'word', 'the', 'right', 'way', '.']
94


# Stemming

Stemming is basically removing the suffix from a word and reduce it to its root word.

For eg - 'Flying'- removing 'ing' suffix and getting the root word 'Fly'

## 1. Porter Stemmer

In [10]:
from nltk.stem import PorterStemmer
stemming = PorterStemmer()
word = 'danced'
stemming.stem(word)

'danc'

We can see that the root word is incorrect. it should be 'dance', but it is giving 'danc'.

In [11]:
word = 'replacement'
stemming.stem(word)

'replac'

In [12]:
word = 'happiness'
stemming.stem(word)

'happi'

It is giving wrong outputs to us. Now we will try Lancaster Stemmer operations.

## Lancaster Stemmer

In [14]:
from nltk.stem import LancasterStemmer
stemming1 = LancasterStemmer()
word = 'happily'
stemming1.stem(word)

'happy'

## Regular Expression Stemmer

In [15]:
from nltk.stem import RegexpStemmer
stemming2 = RegexpStemmer('ing$|s$|e$|able$|ness$', min=3)
word = 'raining'
stemming2.stem(word)

'rain'

In [16]:
word = 'flying'
stemming2.stem(word)

'fly'

In [17]:
word = 'happiness'
stemming2.stem(word)

'happi'

## Snowball Stemmer

In [18]:
nltk.download("snowball_data")

[nltk_data] Downloading package snowball_data to /root/nltk_data...


True

In [19]:
from nltk.stem import SnowballStemmer
stemming3 = SnowballStemmer("english")
word = 'happiness'
stemming3.stem(word)


'happi'

In [20]:
stemming3 = SnowballStemmer("arabic")
word = 'تحلق'
stemming3.stem(word)

'تحلق'

Snowball stemmer can be used with different languages.

Best stemmers are Lancaster and Snowball stemmer.

# Lemmatization

Lemmatization is mainly used for chatbots.

In [21]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

These packages like punkt, snowball_data, wordnet are already loaded with lots of data.

In [27]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

lemmatizer.lemmatize('going', pos= 'v')
words = [("eating",'v'),("playing", 'v')]

for word, pos in words:
  print(lemmatizer.lemmatize(word, pos = pos))

eat
play
