# Tokenization

Tokenization is the process of breaking down the given text in natural language processing into the smallest unit in a sentence called a token. Punctuation marks, words, and numbers can be considered tokens.

In [6]:
text = 'Hi Everyone! This is Hackers Realm. We are learning Natural Language Processing. We reached 1000000 views.'

In [7]:
text.split(' ')

['Hi',
 'Everyone!',
 'This',
 'is',
 'Hackers',
 'Realm.',
 'We',
 'are',
 'learning',
 'Natural',
 'Language',
 'Processing.',
 'We',
 'reached',
 '1000000',
 'views.']

In [8]:
from nltk import sent_tokenize, word_tokenize

In [9]:
# split the text into sentences
sent_tokens = sent_tokenize(text)
sent_tokens

['Hi Everyone!',
 'This is Hackers Realm.',
 'We are learning Natural Language Processing.',
 'We reached 1000000 views.']

In [10]:
# split the text into words
word_tokens = word_tokenize(text)
word_tokens

['Hi',
 'Everyone',
 '!',
 'This',
 'is',
 'Hackers',
 'Realm',
 '.',
 'We',
 'are',
 'learning',
 'Natural',
 'Language',
 'Processing',
 '.',
 'We',
 'reached',
 '1000000',
 'views',
 '.']

# Stemming

Stemming is the process of finding the root of words. A word stem need not be the same root as a dictionary-based morphological root, it just is an equal to or smaller form of the word.

In [13]:
from nltk.stem import PorterStemmer, SnowballStemmer
ps = PorterStemmer()

In [17]:
word = ('eats')
ps.stem(word)

'eat'

In [16]:
word = ('eating')
ps.stem(word)

'eat'

In [18]:
word = ('eaten')
ps.stem(word)

'eaten'

In [19]:
text = 'Hi Everyone! This is Hackers Realm. We are learning Natural Language Processing. We reached 1000000 views.'

In [20]:
word_tokens = word_tokenize(text)

In [21]:
stemmed_sentence = " ".join(ps.stem(word) for word in word_tokens)
stemmed_sentence

'Hi everyon ! thi is hacker realm . We are learn natur languag process . We reach 1000000 view .'

# Lemmatization

Lemmatization is the process of finding the form of the related word in the dictionary. It is different from Stemming. It involves longer processes to calculate than Stemming.

In [22]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [30]:
lemmatizer.lemmatize('workers')

'worker'

In [31]:
lemmatizer.lemmatize('words')

'word'

In [37]:
lemmatizer.lemmatize('feet')

'foot'

In [39]:
lemmatizer.lemmatize('stripes', 'v')

'strip'

In [40]:
lemmatizer.lemmatize('stripes', 'n')

'stripe'

In [41]:
text = 'Hi Everyone! This is Hackers Realm. We are learning Natural Language Processing. We reached 1000000 views.'

In [42]:
word_tokens = word_tokenize(text)

In [44]:
lemmatized_sentence = " ".join(lemmatizer.lemmatize(word.lower()) for word in word_tokens)
lemmatized_sentence

'hi everyone ! this is hacker realm . we are learning natural language processing . we reached 1000000 view .'

# Part of Speech Tagging (POS)

Part of Speech Tagging is a process of converting a sentence to forms — list of words, list of tuples (where each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on.

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [45]:
from nltk import pos_tag

In [51]:
pos_tag(['fighting'])

[('fighting', 'VBG')]

In [46]:
text = 'Hi Everyone! This is Hackers Realm. We are learning Natural Language Processing. We reached 1000000 views.'

In [47]:
word_tokens = word_tokenize(text)

In [52]:
pos_tag(word_tokens)

[('Hi', 'NNP'),
 ('Everyone', 'NN'),
 ('!', '.'),
 ('This', 'DT'),
 ('is', 'VBZ'),
 ('Hackers', 'NNP'),
 ('Realm', 'NNP'),
 ('.', '.'),
 ('We', 'PRP'),
 ('are', 'VBP'),
 ('learning', 'VBG'),
 ('Natural', 'NNP'),
 ('Language', 'NNP'),
 ('Processing', 'NNP'),
 ('.', '.'),
 ('We', 'PRP'),
 ('reached', 'VBD'),
 ('1000000', 'CD'),
 ('views', 'NNS'),
 ('.', '.')]