# Word Tokenization

Word tokenization (also called word segmentation) is the problem of dividing a string of written language into its component words.

In [1]:
from nltk.tokenize import word_tokenize
sentence = "Books are on the table"
words = word_tokenize(sentence)

In [2]:
print(words)

# Lower casing

Converting a word to lower case (NLP -> nlp).
Words like Book and book mean the same but when not converted to the lower case those two are represented as two
different words in the vector space model (resulting in more dimensions).

In [3]:
sentence = "Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East."
sentence = sentence.lower()
print(sentence)

# Stop word removal

Stop words are very commonly used words (a, an, the, etc.) in the documents. These words do not
really signify any importance as they do not help in distinguishing two documents.


In [4]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
sentence = "Machine Learning is cool!"
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(sentence)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
print(filtered_sentence)

# Stemming

 It is a process of transforming a word to its root form.

In [5]:
import nltk
from nltk.stem import PorterStemmer
ps = PorterStemmer()
sentence = "Machine Learning is cool"
for word in sentence.split():
    print(ps.stem(word))

# Lemmatization

Unlike stemming, lemmatization reduces the words to a word existing in the language.


In [6]:
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("Machine", pos='n'))
print(lemmatizer.lemmatize("caring", pos='v'))