### Tokenization

In [2]:
import nltk

The algorithm splits the sentence in words considering the language.  Works better and .split() function considering that it can identify the apostrophes or other punctuation marks.

In [16]:
test_sentence = "I hope this winter it's not too cold but I've heard it's going to be colder than the last one"

In [17]:
nltk.word_tokenize(test_sentence)

['I',
 'hope',
 'this',
 'winter',
 'it',
 "'s",
 'not',
 'too',
 'cold',
 'but',
 'I',
 "'ve",
 'heard',
 'it',
 "'s",
 'going',
 'to',
 'be',
 'colder',
 'than',
 'the',
 'last',
 'one']

### POS Tagging

This helps us identifying the syntaxis in a sentence, it classifies each word with its respective grammar annotation.

In [18]:
tokens = nltk.word_tokenize(test_sentence)
nltk.pos_tag(tokens)

[('I', 'PRP'),
 ('hope', 'VBP'),
 ('this', 'DT'),
 ('winter', 'NN'),
 ('it', 'PRP'),
 ("'s", 'VBZ'),
 ('not', 'RB'),
 ('too', 'RB'),
 ('cold', 'JJ'),
 ('but', 'CC'),
 ('I', 'PRP'),
 ("'ve", 'VBP'),
 ('heard', 'VBN'),
 ('it', 'PRP'),
 ("'s", 'VBZ'),
 ('going', 'VBG'),
 ('to', 'TO'),
 ('be', 'VB'),
 ('colder', 'JJR'),
 ('than', 'IN'),
 ('the', 'DT'),
 ('last', 'JJ'),
 ('one', 'NN')]

    Definitions
    - PRP: Pronoun
    - VBP: Verb, present tense, not 3rd person singular
    - DT: Determiner
    - NN: Noun
    - VBZ: Verb, present tense, 3rd person singular
    - RB: Adverb
    - JJ: Adjective or numeral
    - CC: Conjunction
    - VBN: Verb, past participle
    - VBG: Verb or gerund
    - TO: Preposition or infinitive
    - VB: Verb, base form
    - JJR: Adjective, comparative
    - IN: Preposition or conjuction

### Lemmatization

Converts the words into their basis form (aka dictionary form)

In [12]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus.reader.wordnet import NOUN, VERB, ADJ

In [19]:
lemmatizer = WordNetLemmatizer()

In [20]:
tokens = nltk.word_tokenize(test_sentence)
tags = nltk.pos_tag(tokens)

In [21]:
for i, token in enumerate(tokens):
    pos_tag = tags[i][1]

    if pos_tag.startswith("N"):
        lemma = lemmatizer.lemmatize(token, pos=NOUN)
    elif pos_tag.startswith("V"):
        lemma = lemmatizer.lemmatize(token, pos=VERB)
    elif pos_tag.startswith("J"):
        lemma = lemmatizer.lemmatize(token, pos=ADJ)
    else:
        lemma = token
        
    print(lemma)

I
hope
this
winter
it
's
not
too
cold
but
I
've
hear
it
's
go
to
be
cold
than
the
last
one
