## Text Tokenization

Main topics:

        - dividing a text into sentences
        - tokenize a sentence into words
        - extracting most common words
        - finding out to which class belong each word (POS tagging)
        - extracting words thaat belong to a particular class. i.e., all nouns, verbs, adjectives, etc.

I am using **nltk** (natural language toolkit) package from Stanford University. It is included in most pythoon distributions.

        - nltk.sent_tokenize(text, language='...') # takes a text as input and divides it into sentences
        - nltk.word_tokenize(sentence, language='...') # takes a sentence as input and tokenizes it into words
        - stopwords = nltk.corpus.stopwords.words('english') # generate list of stopwords
        - nltk.pos_tag(tokens) # takes list of words / tokens as input and adds the part of speech (POS) for eacch word

In [2]:
# pip install nltk # install nltk
# nltk.download() # A window should now be opened. Select anything from the NLTK book to install.
import nltk

In [3]:
# divide below text into sentences

text = """
I follow the Moskva.
Down to Gorky Park.
Listening to the wind of change.
An august summer night.

The world is closing in.
Did you ever think.
That we could be so close, like brothers.
The future's in the air.
I can feel it everywhere.
Blowing with the wind of change.

Take me to the magic of the moment.
On a glory night.
Where the children of tomorrow dream away.
In the wind of change.

Walking down the street.
Distant memories.
Are buried in the past forever.
I follow the Moskva.
Down to Gorky Park.
Listening to the wind of change.

Take me to the magic of the moment On a glory night.
Where the children of tomorrow share their dreams.
With you and me.
Take me to the magic of the moment.
On a glory night.
Where the children of tomorrow dream away.
In the wind of change.

The wind of change.
Blows straight into the face of time.
Like a stormwind that will.
ring the freedom bell.
For peace of mind.
Let your balalaika sing.
What my guitar wants to say.
"""

sentences = nltk.sent_tokenize(text, language = 'english')
print('number of sentences =', len(sentences))
sentences

number of sentences = 34


['\nI follow the Moskva.',
 'Down to Gorky Park.',
 'Listening to the wind of change.',
 'An august summer night.',
 'The world is closing in.',
 'Did you ever think.',
 'That we could be so close, like brothers.',
 "The future's in the air.",
 'I can feel it everywhere.',
 'Blowing with the wind of change.',
 'Take me to the magic of the moment.',
 'On a glory night.',
 'Where the children of tomorrow dream away.',
 'In the wind of change.',
 'Walking down the street.',
 'Distant memories.',
 'Are buried in the past forever.',
 'I follow the Moskva.',
 'Down to Gorky Park.',
 'Listening to the wind of change.',
 'Take me to the magic of the moment On a glory night.',
 'Where the children of tomorrow share their dreams.',
 'With you and me.',
 'Take me to the magic of the moment.',
 'On a glory night.',
 'Where the children of tomorrow dream away.',
 'In the wind of change.',
 'The wind of change.',
 'Blows straight into the face of time.',
 'Like a stormwind that will.',
 'ring the fr

In [4]:
# now tokenize each sentence into the words

tokens = [nltk.word_tokenize(e, language = 'english') for e in sentences]
tokens = [e for e in tokens for e in e]

tokens

['I',
 'follow',
 'the',
 'Moskva',
 '.',
 'Down',
 'to',
 'Gorky',
 'Park',
 '.',
 'Listening',
 'to',
 'the',
 'wind',
 'of',
 'change',
 '.',
 'An',
 'august',
 'summer',
 'night',
 '.',
 'The',
 'world',
 'is',
 'closing',
 'in',
 '.',
 'Did',
 'you',
 'ever',
 'think',
 '.',
 'That',
 'we',
 'could',
 'be',
 'so',
 'close',
 ',',
 'like',
 'brothers',
 '.',
 'The',
 'future',
 "'s",
 'in',
 'the',
 'air',
 '.',
 'I',
 'can',
 'feel',
 'it',
 'everywhere',
 '.',
 'Blowing',
 'with',
 'the',
 'wind',
 'of',
 'change',
 '.',
 'Take',
 'me',
 'to',
 'the',
 'magic',
 'of',
 'the',
 'moment',
 '.',
 'On',
 'a',
 'glory',
 'night',
 '.',
 'Where',
 'the',
 'children',
 'of',
 'tomorrow',
 'dream',
 'away',
 '.',
 'In',
 'the',
 'wind',
 'of',
 'change',
 '.',
 'Walking',
 'down',
 'the',
 'street',
 '.',
 'Distant',
 'memories',
 '.',
 'Are',
 'buried',
 'in',
 'the',
 'past',
 'forever',
 '.',
 'I',
 'follow',
 'the',
 'Moskva',
 '.',
 'Down',
 'to',
 'Gorky',
 'Park',
 '.',
 'Listenin

In [5]:
# find out 10 most common words
from collections import Counter

tokens_counter = Counter(tokens)
tokens_counter.most_common(20)

[('.', 34),
 ('the', 21),
 ('of', 14),
 ('to', 8),
 ('wind', 6),
 ('change', 6),
 ('night', 4),
 ('me', 4),
 ('a', 4),
 ('I', 3),
 ('The', 3),
 ('in', 3),
 ('Take', 3),
 ('magic', 3),
 ('moment', 3),
 ('On', 3),
 ('glory', 3),
 ('Where', 3),
 ('children', 3),
 ('tomorrow', 3)]

In [6]:
# remove stop words and punctuation from above tokens and count most common again.

import string # to generate list of punttuations

stopwords = nltk.corpus.stopwords.words('english')
punctuations = string.punctuation # !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

print('stopwords =', stopwords)
print('punctuations =', punctuations)

tokens = [e for e in tokens if e.lower() not in stopwords and e not in punctuations]

tokens_counter = Counter(tokens)
tokens_counter.most_common(20)

stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so

[('wind', 6),
 ('change', 6),
 ('night', 4),
 ('Take', 3),
 ('magic', 3),
 ('moment', 3),
 ('glory', 3),
 ('children', 3),
 ('tomorrow', 3),
 ('follow', 2),
 ('Moskva', 2),
 ('Gorky', 2),
 ('Park', 2),
 ('Listening', 2),
 ('dream', 2),
 ('away', 2),
 ('august', 1),
 ('summer', 1),
 ('world', 1),
 ('closing', 1)]

## Part of Speech (POS) Tagging

After tokenization of text into the words, one of the common steps in text analysis is to find out to which class each word belongs.

For this purpose, I am using the .pos_tag() from nltk package which tags the parts of speech (POS) of each word. for example, the words 'wind' and 'summer' will be tagged as 'NN' which means 'nouns' and the words 'follow' and 'think' will tagged as 'VB' which means 'verb'

Below is a list of some of the common POS tags:

        - CC - Coordinating conjunction
        - CD - Cardinal number
        - DT - Determiner
        - EX - Existential there
        - FW - Foreign word
        - IN - Preposition or subordinating conjunction
        - JJ - Adjective
        - JJR - Adjective, comparative
        - JJS - Adjective, superlative
        - LS - List item marker
        - MD - Modal
        - NN - Noun, singular or mass
        - NNS - Noun, plural
        - NNP - Proper noun, singular
        - NNPS - Proper noun, plural
        - PDT - Predeterminer
        - POS - Possessive ending
        - PRP - Personal pronoun
        - PRP$ - Possessive pronoun
        - RB - Adverb
        - RBR - Adverb, comparative
        - RBS - Adverb, superlative
        - RP - Particle
        - TO - to
        - UH - Interjection
        - VB - Verb, base form
        - VBD - Verb, past tense
        - VBG - Verb, gerund or present participle
        - VBN - Verb, past participle
        - VBP - Verb, non-3rd person singular present
        - VBZ - Verb, 3rd person singular present
        - WDT - Wh-determiner
        - WP - Wh-pronoun
        - WP$ - Possessive wh-pronoun
        - WRB - Wh-adverb

use nltk.help.upenn_tagset() for more information.

In [7]:
tokens_pos = nltk.pos_tag(tokens)
tokens_pos

[('follow', 'VB'),
 ('Moskva', 'NNP'),
 ('Gorky', 'NNP'),
 ('Park', 'NNP'),
 ('Listening', 'NNP'),
 ('wind', 'NN'),
 ('change', 'NN'),
 ('august', 'RB'),
 ('summer', 'NN'),
 ('night', 'NN'),
 ('world', 'NN'),
 ('closing', 'NN'),
 ('ever', 'RB'),
 ('think', 'VB'),
 ('could', 'MD'),
 ('close', 'VB'),
 ('like', 'IN'),
 ('brothers', 'NNS'),
 ('future', 'VBP'),
 ("'s", 'POS'),
 ('air', 'NN'),
 ('feel', 'NN'),
 ('everywhere', 'RB'),
 ('Blowing', 'NNP'),
 ('wind', 'NN'),
 ('change', 'NN'),
 ('Take', 'NNP'),
 ('magic', 'NN'),
 ('moment', 'NN'),
 ('glory', 'NN'),
 ('night', 'NN'),
 ('children', 'NNS'),
 ('tomorrow', 'NN'),
 ('dream', 'NN'),
 ('away', 'RB'),
 ('wind', 'IN'),
 ('change', 'NN'),
 ('Walking', 'NNP'),
 ('street', 'NN'),
 ('Distant', 'NNP'),
 ('memories', 'NNS'),
 ('buried', 'VBD'),
 ('past', 'JJ'),
 ('forever', 'RB'),
 ('follow', 'VBP'),
 ('Moskva', 'NNP'),
 ('Gorky', 'NNP'),
 ('Park', 'NNP'),
 ('Listening', 'NNP'),
 ('wind', 'NN'),
 ('change', 'NN'),
 ('Take', 'NNP'),
 ('magic', 'N

Once we find it the POS of each word, we can easly extract words which belong to a particular class.

For instance, let's find our the nouns and verbs in the above text.

In [65]:
print('----------- list of nouns:')
[e for e in tokens_pos if e[1][0] == 'N'] # loops through each element of the list which are tuples, from each tuple takes the second element which the POS, from each POS take the first letter .The POS of words that are tagged as noun always startes with 'N'

----------- list of nouns:


[('Moskva', 'NNP'),
 ('Gorky', 'NNP'),
 ('Park', 'NNP'),
 ('Listening', 'NNP'),
 ('wind', 'NN'),
 ('change', 'NN'),
 ('summer', 'NN'),
 ('night', 'NN'),
 ('world', 'NN'),
 ('closing', 'NN'),
 ('brothers', 'NNS'),
 ('air', 'NN'),
 ('feel', 'NN'),
 ('Blowing', 'NNP'),
 ('wind', 'NN'),
 ('change', 'NN'),
 ('Take', 'NNP'),
 ('magic', 'NN'),
 ('moment', 'NN'),
 ('glory', 'NN'),
 ('night', 'NN'),
 ('children', 'NNS'),
 ('tomorrow', 'NN'),
 ('dream', 'NN'),
 ('change', 'NN'),
 ('Walking', 'NNP'),
 ('street', 'NN'),
 ('Distant', 'NNP'),
 ('memories', 'NNS'),
 ('Moskva', 'NNP'),
 ('Gorky', 'NNP'),
 ('Park', 'NNP'),
 ('Listening', 'NNP'),
 ('wind', 'NN'),
 ('change', 'NN'),
 ('Take', 'NNP'),
 ('magic', 'NN'),
 ('moment', 'NN'),
 ('glory', 'NN'),
 ('night', 'NN'),
 ('children', 'NNS'),
 ('tomorrow', 'NN'),
 ('share', 'NN'),
 ('dreams', 'NNS'),
 ('moment', 'NN'),
 ('glory', 'NN'),
 ('night', 'NN'),
 ('children', 'NNS'),
 ('tomorrow', 'NN'),
 ('dream', 'NN'),
 ('change', 'NN'),
 ('wind', 'NN'),
 ('

In [66]:
print('----------- list of verbs:')
[e for e in tokens_pos if e[1][0] == 'V'] # loops through each element of the list which are tuples, from each tuple takes the second element which the POS, from each POS take the first letter .The POS of words that are tagged as noun always startes with 'V'

----------- list of verbs:


[('follow', 'VB'),
 ('think', 'VB'),
 ('close', 'VB'),
 ('future', 'VBP'),
 ('buried', 'VBD'),
 ('follow', 'VBP'),
 ('Take', 'VBP'),
 ('straight', 'VBD'),
 ('bell', 'VB'),
 ('balalaika', 'VB'),
 ('sing', 'VBG'),
 ('wants', 'VBZ'),
 ('say', 'VBP')]

## Lemmatization

One of valuable preprocessing step in NLP is standardizing and normalizing words to their base or root forms so that different forms of a word are treated as the same word. This technique is called lemmatization.

For example, the lemma of the word "running" is "run," and the lemma of "better" is "good."

Lemma of a word can vary depending on its grammatical role. For example, the lemma of "better" as an adjective is "good," but as a verb, it is "better". Thus, lemmatization takes into account the POS of a word.

In [8]:
lemmatizer = nltk.WordNetLemmatizer()
print('lemma of the word "better" as an adjective:', lemmatizer.lemmatize(word = 'better', pos = 'a'))
print('lemma of the word "better" as a verb:', lemmatizer.lemmatize(word = 'better', pos = 'v'))

lemma of the word "better" as an adjective: good
lemma of the word "better" as a verb: better


Using above technique, let's add the lemma of the words in the 'tokens_pos' list.

In [9]:
# Since we use the 'WordNetLemmatizer()', we need to create a helper function to map the POS tags in our list with the WordNet's POS tagset.
# It's important to note that POS tagging can be context-dependent and challenging.
# For example, foreign words (words that are tagged as FW) may function as nouns, but they can also function as other parts of speech depending on the context.
# Thus, you can adjust the helper function based on different context

from nltk.corpus import wordnet

def get_wordnet_pos(pos):
    if pos.startswith('J'): return wordnet.ADJ # Adjective
    elif pos.startswith('V'): return wordnet.VERB # Verb
    elif pos.startswith('N'): return wordnet.NOUN # Noun
    elif pos.startswith('R'): return wordnet.ADV # Adverb
    elif pos.startswith('D'): return wordnet.ADJ_SAT  # Adverbs
    elif pos.startswith('P'): return wordnet.ADJ_SAT  # Pronouns
    elif pos.startswith('C'): return wordnet.ADJ_SAT  # Conjunctions
    elif pos.startswith('U'): return wordnet.ADJ_SAT  # Interjections
    elif pos.startswith('M'): return wordnet.ADJ_SAT  # Modals
    elif pos == 'CC': return wordnet.ADV  # Coordinating conjunctions
    elif pos == 'DT': return wordnet.ADJ_SAT  # Determiners
    elif pos == 'IN': return wordnet.ADJ_SAT  # Prepositions
    elif pos == 'TO': return wordnet.ADJ_SAT  # 'to' as part of infinitive verb
    elif pos == 'MD': return wordnet.ADJ_SAT  # Modal verbs
    elif pos == 'EX': return wordnet.ADV  # Existential 'there'
    elif pos == 'CD': return wordnet.NOUN  # Cardinal numbers
    elif pos == 'UH': return wordnet.INTJ  # Interjections
    else: return wordnet.NOUN  # Default to noun if no specific mapping

tokens_pos_lemma = [(word, pos, lemmatizer.lemmatize(word,  get_wordnet_pos(pos))) for word, pos in tokens_pos]
tokens_pos_lemma

[('follow', 'VB', 'follow'),
 ('Moskva', 'NNP', 'Moskva'),
 ('Gorky', 'NNP', 'Gorky'),
 ('Park', 'NNP', 'Park'),
 ('Listening', 'NNP', 'Listening'),
 ('wind', 'NN', 'wind'),
 ('change', 'NN', 'change'),
 ('august', 'RB', 'august'),
 ('summer', 'NN', 'summer'),
 ('night', 'NN', 'night'),
 ('world', 'NN', 'world'),
 ('closing', 'NN', 'closing'),
 ('ever', 'RB', 'ever'),
 ('think', 'VB', 'think'),
 ('could', 'MD', 'could'),
 ('close', 'VB', 'close'),
 ('like', 'IN', 'like'),
 ('brothers', 'NNS', 'brother'),
 ('future', 'VBP', 'future'),
 ("'s", 'POS', "'s"),
 ('air', 'NN', 'air'),
 ('feel', 'NN', 'feel'),
 ('everywhere', 'RB', 'everywhere'),
 ('Blowing', 'NNP', 'Blowing'),
 ('wind', 'NN', 'wind'),
 ('change', 'NN', 'change'),
 ('Take', 'NNP', 'Take'),
 ('magic', 'NN', 'magic'),
 ('moment', 'NN', 'moment'),
 ('glory', 'NN', 'glory'),
 ('night', 'NN', 'night'),
 ('children', 'NNS', 'child'),
 ('tomorrow', 'NN', 'tomorrow'),
 ('dream', 'NN', 'dream'),
 ('away', 'RB', 'away'),
 ('wind', 'IN',

At this moment, the list of words has the words as they are found in the text, their POS, and the base / root form of each word.
Let's find if a word in the text is different than its base form.

In [10]:
for word, pos, lemma in tokens_pos_lemma:
    if word != lemma:
        print('Found in the text:', word)
        print('POS tagging:', pos)
        print('Base or root form', lemma)
        print('--------------------------')

Found in the text: brothers
POS tagging: NNS
Base or root form brother
--------------------------
Found in the text: children
POS tagging: NNS
Base or root form child
--------------------------
Found in the text: memories
POS tagging: NNS
Base or root form memory
--------------------------
Found in the text: buried
POS tagging: VBD
Base or root form bury
--------------------------
Found in the text: children
POS tagging: NNS
Base or root form child
--------------------------
Found in the text: dreams
POS tagging: NNS
Base or root form dream
--------------------------
Found in the text: children
POS tagging: NNS
Base or root form child
--------------------------
Found in the text: wants
POS tagging: VBZ
Base or root form want
--------------------------
