### Tokenization

Tokenization refers to the procedure of splitting a sentence into its constituent
parts—the words and punctuation that it is made up of. It is different from simply
splitting the sentence on whitespaces, and instead actually divides the sentence
into constituent words, numbers (if any), and punctuation, which may not always
be separated by whitespaces.

NLTK provides a method called word_tokenize() , which tokenizes given text into
words. It actually separates the text into different words based on punctuation and
spaces between words.

In [3]:
# download words
from nltk import word_tokenize, download
download(['punkt','averaged_perceptron_tagger','stopwords'])

[nltk_data] Downloading package punkt to
[nltk_data]     /home/josephitopa/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/josephitopa/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/josephitopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
# tokenize a sentence
def get_tokens(sentence):
    words = word_tokenize(sentence)
    return words

print(get_tokens("I am reading NLP Fundamentals."))

['I', 'am', 'reading', 'NLP', 'Fundamentals', '.']


#### POS - Part of Speech

In NLP, the term PoS refers to parts of speech. PoS tagging refers to the process
of tagging words within sentences with their respective PoS.

DT = Determiner
NN = Noun, common, singular or mass
VBZ = Verb, present tense, third-person singular
JJ = Adjective

PoS tagging finds application in many NLP tasks, including word sense
disambiguation, classification, Named Entity Recognition (NER), and coreference
resolution.

In [4]:
# import word_tokenize and pos_tag
from nltk import word_tokenize, pos_tag

In [4]:
def get_tokens(sentence):
    words = word_tokenize(sentence)
    return words

In [5]:
# apply the get_token function to tokenize sentence
words = get_tokens("I am reading NLP Fundamentals")
print(words)

['I', 'am', 'reading', 'NLP', 'Fundamentals']


In [6]:
# tag sentence with PoS
def get_pos(words):
    return pos_tag(words)

get_pos(words)

[('I', 'PRP'),
 ('am', 'VBP'),
 ('reading', 'VBG'),
 ('NLP', 'NNP'),
 ('Fundamentals', 'NNS')]

Here, PRP stands for personal pronoun, VBP stands for verb present, VGB stands
for verb gerund, NNP stands for proper noun singular, and NNS stands for
noun plural.

### Stop Word Removal

Stop words are the most frequently occurring words in any language and they are just
used to support the construction of sentences and do not contribute anything to the
semantics of a sentence.

Examples of stop words include "a," "am," "and," "the,"
"in," "of," and more.

In [2]:
from nltk import download
download('stopwords')
from nltk import word_tokenize
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/josephitopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
stop_words = stopwords.words('english')

In [10]:
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [11]:
sentence = "I am learning Python. It is one of the "\
            "most popular programming languages"

sentence_words = word_tokenize(sentence)

In [12]:
print(sentence_words)

['I', 'am', 'learning', 'Python', '.', 'It', 'is', 'one', 'of', 'the', 'most', 'popular', 'programming', 'languages']


In [13]:
def remove_stop_words(sentence_words, stop_words):
    return ' '.join([word for word in sentence_words if word not in stop_words])

In [14]:
print(remove_stop_words(sentence_words, stop_words))

I learning Python . It one popular programming languages


In [15]:
stop_words.extend(['I', 'It', 'one'])

print(remove_stop_words(sentence_words,stop_words))

learning Python . popular programming languages


In [20]:
# Assignment - tokenize and remove stop word using a python function
from nltk import download
download('stopwords')
from nltk import word_tokenize
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

sentence = "I am to be learning Python. It is one of the "\
            "most popular programming languages in the Universe"

def tokenize_and_stop_word(sentence, stop_words):
    token = word_tokenize(sentence)
    stop_words.extend(['I', 'It', 'one'])
    return ' '.join([word for word in token if word not in stop_words])

tokenize_and_stop_word(sentence, stop_words)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/josephitopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


'learning Python . popular programming languages Universe'

### Text Normalization

text normalization is a process wherein different variations of text get converted into a
standard form.

For example, words such as "does" and "doing," when converted to their base form, become "do."

In [1]:
sentence = "I visited the US from the UK on 22-10-18"

In [5]:
def normalize(text):
    return text.replace("US", "United States")\
            .replace("UK", "United Kingdom")\
            .replace("-18", "-2018")

In [6]:
normalized_sentence = normalize(sentence)
print(normalized_sentence)

I visited the United States from the United Kingdom on 22-10-2018


In [7]:
normalized_sentence = normalize('US and UK are two superpowers')
print(normalized_sentence)

United States and United Kingdom are two superpowers


### Spelling correction is executed in two steps:

1. Identify the misspelled word, which can be done by a simple dictionary lookup.
If there is no match found in the language dictionary, it is considered to
be misspelled.

2. Replace it or suggest the correctly spelled word. There are a lot of algorithms for
this task. One of them is the minimum edit distance algorithm, which chooses
the nearest correctly spelled word for a misspelled word.

We make use of the autocorrect Python library to correct spellings.

In [8]:
from nltk import word_tokenize
from autocorrect import Speller

In [9]:
spell = Speller(lang = 'en')
spell('Natureal')

'Natural'

In [10]:
sentence = word_tokenize("Ntural Luanguage Processin deals with "\
                         "the art of extracting insightes from "\
                         "Natural Languaes")

In [11]:
print(sentence)

['Ntural', 'Luanguage', 'Processin', 'deals', 'with', 'the', 'art', 'of', 'extracting', 'insightes', 'from', 'Natural', 'Languaes']


In [14]:
def correct_spelling(tokens):
    sentence_corrected = ' '.join([spell(word) \
                                   for word in tokens])
    return sentence_corrected

In [15]:
print(correct_spelling(sentence))

Natural Language Processing deals with the art of extracting insights from Natural Languages


### STEMMING

It is necessary to convert these words into their base forms, as they carry
the same meaning in any case. Stemming is the process that helps us to do so.

In [19]:
from nltk import stem

In [17]:
def get_stems(word, stemmer):
    return stemmer.stem(word)

In [20]:
porterStem = stem.PorterStemmer()
get_stems("production", porterStem)

'product'

In [21]:
get_stems("coming", porterStem)

'come'

In [22]:
get_stems("firing", porterStem)

'fire'

In [23]:
get_stems("battling", porterStem)

'battl'

In [24]:
stemmer = stem.SnowballStemmer("english")
get_stems("battling",stemmer)

'battl'

### LEMMATIZATION

Sometimes, the stemming process leads to incorrect results. For example, in the
last exercise, the word battling was transformed to "battl" , which is not a
word. 

To overcome such problems with stemming, we make use of lemmatization.

Lemmatization is the process of converting words to their base grammatical form,
as in "battling" to "battle," rather than just randomly axing words.

NLTK's WordNetLemmatizer provides a method called lemmatize(),

which returns the lemma (grammatical base form) of a given word using WordNet.

In [26]:
from nltk import download
download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/josephitopa/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [27]:
lemmatizer = WordNetLemmatizer()
def get_lemma(word):
    return lemmatizer.lemmatize(word)

In [31]:
get_lemma('products')

'product'

In [32]:
get_lemma('production')

'production'

### Named Entity Recognition (NER)

NER is the process of extracting important entities, such as person names, place
names, and organization names, from some given text. These are usually not present
in dictionaries. 

So, we need to treat them differently. The main objective of this
process is to identify the named entities (such as proper nouns) and map them to
categories, which are already defined.

from nltk import download
from nltk import pos_tag
from nltk import ne_chunk
from nltk import word_tokenize
download('maxent_ne_chunker')
download('words')

In [34]:
sentence = "We are reading a book published by Packt "\
            "which is based out of Birmingham."

In [35]:
def get_ner(text):
    i = ne_chunk(pos_tag(word_tokenize(text)), binary = True)
    return [a for a in i if len(a) == 1]

get_ner(sentence)

[Tree('NE', [('Packt', 'NNP')]), Tree('NE', [('Birmingham', 'NNP')])]

### Word Sense Disambiguation

This means two or more words with the same spelling may have different meanings
in different contexts. 

This often leads to ambiguity. Word sense disambiguation
is the process of mapping a word to the sense that it should carry.

One of the algorithms to solve word sense disambiguation is the LESK algorithm.

It has a huge corpus in the background (generally WordNet is used) that contains

definitions of all the possible synonyms of all the possible words in a language.

In [36]:
import nltk
nltk.download('wordnet')
from nltk.wsd import lesk
from nltk import word_tokenize

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/josephitopa/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [37]:
sentence1 = "Keep your savings in the bank"
sentence2 = "It's so risky to drive over the banks of the road"

In [38]:
def get_synset(sentence, word):
    return lesk(word_tokenize(sentence), word)

get_synset(sentence1,'bank') # Here, savings_bank.n.02 refers to a container for keeping money safely at home.

Synset('savings_bank.n.02')

In [39]:
get_synset(sentence2,'bank') # Here, bank.v.07 refers to a slope in the turn of a road

Synset('bank.v.07')

### Sentence Boundary Detection

Sentence boundary detection is the method of detecting where one sentence ends
and another begins. 

If you are thinking that this sounds pretty easy, as a period (.)
or a question mark (?) denotes the end of a sentence and the beginning of another
sentence, then you are wrong. 

There can also be instances where the letters of
acronyms are separated by full stops, for instance.

In [41]:
import nltk
from nltk.tokenize import sent_tokenize

In [42]:
def get_sentences(text):
    return sent_tokenize(text)

get_sentences("We are reading a book. Do you know who is "\
              "the publisher? It is Packt. Packt is based "\
              "out of Birmingham.")

['We are reading a book.',
 'Do you know who is the publisher?',
 'It is Packt.',
 'Packt is based out of Birmingham.']

In [43]:
get_sentences("Mr. Donald John Trump is the current "\
              "president of the USA. Before joining "\
              "politics, he was a businessman.")

['Mr. Donald John Trump is the current president of the USA.',
 'Before joining politics, he was a businessman.']

As you can see in the code, the sent_tokenize method is able to differentiate

between the period (.) after "Mr" and the one used to end the sentence.