# <font color = 'dodgerblue'>**Natural Language ToolKit**

## <font color = 'dodgerblue'>**1. Introduction and installation**

* NLTK is an open-source and powerful python package that provides a set of diverse natural languages algorithms.
* It is free, easy to use, large community, and very well documented.
* NLTK consists of the most common algorithms such as tokenizing, part-of-speech tagging, stemming, named entity recognition, and many more.


In [None]:
# Installing nltk
!pip install -U nltk



In [None]:
# Wrap lomg text to make it readable 
import textwrap as tw

# import pandas dataframe
import pandas as pd

# nltk library
import nltk

# Import WordNetLemmatizer from nltk.stem
from nltk.stem import WordNetLemmatizer

# Import tokenizers from nltk
from nltk.tokenize import sent_tokenize, word_tokenize, TweetTokenizer

# Import stopwords from nltk.corpus
from nltk.corpus import stopwords

# Import PorterStemmer from nltk.stem
from nltk.stem import PorterStemmer

# named entities
from nltk.chunk import ne_chunk

# Importing pos_tag
from nltk.tag import pos_tag


In [None]:
# Checking nltk version
print(nltk.__version__)

3.7


In [None]:
# Downloading all nltk packages
'''
nltk has set of packages that needs to be downloaded in order to use them.
For example, to use nltk stopwords you need to download stopwords package using
nltk.download("stopwords")
'''
nltk.download(['wordnet', 'punkt', 'stopwords', 'omw-1.4','averaged_perceptron_tagger', 'universal_tagset','maxent_ne_chunker','words'])

[nltk_data] Downloading package wordnet to /home/harpreet/nltk_data...
[nltk_data] Downloading package punkt to /home/harpreet/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/harpreet/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package omw-1.4 to /home/harpreet/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/harpreet/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package universal_tagset to
[nltk_data]     /home/harpreet/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/harpreet/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /home/harpreet/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [None]:
# Some random text that we will use to explain preprocssing using nltk
text = '''A regular expression (shortened as regex or regexp;[1] also referred to as rational expression[2][3]) is a sequence of characters that define a search pattern. The concept arose in the 1950s when the American mathematician Stephen Cole Kleene formalized the description of a regular language. 
RT @TeaPainUSA: #AMERICA 1ST #MAGA #CPAC* THERES NOTHING LIKE IT_SO #FREEDOM LOVING AMERICANS_CAN GET AWAY FROM THE BULL… https://t.co/NCxINeoqvd
'''

In [None]:
print(tw.fill(text, 100))

A regular expression (shortened as regex or regexp;[1] also referred to as rational
expression[2][3]) is a sequence of characters that define a search pattern. The concept arose in the
1950s when the American mathematician Stephen Cole Kleene formalized the description of a regular
language.  RT @TeaPainUSA: #AMERICA 1ST #MAGA #CPAC* THERES NOTHING LIKE IT_SO #FREEDOM LOVING
AMERICANS_CAN GET AWAY FROM THE BULL… https://t.co/NCxINeoqvd


## <font color = 'dodgerblue'>**2. Tokenization**

* Tokenization is the process of breaking down a text paragraph into smaller chunks such as words or sentences.
* Token is a single entity that is building blocks for sentence or paragraph.

There are different kinds of tokennization in nltk such as Sentence Tokenization, Word Tokenization, Tweet Tokenization, etc.
* Sentence Tokenizer breaks text paragraph into sentences.
* Word Tokenizer breaks text paragraph or sentence into words.
* Tweet Tokenizer breaks the tweet content into words.

In [None]:
# Import tokenizers from nltk
# from nltk.tokenize import sent_tokenize, word_tokenize, TweetTokenizer

sentences = sent_tokenize(text)

words = word_tokenize(text)

tweet_tokenizer = TweetTokenizer()
tweet_tokens = tweet_tokenizer.tokenize(text)
df_word_tokens = pd.DataFrame(words, columns=['Tokens'])
df_tweet_tokens = pd.DataFrame(tweet_tokens, columns=['Tokens'])
df_sentences =  pd.DataFrame(sentences, columns=['Sentences'])


In [None]:
df_sentences

Unnamed: 0,Sentences
0,A regular expression (shortened as regex or re...
1,The concept arose in the 1950s when the Americ...
2,RT @TeaPainUSA: #AMERICA 1ST #MAGA #CPAC* THER...


In [None]:
df_word_tokens

Unnamed: 0,Tokens
0,A
1,regular
2,expression
3,(
4,shortened
...,...
81,THE
82,BULL…
83,https
84,:


In [None]:
df_tweet_tokens

Unnamed: 0,Tokens
0,A
1,regular
2,expression
3,(
4,shortened
...,...
74,FROM
75,THE
76,BULL
77,…


In [None]:
tweet_tokens = tweet_tokenizer.tokenize("The protest against #deforestation by @forest is gaining pace :-) :-P <3. \
                                        and some arrows < > -> <--")

In [None]:
print(tweet_tokens[:15])# We can see the tokenization of different emoticons, arrows, mentions and hashtags.

['The', 'protest', 'against', '#deforestation', 'by', '@forest', 'is', 'gaining', 'pace', ':-)', ':-P', '<3', '.', 'and', 'some']


## <font color = 'dodgerblue'>**3. Stopwords**

In [None]:
# Import stopwords from nltk.corpus
# from nltk.corpus import stopwords

# Create a list of stopwords in english
stopwords_eng = pd.DataFrame(stopwords.words("english"), columns = ['tokens'])


In [None]:
stopwords_eng 

Unnamed: 0,tokens
0,i
1,me
2,my
3,myself
4,we
...,...
174,weren't
175,won
176,won't
177,wouldn


In [None]:
len(stopwords_eng)

179

In [None]:
text2 = 'A regular expression is a sequence of characters that define a search pattern.'

In [None]:
# Get all words from text using word_tokenize
words = word_tokenize(text2)

# Create a list of nonstopwords
non_stopwords= []

for word in words:
  if word.lower() not in stopwords_eng['tokens'].values:
    non_stopwords.append(word)
    
print(non_stopwords)

['regular', 'expression', 'sequence', 'characters', 'define', 'search', 'pattern', '.']


In [None]:
# Add custom stopwords
# Get all words from text using word_tokenize
words = word_tokenize(text2)
stopword = stopwords.words("english")
non_stopwords=[]
new_stopwords = ["regular", "pattern"]

stopword.extend(new_stopwords)

for word in words:
  if word.lower() not in stopword:
    non_stopwords.append(word.lower())
    
print(non_stopwords) 

['expression', 'sequence', 'characters', 'define', 'search', '.']


In [None]:
# Remove a default stopword from the stopword list
words = word_tokenize(text2)
non_stopwords=[]
stopword = stopwords.words("english")
stopword.remove('a')

for word in words:
  if word.lower() not in stopword:
    non_stopwords.append(word.lower())
    
print(non_stopwords) 

['a', 'regular', 'expression', 'a', 'sequence', 'characters', 'define', 'a', 'search', 'pattern', '.']


## <font color = 'dodgerblue'>**4. Stemming**

In [None]:
# Import PorterStemmer from nltk.stem
# from nltk.stem import PorterStemmer

# Create an object of class PorterStemmer
stemmer = PorterStemmer()

words = ['connection', 'connected', 'connnecter', 'connnecting', 'connect']

for w in words:
  print(w, " : ", stemmer.stem(w))

connection  :  connect
connected  :  connect
connnecter  :  connnect
connnecting  :  connnect
connect  :  connect


## <font color = 'dodgerblue'>**5. Lemmatization**

In [None]:
# Let's do stemming of better and good and then do lemmatization for same
for w in ['better','good']:
  print(w, " : ", stemmer.stem(w))

better  :  better
good  :  good


In [None]:
# Create an object of WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# We have to mention pos tag of the word to lemmatize. Default = 'n'(NOUN)

print ('Lemmas for better')
print(lemmatizer.lemmatize("better", pos='n')) # noun
print(lemmatizer.lemmatize("better", pos='a')) # adjective
print(lemmatizer.lemmatize("better", pos='r')) # adverb
print(lemmatizer.lemmatize("better", pos='v')) # verb
print()
print('Lemmas for good')
print(lemmatizer.lemmatize("good", pos='n')) # noun
print(lemmatizer.lemmatize("good", pos='a')) # adjective
print(lemmatizer.lemmatize("good", pos='r')) # adverb
print(lemmatizer.lemmatize("good", pos='v')) # verb

Lemmas for better
better
good
well
better

Lemmas for good
good
good
good
good


## <font color = 'dodgerblue'>**6. POS Tagging**

* The primary goal of POS(Part-of-Speech) tagging is to identify the grammatical group of a given word. It could be a NOUN, PRONOUN, ADJECTIVE, VERB, ADVERBS, etc. based on the context.
* POS Tagging looks for relationships within the sentence and assigns a corresponding tag to the word.


In [None]:
# Import pos_tag from nltk.tag
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize

# finding pos_tags of tokens
# Returns a list of tuples
pos_tags = pos_tag(word_tokenize(text2), tagset='universal', lang='eng')

pos_tags

[('A', 'DET'),
 ('regular', 'ADJ'),
 ('expression', 'NOUN'),
 ('is', 'VERB'),
 ('a', 'DET'),
 ('sequence', 'NOUN'),
 ('of', 'ADP'),
 ('characters', 'NOUN'),
 ('that', 'DET'),
 ('define', 'VERB'),
 ('a', 'DET'),
 ('search', 'NOUN'),
 ('pattern', 'NOUN'),
 ('.', '.')]

## <font color = 'dodgerblue'>**7. Combining Lemmatization and POS Tagging.**

* In case of lemmatization, we have to provide pos tag to get correct output but, the output of `pos_tag` function does not matches the input format of `pos` argument in `lemmatize` function.
* So to resolve this problem we can combine these two NLP techniques.

In [None]:
# Import pos_tag from nltk.tag
# from nltk.tag import pos_tag
# from nltk.tokenize import word_tokenize
# from nltk.corpus import wordnet

def pos_wordnet(text):
  # finding pos_tags of tokens
  # we have to use tagset as universal to get universal pos tags
  univ_pos = pos_tag(word_tokenize(text), tagset="universal", lang='eng')

  # Create a dict which maps universal pos tags to wordnet pos tags
  univpos_wordnetpos = {"NOUN":'n', "VERB":'v', "ADJ":"a", "ADV":"r"}

  wordnet_pos = []
  for i in range(len(pos_tags)):

    # Check if the pos tag is in the dictionary keys
    if univ_pos[i][1] in univpos_wordnetpos.keys():
      wordnet_pos.append((univ_pos[i][0], univpos_wordnetpos[univ_pos[i][1]]))

    else:
      # if tag is not in keys, append the default lemmatizing pos tag i.e., 'n' (NOUN)
      wordnet_pos.append((univ_pos[i][0], 'n'))

  return wordnet_pos

In [None]:
# Getting lemmas using pos_wordnet function
lemmatizer = WordNetLemmatizer()
wordnet_pos = pos_wordnet(text2)

# Create a list for storing lemmas of words
lemmas = []
for word, pos in wordnet_pos:
  lemma = lemmatizer.lemmatize(word, pos=pos)
  lemmas.append(lemma)

print(lemmas)

['A', 'regular', 'expression', 'be', 'a', 'sequence', 'of', 'character', 'that', 'define', 'a', 'search', 'pattern', '.']


## <font color = 'dodgerblue'>**8. Named Entity Recognition (NER)**

In [None]:
# text taken from https://oilprice.com/Energy/Oil-Prices/Oil-Rally-Continues-On-Bright-US-Economic-Data.html on June23 2021.
# Defining String
text3 = """
Oil prices rose early on Wednesday, driven by brighter economic prospects for the United States and continued recovery in oil demand in America and elsewhere in the world.
As of 9:04 a.m. EDT on Wednesday, ahead of the weekly inventory report by the U.S. Energy Information Administration (EIA), WTI Crude was up 1.04 percent at $73.61, 
and Brent Crude traded at $75.54, up by 0.99 percent on the day.Prices found support late on Tuesday after the American Petroleum Institute (API) 
reported a draw in crude oil inventories of 7.199 million barrels for the week ending June 18. If the EIA confirms a draw today, it would be the fifth consecutive week of crude inventory draws in the United States, where demand for fuels continues to grow.
"""

In [None]:
#from nltk.chunk import ne_chunk
#from nltk.tag import pos_tag
#from nltk.tokenize import word_tokenize

tokens = word_tokenize(text3)

# Consider some part of tokens for better visualization


tagged_tokens = pos_tag(tokens, lang='eng')
ne_tree = ne_chunk(tagged_tokens)

print(ne_tree)

(S
  (GPE Oil/NN)
  prices/NNS
  rose/VBD
  early/RB
  on/IN
  Wednesday/NNP
  ,/,
  driven/VBN
  by/IN
  brighter/JJ
  economic/JJ
  prospects/NNS
  for/IN
  the/DT
  (GPE United/NNP States/NNPS)
  and/CC
  continued/JJ
  recovery/NN
  in/IN
  oil/NN
  demand/NN
  in/IN
  (GPE America/NNP)
  and/CC
  elsewhere/RB
  in/IN
  the/DT
  world/NN
  ./.
  As/IN
  of/IN
  9:04/CD
  a.m./RB
  (ORGANIZATION EDT/NNP)
  on/IN
  Wednesday/NNP
  ,/,
  ahead/RB
  of/IN
  the/DT
  weekly/JJ
  inventory/NN
  report/NN
  by/IN
  the/DT
  (GPE U.S./NNP)
  Energy/NNP
  Information/NNP
  Administration/NNP
  (/(
  (ORGANIZATION EIA/NNP)
  )/)
  ,/,
  (ORGANIZATION WTI/NNP Crude/NNP)
  was/VBD
  up/RB
  1.04/CD
  percent/NN
  at/IN
  $/$
  73.61/CD
  ,/,
  and/CC
  (PERSON Brent/NNP Crude/NNP)
  traded/VBD
  at/IN
  $/$
  75.54/CD
  ,/,
  up/RB
  by/IN
  0.99/CD
  percent/NN
  on/IN
  the/DT
  day.Prices/NNS
  found/VBD
  support/NN
  late/RB
  on/IN
  Tuesday/NNP
  after/IN
  the/DT
  (ORGANIZATION Americ

In [None]:
for chunk in ne_tree:
     if hasattr(chunk, 'label'):
        print(' '.join(c[0] for c in chunk),chunk.label(), )

Oil GPE
United States GPE
America GPE
EDT ORGANIZATION
U.S. GPE
EIA ORGANIZATION
WTI Crude ORGANIZATION
Brent Crude PERSON
American Petroleum Institute ORGANIZATION
API ORGANIZATION
EIA ORGANIZATION
United States GPE
