<a href="https://colab.research.google.com/github/Satwikram/NLP-Implementations/blob/main/NLTK/NLTK%20Implementations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Author: Satwik Ram K

**NLTK Implementations**

### Importing Dependencies

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import shutil
import tensorflow as tf
import nltk
from tqdm import tqdm

### Downloading all the modules of NLTK

In [None]:
nltk.download('all')

### Basic Tokenizers

In [None]:
from nltk import sent_tokenize, word_tokenize

In [None]:
example_text = "Hello, how are you. Iam in Bengaluru and its raining heavily. How is weather there?"

#### Sentence Tokenizer

In [None]:
for sent in sent_tokenize(example_text):
  print(sent)

Hello, how are you.
Iam in Bengaluru and its raining heavily.
How is weather there?


#### Word Tokenizer

In [None]:
for word in word_tokenize(example_text):
  print(word)

Hello
,
how
are
you
.
Iam
in
Bengaluru
and
its
raining
heavily
.
How
is
weather
there
?


#### TreebankWordTokenizer

In [None]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer_wrd = TreebankWordTokenizer()
tokenizer_wrd.tokenize('youtube.com provides high quality technical tutorials for free.')

['youtube.com',
 'provides',
 'high',
 'quality',
 'technical',
 'tutorials',
 'for',
 'free',
 '.']

####  WordPunctTokenizer

In [None]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
print(tokenizer.tokenize(" I can't allow you to go home early"))
print(word_tokenize(" I can't allow you to go home early"))
print(tokenizer_wrd.tokenize("I can't allow you to go home early"))

['I', 'can', "'", 't', 'allow', 'you', 'to', 'go', 'home', 'early']
['I', 'ca', "n't", 'allow', 'you', 'to', 'go', 'home', 'early']
['I', 'ca', "n't", 'allow', 'you', 'to', 'go', 'home', 'early']


#### RegexpTokenizer

In [None]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
print(tokenizer.tokenize("won't is a  contraction."))
print(tokenizer.tokenize("can't is a contraction."))

["won't", 'is', 'a', 'contraction']
["can't", 'is', 'a', 'contraction']


#### Why to train own sentence tokenizer?

This is very important question that if we have NLTK’s default sentence tokenizer then why do we need to train a sentence tokenizer? The answer to this question lies in the quality of NLTK’s default sentence tokenizer. The NLTK’s default tokenizer is basically a general-purpose tokenizer. Although it works very well but it may not be a good choice for nonstandard text, that perhaps our text is, or for a text that is having a unique formatting. To tokenize such text and get best results, we should train our own sentence tokenizer.

### Stopwords

In [None]:
from nltk.corpus import stopwords

In [None]:
stopwords.fileids()

['arabic',
 'azerbaijani',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

In [None]:
stop_words = set(stopwords.words('english'))

In [None]:
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [None]:
filtered_sentence = []

In [None]:
for word in word_tokenize(example_text):
  if word not in stop_words:
    filtered_sentence.append(word)

In [None]:
filtered_sentence

['Hello',
 ',',
 '.',
 'Iam',
 'Bengaluru',
 'raining',
 'heavily',
 '.',
 'How',
 'weather',
 '?']

In [None]:
#Same in Single Line

filtered_words = [word for word in word_tokenize(example_text) if not word in stop_words]

In [None]:
filtered_words

['Hello',
 ',',
 '.',
 'Iam',
 'Bengaluru',
 'raining',
 'heavily',
 '.',
 'How',
 'weather',
 '?']

### Stemming

Stemming is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words eating, eats, eaten is eat

In [None]:
from nltk.stem import PorterStemmer, LancasterStemmer

In [None]:
ps = PorterStemmer()
ls = LancasterStemmer()

In [None]:
example_words = ["python", "pythoner", "pythoning", "pythoned", "pythonly"]

In [None]:
for word in example_words:
  print(ps.stem(word), ls.stem(word))

python python
python python
python python
python python
pythonli python


### Part of Speech Tagging

In [None]:
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

In [None]:
sample_text = state_union.raw('2006-GWBush.txt')
train_text = state_union.raw('2006-GWBush.txt')

In [None]:
sample_text

'PRESIDENT GEORGE W. BUSH\'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nJanuary 31, 2006\n\nTHE PRESIDENT: Thank you all. Mr. Speaker, Vice President Cheney, members of Congress, members of the Supreme Court and diplomatic corps, distinguished guests, and fellow citizens: Today our nation lost a beloved, graceful, courageous woman who called America to its founding ideals and carried on a noble dream. Tonight we are comforted by the hope of a glad reunion with the husband who was taken so long ago, and we are grateful for the good life of Coretta Scott King. (Applause.)\n\nPresident George W. Bush reacts to applause during his State of the Union Address at the Capitol, Tuesday, Jan. 31, 2006. White House photo by Eric DraperEvery time I\'m invited to this rostrum, I\'m humbled by the privilege, and mindful of the history we\'ve seen together. We have gathered under this Capitol dome in moments of national mourning and national achievement. We have serv

In [None]:
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

In [None]:
tokenized = custom_sent_tokenizer.tokenize(sample_text)

In [None]:
def process_content():

  try:
    for i in tokenized:
      words = nltk.word_tokenize(i)
      tagged = nltk.pos_tag(words)
      print(tagged)

  except Exception as e:
    print(e)

In [None]:
process_content()

### Chunking

In [None]:
def process_content1():

  try:
    for i in tokenized:
      words = nltk.word_tokenize(i)
      tagged = nltk.pos_tag(words)

      ChunkGram = r"""chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""

      ChunkParser = nltk.RegexpParser(ChunkGram)

      chunked = ChunkParser.parse(tagged)
      print(chunked)

  except Exception as e:
    print(e)

In [None]:
process_content1()

### Named Entity Recognition

In [None]:
def process_content1():

  try:
    for i in tokenized:
      words = nltk.word_tokenize(i)
      tagged = nltk.pos_tag(words)

      namedEnt = nltk.ne_chunk(tagged, binary = True)

      print(namedEnt)

  except Exception as e:
    print(e)

In [None]:
process_content1()

### Lemmatization

Lemmatization technique is like stemming. The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a valid word that means the same thing.

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [None]:
print(lemmatizer.lemmatize("catsssssssssss"))
print(lemmatizer.lemmatize("pythoned"))
print(lemmatizer.lemmatize("better", pos = 'a'))
print(lemmatizer.lemmatize("best", pos = 'a'))

cat
pythoned
good
best


### Wordnet

Wordnet is a large lexical database of English, which was created by Princeton. It is a part of the NLTK corpus. Nouns, verbs, adjectives and adverbs all are grouped into set of synsets, i.e. cognitive synonyms.

In [None]:
from nltk.corpus import wordnet

In [None]:
for i in wordnet.synsets("plan"):
  print(i,":",i.definition(),"\n","Example:",i.examples())
  print("---"*30)

Synset('plan.n.01') : a series of steps to be carried out or goals to be accomplished 
 Example: ['they drew up a six-step plan', 'they discussed plans for a new bond issue']
------------------------------------------------------------------------------------------
Synset('design.n.02') : an arrangement scheme 
 Example: ['the awkward design of the keyboard made operation difficult', 'it was an excellent design for living', 'a plan for seating guests']
------------------------------------------------------------------------------------------
Synset('plan.n.03') : scale drawing of a structure 
 Example: ['the plans for City Hall were on file']
------------------------------------------------------------------------------------------
Synset('plan.v.01') : have the will and intention to carry out some action 
 Example: ['He plans to be in graduate school next year', 'The rebels had planned turmoil and confusion']
----------------------------------------------------------------------------

### Similarity

In [None]:
w1 = wordnet.synset("ship.n.01")
w2 = wordnet.synset("boat.n.01")

In [None]:
str(w1.wup_similarity(w2)*100)+" % Similar"

'90.9090909090909 % Similar'

In [None]:
w1 = wordnet.synset("ship.n.01")
w2 = wordnet.synset("car.n.01")

In [None]:
str(w1.wup_similarity(w2)*100)+" % Similar"

'69.56521739130434 % Similar'

### Finding Synonyms
By using the lemma() method, we can find the number of synonyms of a Synset. Let us apply this method on ‘dog’ synset −

In [None]:
from nltk.corpus import wordnet as wn
syn = wn.synsets('dog')[0]
lemmas = syn.lemmas()
len(lemmas)

3

In [None]:
for i in lemmas:
  print(i.name())

dog
domestic_dog
Canis_familiaris


### Finding Antonyms
In WordNet, some lemmas also have antonyms. For example, the word ‘good ‘has a total of 27 synets, among them, 5 have lemmas with antonyms. Let us find the antonyms (when the word ‘good’ used as noun and when the word ‘good’ used as adjective).

In [None]:
from nltk.corpus import wordnet as wn
syn1 = wn.synset('good.n.02')
antonym1 = syn1.lemmas()[0].antonyms()[0]
antonym1.name()

'evil'

### Word replacement using regular expression


In [None]:
import re
from nltk.corpus import wordnet

In [96]:
tweet = "am ok, I can't do this"

In [125]:
regxDict = {"can\'t": "cannot", 
            "won\'t": "will not",
            "am": "I am",
            }

In [126]:
for i in regxDict:
  print(i, regxDict[i])

can't cannot
won't will not
am I am


In [129]:
def regx(tweet):

  for i in regxDict:
    exp = i

    tweet = re.sub(exp, regxDict[i], tweet)

  return tweet

In [130]:
regx(tweet)

'I am ok, I cannot do this'