# Natural Language Processing

# Introduction

The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP) methodology. NLTK will aid you with everything from splitting sentences from paragraphs, splitting up words, recognizing the part of speech of those words, highlighting the main subjects, and then even with helping your machine to understand what the text is all about. In this series, we're going to tackle the field of opinion mining, or sentiment analysis.

In [1]:
import nltk
#Data imports
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

#Math
import math

#Plot
#plot import
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')
%matplotlib inline

In [2]:
#Download all NLTK packages
#nltk.download()

# Tokenizing

Splitting sentences and words from the body of text.
There are two types:

Word Tokenizer : splits by words

Sentence Tokenizer : splits by sentences

 Each "entity" that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is "tokenized" into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.

# Corpus

Body of text, singular. Corpora is the plural of this. Example: A collection of medical journals

# Lexicon

 Words and their meanings. Example: English dictionary. Consider, however, that various fields will have different lexicons. For example: To a financial investor, the first meaning for the word "Bull" is someone who is confident about the market, as compared to the common English lexicon, where the first meaning for the word "Bull" is an animal. As such, there is a special lexicon for financial investors, doctors, children, mechanics, and so on.

In [3]:
from nltk.tokenize import sent_tokenize,word_tokenize

In [4]:
example_text = "Hello Mr. Smith, how are you doing today? The weather is good today and Python is awesome. The sky line is good."

In [5]:
print(sent_tokenize(example_text))

['Hello Mr. Smith, how are you doing today?', 'The weather is good today and Python is awesome.', 'The sky line is good.']


In [6]:
print(word_tokenize(example_text))

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'good', 'today', 'and', 'Python', 'is', 'awesome', '.', 'The', 'sky', 'line', 'is', 'good', '.']


In [7]:
for i in word_tokenize(example_text):
    print(i)

Hello
Mr.
Smith
,
how
are
you
doing
today
?
The
weather
is
good
today
and
Python
is
awesome
.
The
sky
line
is
good
.


# Stop Words

The words that doesn't have any meaning. They are filler words. They make meaning for our language but not for the data analysis.

The idea of Natural Language Processing is to do some form of analysis, or processing, where the machine can understand, at least to some level, what the text means, says, or implies.

This is an obviously massive challenge, but there are steps to doing it that anyone can follow. The main idea, however, is that computers simply do not, and will not, ever understand words directly. Humans don't either *shocker*. In humans, memory is broken down into electrical signals in the brain, in the form of neural groups that fire in patterns. There is a lot about the brain that remains unknown, but, the more we break down the human brain to the basic elements, we find out basic the elements really are. Well, it turns out computers store information in a very similar way! We need a way to get as close to that as possible if we're going to mimic how humans read and understand text. Generally, computers use numbers for everything, but we often see directly in programming where we use binary signals (True or False, which directly translate to 1 or 0, which originates directly from either the presence of an electrical signal (True, 1), or not (False, 0)). To do this, we need a way to convert words to values, in numbers, or signal patterns. The process of converting data to something a computer can understand is referred to as "pre-processing." One of the major forms of pre-processing is going to be filtering out useless data. In natural language processing, useless words (data), are referred to as stop words.

In [8]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

In [9]:
example_sentence ="This is an example showing off stop word filtration"

#stop words
stop_words = set(stopwords.words("english"))
print(stop_words)

{'why', 'which', 'do', 'her', 'until', 'those', 'he', 'we', 'won', 'doing', 't', 'aren', 'down', 'hasn', 'their', 'as', 'i', 'below', 'then', 'theirs', 'ourselves', 'about', 'doesn', 'y', 'our', 'should', 'o', 'no', 'few', 'before', 'in', 'weren', 'above', 'hers', 'further', 'me', 'over', 'isn', 'these', 'don', 's', 'him', 'most', 'your', 'am', 'yours', 'what', 'is', 'you', 'up', 'or', 'wasn', 'this', 'but', 'against', 'through', 'not', 'now', 'herself', 'his', 'd', 'didn', 'shan', 'for', 'it', 'have', 'to', 'because', 'if', 'such', 'does', 'couldn', 'so', 'were', 'ain', 'all', 'my', 'its', 'shouldn', 'needn', 'how', 'has', 'of', 'on', 'from', 'was', 'and', 'only', 'having', 'where', 'whom', 'when', 'been', 'who', 'again', 'themselves', 'them', 'under', 'here', 'mightn', 'nor', 'at', 'be', 'once', 've', 'just', 'very', 'hadn', 'a', 'by', 'after', 'some', 'm', 'can', 'himself', 'the', 'haven', 'ours', 'myself', 'are', 'being', 'had', 'during', 'any', 'ma', 'an', 'too', 'yourself', 'woul

In [10]:
word = word_tokenize(example_sentence)

In [11]:
filtered_sentence = []

In [12]:
for w in word:
    if w not in stop_words:
        filtered_sentence.append(w)

In [13]:
filtered_sentence

['This', 'example', 'showing', 'stop', 'word', 'filtration']

In [14]:
filtered_sentence = [w for w in word if not w in stop_words]

In [15]:
filtered_sentence

['This', 'example', 'showing', 'stop', 'word', 'filtration']

# Stemming

The idea of stemming is a sort of normalizing method. Many variations of words carry the same meaning, other than when tense is involved.

The reason why we stem is to shorten the lookup, and normalize sentences.

Consider:

I was taking a ride in the car.
I was riding in the car.

One of the most popular stemming algorithms is the Porter stemmer, which has been around since 1979.

In [16]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [17]:
ps = PorterStemmer()

In [18]:
ex_words = ["python","pythoning","pythoned","pythonly"]

In [19]:
for word in ex_words:
    print(ps.stem(word))

python
python
python
pythonli


In [20]:
 new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."

In [21]:
words = word_tokenize(new_text)

In [22]:
for word in words:
    print(ps.stem(word))

It
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
all
python
have
python
poorli
at
least
onc
.


# Part of Speech Tagging

One of the more powerful aspects of the NLTK module is the Part of Speech tagging that it can do for you. This means labeling words in a sentence as nouns, adjectives, verbs...etc.

In [23]:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

PunktSentence Tokenizer is a unsupervised machine learning sentence tokenizer. We can re train it.

In [24]:
sample_text = state_union.raw("2006-GWBush.txt")

In [25]:
train_text = state_union.raw("2005-GWBush.txt")

In [26]:
custom_text_tokenizer = PunktSentenceTokenizer(train_text)

In [27]:
custom_text_tokenizer

<nltk.tokenize.punkt.PunktSentenceTokenizer at 0x2b0ddf6e7b8>

In [28]:
tokenized = custom_text_tokenizer.tokenize(sample_text)

In [29]:
tokenized

["PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nJanuary 31, 2006\n\nTHE PRESIDENT: Thank you all.",
 'Mr. Speaker, Vice President Cheney, members of Congress, members of the Supreme Court and diplomatic corps, distinguished guests, and fellow citizens: Today our nation lost a beloved, graceful, courageous woman who called America to its founding ideals and carried on a noble dream.',
 'Tonight we are comforted by the hope of a glad reunion with the husband who was taken so long ago, and we are grateful for the good life of Coretta Scott King.',
 '(Applause.)',
 'President George W. Bush reacts to applause during his State of the Union Address at the Capitol, Tuesday, Jan.',
 '31, 2006.',
 "White House photo by Eric DraperEvery time I'm invited to this rostrum, I'm humbled by the privilege, and mindful of the history we've seen together.",
 'We have gathered under this Capitol dome in moments of national mourning and national ach

In [37]:
# creating a function to tokenize

def part_of_speech(document):
    try:
        for i in document:
            words = nltk.word_tokenize(document)
            tagged = nltk.pos_tag(words)
            print(tagged)
        
    except Exception as e:
        print(str(e))

In [38]:
part_of_speech(tokenized)

expected string or bytes-like object


This error occurs because the tagging could not be done on the number

In [39]:
import re
letters_only = re.sub("[^a-zA-Z]",  # Search for all non-letters
                          " ",          # Replace all non-letters with spaces
                          str(tokenized))

In [40]:
letters_only

'  PRESIDENT GEORGE W  BUSH S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION n  nJanuary          n nTHE PRESIDENT  Thank you all     Mr  Speaker  Vice President Cheney  members of Congress  members of the Supreme Court and diplomatic corps  distinguished guests  and fellow citizens  Today our nation lost a beloved  graceful  courageous woman who called America to its founding ideals and carried on a noble dream     Tonight we are comforted by the hope of a glad reunion with the husband who was taken so long ago  and we are grateful for the good life of Coretta Scott King      Applause      President George W  Bush reacts to applause during his State of the Union Address at the Capitol  Tuesday  Jan                  White House photo by Eric DraperEvery time I m invited to this rostrum  I m humbled by the privilege  and mindful of the history we ve seen together     We have gathered under this Capitol dome in moments of national mourning and national achieveme

In [41]:
part_of_speech(letters_only)

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W', 'NNP'), ('BUSH', 'NNP'), ('S', 'NNP'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('n', 'IN'), ('nJanuary', 'JJ'), ('n', 'JJ'), ('nTHE', 'FW'), ('PRESIDENT', 'NNP'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('Mr', 'NNP'), ('Speaker', 'NNP'), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), ('distinguished', 'VBN'), ('guests', 'NNS'), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), ('Today', 'NNP'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), ('graceful', 'JJ'), ('courageous', 'JJ'), ('woman',

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W', 'NNP'), ('BUSH', 'NNP'), ('S', 'NNP'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('n', 'IN'), ('nJanuary', 'JJ'), ('n', 'JJ'), ('nTHE', 'FW'), ('PRESIDENT', 'NNP'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('Mr', 'NNP'), ('Speaker', 'NNP'), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), ('distinguished', 'VBN'), ('guests', 'NNS'), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), ('Today', 'NNP'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), ('graceful', 'JJ'), ('courageous', 'JJ'), ('woman',

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W', 'NNP'), ('BUSH', 'NNP'), ('S', 'NNP'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('n', 'IN'), ('nJanuary', 'JJ'), ('n', 'JJ'), ('nTHE', 'FW'), ('PRESIDENT', 'NNP'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('Mr', 'NNP'), ('Speaker', 'NNP'), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), ('distinguished', 'VBN'), ('guests', 'NNS'), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), ('Today', 'NNP'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), ('graceful', 'JJ'), ('courageous', 'JJ'), ('woman',

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W', 'NNP'), ('BUSH', 'NNP'), ('S', 'NNP'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('n', 'IN'), ('nJanuary', 'JJ'), ('n', 'JJ'), ('nTHE', 'FW'), ('PRESIDENT', 'NNP'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('Mr', 'NNP'), ('Speaker', 'NNP'), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), ('distinguished', 'VBN'), ('guests', 'NNS'), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), ('Today', 'NNP'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), ('graceful', 'JJ'), ('courageous', 'JJ'), ('woman',

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W', 'NNP'), ('BUSH', 'NNP'), ('S', 'NNP'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('n', 'IN'), ('nJanuary', 'JJ'), ('n', 'JJ'), ('nTHE', 'FW'), ('PRESIDENT', 'NNP'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('Mr', 'NNP'), ('Speaker', 'NNP'), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), ('distinguished', 'VBN'), ('guests', 'NNS'), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), ('Today', 'NNP'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), ('graceful', 'JJ'), ('courageous', 'JJ'), ('woman',

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W', 'NNP'), ('BUSH', 'NNP'), ('S', 'NNP'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('n', 'IN'), ('nJanuary', 'JJ'), ('n', 'JJ'), ('nTHE', 'FW'), ('PRESIDENT', 'NNP'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('Mr', 'NNP'), ('Speaker', 'NNP'), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), ('distinguished', 'VBN'), ('guests', 'NNS'), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), ('Today', 'NNP'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), ('graceful', 'JJ'), ('courageous', 'JJ'), ('woman',

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W', 'NNP'), ('BUSH', 'NNP'), ('S', 'NNP'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('n', 'IN'), ('nJanuary', 'JJ'), ('n', 'JJ'), ('nTHE', 'FW'), ('PRESIDENT', 'NNP'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('Mr', 'NNP'), ('Speaker', 'NNP'), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), ('distinguished', 'VBN'), ('guests', 'NNS'), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), ('Today', 'NNP'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), ('graceful', 'JJ'), ('courageous', 'JJ'), ('woman',

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W', 'NNP'), ('BUSH', 'NNP'), ('S', 'NNP'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('n', 'IN'), ('nJanuary', 'JJ'), ('n', 'JJ'), ('nTHE', 'FW'), ('PRESIDENT', 'NNP'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('Mr', 'NNP'), ('Speaker', 'NNP'), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), ('distinguished', 'VBN'), ('guests', 'NNS'), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), ('Today', 'NNP'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), ('graceful', 'JJ'), ('courageous', 'JJ'), ('woman',

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W', 'NNP'), ('BUSH', 'NNP'), ('S', 'NNP'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('n', 'IN'), ('nJanuary', 'JJ'), ('n', 'JJ'), ('nTHE', 'FW'), ('PRESIDENT', 'NNP'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('Mr', 'NNP'), ('Speaker', 'NNP'), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), ('distinguished', 'VBN'), ('guests', 'NNS'), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), ('Today', 'NNP'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), ('graceful', 'JJ'), ('courageous', 'JJ'), ('woman',

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W', 'NNP'), ('BUSH', 'NNP'), ('S', 'NNP'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('n', 'IN'), ('nJanuary', 'JJ'), ('n', 'JJ'), ('nTHE', 'FW'), ('PRESIDENT', 'NNP'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('Mr', 'NNP'), ('Speaker', 'NNP'), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), ('distinguished', 'VBN'), ('guests', 'NNS'), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), ('Today', 'NNP'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), ('graceful', 'JJ'), ('courageous', 'JJ'), ('woman',

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W', 'NNP'), ('BUSH', 'NNP'), ('S', 'NNP'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('n', 'IN'), ('nJanuary', 'JJ'), ('n', 'JJ'), ('nTHE', 'FW'), ('PRESIDENT', 'NNP'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('Mr', 'NNP'), ('Speaker', 'NNP'), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), ('distinguished', 'VBN'), ('guests', 'NNS'), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), ('Today', 'NNP'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), ('graceful', 'JJ'), ('courageous', 'JJ'), ('woman',

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W', 'NNP'), ('BUSH', 'NNP'), ('S', 'NNP'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('n', 'IN'), ('nJanuary', 'JJ'), ('n', 'JJ'), ('nTHE', 'FW'), ('PRESIDENT', 'NNP'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('Mr', 'NNP'), ('Speaker', 'NNP'), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), ('distinguished', 'VBN'), ('guests', 'NNS'), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), ('Today', 'NNP'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), ('graceful', 'JJ'), ('courageous', 'JJ'), ('woman',

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W', 'NNP'), ('BUSH', 'NNP'), ('S', 'NNP'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('n', 'IN'), ('nJanuary', 'JJ'), ('n', 'JJ'), ('nTHE', 'FW'), ('PRESIDENT', 'NNP'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('Mr', 'NNP'), ('Speaker', 'NNP'), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), ('distinguished', 'VBN'), ('guests', 'NNS'), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), ('Today', 'NNP'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), ('graceful', 'JJ'), ('courageous', 'JJ'), ('woman',

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W', 'NNP'), ('BUSH', 'NNP'), ('S', 'NNP'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('n', 'IN'), ('nJanuary', 'JJ'), ('n', 'JJ'), ('nTHE', 'FW'), ('PRESIDENT', 'NNP'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('Mr', 'NNP'), ('Speaker', 'NNP'), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), ('distinguished', 'VBN'), ('guests', 'NNS'), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), ('Today', 'NNP'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), ('graceful', 'JJ'), ('courageous', 'JJ'), ('woman',

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W', 'NNP'), ('BUSH', 'NNP'), ('S', 'NNP'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('n', 'IN'), ('nJanuary', 'JJ'), ('n', 'JJ'), ('nTHE', 'FW'), ('PRESIDENT', 'NNP'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('Mr', 'NNP'), ('Speaker', 'NNP'), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), ('distinguished', 'VBN'), ('guests', 'NNS'), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), ('Today', 'NNP'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), ('graceful', 'JJ'), ('courageous', 'JJ'), ('woman',

KeyboardInterrupt: 

A problem with twitter analysis is that people don't punctuate. eg the names are written in small letters.

# Chunking

 we can do what is called chunking, and group words into hopefully meaningful chunks. One of the main goals of chunking is to group into what are known as "noun phrases." These are phrases of one or more words that contain a noun, maybe some descriptive words, maybe a verb, and maybe something like an adverb. The idea is to group nouns with the words that are in relation to them.

In order to chunk, we combine the part of speech tags with regular expressions. Mainly from regular expressions, we are going to utilize the following:

In [44]:
def chunk(document):
    try:
        for i in document:
            words = nltk.word_tokenize(document)
            tagged = nltk.pos_tag(words)
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
                print(subtree)
            
        
    except Exception as e:
        print(str(e))

In [45]:
chunk(letters_only)

(Chunk PRESIDENT/NNP GEORGE/NNP W/NNP BUSH/NNP S/NNP ADDRESS/NNP)
(Chunk A/NNP JOINT/NNP SESSION/NNP)
(Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
(Chunk THE/NNP UNION/NNP)
(Chunk PRESIDENT/NNP Thank/NNP)
(Chunk Mr/NNP Speaker/NNP Vice/NNP President/NNP Cheney/NNP)
(Chunk Congress/NNP)
(Chunk Supreme/NNP Court/NNP)
(Chunk Today/NNP)
(Chunk called/VBD America/NNP)
(Chunk Tonight/NNP)
(Chunk
  Coretta/NNP
  Scott/NNP
  King/NNP
  Applause/NNP
  President/NNP
  George/NNP
  W/NNP
  Bush/NNP)
(Chunk State/NNP)
(Chunk Union/NNP Address/NNP)
(Chunk Capitol/NNP Tuesday/NNP Jan/NNP White/NNP House/NNP photo/NN)
(Chunk Eric/NNP DraperEvery/NNP time/NN)
(Chunk Capitol/NNP dome/NN)
(Chunk have/VBP served/VBN America/NNP)
(Chunk Tonight/NNP)
(Chunk Union/NNP)
(Chunk Applause/NNP)
(Chunk United/NNP)
(Chunk America/NNP)
(Chunk lead/VB Applause/NNP Abroad/NNP)
(Chunk America/NNP)
(Chunk September/NNP)
(Chunk Dictatorships/NNP shelter/NN)
(Chunk Democracies/NNP)
(Chunk Every/NNP step/NN)
(Chu

(Chunk re/VBP helping/VBG Iraqis/NNP)
(Chunk be/VB marginalized/VBN Second/NNP)
(Chunk Iraqi/NNP government/NN)
(Chunk Iraqis/NNP)
(Chunk Iraqis/NNP)
(Chunk Applause/NNP)
(Chunk Iraq/NNP)
(Chunk Iraqi/NNP security/NN)
(Chunk Iraqi/NNP)
(Chunk Fellow/NNP)
(Chunk Iraqi/NNP)
(Chunk Washington/NNP D/NNP C/NNP Applause/NNP)
(Chunk Iraq/NNP)
(Chunk Congress/NNP)
(Chunk Applause/NNP Hindsight/NNP)
(Chunk Applause/NNP)
(Chunk A/NNP)
(Chunk Iraq/NNP)
(Chunk Iraqi/NNP)
(Chunk Laden/NNP)
(Chunk Zarqawi/NNP)
(Chunk America/NNP)
(Chunk Congress/NNP)
(Chunk Applause/NNP Laura/NNP Bush/NNP)
(Chunk is/VBZ introduced/VBN Tuesday/NNP)
(Chunk evening/VBG Jan/NNP)
(Chunk State/NNP)
(Chunk Union/NNP Address/NNP)
(Chunk United/NNP)
(Chunk Capitol/NNP)
(Chunk Washington/NNP White/NNP House/NNP photo/NN)
(Chunk Eric/NNP Draper/NNP)
(Chunk Marine/NNP Staff/NNP Sergeant/NNP Dan/NNP Clay/NNP)
(Chunk Fallujah/NNP)
(Chunk American/NNP)
(Chunk Dan/NNP)
(Chunk Never/NNP falter/NN)
(Chunk Don/NNP t/NN)
(Chunk protect

(Chunk State/NNP)
(Chunk Union/NNP Address/NNP)
(Chunk Capitol/NNP Tuesday/NNP Jan/NNP White/NNP House/NNP photo/NN)
(Chunk Eric/NNP DraperEvery/NNP time/NN)
(Chunk Capitol/NNP dome/NN)
(Chunk have/VBP served/VBN America/NNP)
(Chunk Tonight/NNP)
(Chunk Union/NNP)
(Chunk Applause/NNP)
(Chunk United/NNP)
(Chunk America/NNP)
(Chunk lead/VB Applause/NNP Abroad/NNP)
(Chunk America/NNP)
(Chunk September/NNP)
(Chunk Dictatorships/NNP shelter/NN)
(Chunk Democracies/NNP)
(Chunk Every/NNP step/NN)
(Chunk Applause/NNP Far/NNP)
(Chunk Afghanistan/NNP)
(Chunk Iraqis/NNP)
(Chunk Lebanon/NNP)
(Chunk Egypt/NNP)
(Chunk Syria/NNP)
(Chunk Burma/NNP Zimbabwe/NNP North/NNP Korea/NNP)
(Chunk Iran/NNP)
(Chunk well/RB Applause/NNP President/NNP George/NNP W/NNP Bush/NNP)
(Chunk Union/NNP Address/NNP)
(Chunk Capitol/NNP Tuesday/NNP Jan/NNP White/NNP House/NNP photo/NN)
(Chunk Eric/NNP Draper/NNP No/NNP one/NN)
(Chunk Islam/NNP)
(Chunk Laden/NNP)
(Chunk Middle/NNP East/NNP)
(Chunk Iraq/NNP)
(Chunk America/NNP)


(Chunk Social/NNP Security/NNP Medicare/NNP)
(Chunk Medicaid/NNP)
(Chunk Congress/NNP)
(Chunk save/VB Social/NNP Security/NNP applause/NN)
(Chunk away/RB Applause/NNP)
(Chunk So/NNP)
(Chunk Social/NNP Security/NNP Medicare/NNP)
(Chunk Medicaid/NNP)
(Chunk Congress/NNP)
(Chunk solved/VBD Applause/NNP Keeping/NNP America/NNP)
(Chunk America/NNP)
(Chunk Applause/NNP Keeping/NNP America/NNP)
(Chunk Applause/NNP)
(Chunk Applause/NNP Keeping/NNP America/NNP)
(Chunk Applause/NNP)
(Chunk Applause/NNP)
(Chunk Applause/NNP)
(Chunk OB/NNP GYN/NNP)
(Chunk Congress/NNP)
(Chunk Applause/NNP Keeping/NNP America/NNP)
(Chunk America/NNP)
(Chunk Advanced/NNP Energy/NNP Initiative/NNP)
(Chunk Department/NNP)
(Chunk Energy/NNP)
(Chunk Applause/NNP Breakthroughs/NNP)
(Chunk Middle/NNP East/NNP)
(Chunk Applause/NNP)
(Chunk America/NNP)
(Chunk Middle/NNP Eastern/NNP oil/NN)
(Chunk Applause/NNP)
(Chunk keep/VB America/NNP)
(Chunk Tonight/NNP)
(Chunk Competitiveness/NNP Initiative/NNP)
(Chunk Applause/NNP Firs

(Chunk Iraqi/NNP government/NN)
(Chunk Iraqis/NNP)
(Chunk Iraqis/NNP)
(Chunk Applause/NNP)
(Chunk Iraq/NNP)
(Chunk Iraqi/NNP security/NN)
(Chunk Iraqi/NNP)
(Chunk Fellow/NNP)
(Chunk Iraqi/NNP)
(Chunk Washington/NNP D/NNP C/NNP Applause/NNP)
(Chunk Iraq/NNP)
(Chunk Congress/NNP)
(Chunk Applause/NNP Hindsight/NNP)
(Chunk Applause/NNP)
(Chunk A/NNP)
(Chunk Iraq/NNP)
(Chunk Iraqi/NNP)
(Chunk Laden/NNP)
(Chunk Zarqawi/NNP)
(Chunk America/NNP)
(Chunk Congress/NNP)
(Chunk Applause/NNP Laura/NNP Bush/NNP)
(Chunk is/VBZ introduced/VBN Tuesday/NNP)
(Chunk evening/VBG Jan/NNP)
(Chunk State/NNP)
(Chunk Union/NNP Address/NNP)
(Chunk United/NNP)
(Chunk Capitol/NNP)
(Chunk Washington/NNP White/NNP House/NNP photo/NN)
(Chunk Eric/NNP Draper/NNP)
(Chunk Marine/NNP Staff/NNP Sergeant/NNP Dan/NNP Clay/NNP)
(Chunk Fallujah/NNP)
(Chunk American/NNP)
(Chunk Dan/NNP)
(Chunk Never/NNP falter/NN)
(Chunk Don/NNP t/NN)
(Chunk protecting/VBG Staff/NNP Sergeant/NNP Dan/NNP Clay/NNP)
(Chunk Lisa/NNP)
(Chunk Sara/NN

(Chunk State/NNP)
(Chunk Union/NNP Address/NNP)
(Chunk Capitol/NNP Tuesday/NNP Jan/NNP White/NNP House/NNP photo/NN)
(Chunk Eric/NNP DraperEvery/NNP time/NN)
(Chunk Capitol/NNP dome/NN)
(Chunk have/VBP served/VBN America/NNP)
(Chunk Tonight/NNP)
(Chunk Union/NNP)
(Chunk Applause/NNP)
(Chunk United/NNP)
(Chunk America/NNP)
(Chunk lead/VB Applause/NNP Abroad/NNP)
(Chunk America/NNP)
(Chunk September/NNP)
(Chunk Dictatorships/NNP shelter/NN)
(Chunk Democracies/NNP)
(Chunk Every/NNP step/NN)
(Chunk Applause/NNP Far/NNP)
(Chunk Afghanistan/NNP)
(Chunk Iraqis/NNP)
(Chunk Lebanon/NNP)
(Chunk Egypt/NNP)
(Chunk Syria/NNP)
(Chunk Burma/NNP Zimbabwe/NNP North/NNP Korea/NNP)
(Chunk Iran/NNP)
(Chunk well/RB Applause/NNP President/NNP George/NNP W/NNP Bush/NNP)
(Chunk Union/NNP Address/NNP)
(Chunk Capitol/NNP Tuesday/NNP Jan/NNP White/NNP House/NNP photo/NN)
(Chunk Eric/NNP Draper/NNP No/NNP one/NN)
(Chunk Islam/NNP)
(Chunk Laden/NNP)
(Chunk Middle/NNP East/NNP)
(Chunk Iraq/NNP)
(Chunk America/NNP)


KeyboardInterrupt: 

# Chinking

Chinking is a lot like chunking, it is basically a way for you to remove a chunk from a chunk. The chunk that you remove from your chunk is your chink.

In [47]:
def chink(document):
    try:
        for i in document:
            words = nltk.word_tokenize(document)
            tagged = nltk.pos_tag(words)
            chunkGram = r"""Chunk: {<.*>+}
                                    }<VB.?|IN|DT>+{"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            print(chunked)
            
        
    except Exception as e:
        print(str(e))

In [50]:
chink(letters_only)

(S
  (Chunk PRESIDENT/NNP GEORGE/NNP W/NNP BUSH/NNP S/NNP ADDRESS/NNP)
  BEFORE/IN
  (Chunk A/NNP JOINT/NNP SESSION/NNP)
  OF/IN
  (Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
  OF/IN
  (Chunk THE/NNP UNION/NNP)
  n/IN
  (Chunk nJanuary/JJ n/JJ nTHE/FW PRESIDENT/NNP Thank/NNP you/PRP)
  all/DT
  (Chunk
    Mr/NNP
    Speaker/NNP
    Vice/NNP
    President/NNP
    Cheney/NNP
    members/NNS)
  of/IN
  (Chunk Congress/NNP members/NNS)
  of/IN
  the/DT
  (Chunk Supreme/NNP Court/NNP and/CC diplomatic/JJ corps/NN)
  distinguished/VBN
  (Chunk
    guests/NNS
    and/CC
    fellow/JJ
    citizens/NNS
    Today/NNP
    our/PRP$
    nation/NN)
  lost/VBD
  a/DT
  beloved/VBN
  (Chunk graceful/JJ courageous/JJ woman/NN who/WP)
  called/VBD
  (Chunk America/NNP to/TO its/PRP$ founding/NN ideals/NNS and/CC)
  carried/VBD
  on/IN
  a/DT
  (Chunk noble/JJ dream/NN Tonight/NNP we/PRP)
  are/VBP
  comforted/VBN
  by/IN
  the/DT
  (Chunk hope/NN)
  of/IN
  a/DT
  (Chunk glad/JJ reunion/NN)
  

(S
  (Chunk PRESIDENT/NNP GEORGE/NNP W/NNP BUSH/NNP S/NNP ADDRESS/NNP)
  BEFORE/IN
  (Chunk A/NNP JOINT/NNP SESSION/NNP)
  OF/IN
  (Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
  OF/IN
  (Chunk THE/NNP UNION/NNP)
  n/IN
  (Chunk nJanuary/JJ n/JJ nTHE/FW PRESIDENT/NNP Thank/NNP you/PRP)
  all/DT
  (Chunk
    Mr/NNP
    Speaker/NNP
    Vice/NNP
    President/NNP
    Cheney/NNP
    members/NNS)
  of/IN
  (Chunk Congress/NNP members/NNS)
  of/IN
  the/DT
  (Chunk Supreme/NNP Court/NNP and/CC diplomatic/JJ corps/NN)
  distinguished/VBN
  (Chunk
    guests/NNS
    and/CC
    fellow/JJ
    citizens/NNS
    Today/NNP
    our/PRP$
    nation/NN)
  lost/VBD
  a/DT
  beloved/VBN
  (Chunk graceful/JJ courageous/JJ woman/NN who/WP)
  called/VBD
  (Chunk America/NNP to/TO its/PRP$ founding/NN ideals/NNS and/CC)
  carried/VBD
  on/IN
  a/DT
  (Chunk noble/JJ dream/NN Tonight/NNP we/PRP)
  are/VBP
  comforted/VBN
  by/IN
  the/DT
  (Chunk hope/NN)
  of/IN
  a/DT
  (Chunk glad/JJ reunion/NN)
  

(S
  (Chunk PRESIDENT/NNP GEORGE/NNP W/NNP BUSH/NNP S/NNP ADDRESS/NNP)
  BEFORE/IN
  (Chunk A/NNP JOINT/NNP SESSION/NNP)
  OF/IN
  (Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
  OF/IN
  (Chunk THE/NNP UNION/NNP)
  n/IN
  (Chunk nJanuary/JJ n/JJ nTHE/FW PRESIDENT/NNP Thank/NNP you/PRP)
  all/DT
  (Chunk
    Mr/NNP
    Speaker/NNP
    Vice/NNP
    President/NNP
    Cheney/NNP
    members/NNS)
  of/IN
  (Chunk Congress/NNP members/NNS)
  of/IN
  the/DT
  (Chunk Supreme/NNP Court/NNP and/CC diplomatic/JJ corps/NN)
  distinguished/VBN
  (Chunk
    guests/NNS
    and/CC
    fellow/JJ
    citizens/NNS
    Today/NNP
    our/PRP$
    nation/NN)
  lost/VBD
  a/DT
  beloved/VBN
  (Chunk graceful/JJ courageous/JJ woman/NN who/WP)
  called/VBD
  (Chunk America/NNP to/TO its/PRP$ founding/NN ideals/NNS and/CC)
  carried/VBD
  on/IN
  a/DT
  (Chunk noble/JJ dream/NN Tonight/NNP we/PRP)
  are/VBP
  comforted/VBN
  by/IN
  the/DT
  (Chunk hope/NN)
  of/IN
  a/DT
  (Chunk glad/JJ reunion/NN)
  

(S
  (Chunk PRESIDENT/NNP GEORGE/NNP W/NNP BUSH/NNP S/NNP ADDRESS/NNP)
  BEFORE/IN
  (Chunk A/NNP JOINT/NNP SESSION/NNP)
  OF/IN
  (Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
  OF/IN
  (Chunk THE/NNP UNION/NNP)
  n/IN
  (Chunk nJanuary/JJ n/JJ nTHE/FW PRESIDENT/NNP Thank/NNP you/PRP)
  all/DT
  (Chunk
    Mr/NNP
    Speaker/NNP
    Vice/NNP
    President/NNP
    Cheney/NNP
    members/NNS)
  of/IN
  (Chunk Congress/NNP members/NNS)
  of/IN
  the/DT
  (Chunk Supreme/NNP Court/NNP and/CC diplomatic/JJ corps/NN)
  distinguished/VBN
  (Chunk
    guests/NNS
    and/CC
    fellow/JJ
    citizens/NNS
    Today/NNP
    our/PRP$
    nation/NN)
  lost/VBD
  a/DT
  beloved/VBN
  (Chunk graceful/JJ courageous/JJ woman/NN who/WP)
  called/VBD
  (Chunk America/NNP to/TO its/PRP$ founding/NN ideals/NNS and/CC)
  carried/VBD
  on/IN
  a/DT
  (Chunk noble/JJ dream/NN Tonight/NNP we/PRP)
  are/VBP
  comforted/VBN
  by/IN
  the/DT
  (Chunk hope/NN)
  of/IN
  a/DT
  (Chunk glad/JJ reunion/NN)
  

(S
  (Chunk PRESIDENT/NNP GEORGE/NNP W/NNP BUSH/NNP S/NNP ADDRESS/NNP)
  BEFORE/IN
  (Chunk A/NNP JOINT/NNP SESSION/NNP)
  OF/IN
  (Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
  OF/IN
  (Chunk THE/NNP UNION/NNP)
  n/IN
  (Chunk nJanuary/JJ n/JJ nTHE/FW PRESIDENT/NNP Thank/NNP you/PRP)
  all/DT
  (Chunk
    Mr/NNP
    Speaker/NNP
    Vice/NNP
    President/NNP
    Cheney/NNP
    members/NNS)
  of/IN
  (Chunk Congress/NNP members/NNS)
  of/IN
  the/DT
  (Chunk Supreme/NNP Court/NNP and/CC diplomatic/JJ corps/NN)
  distinguished/VBN
  (Chunk
    guests/NNS
    and/CC
    fellow/JJ
    citizens/NNS
    Today/NNP
    our/PRP$
    nation/NN)
  lost/VBD
  a/DT
  beloved/VBN
  (Chunk graceful/JJ courageous/JJ woman/NN who/WP)
  called/VBD
  (Chunk America/NNP to/TO its/PRP$ founding/NN ideals/NNS and/CC)
  carried/VBD
  on/IN
  a/DT
  (Chunk noble/JJ dream/NN Tonight/NNP we/PRP)
  are/VBP
  comforted/VBN
  by/IN
  the/DT
  (Chunk hope/NN)
  of/IN
  a/DT
  (Chunk glad/JJ reunion/NN)
  

(S
  (Chunk PRESIDENT/NNP GEORGE/NNP W/NNP BUSH/NNP S/NNP ADDRESS/NNP)
  BEFORE/IN
  (Chunk A/NNP JOINT/NNP SESSION/NNP)
  OF/IN
  (Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
  OF/IN
  (Chunk THE/NNP UNION/NNP)
  n/IN
  (Chunk nJanuary/JJ n/JJ nTHE/FW PRESIDENT/NNP Thank/NNP you/PRP)
  all/DT
  (Chunk
    Mr/NNP
    Speaker/NNP
    Vice/NNP
    President/NNP
    Cheney/NNP
    members/NNS)
  of/IN
  (Chunk Congress/NNP members/NNS)
  of/IN
  the/DT
  (Chunk Supreme/NNP Court/NNP and/CC diplomatic/JJ corps/NN)
  distinguished/VBN
  (Chunk
    guests/NNS
    and/CC
    fellow/JJ
    citizens/NNS
    Today/NNP
    our/PRP$
    nation/NN)
  lost/VBD
  a/DT
  beloved/VBN
  (Chunk graceful/JJ courageous/JJ woman/NN who/WP)
  called/VBD
  (Chunk America/NNP to/TO its/PRP$ founding/NN ideals/NNS and/CC)
  carried/VBD
  on/IN
  a/DT
  (Chunk noble/JJ dream/NN Tonight/NNP we/PRP)
  are/VBP
  comforted/VBN
  by/IN
  the/DT
  (Chunk hope/NN)
  of/IN
  a/DT
  (Chunk glad/JJ reunion/NN)
  

(S
  (Chunk PRESIDENT/NNP GEORGE/NNP W/NNP BUSH/NNP S/NNP ADDRESS/NNP)
  BEFORE/IN
  (Chunk A/NNP JOINT/NNP SESSION/NNP)
  OF/IN
  (Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
  OF/IN
  (Chunk THE/NNP UNION/NNP)
  n/IN
  (Chunk nJanuary/JJ n/JJ nTHE/FW PRESIDENT/NNP Thank/NNP you/PRP)
  all/DT
  (Chunk
    Mr/NNP
    Speaker/NNP
    Vice/NNP
    President/NNP
    Cheney/NNP
    members/NNS)
  of/IN
  (Chunk Congress/NNP members/NNS)
  of/IN
  the/DT
  (Chunk Supreme/NNP Court/NNP and/CC diplomatic/JJ corps/NN)
  distinguished/VBN
  (Chunk
    guests/NNS
    and/CC
    fellow/JJ
    citizens/NNS
    Today/NNP
    our/PRP$
    nation/NN)
  lost/VBD
  a/DT
  beloved/VBN
  (Chunk graceful/JJ courageous/JJ woman/NN who/WP)
  called/VBD
  (Chunk America/NNP to/TO its/PRP$ founding/NN ideals/NNS and/CC)
  carried/VBD
  on/IN
  a/DT
  (Chunk noble/JJ dream/NN Tonight/NNP we/PRP)
  are/VBP
  comforted/VBN
  by/IN
  the/DT
  (Chunk hope/NN)
  of/IN
  a/DT
  (Chunk glad/JJ reunion/NN)
  

KeyboardInterrupt: 

 ## IMP
    In part of speech tag we provide a tag to each names but we still end up the problem of not getting a word for eg George W.BUsh together. So we use chunking 

# Named Entity Recognition

One of the most major forms of chunking in natural language processing is called "Named Entity Recognition." The idea is to have the machine immediately be able to pull out "entities" like people, places, things, locations, monetary figures, and more.

This can be a bit of a challenge, but NLTK is this built in for us. There are two major options with NLTK's named entity recognition: either recognize all named entities, or recognize named entities as their respective type, like people, places, locations, etc.

In [52]:
def NamedEntity(document):
    try:
        for i in document:
            words = nltk.word_tokenize(document)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged,binary=True)
            for subtree in namedEnt.subtrees(filter=lambda t: t.label() == 'NE'):
                print(subtree)
            
           
        
    except Exception as e:
        print(str(e))

In [53]:
NamedEntity(letters_only)

(NE GEORGE/NNP)
(NE THE/NNP)
(NE CONGRESS/NNP)
(NE THE/NNP UNION/NNP)
(NE nJanuary/JJ)
(NE Cheney/NNP)
(NE Congress/NNP)
(NE Supreme/NNP Court/NNP)
(NE America/NNP)
(NE Coretta/NNP Scott/NNP King/NNP Applause/NNP)
(NE George/NNP)
(NE Union/NNP Address/NNP)
(NE Capitol/NNP)
(NE White/NNP House/NNP)
(NE Eric/NNP)
(NE America/NNP)
(NE United/NNP States/NNPS)
(NE America/NNP)
(NE Applause/NNP Abroad/NNP)
(NE America/NNP)
(NE Applause/NNP Far/NNP)
(NE Afghanistan/NNP)
(NE Iraqis/NNP)
(NE Lebanon/NNP)
(NE Egypt/NNP)
(NE Syria/NNP)
(NE Burma/NNP Zimbabwe/NNP North/NNP Korea/NNP)
(NE Iran/NNP)
(NE Applause/NNP)
(NE George/NNP)
(NE Union/NNP Address/NNP)
(NE Capitol/NNP)
(NE White/NNP House/NNP)
(NE Eric/NNP Draper/NNP No/NNP)
(NE Middle/NNP East/NNP)
(NE Iraq/NNP)
(NE America/NNP)
(NE Beslan/NNP)
(NE London/NNP)
(NE United/NNP States/NNPS)
(NE Applause/NNP America/NNP)
(NE Europe/NNP)
(NE George/NNP)
(NE Congress/NNP)
(NE Union/NNP Address/NNP)
(NE Capitol/NNP)
(NE White/NNP House/NNP)
(NE Eri

KeyboardInterrupt: 

# Lemmatizing

A very similar operation to stemming is called lemmatizing. The major difference between these is, as you saw earlier, stemming can often create non-existent words, whereas lemmas are actual words.

In [54]:
from nltk.stem import WordNetLemmatizer

In [55]:
lem = WordNetLemmatizer()

In [56]:
print(lem.lemmatize("cat"))
print(lem.lemmatize("cacti"))
print(lem.lemmatize("geeze"))
print(lem.lemmatize("rocks"))
print(lem.lemmatize("pythonli"))
print(lem.lemmatize("better"))

cat
cactus
geeze
rock
pythonli
better


In [57]:
print(lem.lemmatize("better",pos="a"))
print(lem.lemmatize("best",pos="a"))
print(lem.lemmatize("run"))


good
best
run


# Corpara

In [58]:
print(nltk.__file__)

C:\Users\aswat\Anaconda3\lib\site-packages\nltk\__init__.py


# WordNet

WordNet is a lexical database for the English language, which was created by Princeton, and is part of the NLTK corpus.

You can use WordNet alongside the NLTK module to find the meanings of words, synonyms, antonyms, and more. Let's cover some examples.

In [59]:
from nltk.corpus import wordnet

In [60]:
syns = wordnet.synsets("program")

In [62]:
print(syns)

[Synset('plan.n.01'), Synset('program.n.02'), Synset('broadcast.n.02'), Synset('platform.n.02'), Synset('program.n.05'), Synset('course_of_study.n.01'), Synset('program.n.07'), Synset('program.n.08'), Synset('program.v.01'), Synset('program.v.02')]


In [63]:
print(syns[0].lemmas())

[Lemma('plan.n.01.plan'), Lemma('plan.n.01.program'), Lemma('plan.n.01.programme')]


In [64]:
print(syns[0].lemmas()[0].name())

plan


In [66]:
#definition
print(syns[0].definition())

a series of steps to be carried out or goals to be accomplished


In [68]:
#examples
print(syns[0].examples())

['they drew up a six-step plan', 'they discussed plans for a new bond issue']


In [69]:
synonyms = []
antonyms = []

for syn in wordnet.synsets("good"):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print(set(synonyms))
print(set(antonyms))

{'unspoiled', 'serious', 'practiced', 'right', 'estimable', 'undecomposed', 'adept', 'ripe', 'thoroughly', 'near', 'well', 'expert', 'honorable', 'sound', 'safe', 'secure', 'skilful', 'honest', 'good', 'salutary', 'proficient', 'commodity', 'trade_good', 'skillful', 'full', 'dependable', 'beneficial', 'effective', 'soundly', 'in_force', 'upright', 'respectable', 'dear', 'unspoilt', 'goodness', 'just', 'in_effect'}
{'evil', 'bad', 'ill', 'badness', 'evilness'}


In [70]:
#SIMILARITY

In [71]:
w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('boat.n.01')
print(w1.wup_similarity(w2))

0.9090909090909091


Next, we can also easily use WordNet to compare the similarity of two words and their tenses, by incorporating the Wu and Palmer method for semantic related-ness