# NLTK

In this notebook, we will learn about the Python Natural Language Toolkit (NLTK) library.

**Table of Contents**  

1. [Install and download NLTK](#sec1)
2. [tokenization](#sec2)
3. [stopword removal](#sec3)
4. [stemming](#sec4)
5. [parts of speech tagging](#sec5)
6. [frequency distribution](#sec6) 
7. [Named Entities](#sec7)
8. [Wordnet](#sec8) 

<a id="sec1"></a> 
## **Install and download NLTK**

In [14]:
import nltk #loads NLTK library into Python

In [15]:
nltk.download('all') #all data packages needed; like tokenizers punkt, stopwords, etc... can take a while

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\jessi\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\jessi\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\jessi\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\jessi\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\jessi\AppData\Roaming\nltk_data...
[

True

<a id="sec2"></a> 
## **Tokenization**

In [16]:
from nltk.tokenize import word_tokenize

In [17]:
sentence = "Hello, how are you doing today?"
tokens = word_tokenize(sentence) #tokenize the sentence into words
print(tokens) #print the tokens

['Hello', ',', 'how', 'are', 'you', 'doing', 'today', '?']


<a id="sec2"></a>

<a id="sec3"></a> 
## **Stop word removal**

In [18]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
for word in stop_words:
    print(word) #prints all stop words in the English language

that
were
under
she
do
was
wasn
o
him
on
after
below
having
ve
when
y
any
before
shouldn't
you're
it
to
we'd
so
you
your
most
isn't
those
who
i'm
or
hasn't
whom
off
there
mustn't
he
up
been
yourself
their
itself
but
s
aren
didn't
my
few
where
doing
doesn't
you'd
for
he's
is
ain
into
what
yourselves
such
i'll
over
they'll
i'd
being
hadn't
wouldn't
ll
them
i've
with
won't
at
about
myself
don't
now
each
needn't
between
t
they've
in
very
herself
weren't
be
isn
we
other
couldn't
own
nor
shan
just
he'd
its
of
only
than
will
an
above
that'll
they're
again
which
against
because
wouldn
it's
we'll
some
you've
she'll
hers
it'd
re
you'll
and
did
a
d
can
our
until
don
hadn
it'll
haven't
the
mightn't
himself
from
here
am
too
once
ours
should
during
couldn
does
doesn
he'll
mustn
needn
we're
weren
why
won
by
didn
ma
no
not
all
shouldn
should've
m
mightn
she'd
as
her
how
i
then
aren't
further
these
they
theirs
themselves
this
hasn
have
we've
down
haven
if
ourselves
more
out
shan't
has
through
same
your

<a id="sec4"></a> 
## **stemming**

In [19]:
from nltk.stem import PorterStemmer

In [20]:
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
#print(stemmed_tokens)

NameError: name 'filtered_tokens' is not defined

<a id="sec5"></a> 
## **Parts of Speech**

In [None]:
from nltk import pos_tag

In [None]:
pos_tags = pos_tag(filtered_tokens)
print(pos_tags)

<a id="sec6"></a> 
## **Frequency Distribution**

In [21]:
from nltk import FreqDist

In [22]:
fdist = FreqDist(filtered_tokens)
fdist.plot(10)  # Plot top 10 words

NameError: name 'filtered_tokens' is not defined

<a id="sec7"></a> 
## **Name Entiy Distribution**

In [23]:
from nltk import ne_chunk
from nltk.tree import Tree

In [25]:
named_entities = ne_chunk(pos_tags)
named_entities.draw()  # Displays a tree diagram

NameError: name 'pos_tags' is not defined

<a id="sec8"></a> 
## **Wordnet**

In [None]:
from nltk.corpus import wordnet as wn

In [None]:
synonyms = wn.synsets("good") #list of synsets for "good"
for item in synonyms:
    print(item) 

Synset('good.n.01')
Synset('good.n.02')
Synset('good.n.03')
Synset('commodity.n.01')
Synset('good.a.01')
Synset('full.s.06')
Synset('good.a.03')
Synset('estimable.s.02')
Synset('beneficial.s.01')
Synset('good.s.06')
Synset('good.s.07')
Synset('adept.s.01')
Synset('good.s.09')
Synset('dear.s.02')
Synset('dependable.s.04')
Synset('good.s.12')
Synset('good.s.13')
Synset('effective.s.04')
Synset('good.s.15')
Synset('good.s.16')
Synset('good.s.17')
Synset('good.s.18')
Synset('good.s.19')
Synset('good.s.20')
Synset('good.s.21')
Synset('well.r.01')
Synset('thoroughly.r.02')


In [26]:
syn = wn.synsets("good")[0]
print(syn.name())         # 'good.n.01'
print(syn.definition())   # "benefit"
print(syn.examples())     # ['for your own good', 'what's the good of worrying?']
print(syn.lemmas())       # [Lemma('good.n.01.good')]

good.n.01
benefit
['for your own good', "what's the good of worrying?"]
[Lemma('good.n.01.good')]
