# 1. Tokenizing Words and Sentences with NLTK

In [1]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

In [2]:
example_text = "NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum."

In [3]:
sentences = sent_tokenize(example_text)
for index, sentence in enumerate(sentences):
    print(index, '-->', sentence)

0 --> NLTK is a leading platform for building Python programs to work with human language data.
1 --> It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.


In [4]:
words = word_tokenize(example_text)
for word in words:
    print(word)

NLTK
is
a
leading
platform
for
building
Python
programs
to
work
with
human
language
data
.
It
provides
easy-to-use
interfaces
to
over
50
corpora
and
lexical
resources
such
as
WordNet
,
along
with
a
suite
of
text
processing
libraries
for
classification
,
tokenization
,
stemming
,
tagging
,
parsing
,
and
semantic
reasoning
,
wrappers
for
industrial-strength
NLP
libraries
,
and
an
active
discussion
forum
.


# 2. Stop words with NLTK

In [5]:
from nltk.corpus import stopwords

In [6]:
stop_words = list(stopwords.words('english'))
len(stop_words)

153

In [7]:
words = word_tokenize(example_text)
print('Original len: ', len(words))

new_word = [word for word in words if word not in stop_words]
filtered_word = [word for word in words if word in stop_words]
print('After filter len: ', len(new_word))

print('new word: ')
print(new_word)

print('filtered_word: ')
print(filtered_word)

Original len:  66
After filter len:  48
new word: 
['NLTK', 'leading', 'platform', 'building', 'Python', 'programs', 'work', 'human', 'language', 'data', '.', 'It', 'provides', 'easy-to-use', 'interfaces', '50', 'corpora', 'lexical', 'resources', 'WordNet', ',', 'along', 'suite', 'text', 'processing', 'libraries', 'classification', ',', 'tokenization', ',', 'stemming', ',', 'tagging', ',', 'parsing', ',', 'semantic', 'reasoning', ',', 'wrappers', 'industrial-strength', 'NLP', 'libraries', ',', 'active', 'discussion', 'forum', '.']
filtered_word: 
['is', 'a', 'for', 'to', 'with', 'to', 'over', 'and', 'such', 'as', 'with', 'a', 'of', 'for', 'and', 'for', 'and', 'an']


# 3. Stemming words with NLTK

In [8]:
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

In [9]:
ps = PorterStemmer()
sbs = SnowballStemmer('english')

In [10]:
words = word_tokenize(example_text)
for w in words:
    print(ps.stem(w))

nltk
is
a
lead
platform
for
build
python
program
to
work
with
human
languag
data
.
It
provid
easy-to-us
interfac
to
over
50
corpora
and
lexic
resourc
such
as
wordnet
,
along
with
a
suit
of
text
process
librari
for
classif
,
token
,
stem
,
tag
,
pars
,
and
semant
reason
,
wrapper
for
industrial-strength
nlp
librari
,
and
an
activ
discuss
forum
.


In [11]:
stem_sample = ['reading','shopping', 'quickly', 'sliced']
for word in stem_sample:
    print(ps.stem(word))
    print(sbs.stem(word))

read
read
shop
shop
quickli
quick
slice
slice


# 4. Lemmatization

> Lemmatization is a more methodical way of converting all the grammatical/in ected forms of the root of the word. Lemmatization uses context and part of speech to determine the in ected form of the word and applies different normalization rules for each part of speech to get the root word (lemma)

In [12]:
from nltk.stem import WordNetLemmatizer
wlem = WordNetLemmatizer()
wlem.lemmatize("ate")

'ate'

Why not work?????

# 5. Part of Speech Tagging with NLTK 
Alphabetical list of part-of-speech tags used in the Penn Treebank Project
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [21]:
words = nltk.word_tokenize(example_text)

for word,pos in nltk.pos_tag(words):
    print('Word: ', word,'<-->','POS-tag: ',pos)
    
# This is one of the pre-trained POS taggers that comes with NLTK

Word:  NLTK <--> POS-tag:  NNP
Word:  is <--> POS-tag:  VBZ
Word:  a <--> POS-tag:  DT
Word:  leading <--> POS-tag:  VBG
Word:  platform <--> POS-tag:  NN
Word:  for <--> POS-tag:  IN
Word:  building <--> POS-tag:  VBG
Word:  Python <--> POS-tag:  NNP
Word:  programs <--> POS-tag:  NNS
Word:  to <--> POS-tag:  TO
Word:  work <--> POS-tag:  VB
Word:  with <--> POS-tag:  IN
Word:  human <--> POS-tag:  JJ
Word:  language <--> POS-tag:  NN
Word:  data <--> POS-tag:  NNS
Word:  . <--> POS-tag:  .
Word:  It <--> POS-tag:  PRP
Word:  provides <--> POS-tag:  VBZ
Word:  easy-to-use <--> POS-tag:  JJ
Word:  interfaces <--> POS-tag:  NNS
Word:  to <--> POS-tag:  TO
Word:  over <--> POS-tag:  IN
Word:  50 <--> POS-tag:  CD
Word:  corpora <--> POS-tag:  NNS
Word:  and <--> POS-tag:  CC
Word:  lexical <--> POS-tag:  JJ
Word:  resources <--> POS-tag:  NNS
Word:  such <--> POS-tag:  JJ
Word:  as <--> POS-tag:  IN
Word:  WordNet <--> POS-tag:  NNP
Word:  , <--> POS-tag:  ,
Word:  along <--> POS-tag:  I

In [26]:
tagged = nltk.pos_tag(word_tokenize(example_text))
all_nouns = [word for word, pos in tagged if pos in ['NN', 'NNP']]
for nouns in all_nouns:
    print(nouns)

NLTK
platform
Python
language
WordNet
suite
text
processing
classification
tokenization
parsing
reasoning
NLP
discussion
forum


There are mainly two ways to achieve any tagging task in NLTK:
1. Using NLTK's or another lib's **pre-trained tagger**, and applying it on the test data.
2. Building or Training a tagger to be used on test data.

Typically, tagging problems like POS tagging are seen as sequence labeling problems or a classi cation problem where people have tried generative and discriminative models to predict the right tag for the given token.

# 6. Named Entity Recognition (NER)
Aside from POS, one of the most common labeling problems is  nding entities in the text. Typically NER constitutes name, location, and organizations. There are NER systems that tag more entities than just three of these. The problem can be seen as
a sequence, labeling the Named entities using the context and other features.  

There are two ways of tagging the NER using NLTK. One is by using the pre-trained NER model that just scores the test data, the other is to build a Machine learning based model.

In [49]:
from nltk import ne_chunk
example_sent = Sent = "Mark is studying at Stanford University in California which is located in USA"
named = ne_chunk(nltk.pos_tag(nltk.word_tokenize(example_sent)), binary=False)
# named.draw()
print(named)

(S
  (PERSON Mark/NNP)
  is/VBZ
  studying/VBG
  at/IN
  (ORGANIZATION Stanford/NNP University/NNP)
  in/IN
  (GPE California/NNP)
  which/WDT
  is/VBZ
  located/VBN
  in/IN
  (ORGANIZATION USA/NNP))
