# Natural Language Processing Concepts

## Setup

In [30]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [0]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.collocations import *
from nltk.stem.lancaster import LancasterStemmer
from nltk.wsd import lesk
from nltk import pos_tag
from string import punctuation

## Tokenisation

Consists of breaking down text into words and sentences.

In [6]:
text="Lucy has got a big dog. His fur is brown and snout cold!"
print(sent_tokenize(text))
print(word_tokenize(text))

['Lucy has got a big dog.', 'His fur is brown and snout cold!']
['Lucy', 'has', 'got', 'a', 'big', 'dog', '.', 'His', 'fur', 'is', 'brown', 'and', 'snout', 'cold', '!']


## Stopwords Removal

Consists of filtering common words.

The NLTK library has a corpus of existing words to be ignored, to what we can add more words. 

In [9]:
text="Lucy has got a big dog. His fur is brown and snout cold!"
stop_words = set(stopwords.words('english')+list(punctuation))
filtered_words = [w for w in word_tokenize(text) if w not in stop_words]
print(filtered_words)

['Lucy', 'got', 'big', 'dog', 'His', 'fur', 'brown', 'snout', 'cold']


## N-Grams

There are used to identify commonly occurring groups of words that usually goes together. 

In [17]:
text="New York is a great city. Have you ever been to New York?"
stop_words = set(stopwords.words('english')+list(punctuation))
filtered_words = [w for w in word_tokenize(text) if w not in stop_words]
finder = BigramCollocationFinder.from_words(filtered_words)
sortedItems = sorted(finder.ngram_fd.items())
bigrams = [i for i in sortedItems if i[1] > 1]
print(bigrams)

[(('New', 'York'), 2)]


## Stemming

Stemming removes the end of some words, that keep same meaning without their suffixes.

There are specific algorithms to reduce these words to their root form such as the Lancaster Stemmer.


In [18]:
text = "John closed the store on closing night when he was in the mood to close."
words = word_tokenize(text)
stemmer = LancasterStemmer()
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)

['john', 'clos', 'the', 'stor', 'on', 'clos', 'night', 'when', 'he', 'was', 'in', 'the', 'mood', 'to', 'clos', '.']


## Part of Speech

Determine if a word is a Noun, Verb, and Adverb...
In the NLTK library, each type of word is identified by a tag defined here: http://www.nltk.org/book/ch05.html#tab-universal-tagset.

In [27]:
text = "Steven is eating plenty of pancakes"
words = word_tokenize(text)
tags = pos_tag(words)
print (tags)

[('Steven', 'NNP'), ('is', 'VBZ'), ('eating', 'VBG'), ('plenty', 'NN'), ('of', 'IN'), ('pancakes', 'NNS')]


## Word meaning disambiguation

Identifying the context in which the word occurs and infer its meaning.

LESK is an algorithm for Word Sense Disambiguation that uses Wordnet lexicon. Depending on the context of the sentence, the right definition is selected.


In [42]:
print("Lexicon for the word 'play'")
for ss in wordnet.synsets('play'):
    print(ss, ss.definition())

print("LESK Algo")
cool_meaning = lesk(word_tokenize("He was asked to play the role of Mozard in the comedy show"), 'play')
print(str(cool_meaning) + " " + cool_meaning.definition())
cool_meaning = lesk(word_tokenize("Can I play of this instrument?"), 'play')
print(str(cool_meaning) + " " + cool_meaning.definition())

Lexicon for the word 'play'
Synset('play.n.01') a dramatic work intended for performance by actors on a stage
Synset('play.n.02') a theatrical performance of a drama
Synset('play.n.03') a preset plan of action in team sports
Synset('maneuver.n.03') a deliberate coordinated movement requiring dexterity and skill
Synset('play.n.05') a state in which action is feasible
Synset('play.n.06') utilization or exercise
Synset('bid.n.02') an attempt to get something
Synset('play.n.08') activity by children that is guided more by imagination than by fixed rules
Synset('playing_period.n.01') (in games or plays or other performances) the time during which play proceeds
Synset('free_rein.n.01') the removal of constraints
Synset('shimmer.n.01') a weak and tremulous light
Synset('fun.n.02') verbal wit or mockery (often at another's expense but not to be taken seriously)
Synset('looseness.n.05') movement or space for movement
Synset('play.n.14') gay or light-hearted recreational activity for diversion o