# What is Natural Language Processing?

Any computation or manipulation of natural language to get some insights about how words mean and how sentences are constructed.

## NLP Tasks
- Counting words, counting frequency of words
- Finding sentence boundaries
- Part of speech tagging
- Parsing the sentence structure
- Identifying semantic roles
- Identifying entities in a sentence
- Finding which pronoun refers to which entity

## NLTK
- NLTK: Natural Language Toolkit
- Open source library in Python

## NLTK

In [1]:
import nltk

In [2]:
# To download some examples for NLP
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [3]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [4]:
text1

<Text: Moby Dick by Herman Melville 1851>

# Simple NLP Tasks

In [5]:
# Counting vocabulary words
text7

<Text: Wall Street Journal>

In [9]:
print(sent7)

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']


In [10]:
# Number of words
len(sent7)

18

In [11]:
# Number of unique words
len(set(sent7))

17

In [12]:
list(set(text7))[:10]

['embarrassing',
 'railcars',
 'massive',
 'aimed',
 '1901',
 'taxpayer',
 '11.6',
 'benefits',
 'took',
 'estimated']

In [13]:
# Frequency of words
dist = FreqDist(text7)

In [14]:
# set of unique words
len(dist)

12408

In [18]:
# actual words
vocab1 = list(dist.keys())

In [19]:
vocab1[:10]

['Pierre', 'Vinken', ',', '61', 'years', 'old', 'will', 'join', 'the', 'board']

In [21]:
# Ocurrences of specific word
dist["old"]

24

In [22]:
# frecuency of words with more than 5 characters and more than 100 ocurrences
# the condition len(w) > 5 is to avoid ".", ",", "a", etc...
freqwords = [w for w in vocab1 if len(w) > 5 and dist[w] > 100]

In [23]:
freqwords

['billion',
 'company',
 'president',
 'because',
 'market',
 'million',
 'shares',
 'trading',
 'program']

## Normalization and Stemming

- Normalization: make different forms of the same word look the same
- Stemming: find the root of the words for later match

In [27]:
input1 = "List listed lists listing listings"
words1 = input1.lower().split(" ")
words1

['list', 'listed', 'lists', 'listing', 'listings']

In [29]:
porter = nltk.PorterStemmer()
[porter.stem(t) for t in words1]

['list', 'list', 'list', 'list', 'list']

## Lemmatization
- A variation of Stemming
- We want the words that come out to be actually meaningful

In [30]:
udhr = nltk.corpus.udhr.words("English-Latin1")
udhr[:20]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'rights',
 'of']

In [31]:
# Stemming
[porter.stem(t) for t in udhr[:20]]

['univers',
 'declar',
 'of',
 'human',
 'right',
 'preambl',
 'wherea',
 'recognit',
 'of',
 'the',
 'inher',
 'digniti',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalien',
 'right',
 'of']

In [33]:
# We can see that some of the words do not make sense as they are the roots: declar, preambl, ...
# Lemmatization: Stemming, but resulting stems are all valid words
WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in udhr[:20]]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'right',
 'of']

## Tokenization
- Splitting a sentence into words/tokens

In [5]:
text11 = "Children shouldn't drink a sugary drink before bed."
text11.split(" ")

['Children', "shouldn't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed.']

In [37]:
# A better to tokenize is using the nltk library methods
nltk.word_tokenize(text11)

['Children',
 'should',
 "n't",
 'drink',
 'a',
 'sugary',
 'drink',
 'before',
 'bed',
 '.']

In [38]:
# This tokenizer method includes "n't" and "."

In [42]:
# Tricky example: With several stops that don't mean end of a sentence
text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"
sentences = nltk.sent_tokenize(text12)
print(len(sentences))
sentences

4


['This is the first sentence.',
 'A gallon of milk in the U.S. costs $2.99.',
 'Is this the third sentence?',
 'Yes, it is!']

# Advanced NLP Tasks with NLTK

## Part-of-speech (POS) Tagging
- Nouns, verbs, adjectives, ...

### Ambiguity in POS Tagging
- This is common in english
- Example : visiting aunts can be a nuisance
- However, POS Tagging is NOT ambiguous
    - NLTK will take the alternative that is most commonly used: if a word could have two roles in a sentence, NLTK will stick to the one that is most commonly used in the english language

In [4]:
# To obtain more information about these tags
nltk.help.upenn_tagset("MD")

MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would


In [7]:
# 1) Splitting a sentence into words/tokens
text13 = nltk.word_tokenize(text11)

In [8]:
# NLTK's tokenizer
nltk.pos_tag(text13)

[('Children', 'NNP'),
 ('should', 'MD'),
 ("n't", 'RB'),
 ('drink', 'VB'),
 ('a', 'DT'),
 ('sugary', 'JJ'),
 ('drink', 'NN'),
 ('before', 'IN'),
 ('bed', 'NN'),
 ('.', '.')]

## Parsing Sentence Structure

- Making sense of sentences
    - It's easy if the follow a well-defined grammatical structure

<img src="resources/nltk_parser.png" width = "400">

In [15]:
text15 = nltk.word_tokenize("Alice loves Bob")
grammar = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP
NP -> 'Alice' | 'Bob'
V -> 'loves'
""")
parser = nltk.ChartParser(grammar)

In [16]:
trees = parser.parse_all(text15)
for tree in trees:
    print(tree)

(S (NP Alice) (VP (V loves) (NP Bob)))


## Ambiguity in Parsing
- Ambiguity may exist even if sentences are grammatically correct
- Example: I saw the man with the telescope
    - You saw the man using your telescope
    - You saw a man with a telescope

In [18]:
text16 = nltk.word_tokenize("I saw the man with a telescope")
grammar1 = nltk.data.load('mygrammar.cfg')
grammar1

IndexError: list index out of range

## NLTK and Parse Tree Collection

In [19]:
from nltk.corpus import treebank
text17 = treebank.parsed_sents("wsj_0001.mrg")[0]
print(text17)

(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))


## POS Tagging and Parsing Complexity
- Uncommon usages of words
    - The old man the boat
- Well-formed sentences may still be meaningless
    - Colorless green ideas sleep furiously