# Basic Natural Language Processing

## What is Natural Language ?
- Language used for everyday communication by humans

## What is Natural Language Processing ?
- Any computation, manipulation of natural language;
- Natural languages evolve:
  - New words get added;
  - Old words lose popularity;
  - Meanings of words change;
  - Language rules themselves may change.

## NLP Tasks: A Broad Spectrum
- Counting words, counting frequency of words;
- Finding sentence boundaries;
- Part of speech tagging;
- Parsing the sequence structure;
- Identifying semantic roles;
- Identifying entities in a sentence;
- Finding which pronoun refers to wich entity.

# Basic NLP tasks with NLTK

## An Introduction to NLTK
- NLTK: Natural Language Toolkit;
- Open source library in Python;
- Has support for most NLP tasks.

In [89]:
import nltk
nltk.download('gutenberg')
nltk.download('genesis')
nltk.download('inaugural')
nltk.download('nps_chat')
nltk.download('webtext')
nltk.download('treebank')
nltk.download('udhr')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('tagsets')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt_tab')
from nltk.book import *

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package genesis to /root/nltk_data...
[nltk_data]   Package genesis is already up-to-date!
[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!
[nltk_data] Downloading package nps_chat to /root/nltk_data...
[nltk_data]   Package nps_chat is already up-to-date!
[nltk_data] Downloading package webtext to /root/nltk_data...
[nltk_data]   Package webtext is already up-to-date!
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package udhr to /root/nltk_data...
[nltk_data]   Package udhr is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] 

In [66]:
text1

<Text: Moby Dick by Herman Melville 1851>

In [67]:
sents()

sent1: Call me Ishmael .
sent2: The family of Dashwood had long been settled in Sussex .
sent3: In the beginning God created the heaven and the earth .
sent4: Fellow - Citizens of the Senate and of the House of Representatives :
sent5: I have a problem with people PMing me to lol JOIN
sent6: SCENE 1 : [ wind ] [ clop clop clop ] KING ARTHUR : Whoa there !
sent7: Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
sent8: 25 SEXY MALE , seeks attrac older single lady , for discreet encounters .
sent9: THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a cloud of sunset .


In [68]:
sent1

['Call', 'me', 'Ishmael', '.']

In [69]:
# Counting vocabulary of words
sent7, len(sent7)

(['Pierre',
  'Vinken',
  ',',
  '61',
  'years',
  'old',
  ',',
  'will',
  'join',
  'the',
  'board',
  'as',
  'a',
  'nonexecutive',
  'director',
  'Nov.',
  '29',
  '.'],
 18)

In [70]:
print(f"Total words count = {len(text7)}\nUnique words count = {len(set(text7))}")

Total words count = 100676
Unique words count = 12408


In [71]:
# Seeing the first words
list(set(text7))[:5]

['sellers', 'Soldado', 'Walt', 'motor-home', 'Editorials']

In [72]:
# Seeing the frequency of words
dist = FreqDist(text7)
print(len(dist))
dist

12408


FreqDist({',': 4885, 'the': 4045, '.': 3828, 'of': 2319, 'to': 2164, 'a': 1878, 'in': 1572, 'and': 1511, '*-1': 1123, '0': 1099, ...})

In [73]:
vocab7 = dist.keys()
list(vocab7)[:5]

['Pierre', 'Vinken', ',', '61', 'years']

In [95]:
# Finding how many times a word appear
dist['four']

20

In [75]:
# Save frequent words if they have at least 5 letters
freqwords = [w for w in vocab7 if len(w) > 5 and dist[w]>100]
sorted(freqwords)

['because',
 'billion',
 'company',
 'market',
 'million',
 'president',
 'program',
 'shares',
 'trading']

## Normalization and Stemming
- Different forms of the same words:
  - Example: "List listed lists listing listings"
- Stemming them:
  ```python
  porter = nltk.PorterStemmer()
  [porter.stem(t) for t in words]

  >>> ['list','list','list','list','list']
  ```


## Lemmatization
 - When you care about the meaning of words instead of just removing suffixes

In [76]:
udhr = nltk.corpus.udhr.words('English-Latin1')
udhr[:20]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'rights',
 'of']

In [77]:
porter = nltk.PorterStemmer()
[porter.stem(t) for t in udhr[:20]]

['univers',
 'declar',
 'of',
 'human',
 'right',
 'preambl',
 'wherea',
 'recognit',
 'of',
 'the',
 'inher',
 'digniti',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalien',
 'right',
 'of']

In [78]:
WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in udhr[:20]]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'right',
 'of']

## Tokenization
- Recall splitting a sentence into words/token can sometimes not work with you just split by blank space:
["Children", "shouldn't", "drink", "alcohol"]
  - "Shouldn't" is counting as a single word.

- NLTK has an in-built tokenizer

In [79]:
text11 = "Children shouldn't drink a sugary drink before bed."
nltk.word_tokenize(text11)

['Children',
 'should',
 "n't",
 'drink',
 'a',
 'sugary',
 'drink',
 'before',
 'bed',
 '.']

## Sentence Splitting
- How would you split sentences from a long text string ?
- NLTK has an in-built sentence splitter.

In [80]:
text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"
sentences = nltk.sent_tokenize(text12)
print(sentences)
len(sentences)

['This is the first sentence.', 'A gallon of milk in the U.S. costs $2.99.', 'Is this the third sentence?', 'Yes, it is!']


4

# Advanced NLP tasks with NLTK

## Part-of-speech(POS) Tagging
- Nouns, verbs, adjectives, ...

In [81]:
nltk.download('tagsets_json', quiet=True)
nltk.help.upenn_tagset('MD')

MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would


In [82]:
splitted = nltk.word_tokenize(text11)
nltk.download('averaged_perceptron_tagger_eng',quiet=True)

# Especially useful when collecting verbs or nous, for example
nltk.pos_tag(splitted)

[('Children', 'NNP'),
 ('should', 'MD'),
 ("n't", 'RB'),
 ('drink', 'VB'),
 ('a', 'DT'),
 ('sugary', 'JJ'),
 ('drink', 'NN'),
 ('before', 'IN'),
 ('bed', 'NN'),
 ('.', '.')]

### Ambiguity in POS Taggin
- Ambiguity is common in English

In [83]:
text14 = nltk.word_tokenize("Visiting aunts can be a nuisance")

# Visiting can be a verb in gerund form or an adjective for aunts
nltk.pos_tag(text14)

[('Visiting', 'VBG'),
 ('aunts', 'NNS'),
 ('can', 'MD'),
 ('be', 'VB'),
 ('a', 'DT'),
 ('nuisance', 'NN')]

## Parsing Sentence Structure
- Making sense of sentences is easy if they follow a well-defined grammatical structure

In [84]:
text15 = nltk.word_tokenize("Alice loves Bob")
grammar = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP
NP -> 'Alice' | 'Bob'
V -> 'loves'
""")
parser = nltk.ChartParser(grammar)
trees = parser.parse_all(text15)
for tree in trees:
  print(tree)

(S (NP Alice) (VP (V loves) (NP Bob)))


### Ambiguity in Parsing
- Ambiguity may exist even if sentences are grammatically correct

In [85]:
text16 = nltk.word_tokenize("I saw the man with a telescope")

grammar1 = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'a' | 'the'
N -> 'man' | 'telescope'
V -> 'saw'
P -> 'with'
""")

parser = nltk.ChartParser(grammar1)
trees = parser.parse_all(text16)
for tree in trees:
  print(tree)

(S
  (NP I)
  (VP
    (VP (V saw) (NP (Det the) (N man)))
    (PP (P with) (NP (Det a) (N telescope)))))
(S
  (NP I)
  (VP
    (V saw)
    (NP (Det the) (N man) (PP (P with) (NP (Det a) (N telescope))))))


## NLTK and Parse Tree Collection

In [86]:
from nltk.corpus import treebank
text17 = treebank.parsed_sents('wsj_0001.mrg')[0]
print(text17)

(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))


# Application: Spell Checker

## Spelling Correction
- A common way to check for mis-spelt words and correct them is to find valid words that share similar spelling;
- Requires a dictionary of valid words:
  - NLTK to the rescue: `import words from nltk.corpus`.
- Requires some way to measure spelling similarity:
  - "Edit distance" between two strings.

## Edit distance
- Number of changes that need to be made to string A to get to string B;
- One specific algorithm: Levenshtein distance
  - Insertions;
  - Deletions;
  - Substitutions.

# N-grams
- Character sequences in a word of size n;
- Can also be used for word sequences;
- How can n-grams help in spell-checking?
  - If two words have similar spelling, they share many n-grams.
  - Example with 2-grams:
    - Correct word: pierce => pi, ie, er, rc, ce
    - Mis-spelt word: pierse=> pi, ie, er, rs, se

# Jaccard similarity
- Used to measure similarity of sets;
- Jaccard index/coefficient of similarity of two sets A and B is interaction of A and B / union of A and B;
- Jaccard('pierce','pierse') = 3/7

# Assignment 2- Introduction to NLTK

In part 1 of this assignment you will use nltk to explore the <a href='http://www.cs.cmu.edu/~ark/personas/'>CMU Movie Summary Corpus</a>. All data is released under a <a href='https://creativecommons.org/licenses/by-sa/3.0/us/legalcode'>Creative Commons Attribution-ShareAlike License</a>. Then in part 2 you will create a spelling recommender function that uses nltk to find words similar to the misspelling.

## Part 1 - Analyzing Plots Summary Text

### Question 7

`text1` is in `nltk.Text` format that has been constructed using tokens output by `nltk.word_tokenize(plots_raw)`.

Now, use `nltk.sent_tokenize` on the tokens in `text1` by joining them using whitespace to output a sentence-tokenized copy of `text1`. Report the average number of whitespace separated tokens per sentence in the sentence-tokenized copy of `text1`.

*This function should return a float.*

In [87]:
sentences = nltk.sent_tokenize(" ".join(text1))
import pandas as pd
df = pd.DataFrame(sentences,columns=['sentences'])

# Calculate the number of whitespace-separated tokens for each sentence
df['token_count'] = df['sentences'].apply(lambda x: len(x.split()))

df['token_count'].mean()

np.float64(26.067652643149795)

### Question 9

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the trigrams of the two words.**

Refer to:
- [NLTK Jaccard distance](https://www.nltk.org/api/nltk.metrics.distance.html?highlight=jaccard_distance#nltk.metrics.distance.jaccard_distance)
- [NLTK ngrams](https://www.nltk.org/api/nltk.util.html?highlight=ngrams#nltk.util.ngrams)

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [93]:
from nltk.corpus import words
from nltk.util import ngrams
from nltk.metrics.distance import jaccard_distance

# nltk.download('words') # Already downloaded in the notebook's setup
correct_spellings = words.words()

entries=['cormulent', 'incendenece', 'validrate']

def spell_checker(entries_list, correct_words_list):
    recommendations = []
    for entry_word in entries_list:
        # Generate trigrams for the entry word
        entry_trigrams = set(ngrams(entry_word, 3))

        min_dist = 1.0 # Jaccard distance ranges from 0 to 1
        best_match = None

        for correct_word in correct_words_list:
            # Ensure the correct word is long enough to form trigrams
            if len(correct_word) >= 3:
                correct_trigrams = set(ngrams(correct_word, 3))

                # Calculate Jaccard distance
                # Handle cases where trigram sets might be empty (though filtered for correct_word length)
                if not entry_trigrams and not correct_trigrams:
                    current_dist = 0.0
                elif not entry_trigrams or not correct_trigrams:
                    current_dist = 1.0
                else:
                    current_dist = jaccard_distance(entry_trigrams, correct_trigrams)

                if current_dist < min_dist:
                    min_dist = current_dist
                    best_match = correct_word

        # Add the best match or the original word if no suitable match was found
        recommendations.append(best_match if best_match is not None else entry_word)
    return recommendations

# Call the function with the provided entries and correct spellings
spell_recommendations = spell_checker(entries, correct_spellings)
print(spell_recommendations)


['formule', 'ascendence', 'validate']
