# NLP Exploration
## Dawson Ren
## January 22nd, 2022

Learning Sources:
https://realpython.com/natural-language-processing-spacy-python/

## Introduction

The goal of this notebook is to chronicle my exploration of Natural Language Processing. To begin my journey, some interesting insights from [Wikipedia](https://en.wikipedia.org/wiki/Natural_language_processing).

The beginnings of NLP (1950-1990)
- Alan Turing developed the Turing test in 1950, based on the "Imitation Game" - could a human reliably tell the difference between a human and computer when communicating only using text?
- Symbolic NLP from 1950-1990 involved computers essentially getting a rulebook to follow about language (no statistical learning methods used).
- Rule-based approaches, perhaps inspired by Chomsykian view of linguistics (seeing grammar as the result of set-in-stone transformations)

Statistical NLP (1990-2010)
- Unstructured data from the web makes it easier to create unsupervised learning algorithms that can achieve similar performance as supervised algorithms given an algorithm with low enough time complexity.

Modern NLP (2010-present)
- Representation Learning - ML detection of representations conducive for classication (learn the features we need)
- Deep Neural Networks - we can throw insane amounts of compute at a problem, train them in a reasonable amount of time because of backpropagation/automatic differentiation.

Symbolic methods are still used for tokenization/parsing and post-processing.

Statistical methods can use the Hidden Markov Model - where we have Y that depends on X (more concretely, the probability that Y in the set of events A depends on the current value of X), and X is a Markov process that we cannot observe. We can apply this to topic analysis - Y are the words we observe, X is the abstract "topics" we're writing about. We have probability distributions of certain words appearing for certain topics, and the actual empirical distribution of words in our sample text. We can work backwards to get the "topics" the sample text is referring to. Pretty cool!

Common NLP Tasks
- Optical Character Recognition (take a photo, return the words on it)
- Speech Recognition (converting an analog signal to human text)
- Speech Segmentation (a subtask of speech recognition, split analog signal into words)
- Text-to-speech
- Tokenization (trivial in English, not so much in Chinese/Japanese)
- Lemmatization (a base form of a word is called a "lemma", for example break, but alternative versions such as breaks, broken, broke, etc. may appear. Also converts words like better to good.)
- Morphological Segmentation - segment into morphemes (a morpheme is a small unit of 
- Part of speech tagging - (is "book" a noun or a verb, like to book a flight?)
- Automatic Summarization
- Grammatical Error Correction
- Machine Translation
- Natural Language Generation (copywriting, etc.)
- Question Answering
- Text-to-image generation

## spaCy

Industrial-strength NLP! Let's get started.

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

This `nlp` object provides a `load()` function which generates a `Doc` object. Iterating over this object shows your tokens.

In [2]:
doc = nlp('This sentence has morphemes and lexemes in it. How cool!')

In [15]:
print(f'{"Text":15} {"ID":5} {"Text with WS":15} {"Is Punctuation":15} {"Is Stop"}')
for token in doc:
    print(f'{token.text:<15} {token.idx:<5} {token.text_with_ws:15} {token.is_punct:<15} {token.is_stop}')

Text            ID    Text with WS    Is Punctuation  Is Stop
This            0     This            0               True
sentence        5     sentence        0               False
has             14    has             0               True
morphemes       18    morphemes       0               False
and             28    and             0               True
lexemes         32    lexemes         0               False
in              40    in              0               True
it              43    it              0               True
.               45    .               1               False
How             47    How             0               True
cool            51    cool            0               False
!               55    !               1               False


Sentence detection is done with delimiting periods. You can create your own pipe and add a different boundary.

As you can see above, `idx` stores the index in the string of the token. See other attributes above such as `text_with_ws`, `is_punct`, and `is_stop`.

A tokenizer takes a prefix (opening parenthesis), suffix (closing parenthesis), and infix (hyphens, dashes).

A stop word is one of the most common words in a language. They distort anu word frequency analysis. The trend has been to use less and less stop words, though!

Lemmatization - is -> be, organizing -> organize, talks -> talk, returns the "lemma", perhaps abstract version of the word? And also normalizes (getting rid of capitalization, breaking hyphens, etc.)

An example of lemmatization:

In [16]:
lemma_text = """
Believe me, dear Sir: there is not in the British empire a man who more cordially loves a union with
Great Britain than I do. But, by the God that made me, I will cease to exist before I yield to a connection
on such terms as the British Parliament propose; and in this, I think I speak the sentiments of America.

- Thomas Jefferson, November 29th, 1775
"""
lemma_doc = nlp(lemma_text)

for token in lemma_doc:
    if str(token) != str(token.lemma_):
        print(token, token.lemma_)

Believe believe
me I
is be
British british
loves love
But but
made make
me I
terms term
sentiments sentiment


In [17]:
# This kind of sounds funny haha
print(" ".join([token.lemma_ for token in lemma_doc]))


 believe I , dear Sir : there be not in the british empire a man who more cordially love a union with 
 Great Britain than I do . but , by the God that make I , I will cease to exist before I yield to a connection 
 on such term as the British Parliament propose ; and in this , I think I speak the sentiment of America . 

 - Thomas Jefferson , November 29th , 1775 



Use `Counter()` from the Python stdlib for counting words. Just pass in a list comprehension from the document filtering out stop words and punctuation!

POS (Part of Speech) tagging - typically 8 parts of speech:
1. Noun
2. Pronoun
3. Adjective
4. Verb
5. Adverb
6. Preposition
7. Conjunction
8. Interjection

You can use `token.tag_` and `token.pos_` to get the specific tag and coarse-grained tag. `spacy.explain(tag)` gives us an actual explanation, which is cool! Notice how it knows the difference between singular and plural nouns.

In [19]:
for token in lemma_doc:
    if token.pos_ == 'NOUN':
        print(str(token), '-', spacy.explain(token.tag_))

empire - noun, singular or mass
man - noun, singular or mass
union - noun, singular or mass
connection - noun, singular or mass
terms - noun, plural
sentiments - noun, plural
29th - noun, singular or mass


In [23]:
# Let's try out displacy!
spacy.displacy.render(lemma_doc, style='dep', jupyter=True)

Using the information provided by spacy, we can create our own preprocessing functions to filter out stop words, punctuation, strip whitespace, etc.

We can also conduct rule-based matching. It's more powerful than a regex because we have access to information about the actual function of the word in the sentence (tagging information).

In [30]:
phone_number_text = 'Hey, I just met you, and this is crazy, so here\'s my number, so call me maybe: (312)-699-5323'

matcher = spacy.matcher.Matcher(nlp.vocab)
phone_number_doc = nlp(phone_number_text)

# we write this as a generator
def extract_phone_number(nlp_doc):
    pattern = [ # the pattern we want to match
        {'ORTH': '('}, # ORTH is exact matching
        {'SHAPE': 'ddd'}, # SHAPE is a general shape we want to match, d for digits
        {'ORTH': ')'},
        {'ORTH': '-', 'OP': '?'}, # OP for options, ? is 0 or 1 instances
        {'SHAPE': 'ddd'},
        {'ORTH': '-', 'OP': '?'},
        {'SHAPE': 'dddd'}
    ]
    matcher.add('PHONE_NUMBER', [pattern])
    matches = matcher(nlp_doc)
    for match_id, start, end in matches:
        span = nlp_doc[start:end]
        return span.text
    
print(extract_phone_number(phone_number_doc))

None


In [31]:
print([tok.text for tok in phone_number_doc])

['Hey', ',', 'I', 'just', 'met', 'you', ',', 'and', 'this', 'is', 'crazy', ',', 'so', 'here', "'s", 'my', 'number', ',', 'so', 'call', 'me', 'maybe', ':', '(', '312)-699', '-', '5323']


Actually, I'm not going to change this, as this is a valuable lesson - sometimes, the way spacy breaks up your text makes it so that pattern matching is difficult! Make sure you observe at least one instance before you try to match it, or be careful about what kind of preprocessing you're doing.

Headword - a word that has no dependents (usually the verb). The dependency relationships are many, but here are three examples:
- nsubj, a subject
- aux, an auxiliary word
- dojb, a direct object

Using these dependency relations, you can traverse the dependency tree. Use `children`, `lefts`, `rights`, and `subtree` where you index into the document (the index of the word where you want to explore.

Shallow parsing/chunking - A document also provides `noun_chunks`, need to import `textacy` for verb chunks.

Named Entity Recognition (NER) - finding named entities, such as names, organizations, locations, monetary expressions, etc. Exposed through `<document>.ents`. Useful for tasks like redacting names from text - learn about retokenizers later.

## Learnings from 1/22/23
- NLP is a useful method to analyze unstructured text and derive meaning insights. Those insights are only as good as the questions you ask of the text.
- spacy provides tools for developers to access textual information about documents, such as part-of-speech tagging, stop words/punctuation/etc., and visualize the dependent relationships between words in a sentence.
- Using this information, we can begin to use statistics and ML to produce more text or modify existing text to automate certain human activities. Language recognition is an AI-hard problem though, but current models are approaching human capabilities on certain specific tasks. Don't overestimate the power of these tools - as I saw, it's easy to make human mistakes (like I did with phone number parsing). They still require a lot of attention to detail and care to use properly.