# Introduction to Natural Language Processing

## Agenda

1. Natural Language Processing - Motivating Tasks
2. Classical NLP
    - Processing text for machine learning
    - Text Classification
    - Topic Modeling
3. Modern NLP - State of the Art models using Transformers

## Natural Language Processing

Natural Language Processing is a field combining linguistics and computer science to analyze natural language data and perform various tasks.

NLP is a broad topic that covers many different tasks. Common tasks include:

1. Text Classification - Predict the topic of a news article from a predefined set of topics.
2. Named Entity Recognition - If "apple" is used in a sentence does it refer to fruit or a company?
3. Question Answering - Given a context text, answer a question about it.
4. Text Summarization - Produce a summary of a given text.
5. Translation - Translate English to Spanish
6. Text Generation - Given a prompt, write a story.
7. Dialogue State Tracking - Given a conversation, record key facts about it.
8. Topic modeling - Given a corpus of texts, discover common topics.

In [1]:
# Set up Google Colab runtime
import sys
import warnings
warnings.filterwarnings("ignore") # stop warnings for the sake of presentation
if "google.colab" in sys.modules:
    print("Setting up Google Colab... ")
    !git clone https://github.com/Strabes/Intro-to-NLP.git intro-to-nlp
    %cd intro-to-nlp
    from install import install_requirements
    install_requirements()
    nltk.download("punkt")

## 20 Newsgroups Dataset

The 20 Newsgroups dataset is a classic dataset in NLP for document classification experiments. It consists of ~20K newsgroup posts that are classified into 20 topics.

In [2]:
from sklearn.datasets import fetch_20newsgroups

d_train = fetch_20newsgroups(
    subset="train",
    remove=('headers','footers','quotes'),
    shuffle=True,
    random_state=42)

d_test = fetch_20newsgroups(
    subset="test",
    remove=('headers','footers','quotes'),
    shuffle=True,
    random_state=42)

print("The topics in the dataset are: " +
  ", ".join([f"'{x}'" for x in d_train["target_names"]]))

The topics in the dataset are: 'alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc'


In [3]:
for text in d_train["data"][:3]:
    print("-"*50)
    print(text.replace("\\"," "))

--------------------------------------------------
I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.
--------------------------------------------------
A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies are especially requested.

I will be summarizing in the next two days, so please 

NLP requires converting natural language documents to numeric representations and performing computations on these representations. The first step to **encoding** documents into numeric representations is breaking the documents down into smaller units via **tokenization**. One obvious method of tokenization is **word tokenization**:

In [4]:
import nltk

example = d_train["data"][0]

def word_tokenize(text):
    x = nltk.word_tokenize(text.replace("\\"," "))
    return x

example_tokenized = word_tokenize(example)

print(example + "\n ==> ")
print(", ".join([f"'{t}'" for t in example_tokenized]))

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.
 ==> 
'I', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'I', 'saw', 'the', 'other', 'day', '.', 'It', 'was', 'a', '2-door', 'sports', 'car', ',', 'looked', 'to', 'be', 'from', 'the', 'late', '60s/', 'early', '70s', '.', 'It', 'was', 'called', 'a', 'Bricklin', '.', 'The', 'doors', 'were', 'really', 'small', '.', 'In', 'addition', ',', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', '.', 'This', 'is', 'all', 'I', 'know', '.',

We then create a **vocabulary** that maps tokens to indices:

In [5]:
from itertools import islice
dict(islice({j:i for i,j in enumerate(set(example_tokenized))}.items(),10))

{'a': 0,
 'The': 1,
 'wondering': 2,
 'history': 3,
 '2-door': 4,
 'name': 5,
 'bumper': 6,
 'or': 7,
 'on': 8,
 'early': 9}

Of course we do not use just a single document to create our vocabulary, but a collection of documents, called a **corpus**. We also need to specify a maximum number of tokens for our vocabulary.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

count_vectorizer = CountVectorizer(
    tokenizer = word_tokenize,
    max_features = 1000, # max number of tokens in our vocabulary
    lowercase = False)

# Train the vocabulary on 1000 example texts
count_vectorizer.fit(d_train["data"][:1000])

print("Here's a subset of our vocabulary:")
dict(islice(count_vectorizer.vocabulary_.items(),10))

Here's a subset of our vocabulary:


{'I': 146,
 'was': 948,
 'wondering': 979,
 'if': 572,
 'anyone': 313,
 'out': 731,
 'there': 893,
 'could': 414,
 'me': 659,
 'on': 718}

## Classical NLP

Many traditional statistical and machine learning models expect data to be in a tabular format where columns correspond to specific features and rows correspond to individual observations (in the case of NLP, each document is treated as an individual observation).

The most common method of transforming a corpus of documents into a tabular format is **bag-of-words** where each column in the table represents a given word (token) and the entry in row i, column j is the count of times word j occurs in document i.

In [7]:
pd.DataFrame(
    count_vectorizer.transform(d_train["data"][:1000]).toarray(),
    columns = count_vectorizer.get_feature_names_out().tolist())

Unnamed: 0,!,#,$,%,&,','','AS,'AX,'d,...,x-Soviet,year,years,yes,yet,you,your,{,|,}
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0
2,0,0,0,0,0,0,3,0,0,1,...,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,2,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
996,0,0,0,0,0,1,3,0,0,0,...,0,0,0,0,0,1,0,0,0,0
997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
998,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,2,0


## Improved token set

The simple word level tokenization above produces a number of tokens that are unlikely to be informative. Given a fixed maximum number of tokens we can generally produce a more informative set of tokens by:

1. Lowercasing: no reason to produce columns for both "Apple" and "apple"
2. Removing **stopwords**, a list of common tokens such as "the", "a", "this", etc. that are unlikely to add value in the bag of words approach.
3. Reducing morphological inflections:
   - **Stemming** uses heuristic rules to reduce morphological inflections by chopping common suffixes from tokens.
   - **Lemmatization** is a more sophisticated technique that generally uses vocabularies, word context and part-of-speech tagging to infer the correct lemma of a word. Lemmatization tends to be more computationally intensive than stemming.


In [8]:
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

stemming_example_words = ["stop","stops","stopping","stopped"]

for word in stemming_example_words:
    print(f"The stem of '{word}' is '{porter_stemmer.stem(word)}'")

The stem of 'stop' is 'stop'
The stem of 'stops' is 'stop'
The stem of 'stopping' is 'stop'
The stem of 'stopped' is 'stop'


In [9]:
import re
def word_tokenize(text):
    """Tokenize a string by:
    1. Word tokenizing
    2. Filtering out tokens that don't contain at least
       two consecutive alpha-numeric characters.
    3. Lowercase characters
    4. Apply one of the most common stemming algorithms,
       the Porter stemmer
    """
    x = nltk.word_tokenize(text.replace("\\"," "))
    x = [t for t in x if re.search("[A-Za-z0-9]{2,}",t)]
    x = [t.lower() for t in x]
    return [porter_stemmer.stem(t) for t in x]

count_vectorizer = CountVectorizer(
    tokenizer=word_tokenize,
    max_features=100,
    stop_words='english' # use a stopword list from scikit-learn
    )

count_vectorizer.fit(d_train["data"][:1000])

pd.DataFrame(count_vectorizer.transform(d_train["data"][:1000]).toarray(),
columns = count_vectorizer.get_feature_names_out().tolist())

Unnamed: 0,'ax,'re,'ve,145,6um,a86,ani,anoth,anyon,argument,...,use,veri,wa,want,way,whi,window,word,work,year
0,0,0,0,0,0,0,0,0,2,0,...,0,0,4,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,2,0,0,0,...,2,0,2,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
998,0,0,0,0,0,0,0,0,0,0,...,2,0,0,0,0,0,0,0,0,0


## N-grams

One of the primary downsides of bag-of-words is that it loses all information contained in the order of the tokens. For example, both "Dog bites man" and "Man bites dog" produce the same bag-of-words representation. One method for partially overcoming this shortcoming is to use **n-grams**; the creation of a new token by combining `n` consecutive tokens

In [10]:
import random
random.seed(1)

count_vectorizer = CountVectorizer(
    tokenizer=word_tokenize,
    max_features=1000,
    stop_words='english',
    ngram_range=(1,4)
    )

count_vectorizer.fit(d_train["data"])

random_sample_n_grams = random.sample(
    [i for i in count_vectorizer.get_feature_names() if re.search(" ",i)],10)

print(", ".join([f"'{t}'" for t in random_sample_n_grams]))


''ax max 'ax 'ax', ''ax 'ax max', 'a86 a86 a86 a86', ''ax 'ax 'ax max', 'a86 a86 a86', 'max 'ax', 'a86 a86', 'someth like', 'doe anyon', '1d9 1d9'


## Text Classification with Logistic Regression

Finally we can use our matrix of vectorized texts to fit a logitstic regression model for topic classification. We pipe the output of our `CountVectorizer` through a `TfidfTransformer` (Term Frequency Inverse Document Frequency) object that normalizes document-level token counts by corpus-level token frequency. This is then piped into a logistic regression classifier. We fit our pipeline on the training data.

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("count_vectorizer", count_vectorizer),
    ("tfidf_transformer", TfidfTransformer()),
    ("logistic_reg", LogisticRegression(multi_class='multinomial'))])

pipe.fit(d_train["data"],d_train["target"])

We evaluate the model using precision and recall. For a given topic:
  - **Precision** is the number of times we predicted that topic correctly divided by the number of times we predicted that topic (both correctly and incorrectly).
  - **Recall** is the number of times we predicted that topic correctly divided by the number of actual cases of the topic.

A model that just guessed randomly and uniformly on our dataset would have both precision and recall close to 0.05. Our simple model is already much better than that:

In [12]:
from sklearn.metrics import classification_report

test_preds = pipe.predict(d_test["data"])

print(
    f"Classification report for classifier:\n" +
    f"""{classification_report(
        d_test['target'], test_preds,
        target_names=d_test['target_names'])}\n""")

Classification report for classifier:
                          precision    recall  f1-score   support

             alt.atheism       0.39      0.42      0.40       319
           comp.graphics       0.49      0.55      0.52       389
 comp.os.ms-windows.misc       0.53      0.48      0.50       394
comp.sys.ibm.pc.hardware       0.52      0.50      0.51       392
   comp.sys.mac.hardware       0.55      0.53      0.54       385
          comp.windows.x       0.63      0.54      0.58       395
            misc.forsale       0.68      0.72      0.70       390
               rec.autos       0.58      0.55      0.57       396
         rec.motorcycles       0.36      0.57      0.44       398
      rec.sport.baseball       0.57      0.57      0.57       397
        rec.sport.hockey       0.70      0.70      0.70       399
               sci.crypt       0.72      0.57      0.64       396
         sci.electronics       0.41      0.44      0.42       393
                 sci.med       0.60  

## State-of-the-Art NLP with Transformers

All state-of-the-art models in NLP are now a type of deep neural network using a "transformer" architecture.

The transformer architecture was originally introduced in the paper [Attention is All You Need](https://arxiv.org/pdf/1706.03762.pdf) by Vaswani et al. 2017 (Google Brain/Google Research).


![Transformer Architecture](images/transformer_architecture.png)

*Attention is All You Need* was concerned with translation, for instance from English to German or English to French. In the case of English to German, the inputs to the Transformer Encoder were the full sequence of English token indices from a document, $\mathbb{x} = (x_1, x_2, ... , x_n)$. The output of the Transformer Encoder were a corresponding sequence of vectors $\mathbb{z} = (z_1, z_2, ... , z_n)$.

The Transformer Decoder then generates an output sequence $\mathbb{y} = (y_1, y_2, ... , y_m)$ of German token indices one at a time. At each time step the Transformer Decoder takes the Transformer Encoder output $\mathbb{z}$ and the part of $\mathbb{y}$ that it had previously generated ($(y_1, ..., y_k)$ where $k<m$) and produces the next token $y_{k+1}$

The Transformer was trained on several million pairs of English-German (or English-French) sentence pairs using backpropogation.

### GPT - Generative Pre-Training

In [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf), Radford et al 2018 (OpenAI), used just the Transformer Decoder and introduced a novel two-stage training paradigm:

**Pretraining (Unsupervised)**: Train the model on a large corpus of unlabeled text data (typically scraped from the web). Given a chunk of text as input, the model's objective is to predict the word that came next in the source text. For instance, the model receives the input `["The", "cat", "is", "on", "the"]` and must predict which word came next. Pretraining allows the model to learn general features of the language.

**Fine-tuning (Supervised)**: For a specific task with limited labeled training data, like classification or textual entailment, initialize a similar model architecture using the weights learned from pretraining and continue training for the specific task. This requires some creativity in formatting the input data to the model and attaching an additional output layer that is task-specific.

![GPT Pretraining and Fine-tuning](images/gpt_objectives.png)

Transformer Decoder models:

- OpenAI's GPT, GPT-2, GPT-3 and GPT-Neo
- Google's PaLM and LaMDA

### BERT - Bidirectional Encoder Representations from Transformers

Even though many languages are written and read left to right, the meaning of words and concepts in sentences flow in both directions. The Transformer Decoder-only models do not take advantage of this bidirectionality. Consider the sentence: "The bat that flew through the night air almost hit the player on deck." If you only saw the first 8 words ("The bat that flew through the night air") you probably have a very different understanding of "bat" than if you saw the full sentence. The correct meaning of the word "bat" in this sentence flows backwords from "the player on deck".

In [BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding](https://arxiv.org/pdf/1810.04805.pdf) (Devlin et al. 2019 - also at Google) took just the Transformer Encoder from *Attention is All You Need* and extended the pretraining/fine-tuning approach to produced models that take advantage of bidirectionality and that achieved state-of-the-art performance.

Pretraining: