# Introduction to Natural Language Processing

## Agenda

1. Natural Language Processing - Motivating Tasks
2. Classical NLP
    - Processing text for machine learning
    - Text Classification
    - Topic Modeling
3. Modern NLP - State of the Art models using Transformers

## Natural Language Processing

Natural Language Processing is a field combining linguistics and computer science to analyze natural language data and perform various tasks.

NLP is a broad topic that covers many different tasks. Common tasks include:

1. Text Classification - Predict the topic of a news article from a predefined set of topics.
2. Text Regression - Given the text of a review on Amazon, predict the number of stars.
2. Named Entity Recognition - If "apple" is used in a sentence does it refer to fruit or a company?
3. Question Answering - Given a context text, answer a question about it.
4. Text Summarization - Produce a summary of a given text.
5. Translation - Translate English to Spanish
6. Text Generation - Given a prompt, write a story.
7. Dialogue State Tracking - Given a conversation, record key facts about it.
8. Topic modeling - Given a corpus of texts, discover common topics.

In [1]:
# Set up Google Colab runtime
import sys
if "google.colab" in sys.modules:
    print("Setting up Google Colab... ")
    !git clone https://github.com/Strabes/Intro-to-NLP.git intro-to-nlp
    %cd intro-to-nlp
    from install import install_requirements
    install_requirements()

In [2]:
from datasets import load_dataset
dataset = load_dataset("ag_news")

Using custom data configuration default
Reusing dataset ag_news (C:\Users\grego\.cache\huggingface\datasets\ag_news\default\0.0.0\bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548)
100%|██████████| 2/2 [00:00<00:00, 31.74it/s]


In [9]:
for text in dataset["train"]["text"][:5]:
    print("-"*50)
    print(text.replace("\\"," "))

--------------------------------------------------
Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling band of ultra-cynics, are seeing green again.
--------------------------------------------------
Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group, which has a reputation for making well-timed and occasionally controversial plays in the defense industry, has quietly placed its bets on another part of the market.
--------------------------------------------------
Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries about the economy and the outlook for earnings are expected to hang over the stock market next week during the depth of the summer doldrums.
--------------------------------------------------
Iraq Halts Oil Exports from Main Southern Pipeline (Reuters) Reuters - Authorities have halted oil export flows from the main pipeline in southern Iraq

NLP requires translating natural language documents to numeric representations and performing computations on these representations. The first step to **encoding** documents into numeric representations is breaking the documents down into smaller units via **tokenization**. One obvious method of tokenization is **work tokenization**:

In [38]:
import nltk

example = dataset["train"]["text"][0]

def word_tokenize(text):
    x = nltk.word_tokenize(text.replace("\\"," "))
    return x

example_tokenized = word_tokenize(example)

print(example + "\n ==> ")
print(", ".join([f"'{t}'" for t in example_tokenized]))

Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
 ==> 
'Wall', 'St.', 'Bears', 'Claw', 'Back', 'Into', 'the', 'Black', '(', 'Reuters', ')', 'Reuters', '-', 'Short-sellers', ',', 'Wall', 'Street', ''s', 'dwindling', 'band', 'of', 'ultra-cynics', ',', 'are', 'seeing', 'green', 'again', '.'


Many traditional statistical and machine learning models expect data to be in a tabular format where columns correspond to specific features and rows correspond to individual observations (in the case of NLP, each document is treated as an individual observation).

The most common method of transforming a corpus of documents into a tabular format is **bag of words** where each column in the table represents a given word (token) and the entry in row i, column j is the count of times word j occurs in document i. We usually only create columns for the most common words, say the top 1000 words occurring in the corpus.

In [39]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

count_vectorizer = CountVectorizer(tokenizer=word_tokenize,max_features=100)

count_vectorizer.fit(dataset["train"]["text"][:1000])

pd.DataFrame(count_vectorizer.transform(dataset["train"]["text"][:1000]).toarray(),
columns = count_vectorizer.get_feature_names_out().tolist())



Unnamed: 0,#,$,&,','','s,(,),",",-,...,was,week,what,which,who,will,with,world,year,you
0,0,0,0,0,0,1,1,1,2,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,1,2,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,1,1,0,1,...,0,1,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,1,3,1,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,0,0,0,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
996,0,0,0,0,0,0,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
997,0,0,0,0,0,0,1,1,2,1,...,0,1,0,0,0,0,1,0,0,0
998,0,0,0,0,0,2,1,1,3,0,...,0,0,0,0,0,0,1,0,0,0


In [46]:
import re
def word_tokenize(text):
    x = nltk.word_tokenize(text.replace("\\"," "))
    # filter out tokens that don't contain at least
    # two consecutive alpha-numeric characters
    x = [t for t in x if re.search("[A-Za-z0-9]{2,}",t)]
    return [t.lower() for t in x]

count_vectorizer = CountVectorizer(
    tokenizer=word_tokenize,
    max_features=100,
    stop_words='english')

count_vectorizer.fit(dataset["train"]["text"][:1000])

pd.DataFrame(count_vectorizer.transform(dataset["train"]["text"][:1000]).toarray(),
columns = count_vectorizer.get_feature_names_out().tolist())



Unnamed: 0,according,afp,ap,athens,billion,bush,charley,chavez,city,company,...,web,week,win,windows,work,world,year,years,yesterday,york
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
997,0,0,2,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
998,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [44]:
re.search("[A-Za-z0-9]{2,}","hello")

<re.Match object; span=(0, 5), match='hello'>

In [24]:
from collections import Counter, OrderedDict

def build_vocab(texts, tokenizer, max_tokens=10000):
    counter = Counter(
        token for tokens in map(tokenizer, texts)
        for token in tokens)

    return {t[0]:i for i,t in enumerate(counter.most_common(max_tokens))}

vocab = build_vocab(dataset["train"]["text"], word_tokenize)

{'the': 0,
 ',': 1,
 '.': 2,
 'a': 3,
 'to': 4,
 'of': 5,
 'in': 6,
 'and': 7,
 "'s": 8,
 'for': 9,
 '(': 10,
 ')': 11,
 ';': 12,
 'on': 13,
 'that': 14,
 'is': 15,
 '-': 16,
 'ap': 17,
 'it': 18,
 'by': 19,
 "''": 20,
 '&': 21,
 'as': 22,
 'at': 23,
 'with': 24,
 'are': 25,
 ':': 26,
 'from': 27,
 'new': 28,
 '--': 29,
 '...': 30,
 'its': 31,
 '``': 32,
 'be': 33,
 'this': 34,
 'an': 35,
 'reuters': 36,
 'has': 37,
 'but': 38,
 'have': 39,
 'google': 40,
 'will': 41,
 '?': 42,
 'lt': 43,
 'gt': 44,
 'i': 45,
 'more': 46,
 "'": 47,
 'they': 48,
 'said': 49,
 'you': 50,
 'could': 51,
 'was': 52,
 'his': 53,
 'can': 54,
 'about': 55,
 'their': 56,
 "n't": 57,
 'first': 58,
 'space': 59,
 'after': 60,
 'one': 61,
 'up': 62,
 '#': 63,
 'company': 64,
 'space.com': 65,
 'or': 66,
 'into': 67,
 'over': 68,
 'who': 69,
 'scientists': 70,
 'than': 71,
 'what': 72,
 'not': 73,
 'out': 74,
 'all': 75,
 'some': 76,
 'software': 77,
 'which': 78,
 'off': 79,
 'ipo': 80,
 'if': 81,
 'may': 82,
 'he

One of the key properties of a natural language is that its grammar is a discrete combinatorial system: From a small set of building blocks (words) we can build up an unlimited number of distinct combinations (sentences, paragraphs, documents) with distinct meanings. This poses a problem for computers because we need to translate natural languages to numeric representations and perform computations on them. With limited computational resources, we have to simplify the documents we process

In [None]:
from nltk.stem.porter import PorterStemmer

In [5]:
dataset["train"]["text"][:5]

["Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.",
 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.',
 "Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.",
 'Iraq Halts Oil Exports from Main Southern Pipeline (Reuters) Reuters - Authorities have halted oil export\\flows from the main pipeline in southern Iraq after\\intelligence showed a rebel militia could strike\\infrastructure, an oil official said on Saturday.',
 'Oil prices soar to all-time record, posing new menace to US economy (A