# Notebook 3 - Classification and Working With Text

This notebook is heavily indebetted to [Prof. Aaron Culotta's NLP Course](https://www.cs.tulane.edu//~aculotta/)

In [None]:
## This is for CoLab: clone the course data and change to the correct
## directory on CoLab
# clone the course repository, change to right directory, and import libraries.
%cd /content
!git clone https://github.com/TulaneCS/intd2810.git
%cd /content/intd2810/_notebooks

### Natural vs. Unnatural (Formal) Languages

**Natural**
- Emerges from intelligent beings
- We **discover** the grammar.
- Full of ambiguity
- English, Spanish, Dolphin Language?

**Formal**
- Defined by humans
- We **prescribe** the grammar.
- Designed to **remove** ambiguity
- Python, math, ...
<br><br><br><br><br><br>

What are some examples of NLP?

![figs/watson.jpg](https://github.com/tulane-cmps6730/main/blob/main/lec/overview/figs/watson.jpg?raw=1)
<br><br><br><br><br><br>

![figs/siri.png](https://github.com/tulane-cmps6730/main/blob/main/lec/overview/figs/siri.png?raw=1)
<br><br><br><br><br><br>

![figs/translate.jpg](https://github.com/tulane-cmps6730/main/blob/main/lec/overview/figs/translate.jpg?raw=1)
<br><br><br><br><br><br>

![figs/echo.jpg](https://github.com/tulane-cmps6730/main/blob/main/lec/overview/figs/echo.jpg?raw=1)
<br><br><br><br><br><br>

![figs/her.jpg](https://github.com/tulane-cmps6730/main/blob/main/lec/overview/figs/her.jpg?raw=1)

![figs/chatgpt.png](https://github.com/tulane-cmps6730/main/blob/main/lec/overview/figs/chatgpt.png?raw=1)
<br><br><br><br><br><br>

## What is NLP??

<br><br><br><br>

**Natural Language Processing = Linguistics + Logic + Machine Learning**

Each has a long history...

<br><br><br><br>

## NLP History

- Dates back to first days of computing (Turing)
- Combines linguistics, formal logic, and statistics

### 1940s-1950s

**Logic**
- automaton (Turing)
- Context-free grammars (Chomsky)

**Statistics**
- McCulloch Pits neuron (perceptron)
- Noisy channel model (Shannon)

### Turing test (Alan Turing, 1950)


Q: Please write me a sonnet on the topic of the Forth Bridge.  
A: Count me out on this one. I never could write poetry  


Q: Add 34957 to 70764  
A: (Pause about 30 seconds and then give answer as) 105721.

<br><br><br><br>

### 1957-1970

**Logic**
- Theorem proving (Newell & Simon)
- Logical inference

**Statistics**
- Neural nets
- Bayesian statistics (e.g., identify author of text)


#### ELIZA (1964, Weizenbaum)

User: You are like my father in some ways.  
ELIZA: WHAT RESEMBLANCE DO YOU SEE  
User: You are not very aggressive but I think you don’t want me to notice that.  
ELIZA: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE  
User: You don’t argue with me.  
ELIZA: WHY DO YOU THINK I DON’T ARGUE WITH YOU?  
User: You are afraid of me.  
ELIZA: DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU

<br><br>

Online example: <http://www.masswerk.at/elizabot/>
<br><br><br><br>

### 1970-1983

**Logic**
  - Winograd's SHRDLU

**Statistics**
  - speech recognition (AT&T Bell Labs, IBM)
  - Hidden Markov Models

![figs/shrdlu.png](https://github.com/tulane-cmps6730/main/blob/main/lec/overview/figs/shrdlu.png?raw=1)


### 1983-1993
- Return to finite state models
- Empirical approach: (data driven) IBM speech recognition

### 1994-1999
- Empirical approach widespread
- Bayesian statistics
- graphical models

### 2000s
- Combinations of logical and empirical approaches
  - Markov logic networks, etc.
- Deep learning
  - revival of neural nets from 1960s
<br><br><br><br><br><br>

![figs/gpt_growth.png](https://github.com/tulane-cmps6730/main/blob/main/lec/overview/figs/gpt_growth.png?raw=1)

(Parmida Beigi, Amazon)

## Why is NLP Hard?

### Ambiguity: The Good and the Bad

- Makes language fun and interesting for humans, but makes language difficult for computers.
- The central problem to NLP is **resolving ambiguity**.


- E.g., "*I made her duck*."

<br><br><br><br><br><br><br><br>



1. I cooked waterfowl for her.
2. I cooked waterfowl belonging to her.
3. I created the (plaster?) duck she owns.
4. I caused her to quickly lower her head or body.
5. I waved my magic wand and turned her into undifferentiated waterfowl.


- Syntactic ambiguity (1 vs 4): "duck" $\rightarrow$ verb or noun?  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; **part-of-speech tagging, syntactic parsing**
- Semantic ambiguity (1 vs 3): "make" $\rightarrow$ *create* or *cook*? &nbsp;&nbsp; **word sense disambiguation**
- Phonetic ambiguity: "I" or "eye"; "made" or "maid"?  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; **speech recognition**

## A Little Bit of Machine Learning

<img src='https://github.com/tulane-cmps6730/main/blob/main/lec/classify/figs/spam.png?raw=1'/>

### A little bit of notation...

- $\vec{x} \in X$ &nbsp;&nbsp;&nbsp;&nbsp; *instance*, *example*, *input*
  - e.g., an email
  
  
- $y \in Y$ &nbsp;&nbsp;&nbsp;&nbsp; *target*, *class*, *label*, *output*
  - e.g., $y=1$: spam ; $y=0$: not spam
  
  
- $f: X \mapsto Y$ &nbsp;&nbsp;&nbsp;&nbsp; *hypothesis*, *learner*, *model*, *classifier*
  - e.g., if $x$ contain the word *free*, $y$ is $1$.

### Important Problems

- **Classification**
  - $\vec{x}$: email ;  $y$: spam or not
- **Regression**
  - $\vec{x}$: twitter feed of a person ; $y$: age
- **Clustering**
  - $\vec{x}$: news articles ; $y$: topics

### The Basic Workflow

1. **Collect** raw data: emails
2. Manually **categorize** them:  spam or not
3. **Vectorize**: email -> word counts [**features**]
4. **Train** / **Fit**: create $f(x)$
5. **Collect** new raw data
6. **Predict**: compute $f(x)$ for new $x$


## Example: Spam Classification

**Steps 1 & 2: Collect and categorize**

**Spam:**

> Free credit report!


> Free money!


**Not spam:**

> Are you free tonight?

> How are you?


**Step 3: Vectorize**

> 'Free money!'

becomes

```
free: 1
money: 1
!: 1
?: 0
credit: 0
...
```

**Representation**: "Feature engineering is the key" -- Domingos

Why is this (seemingly) a terrible representation of a document?

<br><br><br>

When working with text we're going to use basic logistic regression and a Bag of words model.

**Bag of Words**

![bow](https://github.com/tulane-cmps6730/main/blob/main/lec/classify/figs/bow.png?raw=1)


**Step 4: Train/Fit**

Which model to use?

- Naive Bayes
- Logistic Regression
- Decision Tree
- K-Nearest Neighbors
- Support Vector Machines
- (Deep) Neural Networks
- ... many many more

**Steps 5-6: Predict on new data**

> Free vacation!

**Spam**

But... How do you know if it works???

In [None]:
# Common Imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

# Make the fonts a little bigger in our graphs.
font = {'size'   : 14}
plt.rc('font', **font)
plt.rcParams['mathtext.fontset'] = 'cm'
plt.rcParams['pdf.fonttype'] = 42

## A Simple Classifier... and why it's hard..

**Spam:**

> Free free credit report free!

> Cousin, are you free for free today?

> Free money!


**Not spam:**

> Are you available tonight?

> Cousin how are you?

In [None]:
# X: each row is a feature vector for one document.
data = pd.DataFrame([
    (0,0,0),
    (1,0,0),
    (0,3,1),
    (1,3,1),
  ],columns=['cousin', 'free', 'y'])
data

In [None]:
!pip install -q hvplot
import hvplot.pandas
import holoviews as hv
from matplotlib.colors import ListedColormap
cmap = ListedColormap(['blue', 'red'])
hv.extension('bokeh')
data.hvplot(x='cousin', y='free', kind='scatter', c='y', cmap=cmap, s=200)

In [None]:
# We often separate the data into feature matrix X and the target vector y
X = data[['cousin', 'free']].values
y = data['y'].values
display(X)
display(y)

In [None]:
# Simplest machine learning algorithm:

class SimplestMachine:

    def __init__(self):
        self.f = dict()

    def train(self, X, y):
        for xi, yi in zip(X, y):
            self.f[tuple(xi)] = yi

    def predict(self, x):
        return self.f[tuple(x)]

# What does this do?

In [None]:
# What does zip do?
[x for x in zip([1, 2, 3], ['a', 'b', 'c', 'd'])]

In [None]:
# create the classifier
simplest_machine = SimplestMachine()
# train the classifier
simplest_machine.train(X, y)
# predict
data['prediction'] = [simplest_machine.predict(xi) for xi in X]
data

In [None]:
# What does it do for unseen example?
simplest_machine.predict((0, 4))

In [None]:
# Second simplest machine learning algorithm:
class SimpleMachine:

    def __init__(self):
        self.f = dict()

    def train(self, X, y):
        for xi, yi in zip(X, y):
            self.f[tuple(xi)] = yi

    def predict(self, x):
        x_closest = self.find_most_similar(x)
        return self.f[x_closest]

    def find_most_similar(self, x):
        distances = [self.distance(x, xi) for xi in self.f.keys()]
        best_idx = np.argmin(distances)
        return list(self.f.keys())[best_idx]

    def distance(self, x, xi):
        return np.sqrt(np.sum((np.array(x)-np.array(xi))**2))

# What does this do?

**Euclidean distance:**   

```
(0, 3)
(1, 5)
```

$$\sqrt{(0-1)^2 + (3-5)^2} = \sqrt{5}$$

In [None]:
simple_machine = SimpleMachine()
simple_machine.train(X, y)
data['prediction'] = [simple_machine.predict(xi) for xi in X]
data

In [None]:
# What does it do for unseen example?
print(simple_machine.predict((0, 4)))

<img src='https://github.com/tulane-cmps6730/main/blob/main/lec/classify/figs/knn.png?raw=1' width='80%'/>

<http://www.scholarpedia.org/article/K-nearest_neighbor>

### Generalization

How accurate will I be on a new, unobserved example?

How do you know if it works?

1. Train on data ${D_1}$
2. Predict on data ${D_2}$
3. Compute accuracy on ${D_2}$.
   - Why not ${D_1}$?

How do you know if it works?

1. Train on data ${D_1}$
2. Predict on data ${D_2}$
3. Compute accuracy on ${D_2}$.
4. Tweak algorithm / representation
5. Repeat

How do you know if it works?

1. Train on data ${D_1}$
2. Predict on data ${D_2}$
3. Compute accuracy on ${D_2}$.
4. Tweak algorithm / representation
5. Repeat

How many times can I do this?

#### Measuring Generalization

- Cross-validation
  - train on 90%, test on 10%, repeat 10 x's
       - each example appears only once in test set
       
       
<p><a href="https://commons.wikimedia.org/wiki/File:K-fold_cross_validation_EN.svg#/media/File:K-fold_cross_validation_EN.svg"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/K-fold_cross_validation_EN.svg/1200px-K-fold_cross_validation_EN.svg.png" alt="K-fold cross validation EN.svg"></a><br><font size=-2>By <a href="//commons.wikimedia.org/w/index.php?title=User:Gufosowa&amp;action=edit&amp;redlink=1" class="new" title="User:Gufosowa (page does not exist)">Gufosowa</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=82298768">Link</a></font></p>


#### Experimental Design

1. Collect data
2. Build model
3. Compute cross-validation accuracy
4. Tune model
5. Repeat
6. **Report accuracy on new data**

<br><br><br>

- What is overfitting? How do you know it is happening? How do you fix?


<img src="https://hackernoon.com/hn-images/1*SBUK2QEfCP-zvJmKm14wGQ.png"/>


<img src="https://hackernoon.com/hn-images/1*xWfbNW3arf39wxk4ZkI2Mw.png"/>

[source](https://hackernoon.com/memorizing-is-not-learning-6-tricks-to-prevent-overfitting-in-machine-learning-820b091dc42)


If overfitting:
- get more labeled data
- reduce complexity of model (fewer parameters)
- stop the training function early

If underfitting:
- increase complexity of model (more parameters)
- let the training function run longer


## Text Classification

In this example we go through a light example of processing a dataset for analyzing text.

The data comes from [this website](https://www.cs.cornell.edu/people/pabo/movie-review-data/) at Cornell and is from Bo Pang and Lillian Lee, A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, Proceedings of ACL 2004.

This contains 1000 positive and 1000 negative movie reviews. Our job is to classify a review as positive or negative based on the text.

In [None]:
# need to unzip the data first.
!unzip ./data/review_polarity.zip -d ./data/

In [None]:
!ls data/review_polarity/pos

In [None]:
import glob

# labels are based on which directory the files are in.
all_pos = list(glob.glob("./data/review_polarity/pos/*"))
all_neg = list(glob.glob("./data/review_polarity/neg/*"))
labels = np.array([1] * len(all_pos) + [0] * len(all_neg))
filenames = all_pos + all_neg

In [None]:
# Let's take a look at a "positive" and "negative"
print(filenames[0])
print(filenames[1001])

In [None]:
!cat ./data/review_polarity/pos/cv172_11131.txt

In [None]:
!cat ./data/review_polarity/neg/cv833_11961.txt

We'll use TfidfVectorizer to convert each document into a (sparse) *feature* vector.

As part of this we pass in:

**ngram_rangetuple (min_n, max_n), default=(1, 1)**

The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

vec = TfidfVectorizer(input='filename', stop_words='english', ngram_range=(1, 1))
X = vec.fit_transform(filenames)
X.shape

So, we have 2000 documents and 39,354 unique words.

How big is this matrix?

Wait, how do we store that?

dense matrix:
$$
X=
  \begin{bmatrix}
    0.1 & 2.8 & 3.2 & ... & 1.5 \\
    3.2 & 4.1 & 5.1 & ... & 2.7  \\
    ...\\
    1.4 & 3.4 & 7.5 & ... & 7.5  \\
  \end{bmatrix}
$$

sparse matrix:
$$
X=
  \begin{bmatrix}
    0.1 & 0 & 0 & ... & 1.5 \\
    0 & 0 & 0 & ... & 2.7  \\
    ...\\
    0 & 3.4 & 0 & ... & 0  \\
  \end{bmatrix}
$$

How can we store a sparse matrix more efficiently?

<br><br><br>
[CSR matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html)

In [None]:
X[0]

In [None]:
filenames[0]

In [None]:
!cat ./data/review_polarity/pos/cv172_11131.txt

In [None]:
X[0].indices

In [None]:
feature_names = np.array(vec.get_feature_names_out())
feature_names[X[0].indices]

In [None]:
X[0].data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.4,
                                                    shuffle=True, random_state=42)

In [None]:
import sklearn.metrics as metrics
textlr = LogisticRegression()
textlr.fit(X_train, y_train)
y_predicted = textlr.predict(X_test)
print(f"accuracy= {metrics.accuracy_score(y_predicted,y_test):.3f}")
print(f"precision= {metrics.precision_score(y_predicted,y_test):.3f}")
print(f"recall ={metrics.recall_score(y_predicted,y_test):.3f}")

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(textlr, X_test, y_test,
                                        display_labels=["Negative", "Positive"],
                                        cmap=plt.cm.Blues, normalize='all')

In [None]:
coefficient_list = pd.DataFrame(textlr.coef_[0],  index=feature_names).rename(columns={0: 'coef'})
coefficient_list.sort_values('coef', ascending=False).head(20)

In [None]:
coefficient_list.sort_values('coef', ascending=True).head(20)