<a href="https://colab.research.google.com/github/Angel-Castro-RC/Final_NLP/blob/main/F5_3_NeuralNetworks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 195: Natural Language Processing
## Neural Networks

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F5_3_NeuralNetworks.ipynb)


## References

SLP: Neural Networks and Neural Language Models, Chapter 7 of Speech and Language Processing by Daniel Jurafsky & James H. Martin https://web.stanford.edu/~jurafsky/slp3/7.pdf

Artificial Neural Networks, Chapter 4 of Machine Learning by Tom M. Mitchell http://www.cs.cmu.edu/~tom/files/MachineLearningTomMitchell.pdf

Sequential Model from Keras Developer Guide: https://keras.io/guides/sequential_model/

In [1]:
import sys
!{sys.executable} -m pip install datasets keras tensorflow transformers

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/521.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.6/521.2 kB[0m [31m5.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
I

## Review: Integer Encoding

We tried machine learning with text where each word was assigned a number - integer encoding

We have to make sure each input has the same size. Since text inputs are different sizes
* pad small ones with zeros
* truncate long ones

In [None]:
from transformers import AutoTokenizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

data = load_dataset("Deysi/spam-detection-dataset")


def prepare_text_data(text_list,tokenizer,encoded_length=512):
    encodings = []
    for curr_example in text_list:
        curr_tokens = tokenizer.tokenize(curr_example)
        curr_encodings = tokenizer.convert_tokens_to_ids(curr_tokens)

        # truncate sequences that are too long
        if len(curr_encodings) > encoded_length:
            curr_encodings = curr_encodings[:encoded_length]
        # pad sequences that are too short with 0s
        elif len(curr_encodings) < encoded_length:
            curr_encodings = curr_encodings + [0]*(encoded_length-len(curr_encodings))

        encodings.append(curr_encodings)

    return encodings



train_encoding = prepare_text_data(data["train"]["text"],tokenizer)
train_labels = data["train"]["label"]
test_encoding = prepare_text_data(data["test"]["text"],tokenizer)
test_labels = data["test"]["label"]


lr_model = LogisticRegression(max_iter=2000)
lr_model.fit(train_encoding,train_labels)

predictions = lr_model.predict(test_encoding)

print( accuracy_score(test_labels,predictions) )

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/581 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/663k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/8175 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2725 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (536 > 512). Running this sequence through the model will result in indexing errors


0.7258715596330275


## Review: Bag-of-Words Encoding

Choose vocabulary (say 5000 most common words) one column for each word

row contains counts for each word

**Example**

*Sentence 1:* "The cat sat on the hat"

*Sentence 2:* "The dog ate the cat and the hat"

*Vocabulary:* { the, cat, sat, on, hat, dog, ate, and }


|            | the | cat | sat | on | hat | dog | ate | and |
|------------|-----|-----|-----|----|-----|-----|-----|-----|
| Sentence 1 | 2   | 1   | 1   | 1  | 1   | 0   | 0   | 0   |
| Sentence 2 | 3   | 1   | 0   | 0  | 1   | 1   | 1   | 1   |


**The downside:** this doesn't maintain any information about word order - thus the "bag" of words

`scikit-learn` provides a Bag-of-Words encoder called `CountVectorizer`


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from datasets import load_dataset

data = load_dataset("Deysi/spam-detection-dataset")

train_texts = data["train"]["text"]
train_labels = data["train"]["label"]
test_texts = data["test"]["text"]
test_labels = data["test"]["label"]

# Consider top 5000 frequent words
# remove stop words
vectorizer = CountVectorizer(max_features=5000,stop_words="english")
vectorizer.fit(train_texts)

train_vectors = vectorizer.transform(train_texts)
test_vectors = vectorizer.transform(test_texts)

lr_model = LogisticRegression(max_iter=2000)
lr_model.fit(train_vectors,train_labels)

predictions = lr_model.predict(test_vectors)

print( accuracy_score(test_labels,predictions) )

0.9952293577981651


## TD-IDF Encoding

**TF-IDF:** Term Frequency - Inverse Document Frequency

**Term Frequency:** How often does the word appear in the example, like CountVectorizer
* actually take the $\log$ of it

**Document Frequency:** What fraction of the *documents* (or *training-examples*) does the word appear in?

**Inverse Document Frequency:** (number of documents) / (number of documents containing the word)
* if a word is in only a few documents, you get a big number
* if a word appears in lots of documents, you get a small number

When encoding a new example, multiply the Term Frequency of the word in this example by the Inverse Document Frequency of the training set
* gives higher weight to words that are differentiators
* stop words should automatically be de-emphasized

**Example:**
Document collection: all of Shakespeare's plays

The word `Romeo` appears 113 times but only in 1 document

The word `action` appears 113 time but in 31 documents

so Romeo will get a much higher weight


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
#from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from datasets import load_dataset

#data = load_dataset("Deysi/spam-detection-dataset")
data = load_dataset("dair-ai/emotion")

train_texts = data["train"]["text"]
train_labels = data["train"]["label"]

test_texts = data["test"]["text"]
test_labels = data["test"]["label"]

# Consider top 5000 frequent words
vectorizer = CountVectorizer(max_features=5000)
vectorizer.fit(train_texts)

train_vectors = vectorizer.transform(train_texts)
test_vectors = vectorizer.transform(test_texts)

lr_model = LogisticRegression(max_iter=2000)
lr_model.fit(train_vectors,train_labels)

predictions = lr_model.predict(test_vectors)

print( accuracy_score(test_labels,predictions) )

Downloading builder script:   0%|          | 0.00/3.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.28k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.78k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/592k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.9k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

0.885


TfidfVectorizer = 0.9952293577981651

CountVectorizer = 0.996697247706422

## Neural Networks

Hopefully neural networks are familiar to you from your Machine Learning course - but here is a review of some important aspects

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/fullyconnected.png?raw=1">
</div>

image credit: http://neuralnetworksanddeeplearning.com/chap6.html

For NLP, vectors representing words/sequences-of-words are the input layer

Output layer: the class for text classification, the next word in the sequence, etc.

## Neural Network Nodes

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/ann-perceptron.png?raw=1">
</div>

image credit: Machine Learning by Tom M. Mitchell, Chapter 4, http://www.cs.cmu.edu/~tom/files/MachineLearningTomMitchell.pdf

## Activation Functions

The basic perceptron *squashing function* just calls anything positive a 1 and anything negative a 0, but modern neural networks use many other activation functions.


<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/activation_binary_step.png?raw=1" width=300>
</div>

Activation Function images from https://en.wikipedia.org/wiki/Activation_function



### Sigmoid Function

$\sigma (x) = {\frac {1}{1+e^{-x}}}$

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/activation_logistic.png?raw=1" width=300>
</div>

differentiable, so the calculus in the training algorithm works out

often used in an output layer where you have a binary classification

### Hyperbolic Tangent Function

$\tanh(x) = {\frac {e^{x}-e^{-x}}{e^{x}+e^{-x}}}$

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/activation_tanh.png?raw=1" width=300>
</div>

like sigmoid, but approximates identity near origin - learns efficiently with small, random, initial weights


### Identity Function

$f(x) = x$

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/activation_identity.png?raw=1" width=300>
</div>

take the output from the node as is - helpful if your output is numerical

### Rectified-Linear Unit - ReLU

$f(x) = \mbox{max}(0,x)$

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/activation_relu.png?raw=1" width=300>
</div>

either "doesn't fire" or "fires with measurable intensity" - biologically motivated

often used in hidden layers


### Softmax

Used for outpus layer when you have more than two possible classes - like if you are predicting the next word

Like arg-max (which argument results in the maximum value)

Which class has the largest output value?

However, it is *soft* in that it really applies a probability to every possible value, weighted heavily to the one with the largest value

## Training a neural network

1. Start with random weights or weights learned from some other related task
2. Feed a training example into the network and get a prediction
3. Calculate the **loss function** a measurement of how far away the prediction was from the target value
4. Adjust weights in an attempt to reduce the loss (Calculus)
    * derivative of the loss function with respect to the weights of the network
    * start with the output layer and move towards the front of the network (backpropagation)
    * adjustments for middle layers based on adjustment of later layer

## Loss Functions

For numerical outputs, use mean-squared-error

For binary/categorical use cross-entropy loss

$y_{true}$: the actual target value

$y_{pred}$: the predicted target value

crossentropy_loss = $-( y_{true}\log(y_{pred}) + (1-y_{true})\log(1-y_{pred} )$

Section 5.5 of SLP (https://web.stanford.edu/~jurafsky/slp3/5.pdf) shows how we get this function.

Intuitively, imagine that we have a binary output layer - it should always be 0 or 1.

Let's say $y_{true}$ is 1 and our model predicts 0.9 (pretty confident it's a "1", but the final activation layer allows it to float a little

Then the crossentropy loss is

In [None]:
import math

y_true = 1
y_pred = 0.9

ce_loss = -(y_true*math.log(y_pred) + (1-y_true)*math.log(y_pred))
print(ce_loss)

0.10536051565782628


That's a small loss/error

compare with $y_{pred}$ of 0.1

In [None]:
import math

y_true = 1
y_pred = 0.1

ce_loss = -(y_true*math.log(y_pred) + (1-y_true)*math.log(y_pred))
print(ce_loss)

2.3025850929940455


## Neural Network Packages

PyTorch (initially developed by Meta AI)

Tensorflow (initially developed by Google)

Keras - easy to use Python interface for TensorFlow
* support for PyTorch coming soon

all are free and open-source

We'll start with Keras, but we may later use TensorFlow and/or PyTorch directly



## Preparing Data for Keras

Keras (and other neural network packages) requires data to be in `numpy` arrays (the main package for fast, numerical arrays/vectors/matrices in Python.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from datasets import load_dataset

data = load_dataset("Deysi/spam-detection-dataset")

train_texts = data["train"]["text"]
train_labels = data["train"]["label"]
test_texts = data["test"]["text"]
test_labels = data["test"]["label"]

# Consider top 5000 frequent words
vectorizer = TfidfVectorizer(max_features=5000)
vectorizer.fit(train_texts)

train_vectors = vectorizer.transform(train_texts)
test_vectors = vectorizer.transform(test_texts)

# the sklearn vectors need to be converted to numpy arrays for this library
train_vectors_arrays = train_vectors.toarray()
test_vectors_arrays = test_vectors.toarray()

#convert labels from spam/not-spam into 1/0
train_labels_binary = []
for label in train_labels:
    if label == "spam":
        train_labels_binary.append(1)
    else:
        train_labels_binary.append(0)

#convert labels from spam/not-spam into 1/0
test_labels_binary = []
for label in test_labels:
    if label == "spam":
        test_labels_binary.append(1)
    else:
        test_labels_binary.append(0)

# Convert values into arrays
train_labels_array = np.array(train_labels_binary)
test_labels_array = np.array(test_labels_binary)


## Defining the model architecture

A *Sequential* model allows you to define a neural network structure one layer at a time.

The first layer has 5000 inputs because our text vectors contain 5000 features

We define the first two layers to have 10 nodes each - these are parameters that can be experimented with

The output layer has 1 sigmoid node because there are only two possible outputs (for categorical, use `'softmax'`)

In [None]:
from keras.models import Sequential
from keras.layers import Dense

#create a neural network architecture
model = Sequential()
model.add(Dense(10, input_dim=5000, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

## Training the model

You need to define the loss function and optimizer algorithm

for more than 2 categories, use `"categorical_crossentropy"`

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(train_vectors_arrays, train_labels_array, epochs=10, verbose=1)

loss, accuracy = model.evaluate(test_vectors_arrays, test_labels_array)
print(f"Test accuracy: {accuracy*100:.2f}%")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy: 99.85%


## Group Exercise

Even though we have established that *integer encoding* is a bad way to encode text, we will eventually want feed in encodings that represent one word at a time in a sequence - unlike BoW and TD-IDF which aggregate all words in the text into a single vector, so let's practice setting it up with integer encoding.

Make the Keras neural network work with the Integer Encoding approach we used earlier.

In [None]:
from transformers import AutoTokenizer
from datasets import load_dataset
import numpy as np
from keras.models import Sequential
from keras.layers import Dense

# Load dataset and tokenizer
data = load_dataset("Deysi/spam-detection-dataset")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Prepare Integer Encodings for Text Data
def prepare_text_data(text_list, tokenizer, encoded_length=512):
    encodings = []
    for curr_example in text_list:
        curr_tokens = tokenizer.tokenize(curr_example)
        curr_encodings = tokenizer.convert_tokens_to_ids(curr_tokens)
        if len(curr_encodings) > encoded_length:
            curr_encodings = curr_encodings[:encoded_length]
        elif len(curr_encodings) < encoded_length:
            curr_encodings = curr_encodings + [0] * (encoded_length - len(curr_encodings))
        encodings.append(curr_encodings)
    return encodings

train_encoding = prepare_text_data(data["train"]["text"], tokenizer)
test_encoding = prepare_text_data(data["test"]["text"], tokenizer)
train_encoding_array = np.array(train_encoding)
test_encoding_array = np.array(test_encoding)

# Convert labels from spam/not-spam into 1/0
train_labels_binary = np.array([1 if label == "spam" else 0 for label in data["train"]["label"]])
test_labels_binary = np.array([1 if label == "spam" else 0 for label in data["test"]["label"]])

# Define and Compile the Keras Model
encoded_length = 512  # Adjust this based on your requirements
model = Sequential()
model.add(Dense(10, input_dim=encoded_length, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the Model
model.fit(train_encoding_array, train_labels_binary, epochs=10, verbose=1)

# Evaluate the Model
loss, accuracy = model.evaluate(test_encoding_array, test_labels_binary)
print(f"Test accuracy: {accuracy*100:.2f}%")

Token indices sequence length is longer than the specified maximum sequence length for this model (536 > 512). Running this sequence through the model will result in indexing errors


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy: 55.01%


## Applied Exploration

Select another Hugging Face dataset for text classification and get it working with the Keras neural network.

Experiment with different numbers of layers and numbers of nodes in each layer. Record your results.

Give a short write-up on the following
* Describe your dataset, including the distribution of the target variable
* Describe the results of the machine learning experiment
* Interpret the results - How did this dataset compare with the spam dataset? Why do you think you got the results that you did?

In [None]:
from transformers import AutoTokenizer
from datasets import load_dataset
import numpy as np
from keras.models import Sequential
from keras.layers import Dense

# Load IMDb dataset and tokenizer
data = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Prepare Integer Encodings for Text Data
def prepare_text_data(text_list, tokenizer, encoded_length=512):
    encodings = []
    for curr_example in text_list:
        curr_tokens = tokenizer.tokenize(curr_example)
        curr_encodings = tokenizer.convert_tokens_to_ids(curr_tokens)
        if len(curr_encodings) > encoded_length:
            curr_encodings = curr_encodings[:encoded_length]
        elif len(curr_encodings) < encoded_length:
            curr_encodings = curr_encodings + [0] * (encoded_length - len(curr_encodings))
        encodings.append(curr_encodings)
    return encodings

train_encoding = prepare_text_data(data["train"]["text"], tokenizer)
test_encoding = prepare_text_data(data["test"]["text"], tokenizer)
train_encoding_array = np.array(train_encoding)
test_encoding_array = np.array(test_encoding)

# Convert labels from positive/negative into 1/0
train_labels_binary = np.array(data["train"]["label"])
test_labels_binary = np.array(data["test"]["label"])

# Define and Compile the Keras Model (Experiment with different architectures)
model = Sequential()
model.add(Dense(64, input_dim=encoded_length, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the Model
model.fit(train_encoding_array, train_labels_binary, epochs=5, verbose=1)

# Evaluate the Model
loss, accuracy = model.evaluate(test_encoding_array, test_labels_binary)
print(f"Test accuracy: {accuracy*100:.2f}%")