<a href="https://colab.research.google.com/github/MattColb/CS167_Notes/blob/main/Day25%20Intro2NLP%20Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recurrent Neural Networks for Natural Language Processing

Bidirectional LSTM and GRU 
- Reverses the signal that passes in the future as well as the past.

Tokenization - Separating text into smaller units called tokes
- char level (h,e,l,l,o)
- word level (hello)

Vectorization - Converting words into numbers

Word embeddings - The representation of words such that words with similar meanings have similar representation
- pretrained models (Word2Vec (Google), Fasttext (FB))

Language model learns to predict the probability of a sequence of words.
- Statistical language models (probability of seeing one word after another)
- Neural language models (Use deep learning to predict what comes next)

Stop word removal - removing words that don't contain much meaning.
- Stop words are the, is ,and, a, for, of, at, to, etc

Part of speech tagging - gramattical tagging of marking up the words within the speech

Language is modeled using RNNs.

Can generate latex, xml, citations

Lower temperature - More confident less random

In [1]:
#imports and things
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
    !pip install -q -U tensorflow-addons
    !pip install -q -U transformers
    IS_COLAB = True
except Exception:
    IS_COLAB = False

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

if not tf.config.list_physical_devices('GPU'):
    print("No GPU was detected. LSTMs and CNNs can be very slow without a GPU.")
    if IS_COLAB:
        print("Go to Runtime > Change runtime and select a GPU hardware accelerator.")

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)
tf.random.set_seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m591.0/591.0 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m41.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m102.7 MB/s[0m eta [36m0:00:00[0m
[?25h

## Char-RNN

### Loading and Preparing the Dataset:

In [2]:
shakespeare_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
filepath = keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

Downloading data from https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt


In [3]:
print(shakespeare_text[:148])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?



In [4]:
# The vocabulary of our character-level language model looks like this:
"".join(sorted(set(shakespeare_text.lower())))

"\n !$&',-.3:;?abcdefghijklmnopqrstuvwxyz"

In [5]:
# Use Tokenizer to tokenize the Shakespeare text
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(shakespeare_text)

In [6]:
# Embed the word 'First' as tokens:
tokenizer.texts_to_sequences(["First"])

[[20, 6, 9, 8, 3]]

In [7]:
# Revert the sequence of tokens back to the word:
tokenizer.sequences_to_texts([[20, 6, 9, 8, 3]])

['f i r s t']

In [8]:
# Dataset prep
max_id = len(tokenizer.word_index) # number of distinct characters
dataset_size = tokenizer.document_count # total number of characters

[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1
train_size = dataset_size * 90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

n_steps = 100
window_length = n_steps + 1 # target = input shifted 1 character ahead
dataset = dataset.repeat().window(window_length, shift=1, drop_remainder=True)

dataset = dataset.flat_map(lambda window: window.batch(window_length))

np.random.seed(42)
tf.random.set_seed(42)

batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))

dataset = dataset.prefetch(1)


for X_batch, Y_batch in dataset.take(1):
    print(X_batch.shape, Y_batch.shape)

(32, 100, 39) (32, 100)


## Creating and Training the Model
If you are not connected to a GPU/TPU, this code will likely take hours to run.

If you are connected to a GPU/TPU, you should be able to run this at about 5-10 minute per epoch. 



In [9]:
model = keras.models.Sequential([
    keras.layers.GRU(64, return_sequences=True, input_shape=[None, max_id],
                     dropout=0.2),
    keras.layers.GRU(64, return_sequences=True,
                     dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id,
                                                    activation="softmax"))
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
history = model.fit(dataset, steps_per_epoch=train_size // batch_size,
                    epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Using the Model to Generate Text:

In [10]:
def preprocess(texts):
    X = np.array(tokenizer.texts_to_sequences(texts)) - 1
    return tf.one_hot(X, max_id)

# Let's pass in 'How are yo' and see what it predicts the next letter should be:
X_new = preprocess(["How are yo"])

#this line takes a look at the softmax output and returns the max
Y_pred = np.argmax(model(X_new), axis=-1)
tokenizer.sequences_to_texts(Y_pred + 1)[0][-1] # 1st sentence, last char

'u'

In [11]:
def next_char(text, temperature=1):
    X_new = preprocess([text])
    y_proba = model(X_new)[0, -1:, :]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

In [19]:
tf.random.set_seed(42)

next_char("The dogs went for a wal", temperature=.01)

'l'

In [14]:
def complete_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

**Temperature** controls the randomness of the outputs, a larger temperature means a less confident, but more random output (more errors, less logic), while a lower temperature is a more confident but less random output. Take a look below to see how temperature influences the predictions. 

In [15]:
tf.random.set_seed(42)

print(complete_text("t", temperature=0.3))

the words and the rather with all.

petruchio:
the 


In [16]:
print(complete_text("t", temperature=1))

tokem brod seart.

petruchio:
ristryor their reserv


In [17]:
print(complete_text("t", temperature=2))


tpenicmem lv!--it?

judhicalo'va;
thenin ruci-haave


# In Class Exercise: 

With your group, answer the following:
- Play around with the `complete_text` function, try different character lengths. What is the best output you got? 

- Do you think we trained the model long enough? Do you expect the predictions to be better if we made the model larger or trained the model longer? Why or why not?

- Does anything surprise you about the predictions? Why or why not?

- How would you go about improving the model? What hyperparameters would you consider changing?



In [20]:
print(complete_text("she", temperature=0.25, n_chars=1000))

she will see the other was the country the country to the country the country the cating the discale to the country the country to the country the country
the matter and he would be the words the come to see the father and the state and in the citizens to the maid the way the belly
the great and the world the command and the words,
that so make the country the country and the grace.

first citizen:
what well the rest the come to the country the the country the country the country the citizen.

first citizen:
what i shall be the country the can the country the great of the care
to the good father for the companion,
they was the care of the country and the country the father with the senolath,
the seater and the man of the comes the content of a wars the country the content of the country the complaing the worst and the cite,
the comes the command and the country the courteen and the country and the country the man of the country the can the country the read.

first citizen:
what it is t