<p>Notes found <a href="https://colab.research.google.com/drive/1ysEKrw_LE2jMndo1snrZUh5w87LQsCxk#forceEdit=true&sandboxMode=true">here</a></p>
<p>Video found <a href="https://www.youtube.com/watch?v=tPYj3fFJGjk&list=WL&index=5&t=11941s&ab_channel=freeCodeCamp.org">here</a></p>

<h1>Encoding Textual Data</h1>

<p>Bag of Words</p>
<p>Tracks frequency of words</p>

In [2]:
vocab = {}  # maps word to integer representing it
word_encoding = 1
def bag_of_words(text):
    global word_encoding

    words = text.lower().split(" ")  # create a list of all of the words in the text; assume there is no grammar in our text for this example
    bag = {}  # stores all of the encodings and their frequency

    for word in words:
        if word in vocab:
            encoding = vocab[word]  # get encoding from vocab
        else:
            vocab[word] = word_encoding
            encoding = word_encoding
            word_encoding += 1
    
        if encoding in bag:
            bag[encoding] += 1
        else:
            bag[encoding] = 1
  
    return bag

text = "this is a test to see if this test will work is is test a a"
bag = bag_of_words(text)
print(bag)
print(vocab)

{1: 2, 2: 3, 3: 3, 4: 3, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}
{'this': 1, 'is': 2, 'a': 3, 'test': 4, 'to': 5, 'see': 6, 'if': 7, 'will': 8, 'work': 9}


In [3]:
# Limitation of bag of words: does not track order of words, and therefore the meaning of the sentencess
positive_review = "I thought the movie was going to be bad but it was actually amazing"
negative_review = "I thought the movie was going to be amazing but it was actually bad"

pos_bag = bag_of_words(positive_review)
neg_bag = bag_of_words(negative_review)

print("Positive:", pos_bag)
print("Negative:", neg_bag)

Positive: {10: 1, 11: 1, 12: 1, 13: 1, 14: 2, 15: 1, 5: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1}
Negative: {10: 1, 11: 1, 12: 1, 13: 1, 14: 2, 15: 1, 5: 1, 16: 1, 21: 1, 18: 1, 19: 1, 20: 1, 17: 1}


<p>Integer Encoding</p>
<p>Tracks frequency and order of words</p>

In [4]:
vocab = {}  
word_encoding = 1
def one_hot_encoding(text):
    global word_encoding

    words = text.lower().split(" ") 
    encoding = []  

    for word in words:
        if word in vocab:
            code = vocab[word]  
            encoding.append(code) 
        else:
            vocab[word] = word_encoding
            encoding.append(word_encoding)
            word_encoding += 1
  
    return encoding

text = "this is a test to see if this test will work is is test a a"
encoding = one_hot_encoding(text)
print(encoding)
print(vocab)

[1, 2, 3, 4, 5, 6, 7, 1, 4, 8, 9, 2, 2, 4, 3, 3]
{'this': 1, 'is': 2, 'a': 3, 'test': 4, 'to': 5, 'see': 6, 'if': 7, 'will': 8, 'work': 9}


In [5]:
# Tracks order but not meaning (e.g. synonyms/antonyms) of words
positive_review = "I thought the movie was going to be bad but it was actually amazing"
negative_review = "I thought the movie was going to be amazing but it was actually bad"

pos_encode = one_hot_encoding(positive_review)
neg_encode = one_hot_encoding(negative_review)

print("Positive:", pos_encode)
print("Negative:", neg_encode)

Positive: [10, 11, 12, 13, 14, 15, 5, 16, 17, 18, 19, 14, 20, 21]
Negative: [10, 11, 12, 13, 14, 15, 5, 16, 21, 18, 19, 14, 20, 17]


<p>Word Embedding</p>
<p>Tracks frequency, order and meaning of words by encoding each word as a dense vector (see below)</p>

<h1>Recurrent Neural Networks (RNN)</h1>

<h2>Sentiment Analysis</h2>

In [None]:
# Imports
from keras.datasets import imdb
from keras.preprocessing import sequence
import keras
import tensorflow as tf
import os
import numpy as np

<p>Data Preprocessing</p>

In [6]:
VOCAB_SIZE = 88584

MAXLEN = 250
BATCH_SIZE = 64

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words = VOCAB_SIZE)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [11]:
train_data = sequence.pad_sequences(train_data, MAXLEN)  # Trim reviews with more than 250 words
test_data = sequence.pad_sequences(test_data, MAXLEN)  # Pad reviews with less than 250 words with 0s (left-padding)

<p>Creating the Model</p>

In [13]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, 32),  # Use word embedding
    tf.keras.layers.LSTM(32),  # 32 is the number of dimensions the output has
    tf.keras.layers.Dense(1, activation="sigmoid")  # Sigmoid restricts output to [0,1], where 0.5 is a neutral review, >0.5 is positive and <0.5 is a negative revivew
])

In [15]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 32)          2834688   
_________________________________________________________________
lstm (LSTM)                  (None, 32)                8320      
_________________________________________________________________
dense (Dense)                (None, 1)                 33        
Total params: 2,843,041
Trainable params: 2,843,041
Non-trainable params: 0
_________________________________________________________________


<p>Training the Model</p>

In [16]:
model.compile(loss="binary_crossentropy",optimizer="rmsprop",metrics=['acc'])

history = model.fit(train_data, train_labels, epochs=3, validation_split=0.2)  # Use 20% of the training data to evaluate and validate the model

Epoch 1/3
Epoch 2/3
Epoch 3/3


<p>Evaluating the Model</p>

In [17]:
results = model.evaluate(test_data, test_labels)
print(results)

[0.3032104969024658, 0.8753200173377991]


<p>Making Predictions</p>

In [18]:
word_index = imdb.get_word_index()

# Preprocesses data to make sure word encoding is the same as the training dataset's
def encode_text(text):
    tokens = keras.preprocessing.text.text_to_word_sequence(text)  # Split sentences into individual words (tokens)
    tokens = [word_index[word] if word in word_index else 0 for word in tokens]  # Use encoding (i.e. int to word) in training dataset
    return sequence.pad_sequences([tokens], MAXLEN)[0]  # Returns movie review as list of words and pads it

text = "that movie was just amazing, so amazing"
encoded = encode_text(text)
print(encoded)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0 

In [21]:
reverse_word_index = {value: key for (key, value) in word_index.items()}

# Decode function
def decode_integers(integers):
    PAD = 0
    text = ""
    for num in integers:
        if num != PAD:
            text += reverse_word_index[num] + " "

    return text[:-1]
  
print(decode_integers(encoded))

that movie was just amazing so amazing


In [22]:
# now time to make a prediction

def predict(text):
    encoded_text = encode_text(text)  # Preprocesses data (integer encoding)
    pred = np.zeros((1,250))  # Creates numpy array of 250 0s (fixed input length of 250 words)
    pred[0] = encoded_text  # Insert text input into array
    result = model.predict(pred)  # Make prediction
    print(result[0])  # Print desired prediction

positive_review = "That movie was! really loved it and would great watch it again because it was amazingly great"
predict(positive_review)

negative_review = "that movie really sucked. I hated it and wouldn't watch it again. Was one of the worst things I've ever watched"
predict(negative_review)

[0.82816947]
[0.37610605]


<h2>RNN Play Generator</h2>

In [23]:
from keras.preprocessing import sequence
import keras
import tensorflow as tf
import os
import numpy as np

<p>Load Dataset</p>

In [26]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

In [27]:
# Read, then decode for py2 compat.
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))

Length of text: 1115394 characters


<p>Data Preprocessing</p>

In [29]:
vocab = sorted(set(text))

# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

def text_to_int(text):
    return np.array([char2idx[c] for c in text])

text_as_int = text_to_int(text)

In [30]:
# Decode text
def int_to_text(ints):
    try:
        ints = ints.numpy()
    except:
        pass
    return ''.join(idx2char[ints])

print(int_to_text(text_as_int[:13]))

First Citizen


<p>Creating Training Examples</p>

In [31]:
seq_length = 100  # length of sequence for a training example
examples_per_epoch = len(text)//(seq_length+1)  # Takes input of x chars and predicts 1 character

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

In [32]:
# Batching data into desired length
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

In [33]:
# Split each batch into input and outputs of length 100
def split_input_target(chunk):  # for the example: hello
    input_text = chunk[:-1]  # hell
    target_text = chunk[1:]  # ello
    return input_text, target_text  # hell, ello

dataset = sequences.map(split_input_target)  # we use map to apply the above function to every entry

In [35]:
# Example
for x, y in dataset.take(2):
    print("\n\nEXAMPLE\n")
    print("INPUT")
    print(int_to_text(x))
    print("\nOUTPUT")
    print(int_to_text(y))



EXAMPLE

INPUT
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You

OUTPUT
irst Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You 


EXAMPLE

INPUT
are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you 

OUTPUT
re all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you k


In [36]:
# Make training batches (64 different sequences)
BATCH_SIZE = 64
VOCAB_SIZE = len(vocab)  # vocab is number of unique characters
EMBEDDING_DIM = 256
RNN_UNITS = 1024

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

data = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

<p>Building the Model</p>

In [37]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
        tf.keras.layers.LSTM(rnn_units,
                        return_sequences=True,  # Return intermediate stage at every step
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
        tf.keras.layers.Dense(vocab_size)  # Gives probability distribution of each unique character
    ])
    return model

model = build_model(VOCAB_SIZE,EMBEDDING_DIM, RNN_UNITS, BATCH_SIZE)
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (64, None, 256)           16640     
_________________________________________________________________
lstm_1 (LSTM)                (64, None, 1024)          5246976   
_________________________________________________________________
dense_1 (Dense)              (64, None, 65)            66625     
Total params: 5,330,241
Trainable params: 5,330,241
Non-trainable params: 0
_________________________________________________________________


<p>Creating a Loss Function</p>

In [38]:
# Shows output of the model
for input_example_batch, target_example_batch in data.take(1):
    example_batch_predictions = model(input_example_batch)  # ask our model for a prediction on our first batch of training data (64 entries)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")  # print out the output shape

(64, 100, 65) # (batch_size, sequence_length, vocab_size)


In [39]:
# we can see that the predicition is an array of 64 arrays, one for each entry in the batch
print(len(example_batch_predictions))
print(example_batch_predictions)

64
tf.Tensor(
[[[ 1.82788807e-03  2.01575982e-04  5.03402250e-03 ... -9.42482322e-04
    5.33416215e-03  4.83257184e-03]
  [ 3.94581817e-03 -2.95584276e-03  4.01094649e-03 ... -7.62519136e-04
    3.85253597e-03  2.01960467e-03]
  [-1.24495942e-03 -4.33709333e-03  3.80271813e-03 ... -2.90203607e-03
    1.31797837e-03  1.00400881e-04]
  ...
  [-2.90967338e-03 -1.66946258e-02 -1.74449303e-03 ...  3.98084382e-03
   -3.47526977e-03 -7.74108805e-04]
  [ 1.98594062e-04 -1.39400531e-02 -5.80338296e-03 ...  6.57944009e-03
   -1.12151052e-03  1.05705438e-02]
  [ 7.30427820e-03 -1.75384283e-02 -1.46656588e-03 ...  5.89273591e-03
    3.52087012e-03  1.37759168e-02]]

 [[-9.73303220e-04 -3.31859570e-04  8.12452810e-04 ...  4.92147077e-03
    2.87920237e-03  3.16991704e-03]
  [ 3.32242460e-04 -7.78404158e-03  4.82773583e-04 ...  5.18790819e-03
   -6.63016271e-03  9.11252014e-03]
  [-2.69867014e-03 -7.61294039e-03  2.14320002e-03 ...  1.60300476e-03
   -6.69447333e-03  4.19102330e-03]
  ...
  [-1.004

In [40]:
# lets examine one prediction
pred = example_batch_predictions[0]
print(len(pred))
print(pred)
# notice this is a 2d array of length 100, where each interior array is the prediction for the next character at each time step

100
tf.Tensor(
[[ 0.00182789  0.00020158  0.00503402 ... -0.00094248  0.00533416
   0.00483257]
 [ 0.00394582 -0.00295584  0.00401095 ... -0.00076252  0.00385254
   0.0020196 ]
 [-0.00124496 -0.00433709  0.00380272 ... -0.00290204  0.00131798
   0.0001004 ]
 ...
 [-0.00290967 -0.01669463 -0.00174449 ...  0.00398084 -0.00347527
  -0.00077411]
 [ 0.00019859 -0.01394005 -0.00580338 ...  0.00657944 -0.00112151
   0.01057054]
 [ 0.00730428 -0.01753843 -0.00146657 ...  0.00589274  0.00352087
   0.01377592]], shape=(100, 65), dtype=float32)


In [41]:
# and finally well look at a prediction at the first timestep
time_pred = pred[0]
print(len(time_pred))
print(time_pred)
# and of course its 65 values representing the probabillity of each character occuring next

65
tf.Tensor(
[ 1.8278881e-03  2.0157598e-04  5.0340225e-03 -1.1198933e-04
 -2.5543952e-03  1.8606276e-03 -1.2137231e-03 -3.5601272e-04
  4.1470136e-03  1.2043944e-03 -4.0052151e-03  1.8165645e-03
 -2.4454659e-03 -1.7880998e-04  2.2015984e-03  1.7457247e-03
  3.6166115e-03  5.1432168e-03 -1.3972995e-03 -3.1049466e-03
 -2.4225553e-03  1.7501978e-03 -7.7822793e-04 -1.5873162e-03
 -3.3104816e-04  5.1440187e-03  5.9586451e-03  5.2843406e-03
  8.0963841e-04  2.2171917e-03  1.9714087e-03  6.9209440e-03
  3.9468948e-03  1.3430731e-03  6.9568260e-04 -1.5013647e-03
  3.7770809e-03  4.8422916e-03  2.1085446e-04  8.5316377e-04
 -2.4801707e-03 -1.6185853e-03 -1.1221125e-03  4.1194800e-03
  3.6000230e-04 -5.4782983e-03 -1.1672035e-03 -3.8305970e-03
  9.1479160e-07  1.9324326e-04 -1.7411442e-04 -8.5141824e-04
 -1.3901860e-03 -3.3946878e-03 -1.7564500e-03 -1.9955414e-03
  5.6158807e-03 -4.1949796e-05 -3.1693918e-03  2.6363947e-03
 -8.0230529e-04 -5.2355505e-03 -9.4248232e-04  5.3341622e-03
  4.832571

In [42]:
# If we want to determine the predicted character we need to sample the output distribution (pick a value based on probabillity)
sampled_indices = tf.random.categorical(pred, num_samples=1)

# now we can reshape that array and convert all the integers to numbers to see the actual characters
sampled_indices = np.reshape(sampled_indices, (1, -1))[0]
predicted_chars = int_to_text(sampled_indices)

predicted_chars  # and this is what the model predicted for training sequence 1

'zycm?HXobuLxv\n XI!rsbB!EXNtSzmPrOrzzqncp!aa,;kVM-Z.GNgXyehpx,BvUxy!SC PfBdWfEUk;zgMlwMWcEzLuTmIaP$&s'

In [44]:
# Loss function
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

<p>Compiling the Model</p>

In [45]:
model.compile(optimizer='adam', loss=loss)

<p>Creating Checkpoints</p>

In [46]:
# ALlows model to create and load checkpoints while it trains
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

<p>Training the Model</p>

In [48]:
history = model.fit(data, epochs=1, callbacks=[checkpoint_callback])



<p>Loading the Model</p>

In [49]:
model = build_model(VOCAB_SIZE, EMBEDDING_DIM, RNN_UNITS, batch_size=1)

In [50]:
# Find the latest checkpoint that stores the models weights
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))

In [57]:
# Load specified checkpoint
checkpoint_num = 1
model.load_weights(tf.train.load_checkpoint("./training_checkpoints/ckpt_" + str(checkpoint_num)))
model.build(tf.TensorShape([1, None]))

AttributeError: 'tensorflow.python.util._pywrap_checkpoint_reader.C' object has no attribute 'endswith'

<p>Generating Text</p>

In [60]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
    num_generate = 800

  # Converting our start string to numbers (vectorizing)
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
    text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
    temperature = 1.0

  # Here batch size == 1
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)

        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

          # using a categorical distribution to predict the character returned by the model
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted character as the next input to the model
      # along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx2char[predicted_id])

    return (start_string + ''.join(text_generated))

In [61]:
# inp = input("Type a starting string: ")
inp = "chicken"
print(generate_text(model, inp))

chicken,
To il, but that poobtur: not and peat nith oncy
You rowit ins tol prouvee thes and
thal't ve kins us the diest ofe, hes prowand and lent:
Thic Caylf usfouk saun is I wond
n'tongut on ssards ke burlins
As naveer of is davp ce itein thee of thinkn: er.

MANIM:

ATI henrent wires tot with and that end shared fpracioldod'suld a pord ont,
Whacids shave.

GOO:
Neners, baid and blongilk so
The hathork live trotes of old Wiin,
Then' Is plarders oro bad
ISAd ant amant ream thiu.

Firs UCILEN:
Aswer, Lepa: on- tere and nod the farld, theur be of an memer will's
Dither but atow tand and porolle ond now our our
The cor of frated and ty lavers thes, Oull to anto, and
Thet'st lefpire be corg thale red is siris not hiple in.

CDUCINI:
I play, wnit orie reer the so! prricheransing thit.

HANBES:
Saving 
