<a href="https://colab.research.google.com/github/SMBH-1/tbd/blob/main/RNN_Projects_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Recurrent Neural Networks - used for NLP

Much more capable of processing sequential data such as text or characters. Use to do the following:
1) Sentiment Analysis
2) Character Generation

Sequence Data
Unlike images, sequence data (i.e. long chains of text, weather patterns, videos & anything where notion of step or time is relevant needs to be processed & handled in special way)

In textual data, need to keep track of order of characters/words. Can't simply encode entire paragraph into one data point wouldn't work. 

A) Bag of Words Method - every single unique word in dataset (vocabulary), placed in a dict w/ a number assigned as value. You keep track of frequency of words but lose ordering of the words. May work for simpler examples but order and multiple meanings of words depending on order make this a poor choice.

B) Word Embedding - classifies or translates each word into a vector grouping similar words near each other as vector representations. Attempts to encode order of words, frequence of words, & meaning of words.

Word embeddings learned by looking at many different training examples. Can add an embedding layer to beginning of model & while model trains, embedding layer will learn correct embedding for words. Can also use pretrained layers (like base layers for CNN)




Recurrent Neural Networks (RNNs) - up until now, we've used feed-forward neural nets. All data is fed forwards (all at once) from left to right thru network. Doesn't work well for text processing. 
  

*   We read words left to right & keep track of current meaning of sentence so we can understand next word.

This is what RNN is designed to do. It's a network that contains a LOOP. RNN processes one word at a time while maintaining an internal memory of what it's already seen. Allows it to treat words differently based on order in a sentence & slowly build understanding of entire input one word at a time.


A single layer is called simpleRNN. It struggles with longer texts. 

LSTM (Long Short-Term Memory) - other recurrent layers that work better. LSTM is one example. Adds a way to access inputs from any timestamp in the past. In simpleRNN layer, input from previous timestamps gradually disappeared as we get further through input. 

LSTM - have long-term memory data structure storing all previously seen inputs as well as when we saw them. Allows us to access any previous value we want at any point in time. Adds to network complexity & allows it to discover more useful relationships between inputs & when they appear.


Sentiment Analysis

-Use RNNs
-Process of computationally identifying and categorizing opinions expressed in piece of text, especially in order to determine whether writer's attitude towards a particular topic, product, etc. is positive, neutral or negative

In [None]:
%tensorflow_version 2.x
from keras.datasets import imdb
from keras.preprocessing import sequence
import tensorflow as tf
import os
import numpy as np

VOCAB_SIZE = 88584

MAXLEN = 250
BATCH_SIZE = 64

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words = VOCAB_SIZE)

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [None]:
train_data[0] #Looking at first review

Preprocessing (cont'd) - loaded reviews are of different lengths; this is an issue. We can't pass different length data in our neural network (like how we had to resize images for CNN). Make every review same length. 


*   If review is > 250 words, trime off extra words
*   If review < 250 words, add necessary amount of 0's to make it equal to 250



In [None]:
train_data = tf.keras.utils.pad_sequences(train_data, MAXLEN)
test_data = tf.keras.utils.pad_sequences(test_data, MAXLEN)
# train_data[1] - check to see how padding was done

In [None]:
#Create model - use word embedding layer as first layer & add LSTM layer afterwards that feeds into dense node 
#to get our predicted sentiment; 32 stands for output dimension of vectors generated by embedding layer. can change if we want

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, 32),
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 32)          2834688   
                                                                 
 lstm (LSTM)                 (None, 32)                8320      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 2,843,041
Trainable params: 2,843,041
Non-trainable params: 0
_________________________________________________________________


In [None]:
#Now to train above model

model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['acc'])
history = model.fit(train_data, train_labels, epochs=10, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
results = model.evaluate(test_data, test_labels)
print(results)

[0.5649113655090332, 0.8435199856758118]


In [None]:
#Predictions - with trained model, we make predictions on our reviews. Need to convert any review that we write into form
#so network can understand it. To do that well, load encodings from dataset & use them to encode our own data

word_index = imdb.get_word_index()

def encode_text(text):
  tokens = tf.keras.preprocessing.text.text_to_word_sequence(text)
  tokens = [word_index[word] if word in word_index else 0 for word in tokens]
  return tf.keras.utils.pad_sequences([tokens], MAXLEN)[0]

text = 'that movie was just amazing, so amazing'
encoded = encode_text(text)
print(encoded)

[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0  12  17  13  4

In [None]:
#Decode function takes in movie review in int form (like above list) & returns text value

reverse_word_index = {value: key for (key, value) in word_index.items()}

def decode_integers(integers):
  PAD = 0
  text = ''
  for num in integers:
    if num!= PAD:
      text += reverse_word_index[num] + ' '
  return text[:-1]

print(decode_integers(encoded))

that movie was just amazing so amazing


In [None]:
#Now to make prediction

def predict(text):
  encoded_text = encode_text(text)
  pred = np.zeros((1,250))
  pred[0] = encoded_text
  result = model.predict(pred)
  print(result[0])

positive_review = "That movie was so awesome! I really loved it and would watch it again because it was amazingly great"
predict(positive_review)

negative_review = "that movie sucked. I hated it and wouldn't watch it again. Was one of the worst things I've ever watched"
predict(negative_review)

[0.8077212]
[0.21099128]


Using RNN to generate text

Will show RNN an example of something we want it to recreate & it will learn how to write a version of it on its own. Done by using character predictive model that will take as input a variable length sequence & predict next character. Can use model many times in a row w/ output from last prediction as input for next call to generate a sequence.

In [None]:
%tensorflow_version 2.x
from keras.preprocessing import sequence
import keras
import tensorflow as tf
import os
import numpy as np

path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt


In [None]:
from google.colab import files
path_to_file = list(files.upload().keys())[0] #Running this code allows for us to choose own txt files on our own computer

In [None]:
#Read, then decode for py2 compat
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

#Length of text is number of characters in it
print('Length of text: {} characters'.format(len(text)))

#Look at first 250 chars of text
print(text[:250])

Length of text: 1115394 characters
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



In [None]:
#Now for encoding all this above txt file

vocab = sorted(set(text)) #Sorts all unique characters in text

#Creating a mapping from unique chars to indices
char2idx = {u:i for i,u in enumerate(vocab)} #goes from letter/char to index
idx2char = np.array(vocab) #reverse mapping going from index to letter/char

def text_to_int(text):
  return np.array([char2idx[c] for c in text]) #Convert every char in text to integer representation by pointing to char2idx dict

text_as_int = text_to_int(text) #Convert entire txt file loaded above into int representation using function text_to_int()

#Look at how part of our text is encoded now
print('Text: ', text[:13])
print('Encoded: ', text_to_int(text[:13]))

Text:  First Citizen
Encoded:  [18 47 56 57 58  1 15 47 58 47 64 43 52]


In [None]:
#Function converts int list/array into text (reverses above function)

def int_to_text(ints):
  try:
    ints = ints.numpy()
  except:
    pass
  return ''.join(idx2char[ints])

print(int_to_text(text_as_int[:13]))

First Citizen


In [None]:
"""Now to create training examples; goal is to feed model a sequence & have it return to us next char. 
Need to split our text data into many shorter sequences to pass to model as training examples.
Training examples will use seq_length sequence as input & seq_length sequence as output where that sequence
is original sequence shifted one letter to right 

ex: input: Hell | output: ello
"""

seq_length = 100 #length of sequence for a training example
examples_per_epoch = len(text)//(seq_length+1) #if we are going w/ 100 chars length; need 101 chars in denominator

#Create training examples/targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int) #Will have 1+ million chars

#Now can use batch method to turn stream of chars into batches of desired length (ex: 101 and dropping remainder chars)
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

In [None]:
#Now take sequences of len 101 and split into input & output

def split_input_target(chunk): #for example: hello
  input_text = chunk[:-1] #hell
  target_text = chunk[1:] #ello
  return input_text, target_text #hell, ello

dataset = sequences.map(split_input_target) #we use map to apply above function to every entry

for x, y in dataset.take(2):
  print("\n\nEXAMPLE\n")
  print("INPUT")
  print(int_to_text(x))
  print("\nOUTPUT")
  print(int_to_text(y))

#Make training batches
BATCH_SIZE = 64
VOCAB_SIZE = len(vocab) #Number of unique chars
EMBEDDING_DIM = 256
RNN_UNITS = 1024

#Buffer size to shuffle dataset
#(TF data is designed to work w/ possibly infinite seuqneces, so it doesn't
#attempt to shuffle entire sequence in memory. Instead, it maintains a buffer
# in which it shuffles elements)

BUFFER_SIZE = 10000

data = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

In [None]:
#Building Model
#Embedding layer, LSTM and one dense layer that contains a node for each unique char in training data
#Dense layer will give probability distribution over all nodes

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
      tf.keras.layers.Embedding(vocab_size, embedding_dim,
                                batch_input_shape=[batch_size, None]),
      tf.keras.layers.LSTM(rnn_units,
                           return_sequences = True,
                           stateful=True,
                           recurrent_initializer = 'glorot_uniform'),
      tf.keras.layers.Dense(vocab_size)                               
  ])
  return model

model = build_model(VOCAB_SIZE, EMBEDDING_DIM, RNN_UNITS, BATCH_SIZE)
model.summary()

#Model takes 64 training examples and gives us 64 outputs. We will rebuild 
#with same params we've saved & trained for model but only for 1 batch size to
#get 1 prediction for 1 input sequence

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (64, None, 256)           16640     
                                                                 
 lstm_1 (LSTM)               (64, None, 1024)          5246976   
                                                                 
 dense_1 (Dense)             (64, None, 65)            66625     
                                                                 
Total params: 5,330,241
Trainable params: 5,330,241
Non-trainable params: 0
_________________________________________________________________


In [None]:
#Creating Loss Function - create own loss function because our model will output
#a (64, sequence_length, 65) shaped tensor that represents probability distribution
#of each char at each timestep for every sequence in batch

for input_example_batch, target_example_batch in data.take(1):
  example_batch_predictions = model(input_example_batch) #ask our model for a prediction on our first batch of training data
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)") #print out output shape

#We can see that prediction is an array of 64 arrays, one for each batch entry
print(len(example_batch_predictions))
print(example_batch_predictions)

In [None]:
#Examine one prediction
pred = example_batch_predictions[0]

print(len(pred))
print(pred)
#Notice this is 2D array of len 100, where each interior array is prediction for next char at each time step

100
tf.Tensor(
[[ 0.00155463  0.00232505 -0.00082527 ... -0.00340165  0.00014181
   0.0018117 ]
 [-0.00400441 -0.00133221 -0.00264353 ... -0.01010216  0.00291852
   0.00063826]
 [-0.00336853  0.00187674 -0.00252856 ... -0.00758613  0.00096282
   0.00151361]
 ...
 [-0.00185623 -0.00531099 -0.00048059 ...  0.00311475  0.00249094
   0.00244514]
 [-0.00201645 -0.00772558 -0.00131663 ... -0.00173657  0.00531403
  -0.00062594]
 [-0.00915895 -0.01009805  0.00059497 ...  0.00081316  0.00036151
   0.00430956]], shape=(100, 65), dtype=float32)


In [None]:
#We'll lastly look at prediction at first time step
time_pred = pred[0]
print(len(time_pred))
print(time_pred)
#its 65 values representing probability of each character occuring next

In [None]:
#Determine predicted char we need to sample output distribution (pick a value based on probability)
sampled_indices = tf.random.categorical(pred, num_samples = 1)

#Next reshape array & convert all ints to nums to see actual chars
sampled_indices = np.reshape(sampled_indices, (1,-1))[0]
predicted_chars = int_to_text(sampled_indices)

predicted_chars #this is what model predicted for training sequence 1


",TysicjQ.:VI3kk&r&kY&xqg\nGmRUKohpvBDEoqwcQBmDqAvjwzyD\n.WlmAxKJ:?k' s$ghJkZaPawbmOmmOiSQCGR$cSD DskGF"

In [None]:
#Create loss function (keras built in example below) that can compare output to expected output & give us some numeric value
#representing how close the two were

def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits = True)

In [None]:
#Compile model 

model.compile(optimizer='adam', loss=loss)

In [None]:
#Create checkpoints - gets model to save checkpoints as it trains; allows us to load model from a checkpoint & continue training it

#Directory where checkpoints will be saved
checkpoint_dir = './training_checkpoints'
#Name of checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, 'ckpt_{epoch}')

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True
)

In [None]:
#Train model

history = model.fit(data, epochs=40, callbacks=[checkpoint_callback])

In [None]:
#Rebuild Model - we'll rebuild model from previous checkpoint using batch_size = 1 so we can feed one piece of text & have model make prediction

model = build_model(VOCAB_SIZE, EMBEDDING_DIM, RNN_UNITS, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))

checkpoint_num = 40
model.load_weights(tf.train.load_checkpoint("./training_checkpoints/ckpt_" + str(checkpoint_num)))
model.build(tf.TensorShape([1, None]))

In [None]:
#Generate Text using below function

def generate_text(model, start_string):
  #Evaluation step (generating text using learned model)

  #Number of chars to generate
  num_generate = 800

  #Converting start str to num (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  #Empty str to store our results
  text_generated = []

  #Low temp results in more predictable text.
  #Higher temp = more surprising text.
  #Experiment to find best setting
  temperature = 1.0

  #Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
    predictions = model(input_eval)
    #remove the batch dimension

    predictions = tf.squeeze(predictions, 0)

    #Using categorial distribution to predict char returned by model
    predictions = predictions / temperature
    predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

    #We pass predicted char as next input to model along w/ previous hidden state
    input_eval = tf.expand_dims([predicted_id], 0)

    text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

inp = input('Type a starting string: ')
print(generate_text(model, inp))


Type a starting string: romeo
romeo.

FROTH:
Ay, commend me to you. Thou hast sworn brother,
Tull favourage dignities us, his fashions,
They be noted in this conatin.
Tell me, thou dost take quandave
To use it justice, make I with this fame
I' the insu take up down to pity of her,
And we will schat the glasses. The temple to his study,
Which strength itself in painted liberty
That Edward you fither, whou forward our hope to tear,
And left thee there and hed proclaim myself.
Alack, for sour adversuisand! Tybalt dead, continue fathers! and do you grow proligy,
Unless they are best received. But, O, how the
dead king's art,
Whethe leisure and thy father and
Thy friends, mine honour, as he does it will;
And weep against your brother, betimes,
Lest in reverence I may break ite our charge for me;
Or else you early not me; I charg
