<a href="https://colab.research.google.com/github/ChristineWeitw/Tensorflow-ML/blob/master/NLP_using_RNN_SentimentAnalysis_PlayGenerator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RNN is really good at classification problems & understanding textural data

# METHOD [Bag of words] - to convert texture data to numeric data
    # they place each words into a Dictionary and parse their corresponding value
    # then pack the words tgr by their frequency
    # disadvantage : lose the content meaning
    # ex : I thought that movie is bad, but it is actually amazing. vs. I thought that movie was amazing, but it is actually bad.add()
    
# METHOD [Word Embedding]
    # A word embedding is a learned representation for text where words that have the same meaning have a similar representation. 
    # it group the similar meaning words toward close vector; and the oppisite words to an oppsite direction of previous vector.add()
    # a vectorized representation of words in a given document that places words with similar meanings near each other.
    
# RNN cf. CNN & Dense NN
    #  CNN & Dense NN are so called " feed forward NN ",  They process all the data at once
    #  RNN on the other hand, has a loop inside its internal model, processes a word per time, then trains it wile passing more data.add()
    #  RNN acts like how human read text, read one word at a time and slowly build up its understanding
    #  when RNN process a new coming new word, it combines the knowledge that has built from previous words
## LSTM (Long Short Term Memory)
    # If the text sequence is really long, model will lose the important input info from the beginning text,
    # Therefore, LSTM allows us not only look at the current but also keep an eye from the begining text
    # LSTM add a component to keep track on the internal states
    # by adding LSTM, we can get information from any previous state at any point in the future when we want it now
    # instead of only store the previous output, long term memory makes a look up table to let us see any output at any time point.


# I. Sentiment analysis


In [2]:
from keras.datasets import imdb
from keras.preprocessing import sequence
import tensorflow as tf
import os
import numpy as np

VOCAB_SIZE = 88584

MAX_Len = 250
BATCH_SIZE = 64

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words = VOCAB_SIZE)

Using TensorFlow backend.


Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


In [5]:
print(len(train_data[0]))

218


## **More Preprocessing**
when we look at our loaded in reviews, we will notice that they are diiferent in length. Therefore we must make each review the same length.


*   if the review is > 250 words then trim off the extra words
*   if the review is < 250 words then we add the necessary amount of 0's to make it equal to 250



In [6]:
train_data = sequence.pad_sequences(train_data,MAX_Len)
test_data = sequence.pad_sequences(test_data, MAX_Len)

## **Create the Model**
We will use a word embedding layer as the first layer in our model and add a LSTM layer afterwards that feeds into a dense node to get our predicted sentiment.

32 stands for the output dimension fo the vectors generated by the embedding layer.  We can change this value if we'd like!

In [7]:
RNN_model = tf.keras.Sequential([
                                 tf.keras.layers.Embedding(VOCAB_SIZE, 32),
                                 tf.keras.layers.LSTM(32),
                                 tf.keras.layers.Dense(1,activation='sigmoid') # we want the model to generate the output of either 0 or 1 (positive/negative).
])

In [8]:
RNN_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 32)          2834688   
_________________________________________________________________
lstm (LSTM)                  (None, 32)                8320      
_________________________________________________________________
dense (Dense)                (None, 1)                 33        
Total params: 2,843,041
Trainable params: 2,843,041
Non-trainable params: 0
_________________________________________________________________


## **Training**

In [9]:
RNN_model.compile(loss='binary_crossentropy',optimizer='rmsprop',metrics=['acc']) # choose 'binary' since the problem we have is a 2 class problem

history = RNN_model.fit(train_data, train_labels, epochs=10, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [10]:
results = RNN_model.evaluate(test_data, test_labels)
print(results)

[0.5178341865539551, 0.8518400192260742]


## **Making Predictions**
not let's use our network to make predctions on our own reviews.

Since our reviews are encoded well need to convert any review that we write into that form so the network can understand it. To do that well load the encodings from the dataset and use them to encode our own data;.

In [11]:
# make an encoding fucntion to convert the input reviews into proper format that our model can process
import keras
word_index = imdb.get_word_index() # creates a look up table showing all the word_index of our loaded data

def encode_text(text):
  tokens = keras.preprocessing.text.text_to_word_sequence(text) # tokenize the input text
  tokens = [word_index[word] if word in word_index else 0 for word in tokens]  # if the input word is existed in our own word index number, return its word_index; if not, return 0
  return sequence.pad_sequences([tokens],MAX_Len)[0]  # pad_sequence only accepts dealing with list, so this is gonna return us a list of a list, and we only want the first value

## example
text = 'that movie was just amazing, so amazing'
encoded = encode_text(text)
print(encoded)
  

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json
[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0  

In [14]:
# creates a decode fucntion to convert integer back to words.
reverse_word_index = {value: key for (key, value) in word_index.items()}

def decode_intergers(integers):
  PAD = 0
  text =''
  for num in intergers:
    if num != PAD:
      text += reverse_word_index[num] + ''
  return text[:-1]

print(decode_intergers(encoded))

NameError: ignored

In [24]:
def predict(text):
  encoded_text = encode_text(text)
  pred = np.zeros((1,250))
  pred[0] = encoded_text
  result = RNN_model.predict(pred)
  print(result[0])

positive_review = 'I love my boyfriend so much. he is such a cutie'
predict(positive_review)

negative_review = 'i hate my boyfriend so much, he is the worst asswhole i haver ever met.'
predict(negative_review)

[0.83154446]
[0.4324516]


## **II. Play Generator**

In [2]:
from keras.datasets import imdb
from keras.preprocessing import sequence
import tensorflow as tf
import os
import numpy as np
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'http://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

Using TensorFlow backend.


Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt


### If you want to RUN YOUR OWN FILE

In [21]:
from google.colab import files
path_to_file = list(files.upload().keys())[0]. ## make sure it is a text file

KeyboardInterrupt: ignored

In [3]:
# Read, then decode for py2 compat.
text = open(path_to_file, 'rb').read().decode(encoding='utf-8') # 'rb' = read bytes
# length of text is the number of characters in it
print('Length of text: {} characters'.format(len(text)))

Length of text: 1115394 characters


In [4]:
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



### **Encoding**

In [6]:
vocab = sorted(set(text)) # figure out how many unique characters are in the data

# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

def text_to_int(text):        # this function is just to use the mapping func from above to convert the characters in the text to integers
  return np.array([char2idx[c] for c in text])

text_as_int = text_to_int(text)

def int_to_text(ints):        # this is just the opposite function, To convert the intergers back to text
  try:
    ints = ints.numpy() # turn it into numpy array
  except:
    pass
  return ''.join(idx2char[ints])


In [7]:
print("Text:", text[:13])
print("Encoded:",text_to_int(text[:13]))

print(int_to_text(text_as_int[:13]))

Text: First Citizen
Encoded: [18 47 56 57 58  1 15 47 58 47 64 43 52]
First Citizen


In [9]:
 seq_length = 100 # length of sequence for a training sample
 examples_per_epoch = len(text)//(seq_length+1)

 # Create training example / targets
 char_datasets = tf.data.Dataset.from_tensor_slices(text_as_int) # convert our entire string datasets into characters. And allows us to have a stream of characters (like 1.1 million words)
 print(char_datasets)

<TensorSliceDataset shapes: (), types: tf.int64>


In [10]:
# batch it into 101 length
sequences = char_datasets.batch(seq_length+1, drop_remainder=True)

In [11]:
# now we need to use these sequence of length 101 and split them into input and output
def split_input_target(chunk): # for teh example hello
  input_text = chunk[:-1] # hell
  target_text = chunk[:-1] #ello
  return input_text, target_text

dataset = sequences.map(split_input_target) # apply it to all batches of seuquences

In [12]:
## example
for x,y in dataset.take(2):
  print('\n\nEXAMPLE\n')
  print("INPUT")
  print(int_to_text(x))
  print("\nOUTPUT")
  print(int_to_text(y))



EXAMPLE

INPUT
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You

OUTPUT
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


EXAMPLE

INPUT
are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you 

OUTPUT
are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you 


In [12]:
BATCH_SIZE = 64 # oone batch contain 64 examples
VOCAB_SIZE= len(vocab) # number of unique chracter
EMBEDDING_DIM = 256
RNN_UNITS = 1024

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequenes, so it doesn't attempt to shuffle the entire sequence in memory.
# Instead, it maintains a buffer in which it shuffles elements.

BUFFER_SIZE = 10000
# shuffle all the batches and batch it with the proper size
data = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE,drop_remainder=True)

In [16]:
print(data)  # 64 examples with 100 characters per example

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>


### **Building the Model**

In [13]:
# this model is gonna take 64 training examples into training and return us 64 output results
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
                               tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]), # 'None - is coz we don't know the length of each 64 seqeucnes'
                               tf.keras.layers.LSTM(rnn_units,
                                                    return_sequences = True,  # So that it keep track of all the intermediate output
                                                    stateful=True,
                                                    recurrent_initializer = 'glorot_uniform'
                                                    ),
                               tf.keras.layers.Dense(vocab_size)]) # we want to make the num of output neurons == the vocab_size, and they all add up to 1 probability\

  return model

model = build_model(VOCAB_SIZE,EMBEDDING_DIM,RNN_UNITS,BATCH_SIZE)
model.summary()

## 64 is the number of examples, None is the length of the sequences(which we don't know) ;at the end 65 is the VOCAB_SIZE

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           16640     
_________________________________________________________________
lstm (LSTM)                  (64, None, 1024)          5246976   
_________________________________________________________________
dense (Dense)                (64, None, 65)            66625     
Total params: 5,330,241
Trainable params: 5,330,241
Non-trainable params: 0
_________________________________________________________________


## **Creating a Loss Function**

In [15]:
# now we clarify all the dimention/shape first
for input_example_batch, target_example_batch in data.take(1):  # ask our model to predict on our firsty batch of training data
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size")

(64, 100, 65) # (batch_size, sequence_length, vocab_size


In [None]:
# this is the example of how our result will look like
print(len(example_batch_predictions))
print(example_batch_predictions)

  ## it will return us 64 results for 64 examples input
  

In [18]:
# lets examine one prediction
pred = example_batch_predictions[0]
print(len(pred))
print(pred)
  ## for every single training example, we get whatever the length of that training example
  ## we have 100 length in each training example, so the model will have 100 time steps,  because RNN feed  only one character each time 
  ## at every time step, we are actually saving that output as a prediction and parsing that back.

100
tf.Tensor(
[[ 0.01046958  0.00570161 -0.00232532 ...  0.00423734 -0.00208422
  -0.00077858]
 [ 0.00615054  0.00605415 -0.00098767 ... -0.00132219  0.00455114
   0.00351664]
 [ 0.00526402  0.00256013 -0.00136691 ... -0.00231734  0.0018963
  -0.0005192 ]
 ...
 [ 0.00740656  0.00128037 -0.01382843 ...  0.00498741 -0.01445409
   0.00687171]
 [ 0.00915973  0.00107904 -0.01157518 ...  0.00284413 -0.01345574
   0.00102549]
 [ 0.00960581  0.00472066 -0.00960267 ...  0.00750926 -0.00656199
   0.00269089]], shape=(100, 65), dtype=float32)


In [19]:
# And finally we will look at a prediction at the first timestep of the first training example
time_pred = pred[0]
print(len(time_pred))
print(time_pred)

  ## now we got a tensor length 65. its 65 values representing the probability of each character occuring next

65
tf.Tensor(
[ 0.01046958  0.00570161 -0.00232532 -0.00743217 -0.01028958  0.00068924
 -0.00095865 -0.0043709  -0.00642598  0.00162289  0.00367309 -0.00227389
  0.00524175 -0.00572277  0.00557022  0.00014716  0.00301498  0.00231852
  0.00511751  0.0073759   0.00132013 -0.0007134   0.01477829 -0.00024507
 -0.00342577 -0.00183641 -0.00235248  0.00195048  0.00363811 -0.00324646
 -0.0071897  -0.00953974  0.00329146 -0.00159327  0.00534082  0.00217295
  0.00053816  0.00463307 -0.00053033  0.00077215 -0.00946033  0.0014554
  0.01255078  0.00087957 -0.00013797 -0.00495945  0.00328171 -0.00646829
 -0.00512948  0.00221696 -0.00572212  0.00312649 -0.00442466 -0.00891806
 -0.00807228 -0.00463288 -0.00622531  0.01280541 -0.00098653 -0.00167485
  0.00164307 -0.00426776  0.00423734 -0.00208422 -0.00077858], shape=(65,), dtype=float32)


In [22]:
## in concoustion, the information above show us that we need to make our own loss function
## Because RNN do not have a built-in function to interpret such complex result

# if we want to determine the predicted character we need to sample the output distribution ( pick a value based on  "PROBABILITY DISTRIBUTION" instead of only picking the value with the highest probability value (statistic rule)

  ## sampling the 100 time step of 1st training example
sampled_indices = tf.random.categorical(pred, num_samples=1)
  ## now we can reshape that array and convert all the intergers to numbers to see the actual characters
sampled_indices = np.reshape(sampled_indices, (1,-1))[0]
predicted_chars = int_to_text(sampled_indices)

predicted_chars  # this is what the model predicted for training sequence 1

"h '.dO3XcdG;?!:mK$vSnttAFB XiYnuga-uovoxQbKWpWjpuu'S$LA,LX'!\nefRAIUJwhOjmmnUa,XmfFJcoaQ.ho$JpxJ\nUMrS"

In [23]:
# utilize Tensorflow's built in loss function to compute the loss between two things
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

## **Compile the Model**

At this point we can think of our problem as a classification problem where the model predicts the probability of each un ique leteer coming next.

In [24]:
model.compile(optimizer='adam',loss=loss)

## **Creating Checkpoints**
now we are going to setup and configure our model to save checkpoints as it trans. This will allow us to load our model from a checkpoint and continue training it.

In [27]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir,'ckpt_{epoch}')

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
      filepath=checkpoint_prefix,
      save_weights_only=True
)

## **Training**

(checkpoints' meaning?)

In [28]:
history = model.fit(data,epochs=40,callbacks=[checkpoint_callback])

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


In this case, the more epochs you train, more better the result will be.
Overfitting problem does not really happen here.

## **Load the Model**
We will rebuild the model from a checkpoint using a batch_size of 1 so that we can feed onepeice of text to the model and have it make a prediction.

In [30]:
# we change the batch_size to 1
model = build_model(VOCAB_SIZE,EMBEDDING_DIM,RNN_UNITS, batch_size=1)

Once the model is finished training we an find the LATEST CHECKPOINT that stores the models weights ( the one we trained using 64 batch_size) using following lines

In [31]:
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir)) # out latest checkpoint is at the 40th
model.build(tf.TensorShape([1,None]))

We can load **any checkpoint** we want by sepecifying the exact file to load.

In [None]:
checkpoint_num = 10
model.load_weights(tf.train.load_checkpoint('./training_checkpoints/ckpt_' +str(checkpoint_num))) # now we load the result from checkpoint 10th
model.build(tf.TensorShape([1,None]))

## **Generating Text**
### The purpose for this section is to enable us to enter ONE sequence of input then the model can generate the results for us.

This lovely fuction is provided by tensorflow to generate some text using any starting string we'd like.

In [32]:
def generate_text(model, start_string):
  # Evaluation step(generating text using the learnt model)
  # Number of characters to generate
  num_generate = 800

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 1.0

  # Here batch size ==1
  model.reset_state() # when you rebuild a model, it will memorize the latest state. Thus, we have to clean it beforeha nd
  for i in range(num_generate):
    predictions = model(input_eval)
    # remove the batch dimension
    predictions = tf.squeeze(predictions, 0)

    # using a categorical distribution to predict the character returned by the model
    predictions = predictions/ temperature
    predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

    # We pass the predicted character as the next input to the model
    # along with the previous hidden state
    input_eval = tf.expand_dims([predicted_id], 0)

    text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

In [None]:
inp = input("TYpe a starting string: ")
print(generate_Text(model, inp))