## Natural Language Processing with RNN

Converting textual data to numerical data:
-  Bag of words: represent each word by a number. Then we keep the track of the frequencies of each words. This method loses the order of the words in the original text. 
- Integer Encoding: mapping each word with an integer and save the order of words in each text. The issue is that the numbers which present the words are actually important but we do not control them.
- Word Embeddings: represent every word into a vector, this vector has n dimension (64, 128). We hope that the similar words will have the same direction as vectors, the angle between these vectors are not big. Word Embeddings are trainable layer, we can use pretrained layer.

#### Bag of Words

In [None]:
vocab = {}  # maps word to integer representing it
word_encoding = 1
def bag_of_words(text):
  global word_encoding

  words = text.lower().split(" ")  # create a list of all of the words in the text, well assume there is no grammar in our text for this example
  bag = {}  # stores all of the encodings and their frequency

  for word in words:
    if word in vocab:
      encoding = vocab[word]  # get encoding from vocab
    else:
      vocab[word] = word_encoding
      encoding = word_encoding
      word_encoding += 1
    
    if encoding in bag:
      bag[encoding] += 1
    else:
      bag[encoding] = 1
  
  return bag

text = "this is a test to see if this test will work is is test a a"
bag = bag_of_words(text)
print(bag)
print(vocab)

{1: 2, 2: 3, 3: 3, 4: 3, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}
{'this': 1, 'is': 2, 'a': 3, 'test': 4, 'to': 5, 'see': 6, 'if': 7, 'will': 8, 'work': 9}


In [None]:
positive_review = "I thought the movie was going to be bad but it was actually amazing"
negative_review = "I thought the movie was going to be amazing but it was actually bad"

pos_bag = bag_of_words(positive_review)
neg_bag = bag_of_words(negative_review)

print("Positive:", pos_bag)
print("Negative:", neg_bag)

Positive: {10: 1, 11: 1, 12: 1, 13: 1, 14: 2, 15: 1, 5: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1}
Negative: {10: 1, 11: 1, 12: 1, 13: 1, 14: 2, 15: 1, 5: 1, 16: 1, 21: 1, 18: 1, 19: 1, 20: 1, 17: 1}


## Sentiment Analysis
We will use movie reviews dataset from Keras to analyze sentiment in them. The reviews are already processed and labeled.

#### Imports and downloaging the dataset

In [1]:
from keras.datasets import imdb
from keras.preprocessing import sequence
import keras
import tensorflow as tf
import os
import numpy as np

VOCAB_SIZE = 88584 # the number of unique words in the dataset

MAXLEN = 250
BATCH_SIZE = 64

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words = VOCAB_SIZE)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [2]:
# all the sequences should be with the same length => padding 
# padding is trimming the extra words or add some words to meet the necessary length
btrain_data = sequence.pad_sequences(train_data, MAXLEN)
test_data = sequence.pad_sequences(test_data, MAXLEN)

#### Building, compiling and training the model

In [3]:
# build the model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, 32), # the output vectors are with dimension 32 
    tf.keras.layers.LSTM(32), 
    tf.keras.layers.Dense(1, activation="sigmoid") # sigmoid sqesh the values between 0 and 1
])

In [4]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 32)          2834688   
                                                                 
 lstm (LSTM)                 (None, 32)                8320      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 2,843,041
Trainable params: 2,843,041
Non-trainable params: 0
_________________________________________________________________


In [5]:
model.compile(loss="binary_crossentropy",optimizer="rmsprop",metrics=['acc'])
# use 20% of the training data for validation
history = model.fit(btrain_data, train_labels, epochs=10, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Evaluating the model

In [6]:
results = model.evaluate(test_data, test_labels)
print(results)

[0.4618006944656372, 0.8445199728012085]


#### Predicting
##### Making an encoding function

In [7]:
# Prediction, we want to preprocess the data precisly the same way we have processed the trianing data
word_index = imdb.get_word_index() # Get the word  indexing

# encoding function
def encode_text(text):
  tokens = keras.preprocessing.text.text_to_word_sequence(text) # Tokenizing
  tokens = [word_index[word] if word in word_index else 0 for word in tokens] #replacing each word with its integer index
  return sequence.pad_sequences([tokens], MAXLEN)[0] # padding and returning, the padding return a list that's why we are returning the first element which is the padding of the sequence 

text = "that movie was just amazing, so amazing"
encoded = encode_text(text)
print(encoded)


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0 

In [8]:
reverse_word_index = {value: key for (key, value) in word_index.items()} # reverse the lookup table

def decode_integers(integers):
    PAD = 0 # if we see 0, then there is no words there
    text = ""
    for num in integers:
      if num != PAD:
        text += reverse_word_index[num] + " "

    return text[:-1]
  
print(decode_integers(encoded))

that movie was just amazing so amazing


##### Predicting

In [12]:
def predict(text):
  encoded_text = encode_text(text) # Encoding the function
  pred = np.zeros((1,250))         # Create a zeros array, the length of the revies is 250
  pred[0] = encoded_text           # the model is optimized on passing a list of sequences
  result = model.predict(pred) 
  print(result[0])

positive_review = "That movie was! really loved it and would great watch it again because it was amazingly great"
predict(positive_review)

negative_review = "that movie really sucked. I hated it and wouldn't watch it again. Was one of the worst things I've ever watched"
predict(negative_review)


[0.86819285]
[0.39357874]
