## Movie Review Dataset
we will be using an IMDB movie review dataset from keras with 25,000 reviews where each one is already preprocessed as either a positive or negative. #Each review is encoded by integers that represent how common a word is in the entire dataset. E.g. a word encoded by int 3 means it is the 3rd most common word in the dataset

In [2]:
from keras.datasets import imdb
from keras.preprocessing import sequence #outmoded - keras.utils import sequence
import tensorflow as tf
import os
import numpy as np


VOCAB_SIZE = 88584

MAXLEN = 250
BATCH_SIZE =64
(train_data, train_lbls), (test_data,test_lbls) = imdb.load_data(num_words = VOCAB_SIZE)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [13]:
#check out the first review of our train datasets
len(train_data[0])

218

In [12]:

# len of all reviews are not the same
print(len(train_data[3]))

550


### More Preprocessing
with our reviews havn diff lengths, we cant pass such data into our neural network. therefore, we make each review the same length with the basic criteria:

i. if the review is greater than 250 words, cut off the excess.
ii. if review is less than 250 words, add the needed amount of zeros to make it sum up to 250

Keras has a function for such tasks:

In [15]:
# train_data = sequence.pad_sequences(train_data, MAXLEN)
# test_data = sequence.pad_sequences(test_data, MAXLEN)              #changed path of the pad_sequiences

train_data = tf.keras.utils.pad_sequences(train_data, MAXLEN)
test_data = tf.keras.utils.pad_sequences(test_data, MAXLEN)

In [18]:
print(len(train_data[0]))
print(len(train_data[3]))

250
250


### Creating the Model
Now we create our model, with a word embedding layer as teh first layer in our model and then add the LSTM layer afterwards  to feed into a dense node to get our predicted sentiment.

32 stands for the output dim of the vectors generated by the embedding layer. we can change teh value if we want.0

In [20]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, 32), #creates 32 dims of the resulting vector
    tf.keras.layers.LSTM(32), #telss lstm later it ll need 2 add 32 dims to each word... 
    tf.keras.layers.Dense(1,activation = 'sigmoid') #activation fxn squezes vals btn 0 and 1
])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 32)          2834688   
                                                                 
 lstm (LSTM)                 (None, 32)                8320      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 2,843,041
Trainable params: 2,843,041
Non-trainable params: 0
_________________________________________________________________


### Training

In [21]:
model.compile(loss = 'binary_crossentropy', optimizer = 'rmsprop', metrics = ['acc'])
history = model.fit(train_data, train_lbls, epochs = 10, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [23]:
#Now we evaluate the model on our training data to see how well it performs
results = model.evaluate(test_data, test_lbls)
print(results)

[0.4154852330684662, 0.859000027179718]


### Making Predictions
Now we will use our network to make predictions on our reviews.
as they are encoded we will need to convert any review that we write into that form so the network can understand it. for that pupose we will load the encodings from the dataset and use them to encode our data.



In [None]:
#Encode function
word_index = imdb.get_word_index() #these are the get wor  indices of our movie review ds
def encode_text(text):
    tokens = keras.preprocesing.text.text_to_word_sequence(text) #converts our strings to series of words csv
    tokens = [word_index[word] if word in word_index else 0 for word in tokens]
    #if word in our tokens exist in our imdb ds, then we put that coresponding val into our tokens as a replacement, else 0
    return keras.utils.pad_sequences([tokens], MAXLEN)[0] #the return statement returns our shit with teh pad sequences which
#ultimately sets the string to a cap of 250, the pad sequences returns a list and our tokens val is a list so a list of list 
#is returned, that is why we added the [0], so that one list is returned.

varstring = 'that movie was simply amazing, so amazing'
encodede = encode_text(varstring)
print(encodede)

In [None]:
#decode fxn
reverse_word_index = {value: key for (key, val) in word_index.items()}

def decode_integers(integers):
    PAD = 0
    text = ""
    for num in integers:
        if num != PAD:
            text += reverse_word_index[num] +" "
    return text[:-1] #returns all but the last space in our string


print(decode_integers(encodede))
            

In [None]:
#time to see a prediction
def predict(text):
    encoded_text = encode_text(text)
    pred = np.zeros((1, 250))
    pred[0] = encoded_text
    result = model.predict(pred)
    print(result[0])
    
positive = 'That was so amazing! I really loved it and would watch it again because it was amazingly great'
predict(positive)

neg = 'that movie sucked. I hated it and would never watch it again. Was one of the worst things i have ever watched'
predict(neg)