<a href="https://colab.research.google.com/github/098Steve/Jupyter/blob/main/SentimentAnalysis_Tensor_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**IMDB Movie Sentiment Analysis**

The IMDB dataset consists of 50000 binary reviews, which are evenly split into positive and
negative opinions. Each review consists of a list of integers, where each integer represents a
word in that review. Keras has the dataset within its library so we can just load it directly from
Keras

In [None]:
import tensorflow as tf
from tensorflow import keras
from keras import datasets, layers, models, preprocessing, Model
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt




In [None]:
#Initialise parameters
max_len =200
n_words =10000
dim_embedding = 256
EPOCHS = 20
BATCH_SIZE = 500

In [None]:
#A function to load the dataset
def load_data():
  #load_data
  (X_train, y_train),(X_test, y_test) = datasets.imdb.load_data(num_words=n_words)
  #pad sequence with max_len
  X_train= preprocessing.sequence.pad_sequences (X_train, maxlen=max_len)
  X_test= preprocessing.sequence.pad_sequences (X_test, maxlen=max_len)
  return (X_train, y_train), (X_test, y_test)

In [None]:
#Now a function to build the model
def build_model():
  model = models.Sequential()
  #Input: eEmbedding layer
  #The model will take as imput an integer matrix of size (batch, input_length)
  #The model will output dimension (input_length, dim_embedding)
  #The largest integer in the input should be no larger than n_words (vocabulary size)
  model.add (layers.Embedding(n_words, dim_embedding))
  model.add (layers.Dropout(0.3))
  #takes the maximum value  from each of the n_words features.
  model.add(layers.GlobalMaxPooling1D())
  model.add (layers.Dense(128, activation = 'relu'))
  model.add (layers.Dropout(0.5))
  model.add(layers.Dense(1, activation= 'sigmoid' ))
  return model

In [None]:

#Now we build and fit the model

(X_train, y_train), (X_test, y_test) = load_data()
model = build_model()
model.summary()
model.compile(optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"] )




In [None]:
#Next fit and evaluate it
train_score = model.fit(X_train, y_train, epochs = EPOCHS, batch_size = BATCH_SIZE, validation_data = (X_test, y_test))
test_score = model.evaluate(X_test, y_test, batch_size = BATCH_SIZE)

In [None]:
# Below will show what keys are available in history dictionary object test_score

train_score.history.keys()



dict_keys(['accuracy', 'loss', 'val_accuracy', 'val_loss'])

In [None]:
# variables for visualizing losses and accuracy


train_loss = train_score.history['loss']
val_loss   = train_score.history['val_loss']
train_acc  = train_score.history['accuracy']
val_acc    = train_score.history['val_accuracy']
xc         = range(EPOCHS)



In [None]:
#Compare loss

plt.figure()
plt.plot(xc, train_loss, color="red", label="Training Loss")
plt.plot(xc, val_loss, color ="blue", label = "Validation Loss")
plt.title("Training Loss and Validation Loss")
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show

In [None]:
#Compare accuracy
plt.figure()
plt.plot(xc, train_acc, color="red", label ="Training Accuracy")
plt.plot(xc, val_acc, color ="blue", label = "Validation Accuracy")
plt.title("Training Acuracy and Validation Accuracy")
plt.xlabel ('Epochs')
plt.ylabel('Loss')
plt.legend()


In [None]:
#Print final training loss

print ("Training:",  "Loss=", train_loss[19] , "Accuracy=", train_acc[19] )

Training: Loss= 0.00608218926936388 Accuracy= 0.9988800287246704


In [None]:
#Print testing loss and accuracy

print ("Testing:",  "Loss=", test_score[0] , "Accuracy=", test_score[1] )

Testing: Loss= 0.4959609806537628 Accuracy= 0.8514800071716309


Do you think the model is overfitted?
What is overfitting?

Now let's explore the  data and do some prediction

In [None]:
#Let's look at some data

X_train[1:4]

Note that each word is represented by a number.  The movies data set has a word index so that we can see which number represents which word.

In [None]:
#The movies dataset has a word-index
word_index = datasets.imdb.get_word_index()
#Let's look at the words in the word index
word_index.items()



In [None]:
#Reverse the key and value from the dictionary so that we can look up numbers to see words
reverse_word_index = dict([(value, key) for (key,value) in word_index.items()])
reverse_word_index.items()



The following function takes two arguments. The first one (n) denotes an integer referring to the
nth review in a set. The second argument defines whether the nth review is taken from our
training set or our test data. Then it simply returns the string version of the review we specify. The i-3 is an offset because positions 0,1 and 2 are used for index in a sequence.

In [None]:

def decode_review (n, split='train'):
  if split=='train':
    decoded_review=' '.join([reverse_word_index.get(i-3,'?') for i in X_train[n]])
  elif split=='test':
    decoded_review=' '.join([reverse_word_index.get(i-3,'?') for i in X_test[n]])
  return decoded_review




In [None]:
#the following code prints the training label and decodes a review
print('Training label:', y_test[4])
review =decode_review(5,split='test')
review

In [None]:
#Predict the whole test dataset

predictions = model.predict(X_test)

In [None]:
predictions

In [None]:
#Set boundaries for when a review is positive and when it is negative

def gauge_predictions(n):
  if (predictions[n]<=0.4) and (y_test[n]==0):
    print('Network correctly predicts that review {} is negative'.format(n))
  elif (predictions[n] >0.7) and (y_test[n]==1):
    print('Network correctly predicts that review {} is positive'.format(n))
  elif (predictions[n]>0.7) and (y_test[n]==0):
    print('Network incorrectly predicts that review {}is positive'.format(n))
  elif (predictions[n]<=0.4) and (y_test[n]==1):
    print('Network incorrectly predicts that review {} is negative'.format(n))
  else:
    print('Network is not so sure. Review {} has a probability of positive score of {}'.format(n))

In [None]:
def verify_predictions(n):
  return gauge_predictions(n), predictions[n], y_test[n], decode_review(n, split='test')

Use the verify predictions function to see some reviews and whether they were predicted as positive or negative

In [None]:
verify_predictions(7)


In [None]:
#Count number of predictions where the model is not sure

count=0
x=0
while x < 25000:
  if (predictions[x] > 0.4) and (predictions[x] <= 0.7):
    count+=1
  x+=1
print(count)
