# Using LSTMs to carry out Sentiment Analysis on movie reviews
### Steps:
* Import the necessary packages
* Create and Preprocess the data
* Build the model
* Train and Evaluate
* Test on unseen data

---

### Import the necessary packages

In [65]:
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense, LSTM
from keras.preprocessing import sequence
from keras import backend as K
import numpy as np

In [66]:
K.clear_session()

---

### Create the data
#### Lets make up some of our own labelled movie reviews. 
#### Why don't we consider a classic separator of opinion, Tommy Wiseau's `The Room`

In [95]:
reviews = ['I will never watch this movie because its so bad',
          'The room is to filmmaking as Hugo Boss is to military clothing',
          'This movie improved my life',
          'The is the best movie Ive ever done',
          'It was a pleasure watching for my kids and me',
          'I hated this movie from start to finish',
          'It was awful, just awful',
          'It was really bad']

labels = [0,1,1,1,1,0,0,0]

### Preprocess the data
#### Factorise the text
* We have to transform the text we give to the sentiment analysis network
* Do this by creating a vocab and vocab length
* Use those to create a **vocab_to_keys** and **keys_to_vocab** dictionary

#### Part 1 : Create the vocab and vocab length

In [96]:
vocab = []
max_length = 0

for review in reviews:
    review = review.lower().split()
    for word in review:
        vocab.append(word)
        if len(review) > max_length:
            max_length = len(review)
    
vocab = list(set(vocab)) # set function cleans duplicates 
max_length = max_length +1 

In [97]:
max_length

13

In [98]:
vocab

['just',
 'hated',
 'it',
 'me',
 'as',
 'watch',
 'ever',
 'watching',
 'room',
 'the',
 'finish',
 'done',
 'so',
 'to',
 'will',
 'kids',
 'for',
 'is',
 'a',
 'from',
 'movie',
 'really',
 'clothing',
 'its',
 'start',
 'awful',
 'my',
 'military',
 'i',
 'bad',
 'awful,',
 'improved',
 'ive',
 'never',
 'filmmaking',
 'life',
 'because',
 'best',
 'pleasure',
 'hugo',
 'was',
 'and',
 'boss',
 'this']

#### Part 2: Create the two dictionaries - vocab to keys and keys to vocab

In [99]:
# decoding 
keys_to_vocab = {i:voc for i, voc in enumerate(vocab, start=1)}

# encoding
vocab_to_keys = {voc:i for i, voc in enumerate(vocab, start=1)}

In [100]:
print(list(vocab_to_keys.keys())[0])
print(vocab_to_keys['just'])
print(keys_to_vocab[1])

just
1
just


#### Part 3: Now we can factorize each review by turning it into a series of numbers

In [101]:
embedded_docs = [[vocab_to_keys[word] for word in review.lower().split()] for review in reviews]

In [102]:
embedded_docs

[[29, 15, 34, 6, 44, 21, 37, 24, 13, 30],
 [10, 9, 18, 14, 35, 5, 40, 43, 18, 14, 28, 23],
 [44, 21, 32, 27, 36],
 [10, 18, 10, 38, 21, 33, 7, 12],
 [3, 41, 19, 39, 8, 17, 27, 16, 42, 4],
 [29, 2, 44, 21, 20, 25, 14, 11],
 [3, 41, 31, 1, 26],
 [3, 41, 22, 30]]

#### Part 4: Finally, pad the review sequences so they are all the same length

In [103]:
padded_doc = sequence.pad_sequences(embedded_docs, maxlen=max_length, padding="post")

In [104]:
padded_doc

array([[29, 15, 34,  6, 44, 21, 37, 24, 13, 30,  0,  0,  0],
       [10,  9, 18, 14, 35,  5, 40, 43, 18, 14, 28, 23,  0],
       [44, 21, 32, 27, 36,  0,  0,  0,  0,  0,  0,  0,  0],
       [10, 18, 10, 38, 21, 33,  7, 12,  0,  0,  0,  0,  0],
       [ 3, 41, 19, 39,  8, 17, 27, 16, 42,  4,  0,  0,  0],
       [29,  2, 44, 21, 20, 25, 14, 11,  0,  0,  0,  0,  0],
       [ 3, 41, 31,  1, 26,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 3, 41, 22, 30,  0,  0,  0,  0,  0,  0,  0,  0,  0]], dtype=int32)

---

## Build the model

### Start with the sequential initialiser

In [105]:
model = Sequential()

### Then add the word embedding layer
##### This layer takes 3 parameters - the size of the vocab (input_dims), the no. of dimensions of each word embedding (output_dim), and the length of each document (input_length), which we've standardised above. It returns a 2d matrix, with rows equal to each word in the document, and columns equal to the number of dimensions in the word embedding. 

*Actually its 3D, cos the batch_size is the first dimension in both input and output, but I find that confuses things more than it clarifies*

### Put another way 

The embedding **takes in** a factorized corpus, e.g.:

**[The, cat, sat, on, the, mat]**    becomes    **[1,2,3,4,1,5]**

And **outputs** a word embedded corpus:

**[1,2,3,4,1,5]**    becomes (lets assume output_dim=2)   **[[0.2,0.7], [0.6,0.3], [0.1,0.8], [0.2,0.1], [0.2,0.7], [0.4,0.9]]**

In [135]:
# if you have a big corpus, embedding will help us here
model.add(Embedding(input_dim=len(vocab)+1, output_dim=32, input_length=max_length))

### Then add the LSTM layer
* We have to define the units, which defines the number of recurrent cells.

In [136]:
model.add(LSTM(128, use_bias=False))

### Lastly, add a Dense layer, which has 1 neuron for our binary Positive/Negative classification

In [137]:
model.add(Dense(1, activation='sigmoid'))

In [138]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 13, 32)            1440      
                                                                 
 lstm_1 (LSTM)               (None, 128)               81920     
                                                                 
 dense_1 (Dense)             (None, 1)                 129       
                                                                 
 embedding_2 (Embedding)     (None, 1, 2)              90        
                                                                 
 lstm_2 (LSTM)               (None, 128)               66560     
                                                                 
 dense_2 (Dense)             (None, 1)                 129       
                                                                 
Total params: 150,268
Trainable params: 150,268
Non-tr

---

### Now we can compile and fit the model on the training data
* Fit the word embeddings from scratch

In [139]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [140]:
labels

[0, 1, 1, 1, 1, 0, 0, 0]

In [141]:
X = padded_doc
y = np.array(labels)

In [142]:
X

array([[10,  9, 18, 19, 42, 10, 21, 10,  0,  0,  0,  0,  0],
       [42, 13, 42,  3,  3, 14,  0,  0,  0,  0,  0,  0,  0],
       [29,  3, 41, 10, 38, 21, 33,  7,  0,  0,  0,  0,  0],
       [ 3, 27,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]], dtype=int32)

In [143]:
model.fit(X,y, epochs=29)

ValueError: Data cardinality is ambiguous:
  x sizes: 4
  y sizes: 8
Make sure all arrays contain the same number of samples.

### And evaluate the model

---

### Once we're happy, we can try and guess the sentiment of some new text
#### **Beware, the vocab has to include that which we've already used!**

### Before giving it to the model, make sure to preprocess the text in the same way, and lets add some test_labels which we can compare against

In [130]:
new_reviews = ['The Room is a masterpiece and one of the most underappreciated movie of the modern era',
              'Every line and scene has so much care and effort put into it that it easily puts many far more ambitious films to shame',
              'I thought it was the best movie Ive ever seen',
              'it made my eyes bleed',
              ]

### And predict the sentiment of the sentence against the model's prediction of the padded_doc

In [131]:
embedded_docs = [[vocab_to_keys[word] for word in review.lower().split() if word in vocab] for review in new_reviews]
padded_doc = sequence.pad_sequences(embedded_docs, maxlen=max_length, padding="post")

In [132]:
X = padded_doc
ypred = []

ypred.append(model.predict(X))

In [133]:
new_reviews

['The Room is a masterpiece and one of the most underappreciated movie of the modern era',
 'Every line and scene has so much care and effort put into it that it easily puts many far more ambitious films to shame',
 'I thought it was the best movie Ive ever seen',
 'it made my eyes bleed']

In [134]:
ypred

[array([[0.86574066],
        [0.333807  ],
        [0.14551404],
        [0.00281903]], dtype=float32)]