# Sentiment Classification


### Generate Word Embeddings and retrieve outputs of each layer with Keras based on Classification task

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

It is a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

We willl use the imdb dataset to learn word embeddings as we train our dataset. This dataset contains 25,000 movie reviews from IMDB, labeled with sentiment (positive or negative). 



### Dataset

`from keras.datasets import imdb`

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocab size of 10,000.

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.


### Aim

1. Import test and train data  
2. Import the labels ( train and test) 
3. Get the word index and then Create key value pair for word and word_id. (12.5 points)
4. Build a Sequential Model using Keras for Sentiment Classification task. (10 points)
5. Report the Accuracy of the model. (5 points)  
6. Retrive the output of each layer in keras for a given single test sample from the trained model you built. (2.5 points)


#### Usage:

In [1]:
from tensorflow.keras.datasets import imdb

(x_train, y_train), (x_test, y_test) = imdb.load_data()

# Find max length of review
maxLen = 0
maxI = 0
for i, id in enumerate(range(x_train.shape[0])):
    length = len(x_train[i])
    if length > maxLen:
        maxLen = length
        maxI = i

print("Max length review: ", maxLen)
print("Max length review is at index: ", maxI)

# Find the max value
maxValue = 0
for i, id in enumerate(range(x_train.shape[0])):
    x_temp = x_train[i]
    if max(x_temp) >= maxValue:
        maxValue = max(x_temp)

print("Number of unique words in the overall data set: ", maxValue)

Max length review:  2494
Max length review is at index:  17934
Number of unique words in the overall data set:  88586


In [2]:
print(type(x_train))
print(x_train.shape)
print(len(x_train[17934]))
print(len(x_train[1]))
print(type(x_train[0]))
# x_train[0]

<class 'numpy.ndarray'>
(25000,)
2494
189
<class 'list'>


In [3]:
# vocab_size = 10000 #vocab size
# maxlen = 300  #number of word used from each review

vocab_size = 10000 #vocab size
maxlen = 300  #number of word used from each review
embedding_size = 50

In [4]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

#load dataset as a list of ints
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
#make all sequences of the same length
x_train = pad_sequences(x_train, maxlen=maxlen, padding='post')
x_test =  pad_sequences(x_test, maxlen=maxlen, padding='post')

In [5]:
print(type(x_train))
print(x_train.shape)
print(len(x_train[0]))
print(len(x_train[1]))
print(type(x_train[0]))
x_train[0]

<class 'numpy.ndarray'>
(25000, 300)
300
300
<class 'numpy.ndarray'>


array([   1,   14,   22,   16,   43,  530,  973, 1622, 1385,   65,  458,
       4468,   66, 3941,    4,  173,   36,  256,    5,   25,  100,   43,
        838,  112,   50,  670,    2,    9,   35,  480,  284,    5,  150,
          4,  172,  112,  167,    2,  336,  385,   39,    4,  172, 4536,
       1111,   17,  546,   38,   13,  447,    4,  192,   50,   16,    6,
        147, 2025,   19,   14,   22,    4, 1920, 4613,  469,    4,   22,
         71,   87,   12,   16,   43,  530,   38,   76,   15,   13, 1247,
          4,   22,   17,  515,   17,   12,   16,  626,   18,    2,    5,
         62,  386,   12,    8,  316,    8,  106,    5,    4, 2223, 5244,
         16,  480,   66, 3785,   33,    4,  130,   12,   16,   38,  619,
          5,   25,  124,   51,   36,  135,   48,   25, 1415,   33,    6,
         22,   12,  215,   28,   77,   52,    5,   14,  407,   16,   82,
          2,    8,    4,  107,  117, 5952,   15,  256,    4,    2,    7,
       3766,    5,  723,   36,   71,   43,  530,  4

In [6]:
# print(len(x_train[17934]))
# x_train[17934]

In [6]:
# Find the max value
maxValue = 0
for i, id in enumerate(range(x_train.shape[0])):
    x_temp = x_train[i]
    if max(x_temp) >= maxValue:
        maxValue = max(x_temp)

print(maxValue)

9999


In [7]:
"""
Assessment till now:

- We have a array of numeric values, where each value represents a word (unknown).

- The max value is 9999, meaning there are 10000 words identified. 
Otherwise, there are more than 80000 unique words in the data set.

- Each sub-array has 300 values.

- Each value in the sub-array is a numeric value representing some word in the vocabalary(unknown).

- If the review was of less than 300 words, then the words set into towards the last of the array. 
The initial values in the array are set to 0.

---***---

To do next:

- For each entry in the array, create a array of size (300, 10000).
Each row will contain value in only one column.
The index of the row in this array will represent the index of the word in the review.
The column which contains will represent the word (as given in the input).
As output of this step there will be (25000, 300, 10000) size data set.

- Pass it through a neural network to train against the labels.


"""
y_train[0]

1

In [8]:
# Get word to index mapping
# It is a dictonary object
# Key is the word
# Value is the numeric value corresponding to the word
word_to_id = imdb.get_word_index()
print(type(word_to_id))  # dictonary

<class 'dict'>


In [9]:
print(len(word_to_id))
print(len(word_to_id.items()))
# word_to_id[0]  # Gives error, meaning indexing starts from 1

88584
88584


In [10]:
# Increase numbers of the each word in the dictonary by 2
word_to_id = {k:(v+2) for k, v in word_to_id.items()}
# This leaves index 0, 1 and 2 available for use.
# NOTE: 0 was not used in the input data, therefore increasing index of each word by 2, instead of 3.

# First 3 index are empty now
# This code must be after increasing number of each word in the dictonary by 3
word_to_id["<PAD>"]=0
word_to_id["<START>"]=1
word_to_id["<END>"]=2

In [11]:
print(type(word_to_id))
print(type(word_to_id.items()))
print(word_to_id["<START>"])
print(len(word_to_id))
print(len(word_to_id.items()))
word_to_id['<START>']

<class 'dict'>
<class 'dict_items'>
1
88587
88587


1

In [12]:
# Create id to word mapping
# Input, word_to_id, is word to id mapping
# Output, will have a numeric key and the value will be the word
id_to_word = {value: key for key, value in word_to_id.items()}
print(type(id_to_word))
id_to_word

<class 'dict'>


{34703: 'fawn',
 52008: 'tsukino',
 52009: 'nunnery',
 16818: 'sonja',
 63953: 'vani',
 1410: 'woods',
 16117: 'spiders',
 2347: 'hanging',
 2291: 'woody',
 52010: 'trawling',
 52011: "hold's",
 11309: 'comically',
 40832: 'localized',
 30570: 'disobeying',
 52012: "'royale",
 40833: "harpo's",
 52013: 'canet',
 19315: 'aileen',
 52014: 'acurately',
 52015: "diplomat's",
 25244: 'rickman',
 6748: 'arranged',
 52016: 'rumbustious',
 52017: 'familiarness',
 52018: "spider'",
 68806: 'hahahah',
 52019: "wood'",
 40835: 'transvestism',
 34704: "hangin'",
 2340: 'bringing',
 40836: 'seamier',
 34705: 'wooded',
 52020: 'bravora',
 16819: 'grueling',
 1638: 'wooden',
 16820: 'wednesday',
 52021: "'prix",
 34706: 'altagracia',
 52022: 'circuitry',
 11587: 'crotch',
 57768: 'busybody',
 52023: "tart'n'tangy",
 14131: 'burgade',
 52025: 'thrace',
 11040: "tom's",
 52027: 'snuggles',
 29116: 'francesco',
 52029: 'complainers',
 52127: 'templarios',
 40837: '272',
 52030: '273',
 52132: 'zaniacs',

In [13]:
print(len(id_to_word))
for key, value in id_to_word.items():
    if (key == 1) or (value == "<END>"):
        print(value)

88587
<START>
<END>


In [14]:
id_to_word[3]

'the'

In [15]:
n = 3
# print(x_train[n])

# Create review (as words) for a sample train record
print(' '.join(id_to_word[id+2] for id in x_train[n]))

that there is julia fantasy to repressed and film good br of loose and basic have into your whatever i i and and demented be hop this standards cole new be home all seek film wives lot br made and in at this of search how concept in thirty some this and not all it rachel are of boys and re is and animals deserve i i worst more it is renting concerned message made all and in does of nor of nor side be and center obviously know end computer here to all tries in does of nor side of home br be indeed i i all it officer in could is performance and fully in of and br by br and its and lit well of nor at coming it's it that an this obviously i i this as their has obviously bad and exist countless and mixed of and br work to of run up and and br dear nor this early her bad having tortured film and movie all care of their br be right acting i i and of and and it away of its shooting and to suffering version you br singers your way just and was can't compared condition film of and br united obvi

In [16]:
print(id_to_word[4])

and


## Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [17]:
"""
Create neural network
"""

'\nCreate neural network\n'

In [18]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense

def create_model(vocab_size, embedding_size, maxlen):
    # define the model
    model_out = Sequential()
    model_out.add(Embedding(vocab_size, embedding_size, input_length=maxlen))  # 50*8 = 400 learnable weights (input so no bias)
    model_out.add(Flatten())
    model_out.add(Dense(1, activation='sigmoid'))  # 8*4 + 1 = 33 learnable weights
    # compile the model
    model_out.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    # summarize the model
    #print(model_out.summary())
    return model_out

In [67]:
def fit_evaluate(model_in):
    # fit the model
    model_in.fit(x_train, y_train, epochs=2, verbose=1, batch_size=500)

    # evaluate the model
    loss, accuracy = model_in.evaluate(x_test, y_test, verbose=0)
    #print('Accuracy: %f' % (accuracy))
    #print('Loss: %f' % (loss))
    print("Accuracy ", round(accuracy, 5), ", Loss ", round(loss, 5))
    return accuracy, loss

### Hyperparameter tuning

In [3]:
import time
import pandas as pd
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences

# vocab_size_list = [1000, 5000, 10000, 20000, 30000, 50000, 80000, 100000]  # Total words in dictionary to consider
# maxlen_list = [5, 10, 20, 30, 50, 100, 150, 200, 250, 300, 350, 500, 1000]  # Length of document to consider
# embedding_size_list = [5, 10, 20, 30, 50, 100, 150, 200, 250, 300, 350, 400, 500, 1000]

# vocab_size_list = [5000, 10000, 30000, 50000, 89000]  # Total words in dictionary to consider
# maxlen_list = [5, 10, 30, 50, 75, 100, 200, 300, 500]  # Length of document to consider
# embedding_size_list = [5, 10, 30, 50, 100, 200, 300, 500, 1000]

# vocab_size_list = [10000, 30000, 50000, 89000]  # Total words in dictionary to consider
# maxlen_list = [5, 10, 30, 50, 100, 200, 300, 500]  # Length of document to consider
# embedding_size_list = [5, 10, 30, 50, 100, 200, 300, 400, 500]

# vocab_size_list = [30000, 50000, 89000]  # Total words in dictionary to consider
# maxlen_list = [5, 10, 30, 50, 100, 200, 300, 500]  # Length of document to consider
# embedding_size_list = [5, 10, 30, 50, 100, 200, 300, 400, 500]

vocab_size_list = [50000, 89000]  # Total words in dictionary to consider
maxlen_list = [5, 10, 30, 50, 100, 200, 250, 300]  # Length of document to consider
embedding_size_list = [5, 10, 30, 50, 100, 200, 250, 300, 400]

# Lists to save to the performance_df
vocab_size = []
maxlen = []
embedding_size = []
accuracy = []
loss = []

fileName = 'hyperparameterComparison_v5.csv'

for i, vocab_size_entry in enumerate(vocab_size_list):
    startingTime = time.time()
    
    #load dataset as a list of ints
    (x_train_original, y_train), (x_test_original, y_test) = imdb.load_data(num_words=vocab_size_entry)
    
    for j, maxlen_entry in enumerate(maxlen_list):
        performance_df = pd.DataFrame({'vocab_size': vocab_size, 'maxlen': maxlen, 
                               'embedding_size': embedding_size, 'accuracy': accuracy, 'loss': loss})
        performance_df.to_csv(fileName, index=False)
        
        startingTime2 = time.time()
        
        #make all sequences of the same length
        x_train = pad_sequences(x_train_original, maxlen=maxlen_entry, padding='post')
        x_test =  pad_sequences(x_test_original, maxlen=maxlen_entry, padding='post')
        
        for k, embedding_size_entry in enumerate(embedding_size_list):
            if embedding_size_entry > 10*maxlen_entry:
                print(embedding_size_entry, ' is too big value for embedding_size. Skipping it...')
                continue
            # Else continue
            startingTime3 = time.time()
            
            print()
            print('---***---')
            print(vocab_size_entry, " words-dictionary, ", maxlen_entry, " maxlength, ", 
                  embedding_size_entry, " embedding_size" )
            model = create_model(vocab_size_entry, embedding_size_entry, maxlen_entry)
            accuracy_i, loss_i = fit_evaluate(model)
            
            vocab_size.append(vocab_size_entry)
            embedding_size.append(embedding_size_entry)
            maxlen.append(maxlen_entry)
            accuracy.append(round(accuracy_i, 3))
            loss.append(round(loss_i, 3))
            print('Time taken for this embedding: ', round(((time.time() - startingTime3)/60), 2), ' minutes')
        print('***Time taken for this maxlen: ', round(((time.time() - startingTime2)/60), 2), ' minutes')
    print('******Time taken for this vocab_size: ', round(((time.time() - startingTime)/60), 2), ' minutes')

# Save
performance_df = pd.DataFrame({'vocab_size': vocab_size, 'maxlen': maxlen, 
                               'embedding_size': embedding_size, 'accuracy': accuracy, 'loss': loss})
performance_df.to_csv(fileName, index=False)

"""
100000 words, 3000 maxlength of document: Accuracy 0.86556, Loss 0.508562
10000 words, 3000 maxlength: Accuracy 0.866080, Loss 0.543325
10000 words, 300 maxlength: Accuracy 0.867640, Loss 0.515492
10000  words,  300  maxlength, embedding_size  50  : Accuracy  0.86764 , Loss  0.5154921369147301

5000	300	200	0.867999971	0.521
10000	300	300	0.871999979	0.496
30000	300	200	0.873000026	0.467
50000	250	250	0.874000013	0.449
"""


---***---
50000  words-dictionary,  5  maxlength,  5  embedding_size
Accuracy  0.67272 , Loss  0.72647
Time taken for this embedding:  2.12  minutes

---***---
50000  words-dictionary,  5  maxlength,  10  embedding_size
Accuracy  0.6684 , Loss  0.80344
Time taken for this embedding:  3.42  minutes

---***---
50000  words-dictionary,  5  maxlength,  30  embedding_size
Accuracy  0.6588 , Loss  0.96455
Time taken for this embedding:  6.4  minutes

---***---
50000  words-dictionary,  5  maxlength,  50  embedding_size
Accuracy  0.6576 , Loss  1.0222
Time taken for this embedding:  9.36  minutes
100  is too big value for embedding_size. Skipping it...
200  is too big value for embedding_size. Skipping it...
250  is too big value for embedding_size. Skipping it...
300  is too big value for embedding_size. Skipping it...
400  is too big value for embedding_size. Skipping it...
***Time taken for this maxlen:  21.3  minutes

---***---
50000  words-dictionary,  10  maxlength,  5  embedding_size


Accuracy  0.8714 , Loss  0.46577
Time taken for this embedding:  16.38  minutes

---***---
50000  words-dictionary,  250  maxlength,  200  embedding_size
Accuracy  0.87236 , Loss  0.46233
Time taken for this embedding:  25.6  minutes

---***---
50000  words-dictionary,  250  maxlength,  250  embedding_size
Accuracy  0.87404 , Loss  0.44937
Time taken for this embedding:  26.68  minutes

---***---
50000  words-dictionary,  250  maxlength,  300  embedding_size
Accuracy  0.87272 , Loss  0.45699
Time taken for this embedding:  42.68  minutes

---***---
50000  words-dictionary,  250  maxlength,  400  embedding_size
Accuracy  0.87352 , Loss  0.45638
Time taken for this embedding:  64.07  minutes
***Time taken for this maxlen:  194.93  minutes

---***---
50000  words-dictionary,  300  maxlength,  5  embedding_size
Accuracy  0.85732 , Loss  0.46205
Time taken for this embedding:  1.88  minutes

---***---
50000  words-dictionary,  300  maxlength,  10  embedding_size
Accuracy  0.86152 , Loss  0.

ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted:  OOM when allocating tensor with shape[50000,300] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
	 [[node Adam/Adam/update/Read/ReadVariableOp (defined at C:\installationDirectory\lib\site-packages\tensorflow_core\python\framework\ops.py:1751) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Adam/Adam/update/AssignSubVariableOp/_41]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted:  OOM when allocating tensor with shape[50000,300] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
	 [[node Adam/Adam/update/Read/ReadVariableOp (defined at C:\installationDirectory\lib\site-packages\tensorflow_core\python\framework\ops.py:1751) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored. [Op:__inference_distributed_function_1610113]

Function call stack:
distributed_function -> distributed_function


### Best model configuration was at
- Vocab size: 50000
- Sentence length: 250
- Embedding size: 250

In [68]:
vocab_size_entry = 50000
maxlen_entry = 250
embedding_size_entry = 250

(x_train_original, y_train), (x_test_original, y_test) = imdb.load_data(num_words=vocab_size_entry)

x_train = pad_sequences(x_train_original, maxlen=maxlen_entry, padding='post')
x_test =  pad_sequences(x_test_original, maxlen=maxlen_entry, padding='post')

print(x_train.shape)
print(x_test.shape)

model = create_model(vocab_size_entry, embedding_size_entry, maxlen_entry)
print('...fit_evaluate')
accuracy_i, loss_i = fit_evaluate(model)

# vocab_size.append(vocab_size_entry)
# embedding_size.append(embedding_size_entry)
# maxlen.append(maxlen_entry)
# accuracy.append(round(accuracy_i, 3))
# loss.append(round(loss_i, 3))

(25000, 250)
(25000, 250)
...fit_evaluate
Train on 25000 samples
Epoch 1/2
Epoch 2/2
Accuracy  0.86476 , Loss  0.32424


In [69]:
model.summary()

Model: "sequential_12"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 250, 250)          12500000  
_________________________________________________________________
flatten_4 (Flatten)          (None, 62500)             0         
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 62501     
Total params: 12,562,501
Trainable params: 12,562,501
Non-trainable params: 0
_________________________________________________________________


In [70]:
from sklearn.metrics import accuracy_score, confusion_matrix,classification_report
import numpy as np

def probToVal(y_prob):
    y_val = np.zeros((len(y_prob)))
    for i in range(len(y_prob)):
        if y_prob[i] > 0.5: y_val[i] = 1
    return y_val

def checkPerformance(Y_1, Y_2, verbose=1):
    print("Model accuracy score: ", accuracy_score(Y_1, Y_2))
    if verbose > 0:
        print()
        print("Model confusion matrix: \n", confusion_matrix(Y_1, Y_2))
        print()
        print("Model classification report: \n", classification_report(Y_1, Y_2))
        print()
    return

y_pred = model.predict(x_test)
y_pred_val = probToVal(y_pred)

# y_pred[1]
# y_test[0]


checkPerformance(y_test, y_pred_val)

Model accuracy score:  0.86476

Model confusion matrix: 
 [[10519  1981]
 [ 1400 11100]]

Model classification report: 
               precision    recall  f1-score   support

           0       0.88      0.84      0.86     12500
           1       0.85      0.89      0.87     12500

    accuracy                           0.86     25000
   macro avg       0.87      0.86      0.86     25000
weighted avg       0.87      0.86      0.86     25000




## Retrive the output of each layer in keras for a given single test sample from the trained model you built

In [41]:
from tensorflow.keras import backend as K

inp = model.input
outputs = [layer.output for layer in model.layers]
fun = [K.function([inp, K.learning_phase()], [out]) for out in outputs]

test = np.array([x_test[0], ])
layer_outs = [func([test, 1]) for func in fun]
print(len(layer_outs[0]))

AttributeError: 'int' object has no attribute 'op'

In [44]:
# outputs = [K.function([model.input], [layer.output])([x_test, 1]) for layer in model.layers]

In [0]:
outputs[0]

# Extras
https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

In [45]:
from numpy import array
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding

from tensorflow.keras.layers import Dense, Embedding, LSTM, SpatialDropout1D, BatchNormalization
from tensorflow.keras.layers import Bidirectional

In [63]:
# embed_dim = 128 # 30 # 128
# lstm_out = 196 # 128 # 196
# num_classes = 2
# model_lstm = Sequential()
# model_lstm.add(Embedding(vocab_size_entry, embedding_size_entry, input_length=maxlen_entry))
# # model.add(embedding_layer)
# #model.add(SpatialDropout1D(0.25))
# model_lstm.add(Bidirectional(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2)))
# #model.add(LSTM(lstm_out))
# """
# ADD BATCH NORMALIZATION
# """
# model_lstm.add(BatchNormalization())
# #
# model_lstm.add(Dense(num_classes, activation='softmax'))
# model_lstm.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
# #
# # model.add(Dense(1,activation='sigmoid'))
# # # model.add(Dense(1,activation='tanh'))
# # model.compile(loss = BinaryCrossentropy(), optimizer='adam',metrics = ['accuracy'])
# print(model_lstm.summary())

In [60]:
# def oneEpochTrain(X_train_in, Y_train_in):
#     batch_size = 32
#     #model.fit(X_train, np.array(Y_train), epochs = 1, batch_size=batch_size, verbose = 0, class_weight=class_weight)
#     model_lstm.fit(X_train_in, np.array(Y_train_in), epochs = 1, batch_size=batch_size, verbose = 1)
#     #
#     #score,acc = model.evaluate(X_test, np.array(Y_test), verbose = 2, batch_size = batch_size)
#     #print("score: %.2f" % (score))
#     #print("acc: %.2f" % (acc))

# def oneEpoch(X_train_in, Y_train_in):
#     oneEpochTrain(X_train_in, Y_train_in)
#     #
#     lstm_pred = model_lstm.predict(X_val)
#     #lstm_pred = sigmoidToPolarity(lstm_pred)
#     #lstm_pred = CatCrossEntropy_toPolarity(lstm_pred)
#     #
#     con_mat = confusion_matrix(np.argmax(lstm_pred, axis=1), np.argmax(Y_val, axis=1))
#     acc = accuracy_score(np.argmax(lstm_pred, axis=1), np.argmax(Y_val, axis=1))
#     #
#     print()
#     print("LSTM model accuracy score: ", acc)
#     #print()
#     print("LSTM model confusion matrix: \n", con_mat)
#     #print()
#     #print("LSTM model classification report: \n", classification_report(Y_val, lstm_pred))
#     #
#     #recall_neg = con_mat[0][0]/(con_mat[0][0] + con_mat[1][0])
#     #recall_pos = con_mat[1][1]/(con_mat[0][1] + con_mat[1][1])
#     #eturn [recall_neg, recall_pos]
#     return con_mat, acc

In [61]:
# oneEpoch(x_train, y_train)

In [62]:
# model_lstm.fit(x_train, y_train, epochs=10, verbose=1)

In [None]:

# # define documents
# docs = ['Well done!',
#         'Good work',
#         'Great effort',
#         'nice work',
#         'Excellent!',
#         'Weak',
#         'Poor effort!',
#         'not good',
#         'poor work',
#         'Could have done better.']
# # define class labels
# labels = array([1,1,1,1,1,0,0,0,0,0])

In [None]:
# # integer encode the documents
# vocab_size = 50
# encoded_docs = [one_hot(d, vocab_size) for d in docs]
# print(encoded_docs)

In [None]:
# # pad documents to a max length of 4 words
# max_length = 4
# padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# print(padded_docs)
# # 1
# # 2
# # 3
# # 4
# # pad documents to a max length of 4 words
# max_length = 4
# padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# print(padded_docs)

In [None]:
# # define the model
# model = Sequential()
# model.add(Embedding(vocab_size, 8, input_length=max_length))  # 50*8 = 400 learnable weights (input so no bias)
# model.add(Flatten())
# model.add(Dense(1, activation='sigmoid'))  # 8*4 + 1 = 33 learnable weights
# # compile the model
# model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# # summarize the model
# print(model.summary())

In [None]:
# # fit the model
# model.fit(padded_docs, labels, epochs=50, verbose=0)

In [None]:
# # evaluate the model
# loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)

In [None]:
# print('Accuracy: %f' % (accuracy*100))
# # 1
# # 2
# # 3
# # 4
# # 5
# # fit the model
# model.fit(padded_docs, labels, epochs=50, verbose=0)
# # evaluate the model
# loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
# print('Accuracy: %f' % (accuracy*100))