# 00 Training and evaluating a DNN model on the IMDB Dataset
## Downloading and data preprocessing

Downloaded the dataset at http://ai.stanford.edu/~amaas/data/sentiment/

```
@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}
```

In [2]:
%time

import os
import pandas as pd

df = pd.DataFrame(columns = ['text','sentiment'])

imdb_dir = "./datasets/aclImdb"

for dir_kind in ['train','test']:
    for label_type in ['neg', 'pos']:
        dir_name = os.path.join(imdb_dir, dir_kind, label_type)
        for fname in os.listdir(dir_name):
            if fname[-4:] == '.txt':
                f = open(os.path.join(dir_name, fname))
                df = df.append({'text': f.read(), 'sentiment': ['neg','pos'].index(label_type)}, ignore_index = True)
                f.close()

CPU times: user 9 µs, sys: 2 µs, total: 11 µs
Wall time: 23.8 µs


In [3]:
df.head()

Unnamed: 0,text,sentiment
0,I am quite a fan of novelist/screenwriter Mich...,0
1,If this book remained faithful to the book the...,0
2,The Eternal Jew (Der Ewige Jude) does not have...,0
3,Here are the matches . . . (adv. = advantage)<...,0
4,I'm sorry but I didn't like this doc very much...,0


In [4]:
print ('Number of negative istances:', len(df[df['sentiment'] == 0]))
print ('Number of positive istances:', len(df[df['sentiment'] == 1]))
print ('Il dataset risulta essere bilanciato!')

Number of negative istances: 25000
Number of positive istances: 25000
Il dataset risulta essere bilanciato!


In [5]:
print(df['text'][0])

I am quite a fan of novelist/screenwriter Michael Chabon. His novel "Wonder Boys" became a fantastic movie by Curtis Hanson. His masterful novel "The Amazing Adventures of Kavalier and Clay" won the Pulitzer Prize a few years back, and he had a hand in the script of "Spider Man 2", arguably the greatest comic book movie of all time.<br /><br />Director Rawson Marshall Thurber has also directed wonderful comedic pieces, such as the gut-busting "Dodgeball" and the genius short film series "Terry Tate: Office Linebacker". And with a cast including Peter Saarsgard, Sienna Miller, Nick Nolte and Mena Suvari, this seems like a no-brainer.<br /><br />It is. Literally.<br /><br />Jon Foster stars as Art Bechstein, the son of a mobster (Nolte) who recently graduated with a degree in Economics. Jon is in a state of arrested development: he works a minimum wage job at Book Barn, has a vapid relationship with his girlfriend/boss, Phlox (Suvari), which amounts to little more than copious amounts of

In [154]:
from bs4 import BeautifulSoup
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.stem.porter import PorterStemmer

nltk.download('stopwords')
stopword_list = stopwords.words('english')

def remove_html_tags(text):
    return BeautifulSoup(text, 'lxml').text

def remove_special_characters (text):
    pattern=r'[^a-zA-z0-9\s]'
    return re.sub(pattern,' ',text)

def stemmer(text):
    ps=nltk.porter.PorterStemmer()
    text= ' '.join([ps.stem(word) for word in text.split()])
    return text

def remove_stopwords(text):
    tokens = ToktokTokenizer().tokenize(text)
    tokens = [token.strip() for token in tokens]
    filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

def text_preprocessing(text):
    text = remove_html_tags(text)
    text = remove_special_characters (text)
    #text = stemmer(text)
    text = remove_stopwords(text)
    return text

[nltk_data] Downloading package stopwords to /home/spola/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
print(text_preprocessing(df['text'][0]))

quite fan novelist screenwriter Michael Chabon novel Wonder Boys became fantastic movie Curtis Hanson masterful novel Amazing Adventures Kavalier Clay Pulitzer Prize years back hand script Spider Man 2 arguably greatest comic book movie time Director Rawson Marshall Thurber also directed wonderful comedic pieces gut busting Dodgeball genius short film series Terry Tate Office Linebacker cast including Peter Saarsgard Sienna Miller Nick Nolte Mena Suvari seems like brainer Literally Jon Foster stars Art Bechstein son mobster Nolte recently graduated degree Economics Jon state arrested development works minimum wage job Book Barn vapid relationship girlfriend boss Phlox Suvari amounts little copious amounts sex plans chip away career zero passion One night party ex roommate introduces Jon Jane Miller beautiful smart violinist Later night go pie asks Jon question begins shake catatonic state existence want tell something never told single soul make night indelible Jon tells reoccurring dr

In [8]:
df['text'] = df['text'].apply(lambda x: text_preprocessing(x))

In [9]:
df.head()

Unnamed: 0,text,sentiment
0,quite fan novelist screenwriter Michael Chabon...,0
1,book remained faithful book assume author igno...,0
2,Eternal Jew Der Ewige Jude today would call ma...,0
3,matches adv advantage Warriors Ultimate Warrio...,0
4,sorry like doc much think million ways could b...,0


In [14]:
%store df

Stored 'df' (DataFrame)


## Creating the DNN Model

In [1]:
%store -r

Unable to restore variable 'model', ignoring (use %store -d to forget!)
The error was: <class 'KeyError'>


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
import numpy as np

x_train, x_val, y_train, y_val = train_test_split(df['text'], df['sentiment'], test_size = 0.33, shuffle = True)

x_train = list(x_train)
x_val = list(x_val)

y_train = list(y_train)
y_val = list(y_val)

In [3]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer()
texts = x_train + x_val
tokenizer.fit_on_texts(texts)

maxlen = max([len(t.split()) for t in texts])

words_size = len(tokenizer.word_index) + 1

train_sequences = tokenizer.texts_to_sequences(x_train)
val_sequences = tokenizer.texts_to_sequences(x_val)

print('Found %s unique tokens.' % len(tokenizer.word_index))

train_data = pad_sequences(train_sequences, maxlen = maxlen)
val_data = pad_sequences(val_sequences, maxlen = maxlen)

y_train = np.asarray(y_train)
y_val = np.asarray(y_val)
print('Shape of train data tensor:', train_data.shape)
print('Shape of train label tensor:', y_train.shape)

print('Shape of validation data tensor:', val_data.shape)
print('Shape of validation label tensor:', y_val.shape)


Found 103331 unique tokens.
Shape of train data tensor: (33500, 1429)
Shape of train label tensor: (33500,)
Shape of validation data tensor: (16500, 1429)
Shape of validation label tensor: (16500,)


In [4]:
val_data

array([[    0,     0,     0, ...,  7589,   329,     6],
       [    0,     0,     0, ..., 61040,   149,   291],
       [    0,     0,     0, ...,  1804, 18227,  1847],
       ...,
       [    0,     0,     0, ...,  1745,    94,   866],
       [    0,     0,     0, ...,  1565,  1481,  1802],
       [    0,     0,     0, ...,   254,   546,     2]], dtype=int32)

In [5]:
y_train

array([1, 1, 0, ..., 0, 1, 0])

## Normal Validation

In [144]:
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
from keras import regularizers
from keras import layers
import keras

In [94]:
callbacks_list = [
    keras.callbacks.EarlyStopping(
        monitor='acc',
        patience=5
    ),
    keras.callbacks.ModelCheckpoint(
        filepath='my_model.h5',
        monitor='val_acc',
        save_best_only=True
    ),
        
    keras.callbacks.ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.1,
        patience=5,
    )
]

In [149]:
import numpy as np

def get_fitted_model(dropout = 0.5, layer_num = 1, init_mode='uniform',
                       regularizer = None, batch_size = 128):
    
    seed = 7
    np.random.seed(seed)
    
    str_kernel_regularizer = None
    if (regularizer is not None):
        str_kernel_regularizer = str(regularizer)
    else:
        str_kernel_regularizer = 'None'
    print('\n', f'Training Model with:', '\n',
    f'* dropout = {dropout};', '\n',
    f'* number of hidden layers = {layer_num};', '\n',
    f'* init mode = {init_mode};', '\n',
    f'* l2 kernel regularizer value = {str_kernel_regularizer};', '\n',
    f'* batch size = {batch_size}')
    
    def add_layers():
        for i in range (0, layer_num):
            if (regularizer):
                model.add(Dense(64, kernel_initializer=init_mode, activation='relu',
                               kernel_regularizer = regularizers.l2(regularizer)))
            else:
                model.add(Dense(64, kernel_initializer=init_mode, activation='relu'))
            model.add(layers.Dropout(dropout))
    
    EMBEDDING_DIM = 100
    
    model = Sequential()
    model.add(Embedding(words_size, EMBEDDING_DIM, input_length=maxlen))
    model.add(Flatten())
    add_layers()
    model.add(Dense(1, activation='sigmoid'))
    
    model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
    history = model.fit(train_data, y_train,
                        epochs=10,
                        batch_size=batch_size,
                        callbacks=callbacks_list,
                        validation_data=(val_data, y_val),
                        verbose = 2)
    return history

In [None]:
hyperparameters = dict(dropout = [0.2, 0.5, 0.65, 0.8],
                       layer_num = [1,2,3],
                       regularizer = [0.1, 0.01, 0.001],
                       batch_size =[128,512],
                       init_mode = ['uniform', 'lecun_uniform', 'normal', 
                                    'glorot_normal', 'glorot_uniform']
                      )

In [126]:
dict_dropout_histories = {}
best_dropout = 0.5
best_dropout_acc = 0
for i in hyperparameters['dropout']:
    history = get_fitted_model(dropout = i)
    if max(history.history['val_acc']) > best_dropout_acc:
        best_dropout = i
        best_dropout_acc = max(history.history['val_acc'])
    dict_dropout_histories[str(i)] = history

Training Model with: 
 * dropout = 0.2; 
 * number of hidden layers = 1; 
 * init mode = uniform; 
 * l2 kernel regularizer value = None; 
 * batch size = 128
Epoch 1/10
262/262 - 74s - loss: 0.5783 - acc: 0.7562 - val_loss: 0.2486 - val_acc: 0.8998 - lr: 0.0010
Epoch 2/10
262/262 - 69s - loss: 0.1507 - acc: 0.9442 - val_loss: 0.2446 - val_acc: 0.9026 - lr: 0.0010
Epoch 3/10
262/262 - 67s - loss: 0.0379 - acc: 0.9883 - val_loss: 0.3281 - val_acc: 0.8942 - lr: 0.0010
Epoch 4/10
262/262 - 66s - loss: 0.0072 - acc: 0.9981 - val_loss: 0.4558 - val_acc: 0.8884 - lr: 0.0010
Epoch 5/10
262/262 - 67s - loss: 0.0036 - acc: 0.9988 - val_loss: 0.5584 - val_acc: 0.8938 - lr: 0.0010
Epoch 6/10
262/262 - 66s - loss: 7.9427e-04 - acc: 0.9996 - val_loss: 0.6350 - val_acc: 0.8898 - lr: 0.0010
Epoch 7/10
262/262 - 70s - loss: 5.8435e-04 - acc: 0.9999 - val_loss: 0.6777 - val_acc: 0.8908 - lr: 0.0010
Epoch 8/10
262/262 - 69s - loss: 1.6233e-06 - acc: 1.0000 - val_loss: 0.7774 - val_acc: 0.8890 - lr: 1.00

In [7]:
#print(max(dict_dropout_histories[str(best_dropout)].history['val_acc']))
#print(best_dropout)

In [133]:
best_dropout = 0.8

In [135]:
dict_layers_num_histories = {}
best_layer_num = 1
best_layer_num_acc = 0
for i in hyperparameters['layer_num']:
    history = get_fitted_model(dropout = best_dropout, layer_num = i)
    if max(history.history['val_acc']) > best_layer_num_acc:
        best_layer_num = i
        best_layer_num_acc = max(history.history['val_acc'])
    dict_layers_num_histories[str(i)] = history


 Training Model with: 
 * dropout = 0.8; 
 * number of hidden layers = 1; 
 * init mode = uniform; 
 * l2 kernel regularizer value = None; 
 * batch size = 128
Epoch 1/10
262/262 - 67s - loss: 0.7365 - acc: 0.4996 - val_loss: 0.6932 - val_acc: 0.4945 - lr: 0.0010
Epoch 2/10
262/262 - 67s - loss: 0.6949 - acc: 0.5012 - val_loss: 0.6932 - val_acc: 0.4945 - lr: 0.0010
Epoch 3/10
262/262 - 66s - loss: 0.6962 - acc: 0.5033 - val_loss: 0.6928 - val_acc: 0.5077 - lr: 0.0010
Epoch 4/10
262/262 - 67s - loss: 0.5884 - acc: 0.6457 - val_loss: 0.3182 - val_acc: 0.8843 - lr: 0.0010
Epoch 5/10
262/262 - 68s - loss: 0.2597 - acc: 0.9007 - val_loss: 0.2396 - val_acc: 0.9042 - lr: 0.0010
Epoch 6/10
262/262 - 67s - loss: 0.1478 - acc: 0.9476 - val_loss: 0.2685 - val_acc: 0.9042 - lr: 0.0010
Epoch 7/10
262/262 - 70s - loss: 0.0771 - acc: 0.9746 - val_loss: 0.3225 - val_acc: 0.9015 - lr: 0.0010
Epoch 8/10
262/262 - 69s - loss: 0.0318 - acc: 0.9901 - val_loss: 0.5237 - val_acc: 0.8853 - lr: 0.0010
Epoch 9

In [137]:
print(max(dict_layers_num_histories[str(best_layer_num)].history['val_acc']))
print(best_layer_num)

0.9041818380355835
1


In [138]:
dict_init_mode_histories = {}
best_init_mode = 'uniform'
best_init_mode_acc = 0
for i in hyperparameters['init_mode']:
    history = get_fitted_model(dropout = best_dropout, layer_num = best_layer_num, init_mode = i)
    if max(history.history['val_acc']) > best_init_mode_acc:
        best_init_mode = i
        best_init_mode_acc = max(history.history['val_acc'])
    dict_init_mode_histories[str(i)] = history


 Training Model with: 
 * dropout = 0.8; 
 * number of hidden layers = 1; 
 * init mode = uniform; 
 * l2 kernel regularizer value = None; 
 * batch size = 128
Epoch 1/10
262/262 - 73s - loss: 0.7706 - acc: 0.5003 - val_loss: 0.6932 - val_acc: 0.4945 - lr: 0.0010
Epoch 2/10
262/262 - 74s - loss: 0.6749 - acc: 0.5578 - val_loss: 0.4227 - val_acc: 0.8475 - lr: 0.0010
Epoch 3/10
262/262 - 73s - loss: 0.3166 - acc: 0.8763 - val_loss: 0.2416 - val_acc: 0.9024 - lr: 0.0010
Epoch 4/10
262/262 - 73s - loss: 0.1737 - acc: 0.9390 - val_loss: 0.2539 - val_acc: 0.9040 - lr: 0.0010
Epoch 5/10
262/262 - 74s - loss: 0.0931 - acc: 0.9696 - val_loss: 0.3177 - val_acc: 0.9022 - lr: 0.0010
Epoch 6/10
262/262 - 74s - loss: 0.0403 - acc: 0.9875 - val_loss: 0.4204 - val_acc: 0.8990 - lr: 0.0010
Epoch 7/10
262/262 - 73s - loss: 0.0188 - acc: 0.9944 - val_loss: 0.5389 - val_acc: 0.8969 - lr: 0.0010
Epoch 8/10
262/262 - 74s - loss: 0.0097 - acc: 0.9972 - val_loss: 0.6026 - val_acc: 0.8970 - lr: 0.0010
Epoch 9

In [139]:
print(max(dict_init_mode_histories[str(best_init_mode)].history['val_acc']))
print(best_init_mode)

0.9055151343345642
glorot_normal


In [140]:
dict_batch_size_histories = {}
best_batch_size = 128
best_batch_size_acc = 0
for i in hyperparameters['batch_size']:
    history = get_fitted_model(dropout = best_dropout, layer_num = best_layer_num, 
                              init_mode = best_init_mode, batch_size = i)
    if max(history.history['val_acc']) > best_batch_size_acc:
        best_batch_size = i
        best_batch_size_acc = max(history.history['val_acc'])
    dict_batch_size_histories[str(i)] = history


 Training Model with: 
 * dropout = 0.8; 
 * number of hidden layers = 1; 
 * init mode = glorot_normal; 
 * l2 kernel regularizer value = None; 
 * batch size = 128
Epoch 1/10
262/262 - 73s - loss: 0.8211 - acc: 0.4996 - val_loss: 0.6932 - val_acc: 0.4945 - lr: 0.0010
Epoch 2/10
262/262 - 74s - loss: 0.6916 - acc: 0.5206 - val_loss: 0.5398 - val_acc: 0.7776 - lr: 0.0010
Epoch 3/10
262/262 - 74s - loss: 0.3332 - acc: 0.8646 - val_loss: 0.2405 - val_acc: 0.9038 - lr: 0.0010
Epoch 4/10
262/262 - 75s - loss: 0.1769 - acc: 0.9381 - val_loss: 0.2641 - val_acc: 0.9030 - lr: 0.0010
Epoch 5/10
262/262 - 75s - loss: 0.0934 - acc: 0.9701 - val_loss: 0.3183 - val_acc: 0.8984 - lr: 0.0010
Epoch 6/10
262/262 - 73s - loss: 0.0398 - acc: 0.9883 - val_loss: 0.4168 - val_acc: 0.9018 - lr: 0.0010
Epoch 7/10
262/262 - 71s - loss: 0.0202 - acc: 0.9942 - val_loss: 0.5338 - val_acc: 0.8979 - lr: 0.0010
Epoch 8/10
262/262 - 71s - loss: 0.0104 - acc: 0.9969 - val_loss: 0.6205 - val_acc: 0.8964 - lr: 0.0010
E

In [141]:
print(max(dict_batch_size_histories[str(best_batch_size)].history['val_acc']))
print(best_batch_size)

0.9047272801399231
512


In [150]:
dict_regularizers_histories = {}
best_regularizer = 0.01
best_regularizer_acc = 0
for i in hyperparameters['regularizer']:
    history = get_fitted_model(dropout = best_dropout, layer_num = best_layer_num, 
                              init_mode = best_init_mode, batch_size = best_batch_size, regularizer = i)
    if max(history.history['val_acc']) > best_regularizer_acc:
        best_regularizer = i
        best_regularizer_acc = max(history.history['val_acc'])
    dict_regularizers_histories[str(i)] = history


 Training Model with: 
 * dropout = 0.8; 
 * number of hidden layers = 1; 
 * init mode = glorot_normal; 
 * l2 kernel regularizer value = 0.1; 
 * batch size = 512
Epoch 1/10
66/66 - 43s - loss: 1.3671 - acc: 0.5021 - val_loss: 0.9157 - val_acc: 0.4945 - lr: 0.0010
Epoch 2/10
66/66 - 46s - loss: 0.9335 - acc: 0.6421 - val_loss: 1.0309 - val_acc: 0.7313 - lr: 0.0010
Epoch 3/10
66/66 - 47s - loss: 0.8637 - acc: 0.8147 - val_loss: 0.8550 - val_acc: 0.8531 - lr: 0.0010
Epoch 4/10
66/66 - 46s - loss: 0.7398 - acc: 0.8678 - val_loss: 0.8150 - val_acc: 0.8856 - lr: 0.0010
Epoch 5/10
66/66 - 46s - loss: 0.7182 - acc: 0.8844 - val_loss: 0.6791 - val_acc: 0.8897 - lr: 0.0010
Epoch 6/10
66/66 - 47s - loss: 0.6763 - acc: 0.9003 - val_loss: 0.8295 - val_acc: 0.8952 - lr: 0.0010
Epoch 7/10
66/66 - 46s - loss: 0.6990 - acc: 0.9034 - val_loss: 0.7381 - val_acc: 0.8944 - lr: 0.0010
Epoch 8/10
66/66 - 47s - loss: 0.6375 - acc: 0.9144 - val_loss: 0.7777 - val_acc: 0.8964 - lr: 0.0010
Epoch 9/10
66/66 -

In [151]:
print(max(dict_regularizers_histories[str(best_regularizer)].history['val_acc']))
print(best_regularizer)

0.9042423963546753
0.001


In [153]:
def get_best_model(dropout = 0.5, layer_num = 1, init_mode='uniform',
                       regularizer = None, batch_size = 128):
    
    seed = 7
    np.random.seed(seed)

    def add_layers():
        for i in range (0, layer_num):
            if (regularizer):
                model.add(Dense(64, kernel_initializer=init_mode, activation='relu',
                               kernel_regularizer = regularizers.l2(regularizer)))
            else:
                model.add(Dense(64, kernel_initializer=init_mode, activation='relu'))
            model.add(layers.Dropout(dropout))
    
    EMBEDDING_DIM = 100
    
    model = Sequential()
    model.add(Embedding(words_size, EMBEDDING_DIM, input_length=maxlen))
    model.add(Flatten())
    add_layers()
    model.add(Dense(1, activation='sigmoid'))
    
    model = Sequential()
    model.add(Embedding(words_size, EMBEDDING_DIM, input_length=maxlen))
    model.add(Flatten())
    add_layers()
    model.add(Dense(1, activation='sigmoid'))
    
    model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
    history = model.fit(train_data, y_train,
                        epochs=10,
                        batch_size=batch_size,
                        callbacks=callbacks_list,
                        validation_data=(val_data, y_val),
                        verbose = 2)
    model.load_weights('my_model.h5')
    
    return model

best_model = get_best_model(dropout = best_dropout, layer_num = best_layer_num, 
                            init_mode = best_init_mode, batch_size = best_batch_size, 
                            regularizer = best_regularizer)

Epoch 1/10
66/66 - 59s - loss: 1.2264 - acc: 0.5003 - val_loss: 0.7014 - val_acc: 0.5055 - lr: 0.0010
Epoch 2/10
66/66 - 45s - loss: 0.7280 - acc: 0.5050 - val_loss: 0.6802 - val_acc: 0.6028 - lr: 0.0010
Epoch 3/10
66/66 - 45s - loss: 0.4785 - acc: 0.8256 - val_loss: 0.3541 - val_acc: 0.8989 - lr: 0.0010
Epoch 4/10
66/66 - 45s - loss: 0.3214 - acc: 0.9195 - val_loss: 0.3406 - val_acc: 0.8988 - lr: 0.0010
Epoch 5/10
66/66 - 45s - loss: 0.2830 - acc: 0.9381 - val_loss: 0.3245 - val_acc: 0.9020 - lr: 0.0010
Epoch 6/10
66/66 - 45s - loss: 0.2384 - acc: 0.9545 - val_loss: 0.3320 - val_acc: 0.9046 - lr: 0.0010
Epoch 7/10
66/66 - 45s - loss: 0.2001 - acc: 0.9670 - val_loss: 0.3446 - val_acc: 0.9031 - lr: 0.0010
Epoch 8/10
66/66 - 45s - loss: 0.2043 - acc: 0.9750 - val_loss: 0.3628 - val_acc: 0.9003 - lr: 0.0010
Epoch 9/10
66/66 - 46s - loss: 0.1195 - acc: 0.9906 - val_loss: 0.3385 - val_acc: 0.8982 - lr: 0.0010
Epoch 10/10
66/66 - 46s - loss: 0.1770 - acc: 0.9799 - val_loss: 0.3489 - val_acc:

NameError: name 'text_preprocessing' is not defined

In [155]:
test_samples = ["I liked this film",
               "the film was terrible...",
               "I enjoyed watching it!",
               "It was the worst experience I've ever had!",
               "I want to see it again!",
               "I really enjoyed all of it",
               "It was amazing!",
               "it was good"]

test_samples = [text_preprocessing(i) for i in test_samples]

print(test_samples)

test_sequences = tokenizer.texts_to_sequences(test_samples)
test_data = pad_sequences(test_sequences, maxlen = maxlen)

for i in best_model.predict(x = test_data):
    print("{:.5f}".format(float(i)))

['liked film', 'film terrible', 'enjoyed watching', 'worst experience ever', 'want see', 'really enjoyed', 'amazing', 'good']
0.79799
0.18767
0.71185
0.17779
0.67965
0.76598
0.82159
0.55706


In [1]:
'''
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(train_data, y_train,
                    epochs=10,
                    batch_size=128,
                    callbacks=callbacks_list,
                    validation_data=(val_data, y_val),
                    verbose = 2)
'''

"\nmodel.compile(optimizer='rmsprop',\n              loss='binary_crossentropy',\n              metrics=['acc'])\nhistory = model.fit(train_data, y_train,\n                    epochs=10,\n                    batch_size=128,\n                    callbacks=callbacks_list,\n                    validation_data=(val_data, y_val),\n                    verbose = 2)\n"

In [2]:
'''
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(train_data, y_train,
                    epochs=10,
                    batch_size=128,
                    callbacks=callbacks_list,
                    validation_data=(val_data, y_val),
                    verbose = 2)
'''

"\nmodel.compile(optimizer='rmsprop',\n              loss='binary_crossentropy',\n              metrics=['acc'])\nhistory = model.fit(train_data, y_train,\n                    epochs=10,\n                    batch_size=128,\n                    callbacks=callbacks_list,\n                    validation_data=(val_data, y_val),\n                    verbose = 2)\n"

In [3]:
'''
import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()
'''

"\nimport matplotlib.pyplot as plt\n\nacc = history.history['acc']\nval_acc = history.history['val_acc']\nloss = history.history['loss']\nval_loss = history.history['val_loss']\n\nepochs = range(1, len(acc) + 1)\n\nplt.plot(epochs, acc, 'bo', label='Training acc')\nplt.plot(epochs, val_acc, 'b', label='Validation acc')\nplt.title('Training and validation accuracy')\nplt.legend()\n\nplt.figure()\n\nplt.plot(epochs, loss, 'bo', label='Training loss')\nplt.plot(epochs, val_loss, 'b', label='Validation loss')\nplt.title('Training and validation loss')\nplt.legend()\n\nplt.show()\n"

In [4]:
'''model.load_weights('my_model.h5')

test_samples = ["I liked this film",
               "the film was terrible...",
               "I enjoyed watching it!",
               "It was the worst experience I've ever had!",
               "I want to see it again!",
               "I really enjoyed all of it",
               "It was amazing!",
               "it was good"]

test_samples = [text_preprocessing(i) for i in test_samples]

print(test_samples)

test_sequences = tokenizer.texts_to_sequences(test_samples)
test_data = pad_sequences(test_sequences, maxlen = maxlen)

for i in model.predict(x = test_data):
    print("{:.5f}".format(float(i)))
'''

'model.load_weights(\'my_model.h5\')\n\ntest_samples = ["I liked this film",\n               "the film was terrible...",\n               "I enjoyed watching it!",\n               "It was the worst experience I\'ve ever had!",\n               "I want to see it again!",\n               "I really enjoyed all of it",\n               "It was amazing!",\n               "it was good"]\n\ntest_samples = [text_preprocessing(i) for i in test_samples]\n\nprint(test_samples)\n\ntest_sequences = tokenizer.texts_to_sequences(test_samples)\ntest_data = pad_sequences(test_sequences, maxlen = maxlen)\n\nfor i in model.predict(x = test_data):\n    print("{:.5f}".format(float(i)))\n'

## K-Fold Cross Validation

**Risulta essere troppo dispendiosa per quanto riguarda i tempi di computazione**

In [45]:
'''
import numpy
from sklearn.model_selection import GridSearchCV
from keras.wrappers.scikit_learn import KerasClassifier

from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
from keras import regularizers
from keras import layers
'''

In [46]:
'''

%time
import keras, numpy as np

callbacks_list = [
    keras.callbacks.EarlyStopping(
        monitor='acc',
        patience=5
    ),
    keras.callbacks.ModelCheckpoint(
        filepath='my_model.h5',
        monitor='val_loss',
        save_best_only=True
    ),
        
    keras.callbacks.ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.1,
        patience=5,
    )
]

batch_size = 128
epochs = 3

model_CV = KerasClassifier(build_fn=create_model, epochs=epochs, 
                           batch_size=batch_size, shuffle = True, verbose=1) #callbacks = callbacks_list)
# define the grid search parameters
init_mode = ['uniform', 'lecun_uniform', 'normal', 
             'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform']

param_grid = dict(init_mode=init_mode)
grid = GridSearchCV(estimator=model_CV, param_grid=param_grid, cv=3)
grid_result = grid.fit(train_data, y_train)




'''


"\n\n%time\nimport keras, numpy as np\n\ncallbacks_list = [\n    keras.callbacks.EarlyStopping(\n        monitor='acc',\n        patience=5\n    ),\n    keras.callbacks.ModelCheckpoint(\n        filepath='my_model.h5',\n        monitor='val_loss',\n        save_best_only=True\n    ),\n        \n    keras.callbacks.ReduceLROnPlateau(\n        monitor='val_loss',\n        factor=0.1,\n        patience=5,\n    )\n]\n\nbatch_size = 128\nepochs = 3\n\nmodel_CV = KerasClassifier(build_fn=create_model, epochs=epochs, \n                           batch_size=batch_size, shuffle = True, verbose=1) #callbacks = callbacks_list)\n# define the grid search parameters\ninit_mode = ['uniform', 'lecun_uniform', 'normal', \n             'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform']\n\nparam_grid = dict(init_mode=init_mode)\ngrid = GridSearchCV(estimator=model_CV, param_grid=param_grid, cv=3)\ngrid_result = grid.fit(train_data, y_train)\n\n\n\n\n"

In [11]:
'''
import numpy as np
seed = 7
np.random.seed(seed)
'''

'\nimport numpy as np\nseed = 7\nnp.random.seed(seed)\n'

In [12]:
'''
# let's create a function that creates the model (required for KerasClassifier) 
# while accepting the hyperparameters we want to tune 
# we also pass some default values such as optimizer='rmsprop'
def create_model(init_mode='uniform', dropout = 0.5, layer_num = 1, regularizer = 0.01):

    def add_layers():
        for i in range (0, layer_num):
            model.add(Dense(64, kernel_initializer=init_mode, activation='relu',
                           kernel_regularizer = regularizers.l2(regularizer)))
            model.add(layers.Dropout(dropout))
    
    EMBEDDING_DIM = 100
    
    model = Sequential()
    model.add(Embedding(words_size, EMBEDDING_DIM, input_length=maxlen))
    model.add(Flatten())
    add_layers()
    model.add(Dense(1, kernel_initializer=init_mode, activation='sigmoid', 
                    kernel_regularizer = regularizers.l2(regularizer)))
    
    # compile model
    model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
    return model
'''

"\n# let's create a function that creates the model (required for KerasClassifier) \n# while accepting the hyperparameters we want to tune \n# we also pass some default values such as optimizer='rmsprop'\ndef create_model(init_mode='uniform', dropout = 0.5, layer_num = 1, regularizer = 0.01):\n\n    def add_layers():\n        for i in range (0, layer_num):\n            model.add(Dense(64, kernel_initializer=init_mode, activation='relu',\n                           kernel_regularizer = regularizers.l2(regularizer)))\n            model.add(layers.Dropout(dropout))\n    \n    EMBEDDING_DIM = 100\n    \n    model = Sequential()\n    model.add(Embedding(words_size, EMBEDDING_DIM, input_length=maxlen))\n    model.add(Flatten())\n    add_layers()\n    model.add(Dense(1, kernel_initializer=init_mode, activation='sigmoid', \n                    kernel_regularizer = regularizers.l2(regularizer)))\n    \n    # compile model\n    model.compile(optimizer='rmsprop',\n              loss='binary_cross

In [10]:
'''
hyperparameters = dict(dropout = [0.2, 0.5, 0.65, 0.8],
                       layer_num = [1,2,3],
                       regularizer = [0.1, 0.01, 0.001],
                       batch_size =[128,512],
                       init_mode = ['uniform', 'lecun_uniform', 'normal', 
                                    'glorot_normal', 'glorot_uniform']
                      )

grid_results = {}
'''

"\nhyperparameters = dict(dropout = [0.2, 0.5, 0.65, 0.8],\n                       layer_num = [1,2,3],\n                       regularizer = [0.1, 0.01, 0.001],\n                       batch_size =[128,512],\n                       init_mode = ['uniform', 'lecun_uniform', 'normal', \n                                    'glorot_normal', 'glorot_uniform']\n                      )\n\ngrid_results = {}\n"

In [8]:
'''
def get_fold_3_validate_result (dropout = [0.5], layer_num = [1], regularizer = [0.01], 
                                batch_size = [128], init_mode = ['uniform']):
    epochs = 10

    model_CV = KerasClassifier(build_fn=create_model, epochs=epochs,
                               shuffle = True, verbose=2)
    # Define the grid search parameters
    param_grid = dict(dropout = dropout,
                     layer_num = layer_num,
                     batch_size = batch_size,
                     init_mode = init_mode,
                     regularizer = regularizer)

    grid = GridSearchCV(estimator=model_CV, param_grid=param_grid, cv=3)
    return grid.fit(train_data, y_train)
'''

"\ndef get_fold_3_validate_result (dropout = [0.5], layer_num = [1], regularizer = [0.01], \n                                batch_size = [128], init_mode = ['uniform']):\n    epochs = 10\n\n    model_CV = KerasClassifier(build_fn=create_model, epochs=epochs,\n                               shuffle = True, verbose=2)\n    # Define the grid search parameters\n    param_grid = dict(dropout = dropout,\n                     layer_num = layer_num,\n                     batch_size = batch_size,\n                     init_mode = init_mode,\n                     regularizer = regularizer)\n\n    grid = GridSearchCV(estimator=model_CV, param_grid=param_grid, cv=3)\n    return grid.fit(train_data, y_train)\n"

In [7]:
'''
print('Tuning dropout')
grid_results['dropout'] = get_fold_3_validate_result (dropout = hyperparameters['dropout'])
'''

"\nprint('Tuning dropout')\ngrid_results['dropout'] = get_fold_3_validate_result (dropout = hyperparameters['dropout'])\n"

In [6]:
'''
# print results
grid_dropout = grid_results['dropout']
'''

"\n# print results\ngrid_dropout = grid_results['dropout']\n"

In [5]:
'''
print(f' Best Accuracy for {grid_dropout.best_score_} using dropout: { grid_dropout.best_params_["dropout"]}')
means = grid_dropout.cv_results_['loss_test_score']
stds = grid_dropout.cv_results_['std_test_score']
params = grid_dropout.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print(f' mean={mean:.4}, std={stdev:.4} using {param}')
'''

'\nprint(f\' Best Accuracy for {grid_dropout.best_score_} using dropout: { grid_dropout.best_params_["dropout"]}\')\nmeans = grid_dropout.cv_results_[\'loss_test_score\']\nstds = grid_dropout.cv_results_[\'std_test_score\']\nparams = grid_dropout.cv_results_[\'params\']\nfor mean, stdev, param in zip(means, stds, params):\n    print(f\' mean={mean:.4}, std={stdev:.4} using {param}\')\n'