<a href="https://colab.research.google.com/github/Shabarinath8899/Shabarinath_Repository/blob/master/Shabarinath_Chandran_Week_4_Lab_Neural_Networks_in_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Shabarinath_Chandran_Week 4 Lab: Neural Networks in practice

In [0]:
from IPython.display import set_matplotlib_formats, display
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from cycler import cycler
import keras
print("Using Keras",keras.__version__)
%matplotlib inline
plt.rcParams['figure.dpi'] = 125 # Use 300 for PDF, 100 for slides


Using TensorFlow backend.


Using Keras 2.3.1


### Overview
* Solving basic classification and regression problems
* Handling textual data
* Model selection (and overfitting)

## Solving basic problems
* Binary classification (of movie reviews)
* Multiclass classification (of news topics)
* Regression (of house prices)

Examples from _Deep Learning with Python_, by _François Chollet_

### Binary classification
* Dataset: 50,000 IMDB reviews, labeled positive (1) or negative (0)
    - Included in Keras, with a 50/50 train-test split
* Each row is one review, with only the 10,000 most frequent words retained
* Each word is replaced by a _word index_ (word ID)

In [0]:
from keras.datasets import imdb
# Download IMDB data with 10000 most frequent words
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
print("Encoded review: ", train_data[0][0:10])

word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
print("Original review: ", ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[0]][0:10]))

Encoded review:  [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]
Original review:  ? this film was just brilliant casting location scenery story


#### Preprocessing
* We can't input lists of categorical value to a neural net, we need to create tensors
* One-hot-encoding:
    -  10000 features, '1.0' if the word occurs
* Word embeddings (word2vec):
    - Map each word to a dense vector that represents it (it's _embedding_)
    - _Embedding_ layer: pre-trained layer that looks up the embedding in a dictionary 
    - Converts 2D tensor of word indices (zero-padded) to 3D tensor of embeddings
* Let's do One-Hot-Encoding for now. We'll come back to _Embedding_ layers.
* Also vectorize the labels: from 0/1 to float
    - Binary classification works with one output node

In [0]:
# Custom implementation of one-hot-encoding
def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.  # set specific indices of results[i] to 1s
    return results
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

IndexError: arrays used as indices must be of integer (or boolean) type

#### Understanding the format of IMDB dataset
1. Train_data and test_data are an array of lists. What does the length of this array correspond to? What does the length of each list correspond to?
2. What are the sizes of the vectorized x_train and x_test? What do the dimensions correspond to?
3. What is the most common word in the first review in the training data? Hint: use the word index (see above)? 
4. Print the first review to verify. 

#### Building the network
* We can solve this problem using a network of _Dense_ layers and the _ReLU_ activation function.
* How many layers? How many hidden units for layer?
    - Start with 2 layers of 16 hidden units each
    - We'll optimize this soon
* Output layer: single unit with _sigmoid_ activation function
    - Close to 1: positive review, close to 0: negative review
* Use binary_crossentropy loss

In [0]:
from keras import models
from keras import layers 

model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

#### Model selection
* How many epochs do we need for training?
* Take a validation set of 10,000 samples from the training set
* Train the neural net and track the loss after every iteration on the validation set
    - This is returned as a `History` object by the `fit()` function 
* We start with 20 epochs in minibatches of 512 samples


In [0]:
x_val, partial_x_train = x_train[:10000], x_train[10000:]
y_val, partial_y_train = y_train[:10000], y_train[10000:] 
history = model.fit(partial_x_train, partial_y_train,
                    epochs=20, batch_size=512, verbose=2,
                    validation_data=(x_val, y_val))

Train on 15000 samples, validate on 10000 samples
Epoch 1/20
 - 4s - loss: 0.5063 - accuracy: 0.7904 - val_loss: 0.3947 - val_accuracy: 0.8481
Epoch 2/20
 - 2s - loss: 0.3019 - accuracy: 0.9013 - val_loss: 0.3047 - val_accuracy: 0.8854
Epoch 3/20
 - 1s - loss: 0.2216 - accuracy: 0.9277 - val_loss: 0.3035 - val_accuracy: 0.8746
Epoch 4/20
 - 1s - loss: 0.1790 - accuracy: 0.9425 - val_loss: 0.2858 - val_accuracy: 0.8826
Epoch 5/20
 - 2s - loss: 0.1460 - accuracy: 0.9531 - val_loss: 0.2853 - val_accuracy: 0.8853
Epoch 6/20
 - 1s - loss: 0.1184 - accuracy: 0.9642 - val_loss: 0.3024 - val_accuracy: 0.8831
Epoch 7/20
 - 1s - loss: 0.1019 - accuracy: 0.9682 - val_loss: 0.3100 - val_accuracy: 0.8822
Epoch 8/20
 - 1s - loss: 0.0804 - accuracy: 0.9783 - val_loss: 0.3435 - val_accuracy: 0.8789
Epoch 9/20
 - 1s - loss: 0.0686 - accuracy: 0.9796 - val_loss: 0.3715 - val_accuracy: 0.8759
Epoch 10/20
 - 1s - loss: 0.0553 - accuracy: 0.9857 - val_loss: 0.3742 - val_accuracy: 0.8754
Epoch 11/20
 - 1s -

#### Evaluate model performance during training
1. Plot the training and validation loss as a function of training epoch. Describe what happens during the training in terms of under or overfitting.
2. Plot the training and validation accuracy as a function of the training epoch.

Hint: these quantities are contained in the dict history.history.

#### Early stopping
One simple technique to avoid overfitting is to use the validation set to 'tune' the optimal number of epochs
* In this case, we could stop after 4 epochs


In [0]:
#@title
model.fit(x_train, y_train, epochs=4, batch_size=512, verbose=2)
result = model.evaluate(x_test, y_test)
print("Loss: {:.4f}, Accuracy:  {:.4f}".format(*result))

Epoch 1/4
 - 2s - loss: 0.2283 - accuracy: 0.9462
Epoch 2/4
 - 2s - loss: 0.1320 - accuracy: 0.9608
Epoch 3/4
 - 2s - loss: 0.0996 - accuracy: 0.9696
Epoch 4/4
 - 2s - loss: 0.0765 - accuracy: 0.9763
Loss: 0.4862, Accuracy:  0.8598


#### Predictions
1. Print the first review that were correctly classified along with the predicted value.
2. Print the first review that were misclassified along with the predicted value. Can you explain why the model likely failed? How confident was the model?

#### Takeaways
* Neural nets require a lot of preprocessing to create tensors
* Dense layers with ReLU activation can solve a wide range of problems
* Binary classification can be done with a Dense layer with a single unit, sigmoid activation, and binary cross-entropy loss
* Neural nets overfit easily
* Many design choices have an effect on accuracy and overfitting. One can try:
    - 1 or 3 hidden layers
    - more or fewer hidden units (e.g. 64)
    - MSE loss instead of binary cross-entropy
    - `tanh` activation instead of `ReLU`

### Regularization: build smaller networks
* The easiest way to avoid overfitting is to use a simpler model
* The number of learnable parameters is called the model _capacity_
* A model with more parameters has a higher _memorization capacity_
    - The entire training set can be `stored` in the weights
    - Learns the mapping from training examples to outputs
* Forcing the model to be small forces it to learn a compressed representation that generalizes better
    - Always a trade-off between too much and too little capacity
* Start with few layers and parameters, incease until you see diminisching returns

Let's try this on our movie review data, with 4 units per layer


In [0]:
from keras.datasets import imdb
import numpy as np

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

def vectorize_sequences(sequences, dimension=10000):
    # Create an all-zero matrix of shape (len(sequences), dimension)
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.  # set specific indices of results[i] to 1s
    return results

# Our vectorized training data
x_train = vectorize_sequences(train_data)
# Our vectorized test data
x_test = vectorize_sequences(test_data)
# Our vectorized labels
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

In [0]:
from keras import models
from keras import layers 
import matplotlib.pyplot as plt

original_model = models.Sequential()
original_model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
original_model.add(layers.Dense(16, activation='relu'))
original_model.add(layers.Dense(1, activation='sigmoid'))

original_model.compile(optimizer='rmsprop',
                       loss='binary_crossentropy',
                       metrics=['acc'])

smaller_model = models.Sequential()
smaller_model.add(layers.Dense(4, activation='relu', input_shape=(10000,)))
smaller_model.add(layers.Dense(4, activation='relu'))
smaller_model.add(layers.Dense(1, activation='sigmoid'))

smaller_model.compile(optimizer='rmsprop',
                      loss='binary_crossentropy',
                      metrics=['acc'])
original_hist = original_model.fit(x_train, y_train,
                                   epochs=20,
                                   batch_size=512, verbose=2,
                                   validation_data=(x_test, y_test))
smaller_model_hist = smaller_model.fit(x_train, y_train,
                                       epochs=20,
                                       batch_size=512, verbose=2,
                                       validation_data=(x_test, y_test))

Train on 25000 samples, validate on 25000 samples
Epoch 1/20
 - 7s - loss: 0.4820 - acc: 0.8104 - val_loss: 0.3584 - val_acc: 0.8797
Epoch 2/20
 - 4s - loss: 0.2737 - acc: 0.9060 - val_loss: 0.2884 - val_acc: 0.8897
Epoch 3/20
 - 3s - loss: 0.2076 - acc: 0.9249 - val_loss: 0.2802 - val_acc: 0.8886
Epoch 4/20
 - 3s - loss: 0.1708 - acc: 0.9390 - val_loss: 0.2904 - val_acc: 0.8849
Epoch 5/20
 - 3s - loss: 0.1487 - acc: 0.9490 - val_loss: 0.3131 - val_acc: 0.8797
Epoch 6/20
 - 3s - loss: 0.1311 - acc: 0.9547 - val_loss: 0.3312 - val_acc: 0.8757
Epoch 7/20
 - 3s - loss: 0.1128 - acc: 0.9617 - val_loss: 0.3596 - val_acc: 0.8727
Epoch 8/20
 - 3s - loss: 0.1013 - acc: 0.9657 - val_loss: 0.3783 - val_acc: 0.8701
Epoch 9/20
 - 3s - loss: 0.0888 - acc: 0.9702 - val_loss: 0.4062 - val_acc: 0.8658
Epoch 10/20
 - 3s - loss: 0.0780 - acc: 0.9751 - val_loss: 0.4325 - val_acc: 0.8664
Epoch 11/20
 - 4s - loss: 0.0682 - acc: 0.9781 - val_loss: 0.4614 - val_acc: 0.8629
Epoch 12/20
 - 3s - loss: 0.0595 - 

1. Plot the validation loss for the original and smaller models. How does the smaller model behave compared to the original?

### Regularization: Weight regularization
* As we did many times before, we can also add weight regularization to our loss function
- L1 regularization: leads to _sparse networks_ with many weights that are 0
- L2 regularization: leads to many very small weights
    - Also called _weight decay_ in neural net literature
* In Keras, add `kernel_regularizer` to every layer

In [0]:
from keras import regularizers
from keras import models
from keras import layers 
import matplotlib.pyplot as plt

l2_model = models.Sequential()
l2_model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
                          activation='relu', input_shape=(10000,)))
l2_model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
                          activation='relu'))
l2_model.add(layers.Dense(1, activation='sigmoid'))

In [0]:
l2_model.compile(optimizer='rmsprop',
                 loss='binary_crossentropy',
                 metrics=['acc'])

In [0]:
l2_model_hist = l2_model.fit(x_train, y_train,
                             epochs=20,
                             batch_size=512, verbose=2,
                             validation_data=(x_test, y_test))

Train on 25000 samples, validate on 25000 samples
Epoch 1/20
 - 3s - loss: 0.5090 - acc: 0.8158 - val_loss: 0.3796 - val_acc: 0.8831
Epoch 2/20
 - 3s - loss: 0.3184 - acc: 0.9048 - val_loss: 0.3366 - val_acc: 0.8888
Epoch 3/20
 - 3s - loss: 0.2734 - acc: 0.9213 - val_loss: 0.3355 - val_acc: 0.8881
Epoch 4/20
 - 3s - loss: 0.2572 - acc: 0.9247 - val_loss: 0.3355 - val_acc: 0.8882
Epoch 5/20
 - 3s - loss: 0.2412 - acc: 0.9329 - val_loss: 0.3432 - val_acc: 0.8843
Epoch 6/20
 - 3s - loss: 0.2349 - acc: 0.9366 - val_loss: 0.3488 - val_acc: 0.8824
Epoch 7/20
 - 3s - loss: 0.2317 - acc: 0.9359 - val_loss: 0.3644 - val_acc: 0.8761
Epoch 8/20
 - 3s - loss: 0.2228 - acc: 0.9410 - val_loss: 0.3634 - val_acc: 0.8789
Epoch 9/20
 - 3s - loss: 0.2213 - acc: 0.9395 - val_loss: 0.3686 - val_acc: 0.8778
Epoch 10/20
 - 4s - loss: 0.2164 - acc: 0.9414 - val_loss: 0.3860 - val_acc: 0.8743
Epoch 11/20
 - 3s - loss: 0.2174 - acc: 0.9413 - val_loss: 0.3916 - val_acc: 0.8708
Epoch 12/20
 - 3s - loss: 0.2085 - 

1. Plot the validation loss for the original and l2 regularized models. How does the regularized model behave compared to the original?

### Regularization: dropout
* One of the most effective and commonly used regularization techniques
* Randomly set a number of outputs of the layer to 0
* Idea: break up accidental non-significant learned patterns 
* _Dropout rate_: fraction of the outputs that are zeroed-out
    - Usually between 0.2 and 0.5
* At test time, nothing is dropped out, but the output values are scaled down by the dropout rate
    - Balances out that more units are active than during training
* In Keras: add `Dropout` layers between the normal layers

In [0]:
from keras import models
from keras import layers 
import matplotlib.pyplot as plt

dpt_model = models.Sequential()
dpt_model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
dpt_model.add(layers.Dropout(0.5))
dpt_model.add(layers.Dense(16, activation='relu'))
dpt_model.add(layers.Dropout(0.5))
dpt_model.add(layers.Dense(1, activation='sigmoid'))

dpt_model.compile(optimizer='rmsprop',
                  loss='binary_crossentropy',
                  metrics=['acc'])

In [0]:
dpt_model_hist = dpt_model.fit(x_train, y_train,
                               epochs=20,
                               
                               batch_size=512, verbose=2,
                               validation_data=(x_test, y_test))

Train on 25000 samples, validate on 25000 samples
Epoch 1/20
 - 4s - loss: 0.5770 - acc: 0.6942 - val_loss: 0.4178 - val_acc: 0.8679
Epoch 2/20
 - 3s - loss: 0.4158 - acc: 0.8284 - val_loss: 0.3183 - val_acc: 0.8839
Epoch 3/20
 - 3s - loss: 0.3375 - acc: 0.8702 - val_loss: 0.2850 - val_acc: 0.8885
Epoch 4/20
 - 3s - loss: 0.2893 - acc: 0.8941 - val_loss: 0.2743 - val_acc: 0.8910
Epoch 5/20
 - 3s - loss: 0.2511 - acc: 0.9092 - val_loss: 0.2781 - val_acc: 0.8894
Epoch 6/20
 - 3s - loss: 0.2257 - acc: 0.9194 - val_loss: 0.2885 - val_acc: 0.8883
Epoch 7/20
 - 3s - loss: 0.2045 - acc: 0.9283 - val_loss: 0.3289 - val_acc: 0.8801
Epoch 8/20
 - 3s - loss: 0.1872 - acc: 0.9328 - val_loss: 0.3280 - val_acc: 0.8854
Epoch 9/20
 - 3s - loss: 0.1741 - acc: 0.9372 - val_loss: 0.3430 - val_acc: 0.8834
Epoch 10/20
 - 3s - loss: 0.1626 - acc: 0.9429 - val_loss: 0.3601 - val_acc: 0.8829
Epoch 11/20
 - 3s - loss: 0.1521 - acc: 0.9458 - val_loss: 0.3986 - val_acc: 0.8802
Epoch 12/20
 - 3s - loss: 0.1452 - 

1. Plot the validation loss for the original and dropout models. How does the dropout model behave compared to the original?

### Regularization recap
* Get more training data
* Reduce the capacity of the network
* Add weight regularization
* Add dropout
* Either start with a simple model and add capacity
* Or, start with a complex model and then regularize by adding weight regularization and dropout

### Regression
* Dataset: 506 examples of houses and sale prices (Boston)
    - Included in Keras, with a 1/5 train-test split
* Each row is one house price, described by numeric properties of the house and neighborhood
* Small dataset, non-normalized features

In [0]:
from keras.datasets import boston_housing

(train_data, train_targets), (test_data, test_targets) =  boston_housing.load_data()

Downloading data from https://s3.amazonaws.com/keras-datasets/boston_housing.npz


#### Preprocessing
* Neural nets work a lot better if we normalize the features first. 
* Keras has no built-in support so we have to do this manually (or with scikit-learn)
    - Again, be careful not to look at the test data during normalization
    


In [0]:
mean, std = train_data.mean(axis=0), train_data.std(axis=0)
train_data -= mean
train_data /= std

test_data -= mean
test_data /= std



#### Building the network
* This is a small dataset, so easy to overfit
    * We use 2 hidden layers of 64 units each
* Use smaller batches, more epochs
* Since we want scalar output, the output layer is one unit without activation
* Loss function is Mean Squared Error (bigger penalty)
* Evaluation metric is Mean Absolute Error (more interpretable)
* We will also use cross-validation, so we wrap the model building in a function, so that we can call it multiple times

1. Create a function build_model that returns the neural network model described above

In [0]:
def build_model():
    model = models.Sequential()
    model.add(layers.Dense(64, activation='relu',
                           input_shape=(train_data.shape[1],)))
    model.add(layers.Dense(64, activation='relu'))
    model.add(layers.Dense(1))
    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
    return model

#### Cross-validation
* Keras does not have support for cross-validation
* We can implement cross-validation ourselves (seeprovided code below)
* Alternatively, we can wrap a Keras model as a scikit-learn estimator
* Generally speaking, cross-validation is tricky with neural nets
    * Some fold may not converge, or fluctuate on random initialization
    

In [0]:
# implementation of cross-validation
import numpy as np

k = 4
num_val_samples = len(train_data) // k
num_epochs = 20
all_scores = []
for i in range(k):
    print('processing fold #', i)
    # Prepare the validation data: data from partition # k
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]

    # Prepare the training data: data from all other partitions
    partial_train_data = np.concatenate(
        [train_data[:i * num_val_samples],
         train_data[(i + 1) * num_val_samples:]],
        axis=0)
    partial_train_targets = np.concatenate(
        [train_targets[:i * num_val_samples],
         train_targets[(i + 1) * num_val_samples:]],
        axis=0)

    # Build the Keras model (already compiled)
    model = build_model()
    # Train the model (in silent mode, verbose=0)
    model.fit(partial_train_data, partial_train_targets,
              epochs=num_epochs, batch_size=1, verbose=0)
    # Evaluate the model on the validation data
    val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=2)
    all_scores.append(val_mae)

processing fold # 0
processing fold # 1
processing fold # 2
processing fold # 3


1. Train for longer (200 epochs) and keep track of loss after every epoch. Plot and describe the loss as a function of epoch number.

In [0]:
from keras import backend as K
K.clear_session() # Memory clean-up

num_epochs = 200
all_mae_histories = []
for i in range(k):
    print('processing fold #', i)
    # Prepare the validation data: data from partition # k
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]

    # Prepare the training data: data from all other partitions
    partial_train_data = np.concatenate(
        [train_data[:i * num_val_samples],
         train_data[(i + 1) * num_val_samples:]],
        axis=0)
    partial_train_targets = np.concatenate(
        [train_targets[:i * num_val_samples],
         train_targets[(i + 1) * num_val_samples:]],
        axis=0)

    # Build the Keras model (already compiled)
    model = build_model()
    # Train the model (in silent mode, verbose=0)
    history = model.fit(partial_train_data, partial_train_targets,
                        validation_data=(val_data, val_targets),
                        epochs=num_epochs, batch_size=1, verbose=2)
    mae_history = history.history['val_loss']
    all_mae_histories.append(mae_history)

processing fold # 0
Train on 303 samples, validate on 101 samples
Epoch 1/200
 - 1s - loss: 195.1275 - mae: 10.4902 - val_loss: 40.3629 - val_mae: 4.2612
Epoch 2/200
 - 0s - loss: 36.0341 - mae: 4.1567 - val_loss: 29.7992 - val_mae: 3.4703
Epoch 3/200
 - 0s - loss: 25.0092 - mae: 3.4746 - val_loss: 22.0133 - val_mae: 2.8533
Epoch 4/200
 - 0s - loss: 21.2176 - mae: 3.0766 - val_loss: 17.2670 - val_mae: 2.7052
Epoch 5/200
 - 1s - loss: 17.7324 - mae: 2.8494 - val_loss: 15.6532 - val_mae: 2.4966
Epoch 6/200
 - 1s - loss: 15.8410 - mae: 2.6892 - val_loss: 13.8507 - val_mae: 2.4598
Epoch 7/200
 - 1s - loss: 14.8836 - mae: 2.5972 - val_loss: 11.8518 - val_mae: 2.2289
Epoch 8/200
 - 0s - loss: 13.6608 - mae: 2.4127 - val_loss: 13.1764 - val_mae: 2.3872
Epoch 9/200
 - 0s - loss: 13.5292 - mae: 2.4517 - val_loss: 10.8733 - val_mae: 2.0731
Epoch 10/200
 - 0s - loss: 12.6883 - mae: 2.3830 - val_loss: 11.3246 - val_mae: 2.2701
Epoch 11/200
 - 0s - loss: 12.4358 - mae: 2.3276 - val_loss: 11.0299 - 

Epoch 97/200
 - 0s - loss: 3.8981 - mae: 1.3467 - val_loss: 8.9346 - val_mae: 2.1062
Epoch 98/200
 - 0s - loss: 3.7373 - mae: 1.3036 - val_loss: 9.1716 - val_mae: 2.1701
Epoch 99/200
 - 0s - loss: 3.8749 - mae: 1.3797 - val_loss: 9.3561 - val_mae: 2.2625
Epoch 100/200
 - 0s - loss: 3.7635 - mae: 1.3344 - val_loss: 10.4435 - val_mae: 2.4523
Epoch 101/200
 - 0s - loss: 3.3873 - mae: 1.3024 - val_loss: 9.8928 - val_mae: 2.3630
Epoch 102/200
 - 0s - loss: 3.7631 - mae: 1.3708 - val_loss: 8.8859 - val_mae: 2.1247
Epoch 103/200
 - 0s - loss: 3.6188 - mae: 1.3526 - val_loss: 8.4137 - val_mae: 2.0886
Epoch 104/200
 - 0s - loss: 3.4709 - mae: 1.2854 - val_loss: 10.7329 - val_mae: 2.4514
Epoch 105/200
 - 0s - loss: 3.4639 - mae: 1.2620 - val_loss: 9.5321 - val_mae: 2.2673
Epoch 106/200
 - 0s - loss: 3.7397 - mae: 1.3374 - val_loss: 10.5206 - val_mae: 2.4046
Epoch 107/200
 - 0s - loss: 3.8444 - mae: 1.3494 - val_loss: 9.2701 - val_mae: 2.3056
Epoch 108/200
 - 0s - loss: 3.3535 - mae: 1.2608 - val

Epoch 192/200
 - 0s - loss: 2.1147 - mae: 1.0407 - val_loss: 12.4089 - val_mae: 2.7526
Epoch 193/200
 - 0s - loss: 2.1224 - mae: 1.0922 - val_loss: 10.7320 - val_mae: 2.4594
Epoch 194/200
 - 0s - loss: 1.9785 - mae: 1.0435 - val_loss: 11.9461 - val_mae: 2.4583
Epoch 195/200
 - 0s - loss: 2.0980 - mae: 1.0479 - val_loss: 9.6251 - val_mae: 2.3607
Epoch 196/200
 - 0s - loss: 2.0987 - mae: 1.0091 - val_loss: 10.9403 - val_mae: 2.4208
Epoch 197/200
 - 0s - loss: 2.1451 - mae: 1.0362 - val_loss: 9.2291 - val_mae: 2.3100
Epoch 198/200
 - 0s - loss: 1.9505 - mae: 1.0439 - val_loss: 9.4407 - val_mae: 2.3669
Epoch 199/200
 - 0s - loss: 2.2647 - mae: 1.0873 - val_loss: 10.1454 - val_mae: 2.4622
Epoch 200/200
 - 0s - loss: 2.2629 - mae: 1.1113 - val_loss: 11.3133 - val_mae: 2.4347
processing fold # 1
Train on 303 samples, validate on 101 samples
Epoch 1/200
 - 1s - loss: 173.2023 - mae: 9.7205 - val_loss: 30.1057 - val_mae: 4.0849
Epoch 2/200
 - 0s - loss: 28.4696 - mae: 3.5091 - val_loss: 19.1309

Epoch 87/200
 - 0s - loss: 4.2696 - mae: 1.3862 - val_loss: 10.0856 - val_mae: 2.3997
Epoch 88/200
 - 0s - loss: 4.3331 - mae: 1.4142 - val_loss: 9.7689 - val_mae: 2.3043
Epoch 89/200
 - 0s - loss: 4.3566 - mae: 1.3584 - val_loss: 12.0766 - val_mae: 2.6253
Epoch 90/200
 - 1s - loss: 4.0308 - mae: 1.3739 - val_loss: 9.2038 - val_mae: 2.1931
Epoch 91/200
 - 0s - loss: 4.2618 - mae: 1.4047 - val_loss: 9.6737 - val_mae: 2.2501
Epoch 92/200
 - 0s - loss: 4.4592 - mae: 1.4029 - val_loss: 10.8047 - val_mae: 2.4261
Epoch 93/200
 - 0s - loss: 4.0299 - mae: 1.3944 - val_loss: 10.6867 - val_mae: 2.4356
Epoch 94/200
 - 0s - loss: 4.0606 - mae: 1.3437 - val_loss: 8.4590 - val_mae: 2.1083
Epoch 95/200
 - 0s - loss: 3.9262 - mae: 1.3097 - val_loss: 9.9661 - val_mae: 2.3909
Epoch 96/200
 - 0s - loss: 3.8002 - mae: 1.3288 - val_loss: 10.3735 - val_mae: 2.5081
Epoch 97/200
 - 0s - loss: 4.0066 - mae: 1.3394 - val_loss: 8.5464 - val_mae: 2.1800
Epoch 98/200
 - 0s - loss: 3.8554 - mae: 1.3612 - val_loss: 

Epoch 182/200
 - 0s - loss: 1.8409 - mae: 0.9783 - val_loss: 14.8694 - val_mae: 2.7550
Epoch 183/200
 - 0s - loss: 1.7424 - mae: 0.9622 - val_loss: 13.8917 - val_mae: 2.6688
Epoch 184/200
 - 0s - loss: 1.6913 - mae: 0.9639 - val_loss: 14.3497 - val_mae: 2.7934
Epoch 185/200
 - 0s - loss: 2.0208 - mae: 0.9869 - val_loss: 16.3533 - val_mae: 2.8183
Epoch 186/200
 - 0s - loss: 2.0109 - mae: 0.9929 - val_loss: 12.2673 - val_mae: 2.5545
Epoch 187/200
 - 0s - loss: 1.8117 - mae: 0.9521 - val_loss: 11.7793 - val_mae: 2.4148
Epoch 188/200
 - 0s - loss: 2.0016 - mae: 1.0147 - val_loss: 15.3515 - val_mae: 2.8096
Epoch 189/200
 - 0s - loss: 1.7440 - mae: 0.9421 - val_loss: 14.7196 - val_mae: 2.7361
Epoch 190/200
 - 0s - loss: 1.7491 - mae: 0.9676 - val_loss: 23.3309 - val_mae: 3.6202
Epoch 191/200
 - 0s - loss: 1.7458 - mae: 0.9844 - val_loss: 18.7251 - val_mae: 3.0125
Epoch 192/200
 - 0s - loss: 1.8468 - mae: 1.0137 - val_loss: 11.8182 - val_mae: 2.4242
Epoch 193/200
 - 0s - loss: 1.7428 - mae: 0

Epoch 77/200
 - 0s - loss: 3.6541 - mae: 1.3392 - val_loss: 14.3023 - val_mae: 2.5095
Epoch 78/200
 - 0s - loss: 3.5021 - mae: 1.2983 - val_loss: 18.5256 - val_mae: 2.8477
Epoch 79/200
 - 0s - loss: 3.5745 - mae: 1.3208 - val_loss: 14.9976 - val_mae: 2.6278
Epoch 80/200
 - 0s - loss: 3.5495 - mae: 1.3019 - val_loss: 15.7043 - val_mae: 2.6720
Epoch 81/200
 - 0s - loss: 3.3402 - mae: 1.3083 - val_loss: 15.2978 - val_mae: 2.6115
Epoch 82/200
 - 0s - loss: 3.5536 - mae: 1.3146 - val_loss: 15.4290 - val_mae: 2.6998
Epoch 83/200
 - 0s - loss: 3.4178 - mae: 1.3144 - val_loss: 13.3200 - val_mae: 2.4039
Epoch 84/200
 - 0s - loss: 3.4778 - mae: 1.3082 - val_loss: 17.5659 - val_mae: 2.9307
Epoch 85/200
 - 0s - loss: 3.6844 - mae: 1.3238 - val_loss: 15.0816 - val_mae: 2.5784
Epoch 86/200
 - 0s - loss: 3.3429 - mae: 1.2170 - val_loss: 16.3775 - val_mae: 2.6890
Epoch 87/200
 - 0s - loss: 3.1386 - mae: 1.2220 - val_loss: 16.6093 - val_mae: 2.8086
Epoch 88/200
 - 1s - loss: 3.3569 - mae: 1.2243 - val_

Epoch 172/200
 - 0s - loss: 1.7601 - mae: 0.9589 - val_loss: 13.3490 - val_mae: 2.6785
Epoch 173/200
 - 0s - loss: 1.5940 - mae: 0.9291 - val_loss: 14.1037 - val_mae: 2.6897
Epoch 174/200
 - 0s - loss: 1.5490 - mae: 0.9044 - val_loss: 13.2724 - val_mae: 2.5969
Epoch 175/200
 - 0s - loss: 1.3765 - mae: 0.8811 - val_loss: 14.3041 - val_mae: 2.7517
Epoch 176/200
 - 0s - loss: 1.5223 - mae: 0.8911 - val_loss: 16.1496 - val_mae: 2.9804
Epoch 177/200
 - 1s - loss: 1.5004 - mae: 0.9001 - val_loss: 13.8408 - val_mae: 2.6904
Epoch 178/200
 - 1s - loss: 1.3481 - mae: 0.8394 - val_loss: 12.6236 - val_mae: 2.6020
Epoch 179/200
 - 1s - loss: 1.5108 - mae: 0.9022 - val_loss: 14.4284 - val_mae: 2.7174
Epoch 180/200
 - 1s - loss: 1.5731 - mae: 0.8809 - val_loss: 16.4417 - val_mae: 2.9549
Epoch 181/200
 - 0s - loss: 1.3803 - mae: 0.8585 - val_loss: 13.9173 - val_mae: 2.6815
Epoch 182/200
 - 0s - loss: 1.4964 - mae: 0.9116 - val_loss: 15.1329 - val_mae: 2.8083
Epoch 183/200
 - 0s - loss: 1.4856 - mae: 0

 - 0s - loss: 5.6892 - mae: 1.5211 - val_loss: 10.9433 - val_mae: 2.2974
Epoch 67/200
 - 0s - loss: 5.7564 - mae: 1.5084 - val_loss: 11.3768 - val_mae: 2.4059
Epoch 68/200
 - 0s - loss: 5.3217 - mae: 1.5202 - val_loss: 10.5992 - val_mae: 2.3572
Epoch 69/200
 - 0s - loss: 5.5147 - mae: 1.4809 - val_loss: 12.6202 - val_mae: 2.5443
Epoch 70/200
 - 0s - loss: 5.3471 - mae: 1.5111 - val_loss: 10.9532 - val_mae: 2.3505
Epoch 71/200
 - 0s - loss: 5.4695 - mae: 1.4532 - val_loss: 10.4960 - val_mae: 2.2746
Epoch 72/200
 - 0s - loss: 5.3494 - mae: 1.4438 - val_loss: 11.4201 - val_mae: 2.3800
Epoch 73/200
 - 0s - loss: 5.2930 - mae: 1.4203 - val_loss: 10.7890 - val_mae: 2.3330
Epoch 74/200
 - 0s - loss: 5.1290 - mae: 1.4631 - val_loss: 9.3228 - val_mae: 2.1366
Epoch 75/200
 - 0s - loss: 5.0232 - mae: 1.4639 - val_loss: 9.3992 - val_mae: 2.1469
Epoch 76/200
 - 0s - loss: 5.3440 - mae: 1.4703 - val_loss: 11.2713 - val_mae: 2.5039
Epoch 77/200
 - 0s - loss: 5.1442 - mae: 1.4091 - val_loss: 10.2423 -

Epoch 162/200
 - 0s - loss: 2.6902 - mae: 1.0341 - val_loss: 10.1056 - val_mae: 2.3246
Epoch 163/200
 - 0s - loss: 2.8558 - mae: 1.1246 - val_loss: 9.7417 - val_mae: 2.2581
Epoch 164/200
 - 0s - loss: 2.9553 - mae: 1.1056 - val_loss: 8.8871 - val_mae: 2.1546
Epoch 165/200
 - 0s - loss: 2.5038 - mae: 1.0050 - val_loss: 11.0798 - val_mae: 2.4448
Epoch 166/200
 - 0s - loss: 2.9987 - mae: 1.1352 - val_loss: 9.6372 - val_mae: 2.2343
Epoch 167/200
 - 0s - loss: 2.7039 - mae: 1.0952 - val_loss: 11.5269 - val_mae: 2.5171
Epoch 168/200
 - 0s - loss: 2.6869 - mae: 1.0674 - val_loss: 10.3814 - val_mae: 2.2874
Epoch 169/200
 - 1s - loss: 2.5511 - mae: 1.0698 - val_loss: 9.5942 - val_mae: 2.2451
Epoch 170/200
 - 0s - loss: 2.6631 - mae: 1.0671 - val_loss: 10.8996 - val_mae: 2.4659
Epoch 171/200
 - 0s - loss: 2.7393 - mae: 1.0498 - val_loss: 9.6629 - val_mae: 2.2583
Epoch 172/200
 - 0s - loss: 2.7955 - mae: 1.0672 - val_loss: 11.7329 - val_mae: 2.5359
Epoch 173/200
 - 0s - loss: 2.7219 - mae: 1.0845

#### Takeaways
* Regression is usually done using MSE loss and MAE for evaluation
* Input data should always be scaled (independent from the test set)
* Small datasets:
    - Use cross-validation
    - Use simple (non-deep) networks
    - Smaller batches, more epochs