# Workshop 6 

### Outline: 
 
1. Multi-Class Classifcation: Classifying newswires (Chapter 3)
2. Regression with Deep Learning (Chapter 3)

Source: Deep Learning with Keras, François Chollet, 2017

### 1. Classifying Newswires

In [None]:
# Loading the reuters dataset
from tensorflow.keras.datasets import reuters
(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000)

# As with the IMDB dataset, the argument num_words=10000 restricts the data to the
# 10,000 most frequently occurring words found in the data.

In [None]:
# Each data point is just a list of indexes of the top 10000 frequent words
train_data[10]

In [None]:
# Decoding a encoded newswire data sample
word_index = reuters.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_newswire = ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[0]])

In [None]:
decoded_newswire

In [None]:
# Objective: Transform this list into a "bag of word" model
# The students that did not participate in AA: https://en.wikipedia.org/wiki/Bag-of-words_model

<img src="resources/img1.png" width="350">

In [None]:
import numpy as np
# Transform to 10.000 Dimension Vector Space with a very simply bag of words approach
def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1. # This is a very simple bag of words model 
    return results

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

In [None]:
x_train

In [None]:
# Range of training labels => 46 Topics
print("min: {} - max: {}".format(train_labels.min(),train_labels.max()))

In [None]:
# Our training data is categorical, we have to transform it with one-hot-encoding into a proper format
# basically this creates dummy variables for each category
from tensorflow.keras.utils import to_categorical
one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)

#### The Deep Neural Network Architecture
The problem at hand looks very similar to the problem we solved last week. However, instead of having 2 classes (positive and negative sentiment) we do have 46 classes. Thus, the dimensionality of the output space is much larger.

In [None]:
from tensorflow.keras import models
from tensorflow.keras import layers

# The raw network architecture
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

Note two things here:
1. Each input vector will be mapped to a 46d output vector
2. Last layer uses a softmax activation function. In other words, the present network will output a probability distribution 

##### The loss function
The best loss function to use in this case is categorical_crossentropy. It measures
the distance between two probability distributions: here, between the probability distribution
output by the network and the true distribution of the labels.

In [None]:
model.compile(optimizer='SGD', loss='categorical_crossentropy', metrics=['accuracy'])

#### Validating our network

In [None]:
# Let's pick 1000 samples to use as a validation set
x_val = x_train[:1000]
partial_x_train = x_train[1000:]
y_val = one_hot_train_labels[:1000]
partial_y_train = one_hot_train_labels[1000:]

In [None]:
# Training Phase with 20 epochs

In [None]:
# validation data = Data on which to evaluate the loss and any model metrics at the end of each epoch.
#                   The model will not be trained on this data.
history = model.fit(partial_x_train, partial_y_train, epochs=20, batch_size=512, validation_data=(x_val, y_val))

In [None]:
# Plotting the training and validation loss
import matplotlib.pyplot as plt

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(loss) + 1)

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

In [None]:
# Plotting the traning and validation accuracy
plt.clf()
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

## Task: Experiment with layers

We have a output layer with 46 nodes. What happens to the accuracy when we reduce the number of nodes the second intermediate layer
to 1?

## Task: Experiment with bag of words model
The most basic bag of words model we used assigned a 1 to any word that is in the article, but it doesn't take into account **frequencies**.

Can you think of a model that takes into account word frequencies?

### Take-Home Message:

1. N Classes => N Output Nodes
2. Output Layer should be a SoftMAX Activation function (provided that you want to a assign each data point to ONE class)
3. Categorical Crossentropy is in many cases the loss function you should use for classification
4. Avoid Information Bottlenecks (i.e., don't use hidden layers with too few nodes)
5. Pre-processing inputs in a clever way can be more important than network tuning!

### 2. Regression with Deep Learning

In [None]:
from tensorflow.keras.datasets import boston_housing
(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

In [None]:
# Traning Data
train_data.shape

In [None]:
# Test Data
test_data.shape

In [None]:
# Numerical Targets 
train_targets

#### Preparing the data

In [None]:
# Standardizing the values (center around 0, std of 1)
mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std

test_data -= mean
test_data /= std

In [None]:
#### Building the network

from tensorflow.keras import models
from tensorflow.keras import layers
from tensorflow.keras.optimizers import RMSprop

def build_model():
    model = models.Sequential()
    model.add(layers.Dense(64, activation='relu', input_shape=(train_data.shape[1],)))
    model.add(layers.Dense(64, activation='relu'))
    model.add(layers.Dense(1))
    # MSE = Mean Squared Error
    # MAE = Mean Absolut Error
    # RMSPROP adaptive learning method based on Stochastic Gradient Descent
    # If you use SGD, your network might not converge....
    opt = RMSprop(lr=0.001)
    model.compile(optimizer=opt, loss='mse', metrics=['mae'])
    return model

#### Introducing cross validation

Since we have so little data, the variance of the validation set might be high. To cope, we use k-fold cross validation.

In [None]:
import numpy as np

k = 4
num_val_samples = len(train_data) // k # returns an integer instead of float
num_epochs = 100
all_scores = []

for i in range(k):
    print('processing fold #', i) 
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]    # Slice Get Validation Data 
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples] # Slice Val. Target Data
    
    # Exclude validation data from the training data
    partial_train_data = np.concatenate(
        [
            train_data[:i * num_val_samples],
            train_data[(i + 1) * num_val_samples:]
        ],
        axis=0)
    partial_train_targets = np.concatenate(
        [
            train_targets[:i * num_val_samples],
            train_targets[(i + 1) * num_val_samples:]
        ],
        axis=0)
    
    # Build Model
    model = build_model()
    
    # Fit Model
    model.fit(partial_train_data, partial_train_targets, epochs=num_epochs, batch_size=1, verbose=0)
    
    # Evaluate Model
    val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
    
    # Add Mean Absolut Error to All Scored List
    all_scores.append(val_mae)

In [None]:
# Get MAE for each k-fold set
all_scores

In [None]:
# Compute Average
np.mean(all_scores)

In [None]:
# Okay, lets analyze how the validation error depends on the number of epochs 
# Rerun...

In [None]:
import numpy as np

k = 2
num_val_samples = len(train_data) // k # returns an integer instead of float
num_epochs = 500
all_mae_histories = [] # <-- This is changed

for i in range(k):
    print('processing fold #', i) 
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]    # Slice Get Validation Data 
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples] # Slice Val. Target Data
    
    # Exclude validation data from the training data
    partial_train_data = np.concatenate(
        [
            train_data[:i * num_val_samples],
            train_data[(i + 1) * num_val_samples:]
        ],
        axis=0)
    partial_train_targets = np.concatenate(
        [
            train_targets[:i * num_val_samples],
            train_targets[(i + 1) * num_val_samples:]
        ],
        axis=0)
    
    # Build Model
    model = build_model()
    
    # Fit Model # <-- This is changed
    history = model.fit(partial_train_data, partial_train_targets, epochs=num_epochs, batch_size=1, verbose=0)
    
    # Cache MAE History  # <-- This is changed
    mae_history = history.history['mae']  
    
    # Add Mean Absolut Error to All Scored List # <-- This is changed
    all_mae_histories.append(mae_history)

In [None]:
all_mae_histories[0]

In [None]:
# Plot MAE History
import matplotlib.pyplot as plt
plt.plot(range(1, len(all_mae_histories[0]) + 1), all_mae_histories[0])
plt.plot(range(1, len(all_mae_histories[0]) + 1), all_mae_histories[1])
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()

In [None]:
# Each iteration generated a history object w
average_mae_history = [np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]

In [None]:
# Plot average MAE History
import matplotlib.pyplot as plt
plt.plot(range(1, len(average_mae_history) + 1), average_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()

In [None]:
# Evaluating the Model with the Test Set
model = build_model()
model.fit(train_data, train_targets, epochs=80, batch_size=16, verbose=0)
test_mse_score, test_mae_score = model.evaluate(test_data, test_targets)

In [None]:
# Voila.
test_mae_score

#### Take-Home Message
1. Mean squared error (MSE) is a loss function commonly used for regression.
2. A common regression metric is mean absolute error.
3. When features in the input data have values in different ranges, each feature should be scaled independently as a preprocessing step.
4. When there is little data available, using K-fold validation is a great way to reliably evaluate a model.
5. If there is little data, use small network. Otherwise your network might overfit.