REQUIREMENTS: Classify news items into 46 topics.

For this purpose, create a Multi-layer perceptron (MLP) using Keras.
Steps to complete the assignment:
- load the Reuters dataset
- preprocess the input data
- build a Sequential Keras model 
- compile the model with a training configuration 
- train your model on the training dataset 
- evaluate your model on the test dataset 

INPUT DATASET: We will use the Reuters newswire dataset that consists of 11,228 newswires from Reuters, labeled over 46 topics. 
Each newswire is encoded as a list of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.
This dataset is available through the Keras API.
REFERENCE: https://keras.io/api/datasets/reuters/

In [None]:
# The code was removed by Watson Studio for sharing.

Keras Model

Keras model represents a neural network model. 
Keras provides two modes to create the model:
- a simple and easy to use Sequential API 
- a more flexible and advanced Functional API

REFERENCE: 
https://www.tutorialspoint.com/keras/keras_models.htm#:~:text=As%20learned%20earlier%2C%20Keras%20model,flexible%20and%20advanced%20Functional%20API.
https://keras.io/guides/sequential_model/

In [None]:
# The code was removed by Watson Studio for sharing.

In [1]:
# TensorFlow is an open-source platform for creating Machine Learning applications. 
# REF: https://www.guru99.com/what-is-tensorflow.html

!pip install tensorflow==2.2.0rc1

Collecting tensorflow==2.2.0rc1
  Downloading tensorflow-2.2.0rc1-cp38-cp38-manylinux2010_x86_64.whl (516.2 MB)
[K     |████████████████████████████████| 516.2 MB 7.1 kB/s  eta 0:00:011��█▏            | 309.3 MB 99.0 MB/s eta 0:00:03
Collecting tensorflow-estimator<2.3.0,>=2.2.0rc0
  Downloading tensorflow_estimator-2.2.0-py2.py3-none-any.whl (454 kB)
[K     |████████████████████████████████| 454 kB 79.0 MB/s eta 0:00:01
Collecting tensorboard<2.2.0,>=2.1.0
  Downloading tensorboard-2.1.1-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 85.9 MB/s eta 0:00:01
Installing collected packages: tensorflow-estimator, tensorboard, tensorflow
  Attempting uninstall: tensorflow-estimator
    Found existing installation: tensorflow-estimator 2.4.0
    Uninstalling tensorflow-estimator-2.4.0:
      Successfully uninstalled tensorflow-estimator-2.4.0
  Attempting uninstall: tensorboard
    Found existing installation: tensorboard 2.4.1
    Uninstalling tensorboard-2.4.1

In [2]:
import tensorflow as tf
if not tf.__version__ == '2.2.0-rc1':
    print(tf.__version__)
    raise ValueError('please upgrade to TensorFlow 2.2.0-rc0, or restart your Kernel (Kernel->Restart & Clear Output)')

IMPORTANT: Restart the kernel by clicking on "Kernel"-> "Restart and Clear Output" and wait until all output disapears. 
Then your changes are beeing picked up.

We use Keras Sequential model with only two types of layers: Dense and Dropout. 
In short, a dropout layer ignores a set of neurons (randomly). This normally is used to prevent the net from overfitting. 
The Dense layer is a normal fully connected layer in a neuronal network.

See the picture from https://stackoverflow.com/questions/58830573/in-keras-what-is-a-dense-and-a-dropout-layer

We also specify a random seed to make our results reproducible. 
And we load the Reuters data set:

In [3]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.datasets import reuters

seed = 1337
np.random.seed(seed)

max_words = 1000
(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=max_words, test_split=0.2, seed=seed)

# Note that we cap the maximum number of words in a news item to 1000 by specifying the *num_words* key word. 
# Also, 20% of the data will be test data and we ensure reproducibility by setting our random seed.

print()
print('REMEMBER that each newswire from x_train is encoded as a sequence of word indexes!') 
print('The shape of x_train:', x_train.shape, '= (number_of_rows, number_of_columns)')
print('Samples from x_train:')
print(x_train)

print()
print('REMEMBER that each newswire is categorised into one of the 46 topics, which will serve as our label stored in y_train.')
print('The shape of y_train:', y_train.shape, '= (number_of_rows, number_of_columns)')
print('Samples from y_train:')
print(y_train)

num_classes = np.max(y_train) + 1  # 46 topics
print('num_classes: ', num_classes, ' topics')
print('Topics have values from: ', np.min(y_train), ' to ', np.max(y_train))


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/reuters.npz

REMEMBER that each newswire from x_train is encoded as a sequence of word indexes!
The shape of x_train: (8982,) = (number_of_rows, number_of_columns)
Samples from x_train:
[list([1, 56, 2, 141, 2, 71, 8, 16, 40, 200, 6, 438, 2, 806, 2, 81, 5, 2, 2, 2, 7, 10, 587, 7, 50, 261, 5, 2, 806, 33, 839, 79, 2, 69, 10, 147, 20, 128, 7, 4, 2, 49, 4, 49, 8, 16, 33, 57, 69, 78, 11, 79, 335, 21, 10, 2, 959, 503, 92, 4, 587, 16, 8, 92, 4, 270, 16, 33, 2, 2, 806, 31, 197, 13, 2, 16, 8, 2, 806, 189, 40, 365, 2, 2, 9, 363, 6, 2, 117, 124, 7, 89, 900, 2, 6, 2, 172, 2, 236, 7, 4, 37, 38, 9, 2, 17, 12])
 list([1, 99, 234, 60, 9, 752, 111, 8, 25, 544, 20, 324, 2, 2, 640, 56, 2, 323, 40, 385, 25, 73, 794, 220, 13, 69, 32, 251, 18, 15, 7, 197, 9, 19, 445, 18, 15, 7, 80, 2, 7, 10, 99, 98, 276, 13, 99, 234, 5, 69, 19, 451, 18, 15, 92, 131, 4, 49, 8, 4, 211, 33, 2, 2, 2, 22, 4, 293, 2, 218, 17, 12])
 list([1, 103, 74,

  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


## 1. Feature encoding

Our training features are still simple sequences of indexes and we need to further preprocess them, so that we can plug them into a *Dense* layer. For this we use a *Tokenizer* from Keras text preprocessing module. This tokenizer will take an index sequence and map it to a vector of length *max_words=1000*. Each of the 1000 vector positions corresponds to one of the words in our newswire corpus. The output of the tokenizer has a 1 at the i-th position of the vector, if the word corresponding to i is in the description of the newswire, and 0 otherwise. Even if this word appears multiple times, we still just put a 1 into our vector, i.e. our tokenizer is binary. We use this tokenizer to transform both train and test features:

In [4]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=max_words)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')

print()
print('The shape of x_train:', x_train.shape, '= (number_of_rows, number_of_columns)')
print('Samples from x_train:')
print(x_train)

print()
print('The shape of x_test:', x_test.shape, '= (number_of_rows, number_of_columns)')
print('Samples from x_test:')
print(x_test)


The shape of x_train: (8982, 1000) = (number_of_rows, number_of_columns)
Samples from x_train:
[[0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 ...
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]]

The shape of x_test: (2246, 1000) = (number_of_rows, number_of_columns)
Samples from x_test:
[[0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 ...
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]]


## 2. Label encoding

Use to_categorical function to transform both *y_train* and *y_test* into one-hot encoded vectors of length *num_classes*:

In [5]:
# to_categorical(y, num_classes=None, dtype="float32")
# Converts a vector of integers into a binary matrix representation of the input.
# Example:
# >>> a = tf.keras.utils.to_categorical([0, 1, 2, 3], num_classes=4)
# >>> a = tf.constant(a, shape=[4, 4])
# >>> print(a)
# tf.Tensor(
#   [[1. 0. 0. 0.]
#    [0. 1. 0. 0.]
#    [0. 0. 1. 0.]
#    [0. 0. 0. 1.]], shape=(4, 4), dtype=float32)    
# Arguments:
#     y: vector to be converted into a matrix (with integers from 0 to num_classes).
#     num_classes: total number of classes. If None, this would be inferred as the (largest number in y) + 1.
#     dtype: The data type expected by the input. Default: 'float32'.
# REFERENCE: https://keras.io/api/utils/python_utils/#to_categorical-function

y_train = ###_YOUR_CODE_GOES_HERE_###
y_test =  ###_YOUR_CODE_GOES_HERE_###

print()
print('The shape of y_train:', y_train.shape, '= (number_of_rows, number_of_columns)')
print('Samples from y_train:')
print(y_train)

print()
print('The shape of y_test:', y_test.shape, '= (number_of_rows, number_of_columns)')
print('Samples from y_test:')
print(y_train)


The shape of y_train: (8982, 46) = (number_of_rows, number_of_columns)
Samples from y_train:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

The shape of y_test: (2246, 46) = (number_of_rows, number_of_columns)
Samples from y_test:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [None]:
# The code was removed by Watson Studio for sharing.

## 3. Model definition

Next, initialise a Keras *Sequential* model and add three layers to it:

    Layer1: a *Dense* layer with input_shape=(max_words,), 512 output units and "relu" activation.
    Layer2: a *Dropout* layer with dropout rate of 50%.
    Layer3: a *Dense* layer with num_classes output units and "softmax" activation.
    
REMEMBER:

An Activation Function is a function that we use to adjust the output of a node. It is also known as the Transfer Function.
REFERENCE: 
https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html

Activation functions are crucial for a MLP (Multi Layer Perceptron) in learning and making sense of something complicated.
Their main objective is to adjust the output of a node for an easier interpretation. 
That output now functions as an input for the next layer of the network.
Remember that in MLP we do the sum of: the products of inputs (X) and their corresponding weights (W), than we add a bias and we apply an activation function, f(x), to get the output of that layer and serve it as an input to the next layer.
Note: The bias is just a constant number, say 1, which is added for scaling purposes.
REF: https://www.quora.com/Why-do-neural-networks-need-an-activation-function
MLP video tutorial (13 minutes): https://youtu.be/MXJQgYgzMMU

ReLU or the Rectified Linear Activation Function is a function that will output its input directly if it is positive, otherwise, it will output zero. 
It has become the default activation function for many types of neural networks because 
a model that uses it is easier to train and often achieves better performance.
REF: https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/ 

Softmax Activation Function is a mathematical function that converts a vector of numbers into a vector of probabilities.
Specifically, the neural network is configured to output N values, one for each class in the classification task, and the softmax function is used to normalize the outputs, converting them from weighted sum values into probabilities that sum to one. Each value in the output of the softmax function is interpreted as the probability of membership for each class.
REF: https://machinelearningmastery.com/softmax-activation-function-with-python/

In [6]:
# Instantiate sequential model:
model = Sequential() 
# Add the first layer: a *Dense* layer with input_shape=(max_words,), 512 output units and "relu" activation.
model.add( ###_YOUR_CODE_GOES_HERE_###  
# Add the second layer: a *Dropout* layer with a dropout rate of 50%.
model.add( ###_YOUR_CODE_GOES_HERE_###
# Add the third layer: a *Dense* layer with num_classes output units and "softmax" activation.
model.add( ###_YOUR_CODE_GOES_HERE_###

In [None]:
# The code was removed by Watson Studio for sharing.

## 4. Model compilation

In the next step, you will compile your Keras model with a training configuration: 
- "categorical_crossentropy" as loss function 
- "adam" as optimizer 
- "accuracy" as evaluation metric

NOTE: In case you get an error regarding h5py, just restart the kernel and start from scratch.

REMEMBER:
The loss function is also called the error function or the cost function or the objective function.

categorical_crossentropy
Cross-entropy is the default loss function to use for multi-class classification problems where the target values are in the set {0, 1, 3, …, n}, where each class is assigned a unique integer value.
Mathematically, it is the preferred loss function under the inference framework of maximum likelihood. It is the loss function to be evaluated first and only changed if you have a good reason.
Cross-entropy will calculate a score that summarizes the average difference between the actual and predicted probability distributions for all classes in the problem. The score is minimized and a perfect cross-entropy value is 0.
Cross-entropy can be specified as the loss function in Keras by specifying "categorical_crossentropy" when compiling the model.
REF: https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/

Optimizers are classes or methods used to change the attributes of a machine learning model such as weights and learning rate 
in order to reduce errors or losses. Therefore, optimizers help us get results faster.
TensorFlow supports 9 optimizer classes, one of which is ADAM.

Adam is an adaptive learning rate method, which means, it computes individual learning rates for different parameters. 
REF: https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c

In [None]:
# The code was removed by Watson Studio for sharing.

In [7]:

model.compile( ###_YOUR_CODE_GOES_HERE_###


In [None]:
# The code was removed by Watson Studio for sharing.

## 5. Model training and evaluation

Next, define the batch_size for training as 32 and train the model for 5 epochs on *x_train* and *y_train* by using the *fit* method of your model. Then calculate the score for your trained model by running *evaluate* on *x_test* and *y_test* with the same batch size as used in *fit*.

REMEMBER:

EPOCHS
The number of epochs defines the number of times that the learning algorithm will work through the entire training dataset.
An epoch means training the neural network with all the training data for one cycle. In an epoch, we use all of the data exactly once. 
The recommendation is to start with a large number of epochs and use Early Stopping to halt training when performance stops improving.

BATCH SIZE
The batch size defines the number of samples that will be propagated through the network.

Let's say you have 1050 training samples and you want to set up a batch_size equal to 100. The algorithm takes the first 100 samples from the training dataset and trains the network. Next, it takes the second 100 samples and trains the network again. We can keep doing this procedure until we have propagated all samples through the network. A problem might occur with the last set of samples. In our example, we've used 1050 which is not divisible by 100 without remainder. The simplest solution is to get the final 50 samples and train the network.

Advantages of using a batch size < number of all samples:
- It requires less memory. Since you train the network using fewer samples, the overall training procedure requires less memory. That's especially important if you are not able to fit the whole dataset in your machine's memory.
- Typically networks train faster with mini-batches. That's because we update the weights after each propagation. In our example we've propagated 11 batches (10 of them had 100 samples and 1 had 50 samples) and after each of them we've updated our network's parameters. If we used all samples during propagation we would make only 1 update for the network's parameter.

Disadvantages of using a batch size < number of all samples:
The smaller the batch the less accurate the estimate of the gradient will be. In the figure https://i.stack.imgur.com/lU3sx.png, you can see that the direction of the mini-batch gradient (green color) fluctuates much more in comparison to the direction of the full batch gradient (blue color).
REF: https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network

In [8]:
batch_size =            ###_YOUR_CODE_GOES_HERE_###
model.fit(              ###_YOUR_CODE_GOES_HERE_###
score = model.evaluate( ###_YOUR_CODE_GOES_HERE_###

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
# The code was removed by Watson Studio for sharing.

In [9]:
score[1]

0.8023152351379395