# Reuters Dataset



> **Abstract**
>
> 



## Framework setup
As central framework for this task we’ll use [Auto-Keras](https://arxiv.org/abs/1806.10282), an efficient neural architecture search system developed by DATA Lab at Texas A&M University. It leverages a variant of Bayesian Optimization to guide deep neural network morphism and find a good architecture for our task and dataset, using Keras and TensorFlow as backend.
First, we have to install it with its dependency, [Keras-Tuner](https://keras-team.github.io/keras-tuner/), which is the hyperparameter optimization library used by Auto-Keras.

In [None]:
!pip install -q git+https://github.com/keras-team/keras-tuner.git@1.0.2rc1 autokeras

## Imports
Now we can import the TensorFlow dataset loading tool and Auto-Keras, as well as plotting tools to examine the dataset. 

In [None]:
import numpy as np
import autokeras as ak
from tensorflow.keras.datasets import reuters

## Loading the data
The dataset can be now loaded into four NumPy arrays.

In [None]:
index_offset = 3  # Index offset
(x_train, y_train), (x_test, y_test) =  reuters.load_data(num_words=1000, 
                                                        index_from=index_offset)
y_train = y_train.reshape(-1, 1)
y_test  = y_test.reshape(-1, 1)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/reuters.npz


In [None]:
# Prepare the dictionary of index to word.
word_to_id = reuters.get_word_index()
word_to_id = {k: (v + index_offset) for k, v in word_to_id.items()}
word_to_id['PADDING'] = 0
word_to_id['START_OF_SEQUENCE'] = 1
word_to_id['UNKNKOWN'] = 2
id_to_word = {value: key for key, value in word_to_id.items()}

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/reuters_word_index.json


In [None]:
# Convert the word indices to words.
x_train = list(map(lambda sentence: ' '.join(
    id_to_word[i] for i in sentence), x_train))
x_test = list(map(lambda sentence: ' '.join(
    id_to_word[i] for i in sentence), x_test))

x_train = np.array(x_train, dtype=np.str)
x_test  = np.array(x_test, dtype=np.str)

## Building the model
`TextClassifier` is the class responsible for model search. Here I set `max_trials=3` to avoid taking too much time for model exploration, but it can be set to any positive integer. It explores different model architectures by tree-based Bayesian Optimization search. We’ll try only the first one as it’s really time consuming on Google Colab’s GPU, where I’m running this notebook.

In [None]:
clf = ak.TextClassifier(              # Initialize the text classifier
    num_classes=None,                 # Infer the number of classes
    multi_label=False,                # Only one output
    loss='categorical_crossentropy',  # Select the loss metric
    metrics='accuracy',               # Metric to watch
    project_name="text_classifier",   # Name of the folder
    max_trials=3,                     # Just try three models
    directory=None,                   # Automatic folder creation
    objective="val_loss",             # Validation set crossentropy
    tuner=None,                       # Automatic hyperparameter tuner selection
    overwrite=True,                   # Don't load previous experiments
    seed=42                           # Set a seed to replicate the experiment
)

### Neural Architecture Search
It’s all ready to start exploring the models.

In [None]:
clf.fit(                   # Fit the model
    x=x_train,             # Training features
    y=y_train,             # Training labels
    epochs=None,           # Automatic number of epochs
    callbacks=None,        # No callbacks
    validation_split=0.2,  # Validation data split
    validation_data=None   # Use a portion of training data
)

Trial 3 Complete [00h 02m 14s]
val_loss: 1.2536143064498901

Best val_loss So Far: 0.9535587430000305
Total elapsed time: 00h 06m 47s
INFO:tensorflow:Oracle triggered exit
Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7


The model found can now be exported, it’s a convolutional multi-layer neural network with the following architecture.

In [None]:
model = clf.export_model()
model.summary()

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 1)]               0         
_________________________________________________________________
text_vectorization (TextVect (None, 512)               0         
_________________________________________________________________
embedding (Embedding)        (None, 512, 64)           320064    
_________________________________________________________________
dropout (Dropout)            (None, 512, 64)           0         
_________________________________________________________________
conv1d (Conv1D)              (None, 508, 256)          82176     
_________________________________________________________________
global_max_pooling1d (Global (None, 256)               0         
_________________________________________________________________
dense (Dense)                (None, 256)              

## Testing the model
The benchmark chosen for this dataset is the experiment reported on the book [Deep Learning with Python by Francois Chollet](https://www.manning.com/books/deep-learning-with-python). The testing crossentropy of the benchmark is ~0.96.

Let’s now test our model on the holdout test set.

In [None]:
# Evaluate the best model with testing data.
current = clf.evaluate(x_test, y_test)



In [None]:
previous = 0.9565213431445807
improvement = (abs(current[0] - previous) / previous) * 100.0

print(f'The categorical crossentropy improvement is {round(improvement, 1)}%.')

The categorical crossentropy improvement is 3.8%.


The accuracy of ~0.79 is in line with the previous model. However, we scored a cross entropy of ~0.92 with only three models explored, outperforming the textbook benchmark by almost 4%.

Depending on our hardware and time availability, of course, we could explore even more models for further improvement in the benchmark score.

[**Go back to index >**](https://github.com/MattiaVerticchio/PersonalProjects/blob/master/README_EN.md)