
# Introduction #

In [None]:
from IPython.display import display

# Preparing Data for a Neural Network #

The data we'll use in this course will be *structured* data, or more specifically, *tabular* data, the kind you'd find in CSV files and Pandas DataFrames. We won't get into the details of data preparation in this course, but let's outline the important points. Take a look at the hidden cell if you'd like to see how it's done.

Neural nets need numeric inputs and produce numeric outputs and generally perform best when all the features are all on a common scale near 0. This means you'll need to encode any non-numeric features and scale any numeric features. For numerics, [standardization](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) and [min-max scaling](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler) to $[0, 1]$ can both be good choices. For categorical features with a moderate number of categories, [one-hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) is a good choice. The [preprocessing module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) in scikit-learn has almost everything you might need for preparing tabular data for neural networks.

<mark><strong>TODO - add resources on Kaggle</strong>
[Data Cleaning](https://www.kaggle.com/learn/data-cleaning)
[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)
</mark>

Let's walk through preprocessing.

### 1a) Load and Process Dataset

In the *Fuel Economy* dataset your task is to predict the fuel economy of an automobile given features like its type of engine or the year it was made.

First let's load the *Fuel Economy* dataset. Our target is the `FE` column.

In [None]:
import pandas as pd

fuel = pd.read_csv('../input/dl-course-data/fuel.csv')
display(fuel.head())
display(fuel.info())

The features with `object` type are categorical, which we will one-hot encode. The numeric features we'll standardize. It's not as essential that the target be transformed, though doing so can significantly speed up training.

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.model_selection import train_test_split

X = fuel.copy()
y = X.pop('FE')

preprocessor = make_column_transformer(
    (StandardScaler(),
     make_column_selector(dtype_include=np.number)),
    (OneHotEncoder(sparse=False),
     make_column_selector(dtype_include=object)),
)

X_train, X_valid, y_train, y_valid = \
    train_test_split(X, y, train_size=0.75)

X_train = preprocessor.fit_transform(X_train)
X_valid = preprocessor.transform(X_valid)
y_train = np.log(y_train) # log transform target instead of standardizing
y_valid = np.log(y_valid)

And now our data is ready for the network!

### 1b) Input Shape

What will be the value of `input_shape` in the first layer of the network? 

In [None]:
# Hint 1: Think about whether you should look at the processed data `X` or the original data `fuel`.
# Hint 2: You should look at the processed data `X`, since that is the data actually going into the network.
input_shape = [50]

# Fuel Economy Prediction #

### 2a) Define Model

Define a model with three hidden layers, each having 64 units and a ReLU activation.

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

model = ____

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=input_shape),
    layers.Dense(64, activation='relu'),    
    layers.Dense(64, activation='relu'),
    layers.Dense(1),
])

### 2b) Add Loss and Optimizer

Now, using the `compile` method, add the Adam optimizer and MAE loss.

In [None]:
# YOUR CODE HERE
____
model.compile(
    optimizer='adam',
    loss='mae'
)

### 2c) Train Model

Now train the network for 100 epochs with a batch size of 128. The input data is `X_train` and the target is `y_train`.

In [None]:
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=128,
    epochs=100,
)

In [None]:
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=128,
    epochs=100,
)

### 2d) Evaluate Training

Finally, run the cell below to a plot of the learning curves.

In [None]:
import pandas as pd

history_df = pd.DataFrame(history.history)
history_df.loc[10:, ['loss', 'val_loss']].plot()

If you trained the model longer, would you expect the loss to decrease further?

In [None]:
# THOUGHT QUESTION

In [None]:
# answer: No.

# Learning Rate and Batch Size #

Let's see how the learning rate and batch size affect how the training proceeds.

### 3) Observe changes in the loss curve

Change the values for `learning_rate` and `batch_size` and then run the cell. Pay attention to how the loss curve changes. Try the following combinations:

| `learning_rate` | `batch_size` |
|-----------------|--------------|
| 0.01            | 128          |
| 0.0001          | 128          |
| 1.0             | 128          |
| 0.01            | 8            |
| 0.01            | 1024         |

In [None]:
# YOUR CODE HERE
learning_rate = 0.01
batch_size = 2048


#-------------------------------------------------------------------------------#
bias_init = keras.initializers.constant(y_train.median()) # you can ignore!
model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=input_shape),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),    
    layers.Dense(1, bias_initializer=bias_init)
                 
])

optimizer = keras.optimizers.SGD(learning_rate=learning_rate)
model.compile(
    optimizer=optimizer,
    loss='mae'
)
history = model.fit(
    X_train, y_train,
    batch_size=batch_size,
    epochs=100,
    verbose=0, # turn off output
)

history_df = pd.DataFrame(history.history)
history_df.loc[0:, 'loss'].plot()
plt.show();

What effect did changing the learning rate have? What effect does changing the batch size have?

In [None]:
# Thought Question

In [None]:
# Answer

# Conclusion #