# Deep Learning

**Tensorflow** is an end-to-end open source platform for machine learning.

There are five steps

1- Define the model,
    
    - layers,

2- Compile the model,
    
    - optimizer,
    - loss function,
    - metric,
    - model.compile(optimizer='sgd', loss='binary_crossentropy', metrics=['accuracy'])

3- Fit the model,

    - model.fit(X_train, y_train, validation_data=(X_valid, y_valid), epochs=100, batch_size=32)

4- Evaluate the model,

    - model.evaluate(X_validation, y_validation, verbose=0)

5- Make predictions,

    - model.predict(X_new)


In neural network model, the activation function, loss function and optimization algorithm play a very important role in efficiently and effectively training a model and produce accurate results.

### Optimizer:

Optimizers are algorithms or methods used to change the attributes of the neural network such as weights and learning rate to reduce the losses. Optimizer minimizes the loss function through **back propagation**. It can tell the network how to change its weights.
Optimisation functions usually calculate the gradient i.e. the partial derivative of loss function with respect to weights, and the weights are modified in the opposite direction of the calculated gradient. This cycle is repeated until we reach the minima of loss function.

- **adam**,
- **Stochastic Gradient Decent (sgd)**,
    * The _gradient_ is a vector that tells us in what direction the weights need to go. More precisely, it tells us how to change the weights to make the loss change fastest. We call our process gradient _descent_ because it uses the gradient to descend the loss curve towards a minimum. _Stochastic_ means "determined by chance." Our training is stochastic because the minibatches are random samples from the dataset. And that's why it's called SGD!



### Loss function:
Loss function shows difference between output and target variable. It measures how good the network's predictions are. The three most common loss functions are:

- **binary_crossentropy** for binary classification,
- **sparse_categorical_crossentropy** for multi-class classification,
- **mse** (mean squared error) for regression,


#### Keywords:

- **epochs**: loops through the training dataset. The number of epochs you train for is how many times the network will see each training example.

- **batch_size**: the number of samples in an epoch used to estimate model error. Each iteration's sample of training data is called a minibatch (or often just "batch")

- **verbose**:


In [None]:
from tensorflow import keras
from tensorflow.keras import layers

# 1)
# no hidden layer
model = keras.Sequential([
    layers.Dense(units=number_of_outputs, 
                 input_shape=[number_of_inputs])
])

# 2)
model.compile(
    optimizer="adam",
    loss="mae",
)

# 3)
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=10,
)

## Find out the loss function during iteration: (read the box below)

history_df = pd.DataFrame(history.history)

history_df['loss'].plot();

The fit method in fact keeps a record of the training and validation loss produced during training in a History object. It's better to convert the data to a Pandas dataframe, which makes the plotting easy.

# Neural network:

### Single neuron:
The fundamental component of neural network is a **linear unit** or a single **neuron** with one input ($x$) where $y = w.x + b$ . The input ($x$) is connected to neuron by weight  ($w$). The bias ($b$) enables the neuron to modify the output independently of its inputs.

### Layers:
Neural networks typically organize their neurons into layers. When we collect together linear units having a common set of inputs we get a dense layer.
There are different types of layers in Keras:
- Convolutional
- recurrent

The layers before the output layer are sometimes called hidden since we never see their outputs directly.

To improve the model we connected layers together through an **activation functio**. The activation function is applied to each of a layer's outputs (its activations). The most common is the rectifier function.
- ReLU ()
    - (0, infinity)
- Leaky ReLU
- Sigmoid or Logistic Activation Function
    - S-shape between (0,1)
    - It is especially used for models where we have to predict the probability as an output since probability is a number between the range of 0 and 1.
- Softmax function
    - more generalized logistic activation function which is used for multiclass classification.
- Tanh
    - S-shape between (-1,1)
    - tanh is also like logistic sigmoid but better.

Depending on our task, activation function can be applied in the output layer or not. No activation function makes the network appropriate to a regression task, where we are trying to predict some arbitrary numeric value. Other tasks (like classification) might require an activation function on the output.

In [None]:
model = keras.Sequential([
    # the hidden ReLU layers
    layers.Dense(units= number_of_outputs_1th_layer, activation='activation_function', input_shape=[number_of_inputs]),
    layers.Dense(units= number_of_outputs_2nd_layer, activation='activation_function'),
    # the linear output layer 
    layers.Dense(units=1),
])



## Learning rate and minibatchs

The **learning rate** is a tuning parameter in an **optimization** algorithm that determines the step size at each iteration while moving toward a **minimum** of a loss function.

A **smaller** learning rate means the network needs to see **more** minibatches before its weights converge to their best values.

The learning rate and the size of the minibatches are the two parameters that have the largest effect on how the SGD training proceeds. Their interaction is often subtle and the right choice for these parameters isn't always obvious.

Fortunately, for most work it won't be necessary to do an extensive hyperparameter search to get satisfactory results. **Adam** is an SGD algorithm that has an adaptive learning rate that makes it suitable for most problems without any parameter tuning (it is "self tuning", in a sense). Adam is a great general-purpose optimizer.



### Capacity

Information in the training data as being of two kinds: _signal_ and _noise_. The _signal_ is the part that _generalizes_, the part that can help our model make predictions from new data. The _noise_ is that part that is _only_ true of the training data; the noise is all of the random fluctuation that comes from data in the real-world or all of the incidental, non-informative patterns that can't actually help the model make predictions.

To get more signal out of the training data while reducing the amount of noise, capacity of a model should be considered.

A model's **capacity** refers to the size and complexity of the patterns it is able to learn. For neural networks, this will largely be determined by how many neurons it has and how they are connected together. If it appears that your network is underfitting the data, you should try increasing its capacity.

You can increase the capacity of a network either by making it **wider** (more units to existing layers) or by making it **deeper** (adding more layers). Wider networks have an easier time learning more linear relationships, while deeper networks prefer more nonlinear ones. Which is better just depends on the dataset.

### Early Stopping

**Early stopping** is a form of **regularization** used to avoid overfitting when training a learner with an iterative method, such as gradient descent. 

When a model is too eagerly learning noise, the validation loss may start to increase during training. To prevent this, we can simply stop the training whenever it seems the validation loss isn't decreasing anymore. Interrupting the training this way is called early stopping. Besides preventing overfitting from training too long, early stopping can also prevent underfitting from not training long enough.

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=20, # how many epochs to wait before stopping
    restore_best_weights=True,
)

In [None]:
from tensorflow import keras
from tensorflow.keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=20, # how many epochs to wait before stopping
    restore_best_weights=True,
)

model = keras.Sequential([
    layers.Dense(512, activation='relu', input_shape=[11]),
    layers.Dense(512, activation='relu'),
    layers.Dense(512, activation='relu'),
    layers.Dense(1),
])
model.compile(
    optimizer='adam',
    loss='mae',
)

history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=500,
    callbacks=[early_stopping], # put your callbacks in a list
    verbose=0,  # turn off training log
)

history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot();


