# Deep learning notes

## Notes for later

- Review backward and forward propagation - done
- Look into dot product and its geometrical interpretationss
- Look into keras
- Check the difference between tensoflow.keras and keras
- find an actual dataset
- How to split into classes without unintentially ordering them?

## Definitions

__Supervised learning__ occurs when your deep learning model learns and makes inferences from data that has already been labeled 

__Unsupervised learning__ occurs when the model learns and makes inferences from unlabeled data 

__Artificial neural networks__ are deep learning models that are based on the structure of the brain's neural networks. Same as neural net, net, and model

An __activation function__ of a neuron defines the output of that neuron given a set of inputs

__Relu__ - rectified linear unit ($\max(0,x)$)

__Sigmoid activation function__ - $\cfrac{1}{1 + e^{-x}}$

__Learning__ is about finding the right weights and biases

__Cost(loss) function__ - function that maps an event or values of one or more variables onto a real number, representing from "cost" associated with the event. We are seeking to minimize the cost function

__Gradient descent__ - first-order iterative optimization algorithm for finding a local minimum of a differentiable function

__epoch__ - a single pass of data through the model. The data will be passed through multiple epochs.

__SGD (Stochastic Gradient Descent)__ - a type of gradient descent. A few samples are selected randomly instead of the whole dataset for each iteration

__The loss function__ is what the gradient descent algorithm is trying to minimize. It is the "distance"/error from the actual to computer results

__Mean Squared Error (MSE)__ - a common loss function. Here, we get the error by taking the difference between the value the model predicted and the correct label. The formula is given by:
$$\cfrac{\sum e_i^2}{n}$$
where $e_i$ is the error on ith category and n is the total number of categories

__Learning rate__ - the number we scale the gradient by. Can be thought of as stepsize

__Training data__ - used to train the data. The hope is that the data is general enough so we can use it to predict on new data.  

__Validation set__ - used to validate our model during training. Helps give information that can assist with adjusting hyper parameters. Prevents overfitting.

__Test set__ - used to test the final model obtained from the training and validation sets. It should not be labeled.

__Overfitting__ occurs when our model is good at predicting the train data but does not perform well with the test set. That is, the model is unable to generalize well

__Data Augmentation__ - the process of creating additional augmented data by reasonably modifying the data in our training set. It allows us to add more data to the training set that's similar to the data we already have but has been reasonably modified. 

__Dropout__ - the model randomly ignores a subset of nodes in a given layer during training. 


## General notes

Neurons are organized in layers
	- Input layer
	- Hidden layer
	- Output layer
Each node is a neuron
Each vertical line is a layer
The hidden layers are between input and output layers

How do you build one?

With Keras!!

In [19]:
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential([Dense(32, input_shape = (10, ), activation = "relu"), 
                    Dense(2, activation = "softmax")])


In [81]:
import sklearn.datasets

data = sklearn.datasets.load_wine(as_frame = True)['data']
labels = sklearn.datasets.load_wine(as_frame = True)['target'][data.index % 5 != 0].reset_index()

for i in data.columns:
    data[i] = data[i].divide(data[i].max())
train = data[data.index % 5 != 0].reset_index()
test = data[data.index % 5 == 0].reset_index()
labels
train

Unnamed: 0,index,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,1,0.890088,0.306897,0.662539,0.373333,0.617284,0.682990,0.543307,0.393939,0.357542,0.336923,0.614035,0.8500,0.625000
1,2,0.887390,0.406897,0.826625,0.620000,0.623457,0.721649,0.637795,0.454545,0.784916,0.436923,0.602339,0.7925,0.705357
2,3,0.968982,0.336207,0.773994,0.560000,0.697531,0.992268,0.687008,0.363636,0.608939,0.600000,0.502924,0.8625,0.880952
3,4,0.892785,0.446552,0.888545,0.700000,0.728395,0.721649,0.529528,0.590909,0.508380,0.332308,0.608187,0.7325,0.437500
4,6,0.970330,0.322414,0.758514,0.486667,0.592593,0.644330,0.496063,0.454545,0.553073,0.403846,0.596491,0.8950,0.767857
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137,172,0.954821,0.432759,0.767802,0.666667,0.561728,0.432990,0.137795,0.666667,0.346369,0.746154,0.362573,0.4275,0.392857
138,173,0.924477,0.974138,0.758514,0.683333,0.586420,0.432990,0.120079,0.787879,0.296089,0.592308,0.374269,0.4350,0.440476
139,174,0.903574,0.674138,0.767802,0.766667,0.629630,0.463918,0.147638,0.651515,0.393855,0.561538,0.409357,0.3900,0.446429
140,176,0.888065,0.446552,0.733746,0.666667,0.740741,0.425258,0.133858,0.803030,0.407821,0.715385,0.350877,0.4050,0.500000


Dense is the most basic type of layer. It connects each input to each  output. The first parameter is the number of nodes in the layer.
Some commonly used layers:
- Dense (or fully connected) - connects each input to each output
- Concolutional layers - image data
- Pooling layers
- Recurrent layers - time series data
- Normalization layers
- Many others

Each connection transfers the output from the previous layer as input to the receiving unit. Each connection has assigned weight. The output is the weighted sum of the inputs. 

Output - the classification categories


In [29]:
from tensorflow import keras
from keras.models import Sequential 
from keras.layers import Dense, Activation

model = Sequential([ Dense(5, input_shape = (13,), activation = "relu"), 
                   Dense(2, activation = "softmax"),
                   ])
# Note that the input shape for layers past the first one is not required
# because 

After the activation function, the neuron is either fired or not fired. Where 0 if for not firing while 1 is for firing.

However, an activation function is not always returning a value between 0 and 1. For example, the most widely used activation function - relu. The main idea is that the more positive neuron, the more active it is. 

Another way to define a sequential model is

In [30]:
model = Sequential()
model.add(Dense(5, input_shape = (13, )))
model.add(Activation("relu"))
# Here, the activation layer is added separately from the Desnse layer.
# The process is the same though

  The initial bias and weights are randomized. Then, the cost function is defined to tell the computer how far the output is from the expected one. Then, the cost function is optimized. 
 
 For example, we can have a sum of the squares of differences between the expected and the observed results. 
 
 !! REMEMBER !!
 
 The gradient is pointing in the direction of largest increase. Hence, as we are looking to minimize the cost function, we will be adjusting the parameters in the direction opposite of the gradient. This process is called gradient descent. 
 
 The value of the gradient is multiplied by the learning rate which is a small number between 0.01 and 0.0001. This is by how much the weights are adjusted with each iteration. When the value is set too high, we are at risk of overshooting. If it's too low, on the other hand, the time to reach the minimum will be much larger.
 
 Why use SGD? 
 Using Batch Gradient Descent (the whole dataset is taken), reduces the amount of noise and randomness. However, the problems come up when the datasets are too big. In that case, using all data entries becomes very computationally expensive. Here, the SGD is used: a batch of size 1 is selected for each iteration. The sample is randomly shuffled and selected for the iteration.
 
 Since only a single entry is used, the path taken by the algorithm is much noisier. While it usually takes more iterations to reach the minima with SGD, it's still much computationally less expensive compared to batch gradient descent. 
 
 source: https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/

Example

In [36]:
from keras import backend as K
from tensorflow.keras.optimizers import Adam
from keras.metrics import categorical_crossentropy

In [55]:
model = Sequential([
    Dense(10, input_shape = (13,), activation = "relu"), 
    Dense(8, activation = "relu"),
    Dense(3, activation = "softmax")
])

In [56]:
model.compile(Adam(learning_rate = 0.001), 
              loss = "sparse_categorical_crossentropy", 
              metrics = ["accuracy"])
#the first parameter is the optimizer. In this case it's Adam,
# which is a type of SGD
# The second parameter is the learning rate
# You can also set the loss by 
model.loss = "sparse_categorical_crossentropy"

In [57]:
model.fit(train, labels, 
          batch_size = 10, epochs = 20, shuffle = True,
         verbose = 2)
# This is the function that actually trains the model
# The first parameter is the training data
# The second parameter contains the labels
# Both are in the format of a numpy array
# Batch size - how many pieces of data we want 
# to be sent to the model at once
# Epochs - there will be 20 individual passes through the data
# Shuffle - the data is shuffled before each epoch

# find other data

Epoch 1/20
15/15 - 1s - loss: 1.1803 - accuracy: 0.1333 - 518ms/epoch - 35ms/step
Epoch 2/20
15/15 - 0s - loss: 1.1034 - accuracy: 0.4000 - 34ms/epoch - 2ms/step
Epoch 3/20
15/15 - 0s - loss: 1.0786 - accuracy: 0.5200 - 46ms/epoch - 3ms/step
Epoch 4/20
15/15 - 0s - loss: 1.0673 - accuracy: 0.5133 - 29ms/epoch - 2ms/step
Epoch 5/20
15/15 - 0s - loss: 1.0564 - accuracy: 0.5467 - 34ms/epoch - 2ms/step
Epoch 6/20
15/15 - 0s - loss: 1.0447 - accuracy: 0.6267 - 45ms/epoch - 3ms/step
Epoch 7/20
15/15 - 0s - loss: 1.0337 - accuracy: 0.7200 - 41ms/epoch - 3ms/step
Epoch 8/20
15/15 - 0s - loss: 1.0235 - accuracy: 0.7333 - 45ms/epoch - 3ms/step
Epoch 9/20
15/15 - 0s - loss: 1.0135 - accuracy: 0.7533 - 56ms/epoch - 4ms/step
Epoch 10/20
15/15 - 0s - loss: 1.0033 - accuracy: 0.7800 - 49ms/epoch - 3ms/step
Epoch 11/20
15/15 - 0s - loss: 0.9922 - accuracy: 0.7867 - 30ms/epoch - 2ms/step
Epoch 12/20
15/15 - 0s - loss: 0.9803 - accuracy: 0.8067 - 40ms/epoch - 3ms/step
Epoch 13/20
15/15 - 0s - loss: 0.96

<keras.callbacks.History at 0x7f8c2d05fe20>

To check or set the learning rate:

In [58]:
model.optimizer.lr

<tf.Variable 'Adam/learning_rate:0' shape=() dtype=float32, numpy=0.001>

 For training and testing purposes, the dataset should be broken down into three distinct datasets:
 - Training set
 - Validation set
 - Test set
 
 The model is trained on the training set and simultaneously validated with the validation set. The weights are not updated in the validation step. The main goal of the validation set is to make sure the model is not overfitting the data. If the results in the training set are significantly better than in the validation set, the model is likely overfitting.
 
 When creating a model with keras, we do not need to specify a validation set. We can set the validation split parameter which will instruct keras to spilt a certain fraction of data and use it as your validation data. 
 
 Example:

In [59]:
model.fit(train, labels, 
          validation_split = 0.2, batch_size = 10, 
          epochs = 20, shuffle = True, verbose = 2)

Epoch 1/20
12/12 - 0s - loss: 0.7794 - accuracy: 0.9333 - val_loss: 1.0791 - val_accuracy: 0.2667 - 195ms/epoch - 16ms/step
Epoch 2/20
12/12 - 0s - loss: 0.7506 - accuracy: 0.9083 - val_loss: 1.1317 - val_accuracy: 0.2333 - 53ms/epoch - 4ms/step
Epoch 3/20
12/12 - 0s - loss: 0.7234 - accuracy: 0.9000 - val_loss: 1.1508 - val_accuracy: 0.2333 - 55ms/epoch - 5ms/step
Epoch 4/20
12/12 - 0s - loss: 0.6956 - accuracy: 0.9083 - val_loss: 1.1675 - val_accuracy: 0.2667 - 48ms/epoch - 4ms/step
Epoch 5/20
12/12 - 0s - loss: 0.6677 - accuracy: 0.9250 - val_loss: 1.1892 - val_accuracy: 0.2667 - 53ms/epoch - 4ms/step
Epoch 6/20
12/12 - 0s - loss: 0.6405 - accuracy: 0.9333 - val_loss: 1.1930 - val_accuracy: 0.2667 - 55ms/epoch - 5ms/step
Epoch 7/20
12/12 - 0s - loss: 0.6147 - accuracy: 0.9500 - val_loss: 1.2245 - val_accuracy: 0.2667 - 76ms/epoch - 6ms/step
Epoch 8/20
12/12 - 0s - loss: 0.5852 - accuracy: 0.9333 - val_loss: 1.2697 - val_accuracy: 0.2667 - 95ms/epoch - 8ms/step
Epoch 9/20
12/12 - 0s 

<keras.callbacks.History at 0x7f8c2c7a5880>

When adding validation set, we will also get preformance metrics for validation data

The validation set can also be created explicitly using validation_data parameter.
For example:

In [63]:

valid_set = train[train.index % 5 == 0]
valid_set_labels = labels[labels.index % 5 == 0]
train2 = train[train.index % 5 != 0]
labels2 = labels[labels.index % 5 != 0]
model.fit(train2, labels2,
          validation_data = (valid_set, valid_set_labels), batch_size = 10, 
          epochs = 20, shuffle = True, verbose = 2)

Epoch 1/20
12/12 - 0s - loss: 0.6321 - accuracy: 0.8250 - val_loss: 0.6317 - val_accuracy: 0.8333 - 111ms/epoch - 9ms/step
Epoch 2/20
12/12 - 0s - loss: 0.6074 - accuracy: 0.8333 - val_loss: 0.6064 - val_accuracy: 0.8333 - 48ms/epoch - 4ms/step
Epoch 3/20
12/12 - 0s - loss: 0.5794 - accuracy: 0.8333 - val_loss: 0.5811 - val_accuracy: 0.8333 - 61ms/epoch - 5ms/step
Epoch 4/20
12/12 - 0s - loss: 0.5427 - accuracy: 0.8333 - val_loss: 0.5534 - val_accuracy: 0.8333 - 77ms/epoch - 6ms/step
Epoch 5/20
12/12 - 0s - loss: 0.5090 - accuracy: 0.8333 - val_loss: 0.5242 - val_accuracy: 0.8333 - 83ms/epoch - 7ms/step
Epoch 6/20
12/12 - 0s - loss: 0.4793 - accuracy: 0.8333 - val_loss: 0.5006 - val_accuracy: 0.8333 - 78ms/epoch - 7ms/step
Epoch 7/20
12/12 - 0s - loss: 0.4584 - accuracy: 0.8417 - val_loss: 0.4841 - val_accuracy: 0.8333 - 74ms/epoch - 6ms/step
Epoch 8/20
12/12 - 0s - loss: 0.4364 - accuracy: 0.8417 - val_loss: 0.4746 - val_accuracy: 0.8333 - 66ms/epoch - 5ms/step
Epoch 9/20
12/12 - 0s -

<keras.callbacks.History at 0x7f8c2d0f2dc0>

The test set would be structured the same way the training set is but without the labels. We will be using that set when we cann the predict function on our model. 

We need to make sure our training and validation sets are a good representitive of actual data. 

To predict with keras:

In [65]:
predictions = model.predict(test, 
                            batch_size = 10, verbose = 0)
# First is the variable holding the test data
predictions

array([[0.13130848, 0.5342713 , 0.3344202 ],
       [0.3565282 , 0.37334207, 0.27012974],
       [0.02354746, 0.7503886 , 0.226064  ],
       [0.37337756, 0.3908902 , 0.23573218],
       [0.08957515, 0.65465724, 0.25576752],
       [0.18293667, 0.46714225, 0.34992114],
       [0.18896735, 0.49854043, 0.31249222],
       [0.14090349, 0.5300852 , 0.32901138],
       [0.44569394, 0.3899349 , 0.16437116],
       [0.400489  , 0.4116825 , 0.18782845],
       [0.17960975, 0.48920405, 0.33118623],
       [0.092433  , 0.53762347, 0.36994356],
       [0.03154162, 0.6732346 , 0.2952238 ],
       [0.16717309, 0.4861888 , 0.34663817],
       [0.19798373, 0.48694307, 0.3150732 ],
       [0.03746299, 0.61912835, 0.34340864],
       [0.26895452, 0.4130004 , 0.31804517],
       [0.42786053, 0.32802674, 0.24411277],
       [0.17660657, 0.531272  , 0.2921214 ],
       [0.2577894 , 0.39539802, 0.34681252],
       [0.06826068, 0.6129884 , 0.3187509 ],
       [0.17096773, 0.5241269 , 0.30490535],
       [0.

To know that the model is overfitting: 
- If the validation metrics are considerably worse than the training metrics than that's an indication of overfitting
- The metrics are good during training but accuracy is low on test data

Some ways to aviod overfitting:
- Add more data. The more data, higher diversity
- Data augmentation
- Reduce complexity of the model. This can be done by making simple changes such as removing layers from the model or reducing the number of neurons in the layers.
- Dropout