# Machine Learning Reminder
* Find f(x) such that f(x) best approximates y
* Examples:
    * Given some pixels (x) tell me the probability it’s a cat (y)
    * Given news articles (x) tell me a stocks value (y)
    * Given some sequences x find some low dimensional space (z) that represent my data 
      * f1(x)=z f2(z)=x  

# Outline
* Dense (Fully Connected Neural Networks)
  * Example Linear Fits
  * Classification
  
# Goals

Dense Neural Netrworks are an essential building block used in Deep Learning image analysis our goals are:
* Know what a Dense Neural Network is
    * Know what an activation function is and what it does
* Know how to write a Dense Neural Network
* How do I train a Dense Neural Network
    * Experiment with hyperparameters
* How do I use a trained neural network
    * Using a Neural Network to predict good wine

# Packages

We're going to be working primarily with Keras and Tensorflow. They're some alternatives like PyTorch, but they all allow you to build ML models.

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

# Our first Layer
A Dense or fully connected layer

<img src="../assets/network_diagrams/dense.png">

A dense layer has a connection between every input variable and every output node. Each connection is represented by a weight $W_{i,n}$ from and input $X_n$ to an output $O_i$. The output is a sum over all the input variables times there weights plus a bias $B_i$
<p style="text-align: center;">
$O_i = \sum_n W_{i,n}*X_n+B_i$    
</p>

We will need to fit this to data, which means finding the best values for $W_{i,n}$ and $B_i$ to approximate our data.

We will often also stack layers $l$, so the output of one layer feeds into the next

$O_{i,l} = \sigma(\sum_n W_{i,l,n}*O_{i,l-1}+B_{i,l})$   


* We'll discuss this more in later in the lecture, but it's important when stacking layers we use an non-linear activation function $\sigma$

<img src="../assets/network_diagrams/nn_3_3_3.png">



# When to use a Dense Network

* When you have fixed input size and a fixed output size 
* When your input size isn't too big
    * We'll have to add something extra for image data
    
# Practice Building Networks

## Keras organizes a network by layers

Look at the code below, It has 
* One input layer with an input size = 3 
* One output layer and an output size = 1
* A layer is connect to a previous layer by passing the previous layer as an argument
  * i.e   
  output_layer=**tf.keras.layers.Dense**(*these arguments initialize the layer* )(**input_layer** *this argument connects the layers* ) #this call is to connect to input layer
  
## Networks are wrapped up into a Model

A model tells Keras which inputs/outputs you want to use for example

**linear_model=tf.keras.models.Model(input_layer,output_layer)**

you'll need this model to fit to your data


In [None]:
# All models start out with an input layer
input_layer=tf.keras.layers.Input(shape=(3,)) 
output_layer = tf.keras.layers.Dense(1)(input_layer)
#A keras model is class used for fitting it takes input layers and output layers
linear_model=tf.keras.models.Model(input_layer,output_layer)

linear_model.summary()



The above layer has 4 parameters (a weight for each connection, and one bias term)

We can represent it like this
<img src="../assets/network_diagrams/nn_3_1.png">


Layers can be stacked into more complex networks let's build this one

<img src="../assets/network_diagrams/nn_3_3_1.png">




In [None]:
# All models start out with an input layer
input_layer=tf.keras.layers.Input(shape=(3,)) 
hidden_layer=tf.keras.layers.Dense(3)(input_layer) 
output_layer = tf.keras.layers.Dense(1)(hidden_layer)
#A keras model is class used for fitting it takes input layers and output layers
linear_model=tf.keras.models.Model(input_layer,output_layer)

linear_model.summary()

# Try building the following networks yourself

Try writing code to make the following networks
* 1. Create an Input Layer
* 2. Write Dense layers with the right number of units
* 3. Make Model name **my_model** using the input layer and your output layer


## Example 1
<img src="../assets/network_diagrams/nn_3_3_3.png">


In [None]:
"Your Code Here"


In [None]:
"""run this to check your answer"""
my_model.layers[0]

assert 'my_model' in locals(), "my_model doesn't exist did you get the name correct for your model"
print('Found my model')
assert type(my_model)==tf.keras.models.Model, "my model dosen't see to be a keras model try tf.keras.models.Model"
assert len(my_model.layers)==3, "Your model has "+str(len(my_model.layers))+" layers and should have 3"
assert my_model.layers[0].output_shape[1]==3, "Input isn't 3 dimensional"
assert my_model.layers[1].output_shape[1]==3, "Hidden layer isn't 3 dimensional"
assert my_model.layers[2].output_shape[1]==3, "Output layer isn't 3 dimensional"
print('Great Job Model is Correct')


## Example 2

<img src="../assets/network_diagrams/nn_3_3_3_3.png">


In [None]:
"Your Code Here"

In [None]:
"""run this to check your answer"""
my_model.layers[0]

assert 'my_model' in locals(), "my_model doesn't exist did you get the name correct for your model"
print('Found my model')
assert type(my_model)==tf.keras.models.Model, "my_model doesn't see to be a keras model try tf.keras.models.Model"
assert len(my_model.layers)==4, "Your model has "+str(len(my_model.layers))+" layers and should have 4"
assert my_model.layers[0].output_shape[1]==3, "Input isn't 3 dimensional"
assert my_model.layers[1].output_shape[1]==3, "Hidden 1 layer isn't 3 dimensional"
assert my_model.layers[2].output_shape[1]==3, "Hidden 2 layer isn't 3 dimensional"
assert my_model.layers[3].output_shape[1]==3, "Output layer isn't 3 dimensional"
print('Great Job Model is Correct')


## Example 3

<img src="../assets/network_diagrams/nn_3_5_3.png">


In [None]:
"Your Code Here"

In [None]:
"""run this to check your answer"""
assert 'my_model' in locals(), "my_model doesn't exist did you get the name correct for your model"
print('Found my model')
assert type(my_model)==tf.keras.models.Model, "my_model doesn't see to be a keras model try tf.keras.models.Model"
assert len(my_model.layers)==3, "Your model has "+str(len(my_model.layers))+" layers and should have 4"
assert my_model.layers[0].output_shape[1]==3, "Input isn't 3 dimensional"
assert my_model.layers[1].output_shape[1]==5, "Hidden 1 layer isn't 5 dimensional"
assert my_model.layers[2].output_shape[1]==3, "Hidden 2 layer isn't 3 dimensional"
print('Great Job Model is Correct')


#  Model Design
* Input dimension is defined by the input data
* Output dimension is defined by target data
* Hidden layers add complexity to the model
    * More hidden layers or larger hidden layer dimensions can represent more complicated functions
    * Too many layers can be hard to train without special tricks
        * Can overfit, or fail to train (more on that later)
    * Too few layers may not correctly describe the data
* The right balance depends on the problem
    * Roughly the more data the more layers you can use
    * The more complex the target the more layers you'll need
* No right answer feel free to experiment

# Fitting your Model
Lets try to fit a simple line using the model below

<img src="../assets/network_diagrams/nn_3_1.png">


In [None]:
# All models start out with an input layer
input_layer=tf.keras.layers.Input(shape=(3,)) 
output_layer = tf.keras.layers.Dense(1)(input_layer)
#A keras model is class used for fitting it takes input layers and output layers
linear_model=tf.keras.models.Model(input_layer,output_layer)

linear_model.summary()

In the code above we define an Input layer and one Dense (Fully Connected Layer), in our equation above
i=1 n=data_dim
if data_dim ==1
then 

$O_i = \sum_n W_{i,n}*X_n+B_i  = O_0 =  W_{0,0}*X_0+B_0$

You'll notice from last lecture this is the same form as our linear model.

* $y_{pred,i}=\theta_{1}*x_{i}+\theta_{2} $

* Each 'neuron' in a dense network is one linear model

in neural network lingo 
*  $W$ is called the weight matrix 
*  $B$ the bias
*  $W$ is a matrix and can have several parameters and all the parameters in the network are often represented by just $\theta$ 

Just as in our Linear model we are going to use the same loss function
* $L=\frac{1}{N}\sum_i (y_{pred,i}-y_{true,i})^2$
* which is Mean Squared Error or mse for short
* and we will pick an optimizer 'adam'




## Task: Fit a slightly harder straight line

We're going to make a data set where x is a series of 3 features, and a target value $y = 2*x_0+1$ 

$y$ is just a line with respect to $x_0$, and completely ignores $x_{1,2,3,4}$ 

This problem is a little bit more difficult than the 1st lecture since we need to learn that two of the input features don't correlate at all to the output.

In [None]:
#Build the Dataset

data_dim=3

X=np.random.uniform(0,10,size=(10000,data_dim))
def func(X):
    return 2*X[:,0]+1  # #Ignore all other input have the output only depend on the first dimention
Y=func(X)




    
    

In [None]:

#MSE= Mean Squared Error 
linear_model.compile(loss='mse',optimizer='adam')

# Fit Our Simple Neural Network
#Fit 
linear_model.fit(X,Y,epochs=100,validation_split=0.5) #Have Keras make a test/validation split for us

#Pro tip, don't want to split the dataset yourself, you can have keras do it for you with validation_split=

# Even more Pro-tip - be careful if you run this cell more than once you'll keep training the same model 
# with a different train/develop split each time which can cause the model to overfit both the train and develop
# sets - a problem you'll only see when using the test set, so make sure to keep a test set around!


Excellent you've fit your first neural network, now lets use it
* we split our initial dataset into a train/develop using validation split
* Lets make a new dataset that has X_0 from -5-15, with X_1,X_2 being random
    * This is just a way to plot the output


In [None]:
#Lets plot the output as a function of X_0

#Create some Random 5-d data
X_test=np.random.uniform(0,10,size=(100,data_dim))
#Set the first dimention to be a line
X_test[:,0]=np.linspace(-5,15,100)

#Get the True distribution from our test function
Y_test=func(X_test)

#Get the prediction from our model
Y_pred=linear_model.predict(X_test)




In [None]:
#Plot

plt.scatter(X_test[:,0],Y_pred,label='prediction',marker='x')
plt.scatter(X_test[:,0],Y_test,label='truth',marker='+')
plt.xlabel('X[:,0]')
plt.ylabel('Y')
plt.legend()
plt.show()

#Lets Look at it wrt X[:,1]
plt.scatter(X_test[:,1],Y_pred,label='prediction',marker='x')
plt.scatter(X_test[:,1],Y_test,label='truth',marker='+')
plt.xlabel('X[:,1]')
plt.ylabel('Y')
plt.legend()
plt.show()



We can also look at a models weights
We expect $W_{0,0}$=2, and $B_0$=1

In [None]:
weights=linear_model.get_weights()
print("W=",weights[0])
print("W[0,0]=",weights[0][0,0])
print("B=",weights[1])


# Try it yourself 
Run the cell below to create a similar data set, but this time with some noise

$y = 2*x_0+1+N(0,2)$ 


In [None]:
#Build the Dataset

data_dim=5

X=np.random.uniform(0,10,size=(10000,data_dim))
def func(X):
    return 2*X[:,0]+1 + np.random.normal(0,2,size=(len(X))) #Ignore all other input have the output only depend on the first dimention
Y=func(X)



In [None]:
"""Write your Model"""
"Input"
"Dense Layer"
"Create Model"
"Fit"


In [None]:
"""Test"""


In [None]:
"""Plot"""


Lets try something a bit more complicated a sin wave

In [None]:
X=np.random.uniform(0,10,size=(10000,data_dim))
def func(X):
    return np.sin(X[:,0]) #Ignore all other input have the output only depend on the first dimention
Y=func(X)


In [None]:

# All models start out with an input layer

input_layer=tf.keras.layers.Input(shape=(data_dim,)) 
output_layer = tf.keras.layers.Dense(1)(input_layer)
#A keras model is a way of going from one layer to the next
sine_model=tf.keras.models.Model(input_layer,output_layer)
sine_model.compile(loss='mse',optimizer='adam')
sine_model.fit(X,Y,epochs=15,validation_split=0.5) #Have Keras make a test/validation split for us





In [None]:
X_test=np.random.uniform(0,10,size=(100,data_dim))
X_test[:,0]=np.linspace(-5,15,100)
Y_test=func(X_test)
Y_pred=sine_model.predict(X_test)

plt.scatter(X_test[:,0],Y_pred,label='prediction')
plt.scatter(X_test[:,0],Y_test,label='truth')
plt.xlabel('X[:,0]')
plt.ylabel('Y')
plt.legend()


Oops this didn't work. Why? So far what we wrote above can only represent linear functions

<p style="text-align: center;">
$O_i = \sum_n W_{i,n}*X_n+B_i$    
</p>

we need to add something called an activation function $\sigma$

<p style="text-align: center;">
$O_i = \sigma(\sum_n W_{i,n}*X_n+B_i)$    
</p>

$\sigma$ has to be non-linear and a good choice is a LeakyReLU

<img src='../assets/leakyReLU.png'>

an activation can be added just like any other layer




In [None]:

# All models start out with an input layer

input_layer=tf.keras.layers.Input(shape=(data_dim,)) 

activation_layer = tf.keras.layers.LeakyReLU()(input_layer)

output_layer = tf.keras.layers.Dense(1)(activation_layer)

#A keras model is a way of going from one layer to the next
sine_model=tf.keras.models.Model(input_layer,output_layer)

sine_model.compile(loss='mse',optimizer='adam')
sine_model.fit(X,Y,epochs=20,validation_split=0.5) #Have Keras make a test/validation split for us

Y_pred=sine_model.predict(X_test)

plt.scatter(X_test[:,0],Y_pred,label='prediction')
plt.scatter(X_test[:,0],Y_test,label='truth')
plt.xlabel('X[:,0]')
plt.ylabel('Y')
plt.legend()

#  Still not very good, let's add an activated hidden layer to get more flexibility

In [None]:
# All models start out with an input layer

input_layer=tf.keras.layers.Input(shape=(data_dim,)) 
hidden_layer = tf.keras.layers.Dense(20)(input_layer)
activation_layer = tf.keras.layers.LeakyReLU()(hidden_layer)
output_layer = tf.keras.layers.Dense(1)(activation_layer)
#A keras model is a way of going from one layer to the next
sine_model=tf.keras.models.Model(input_layer,output_layer)

sine_model.compile(loss='mse',optimizer='adam')
sine_model.fit(X,Y,epochs=20,validation_split=0.5) #Have Keras make a test/validation split for us

Y_pred=sine_model.predict(X_test)

plt.scatter(X_test[:,0],Y_pred,label='prediction')
plt.scatter(X_test[:,0],Y_test,label='truth')
plt.xlabel('X[:,0]')
plt.ylabel('Y')
plt.legend()


Still not very good. Let's also make our model a bit more powerful, by adding more layers $l$

<p style="text-align: center;">
$O_i,o=X_i$
</p>
 
<p style="text-align: center;">  
$O_{i,l} = \sigma(\sum_n W_{i,l,n}*O_{i,l-1}+B_{i,l})$    
</p>

In [None]:
# All models start out with an input layer

input_layer=tf.keras.layers.Input(shape=(data_dim,)) 

hidden_layer = tf.keras.layers.Dense(20)(input_layer)
activation_layer = tf.keras.layers.LeakyReLU()(hidden_layer)

hidden_layer = tf.keras.layers.Dense(20)(activation_layer)
activation_layer = tf.keras.layers.LeakyReLU()(hidden_layer)

hidden_layer = tf.keras.layers.Dense(20)(activation_layer)
activation_layer = tf.keras.layers.LeakyReLU()(hidden_layer)

hidden_layer = tf.keras.layers.Dense(20)(activation_layer)
activation_layer = tf.keras.layers.LeakyReLU()(hidden_layer)


output_layer = tf.keras.layers.Dense(1)(activation_layer)
#A keras model is a way of going from one layer to the next
sine_model=tf.keras.models.Model(input_layer,output_layer)

sine_model.compile(loss='mse',optimizer='adam')
sine_model.fit(X,Y,epochs=20,validation_split=0.5) #Have Keras make a test/validation split for us

Y_pred=sine_model.predict(X_test)

plt.scatter(X_test[:,0],Y_pred,label='prediction')
plt.scatter(X_test[:,0],Y_test,label='truth')
plt.xlabel('X[:,0]')
plt.ylabel('Y')
plt.legend()

The data fits the sin curve perfectly where it had seen training data 0-10, and not so well where there was no training data. Neural networks are universal function approximators, you have little control of what they predict when given data that is completely new. 


# Vocab Review
**Hyper Parmeter** Anything that goes into the model number layers, number of units..., or model fit learning rates, optimizers, etc.

**batch size**: The number of examples seen when doing gradient decent 

**epoch**: The number of times the entire dataset has been used (selected in batch sized chunks)

**learning rate**: Controls the distance of each gradient step

**optimizer**: Algorithm that (using the learning rate) decides on how big a gradient step to take
  * sgd
  * adam
  * rmsprop



In [None]:
# This code is used to reset the weights of the model below, so we can experiment with training
def reset_weights(model):
    session = tf.keras.backend.get_session()
    for layer in model.layers: 
        if hasattr(layer, 'kernel_initializer'):
            layer.kernel.initializer.run(session=session)


# Let's Experiment

Try adjusting hyperparameters used for fitting, see how long it takes and how low the val_loss is

In [None]:
import time
optimizer=tf.keras.optimizers.Adam(lr=1e-3)
#optimizer=tf.keras.optimizers.RMSprop(lr=1e-3)
#optimizer=tf.keras.optimizers.SGD(lr=1e-4)
sine_model.compile(loss='mse',optimizer=optimizer)


reset_weights(sine_model)

i_time=time.time()
sine_model.fit(X,Y,epochs=10,validation_split=0.5,batch_size=1000) #Have Keras make a test/validation split for us
print(time.time()-i_time)

Y_pred=sine_model.predict(X_test)
plt.scatter(X_test[:,0],Y_pred,label='prediction')
plt.scatter(X_test[:,0],Y_test,label='truth')
plt.xlabel('X[:,0]')
plt.ylabel('Y')
plt.legend()

# What did you see ?

Does the code run faster or slower with a larger batch size?
   * Is the loss better or worse
Which optimizer gives the best results (SGD, Adam, RMSProp)?
   * What is the effect of the learning rate on each optimizer
    
