#### Neural networks
* goes through layers of linear/non-linear function
* connections are given weights to determine how much each input the unit propagates
* weight adjustment happens during training
* done through backpropogation and gradient descent 
* not always the best case especially with tabular data
* better for unstructured data - images, speech, videos, text

**Advantages**
* mix of parametric and non-parametric
* number of parameters and weights are determined beforehand BUT number of weights can be alot
* weights can also zero out - blocking that path
* makes NN more adaptable
* essentially non-linear models better performance on real-world relationships 
* convolutional - can have broad/details inputs
* recurrent - have feedback looks, can remember previous inputs, good for timeseries and NLP
* can also be combined

**Downsides**
* black boxes - impossible to intuitively explain
* needs large amounts of training data - can take a long time to learn 

**Deep Dive**
* each node/neuron has an activation - higher number, more activated
* first layer - is inputs
* last layer - has activation also for prediction probability 
* activation of one layer changes activation of second layer
* hidden layers - piece together different components, activated by different characteristics or subcomponents
* picture an image - that have different components/characteristics
* sigmoid curve to compress activation to 0-1, how positive the relative weighted sum is
* also includes a 'bias' - to determine how inactive/active a node should be - a constant - similar to y-intercept
* bias determines how high the weighted sub should be before it gets meaningfully activated 
* each node has has these weights and biases for activation
* during learning - the right weights and biases are taken
* weights (matrix) multiplied by inputs (vectors) + beta (constants) all multiplied by sigmoid function- which is basically a matrix vector multiplication (see chapter 3 again  from 3blue1brown)
* ReLU instead of sigmoid (max(0,a)) rectified linear unit,
* faster for deeper learning, (goes straight up as a line instead of a curve)

**How a Neural Network Learns**
* Gradient descent - adjust weights and constants/threshold 


#### From Lecture

Deep Learning
* very data hungry - millions of data points
* can train models on GPUs
* all comes down to larget matrix multiplications
* hidden layers - intermediate computations
    * next set take in the previous as input
* everything is set beforehand by the user: how many input features, intermediate layers, output predictions 
* connections have weights - parameters that are being tuned by the data 
* different patterns or designs and schemes that work better based on what data you have
* deeper usually better performance - but more expensive to train and get predictions from
* in industry - architecture is based on what is already out there depending on the data

Intuition for deep neural networks
* Why are deeper better
* lower level features to edges to countours to object parts to classes - the deeper the level - the more lower level features - and more granularity, more abstract/complex features

Some applications
* chess, music generation, image generation (generative adversial network), self-driving cars, protein structures
* can stack an ML on a neural network - but needs to be differentiable (find a derivative)
* works mostly with non-structured data (audio, images, sensor/recordings, language/text)
* usually for high-dimensional data
* but is getting better than typical machine learning for structured data e.g. TabNet from Google [tabnet](https://towardsdatascience.com/tabnet-deep-neural-network-for-structured-tabular-data-39eb4b27a9e4)

Advantages
* little preprocessing and feature engineering
* only need really one-hot encoding, standardization
* within layers - does feature selection/feature engineering
* standardization - 0 mean and 1 SD
* normalization - dividing a value by its mean 

Caveats
* needs more data
* more functions to expression, more parameters, more data needed to learn them
* upwards of millions for images, tens of millions of text, thousands of hours of audio
* when data is small - no advantage to ML
* but scale is much better when you have more data, can continuously get better 
* ML caps at large amounts of data - where adding more does not add 
* DL can also do some transfer learning - start with a NN that's pretrained with a similar/other data set - e.g. image classification 

Deeper Dive - Deep Neural Networks
* how deep neural networks work
* number of units, connection between units, and equations within units are set arbitrarily - by user, through a premade architecture
* training changes the weights
* visualized with a 'computation graph'

Fully-connected layers 
* seen in multilayer perceptoron, linear regression
* every unit in one layer is connected to every unit in the next layer
* linear regression - X to Y 
* multilayer perceptron - has more layers 
* calculation is a weighted sum, which results in a dot product.
* think of units are rows with multiple columns
* weights as a matrix 
* so to get Y = matrix multiplication of W * X
* can tell it's linear, matrix * x = y, can use substitution to end up with a linear equation
* good usually for tabular data 

Non-linear DL
* universal function approximator to get non-linearities
* after weighted sum - apply a non-linear transformation - using a sigmoid or taking the log
* non-linear transformation does not have to be the same at each layer or each unit within a layer
* applied on weighted sums - 
* common functions:
    * ReLU - no negative units, clamped down to 0
        * used in intermediate layers - in all or some variation of it
        * between 0 and +infinity 
    * Sigmoid
        * usually at the output layer - clamped within some range 0-1
        * negative values gets clamped to 0, and positive to 1
        * good for binary
    * TanH
        * similar to sigmoid - but clamps to -1 to 1
    * Softmax
        * multiple output units, output sums to 1
        * all positive 
        * good for multiclass
        * less useful for hidden layers

Architecture Decisions
* architectures usually provided for free 
* take from academic

Training Deep Neural Networks
* loss function based on gradient descent
* gradient - derivative of loss with respect to the parameters (steepest ascent) - use the opposite 
    * initialize parameters
    * compute gradient of loss function with parameters - increases if moved to that direction
    * move parameters in the opposite direction of the gradient - to decrease loss function
    * repeat until loss gradually decreases 
* loss function must be differentiable - NNs usually are differentiable
* accuracy - can not be used - because not differentiable
* almost always a stochastic gradient descent 
* derivative trick - chain rule - to find the effect of weight on Loss 
* finds causal effect of weight (w) on the y - and (chain) effect of y on Loss 
* learning rate - changes the weight to minimize the gradient 
* all weights are adjusted in parallel, and updates

TensorFlow - PyTorch
* TF more scalable, and easier to deploy
* TF v1 is hard to use, do not attempt for bootcamp

Keras
* used in bootcamp
* easier to use
* used by beginners and industry
* similar to sklearn
* creates tensorflow under the hood
* Dense = full connectivity, how many units are outputed - this is the calculation that's done in each unit
* activation function - for non-linearity 
* input_shape - how many are coming in - number of columns/features 
* first is .compile - specify loss='mean_squared_error', optimizer='adam' - for updating the weights - usually stochastic gradient descent - sgd with tweaks - always use adam
* .fit - validation data - prints how model is doing on validation data - but not used 
* epochs, batch_size=32 - how many computations per gradient descent
* EarlyStopping(patience=3) - stop if improvement is not increasing 
* architecture is each layer and functions - can be followed along from published stuff
* preprocessing and scaling - less needed - layers assumed to take care of scaling, and imputing missings