# <div align="center"> SPECIAL TOPICS III </div>
## <div align="center"> Data Science for Social Scientists  </div>
### <div align="center"> ECO 4199 </div>
#### <div align="center">Class 11 - Deep Learning</div>
<div align="center"> Jonathan Holmes, (he/him)</div>

## Today's class: 

### Part #1: Lecture on Deep Learning

### Part #2: Hands-on deep learning using tensorflow

Using the following guide: https://www.tensorflow.org/tutorials/keras/regression
- You can run directly in Google Colab (see link)
- To use on a local machine: %conda install tensorflow

# Deep Learning
- Deep learning (DL) is the technique behind most of artificial intelligence innovation in recent years
    - self-driving cars
    - image recognition
    - deep fakes etc.

- Deep Learning and Artificial Neural Networks are synonyms in this class


Additional resources: 
- The fairly technical book [_Deep Learning_](https://www.deeplearningbook.org/)  by [Ian Goodfellow](https://en.wikipedia.org/wiki/Ian_Goodfellow) and the terrific [Kaggle tutorial](https://www.kaggle.com/learn/intro-to-deep-learning)

## How does a human brain understand numbers? 

<div align="center"> <img src="numbers.png" width=600 height=600 /> </div>





1. Eyes detect light, dark, and colors (Eyes)
2. Visual processing sensory input into shapes (Visual Cortex)
3. Brain decodes shapes into numbers (Visual Association Cortex)

## From human brains to computers

Why don't we let computers do that? 

1. Start with data inputs
2. Combine the data into concepts
3. Combine the concepts into a predictive answer

#### Notes: 
- The computer gets to define its own concepts
- Just like humans, it's very hard to determine why a deep learning model got the answer it did



## What's deep in DL?
$$y=f(X) + \varepsilon$$


- Deep learning includes multiple nested functions:
$$f^\ast(x)= f^{(3)}(f^{(2)}(f^{(1)}(x)))$$

- These chain structures are the most commonly used structures of __neural networks__. 
    - f(1) is called the first __layer__ of the network
    - f (2) is called the second layer, and so on
    
- Layers are like levels of abstraction

## A single layer neural network

<div align="center"> <img src="neural_network1.png" width=500 height=500 /> </div>

1. Input layer is data
2. Hidden layer is functions of the data
3. Output layer is the prediction



# What's a neuron in neural network?

<div align="center"> <img src="fig1_unit.png" width=500 height=500 /> </div>

- This is an artificial __neuron__
    - Synonyms: __node__, __unit__

- Neurons represent functions: 
    - Inputs X and a constant
    - Output Y




## Linear regression and Linear unit
- Example neuron funtion: The Linear Unit/Linear Regression
    -  y = w x + b 
- This is the equation of a line:
     - w is a __weight__ (Econometrics: the slope $\beta$)
     - b is a __bias__ (Econometrics: the intercept $\alpha$)
- Too many definitions for __bias__!
    - Econometrics: When $E[\hat{\beta}] \neq \beta$ 
    - Statistical learning: When the model $\hat{f}(x)$ is not flexible enough to fit the real $f(x)$
    - Neural networks: Bias = Coefficient on constant term (Intercept)

## Multiple Inputs 

- You can simply add more input __connections__ to the neuron, one for each additional feature. 
  - $y = w_0 x_0 + w_1 x_1 + w_2 x_2 + b$
  
- This equation has _3_ weights and _1_ bias term

## Other functions

Generally, neurons are non-linear functions: 
$$ y = h(x) = g(w_0x_0 + w_1x_1 + ... + w_Nx_N + b) $$ 

Examples of $g$: 
- Sigmoid function: $g(z) = \frac{e^z}{1+e^z}$
- Activation function $g(z) = \cases{0 \text{ if } z < 0 \\ z \text{ otherwise}}$
    - Sometimes called a **rectified linear unit** or **ReLU**.

## What's going on? 


<div align="center"> <img src="neural_network1.png" width=600 height=600 /> </div>

- Each neuron in an intermediate layer is an equation
- $A1 = h_1(X)$, ..., $A5 = h_5(X)$
- Output $Y = f(A1, A2, ..., A5)$




## In-Class Exercise

Q1: In the above neural network, how many features are there? How many neurons? 
There are 4 features (X1 ... X4) and 5 neurons (A1...A5)

Q2: In the above neural network: 
- How many parameters do you estimate for each neuron? How many are weights, and how many bias terms?
A1: $h_1(X) = g(w^1_1X_1 + w^1_2X_2 + w^1_3X_3 + w^1_4X_4 + b)$ 4 weights w, 1 bias term. 

- How many parameters do you estimate for the output layer?
$$f(X) = \beta_0 + \beta_1 A_1 + \beta_2 A_2 + \beta_3 A_3 + \beta_4 A_4 + \beta_5 A_5$$
$$f(X) = \beta_0 + \beta_1 h_1(x) + \beta_2  h_2(x) + \beta_3  h_3(x) + \beta_4  h_4(x) + \beta_5  h_5(x)$$
6 more parameters in the output layer

- How many total parameters do you estimate in this model? 
    - 5 parameters per neuron * 5 neurons = 25
    - plus 6 betas in the output layer 
    - Total: 31


Q3: Why do you think we usually use non-linear functions for the neurons instead of linear functions?  
- Allows for more complex relationships
- Linear functions don't actually work! (Linear algebra)

## Neural networks and human brains

Similar terminology about the human brain: 
- If a neuron $A_k$ is close to 0, it is __silent__
- If a neuron $A_k$ is close to 1 (or is large), it is __firing__ 

Also: 
- Intermediate neurons are part of __hidden layers__
- We will NOT be able to understand what the weights (w), bias terms (b), or even the neurons (A1,...) mean
- Evaluate neural networks based on whether they generate good predictions, $\hat{y}$. If you want to understand the mechanisms ($\hat{\beta}$), use econometrics!




## Estimating a deep neural network

Coder decides: 
- Number of hidden layers
- Number of neurons
- What function (sigmoid, activation, etc) to use for each node. 

Then: 
1. Computer assigns random values of all weights and biases ($w$, $b$)
2. Repeat many times: 
    1. Calculate model error based on the current $w$, $b$ 
    2. For each $w$ and $b$, use calculus to identify the "next guess"
    3. Update $w$ and $b$ to the new guess 

## Terminology

- _Backpropogation_: The algorithm based on calculus, which computes the best next step (the _gradiant_, aka derivative)
- _Gradiant descent_: An algorithm which progressivly gets better guesses using the gradiant as a guide
- _Epoch_: Each time you run the loop, it's a new epoch

## Advantages of neural networks

In previous classes, we made linear models more flexible by adding: 
- Polynomial terms (eg: $X_1^2$, $X_1^3$)
- Interaction terms (eg: $X_1X^2$, $X_1^2X_2$)

Neural networks can be _far_ more flexible
- The algorithm decides how to combine/interact variables
- It's possible to have discontinuous, non-linear, interacting, complex relationships



## In-class exercise

"unbiased" => Econometrics term
ML: Reducing bias 
Q4: I have a dataset and a prediction problem, and I'm deciding whether to use OLS, Lasso, or a neural network. How can I decide which method is best? 
- OLS/lasso: Higher bias, lower variance 
- Complex interactions: Neural network
- Just try them, use our tests! (Cross validation, Validation set, AIC, BIC, etc). 

Q5: Lets say I wanted to answer the same question as Q4, but with a dataset that is 100 times larger than before. Which method is likely to be better? 
- Neural network gets better
- As your data gets bigger, concerns about "variance" and "overfitting" are less important

## Multiple Layered Neural Networks

<div align="center"> <img src="neural_network2.png" width=550 height=550 /> </div>

- Input layer is input to hidden L1
- Hidden L1 is input to hidden L2
- Output is calculated as a function of L2
- In this case, there are many outputs

# Layers #

- Neurons are organized in **layers** that are inter-connected
  - We talk about  __dense__ layers when linear units have in common the same set of inputs

- It turns out that combining dense layers is not enough to obtain the _flexibility_ that makes neural networks so powerful
  - You could think of each layer in a neural network as performing some kind of relatively simple transformation. 

## Decisions to make to design a neural network

1. How many layers? 
2. How many neurons per layer? 
3. What function $h(X)$ do you use in each layer? 

In-class exercise: 

Q6. If you increase the number of layers, this will most likely __ bias and __ variance of the model 
DECREASE BIAS, INCREASE VARIANCE 

Q7. If you increase the number of neurons, this will most likely __ bias and __ variance of the model 
DECREASE BIAS, INCREASE VARIANCE 

##  Advances in Neural Networks

### Part 1: Better, faster, more efficient estimation algorithms 
- Example: Stochastic gradiant descent
- Goal: Faster convergence, less memory usage, ability to use "distributed" computing, using GPUs, etc. 


### Part 2: Creative intermediate layers (using different functions h(X))
- ChatGPT is a _transformer_ model, known for its _attention_ layers
- Image recognition uses _convolutional_ layers


### Part 3: Throwing a lot more data at the problem
- GPT3 has 175 billion parameters, 96 "blocks" of layers layers (each containing multiple layers)



