# <div align="center"> SPECIAL TOPICS III </div>
## <div align="center"> Data Science for Social Scientists  </div>
### <div align="center"> ECO 4199 </div>
#### <div align="center">Class 11 - Social Biases and Prediction</div>
<div align="center"> Fabien Forge, (he/him)</div>

# Final Exam
- The final exam will take place on: 
    - __29 April 2021__
    - __9:30AM - 12:30PM__ (no extension possible!)
    - __Brightspace__
- The exam will take the same format as the midterm:
    - True and False Questions or Multiple Choice Questions
- __CAREFUL__: You will be limited in time which means that while this will be an open book exam, you will probably not have enough time to look up all the answers.

## Final Exam Continued
- This is a __cumulative exam__
- No code will be required but I may ask questions in which coding is part of the question/hint
    - e.g. recall the df.shape part of the question in the midterm
- Today's class will be part of the exam

# Machine Learning
- What does it mean to do machine learning?
- With the exception of unsupervised learning, all our ML tasks evolved around the same idea:
    - Produce predictions of y from x
- ML is the art of finding patterns in the data, that can be approximated by some functions
- If the function is good enough then its predictions will be close to the observations

## Machine Learing's Hype
- Why is ML working? 
- ML fits complex and very flexible functional forms to the data:
    - without simply overfitting 
    - it finds functions that work well out-of-sample

## ML vs Classical Econometrics
- You can think about econometrics in terms of causal inference (as you should)
- Mathematically, you can also think of it in terms of:
    1. Find the functional form: interaction term, non-linear term etc
    2. Find _estimates_ of the $\beta$ in the relationship between Y and X
- OLS regressions try to estimate $\hat{\beta}$. How different is it from:
    - $\hat{\beta}$ obtained in a LASSO regression?
    - $w$ weights from deep learning?
- Of course the difference is clearer for non parametric methods such as tree-methods 

## ML vs Classical Econometrics, interpretation
- Recall the equation that started each lecture:
    - $y = f(x) + \varepsilon$
- One reason to use such representation was to make clear that we were trying to find such function
- But of course, OLS is also a function that related X to y
- In econometrics, this formulation is hardly ever used because the important part is $\beta$
- Thus, ML is a fantastic tool to obtain $y \approx \hat{y} = f(x)$ while causal inference is a way of finding $\beta \approx \mathbb{E}[\hat{\beta}|x]$

## ML vs Classical Econometrics, interpretation continued

- This means that machine learning cannot be used for causal inference
- Even if the model used is linear and the functional form resemble:
    - $f(x) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p + \varepsilon$
- Because the $\hat{\beta}_{ML}$ obtained in ML do not have the same properties as the ones from causal inference $\hat{\beta}_{OLS}$
- Indeed $\hat{\beta}_{ML}$ are the ones that make $\hat{y}$ close to $y$ __not__ close to $\beta$

## Most of Machine Learning in One Expression
$\underbrace{\min \sum_{i=1}^n L(f(x_i), y_i)}_{\text{in-sample loss}} \text{ over } \underbrace{f \in F}_{\text{function class}} \text{ subject to } \underbrace{R(f) \leq c}_{\text{complexity restriction}}$
- All (most) ML tasks try to minimize an in-sample loss function $L(.)$
    - We mainly say mean squared error (or RSS) but others exist
    - For instance, for classification, false positive may have different weights from false negative
- This is done using a function class:
    - OLS, LASSO, Random Forest etc.
- And using penalties to make sure that function's complexity does not lead to overfitting
    - Hyper-parameters tuning

## Most of Machine Learning in One Expression, continued
- __Global/parametric predictors__:
    - LASSO = $\|\beta\|_1 = \sum_{j=1}^k \beta_j $
    - Ridge = $\|\beta\|_2 = \sum_{j=1}^k \beta_j^2 $
- __Local/nonparametric predictors__:
    - Decision/regression trees = Depth, number of nodes/leaves, minimal leaf size
    - Random forest (linear combination of trees) = Number of trees, number of variables used in each tree, size of bootstrap sample, complexity of trees
    - Nearest neighbors = Number of neighbors
- __Mixed predictors__:
    - Deep learning =  Number of levels (depth), number of neurons per level (width), connectivity between neurons (density)
- __Combined predictors__:
    - Bagging: Number of draws, size of bootstrap samples (and individual regularization parameters)

## Function Classes
- Recall that there is no one function class that is better over all predictive tasks all the time
- If you want to know which performs best a good way would to try them all at once
- It turns out that [PyCaret](https://pycaret.org/) can do this for you
- We will follow their [advanced tutorial](https://github.com/pycaret/pycaret/blob/master/tutorials/Regression%20Tutorial%20Level%20Intermediate%20-%20REG102.ipynb)

## Pycart Set up
- The setup() function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling . setup() must be called before executing any other function in pycaret. 
- It takes two mandatory parameters: a pandas dataframe and the name of the target column. 
- All other parameters are optional and are used to customize the pre-processing pipeline (we will see them in later tutorials).

# Biases and Machine Learning
- There are two sources of biases that we will explore
- A first source of bias will be linked to the data used
- The second is linked to what the complexity restriction implies

## ML and data-induced biases
- Deep learning can be used to predict what is represented on a picture
- It can also be used to enhance images

![](https://4.img-dpreview.com/files/p/E~TS590x0~articles/4871415337/googlebrain.jpeg)

## ML and Upsampling
- The way to do this is beyond the scope of this lecture
- But the method isn't very different from what we have learned
- The upsampling in the previous example takes a matrix X of dimension 8x8
- It outputs a matrix Y of dimension 32x32
$$\mathbf{Y} = f(\mathbf{X})$$
- For every pixel, the function output 4 pixels

## ML training
- So a deep learning neural network was trained to predict higher resolution picture
- The way to do this is of course to feed the NN a low resolution picture and use the high resolution picture as a target
- The NN learns from this data the pixels it should output given a low resolution picture
- How good this network is determined by how close it gets to the data it received

Who is this guy?

![](obama.png)

## ML test
- As should be clear by now, ML algorithms are only as good as their prediction out of sample
- Seeing the picture from the previous slide it is obvious to us that this low resolution picture represents Obama
- We are therefore able to reconstruct the image based on our recollection of Obama's features
- Not the trained NN, it is only able to output a new matrix Y, based on the weights that best fitted the data

![](https://cdn.vox-cdn.com/thumbor/MXX-mZqWLQZW8Fdx1ilcFEHR8Wk=/55x85:768x536/1820x1213/filters:focal(336x236:464x364):format(webp)/cdn.vox-cdn.com/uploads/chorus_image/image/66972412/face_depixelizer_obama.0.jpg)

## ML and out of sample prediction
- Did the algorithm fail?
- If you were presented with the right picture only and knew this was generated by a computer you would probably be very impressed
- When compared to the left picture this is outrageously wrong
- But is it enough to talk about biases?
- Here are a few other out of sample predictions

![](https://pbs.twimg.com/media/Ea-8T2NXkAEfH6y?format=png&name=900x900)

![](https://pbs.twimg.com/media/Ea_AGceXYAYg4KT?format=jpg&name=medium)

## Sample induced bias
- What is wrong here?
- To be clear there is no reason to believe that the people who trained this NN were racists (i.e. they did not add something to their algorithm to make pictures whiter)
- Instead the issue is linked the pictures used for training and test    

## Sample induced bias, continued
- Think about your assignment: I first asked you to split your data between train and test
- Your models performed more or less well on these datasets
- Now if your data is indeed representative of the prediction task at hand this is perfect
- But if somehow your dataset was not representative then your out-of-sample MSE would be less meaningful

## Objective function
- Another example is Amazon, that used ML to select resumes
- The goal was to select the best resumes based on how close they were to employees they already hired
- They ended up with predictions that best fit would most of the time be middle aged white males...

# Complexity restrictions and Biases
- Recall that parameters of ML do not have the same interpreation as OLS
- One way to think about this is to think about omitted variable bias
- If you have OVB and you do not control for the missing variable, then your parameter of interest will capture the effect of the variable in the regression __and__ the variable missing
- In the context of complexity restriction you are very likely to restrict parameters that are indeed correlated


## Example
- Say that you want to predict some y based on education and race
- It is a statistical fact that, in the US, the black community is __on average__ less educated than the white community
- In the context of LASSO, where I do not want too many predictors, perhaps race is slightly less informative and will be shrunk to zero.
- You may even want specifically to avoid including race, depending on what your target y is, to make sure you are not including racial biases

## Example continued
- Many states in the US are now using Machine Learning to predict how a defendant’s risk of future crime
- The goal is to remove the judge bias and try to predict "objectively" based on some data who was likely to comit a crime again in the future
- A classification task

## Example continued
- When performing classification tasks, one can use a confusion matrix

|Prediction/Reality| FALSE | TRUE |
| ---| --- | --- |
|__FALSE__| True Negative | False Negative | 
|__TRUE__| False Positive | True Positive | 


## Example continued
- An important [research](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing) looked at the algorithm created by the for-profit company: Northpointe.
    - The formula was particularly likely to falsely flag black defendants as future criminals, wrongly labeling them this way at almost twice the rate as white defendants.
    - White defendants were mislabeled as low risk more often than black defendants.
    - The algorithm was somewhat more accurate than a coin flip. Of those deemed likely to re-offend, 61 percent were arrested for any subsequent crimes within two years.

## Bias explained
- Say that you want to predict whether a convict will commit another crime in the future
- You are only allowed to use a decision tree
- Your tree can only have one split
- If it is in your dataset, there's a good chance that race would be the split that gives you the best prediction

## Bias explained continued
- Think back on our principal component analysis
- You don't need a race dummy to caracterized a person of color
- You could instead use things like education, the neighborhood in which the person was raised and so on
- If these things are sufficiently correlated with race then you would still predict future crime based race __even if__ race is never in your algorithm

# Machine learning and decision making
- What does it mean for ML as a way to make decision?
- Thus, part of Northpointe bias was comming from features
- Would it still be a bias if POC were indeed more likely to commit crime and their algorithm had just as many false positive for blacks as they do for white?
- This enters a philosophical debate but here is what can be said:
    - Machines are learning from data
    - Removing the human from the learning process doesn't remove the bias from the data itself
- Using ML for making decision requires tools that are not statistical but for which social scientists are well equipped