# Session 13:
## Non-linear ML and applications

*Andreas Bjerre-Nielsen*

## Agenda
machine learning for social scientists
1. [measures for classification](#Measures-for-classification)  
1. [nested cross validation](#Nested-cross-validation)  
1. [non-linear ML](#Non-linear-ML)
  -  [tree based models](#Tree-based-models)
  -  [neural networks](#Neural-networks)
1. [machine learning for social scientists](#ML-for-social-science)


## Vaaaamos

In [23]:
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings(action='ignore', category=ConvergenceWarning)

import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 
import seaborn as sns

plt.style.use('default') # set style (colors, background, size, gridlines etc.)
plt.rcParams['figure.figsize'] = 10, 4 # set default size of plots
plt.rcParams.update({'font.size': 18})

## With great power ...

### comes great responsibility...

You have been suffering a lot with implementing estimators... why?

- If you don't know what is going on you are likely to apply erroneously.
- So very important although you don't use in the exam.

# Measures for classification

## Breakdown by error type (1)

We measure the accaracy as the rate of true predictions, i.e. \begin{align}ACC&=\frac{True}{True+False}\end{align}

Can we decompose?

## Breakdown by error type (2)
Yes, we can decompose into
$$ACC=\frac{TP+TN}{TP+TN+FP+FN}$$

<center><img src='https://github.com/rasbt/python-machine-learning-book-2nd-edition/raw/master/code/ch06/images/06_08.png' alt="Drawing" style="width: 400px;"/></center>

## Breakdown by error type (3)

Some powerful measures:

- Precision: share of *predicted positive* that are true
    - PRE = $\frac{TP}{TP+FP}$    
    - = true positive rate 
- Recall: share of *actual positive* that are true    
   - REC = $\frac{TP}{TP+FN}=\frac{TP}{AP}$ 
   - = 1- false negative rate
- F1: mix recall and precision: $\frac{2\cdot PRE\cdot REC}{PRE+ REC}$


In [1]:
from sklearn.metrics import precision_score, recall_score, f1_score

## Breakdown by error type (4)

Classification models provide a predicted likelihood of being in the class or not:
- Receiver Operating Characteristic (ROC) curve by varying thresholds for predicted true.
    - ROC is a *theoretical* measure of model performance based on probabilities.
    - AUC: Area Under the (ROC) Curve.

<center><img src='https://github.com/rasbt/python-machine-learning-book-2nd-edition/raw/master/code/ch06/images/06_10.png' alt="Drawing" style="width: 800px;"/></center>

# Nested cross validation

## Nested cross validation (1)

- Model validation does not consider that we are also tuning hyperparameters:
  - Leads too overfitting (Varma & Simon 2006; Cawley, Talbot 2010).
- Solution is **nested cross validation**.
  - Validation step should not be modelled as 1) train; 2) test.
  - Better way is 1) model selection: train, validate; 2) test.
  - Implement as pp 204-205 in Python for Machine Learning:
      - first inner loop: `GridSearchCV` 
      - second outer loop: `cross_val_score`

## Nested cross validation (2)
We can apply cross-val. at two levels:
- the outer level, i.e. split into test-dev. split
- the inner level, i.e. split dev. data into training and validation

<center><img src='https://github.com/rasbt/python-machine-learning-book-2nd-edition/raw/master/code/ch06/images/06_07.png' alt="Drawing" style="width: 650px;"/></center>

# Non-linear ML

## Success of machine learning
*Are linear models the best performing models?*


George E. P. Box: All models are wrong

- But some are useful.

Evidence
- Sometimes linear model are the best 
- But there are many others that in general perform better
- They can find patterns that non-linear models cannot

## Success of machine learning
*Are linear models the best performing models?*


## Success of machine learning (2)
*What do we call models that can fit any pattern?*


- Universal approximators. 
    - We can also make input non-linear using `PolynomialFeatures` of any order.
        - Follows from iterative Taylor expansion
        - Problem?
- These are very powerful tools.
    - Example of recognizing characters and digits in handwriting (MNIST data)

# Tree based models

## A hierarchal structure 
*What does a `decision tree` look like?*



<center><img src='https://github.com/rasbt/python-machine-learning-book-2nd-edition/raw/master/code/ch03/images/03_17.png' alt="Drawing" style="width: 800px;"/></center>


## Sample splitting (1)
*Suppose we have data like below, we are interested in predicting criminal*

In [24]:
df = pd.DataFrame({'Criminal':[1]*5+[0]*10,
                   'From Jutland':np.random.randint(0,2,15),                   
                   'Parents together':[0]*4+[1]*10+[0],
                   'Parents unemployed':[1]*7+[0]*8})
print(df.sample(n=10))

    Criminal  From Jutland  Parents together  Parents unemployed
12         0             0                 1                   0
8          0             0                 1                   0
4          1             0                 1                   1
10         0             0                 1                   0
14         0             1                 0                   0
1          1             0                 0                   1
3          1             1                 0                   1
2          1             1                 0                   1
9          0             1                 1                   0
11         0             1                 1                   0


## Sample splitting (2)
*Let's try to split by variables and see whether it helps*

In [18]:
my_split = df\
            .groupby(['From Jutland'])\
            .Criminal\
            .mean()\
print(my_split)

From Jutland
0    0.333333
1    0.333333
Name: Criminal, dtype: float64


## Sample splitting (3)
*What about other variables?*

In [26]:
my_split = df\
            .groupby(['Parents together', 'Parents unemployed'])\
            .Criminal\
            .mean()\
        

print(my_split)

Parents together  Parents unemployed
0                 0                     0.000000
                  1                     1.000000
1                 0                     0.000000
                  1                     0.333333
Name: Criminal, dtype: float64


## Sample splitting (4)
*What might a tree structure look like?*

- Parents together: Yes > Not criminal    
<br/><br/>        
- Parents together: No
    - Parents unemployed: Yes > Criminal
    - Parents unemployed: No > Not criminal        

## Improving decision trees
*What can we conclude about the decision trees?*

- Can fit anything ~ Universal Approximation
    - *little* underfitting (~low bias)
    - **LARGE** overfitting (~large variance)
    
`random forest` improves on decision trees

# Neural networks

## Neural networks (1)
*I have forgotten, what was Adaline?*


<center><img src='https://github.com/rasbt/python-machine-learning-book-2nd-edition/raw/master/code/ch12/images/12_01.png' alt="Drawing" style="width: 900px;"/></center>


## Neural networks (2)
*Why are neural networks called deep learning?*


<center><img src='https://github.com/rasbt/python-machine-learning-book-2nd-edition/raw/master/code/ch12/images/12_02.png' alt="Drawing" style="width: 800px;"/></center>


## Neural networks (3)
*So learning about the Perceptron and Adaline actually has value?*

Yes, lot's of value: these are the neurons of neural networks. 

In other words, they are fundamental building blocks for doing deep learning.



## Neural networks (4)
*How are neural networks different from simply using polynomial features?*

- [Cheng et al.](https://arxiv.org/abs/1806.06850) that neural nets and polynomial expansion are essentially the same.


There are however some differences:
- Hidden layer can be a lot smaller than all possible combinations 
    - Small picture (28x28 pixels) with 3 color channels  have more 13 billion combinations.
- It uses non-linear activation functions. 
- A neural network with one hidden layer has universal approximation.
    - This corresponds to quadratic if linear.
- In practice they perform really well, especially on non-linear data
    - Computer vision: recognizing characters, content in images
    - Natural Language Processing (NLP): parsing text and speech data
    - Much more

# Universal approximation

## Universal approximators (1)
*Are decision trees the only universal approximators?*

No there are also **kernel based** ones.

- K Nearest Neighbors:
    - Approximate by taking average/mode from K nearest neighbors
        - Need standardization
    - Can also be used for interpolated local measures 
        - (weather, pollution, house prices etc.)
    - Not good with high dimensionality.


## Universal approximators (2)
*What can these these approximators be used for?*


- Reduce bias (underfitting)
- Must be careful we do not overfit (control bias)

*Can I get an overview of them?*
- Kernel methods, e.g. nearest neighbor
- Neural networks (1+ hidden layer, deep learning)
- Polynomial inputs in linear models (i.e. Taylor approximation)
- Tree based models

## Universal approximators (3)
*Can we use these methods?*

Yes, they all come off the shelf with `sklearn`.
- E.g. `from sklearn.ensemble import RandomForestClassifier`

For neural networks that have more hidden layers (deep learning) you need new packages:
- We recommend looking at either `pytorch` or using `keras` (which uses `tensorflow`)

*Should we use these methods in the exam of this course?*

## NO

# ML for social science

## ML for social science (1): testing predictive power

ML helps us with making predictive models: 

- Assess the performance of our models
- Choose the parameters that help estimate the best performing model 

Can we use ML to help us clarify whether a new feature set is relevant for prediction?

## ML for social science (2): new data

Machine learning can help us *'fill in the blanks'* and impute missing data

Input: Google Street View
- Infer neighborhood socioeconomic status (Naik, Raskar, Hidalgo 2016)

Input: Cell phone data
- Inferring poverty. (Blumenstock, Cadamuro, On 2015)
- Inferring mode of transportation. (Bjerre-Nielsen et al. 2019)
- Sleep (Cuttone et al. 2017)

Facebook data (likes, way of writing, town) can help infer
- personality and demographics (Cambridge Analytica); socioeconomic status; current mood

## ML for social science (3): better policy targetting

Social and medical scientists are often involved in policies aimed at: 
- alleviating poverty, decrease drop-out, crime etc.

Efficacy of these programs requires targetting of individuals:
- who is most poor, who is most at risk of dropping out? dying?

Kleinberg et al. 2015 show that mortality from surgery can be predicted in advance.
- save billions of $ and not cause pain of surgery
- if causal effects of intervention are small

Other policies: 
- should we prescribe opioids to you (what is your risk of addiction?)
- should we audit your firm for VAT and tax review (how much do we predict you cheat)
- should we admit you to this education? (what is your dropout risk)

## ML for social science (4): improving econometrics

Many econometric methods try to establish causality:

- applications for 
    - instrument variables (Hartford et al. 2017; Bjørn 2018)
    - matching (Wager, Athey 2017)
    - linear models with many covariates (see work by Hansen, Belloni, Chernozhukov)
- models for matching, instrument variables have a prediction problem built-in
    - can be enhanced with machine learning


## ML for social science (5): decision problems and game theory

We can solve decision problems and games using reinforcement learning
- uses neural networks to teach "agent" to play game
    - learn to play computer games, poker, etc.
- solve problems where game theory is intractable

# Outro 

There are amazing resources for you to keep learning, online and offline.

@ in Denmark.
- Dept. of Econonomics / Center for Social Data Science 
  - Advanced courses (expected fall 2021 or later)
      - [SDS econometrics and machine learning](https://kurser.ku.dk/course/a%C3%98kk08400u/2019-2020): tree based models for prediction and statistics; network inference
      - [Seminar in econometrics and machine learning](https://kurser.ku.dk/course/a%C3%98kk08386u/2019-2020): project based course
  - A new education MSc Social Data Science (I'm Head of Studies)
      - Courses not open currently (maybe from spring 2022)
- More courses are taught in machine learning at CS dept. (DIKU), DTU, ITU

- Michael Nielsen: deep learning neural networks

@ online: coursera, edX, DataCamp, MIT open courseware, etc.

# Everyone freeze!

### Please run the course evaluation now (<5 min)

- Evaluate our actions: 
  - What was good, what was not good

- Please evaluate: 
  - our teaching: did I make myself clear? was I too fast? what about Nicklas, David and Terne?
  - the material (lectures, exercises, books)
  - autograding
  - the quizzes (those that worked)
  - machine learning curriculum

# The end
[Return to agenda](#Agenda)