# Machine learning vs. traditional programmimg

<img src='https://d33wubrfki0l68.cloudfront.net/404cea9efbb7f4a8f1d3767e67806873044fa92b/1d74b/static/uploads/machine-learning-vs-traditional.webp' style="max-width: 50%; height: auto;"/>

# what is learning?

learning = training = fitting 

= estimating function parameters 

= identify pattern in data 

= gain ability to perform task well on unseen data

- def of learning: a computer program is said to learn from experience $E$ if its performance at tasks T measured by P **improves** with experience E
     
- E: a dataset of many samples of a random variable $X \in \mathbb{R}^{n \times d}$

- T: hard code in algo

- P: differentiable loss evaluated on test data 

# Goals of learning

- predict well

    e.g., predict future stock prices
    
    first we focus on predicting well


- reject outliers

    e.g., identify malicious devices on a network
    
    
- identify if a feature actually matters (correlation)

    e.g., does adding a pool increase the value of my home?
    
    not necessarily in a causal way: causal inference is over-used



- Bayesian stats, neural nets: fill in the unknowns (could mean predict too)
    





# Example of learning

- given info about a house, predict its value

- given an image, predict the digit, or predict if it is offensive

- given some text, find underlying themes

- given some emails, automatically tag and group them

- given a word, find its translation


# Type of learning 

## supervised learning: regression/classification

supervised learning goal: given a bunch of labelled data, want to label new data

learn a function $\hat{f}: \mathcal{X}\in \mathbb{R}^{d}\mapsto \mathcal{Y} \in \mathbb{R}$ from data $\mathcal{D}= \left\{(\mathbf{x}_i, y_i)\right\}_{i=1}^n$, want $\hat{f}(\mathbf{x}_i) \approx y_i$

- regression: y is continuous/ordered, e.g., home price, stock price

- classification: y is discrete/unordered, e.g., image label, stock goes up or down

    find a **decision boundary** in feature space

## unsupervised learning: finding structure/visualization

given a bunch of **unlabelled** data, find structure

learn an embedding $\hat{f}: \mathcal{X} \in \mathbb{R}^{d}\mapsto \mathcal{Y} \in \mathbb{R}^{p}\ (d \gg p)$, where $Y$ is in lower dimension, easier to interpret

- clustering

- density estimation

- data generation

- representation learning

- denoising

## semi-supervised learning

combine information from both labeled and unlabeled data to deduce information

# how to represent data?

a set of data often represented in caligraphy （花体）

- features: $\mathcal{X}$

- labels:  $\mathcal{Y}$

we often use vector, matrix, tensor

- vector: $x\in R^{d}$ is a vector in d-dimensions

$$x=\begin{bmatrix}x_{\left ( 1 \right )}\\ x_{\left ( 2 \right )}\\ \vdots \\ x_{\left ( d \right )}\end{bmatrix}$$

$x_{\left ( i \right )}$ is the ith coordinate

$x$ is taken as a column vector, $x^{T}$ as a transpose is taken as a row vector

- matrix: $M\in R^{n\times d}$ is a matrix with n rows and d columns

$$\begin{bmatrix} M_{\left ( 11 \right )} & M_{\left ( 12 \right )} & \cdots & M_{\left ( 1d \right )}\\ \vdots & \ddots & &\vdots \\ M_{\left ( n1 \right )} & M_{\left ( n2 \right )} & \cdots & M_{\left ( nd \right )} \end{bmatrix}
$$

$M_{\left ( ij \right )}$ or $M_{\left ( i,j \right )}$ is the entry of M in the ith row and jth column

column vector $M_{\left ( i: \right )}^{T}$ is d x 1, row vector $M_{\left ( i: \right )}$ is 1 x d

- tensor: multi-dimension vector

    $T\in R^{n\times d\times k}$ is a 3D tensor

- data is $Z=\left \{ \left ( x_{i},y_{i} \right ) \right \}_{i=1}^{n}$

    features is a matrix $X \in R^{n\times d}$

    labels is a column vector $y \in R^{n}$

![image.png](attachment:image.png)

# learning is an optimization problem

![image.png](attachment:image.png)

$\mathcal{F}$: functions or models, can be any function classes, e.g., linear function, nereast neighboring, neural network

learning is an optimization problem

the best model $\hat f$ minimizes empirical risk $\hat R$

$f^*$: true function, $\hat{f}$: estimated function

$R$: population risk, $\hat{R}$: empirical risk 

$l(f(x), y)$: loss

## population risk

- Definition: **population risk** of function f is **expected loss** of that function with data draw from **population** as inputs

$$R(f) = \mathbb{E}l(f(x), y)$$

- if data is **i.d.** (identically distributed), expectation of empirical risk = population risk

   data don't need to be independent bc expectation is linear

$$\mathbb{E}\left [ \hat{R}\left ( f \right ) \right ]=R\left ( f \right )\ (Assume\ i.d)$$ 

- based on central limit theorem, empirical process theory (general), Glivenko Cantelli (specific), when sample size n is large, empirical risk is close to population risk

$$\hat{R}\left ( f \right )\overset{large\ sample}{\rightarrow}R\left ( f \right )$$


## empirical risk

- Definition: empirical risk is mean of loss

$$
\hat R(f) = \frac{1}{n}\sum_{i=1}^n l(f(x_i), y_i)
$$

## empirical risk minimization (ERM)-based learning



- Assumption: **sample/empirical distribution = population distribution**

    bc population distribution is unknown, we can't directly optimize population risk, we optimize **empirical risk** instead

- ERM is a "frequentist" way to assess fitness 

$$
\hat{f} = \underset{f \in \mathcal{F}}{\arg\min} \hat{R}(f) = \underset{f \in \mathcal{F}}{\arg\min} \frac{1}{n}\sum_{i=1}^n l(f(x_i), y_i)
$$

- $\hat{f}$ is a set of all possible function $f$ that gives the smallest empirical risk

## loss and cost

- Definition of loss: for a data point $x$, loss $L: \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}$ is a function of predicted response $\hat y$ and true response $y$
$$L=l\left ( \hat{y},y \right )$$

- Definition of cost: for a sample of size $n$, cost is the sum of loss over all the data points

$$C=\sum_{i=1}^nl\left (\hat{y_i},y_i \right )$$

## loss and empirical risk

![image.png](attachment:image.png)

- Problem with above loss

    - everything weighted the same
    
    - bias towards majority group of data, in 0D MSE: $\hat y = \bar y$, MAD: $\hat y = median(y)$

# Heuristic

- Heuristic is a technique for solving optimization faster than classic methods by approximating the exact solution

- e.g., rules-of-thumb, shortcut

- Heuristic trades optimality, completeness, accuracy, or precision for speed

# standardization

by convention, $X$ should be standarized, $y$ is assumed to be centered for ridge regression

for continuous predictors, standarize $X$ to have mean 0 and sd 1, 

for categorical predictors, standarize $X$ to be $\in$ {-1, +1}



- scale-equivariant: OLS

    don't need standardization but help numerical calculation for large $X$ 

- scale-inequivariant:

    Lasso and Ridge regression: regularization depends on magnitude of weight

    KNN and K-means: distance-based, large-scale features can dominate the distance calculation

# source of bias

- labelling

- sample selection

- features

- architecture

- loss function

- deployed system create feedback loops

bias laundering

Maciej Ceglowski
"Instead of relying on algorithms, which we can be accused of manipulating for our benefit, we have turned to machine learning, an ingenious way of disclaiming responsibility for anything. Machine learning is like money laundering for bias. It's a clean, mathematical apparatus that gives the status quo the aura of logical inevitability. The numbers don't lie."