# COMP 562 – Lecture 5

$$
\renewcommand{\xx}{\mathbf{x}}
\renewcommand{\yy}{\mathbf{y}}
\renewcommand{\zz}{\mathbf{z}}
\renewcommand{\vv}{\mathbf{v}}
\renewcommand{\loglik}{\log\mathcal{L}}
\renewcommand{\likelihood}{\mathcal{L}}
\renewcommand{\Data}{\textrm{Data}}
\renewcommand{\given}{ | }
\renewcommand{\MLE}{\textrm{MLE}}
\renewcommand{\tth}{\textrm{th}}
\renewcommand{\Gaussian}[2]{\mathcal{N}\left(#1,#2\right)}
\renewcommand{\norm}[1]{\left\lVert#1\right\rVert}
\renewcommand{\ones}{\mathbf{1}}
\renewcommand{\diag}[1]{\textrm{diag}\left( #1 \right)}
\renewcommand{\sigmoid}[1]{\sigma\left(#1\right)}
\renewcommand{\myexp}[1]{\exp\left\{#1\right\}}
$$

# Feature Scaling -- Feature Scaling

* Idea: gradient ascent/descentalgorithm tends to work better if the features are on the **same scale** 

<img src="./Images/Scaling.png" align="center"/>

When contours are skewed then learning steps would take longer to converge due to oscillatory behaviour

# Feature Scaling -- Centering

**Center** features by removing the mean

$$
\begin{aligned}
\mu_i &= \frac{1}{N}\sum_{k=1}^N x_{i,k}\\ \\
x_{i,j} &= x_{i,j} - \mu_i
\end{aligned}
$$

This makes each feature's mean equal to 0. Compute the mean first, then subtract it!


# Feature Scaling -- Standardizing


**Standardize** centered features by dividing by the standard deviation

$$
\begin{aligned}
\sigma_i &= \sqrt{ \frac{1}{N-1}\sum_j x_{i,j}^2 }\\ \\
x_{i,j}& = \frac{x_{i,j}}{\sigma_i}
\end{aligned}
$$

Note that standardized features are first centered and then divided by their standard deviation

Transform your data to a distribution that has a mean of 0 and a standard deviation of 1 (z-score)

# Feature Scaling -- Normalizing

Alternatively, **normalize** centered features by dividing by their norm

$$
\begin{aligned}
r_i &= \sqrt{\sum_j x_{i,j}^2 }\\ \\
x_{i,j}& = \frac{x_{i,j}}{r_i}
\end{aligned}
$$

Note that normalized features are first centered and then divided by their norm

Normalization transforms your data into a range between 0 and 1 regardless of the data set size

# Feature Scaling Benefits

1. Centering 
  1. $\beta_0$ is equal to the mean of the target variable 
  2. Feature weights $\beta$ now tell us how much does feature's departure from mean affect the target variable 
2. Standardization
  1. All the features are on the same scale and their effects comparable
  2. Interpretation is easier: $\beta$s tell us how much departure by single standard deviation affects the target variable  
3. Normalization
  1. Scale of features is the same, regardles of the size of the dataset
  2. Hence weights learend on different sized datasets can be compared
  3. However, their combination might be problematic -- certainly we don't trust weights learned on few samples

# Classification -- Bernoulli View

We can model a target variable $y \in \{0,1\}$  using Bernouli distribution

$$
p(y=1\given\theta) = \theta
$$

We note that $\theta$ has to be in range $[0,1]$

We cannot directly take weighted combination of features to obtain $\theta$

We need a way to map $\xx^T\beta \in \mathbb{R}$ to range $[0,1]$

# Some Useful Equalities Involving Sigmoid

Definition:

$$
\sigma(z) = \frac{1}{1  + \exp(-z)}
$$

Recognize the alternative way to write it:

$$
\sigma(z) = \frac{\exp z}{1 + \exp z} 
$$

Complement is just flip of the sign in the argument

$$
\sigma(-z) = 1 - \sigma(z) 
$$

Log ratio of probability (log odds) 

$$
\log \frac{\sigma(z)}{\sigma(-z)} = z
$$

# Using Sigmoid to Parameterize Bernoulli

$$
p(y=1|\theta) = \theta
$$

Sigmoid "squashes" the whole real line into range $[0,1]$

Hence we can map weighted features into a parameter $\theta$

$$
\theta = \sigma(\beta_0 + \xx^T\beta) 
$$

and use that $\theta$  in our Bernoulli

$$
p(y=1\given\theta=\sigma(\beta_0 + \xx^T\beta) ) = \sigma(\beta_0 + \xx^T\beta) 
$$



# Logistic Regression -- Binary Classification

In logistic regression we model a binary variable $\color{red}{y \in \{-1,+1\}}$

$$
\begin{aligned}
p({\color{blue}{y=+1}}\given\xx,\beta_0,\beta) &= \sigmoid{{\color{blue}{+}}(\beta_0 + \xx^T\beta)}\\
p({\color{red}{y=-1}}\given\xx,\beta_0,\beta) &= 1 - \sigmoid{-(\beta_0 + \xx^T\beta)} = \sigmoid{{\color{red}{-}}(\beta_0 + \xx^T\beta)} 
\end{aligned}
$$

This is equivalent to

$$
p(y\given\xx,\beta_0,\beta) = 
\sigmoid{{\color{green}{y}}(\beta_0 + \xx^T\beta)} = 
\frac{1}{1 + \myexp{-y(\beta_0 + \xx^T\beta)}}
$$

**<font color='red'> Q: Does above formula work for $y \in \{0,1\}$? </font>**

# Logistic Regression -- Decision Boundary

$$
p(y=1\given\xx,\beta_0,\beta) = 
\sigmoid{(\beta_0 + \xx^T\beta)}= 
\frac{1}{1 + \myexp{-(\beta_0 + \xx^T\beta)}}
$$

$$
\sigma(z) = \frac{1}{1  + \exp(-z)}
$$

* Suppose predict "$y=1$" if 

$$
p(y=1\given\xx,\beta_0,\beta) \geq 0.5 \rightarrow \beta_0 + \xx^T\beta \geq 0 
$$
    
* Then predict "$y=-1$" if 

$$
p(y=1\given\xx,\beta_0,\beta) < 0.5 \rightarrow \beta_0 + \xx^T\beta < 0
$$
    
* Hence, the decision boundary is given by $\beta_0 + \xx^T\beta$ $=$ 0

**<font color='red'> Q: What does this decision boundary equation describe? </font>**

# Logistic Regression -- Log-Likelihood

Probability of a single sample is:
$$
p(y\given\xx,\beta_0,\beta) = \frac{1}{1 + \myexp{-y(\beta_0 + \xx^T\beta)}}
$$

Likelihood function is:
$$
\likelihood(\beta_0,\beta\given\yy,\xx) = \prod_i \frac{1}{1 + \myexp{-y_i(\beta_0 + \xx_i^T\beta)}}
$$

Log-likelihood function is:
$$
\loglik(\beta_0,\beta\given\yy,\xx) = -\sum_i \log\left\{1 + \myexp{-y_i(\beta_0 + \xx_i^T\beta)} \right\}
$$

Follow the same recipe as before to find $\beta$s that maximize the Log-likelihood function