<div style="text-align:center"><span style="font-size:2em; font-weight: bold;">Lecture 5—Optimization</span></div>

# Data science: Logistic regression

## Derivation

**Linear formulation**

$$\mathcal L=\prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}$$
$$\mathcal L=\prod_{i=1}^n F(x_i'\beta)^{y_i}(1-F(x_i'\beta))^{1-y_i}$$
$$F(x)=\frac{1}{1+e^{-x}}$$
$$\ln\mathcal L=\sum_{i=1}^n y_i \ln{F(x_i'\beta)}+(1-y_i)\ln{(1-F(x_i'\beta))}$$
$$\ln\mathcal L=\left[\ln{F(\beta'\mathbf X')}\right]y+\left[\ln{(\mathbf{1}'-F(\beta'\mathbf X'))}\right](1-y)$$

In [None]:
expit = lambda x: 1/(1+np.exp(-x))
def loglike(x,y,b):
    Fx = expit(b.T@x.T)
    return np.log(Fx)@y+np.log(1-Fx)@(1-y)

**Gradient**
$$\frac{d\ln\mathcal L}{d\beta}=\mathbf X'\text{diag}\left(\frac{f(\mathbf X\beta)}{F(\mathbf X\beta)}\right)y-\mathbf X'\text{diag}\left(\frac{f(\mathbf X\beta)}{\mathbf 1-F(\mathbf X\beta)}\right)(1-y)$$
$$\frac{d\ln\mathcal L}{d\beta}=\mathbf X'\text{diag}\left(\frac{f(\mathbf X\beta)(1-F(\mathbf X\beta))}{(1-F(\mathbf X\beta))F(\mathbf X\beta)}\right)y-\mathbf X'\text{diag}\left(\frac{f(\mathbf X\beta)F(\mathbf X\beta)}{F(\mathbf X\beta)(1-F(\mathbf X\beta))}\right)(1-y)$$
$$\frac{d\ln\mathcal L}{d\beta}=\mathbf X'\left[\text{diag}\left(1-F(\mathbf X\beta)\right)y-\text{diag}\left(F(\mathbf X\beta)\right)(1-y)\right]$$
$$\frac{d\ln\mathcal L}{d\beta}=\mathbf X'\left[\text{diag}\left(y-F(\mathbf X\beta)y-F(\mathbf X\beta)+F(\mathbf X\beta)y)\right)\right]\mathbf 1$$
$$\frac{d\ln\mathcal L}{d\beta}=\mathbf X'\left[\text{diag}\left(y-F(\mathbf X\beta)\right)\right]\mathbf 1$$
$$\frac{d\ln\mathcal L}{d\beta}=\mathbf X'\left[y-F(\mathbf X\beta)\right]$$

In [None]:
def gradient(x,y,b):
    Fx = expit(x@b)
    return x.T@(y-Fx)

**Hessian**
$$\frac{d}{d\beta}\frac{d\ln\mathcal L}{d\beta}'=\frac{d}{d\beta}\left[y'-F(\beta'\mathbf X')\right]\mathbf X$$
$$\frac{d^2\ln\mathcal L}{d\beta d\beta'}=-\mathbf X'\left[\text{diag}\left(f(\mathbf X\beta)\right)\right]\mathbf X$$

In [None]:
def hessian(x,y,b):
    Fx = expit(x@b)
    fx = Fx*(1-Fx)
    return -x.T@np.diagflat(fx.flatten())@x

**Theorem** Crammer-Rao Lower Bound

Assume
$\mathcal{L}$ is continuous and differentiable. For any unbiased estimator $\hat\theta$, the variance is bounded below by
$$\text{Var}\left[\hat\theta\right]\ge\left[-\text{E}\left[\frac{d^2\ln{\mathcal{L}}}{d\theta d\theta'}\right]\right]^{-1}$$


# Programming--Numerical Optimization Strategies

## Grid search

Search over a given parameter space. Check every possible option for the optimum value

In [None]:
import numpy as np
from itertools import product

def grid_search(func,space,maximize=False):
    vstates = [(x,func(x)) for x in space]
    vstates.sort(key=lambda x: x[1])
    return vstates[-1][0] if maximize else vstates[0][0]

x = np.linspace(0,10,1000).tolist()
func = lambda x: (x[0]-4.0001)**2*(x[1]-6.0001)**2
grid_search(func,product(x,x))

(4.004004004004004, 5.995995995995996)

## Gradient descent

Walk along the slope of the curve by steps proportional to the opposite of the size of the gradient. 

In [None]:
def gradient_descent(func,gradient,init_x:np.ndarray,learning_rate:float=0.005,max_reps:int=10000,maximize=False):
    x = init_x.copy()
    for i in range(max_reps):
        gx = gradient(x)
        x0 = x.copy()
        flast = func(x)
        x += gx*learning_rate if maximize else -gx*learning_rate
        if (func(x)<flast and maximize and i>2) or (func(x)>flast and (not maximize) and i>2): 
            x = x0
            break
    return x


## Newton's method

Use a zero finding algorithm on the gradient to isolate where the gradient is flat, i.e., where the maximum or minimum values of the function are located.

In [None]:
def newton(gradient,hessian,init_x:np.ndarray,max_reps:int=100,tolerance:float=1e-16):
    x = init_x.copy()
    for i in range(max_reps):
        update = -np.linalg.solve(hessian(x),gradient(x))
        x += update
        if np.abs(update).sum()<tolerance:
            return (x,i)
    raise Exception('Newton did not converge')

## Complete code

In [None]:
from cleands import *

class logistic_regressor(linear_model):
    def __fit__(self,x,y):
        params,self.iters = self.__max_likelihood__(np.zeros(self.n_feat))
        return params
    @property
    def vcov_params(self):return self.__vcov_params_lnL__()
    def evaluate_lnL(self,pred):return self.y.T@np.log(pred)+(1-self.y).T@np.log(1-pred)
    def _gradient_(self,coefs):return self.x.T@(self.y-expit(self.x@coefs))
    def _hessian_(self,coefs):
        Fx = expit(self.x@coefs)
        return -self.x.T@np.diagflat((Fx*(1-Fx)).values)@self.x
    def predict(self,target):return expit(target@self.params)

class LogisticRegressor(logistic_regressor,broom_model):
    def __init__(self,x_vars:list,y_var:str,data:pd.DataFrame,*args,**kwargs):
        super(LogisticRegressor,self).__init__(data[x_vars],data[y_var],*args,**kwargs)
        self.x_vars = x_vars
        self.y_var = y_var
        self.data = data
    def _glance_dict_(self):
        return {'mcfadden.r.squared':self.r_squared,
                'adjusted.r.squared':self.adjusted_r_squared,
                'self.df':self.n_feat,
                'resid.df':self.degrees_of_freedom,
                'aic':self.aic,
                'bic':self.bic,
                'log.likelihood':self.lnL,
                'deviance':self.deviance,
                'resid.var':self.ssq}

ModuleNotFoundError: No module named 'cleands'

In [None]:
from cleands import *

In [None]:
## Data generation
df = pd.DataFrame(np.random.normal(size=(10000,4)),columns=['x1','x2','x3','y'])
df['y'] += df[['x1','x2','x3']]@np.random.uniform(size=(3,))
df['y'] = (df['y']>0).astype(int)

In [None]:
## Run the model
model = LogisticRegressor(*add_intercept(['x1','x2','x3'],'y',df))

In [None]:
## See table
model.tidy

In [None]:
model.glance

In [None]:
model.iters

# Programming challenges

## Recursive partitioning trees

Write a class that implements a recursive partitioning algorithm. Use our common machine learning code.

## Quaternions

The Quaternions are a generalization of complex numbers. Where the complex numbers have two components, $a$ and $b$, for a number $a+bi$, the Quaternions have four parts $a, b, c$ and $d$: $$a+bi+cj+dk$$

The Quaternions have four basic operations: addition, subtraction, multiplication, and the inverse. Also write a str representation function. Your job is to write a quaternion class which implements these operations. You can learn how to perform these operations on the Quaternions' wikipedia page.