### Bayesian Methods - Bayesian Logistic Regression

Using six features (seven including constant) we are to classify whether or not a woman is working or not. This is done by logistic regression, in which the weighting parameters follow some prior distribution, in the Bayesian approach. Recall, in logistic regression we are trying to find parameters $\mathbf{w}$ for the features which are then parsed into the Sigmoid function.

$$
\sigma(z) \triangleq \frac{1}{1-e^{-z}}
$$

$$
\mathbb{P}(Y_i = 1 | \mathbf{x}, \mathbf{w}) = \sigma(\mathbf{w}^{\text{T}}\mathbf{x})
$$

The prior is set to:
$$
\mathbf{w} \sim N(\mathbf{0}, \tau^2\mathbf{I}_7)
$$

With the likelihood function being:

$$
p(\mathcal{D}|\mathbf{w}) = \prod_{(\mathbf{x}, y) \in \mathcal{D}} [\sigma(\mathbf{w}^{\text{T}}\mathbf{x})]^y [1-\sigma(\mathbf{w}^{\text{T}}\mathbf{x})]^{1-y}
$$

And the posterior distribution having the form derived from Bayes' rule:
$$
p(\mathbf{w}|\mathcal{D}) = \frac{p(\mathcal{D}|\mathbf{w})p(\mathbf{w})} {p(\mathbf{\mathcal{D}})}
$$

Where the evidence is calculated as:
$$
p(\mathbf{\mathcal{D}}) = \underset{}{\int_{\mathbb{R}} \cdots \int_{\mathbb{R}}} \prod_{(\mathbf{x}, y) \in \mathcal{D}} [\sigma(\mathbf{w}^{\text{T}}\mathbf{x})]^y [1-\sigma(\mathbf{w}^{\text{T}}\mathbf{x})]^{1-y}p(\mathbf{w})~dw_1\cdots dw_7
$$


The prior distribution is non-conjugate. In addition, there exists no closed-form expression of the posterior distribution, which is easy to see, since Bayes' factor is computationally intractable to calculated due to the curse of high dimensionality.

However, for large samples sizes, we can normally approximate the posterior distribution. Then it is simple to find samples from the posterior. This is also called **Laplace Approximation**.

The approximation has the form:
$$
\mathbf{w}|\mathcal{D} \sim N(\mathbf{w}_{\text{MAP}}, -\nabla^2\ln(p(\mathbf{w}_{\text{MAP}}|\mathcal{D})))
$$


We can arrive at the posterior normal approximation by Taylor expanding the posterior distribution around $\mathbf{w}_{\text{MAP}}$. There are two paths forward. (i) is to calculate the Taylor approximation of the log posterior by hand (requiring the usage of the likelihood function and prior). (ii) is to find the $\mathbf{w}_{\text{MAP}}$ numerically with some optimzation algorithm, by maximizing the likelihood times the prior.

In [99]:
import pandas as pd
import numpy as np 

In [19]:
# Data import. 
df = pd.read_csv('data/WomenAtWork.dat', delimiter='\t', engine='python')
df

Unnamed: 0,Work,Constant,HusbandInc,EducYears,ExpYears,Age,NSmallChild,NBigChild
0,1.0,1.0,22.394940,12.0,7.0,43.0,0.0,3.0
1,0.0,1.0,7.232000,8.0,10.0,34.0,0.0,7.0
2,1.0,1.0,18.271990,12.0,4.0,41.0,1.0,5.0
3,0.0,1.0,28.069000,14.0,2.0,43.0,0.0,2.0
4,1.0,1.0,7.799889,12.0,10.0,31.0,0.0,1.0
...,...,...,...,...,...,...,...,...
163,1.0,1.0,25.075040,12.0,9.0,43.0,0.0,3.0
164,1.0,1.0,27.799960,17.0,8.0,54.0,0.0,0.0
165,0.0,1.0,20.200000,13.0,7.0,53.0,0.0,0.0
166,0.0,1.0,42.250000,12.0,3.0,30.0,1.0,1.0


In [13]:
def sigmoid(x: float) -> float:
    return 1 / (1 + np.exp(-x))

In [95]:
def log_likelihood(X: np.array, t: np.array, w: np.array) -> float:
    epsilon = 1e-32
    prob = sigmoid(X @ w)
    likelihood_vector = t @ np.log(prob + epsilon) + (1 - t) @ np.log(1 - prob + epsilon)
    return np.sum(likelihood_vector)

In [105]:
from scipy.stats import multivariate_normal


def log_prior(w: np.array, init_sigma: float) -> float:
    mean = np.zeros(len(w))
    covariance_matrix = init_sigma * np.identity(len(w))
    return np.log(multivariate_normal.pdf(w, mean, covariance_matrix))

In [114]:
def objective(w: np.array, X: np.array, t: np.array, init_sigma: float) -> float:
    return -(log_likelihood(X, t, w) * log_prior(w, init_sigma=init_sigma))

In [119]:
from scipy.optimize import minimize

# Initial solution.
x0 = 0.1 * np.ones(7)

# Parameters.
X = df[df.columns[1:]]
X = np.array(X)

t = df[df.columns[0]]
t = np.array(t)

init_sigma = 5

# Optimization results.
minimize(objective, x0=x0, args=(X, t, init_sigma))

  return 1 / (1 + np.exp(-x))
  return np.log(multivariate_normal.pdf(w, mean, covariance_matrix))
  df = fun(x) - f0
  return 1 / (1 + np.exp(-x))
  return np.log(multivariate_normal.pdf(w, mean, covariance_matrix))
  df = fun(x) - f0


      fun: -70173.11784106336
 hess_inv: array([[ 9.99667475e-01, -7.99190229e-03, -4.56534752e-03,
        -2.68716384e-03, -1.56993689e-02, -1.07572279e-04,
        -5.13548797e-04],
       [-7.99190229e-03,  8.11403724e-01, -1.07799787e-01,
        -6.35131304e-02, -3.70334579e-01, -2.68762071e-03,
        -1.22603490e-02],
       [-4.56534752e-03, -1.07799787e-01,  9.38383925e-01,
        -3.63015825e-02, -2.11682386e-01, -1.53339142e-03,
        -7.00521480e-03],
       [-2.68716384e-03, -6.35131304e-02, -3.63015825e-02,
         9.78613774e-01, -1.24720994e-01, -9.00728403e-04,
        -4.12473919e-03],
       [-1.56993689e-02, -3.70334579e-01, -2.11682386e-01,
        -1.24720994e-01,  2.72803625e-01, -5.28386805e-03,
        -2.40809018e-02],
       [-1.07572279e-04, -2.68762071e-03, -1.53339142e-03,
        -9.00728403e-04, -5.28386805e-03,  9.99968203e-01,
        -1.68549762e-04],
       [-5.13548797e-04, -1.22603490e-02, -7.00521480e-03,
        -4.12473919e-03, -2.40809018