In [2]:
import pandas as pd
import pymc as pm
import numpy as np
import statsmodels.api as sm

# STA365 Homework 6

## Part I

I will use the classical Kaggle titanic data set.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

df = pd.read_csv("/content/drive/My Drive/titanic_train.csv")

Mounted at /content/drive


In [4]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

I will first do some basic cleaning. I will also select numerical variables only to satisfy the Multivariate normal assumption.

In [5]:
df1 = df.dropna()
df_num = df1.select_dtypes(include = ['float64', 'int64'])
X = df_num.drop(['PassengerId', 'Survived'], axis = 1)
y = df_num['Survived']

Before specifying our model, let's fit an ordinary logistic regression first to gain some insights about our prior.

In [6]:
log_reg = sm.Logit(y, X).fit()
print(log_reg.summary())

Optimization terminated successfully.
         Current function value: 0.605855
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  183
Model:                          Logit   Df Residuals:                      178
Method:                           MLE   Df Model:                            4
Date:                Wed, 13 Mar 2024   Pseudo R-squ.:                 0.04237
Time:                        05:36:33   Log-Likelihood:                -110.87
converged:                       True   LL-Null:                       -115.78
Covariance Type:            nonrobust   LLR p-value:                   0.04375
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Pclass         0.5360      0.230      2.327      0.020       0.084       0.987
Age           -0.0167      0.

Now, we will specify our Bayesian logistic regression model.

In [10]:
n,p=X.shape
with pm.Model() as MLR:
    betas = pm.Normal('betas', mu=0, sigma=0.1, shape=p)
    sigma = pm.TruncatedNormal('sigma', mu=0.1, sigma=1, lower=0) # half normal
    logit_p = pm.math.dot(X, betas)
    y_obs = pm.Bernoulli('y_obs', pm.math.sigmoid(logit_p), observed=y)

with MLR:
    idata = pm.sample()

## Part II

For LASSO, we are assuming a double exponential prior for $\beta$, where

\begin{align*}
\beta_i \sim {} & \text{Laplace}(b_i,s_i) & f(\beta_i| b_i, s_i) = {}& {\frac  {1}{2s}}\exp \left(-{\frac{|\beta_i-b_i |}{s_i}}\right) & \overbrace{\underbrace{|\beta_i-b_i|}}_{\text{Absolute Penalization}}^{L_1}\\
y_i \sim {} & \text{Normal}(x_i^T\beta, \sigma) & \sigma \sim {} & \text{HalfNormal}(\sigma_0)
\end{align*}

Then the posterior can be written as

$$p(\beta|y)∝p(\beta)p(y|\beta)=\prod_{i=1}^n{\frac  {1}{2s_i}}\exp \left(-{\frac{|\beta-b |}{s_i}}\right)\times\prod_{i=1}^n\frac{1}{\sigma\sqrt{2\pi}}\exp{-\frac{1}{2}(\frac{y-X\beta}{\sigma}})^2$$

Replacing $s$ with precision $\tau$ for simplicity,

$$∝\prod_{j=1}^n\exp{(-\frac{1}{\tau}|\beta_j|)}\prod_{i=1}^n\exp{(-\frac{1}{\sigma^2}(y_i-x_i\beta_i)^2)}=\exp{(\sum_{j=1}^n-\frac{1}{\tau}|\beta_j|+\sum_{i=1}^n-\frac{1}{\sigma^2}(y_i-x_i\beta_i)^2)}$$

Finding $\beta$ minimizing $p(\beta|y)$ is equivalent to maximizing $\log p(\beta|y)$, hence

$$\log p(\beta|y)∝\log\exp{(\sum_{j=1}^n-\frac{1}{\tau}|\beta_j|+\sum_{i=1}^n-\frac{1}{\sigma^2}(y_i-x_i\beta_i)^2)}=\sum_{j=1}^n-\frac{1}{\tau}|\beta_j|+\sum_{i=1}^n-\frac{1}{\sigma^2}(y_i-x_i\beta_i)^2$$

which reduces to the LASSO problem.

For ridge, we are assuming a normal prior on $\beta$, so the posterior can be written as

$$p(\beta|y)∝p(\beta)p(y|\beta)=\exp{(\frac{1}{2\sigma^2}(y-X\beta)^T(y-X\beta))}\exp{(-\frac{1}{2\tau^2}\beta^T\beta)}=\exp{(\frac{1}{2\sigma^2}(y-X\beta)^T(y-X\beta)-\frac{1}{2\tau^2}\beta^T\beta)}$$

$$\log p(\beta|y)∝\log\exp{(\frac{1}{2\sigma^2}(y-X\beta)^T(y-X\beta)-\frac{1}{2\tau^2}\beta^T\beta)}=\frac{1}{2\sigma^2}(y-X\beta)^T(y-X\beta)-\frac{1}{2\tau^2}\beta^T\beta$$

Again, maximizing $\log p(\beta|y)$ yields minimizing

$$(y-X\beta)^T(y-X\beta)+\frac{\sigma^2}{\tau^2}\beta^T\beta=\sum_{i=1}^n(y_i-x^T\beta)+\frac{\sigma^2}{\tau^2}\sum_{j=1}^p\beta_j^2$$

Which reduces precisely to the ridge problem.