If weights are not provided, they are assumed to be $1/N$. Include an offset $g$.

\begin{align*}
f(\beta_0, \beta, g) &= \frac{1}{2} \sum_{i=1}^N w_i \left(y_i - \beta_0 - x_i^T \beta - g_i \right)^2 + \lambda P_\alpha(\beta) \\
\frac{\partial f}{\partial \beta_0} &= 
-\sum_{i=1}^N w_i \left(y_i - \beta_0 - x_i^T - g_i \beta \right) \\
\beta_0^*(\beta) &= \frac{1}{\sum_i w_i} \sum_{i=1}^N w_i \left(y_i - x_i^T \beta \right)
\end{align*}

If $\beta_j = 0$, the gradient is not defined. If $\beta_j \neq 0$, then
\begin{align*}
\frac{\partial f}{\partial \beta_j} &= 
- \sum_i w_i \left(y_i - \beta_0 - x_i^T \beta \right)  x_{ij}
+ \lambda \frac{\partial P_\alpha(\beta)}{\partial \beta_j} \\
&= 
- \sum_i w_i \left(y_i - \beta_0 - x_i^T \beta \right)  x_{ij}
+ \lambda \left( (1 - \alpha) \beta_j + \alpha \mathrm{sign}(\beta_j)
\right)
\end{align*}

The optimal $\beta_j$ will be zero if 
$\left. \frac{\partial f}{\partial \beta_j}  \right|_+ \geq 0$
and $\left. \frac{\partial f}{\partial \beta_j}  \right|_- \leq 0$. That is,
\begin{align*}
- \lambda \alpha &\geq \sum_i w_i\left(y_i - \beta_0 - x_i^T \beta  x_{ij} \right) \geq \lambda \alpha \\
\mathrm{abs} \left( \sum_i w_i \left(y_i - \beta_0 - x_i^T \beta \right) x_{ij} \right)  &\leq \lambda \alpha
\end{align*}

When the optimal value of $\beta_j$ is not zero, we can find it by setting the gradient equal to zero:
\begin{align*}
0 &= 
- \sum_i w_i \left(
y_i - \beta_0 - \tilde{x}^T_{ij} \tilde{\beta}_j - x_{ij} \beta_j \right) x_{ij} 
+ \lambda \left( (1 - \alpha) \beta_j + \alpha \mathrm{sign}(\beta_j) \right) \\
\beta_j \left( \sum_i w_i x_{ij}^2 + \lambda(1 - \alpha) \right) &= 
- \sum_i w_i \left(
y_i - \beta_0 - \tilde{x}^T_{ij} \tilde{\beta}_j \right) x_{ij}
- \lambda \alpha \mathrm{sign}(\beta_j) \\
\beta_j^*  &= 
\frac{ \sum_i w_i \left(
y_i - \beta_0 - \tilde{x}^T_{ij} \tilde{\beta}_j \right) x_{ij}
- \lambda \alpha \mathrm{sign}(\beta_j) }
{\sum_i w_i x_{ij}^2 
+ \lambda (1 - \alpha) }
\end{align*}

## Naive Updates (Section 2.1)

\begin{align*}
\end{align*}

Using the soft-thresholding operator
\begin{align*}
\beta_j^*  &= \frac
{S \left( \sum_i w_i \left(y_i - \beta_0 - \tilde{x}_{ij}^T \tilde{\beta}_j \right)  x_{ij}, \lambda \alpha \right)}
{ \sum_i w_i x_{ij}^2 + \lambda (1 - \alpha) }
\end{align*}
In the "naive" optimizer, explained without weights, $x$ is normalized so that $\frac{1}{N}\sum_i x_{ij}^2 = 1$. In the sparse optimizer, it is not so clear what is going on, but I think $x$ is scaled so that the same property still holds. I am also not sure what happens when there are weights.

Let
$z = \sum_i w_i x_{ij} (y_i - \beta_0 - \tilde{x}_{ij}^T \tilde{\beta}_j)$. Let's look at $z$ more closely, defining residuals $r$.
\begin{align*}
z &= \sum_i w_i x_{ij} (y_i - \beta_0 - \tilde{x}_{ij}^T \tilde{\beta}_j) \\
&= \sum_i w_i x_{ij} (r_i + x_{ij} \beta_j)\\
&= \sum_i w_i x_{ij} r_i + \beta_j \sum_i w_i x_{ij}^2
\end{align*}

If we have normalized so that $\sum_i w_i x_{ij}^2 = 1$, then we can simplify the above equations:
\begin{align*}
z &= \sum_i w_i x_{ij} r_i + \beta_j \\
\beta_j^* &= \frac{S(z, \lambda \alpha)}{1 + \lambda (1 - \alpha)}
\end{align*}

In [122]:
def get_cw_update(y, x, j, beta, alpha, lambda_):
    prediction_not_j = x.dot(beta) - x[:, j] * beta[j]
    resid = y - prediction_not_j
    n = len(y)
    mean_resid = x[:, j].dot(resid) / n
    numerator = soft_threshold(mean_resid, lambda_ * alpha)
    denominator = 1 + lambda_ * (1 - alpha)
    return numerator / denominator

In [123]:
def cd_update(y, x, beta, alpha, lamba_):
    for i in range(len(beta)):
        beta[i] = get_cw_update(y, x, j, beta, alpha, lambda_)
    return beta

In [124]:
def do_cd(y, x, alpha, lambda_, n_iters):
    beta = np.zeros(x.shape[1])
    for i in range(n_iters):
        beta = cd_update(y, x, beta, alpha, lambda_)
    return beta

## Derivatives

\begin{align*}
LL &= \sum_i w_i LL_i(\theta(x_i^T \beta))\\
\frac{\partial LL_i}{\partial \beta} &= LL_i'(\theta(x_i^T\beta)) \theta'(x_i^T\beta) x_i \\
&\equiv LL_i' \theta' x_i \\
\frac{\partial LL}{\partial \beta} &= \sum_i w_i LL_i' \theta' x_i \\
\frac{\partial^2 LL_i}{\partial \beta \partial \beta^T} &= \left(
LL_i'' \theta'^2 + LL_i \theta '' \right) x_i x_i^T
\end{align*}

Taylor expand:

\begin{align*}
LL_i(\beta) &\approx LL_i(\tilde{\beta}) + LL_i' \theta' x_i^T \left( \beta  - \tilde{\beta} \right)
+ \frac{1}{2} \left(
LL_i'' \theta'^2 + LL_i \theta '' \right) \left(x_i^T \beta - x_i^T \tilde{\beta} \right)^2 \\
&= C(\tilde{\beta}) + LL_i' \theta' x_i^T \beta
+ \frac{1}{2} \left(
LL_i'' \theta'^2 + LL_i \theta '' \right) \left((x_i^T \beta)^2 - 2 x_i^T \tilde{\beta} x_i^T \beta \right) \\
&= C(\tilde{\beta}) + \frac{1}{2} \left(
LL_i'' \theta'^2 + LL_i \theta '' \right) \left((x_i^T \beta)^2 - 2 x_i^T \tilde{\beta} x_i^T \beta 
+ 2 \frac{LL_i' \theta'}{LL_i'' \theta'^2 + LL_i \theta''} x_i^T \beta
\right) \\
&= C(\tilde{\beta}) + \frac{1}{2} \left(
LL_i'' \theta'^2 + LL_i \theta '' \right) \left(x_i^T \beta - \left( x_i^T \tilde{\beta} 
- \frac{LL_i' \theta'}{LL_i'' \theta'^2 + LL_i \theta''} \right)
\right)^2 \\
\end{align*}

Call IRLS weights $\nu_i$ to avoid confusion with likelihood weights $w_i$.

\begin{align*}
\nu_i &= \frac{1}{2} \left(
LL_i'' \theta'^2 + LL_i \theta '' \right) \\
z_i &=  x_i^T \tilde{\beta} 
- \frac{LL_i' \theta'}{LL_i'' \theta'^2 + LL_i \theta''}  \\
&= x_i^T \tilde{\beta}
- \frac{LL_i' \theta'}{2 \nu_i}
\end{align*}

$$
LL(\beta) \approx C + \frac{1}{2} \sum_i w_i \nu_i \left(z_i - x_i^T \beta \right)^2
$$

Helpful derivatives:

Gaussian:

\begin{align*}
LL_i &= \frac{1}{2} (y_i - \theta_i)^2 \\
LL_i' &= \theta_i - y_i \\
LL_i'' &= 1
\end{align*}

Identity link:
\begin{align*}
\theta &= \eta \\
\theta' &= 1 \\
\theta'' &= 0
\end{align*}

Log link:
\begin{align*}
\theta, \theta', \theta'' &= e^{\eta} \\
\end{align*}

IRLS for Gaussian with identity link:
\begin{align*}
w_i &= \frac{1}{2} \\
z_i &= x^T \tilde{\beta} - x^T \tilde{\beta} - y_i \\
&= y_i \\
\min &\sum_i \frac{1}{2} \left(y_i - x_i^T \beta \right)^2
\end{align*}

In [None]:
\b