# Part1

In [15]:
import pymc as pm
import numpy as np
import pandas as pd

data = pd.read_excel("MLB365.xlsx")
X = data['AVG'].values[:, None]
y = data['award'].values
with pm.Model() as logistic_model:
    intercept = pm.Normal('intercept', mu=0, sigma=10)
    slope = pm.Normal('slope', mu=0, sigma=10)

    logit_p = intercept + slope * X.flatten()
    p = pm.math.sigmoid(logit_p)

    y_obs = pm.Bernoulli('y_obs', p=p, observed=y)

    trace = pm.sample()

pm.summary(trace)


Unnamed: 0,mean,sd,hdi_3%,hdi_97%,mcse_mean,mcse_sd,ess_bulk,ess_tail,r_hat
intercept,-6.599,0.377,-7.379,-5.95,0.02,0.014,359.0,446.0,1.01
slope,13.08,1.346,10.722,15.817,0.07,0.049,378.0,418.0,1.01


# Part2

### Ridge Regression and Normal Prior
This function is equivalent to the negative log posterior distribution when using a Normal prior for the coefficients $\beta$ with mean 0 and variance $\sigma^2 = 1$:


$$\log \left( \prod_{i=1}^{n} \mathcal{N}(y_i | x_i^T\beta, 1) \times \prod_{j=1}^{p} \mathcal{N}(\beta_j | 0, 1) \right)$$
Ignoring the normalizing constants, this is:
$$\sum_{i=1}^{n} \frac{1}{2}(y_i - x_i^T\beta)^2 + \sum_{j=1}^{p}{\beta_j^2}$$

### Lasso Regression and Laplace Prior
This function is equivalent to the negative log posterior distribution when using a Laplace prior for $\beta$ with mean 0 and scale $b = 1$:
$$\log \left( \prod_{i=1}^{n} \mathcal{N}(y_i | x_i^T\beta, 1) \times \prod_{j=1}^{p} \text{Laplace}(\beta_j | 0, 1) \right)$$
Ignoring the normalizing constants and using the Laplace distribution, which is:
$$\sum_{i=1}^{n} \frac{1}{2}(y_i - x_i^T\beta)^2 + \sum_{j=1}^{p}  |\beta_j|$$


### "Bayesians do not optimize posterior distributions, they sample from them; but, the posterior distributions are nonetheless 'regularizations' of the likelihood through the prior."  
In the context of Bayesian statistics, the process of inference is about updating our beliefs of new data. The key distinction between Bayesian inference and other forms of statistical inference is the way in which it treats the process of parameter estimation.

In traditional, frequentist often seeks the point estimates that maximize the likelihood function. These procedures are forms of optimization, where the goal is to find the set of parameters that best explains the data.

Bayesian inference does not condense the inference to a single point estimate. Instead, Bayesians compute the entire posterior distribution, which describes the probability of parameter values given the data and the model. From this distribution, one can derive point estimates, but the focus is on the full distribution, which captures the uncertainty surrounding the parameters.

Sampling from the posterior distribution, typically using algorithms like MCMC, is the primary method for doing Bayesian inference. This process does not involve optimization; rather, it generates a collection of parameter values that are representative of the posterior distribution.

The prior distribution plays a similar role to regularization in frequentist methods. A regularization term in frequentist optimization penalizes certain parameter values to enforce smoothness. In Bayesian inference, the prior distribution encodes similar biases by determining which parameter values are considered a priori more likely. So, while Bayesians do not optimize, the combination of the likelihood with the prior can be seen as imposing a form of regularization. The prior guides the inference process, ensuring that the posterior distribution reflects both the information in the data and any a priori constraints or expectations encoded in the model.