Q1)
1.1)

$ \text{Likelihood function p(x|$\theta$) is}$ 
$$ p(x|\theta) = \frac{1}{\sqrt{8\pi}}exp\left(-\frac{(x-\theta)^2}{8}\right)$$

$\text{The prior distribution for $\theta$ is}$
$$ p(\theta)= \frac{1}{3\sqrt{\pi}}exp \left(-\frac{(\theta-5)^2}{12}\right)$$

$\text{Now, according to Bayes' theorem,}$
$$ p(\theta|x) \propto p(x|\theta)p(\theta)$$
$$ \propto exp\left(-\frac{(x-\theta)^2}{8}-\frac{(\theta-5)^2}{12}\right) $$
$$ \propto exp\left(-\frac{(6-\theta)^2}{8}-\frac{(\theta-5)^2}{12}\right) $$
$$\propto exp \left(-\frac{\left(\theta - \frac{74}{13}\right)^2}{2\times\frac{36}{13}}\right) $$
$$ \implies p(x|\theta) \sim \mathcal{N}\left(\frac{74}{13}, \frac{36}{13}\right)$$


Q1) 1.2)

$$ a=\frac {1}{9}, \space \space b=\frac {n}{4}$$
$$ \mu_{post}= \frac{\frac{5}{9}+\frac{n}{4}\bar{x}}{\frac{9n+4}{36}} = \frac{20+9n\bar{x}}{9n+4}$$
$$ \sigma^2_{post}= \frac{36}{9n+4}$$

Q1) 1.3)

$$ \mu_{post}= \frac{20+9n\bar{x}}{9n+4}$$
$$\text{As n $\to \infty $, $\mu_{post} \to \bar{x}$ }$$
$$ \sigma^2_{post}= \frac{36}{9n+4}$$
$$\text{As n $\to \infty $, $\sigma^2_{post}\to 0$}$$
$\text{ This reflects how the posterior mean updates towards the observed data as more data points are accumulated.} $
$\text{And the reduction in variance indicates increased certainty about the true value of $\theta$ as more data is received.}$

Q1) 1.4.1)

$$ \mu_{prior}= 100, \sigma^2_{prior}= 152$$
$\text{Since, we have only one sample, $\bar{x}= x = 80$.}\space$
$\text{Therefore,}$
$$ \mu_{post}= \frac{\frac{100}{152} + \frac{80}{102}}{\frac{1}{152}+\frac{1}{102}}= 88.03$$


Q1) 1.4.2)

$\text{Similarly, for $x=150$}$
$$ \mu_{post}= \frac{\frac{100}{152} + \frac{150}{102}}{\frac{1}{152}+\frac{1}{102}}= 129.92$$

Q2)

$\text{The likelihood function is:}$
$$L(\mu, \sigma^2 \mid \{x_i\}) = \prod_{i=1}^{n} \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{(x_i - \mu)^2}{2 \sigma^2} \right)$$

$\text{The log likelihood function is:}$
$$ \log L(\mu, \sigma^2 \mid \{x_i\}) = -\frac{n}{2} \log(2 \pi) - \frac{n}{2} \log(\sigma^2) - \frac{1}{2 \sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2$$

$\text{To find the Maximum Likelihood Estimates, \( \hat{\mu} \) and \( \hat{\sigma}^2 \),  we maximize the log-likelihood function.}$
$\text{We get,}$
$$ \hat{\mu}= \frac{1}{n} \sum_{i=1}^{n} x_i $$
$$ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \hat{\mu})^2$$

In [46]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize

np.random.seed(0)  
mean = 5.0
std_dev = 2.0
n = 1000
data = np.random.normal(mean, std_dev, n)

def neg_log_likelihood(params, data):
    mu, sigma = params
    n = len(data)
    return (n / 2) * np.log(2 * np.pi * sigma**2) + np.sum((data - mu)**2) / (2 * sigma**2)

initial_guess = [0, 1]

result = minimize(neg_log_likelihood, initial_guess, args=(data,))
    
mu_mle, sigma_mle = result.x

print(f"True mean: {mu_true}, True standard deviation: {sigma_true}")
print(f"MLE of mean: {mu_mle}, MLE pf standard deviation: {sigma_mle}")

True mean: 5.0, True standard deviation: 2.0
MLE of mean: 4.909486604140559, MLE pf standard deviation: 1.9740663198660107


Q3)

In [47]:
import numpy as np
import math
from scipy.optimize import minimize

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def map(theta, x, mu, sigma, y):
    z = np.dot(x, theta)  
    log_likelihood = np.sum(y*np.log(sigmoid(z)) + (1-y)*np.log(1-sigmoid(z)))
    log_prior = -0.5 * np.sum(((theta-mu)/sigma)**2) - (len(theta)/2) * np.log(2 * math.pi * sigma**2)
    return log_likelihood + log_prior

def neg_map(theta, x, mu, sigma, y):
    return -map(theta, x, mu, sigma, y)

np.random.seed(0)
n_samples = 1000
n_features=4
X = np.random.randn(n_samples, n_features)

theta_true = np.array([-0.5, 1.0, 1.5, 1.0])

y = (np.random.rand(n_samples) < sigmoid(np.dot(X, t_theta))).astype(int)

mu_prior = np.zeros(X.shape[1])
sigma_prior = 16
starting_guess = np.zeros(X.shape[1])

result = minimize(neg_map, starting_guess_1, args=(X, mu_prior, sigma_prior, y1))
theta_estimated = result.x

print("True Parameters: ", theta_true)
print("Estimated Parameters: ", theta_estimated)



True Parameters:  [-0.5  1.   1.5  1. ]
Estimated Parameters:  [-0.55124223  0.86040467  1.5612228   0.94275723]


Q4)

4.1) 
Consider the case where there is only one point x1. The possible labellings are 0 and 1.
The concept class can shatter this set because there exist hypotheses h_0 and h_1 that can assign labels 0 and 1 respectively.
Consider two points x1 and x2. There are 2^2= 4 labellings possible,(0,0),(0,1),(1,0),(1,1).
The concept class cannot shatter this set because  It cannot separate (0,0) and (1,1) simultaneously with any single constant function.
Therefore, the VC dimension of the constant function concept class is 1.


4.2) A linear function in d dimensions can be defined as h(x)=w'x+b, where, w is a d dimensional matrix.
Consider the case where there is only one point x1 $\in \mathbb{R}^d$. The linear function can shatter this single point because w'x1+b>0 can be assigned label +1 and w'x1+b< 0 can be assigned label −1.
Consider two points x1, x2 $\in \mathbb{R}^d$.
 There are 2^2= 4 labellings possible, (+1,+1),(+1,−1),(−1,+1),(−1,−1). The linear function concept class can shatter this set because it is possible to find a linear separator (hyperplane) that can separate all possible combinations of labels for two points in $\mathbb{R}^d$. For d+1 points, the linear function concept class can shatter this set as well because there exists a hyperplane in d dimensions which can divide the space such that the d+1 datapoints can be classified.
It fails to shatter sets of d+2 points because any hyperplane in d dimensions will not be able to classify all the datapoints uniquely.
Therefore, the VC dimension of the concept class of linear functions in d dimensions is d+1.

Q5) 5.1)

$$ D_{KL}(P \parallel Q) = \int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)} \, dx$$
$$ \therefore D_{KL}(P \parallel Q) = \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi\sigma_1^2}} \exp \left( -\frac{(x - \mu_1)^2}{2\sigma_1^2} \right) \log \left( \frac{\frac{1}{\sqrt{2\pi\sigma_1^2}} \exp \left( -\frac{(x - \mu_1)^2}{2\sigma_1^2} \right)}{\frac{1}{\sqrt{2\pi\sigma_2^2}} \exp \left( -\frac{(x - \mu_2)^2}{2\sigma_2^2} \right)} \right) dx$$

$$ \therefore D_{KL}(P \parallel Q) = \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi\sigma_1^2}} \exp \left( -\frac{(x - \mu_1)^2}{2\sigma_1^2} \right) \left[ \log \frac{\sigma_2}{\sigma_1} - \frac{(x - \mu_1)^2}{2\sigma_1^2} + \frac{(x - \mu_2)^2}{2\sigma_2^2} \right] dx
$$

$ \text{Upon solving,}$

$$ D_{KL}(P \parallel Q) = \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}$$

Q5) 5.2)

In [48]:
import pandas as pd
df = pd.read_csv(r'C:\Users\91876\OneDrive\ドキュメント\Desktop\data_KL.csv',index_col=0)
df.head()

Unnamed: 0,P,Q
0,7.888609e-31,1.48672e-07
1,7.888609000000001e-29,2.438962e-07
2,3.904861e-27,3.961301e-07
3,1.2755880000000001e-25,6.369829e-07
4,3.093301e-24,1.014086e-06


In [49]:
P = df['P'].values
Q = df['Q'].values

kl_div1 = np.sum(P * np.log(P / Q)) 
kl_div2 = np.sum(Q * np.log(Q / P)) 

print(f"KL Divergence from P to Q: {kl_div1:.4f}")
print(f"KL Divergence from Q to P: {kl_div2:.4f}")

KL Divergence from P to Q: 0.3182
KL Divergence from Q to P: 0.8319


Q5) 5.3)

KL divergence measures the divergence or disparity between two distributions, actual and expected. It is not mathematical distance, as it is not symmetric, that is, $D_{KL}(P\parallel Q) \neq  D_{KL}(Q\parallel P) $.
Moreover, it actually gives the infromation lost when one distribution is approximated as other.