In [1]:
import numpy as np
import matplotlib.pyplot as plt

# Bernoulli naive Bayes

Run the below cell to get the following variables:

`X` = Data matrix of shape $(n, d)$. All the features are binary taking values $0$ or $1$.

`y` = label vector. Labels are $0$ and $1$.

In [2]:
rng = np.random.default_rng(seed=1)
X1 = np.concatenate((rng.binomial(size = 50,n = 1, p =0.7), rng.binomial(size = 50,n = 1, p =0.2))).reshape(-1, 1)
X2 = np.concatenate((rng.binomial(size = 50,n = 1, p =0.6), rng.binomial(size = 50,n = 1, p =0.1))).reshape(-1, 1)
X3 = np.concatenate((rng.binomial(size = 50,n = 1, p =0.6), rng.binomial(size = 50,n = 1, p =0.2))).reshape(-1, 1)
X4 = np.concatenate((rng.binomial(size = 50,n = 1, p =0.8), rng.binomial(size = 50,n = 1, p =0.1))).reshape(-1, 1)


X = np.column_stack((X1,X2,X3,X4))

y = np.concatenate((np.zeros(50, dtype= int), np.ones(50, dtype = int))).reshape(-1, 1)
permute = rng.permuted(range(100)) 

X = X[permute]
y = y[permute]


## Question 1
If we train the naive Bayes model on the dataset, What will be the value of $\hat{p}$, the estimate for $P(Y=1)$? 



In [6]:
# Enter your solution here
np.sum(y)/y.shape[0]

0.5

## Question 2
What will be the value of $\hat{p}_0^0$, the estimate of $P(f_0=1|y=0)$?  Write your answer correct to two decimal places.



In [17]:
# Enter your solution here
np.sum(X[(y.T == 0)[0]][:, 0])/np.sum(y)

0.68

## Question 3
What will be the value of $\hat{p}_0^1$, the estimate of $P(f_0=1|y=1)$?  Write your answer correct to two decimal places.



In [18]:
# Enter your solution here
np.sum(X[(y.T == 1)[0]][:, 0])/np.sum(y)

0.26

## Question 4
What will be the value of $\hat{p}_3^1$, the estimate of $P(f_3=1|y=1)$?  Write your answer correct to two decimal places.




In [19]:
# Enter your solution here
np.sum(X[(y.T == 1)[0]][:, 3])/np.sum(y)

0.12

## Question 5

What will be the predicted label for the point $[1, 0, 1, 0]$? 



In [43]:
# Enter your solution here
def predict(x, X, y):
    p0 = np.prod(np.sum(X[(y.T == 0)[0]] == x, axis=0) / y[(y.T == 0)[0]].shape[0])
    p1 = np.prod(np.sum(X[(y.T == 1)[0]] == x, axis=0) / y[(y.T == 1)[0]].shape[0])

    return (p0 * y[(y.T == 0)[0]].shape[0]/y.shape[0]) <= (p1 * y[(y.T == 1)[0]].shape[0]/y.shape[0])

In [44]:
predict(np.array([[1,0,1,0]]), X, y)

True

## Question 6

What will be the predicted label for the point $[1, 0, 1, 1]$? 



In [45]:
# Enter your solution here
predict(np.array([[1, 0, 1, 1]]), X, y)

False

# Gaussian naive Bayes

Run the below cell to get the following variables:

`X_train` = Training dataset of the shape $(n, d)$. All the examples are coming from multivariate gaussian distribution.

`y_train` = label vector for corresponding training examples. labels are $0$ and $1$.

`X_test` = Test dataset of the shape $(m, d)$, where $m$ is the number of examples in the test dataset. All the examples are coming from multivariate gaussian distribution.

`y_test` = label vector for corresponding test examples. labels are $0$ and $1$.



In [46]:
from sklearn.datasets import make_classification, make_blobs
from sklearn.model_selection import train_test_split

# generate artificial data points
X, y = make_blobs(n_samples = 100,
                  n_features=2, 
                  centers=[[5,5],[10,10]],
                  cluster_std=1.5,
                  random_state=2)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=123)

## Question 7

How many examples are there in the trianing dataset?



In [49]:
# Enter your solution here
X_train.shape[0]

80

## Question 8
How many features are there in the dataset?



In [50]:
# Enter your solution here
X_train.shape[1]

2

## Question 9

If we train the Gaussian naive Bayes model on the trianing dataset, What will be the value of $\hat{p}$, the estimate for $P(Y=1)$? Write your answer correct to two decimal places.





In [53]:
# Enter your solution here
p_hat = np.sum(y_train)/np.shape(y_train)[0]

In [54]:
p_hat

0.4875

## Question 10

If $\hat{\mu}_0 = [\mu_1, \mu_2, ..., \mu_d]$ be the estimate for $\mu_0$, the mean of $0$ labeled examples, what will be the value of $\mu_1+\mu_2+...+\mu_d$? Write your answer correct to two decimal places.



In [75]:
X_train[(y_train == 0)].shape

(41, 2)

In [94]:
# Enter your solution here
mu_0 = np.sum(X_train[(y_train == 0)], axis=0) /  y_train[(y_train == 0)].shape[0]
mu_1 = np.sum(X_train[(y_train == 1)], axis=0) /  y_train[(y_train == 1)].shape[0]
mu = np.vstack((mu_0, mu_1))

In [121]:
mu

array([[ 4.55853975,  5.01739665],
       [10.30431548, 10.08580617]])

In [88]:
np.sum(mu_0)

9.575936394688135

We will be using the different covariances for different labeled examples. The estimate for $\Sigma_k$ will be 

$\hat{\Sigma}_k = \sigma_iI$ where $\sigma_i$ is the variance of $i^{th}$ feature values of examples labeled $k$.



## Question 11
What will be value of $\text{trace}({\hat{\Sigma}}_0)$?  Write your answer correct to two decimal places.







In [122]:
sigma_0 = (X_train[(y_train == 0)] - mu[0, :]).T@(X_train[(y_train == 0)] - mu[0, :]) / X_train[(y_train == 0)].shape[0]
sigma_1 = (X_train[(y_train == 1)] - mu[1, :]).T@(X_train[(y_train == 1)] - mu[1, :]) / X_train[(y_train == 1)].shape[0]

In [123]:
# Enter your solution here
np.trace(sigma_0)

4.435204194501573

## Question 12

Once we have estimated all the parameters for Gaussian naive Bayes assuming the different covariance matrices, we predict the labels for the training examples. What will be the training accuracy?

Accuracy is defined as the proportion of correctly classified examples.  Write your answer correct to two decimal places.




In [125]:
(X_train - mu[1, :]).shape, sigma_0.shape

((80, 2), (2, 2))

In [174]:
def pdf(X, mu, sigma):
    return np.exp((X - mu)@np.linalg.inv(sigma)@(X-mu).T)

In [206]:
def predict(X_test, mu, sigma_0, sigma_1, p_hat):
    p1 = np.log(pdf(X_test, mu[0, :], sigma_0))+np.log(p_hat)
    p0 = np.log(pdf(X_test, mu[1, :], sigma_1))+np.log(1-p_hat)

    return np.round(np.mean(p1 >= p0, axis=0))

In [208]:
# Enter your solution here
np.mean(y_train == predict(X_train, mu, sigma_0, sigma_1, p_hat))

0.9875

## Question 13

What will be the test accuracy?

Accuracy is defined as the proportion of correctly classified examples.  




In [209]:
# Enter your solution here
np.mean(y_test == predict(X_test, mu, sigma_0, sigma_1, p_hat))

0.9