# Naive Bayes

Make sure you have downloaded:
- heart_processed.csv

This will ask you to implement naive bayes using a custom likelihood and then comparing it against sklearn's Gaussian naive Bayes. 
The execution is slightly different from lecture and section. 
- It is more streamlined to take adavantage of vector multiplications and numpy functions, which has its own benefits if we want to scale up our naive bayes prediction to higher dimensions. 
- However, you may need to familiarize yourself with the "dictionary" data structure.


## 0 Data
Load `heart_processed.csv` from the [Heart Failure Clinical Records Dataset](https://archive.ics.uci.edu/ml/datasets/Heart%2Bfailure%2Bclinical%2Brecords)  It contains various predictors (which are in log-scale) for predicting the event of death `DEATH_EVENT`.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
dataset = pd.read_csv("heart_processed_log.csv", index_col=0)
X = dataset.drop("DEATH_EVENT", axis=1).values
y = dataset["DEATH_EVENT"].values

# split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# print the shapes of the training and testing sets
print('train shapes:')
print('\t X_train ->', X_train.shape)
print('\t y_train ->', y_train.shape)

print('test shapes:')
print('\t X_test ->', X_test.shape)
print('\t y_test ->', y_test.shape)

display(dataset)

train shapes:
	 X_train -> (209, 6)
	 y_train -> (209,)
test shapes:
	 X_test -> (90, 6)
	 y_test -> (90,)


Unnamed: 0,age,creatinine_phosphokinase,ejection_fraction,platelets,serum_creatinine,serum_sodium,DEATH_EVENT
0,4.317488,6.366470,2.995732,12.487485,0.641854,4.867534,1
1,4.007333,8.969669,3.637586,12.481270,0.095310,4.912655,1
2,4.174387,4.983607,2.995732,11.995352,0.262364,4.859812,1
3,3.912023,4.709530,2.995732,12.254863,0.641854,4.919981,1
4,4.174387,5.075174,2.995732,12.697715,0.993252,4.753590,1
...,...,...,...,...,...,...,...
294,4.127134,4.110874,3.637586,11.951180,0.095310,4.962845,0
295,4.007333,7.506592,3.637586,12.506177,0.182322,4.934474,0
296,3.806662,7.630461,4.094345,13.517105,-0.223144,4.927254,0
297,3.806662,7.788626,3.637586,11.849398,0.336472,4.941642,0


Recall: naive Bayes is choosing the class $k$, $C_k$, that maximizes the posterior
$$
P(C_k \lvert\,\boldsymbol{x}) = \frac{\pi(C_k)\,{\cal{}L}_{\!\boldsymbol{x}}(C_k)}{Z}.
$$
Hence, we maximize the numerator + assume that all $d$ features $x_i$ are independent ("naive-ness"). So we want to find the $k$ that satisfies
$$
\max_k \, \pi(C_k)\,{\cal{}L}_{\!\boldsymbol{x}}(C_k) \quad = \quad \max_k \, \left( \pi(C_k)\,\prod_{i=1}^d p(x_i \lvert C_k) \right).
$$

## 1 Custom Naive Bayes Classifier with KDE
You will create a naive Bayes classifier:
- using the training data
- with KDE to approximate the likelihood
- with bernoulli as the prior

**Use only the training data ```X_train, y_train``` to fit the naive Bayes classifier.**

### 1.1 Prior
1. [2 pt] Compute ```prior```, a two element array. 
    - prior[0] is the probability of death event 0, $\pi(C_0)$
    - prior[1] is the probability of death event 1, $\pi(C_1)$ 
    - You should construct the prior probabilities based on frequency of death events from the training data. 
    - Tip: Use np.unique() with return_counts.
2. [1 pt] Print ```prior```.

In [3]:
num = np.unique(y_train, return_counts=True)[1]
prior = num / len(y_train) #TODO

print('The prior probabilities are:', prior)

The prior probabilities are: [0.67464115 0.32535885]


### 1.2 Likelihood (KDE)
1. [2 pt] Define dictionaries `kde0` and `kde1` which fulfill the following:
    - kde0[i] corresponds to the kde object (created by calling `scipy.stats.gaussian_kde`) for feature i when death event is 0. kde1[i] defined likewise.
    - Make sure you index the correct rows of `X_train` when defining kdes.
    - Use bandwidth method 'scott'. (For fun, you can try 'silverman' and see what difference in result you get.)
    - As with all arrays you throw into sklearn or scipy, you may need to take transposes.

In [4]:
from scipy.stats import gaussian_kde
kde0 = {} 
kde1 = {} 

for i in range(X_train.shape[1]):
    X_train_death0 = X_train[y_train == 0][:, i]
    X_train_death1 = X_train[y_train == 1][:, i]

    # Compute KDE for death event 0
    kde0[i] = gaussian_kde(X_train_death0, bw_method='scott')

    # Compute KDE for death event 1
    kde1[i] = gaussian_kde(X_train_death1, bw_method='scott')

display(kde0) # Use this to check what you made. swap kde0 for kde1 if you want
display(kde1) # Use this to check what you made. swap kde0 for kde1 if you want


{0: <scipy.stats._kde.gaussian_kde at 0x2a241c5fe90>,
 1: <scipy.stats._kde.gaussian_kde at 0x2a240033590>,
 2: <scipy.stats._kde.gaussian_kde at 0x2a210b78a50>,
 3: <scipy.stats._kde.gaussian_kde at 0x2a241c5f3d0>,
 4: <scipy.stats._kde.gaussian_kde at 0x2a241ec0410>,
 5: <scipy.stats._kde.gaussian_kde at 0x2a241ec2790>}

{0: <scipy.stats._kde.gaussian_kde at 0x2a2406b8490>,
 1: <scipy.stats._kde.gaussian_kde at 0x2a229ae5350>,
 2: <scipy.stats._kde.gaussian_kde at 0x2a240519bd0>,
 3: <scipy.stats._kde.gaussian_kde at 0x2a241ec0850>,
 4: <scipy.stats._kde.gaussian_kde at 0x2a241ec2090>,
 5: <scipy.stats._kde.gaussian_kde at 0x2a241ec2550>}

2. [2 pt] Complete the code for ```compute_likelihood``` function.
    - The objects kde0[i] and kde1[i] have a method .pdf(), which you will use when computing the likelihood.
        - Read the documentation to understand how it works.
    - `likelihood0[j]` is the likleihood of seeing $j$ th data $\boldsymbol{x_j} = \left(\boldsymbol{x_j}_1, \dots, \boldsymbol{x_j}_d\right)$ for death event 0, i.e., ${\cal{}L}_{\!\boldsymbol{x_j}}(C_0) = \prod_{i=1}^d p(\boldsymbol{x_j}_i \lvert C_0)$
    - `likelihood1[j]` defined likewise.
    - You can loop over the kde objects kde[i] to populate the likelihood arrays.

(Your solution shouldn't be very complicated. A working solutions needs only about 5-10 lines of code.)

In [5]:
def compute_likelihood(x, kde0, kde1):

    n_samples, n_features = x.shape
    likelihood0 = np.ones(n_samples)
    likelihood1 = np.ones(n_samples)

    for i in range(n_features):
        likelihood0 *= kde0[i].pdf(x[:, i])
        likelihood1 *= kde1[i].pdf(x[:, i])

    likelihood = np.vstack((likelihood0, likelihood1)).T
    return likelihood


### 1.3 Posterior
1. [2 pt] Complete the code for ```compute_posterior``` function. 
    - It should include calling the function ```compute_likelihood```.

In [6]:
def compute_posterior(x, prior, kde0, kde1):
    # input:    x, a (# data) by (# features) array of test data
    #           prior, a 1 by 2 array
    #           kde0 and kde1, kde dictionaries that will be used to compute the likelihood
    # output:   posterior, a (# data) by (# classes) array

    likelihood = compute_likelihood(x, kde0, kde1)

    # Calculate posterior probabilities
    posterior0 = prior[0] * likelihood[:, 0]
    posterior1 = prior[1] * likelihood[:, 1]

    # Normalize to get valid probabilities
    total_posterior = posterior0 + posterior1
    posterior = np.vstack((posterior0 / total_posterior, posterior1 / total_posterior)).T

   
    return posterior

### 1.4 Combine prior, likelihood, posterior
Now, we are ready to piece all the code we prepared above.
1. [2 pt] Complete the code for ```naive_bayes_predict```.
    - Your code should include calling the ```compute_posterior``` function.
    - Computing y_pred should be a simple one line of code. You may consider using numpy functions that find the index of the largest entry on every row.
2. [1 pt] Complete the code for ```print_success_rates```.

In [7]:

from turtle import pos


def naive_bayes_predict(x, prior, kde0, kde1):
    # input:    x, a (# data) by (# features) array
    #           prior, a 1 by 2 array
    #           kde0 and kde1, kde dictionaries that will be used to compute the likelihood
    # output:   y_pred, an array of length (# data)
    #           y_pred[j] is the predicted class for the j-th data point in x
    posterior = compute_posterior(x, prior, kde0, kde1)

    # Choose the class with the highest posterior probability
    y_pred = np.argmax(posterior, axis=1)
    return y_pred


def print_success_rates(y_true,y_pred):
    n_success =  np.sum(y_true == y_pred)
    n_total = len(y_true)

    print("Number of correctly labeled points: %d of %d.  Accuracy: %.2f"  % (n_success, n_total, n_success/n_total))




### 1.5 Predict
1. [1 pt] Use your custom naive Bayes to:
    - predict *TRAINING* 
    - print the results with ```print_success_rates```

In [8]:
# TODO predict training data and print
y_pred = naive_bayes_predict(X_train, prior, kde0, kde1)
print_success_rates(y_train, y_pred)

Number of correctly labeled points: 171 of 209.  Accuracy: 0.82


2. [1 pt] Use your custom naive Bayes to:
    - predict *TEST* data
    - print the results with ```print_success_rates```

In [9]:
# TODO predict test data and print
y_pred = naive_bayes_predict(X_test, prior, kde0, kde1)
print_success_rates(y_test, y_pred)

Number of correctly labeled points: 67 of 90.  Accuracy: 0.74


## 2. sklearn Gaussian naive Bayes
Let's compare our custom naive Bayes with KDE to the sklearn Gaussian naive Bayes.

### 2.1 Train
1. [1 pt] Fit ```gnb``` using training data.

In [10]:
# run sklearn's version - read up on differences if interested
from sklearn.naive_bayes import GaussianNB



### 2.2 Predict
1. [1 pt] Use sklearn naive Bayes to:
    - predict *TRAINING* data
    - print the results with ```print_success_rates```

In [11]:
# TODO predict training data and print
gnb = GaussianNB()
y_pred_train = gnb.fit(X_train, y_train).predict(X_train)
print_success_rates(y_train, y_pred_train)

Number of correctly labeled points: 160 of 209.  Accuracy: 0.77


2. [1 pt] Use sklearn naive Bayes to:
    - predict *TEST* data
    - print the results with ```print_success_rates```

In [12]:
# TODO predict test data and print
y_pred_test = gnb.fit(X_train, y_train).predict(X_test)
print_success_rates(y_test, y_pred_test)

Number of correctly labeled points: 68 of 90.  Accuracy: 0.76


## 3. Discussion
### 3.1 random_state = 0
Using random_state=0 and respond to the following questions.

[2 pt] For **custom NB**, what is the difference between the training and test accuracy? Give an explanation for why it might be so.
    
**Ans:**  There is a 10% difference in the training and testing accuracy because our custom NB required the model to understand the pattern and trends in the data. While testing on the same data it is able to correctly predict the occuracnce of death as per the features. This is due to the fact that the model is trained on that data so it can compartivly easily predict the values as compared to testing data.


### 3.2 change random_state
Now, experiment with a range of random_state and respond to the following question.

[2 pt] Does your responses to 3.1 change? If so, describe how your responses change and why you changed them.
- (You do not need to artificially adjust your response to 3.1 to fit the any new findings you made after changing random_state)

**Ans:** If I change the random state to 2 then the testing accuracy decreases because the shuffling of the data increases. The random state is just different catogires and ways to check the reproducabilty of data. So after a certain extent (for eg when I changed the random state from 2 to 4 or 8) , the trend between the training and testing accuracy remain the same. And is I continue to increase the randome state value, same thing occurs.

### 3.3 Choice of model
[2 pt]  Compare **test** accuracy results for **custom NB and sklearn GNB**? Which model would you choose to use, and why? 


**Ans:**  The tesing accuracy for my custom NB is more for this data set. And even though it decreases up to some extent, it can still learn if more data is provided. Whereas in Gaussian NB the testing accuracy is less. Hence I would choose Custom NB - as I could as more features for it to learn from as well.


