## Prepare python environment


In [None]:
# Installs required packages
!apt install libgraphviz-dev
!pip install pomegranate matplotlib pygraphviz

# Press "Restart Runtime" after running this cell, before going to the rest of the code.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.model_selection import train_test_split

%matplotlib inline

In [None]:
random_state=5 # use this to control randomness across runs e.g., dataset partitioning

## Preparing the dataset (2 points)

We will use diabetes dataset from UCI machine learning repository. Detail of this data can be found [here](https://www.kaggle.com/uciml/pima-indians-diabetes-database). The objective of the dataset is to  predict whether or not a female patient has diabetes based on certain diagnostic measurements included in the dataset.

The dataset consists of several medical predictor (features) variables and one target variable indicating if the person has diabetes. Predictor variables include the number of pregnancies the patient has had,  glucose level, blood pressure, skin, insulin, bmi, pedigree and age.

### Loading the dataset

In [None]:
# These are the names of column in the dataset. It includes all features of the data and the label.
col_names = ['pregnancies', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']

# Download and load the dataset
import os
if not os.path.exists('diabetes.csv'): 
    !wget https://raw.githubusercontent.com/JHA-Lab/ece364/main/dataset/diabetes.csv 
diabetes_data = pd.read_csv("diabetes.csv", header=1, names=col_names)

FEATURE_NAMES=diabetes_data.drop('label',axis=1).columns
# Display the first five instances in the dataset
diabetes_data.head(5)

--2021-10-10 20:37:48--  https://raw.githubusercontent.com/JHA-Lab/ece364/main/dataset/diabetes.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24641 (24K) [text/plain]
Saving to: ‘diabetes.csv’


2021-10-10 20:37:48 (58.2 MB/s) - ‘diabetes.csv’ saved [24641/24641]



Unnamed: 0,pregnancies,glucose,bp,skin,insulin,bmi,pedigree,age,label
0,1,85,66,29,0,26.6,0.351,31,0
1,8,183,64,0,0,23.3,0.672,32,1
2,1,89,66,23,94,28.1,0.167,21,0
3,0,137,40,35,168,43.1,2.288,33,1
4,5,116,74,0,0,25.6,0.201,30,0


#### Use `describe` function to display some statistics of the data. See [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) details about this function.

In [None]:
# Display some statistics of the data
diabetes_data.describe()

In [None]:
diabetes_data.info()

### Extract target and descriptive features (1 point)


In [None]:
#split dataset into features and target variable
X = # TODO 
y = # TODO

In [None]:
# Convert data to numpy array
X = # TODO
y = # TODO

### Create training and test datasets (1 point)

Split the data into training and test sets using `train_test_split`.  See [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for details. To get consistent result while splitting, set `random_state` to the value defined earlier. We use 80% of the data for training and 20% of the data for testing. 

In [None]:
X_train,X_test,y_train,y_test = # TODO

## Training probability-based classifiers (18 points)


### Exercise 1: Learning a Naive Bayes Model (9 points)

#### We will use the `pomegranate` library to train a Naive Bayes Model. Review ch.6 and see [here](https://pomegranate.readthedocs.io/en/latest/NaiveBayes.html) for more details. 

In [None]:
from pomegranate.distributions import NormalDistribution, ExponentialDistribution, DiscreteDistribution 
from pomegranate.NaiveBayes import NaiveBayes
from pomegranate.BayesClassifier import BayesClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import KBinsDiscretizer
import math

np.random.seed(random_state)

#### Exercise 1a: Fit naive bayes model using a single distribution type (2 points)

#### Train one naive bayes model using a normal distribution per feature. Train another naive bayes model using an exponential distribution per feature. Hint: use NormalDistribution or ExponentialDistribution and NaiveBayes.from_samples() to fit the model to the data.

#### Report the training and test set accuracies for each model. Hint: use accuracy_score()


In [None]:
# TO DO

#### Exercise 1b: Fit a naive bayes model using different feature distributions (3 points)

#### Visualize the feature distributions (done for you below) to determine which distribution (normal or exponential) better models a specific feature. 

#### Train a Naive Bayes classifier using this set of feature-specific distributions. Hint: use NormalDistribution or ExponentialDistribution and NaiveBayes.from_samples() to fit the model to the data.

#### Report the training and test set accuracies for the model. Hint: use accuracy_score()

In [None]:
# visualization code

num_cols=4
num_rows=int(len(FEATURE_NAMES)/num_cols) if len(FEATURE_NAMES)%num_cols == 0 else int(math.ceil(len(FEATURE_NAMES)/num_cols))
fig,ax=plt.subplots(num_rows,num_cols)

for ft_index in np.arange(X_train.shape[1]):
    ax[ft_index//num_cols,ft_index%num_cols].hist(X_train[:,ft_index], color='blue')
    ax[ft_index//num_cols,ft_index%num_cols].set_title(FEATURE_NAMES[ft_index])
    
fig.tight_layout()

In [None]:
# TODO: train a classifier

#### Comment on any performance difference between this model and the models trained in Ex. 1a. (1 point)

TO DO

#### Exercise 1c: Fit a naive bayes model on categorical features (2 points)

#### Besides fitting a naive bayes model on the continuous features, one can fit a naive bayes model on categorical features derived from binning the continuous features, and then compute a probability mass function for each categorical feature.

#### Bin the features by varying the strategy among {equal-width binning, equal-frequency binning}. For each binning strategy, vary the number of bins among {3,10,50}. Hint: use [KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html#sklearn.preprocessing.KBinsDiscretizer.get_params) by modifying n_bins and strategy and setting encode="ordinal" to map the labels to numerical categories.

#### For each binning setting tried above, fit a naive bayes model on the binned version of the training set. Hint: use DiscreteDistribution to model the categorical features and NaiveBayes.from_samples() to fit the model to the data.

#### Report the training and test set accuracy for each model trained and evaluated on binned versions of the training and test sets respectively. 

**Note** There may be some variability in the actual performance scores, but the overall trends should remain the consistent.

In [None]:
# TODO

#### Briefly explain any performance difference between equal-width and equal-frequency binning. Also comment on the effect of increasing the number of bins (see ch.3). (1 point)

TO DO

### Exercise 2: Learning a Bayes Net (9 points)

#### We will use the `pomegranate` library to train a Bayes Net to assess whether relaxing the assumption in Naive bayes (i.e., all features are independent given the target feature) could improve the classification model. Review ch.6 and see [here](https://pomegranate.readthedocs.io/en/latest/BayesianNetwork.html) for more details. 

#### Exercise 2a: Create a categorical version of the dataset (1 point)

#### Create categorical versions of the training and test sets by using equal-frequency binning with the number of bins set to 3 (as in Ex. 1c).

#### <u>Use these datasets for training and evaluating the bayes net models in the following exercises.</u> 

**Note** This is done because pomegranate currently only supports bayes net over categorical features.

In [None]:
# TODO

#### Exercise 2b: Construct a Bayes net (3 points)

#### Construct and train a Bayes net in which the pregnancy (feature) node is a parent of the diabetes (feature) node (only these 2 nodes should be in the net). Use construct_and_train_bayes_net (defined below) by passing in the binned training dataset and specifying the index of the parent feature node.

#### Construct and train another Bayes net in which the glucose (feature) node is a parent of the diabetes (feature) node (only these 2 nodes should be in the net). Use construct_and_train_bayes_net (defined below) by passing in the binned training dataset and specifying the index of the parent feature node.

#### Report the training and test accuracies of each Bayes Net. Use get_performance (defined below) by passing in the trained bayes net, binned datasets, and specifying the index of the parent feature node.

In [None]:
from pomegranate import *

"""
X_train_binned: ndarray (# instances, # features) This is the binned version of the training set
y_train: 1darray (# instances,)
ind_chosen_parent_features: 1d numpy array encodes the indices of the features relative to FEATURE_NAMES. 
                            These indices correspond to features that are parent nodes of the diabetes node. 
ind_chosen_child_features: 1d numpy array encodes the indices of the features relative to FEATURE_NAMES. 
                            These indices correspond to features that are children nodes of the diabetes node.
                            
Returns a BayesianNetwork representing the trained bayes net
"""
def construct_and_train_bayes_net(X_train_binned,
                                  y_train,
                                  ind_chosen_parent_features=np.array([]), 
                                  ind_chosen_child_features=np.array([]),
                                ):
    # parent nodes of diabetes

    dist_by_parent_feature=[]
    state_by_parent_feature=[]
    if len(ind_chosen_parent_features)>0:
        parent_feature_names_chosen=FEATURE_NAMES[ind_chosen_parent_features]

        for ft_index in ind_chosen_parent_features:
            ft_dist=DiscreteDistribution.from_samples(X_train_binned[:,ft_index])
            dist_by_parent_feature.append(ft_dist)
            state_by_parent_feature.append(State(ft_dist, str(FEATURE_NAMES[ft_index])))
        dist_by_parent_feature=np.array(dist_by_parent_feature)
        state_by_parent_feature=np.array(state_by_parent_feature)


    # diabetes node
    if len(ind_chosen_parent_features)>0:
        X_train_parent_features_binned_with_labels=np.concatenate((X_train_binned[:,ind_chosen_parent_features],
                                                                   np.expand_dims(y_train,axis=1)),axis=1)
        diabetes_dist=ConditionalProbabilityTable.from_samples(X_train_parent_features_binned_with_labels)
        # temporary workaround to properly initialize the distribution
        diabetes_dist=ConditionalProbabilityTable(diabetes_dist.parameters[0],dist_by_parent_feature.tolist())
    else:
        diabetes_dist=DiscreteDistribution.from_samples(y_train)
    diabetes_state=State(diabetes_dist, "diabetes")

    # children node of diabetes

    dist_by_child_feature=[]
    state_by_child_feature=[]    
    if len(ind_chosen_child_features)>0:
        child_feature_names_chosen=FEATURE_NAMES[ind_chosen_child_features]

        for ft_index in ind_chosen_child_features:
            X_train_child_features_binned_with_labels=np.concatenate((np.expand_dims(y_train,axis=1),
                                                                        np.expand_dims(X_train_binned[:,ft_index],axis=1)),
                                                                     axis=1)
            ft_dist=ConditionalProbabilityTable.from_samples(X_train_child_features_binned_with_labels)
            ft_dist=ConditionalProbabilityTable(ft_dist.parameters[0],[diabetes_dist])
            dist_by_child_feature.append(ft_dist)
            state_by_child_feature.append(State(ft_dist, str(FEATURE_NAMES[ft_index])))
        dist_by_child_feature=np.array(dist_by_child_feature)
        state_by_child_feature=np.array(state_by_child_feature)


    pom_model = BayesianNetwork()
    pom_model.add_states(*list(state_by_parent_feature))
    pom_model.add_states(diabetes_state)
    pom_model.add_states(*list(state_by_child_feature))

    for parent_index in np.arange(len(ind_chosen_parent_features)):
        pom_model.add_edge(state_by_parent_feature[parent_index],diabetes_state)

    for child_index in np.arange(len(ind_chosen_child_features)):
        pom_model.add_edge(diabetes_state, state_by_child_feature[child_index])

    pom_model.bake()

    return pom_model


"""
pom_model: BayesianNetwork represents the trained bayes net model
X_train_binned: ndarray (# instances, # features) This is the binned training set
y_train: 1darray (# instances,)
X_test_binned: ndarray (# instances, # features) This is the binned test set
y_test: 1darray (# instances,)
ind_chosen_parent_features: 1d numpy array encodes the indices of the features relative to FEATURE_NAMES. 
                            These indices correspond to features that are parent nodes of the diabetes node. 
ind_chosen_child_features: 1d numpy array encodes the indices of the features relative to FEATURE_NAMES. 
                            These indices correspond to features that are children nodes of the diabetes node.
                            
Returns the training and test set accuracies attained by the bayes net model (pom_model)
"""
def get_performance(pom_model, X_train_binned, y_train, X_test_binned, y_test, 
                    ind_chosen_parent_features=np.array([]), ind_chosen_child_features=np.array([])):
    nones_array=np.expand_dims(np.array([None]*len(X_train_binned)),axis=1)
    ind_diabetes_node=len(ind_chosen_parent_features)
    if len(ind_chosen_parent_features)>0:
        X_train_binned_with_none=X_train_binned[:,ind_chosen_parent_features]
        X_train_binned_with_none=np.concatenate((X_train_binned_with_none,nones_array),axis=1)
    else:
        X_train_binned_with_none=nones_array

    if len(ind_chosen_child_features)>0:
        X_train_binned_with_none=np.concatenate((X_train_binned_with_none,
                                                X_train_binned[:,ind_chosen_child_features]),
                                               axis=1)
    pred_labels=np.array(pom_model.predict(X_train_binned_with_none),dtype='int64')[:,ind_diabetes_node]
    train_acc=accuracy_score(y_train, pred_labels)

    nones_array=np.expand_dims(np.array([None]*len(X_test_binned)),axis=1)
    if len(ind_chosen_parent_features)>0:
        X_test_binned_with_none=X_test_binned[:,ind_chosen_parent_features]
        X_test_binned_with_none=np.concatenate((X_test_binned_with_none,nones_array),axis=1)
    else:
        X_test_binned_with_none=nones_array

    if len(ind_chosen_child_features)>0:
        X_test_binned_with_none=np.concatenate((X_test_binned_with_none,
                                               X_test_binned[:,ind_chosen_child_features]),
                                               axis=1)
    pred_labels=np.array(pom_model.predict(X_test_binned_with_none),dtype='int64')[:,ind_diabetes_node]
    test_acc=accuracy_score(y_test, pred_labels)
    
    return train_acc, test_acc

    

In [None]:
# TODO

#### Comment on which feature seems more informative for predicting the presence of diabetes. (1 point)

TO DO

#### Exercise 2c: Construct a Bayes net with parent and children nodes (3 points)

#### Here, we'll implement a Bayes net with similar structure to one laid out in this [paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6470852).

#### Construct and train a Bayes net in which:
#### -the following features are all parents of the diabetes feature node (pregnancies, skin, bmi, pedigree, age).  
#### -the following features are all children of the diabetes feature node (glucose, bp, insulin)
#### Use construct_and_train_bayes_net by passing in the binned training dataset and specifying the indices of the parent feature nodes and indices of the children feature nodes.

#### Report the training and test accuracy of the Bayes Net using get_performance by passing in the trained bayes net, binned datasets, and indices of the parent and children feature nodes.

In [None]:
# TODO

#### Compare the performance of this Bayes net against the Bayes nets from Ex. 2b. (1 point)

TO DO