# Decision Making under Uncertainty - P1

Group members
- Mohammad Beigi
- Sagar Parekh

### Goal

explain the goal in a small paragraph

## Import Necessary Packages and Load the Dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pgmpy.estimators import BDeuScore, BicScore
from pgmpy.estimators import HillClimbSearch, ExpectationMaximization, MaximumLikelihoodEstimator, BayesianEstimator
from pgmpy.models import BayesianNetwork
from sklearn.model_selection import train_test_split
import graphviz
# from pygobnilp.gobnilp import Gobnilp


In [2]:
raw_data = pd.read_csv('heart.csv')
raw_data.head()


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


## Analyzing the Data

## Processing the Data

Discretize continuous data (the last values in the range are inclusive, the first values are not)

    + Age varies between 28 and 77. We divide it into 3 fixed tiers:
    
    tier A - 27 to 43,
    tier B - 43 to 59,
    tier C - 59 to 77
    
    + We discretize RestingBP into 2 tiers:
    
    tier A - -1 to 100,
    tier B - 100 to 200
    
    + Cholesterol varies between 0 and 603. We divide it into 6 fixed tiers:
    
    tier A - -1 to 100,
    tier B - 100 to 200,
    tier C - 200 to 300,
    tier D - 300 to 400,
    tier E - 400 to 500,
    tier F - 500 to 603
    
    + MaxHR varies between 60 and 202. We divide it into 3 fixed tiers: 
    
    tier A - 59 to 120,
    tier B - 120 to 170,
    tier C - 170 to 202
    
    + Oldpeak varies between -2.6 and 6.2. We discretize it as the following tiers:
    
    tier A - -2.7 to 0,
    tier B - 0 to 2.1,
    tier C - 2.1 to 4.2,
    tier D - 4.2 to 6.2


In [3]:
boundaries = {
             'Age': np.array([[27, 43],
                              [43, 59],
                              [59, 77]]),
             'RestingBP': np.array([[-1, 100],
                                    [100, 200]]),
             'Cholesterol': np.array([[-1, 100],
                                      [100, 200],
                                      [200, 300],
                                      [300, 400],
                                      [400, 500],
                                      [500, 602]]),
             'MaxHR': np.array([[59, 120],
                                [120, 170],
                                [170, 202]]),
             'Oldpeak': np.array([[-2.7, 0],
                                  [0, 2.1],
                                  [2.1, 4.2],
                                  [4.2, 6.2]])
             }

def discretize(series, boundaries):
    tiers = ['A', 'B', 'C', 'D', 'E', 'F']
    
    for idx, elem in enumerate(series):
        for i in range(len(boundaries)):
            if elem > boundaries[i, 0] and elem <= boundaries[i, 1]:
                series[idx] = tiers[i]
    
    return series
    

new_columns = {}
for i, content in enumerate(raw_data.items()):
    (label, series) = content
    
    if label in ['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak']:
        new_columns[label] = discretize(series, boundaries[label])
    else:
        new_columns[label] = series

data = pd.DataFrame(new_columns)
data.head()


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  series[idx] = tiers[i]


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,A,M,ATA,B,C,0,Normal,C,N,A,Up,0
1,B,F,NAP,B,B,0,Normal,B,N,B,Flat,1
2,A,M,ATA,B,C,0,ST,A,N,A,Up,0
3,B,F,ASY,B,C,0,Normal,A,Y,B,Flat,1
4,B,M,NAP,B,B,0,Normal,B,N,A,Up,0


Split the dataset into train and test data

In [4]:
train, test = train_test_split(data, test_size=0.2)
true_labels = test.HeartDisease

test.drop('HeartDisease', axis=1, inplace=True)


In [5]:
train.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
131,B,M,ASY,B,C,0,Normal,B,Y,A,Flat,1
869,B,M,NAP,B,C,1,Normal,B,N,B,Up,0
745,C,F,ASY,B,C,0,Normal,B,Y,B,Flat,1
898,A,M,ATA,B,B,0,Normal,C,N,A,Up,0
462,B,M,ASY,B,C,0,Normal,A,Y,B,Down,1


In [6]:
test.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope
558,B,M,NAP,B,C,0,ST,B,Y,B,Flat
432,C,M,ASY,B,B,0,Normal,A,Y,C,Down
582,C,M,ASY,B,C,1,LVH,A,Y,A,Flat
258,B,F,NAP,B,B,0,Normal,A,N,B,Up
574,C,M,ASY,B,C,1,ST,A,Y,B,Flat


## Parameter Learning
We use the BN model proposed in \cite{} and learn the parameters for the network using the given dataset.

## Learning BN from data
In our second approach, we learn the structure of the BN from the dataset.

#### Hill Climb Search

In [74]:
%%capture output
hc = HillClimbSearch(train)
best_model = hc.estimate(scoring_method=BicScore(train))


TypeError: '<' not supported between instances of 'int' and 'str'

Visualize the learned BN

In [18]:
bn_hc = graphviz.Digraph(comment='BN learned from data using Hill Climb Search')

for node in best_model.nodes():
    bn_hc.node(node)
    
for edge in best_model.edges():
    bn_hc.edge(edge[0], edge[1])
    
bn_hc.render(view=True)

NameError: name 'best_model' is not defined

#### Parameter Learning

In [62]:
learned_BN = BayesianNetwork(best_model.edges())

estimator = BayesianEstimator(learned_BN, train)
estimator.get_parameters()

NameError: name 'best_model' is not defined

In [23]:
learned_BN.fit(train, estimator=BayesianEstimator)

In [24]:
learned_BN.check_model()

True

Validate the learned BN on the test data 

In [21]:
learned_BN.predict(test)

ValueError: Data has variables which are not in the model