In [2]:
# **Imports:**
# - `pandas` (`pd`): For data manipulation and analysis.
# - `numpy` (`np`): For numerical operations, used here to handle missing values (`NaN`).
# - `BayesianNetwork` from `pgmpy.models`: To define the structure of the Bayesian Network.
# - `MaximumLikelihoodEstimator` from `pgmpy.estimators`: For parameter learning using maximum likelihood estimation.
# - `VariableElimination` from `pgmpy.inference`: For making queries (inference) on the Bayesian Network.
# - `warnings`: To ignore warnings for a cleaner output.

import pandas as pd
import numpy as np
from pgmpy.models import BayesianNetwork
from pgmpy.estimators import MaximumLikelihoodEstimator
from pgmpy.inference import VariableElimination
import warnings

# Ignore warnings (optional, for cleaner output)
warnings.filterwarnings("ignore")

# **Step 1: Load the Dataset**
# - `heart.csv` contains the heart disease dataset.
# - The dataset might have missing values represented by '?'.
# - We replace '?' with `NaN` for easier handling of missing data.
data = pd.read_csv('heart.csv')
data.replace('?', np.nan, inplace=True)

# Display the first few rows of the dataset to understand its structure.
print("Sample data:\n", data.head())

# **Bayesian Network:**
# - A Bayesian Network is a probabilistic graphical model representing variables and their conditional dependencies using a directed acyclic graph (DAG).
# - Each node represents a variable (feature) in the dataset.
# - Edges (arrows) represent direct dependencies between variables.

# **Why use a Bayesian Network?**
# - It helps model the joint probability distribution of the dataset.
# - It allows us to perform inference, making predictions and understanding relationships between variables.

# **Step 2: Define the Structure of the Bayesian Network**
# - Here, we specify the structure of the network based on domain knowledge (medical insights about heart disease).
# - **Relations explained:**
#   - `('age', 'trestbps')`: Age can affect resting blood pressure (`trestbps`).
#   - `('age', 'chol')`: Age can influence cholesterol levels (`chol`).
#   - `('sex', 'trestbps')`: Sex (gender) can also affect resting blood pressure.
#   - `('chol', 'target')`: High cholesterol can increase the risk of heart disease (`target`).
#   - `('trestbps', 'target')`: High resting blood pressure is a risk factor for heart disease.
#   - `('fbs', 'target')`: Fasting blood sugar (`fbs`) levels can be linked to heart disease risk.
#   - `('target', 'thalach')`: If heart disease is present, it may affect the maximum heart rate (`thalach`).
#   - `('target', 'restecg')`: Heart disease can influence the resting electrocardiographic (`restecg`) results.

model = BayesianNetwork([
    ('age', 'trestbps'),
    ('age', 'chol'),
    ('sex', 'trestbps'),
    ('chol', 'target'),
    ('trestbps', 'target'),
    ('fbs', 'target'),
    ('target', 'thalach'),
    ('target', 'restecg')
])

# **Step 3: Parameter Learning using Maximum Likelihood Estimation (MLE)**
# - `MaximumLikelihoodEstimator`: A method to estimate the parameters of the Bayesian Network using MLE.
# - **Why MLE?**
#   - MLE finds the parameter values that maximize the likelihood of the observed data.
#   - It is suitable when we have complete data (no missing values or the missing values are handled).

model.fit(data, estimator=MaximumLikelihoodEstimator)

# **Step 4: Inference using Variable Elimination**
# - `VariableElimination`: A method for making queries on the Bayesian Network.
# - **Why use Variable Elimination?**
#   - It is an efficient way to perform probabilistic inference, allowing us to compute conditional probabilities.

inference = VariableElimination(model)

# **Query 1: Probability of Heart Disease given Age = 45**
# - We query the probability of the `target` (heart disease) given the evidence that age is 45.
result1 = inference.query(variables=['target'], evidence={'age': 45})
print("\nProbability of Heart Disease given Age=45:\n", result1)

# **Query 2: Probability of Heart Disease given Cholesterol = 200**
# - We query the probability of the `target` (heart disease) given the evidence that cholesterol level (`chol`) is 200.
result2 = inference.query(variables=['target'], evidence={'chol': 200})
print("\nProbability of Heart Disease given Cholesterol=200:\n", result2)

# **Summary:**
# - This code builds a Bayesian Network to model relationships between features in the heart disease dataset.
# - It uses Maximum Likelihood Estimation to learn parameters and Variable Elimination for efficient inference.
# - The network structure is based on medical knowledge of factors affecting heart disease.
# - This approach can be useful in healthcare applications to predict the risk of heart disease and understand the influence of various factors.


Sample data:
    age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   52    1   0       125   212    0        1      168      0      1.0      2   
1   53    1   0       140   203    1        0      155      1      3.1      0   
2   70    1   0       145   174    0        1      125      1      2.6      0   
3   61    1   0       148   203    0        1      161      0      0.0      2   
4   62    0   0       138   294    1        1      106      0      1.9      1   

   ca  thal  target  
0   2     3       0  
1   0     3       0  
2   0     3       0  
3   1     3       0  
4   3     2       0  

Probability of Heart Disease given Age=45:
 +-----------+---------------+
| target    |   phi(target) |
| target(0) |        0.4678 |
+-----------+---------------+
| target(1) |        0.5322 |
+-----------+---------------+

Probability of Heart Disease given Cholesterol=200:
 +-----------+---------------+
| target    |   phi(target) |
| target(0) |        0.5740 