# Heart Disease Prediction using Decision Tree Classifier


### Objectives:
1. Explore the heart disease dataset.
2. Apply Decision Tree Classifier to predict heart disease stages.
3. Tune the model for better performance.
4. Visualize the decision tree and key features.
5. Answer interview questions related to Decision Trees and feature encoding.
    

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
from sklearn.model_selection import GridSearchCV

# Load dataset
data = pd.read_csv('/mnt/data/heart_disease.csv')
data.head()
    

### Data Overview and Basic Statistics

In [None]:

# Display basic info and first few rows of the dataset
data.info()
data.describe()

# Check for missing values
data.isnull().sum()
    


### Exploratory Data Analysis (EDA)
We will visualize the distributions of numeric features, check for outliers, and analyze correlations between features.
    

In [None]:

# 1. Plot histograms for numerical columns
plt.figure(figsize=(12, 10))
data[['age', 'trestbps', 'chol', 'thalch', 'oldpeak']].hist(bins=15, figsize=(12, 10))
plt.suptitle('Distribution of Numeric Features')
plt.show()

# 2. Boxplot for 'trestbps' and 'chol' (outlier detection)
plt.figure(figsize=(12, 6))
sns.boxplot(x=data['trestbps'])
plt.title('Boxplot of Resting Blood Pressure')
plt.show()

plt.figure(figsize=(12, 6))
sns.boxplot(x=data['chol'])
plt.title('Boxplot of Cholesterol')
plt.show()

# 3. Correlation matrix
plt.figure(figsize=(10, 8))
correlation_matrix = data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
    


### Feature Engineering:
1. Handle missing values in the 'oldpeak' column.
2. Encode categorical variables ('sex', 'cp', 'restecg', etc.) for machine learning.
    

In [None]:

# Fill missing values in 'oldpeak' with the median
data['oldpeak'] = data['oldpeak'].fillna(data['oldpeak'].median())

# Encode categorical columns using LabelEncoder
encoder = LabelEncoder()
categorical_columns = ['sex', 'cp', 'restecg', 'exang', 'slope', 'thal']

for col in categorical_columns:
    data[col] = encoder.fit_transform(data[col])

data.head()
    


### Decision Tree Classifier:
We will split the data into training and testing sets, train a Decision Tree model, and evaluate its performance using accuracy, precision, recall, F1-score, and ROC-AUC.
    

In [None]:

# Splitting data into features (X) and target (y)
X = data.drop('num', axis=1)
y = data['num']

# Splitting into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Predictions and evaluation
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

# Output the performance metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")
    


### Hyperparameter Tuning:
We will tune the Decision Tree using hyperparameters like 'max_depth', 'min_samples_split', and 'criterion' to optimize the model's performance.
    

In [None]:

# Define the parameter grid
param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}

# Use GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, scoring='accuracy', cv=5)
grid_search.fit(X_train, y_train)

# Output best parameters
print(f"Best parameters: {grid_search.best_params_}")
    


### Decision Tree Visualization:
We will visualize the structure of the decision tree and the importance of the features in the model.
    

In [None]:

# Visualize the Decision Tree
plt.figure(figsize=(15, 10))
plot_tree(clf, feature_names=X.columns, filled=True, rounded=True, class_names=True)
plt.show()

# Feature importance plot
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(12, 6))
sns.barplot(x=importances[indices], y=X.columns[indices])
plt.title('Feature Importance')
plt.show()
    


## Interview Questions:
1. **What are some common hyperparameters of Decision Tree models and how do they affect the model's performance?**
   - **max_depth**: Limits the depth of the tree to prevent overfitting.
   - **min_samples_split**: Minimum number of samples required to split a node. Increasing this can prevent overfitting by limiting tree growth.
   - **criterion**: The function used to measure the quality of a split ('gini' for Gini impurity or 'entropy' for Information Gain). The choice can affect how the tree splits data at each node.

2. **What is the difference between Label Encoding and One-Hot Encoding?**
   - **Label Encoding**: Assigns a unique integer to each category. This can be useful when categories have an inherent order, but may confuse models when categories are nominal (unordered).
   - **One-Hot Encoding**: Creates binary columns for each category, preserving the uniqueness of each category without implying any order.
    