## Heart Disease Prediction using Decision Tree Classifier

## Libraries Used

The following libraries were used for the implementation of the Decision Tree Classifier:

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score


In [2]:
data = pd.read_csv(r"C:\Users\Shaik Sakhlaih\Downloads\Dataset Heart Disease.csv")
data

Unnamed: 0.1,Unnamed: 0,age,sex,chest pain type,resting bps,cholesterol,fasting blood sugar,resting ecg,max heart rate,exercise angina,oldpeak,ST slope,target
0,0,40,1,2,140,289.0,0,0,172,0,0.0,1,0
1,1,49,0,3,160,180.0,0,0,156,0,1.0,2,1
2,2,37,1,2,130,283.0,0,1,98,0,0.0,1,0
3,3,48,0,4,138,214.0,0,0,108,1,1.5,2,1
4,4,54,1,3,150,195.0,0,0,122,0,0.0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1043,297,68,0,3,120,211.0,0,0,115,0,1.5,1,1
1044,298,44,0,3,108,141.0,0,1,175,0,0.6,1,1
1045,299,52,1,1,128,255.0,0,1,161,1,0.0,2,0
1046,300,59,1,4,160,273.0,0,0,125,0,0.0,2,0


 

In this project, the following libraries were used:  

- **pandas**: For data manipulation and analysis.  
- **numpy**: For numerical computations.  
- **seaborn**: For data visualization (optional, not explicitly mentioned in the code).  
- **sklearn.model_selection.train_test_split**: To split the data into training and testing sets.  
- **sklearn.tree.DecisionTreeClassifier**: To build the **Decision Tree** model.  
- **sklearn.metrics.accuracy_score**: To evaluate the accuracy of the model.  

## About the Dataset  

The dataset used in this project is a **Heart Disease** dataset, where the goal is to predict the presence or absence of heart disease in a patient. The dataset contains various medical features that can potentially be used to predict the outcome.  

### Dataset Overview:  

- **Number of Columns**: The dataset has 14 columns, including the target column.  
- **Target Variable**: The target variable is the **target** column, which indicates whether the patient has heart disease (1) or not (0).  

### Columns:  
- **age**: Age of the patient.  
- **sex**: Sex of the patient (1 = male, 0 = female).  
- **cp**: Chest pain type.  
- **trestbps**: Resting blood pressure.  
- **chol**: Serum cholesterol in mg/dl.  
- **fbs**: Fasting blood sugar > 120 mg/dl (1 = true, 0 = false).  
- **restecg**: Resting electrocardiographic results.  
- **thalach**: Maximum heart rate achieved.  
- **exang**: Exercise induced angina (1 = yes, 0 = no).  
- **oldpeak**: Depression induced by exercise relative to rest.  
- **slope**: Slope of the peak exercise ST segment.  
- **ca**: Number of major vessels colored by fluoroscopy.  
- **thal**: Thalassemia (a blood disorder).  
- **target**: Target variable (1 = heart disease, 0 = no heart disease).  

### Relationship Between Columns:  
- **Age**, **sex**, and other medical features are the independent variables (**X**).  
- **Target** is the dependent variable (**Y**), where 1 indicates the presence of heart disease and 0 indicates the absence.  

## Data Analysis  

### Checking for Missing Values:  
To ensure the dataset is clean, we checked for null values.


In [3]:
data.isnull().sum()


Unnamed: 0             0
age                    0
sex                    0
chest pain type        0
resting bps            0
cholesterol            0
fasting blood sugar    0
resting ecg            0
max heart rate         0
exercise angina        0
oldpeak                0
ST slope               0
target                 0
dtype: int64

Result: There are no null values in the dataset, which means the data is already clean and ready for modeling.

## Descriptive Statistics:

In [4]:
data.describe()

Unnamed: 0.1,Unnamed: 0,age,sex,chest pain type,resting bps,cholesterol,fasting blood sugar,resting ecg,max heart rate,exercise angina,oldpeak,ST slope,target
count,1048.0,1048.0,1048.0,1048.0,1048.0,1048.0,1048.0,1048.0,1048.0,1048.0,1048.0,1048.0,1048.0
mean,390.841603,53.325382,0.734733,2.817748,132.61355,245.17271,0.162214,0.60687,142.918893,0.368321,0.942366,1.532443,0.496183
std,307.916633,9.397822,0.441686,1.118649,17.367605,57.101359,0.368823,0.763313,24.427115,0.482579,1.100429,0.611023,0.500224
min,0.0,28.0,0.0,1.0,92.0,85.0,0.0,0.0,69.0,0.0,-0.1,0.0,0.0
25%,130.75,46.0,0.0,2.0,120.0,208.0,0.0,0.0,125.0,0.0,0.0,1.0,0.0
50%,262.0,54.0,1.0,3.0,130.0,239.0,0.0,0.0,144.0,0.0,0.6,2.0,0.0
75%,657.25,60.0,1.0,4.0,140.0,275.0,0.0,1.0,162.0,1.0,1.6,2.0,1.0
max,1189.0,77.0,1.0,4.0,200.0,603.0,1.0,2.0,202.0,1.0,6.2,3.0,1.0


This gives the basic statistics (mean, standard deviation, min, max, etc.) of each column.

## Model Building


### Data Preprocessing:

- The target variable (`target`) was separated from the features.

- The features were assigned to `x`, and the target was assigned to `y`.

In [5]:
x = data.drop(['target'], axis=1)
y = data['target']

## Splitting the Data:

We split the data into training and testing sets with an 80-20 ratio:

In [6]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

## Model Creation:

We used the Decision Tree Classifier with the Gini index for splitting and limited the tree depth to 3 to avoid overfitting.

In [7]:
DTree = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=0)
DTree.fit(x_train, y_train)


#Predictions:
We made predictions on the test set: