In [None]:
Heart Disease Prediction Using Machine Learning
Task 3 – AI/ML Engineering Internship at DevelopersHub Corporation
Model Used: Random Forest Classifier  
Dataset: Heart Disease UCI (from Kaggle)  


In [None]:
1. Problem Statement

Heart disease is one of the leading causes of death worldwide. Early detection can significantly improve treatment outcomes. The objective of this project is to build a machine learning model that can predict whether a patient is likely to have heart disease based on various medical attributes.

Goal

- Understand patterns in heart-related medical features
- Visualize important trends and distributions
- Train and evaluate a machine learning classification model
- Provide a clear summary of results and key insights


In [None]:
2. Dataset Loading and Preprocessing

We load the Heart Disease dataset, check its structure, clean it (if needed), and prepare it for model training. Preprocessing includes renaming columns, encoding categorical variables, and feature scaling.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
data = pd.read_csv('heart.csv')
print("Shape:", data.shape)
print(data.head())
print(data.describe())
print(data.info())

# Rename columns
column_map = {
    'cp': 'chest_pain_type',
    'trestbps': 'resting_blood_pressure',
    'chol': 'cholesterol',
    'fbs': 'fasting_blood_sugar',
    'restecg': 'rest_ecg',
    'thalach': 'max_heart_rate_achieved',
    'exang': 'exercise_induced_angina',
    'oldpeak': 'st_depression',
    'slope': 'st_slope',
    'ca': 'num_major_vessels',
    'thal': 'thalassemia'
}
data.rename(columns=column_map, inplace=True)

# Replace categorical values
replace_map = {
    'sex': {0: 'female', 1: 'male'},
    'chest_pain_type': {1: 'typical angina', 2: 'atypical angina', 3: 'non-anginal pain', 4: 'asymptomatic'},
    'fasting_blood_sugar': {0: 'lower than 120mg/ml', 1: 'greater than 120mg/ml'},
    'rest_ecg': {0: 'normal', 1: 'ST-T wave abnormality', 2: 'left ventricular hypertrophy'},
    'exercise_induced_angina': {0: 'no', 1: 'yes'},
    'st_slope': {1: 'upsloping', 2: 'flat', 3: 'downsloping'},
    'thalassemia': {3: 'normal', 6: 'fixed defect', 7: 'reversable defect'}
}
data.replace(replace_map, inplace=True)

# Convert to categorical types
to_category = ['sex', 'chest_pain_type', 'fasting_blood_sugar', 'rest_ecg',
               'exercise_induced_angina', 'st_slope', 'thalassemia']
data[to_category] = data[to_category].astype('category')

# Prepare features and labels
y = data['target']
x = data.drop(columns='target')
x = pd.get_dummies(x, drop_first=True)

# Train-test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)


In [None]:
3. Data Visualization and Exploration

Below we visualize the data to uncover insights:
- Distribution of age and gender
- Correlation heatmap
- Target class distribution
- Boxplots and violin plots for key features

sns.pairplot(data)
plt.show()

plt.figure(figsize=(15, 15))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()

plt.figure(figsize=(10, 7))
sns.histplot(data['age'], kde=True)
plt.title('Distribution of Age')
plt.show()

gender_counts = data['sex'].value_counts()
labels = ['Male', 'Female']
size = [gender_counts['male'], gender_counts['female']]
colors = ['lightblue', 'lightgreen']
explode = [0, 0.01]
my_circle = plt.Circle((0, 0), 0.7, color='white')

plt.figure(figsize=(9, 9))
plt.pie(size, labels=labels, colors=colors, explode=explode, autopct='%.2f%%', shadow=True)
plt.gca().add_artist(my_circle)
plt.title('Distribution of Gender')
plt.legend()
plt.show()

sns.countplot(x='target', data=data, palette='pastel')
plt.title('Target Distribution')
plt.grid()
plt.show()

sns.boxplot(x='target', y='resting_blood_pressure', data=data)
plt.title('Resting BP vs Target')
plt.show()

sns.violinplot(x='target', y='cholesterol', data=data)
plt.title('Cholesterol vs Target')
plt.show()

dat = pd.crosstab(data['target'], data['rest_ecg'])
dat.div(dat.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True, figsize=(10, 7))
plt.title('ECG vs Target')
plt.show()

sns.lmplot(x='target', y='max_heart_rate_achieved', data=data)
plt.title('Max Heart Rate vs Target')
plt.show()

sns.regplot(x='target', y='exercise_induced_angina', data=data)
plt.title('Exercise Induced Angina vs Target')
plt.show()

sns.boxplot(x='target', y='st_slope', data=data)
plt.title('ST Slope vs Target')
plt.show()



In [None]:
4. Model Training and Evaluation

We train a Random Forest Classifier and evaluate its performance using accuracy, confusion matrix, classification report, and ROC-AUC curve.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc

model = RandomForestClassifier(n_estimators=50, max_depth=5)
model.fit(x_train, y_train)

y_pred = model.predict(x_test)
y_pred_proba = model.predict_proba(x_test)[:, 1]

print("Training Accuracy:", model.score(x_train, y_train))
print("Testing Accuracy:", model.score(x_test, y_test))

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
plt.title('Confusion Matrix')
plt.show()

print(classification_report(y_test, y_pred))

fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
print("AUC Score:", roc_auc)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"AUC = {roc_auc:.2f}")
plt.plot([0, 1], [0, 1], linestyle='--')
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.grid()
plt.show()


In [None]:
5. Results and Final Insights

- Best Model: Random Forest Classifier (n_estimators=50, max_depth=5)
- Training Accuracy: XX%
- Testing Accuracy: XX%
- AUC Score: XX

Key Observations:
- Features like age, cholesterol, chest pain type, and thalassemia strongly correlate with heart disease.
- The ROC curve shows a good trade-off between sensitivity and specificity.
- The model performs well without overfitting (small train-test gap).

This model can be used for initial medical screening or as a decision-support system.
