# Team Heart_Bits: Heart Disease Analysis

## About the Dataset

This dataset contains 13 features related to heart disease.  
The goal is to classify the target variable (**disease / no disease**) using various machine learning algorithms and identify the most suitable one.

### Attribute Information

- **Age** - Age in years  
- **Sex** - 1 = Male; 0 = Female  
- **CP** - Chest pain type  
- **TRESTBPS** - Resting blood pressure (in mm Hg on admission to the hospital)  
- **CHOL** - Serum cholesterol in mg/dl  
- **FBS** - Fasting blood sugar > 120 mg/dl (1 = True; 0 = False)  
- **RESTECH** - Resting electrocardiographic results  
- **THALACH** - Maximum heart rate achieved  
- **EXANG** - Exercise induced angina (1 = Yes; 0 = No)  
- **OLDPEAK** - ST depression induced by exercise relative to rest  
- **SLOPE** - Slope of the peak exercise ST segment  
- **CA** - Number of major vessels (0–3) colored by fluoroscopy  
- **THAL** - Thalassemia type 3 = Normal; 6 = Fixed defect; 7 = Reversible defect  
- **TARGET** - 1 = Disease; 0 = No disease


In [1]:
### STEP 1: Import Libraries


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [2]:
### STEP 2: Load Dataset

heart_bits = pd.read_csv(r"C:\Users\user\Downloads\heart.csv")

heart_bits.head()

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\user\\Downloads\\heart.csv'

In [None]:
# Shape of dataset

heart_bits.shape

In [None]:
# Data Types and Info

heart_bits.info()

In [None]:
# Check for missing values

heart_bits.isnull().sum()

In [None]:
##Statistical summary

heart_bits.describe()

### Exploratory Data Analysis (EDA)


In [None]:
# Step1. Check Target Variable (Heart Disease Cases)

In [None]:
plt.figure(figsize=(6,4))
sns.countplot(x='target', data=heart_bits)
plt.title("Heart Disease Presence (1 = Disease, 0 = No Disease)")
plt.show()


In [None]:
#Step2. Check the  Percentage 

heart_rate = heart_bits['target'].value_counts(normalize=True) * 100

heart_rate

In [None]:
##Step3. Check the age variable

plt.figure(figsize=(6,4))
sns.histplot(heart_bits['age'], kde=True)
plt.title("Age Distribution of Patients")
plt.xlabel("Age")
plt.show()

In [None]:
##Step4. Check  Chest pain vs Heart Diseases


plt.figure(figsize=(6,4))
sns.countplot(x='cp', hue='target', data=heart_bits)
plt.title("Chest Pain Type vs Heart Disease")
plt.xlabel("Chest Pain Type (0-3)")
plt.ylabel("Count")
plt.show()


In [None]:
##Step5. Check Gender vs Heart Disease

plt.figure(figsize=(6,4))
sns.countplot(x='sex', hue='target', data=heart_bits, palette='coolwarm')
plt.title('Gender vs Heart Disease')
plt.xlabel('0 = Female, 1 = Male')
plt.ylabel('Count')
plt.show()

In [None]:
###Step6. Check  Resting Blood Pressure
plt.figure(figsize=(6,4))
sns.histplot(heart_bits['trestbps'], bins=20, kde=True, color='coral')
plt.title('Resting Blood Pressure Distribution')
plt.xlabel('Resting Blood Pressure (mm Hg)')
plt.ylabel('Frequency')
plt.show()

In [None]:
###Step7. Check Maximum Heart rate vs Disease

plt.figure(figsize=(6,4))
sns.boxplot(x='target', y='thalach', data=heart_bits)
plt.title("Max Heart Rate Achieved vs Heart Disease")
plt.xlabel("Heart Disease (0/1)")
plt.ylabel("Max Heart Rate")
plt.show()


In [None]:
#Step8. Check Cholesterol vs Heart disease


plt.figure(figsize=(6,4))
sns.boxplot(x='target', y='chol', data=heart_bits)
plt.title("Cholesterol Levels vs Heart Disease")
plt.xlabel("Heart Disease (0/1)")
plt.ylabel("Cholesterol (mg/dl)")
plt.show()


In [None]:
##Step9. Check Resting blood pressure vs Heart disease

plt.figure(figsize=(6,4))
sns.boxplot(x='target', y='trestbps', data=heart_bits)
plt.title("Resting Blood Pressure vs Heart Disease")
plt.xlabel("Heart Disease (0/1)")
plt.ylabel("Resting BP (mm Hg)")
plt.show()


In [None]:
#Step10. Check Gender Distribution vs Heart disease

plt.figure(figsize=(6,4))
sns.boxplot(x='target', y='trestbps', data=heart_bits)
plt.title("Resting Blood Pressure vs Heart Disease")
plt.xlabel("Heart Disease (0/1)")
plt.ylabel("Resting BP (mm Hg)")
plt.show()


In [None]:
##Step11. Check Fasting blood sugar vs Heart disease

plt.figure(figsize=(6,4))
sns.countplot(x='fbs', hue='target', data=heart_bits)
plt.title("Fasting Blood Sugar (>120 mg/dl) vs Heart Disease")
plt.xlabel("1 = True, 0 = False")
plt.ylabel("Count")
plt.show()


In [None]:
##Step12. Check the Slope of ST vs Heart disease

plt.figure(figsize=(6,4))
sns.countplot(x='fbs', hue='target', data=heart_bits)
plt.title("Fasting Blood Sugar (>120 mg/dl) vs Heart Disease")
plt.xlabel("1 = True, 0 = False")
plt.ylabel("Count")
plt.show()


In [None]:
#Step13. Check the  Correlation Matrix
plt.figure(figsize=(10,8))
sns.heatmap(heart_bits.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Feature Correlation Heatmap')
plt.show()

In [None]:
#Step14. Check the Pairplot of Selected Features
selected_features = ['age', 'chol', 'trestbps', 'thalach', 'target']
sns.pairplot(heart_bits[selected_features], hue='target', palette='husl')
plt.suptitle('Pairwise Relationships Among Key Features', y=1.02)
plt.show()

# Target Variable Analysis

### Objective:
Before performing statistical tests, let’s explore the target variable (`target`)  
to understand how heart disease is distributed across the dataset.


In [None]:
# Count and percentage of heart disease cases
target_counts = heart_bits['target'].value_counts()
target_percent = heart_bits['target'].value_counts(normalize=True) * 100

# Display as a small summary table
target_summary = pd.DataFrame({
    'Count': target_counts,
    'Percentage (%)': target_percent.round(2)
})
target_summary


In [None]:
#Visualizing target distribution
plt.figure(figsize=(6,4))
sns.countplot(x='target', data=heart_bits, palette='coolwarm')
plt.title("Heart Disease Presence (1 = Disease, 0 = No Disease)", fontsize=12, weight='bold')
plt.xlabel("Target")
plt.ylabel("Count")
plt.show()


**Findings:**
- This shows how many patients have heart disease (1) versus those who don’t (0).
- A balanced dataset (50/50) is ideal for fair model training.
- The dataset shows an almost equal number of patients with and without heart disease, indicating a well-balanced target variable, ideal for unbiased and reliable model training.



### Target vs Gender and Chest Pains

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.countplot(x='sex', hue='target', data=heart_bits, palette='viridis', ax=axes[0])
axes[0].set_title('Heart Disease by Gender')
axes[0].set_xlabel('0 = Female, 1 = Male')
axes[0].set_ylabel('Count')

sns.countplot(x='cp', hue='target', data=heart_bits, palette='coolwarm', ax=axes[1])
axes[1].set_title('Heart Disease by Chest Pain Type')
axes[1].set_xlabel('Chest Pain Type (0–3)')
axes[1].set_ylabel('Count')

plt.tight_layout()
plt.show()


**Findings:**
- A higher proportion of males (sex = 1) are affected by heart disease compared to females.
- Chest pain type (cp) shows a strong relationship, certain pain types are more frequent in disease cases.


### A Brief Overview of Target vs Continuous Features

In [None]:
numeric_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
sns.boxplot(x='target', y='age', data=heart_bits, ax=axes[0], palette='pastel')
sns.boxplot(x='target', y='thalach', data=heart_bits, ax=axes[1], palette='Set2')
sns.boxplot(x='target', y='oldpeak', data=heart_bits, ax=axes[2], palette='coolwarm')
axes[0].set_title('Age vs Heart Disease')
axes[1].set_title('Max Heart Rate vs Heart Disease')
axes[2].set_title('Oldpeak vs Heart Disease')
plt.tight_layout()
plt.show()


**Findings:**
- Patients with heart disease generally have:
  - Slightly higher age
  - Lower maximum heart rate (thalach)
  - Higher ST depression (oldpeak)
  
These trends will be confirmed statistically in the next section.


### Summary of Target Analysis Findings

- The target variable (heart disease) is roughly balanced, ideal for classification.  
- Gender (sex) and chest pain type (cp) show visible differences between disease groups.  
- Age, maximum heart rate (thalach), and ST depression (oldpeak) differ notably across target classes.  
- These patterns suggest potential significant relationships, which we will now test using:
  - Chi-Square test for categorical variables  
  - T-test / ANOVA for continuous variables


# Statistical Analysis

Before training models, it’s important to check whether there are statistically significant relationships between the target (heart disease) and the other features.

We will use:

Chi-Square Test - For categorical variables

T-Test / ANOVA - For continuous variables

### Chi-Square Test (Categorical Features vs Target)

We test if the distribution of categorical variables differs significantly between target groups (0 vs 1)

In [None]:
categorical_features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']

chi_square_results = {}

for col in categorical_features:
    contingency_table = pd.crosstab(heart_bits[col], heart_bits['target'])
    chi2, p, dof, expected = stats.chi2_contingency(contingency_table)
    chi_square_results[col] = {'Chi2': chi2, 'p-value': p}

chi_square_df = pd.DataFrame(chi_square_results).T
chi_square_df


**Interpretation:**
- A *p-value < 0.05* suggests a **significant relationship** between that variable and the target (heart disease).
- Example: If cp (chest pain type) and thal(thalassemia) show low p-values, they likely play an important role in predicting heart disease.


### Independent T-Tests (Continuous Features vs Target)

We test if the mean values of numerical variables differ between those with and without heart disease.

In [None]:
numerical_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']

t_test_results = {}

for col in numerical_features:
    group0 = heart_bits[heart_bits['target'] == 0][col]
    group1 = heart_bits[heart_bits['target'] == 1][col]
    t_stat, p_val = stats.ttest_ind(group0, group1, equal_var=False)
    t_test_results[col] = {'T-statistic': t_stat, 'p-value': p_val}

t_test_df = pd.DataFrame(t_test_results).T
t_test_df


**Interpretation:**
- If p < 0.05, there’s a statistically significant difference in mean values between disease and no-disease groups.
- For instance:
  - Lower thalach (max heart rate) often indicates higher risk.
  - Higher oldpeak (ST depression) is usually seen in disease cases.


### Key Insights from the Statistical Analysis

- Chest Pain Type (cp), Thalassemia (thal) and Number of Major Vessels (ca) show the strongest association with heart disease (extremely low p-values close to 0).  
- Exercise-Induced Angina (exang), Slope, Resting ECG (restecg), and Sex also have highly significant relationships with heart disease (p < 0.001).  
- Fasting Blood Sugar (fbs) is not significantly associated with heart disease (p = 0.2186).  
- Overall, categorical variables with p < 0.05 are statistically significant and can be prioritized for predictive modeling and feature importance analysis.



# Feature Engineering

Feature engineering helps improve model performance by making the data more meaningful and suitable for machine learning algorithms. It ensures that all variables are in the right format, scale and representation for accurate predictions.

In this step, we:
1. Separate features and target for modeling.  
2. Scale numeric features to ensure equal contribution to model training.  
3. Rename columns to improve readability and interpretability for modeling and deployment.


In [None]:
#Importing data preprocessing and modelling libraries
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [None]:
#Separating features and target
y = heart_bits["target"]
X = heart_bits.drop('target', axis=1)

#Renaming columns for clarity
X = X.rename(columns={
    'age': 'Age',
    'sex': 'Sex',
    'cp': 'Chest_Pain_Type',
    'trestbps': 'Resting_BP',
    'chol': 'Serum_Cholesterol',
    'fbs': 'Fasting_Blood_Sugar',
    'restecg': 'Resting_ECG',
    'thalach': 'Max_Heart_Rate',
    'exang': 'Exercise_Angina',
    'oldpeak': 'ST_Depression',
    'slope': 'Slope_ST_Segment',
    'ca': 'Major_Vessels',
    'thal': 'Thalassemia'
})

#Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

#Scaling selected numerical columns
cols_to_scale = ['Age', 'Resting_BP', 'Serum_Cholesterol', 'Max_Heart_Rate', 'ST_Depression']

scaler = StandardScaler()
X_train[cols_to_scale] = scaler.fit_transform(X_train[cols_to_scale])
X_test[cols_to_scale] = scaler.transform(X_test[cols_to_scale])

#Displaying updated feature names
X_train.head()


# Model Building

In this section, we train and evaluate different machine learning models to predict the presence of heart disease.  
The goal is to identify which algorithm performs best based on accuracy and other evaluation metrics.

In [None]:
#Importing models 
from sklearn.metrics import confusion_matrix,accuracy_score,roc_curve,classification_report
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

In [None]:
#Logistic Regression
lr = LogisticRegression()
model = lr.fit(X_train, y_train)
lr_predict = lr.predict(X_test)
lr_conf_matrix = confusion_matrix(y_test, lr_predict)
lr_acc_score = accuracy_score(y_test, lr_predict)
print("confussion matrix")
print(lr_conf_matrix)
print("\n")
print("Accuracy of Logistic Regression:",lr_acc_score*100,'\n')
print(classification_report(y_test,lr_predict))

In [None]:
#Random Forest Classfier
rf = RandomForestClassifier(n_estimators=20, random_state=2,max_depth=5)
rf.fit(X_train,y_train)
rf_predicted = rf.predict(X_test)
rf_conf_matrix = confusion_matrix(y_test, rf_predicted)
rf_acc_score = accuracy_score(y_test, rf_predicted)
print("confussion matrix")
print(rf_conf_matrix)
print("\n")
print("Accuracy of Random Forest:",rf_acc_score*100,'\n')
print(classification_report(y_test,rf_predicted))

In [None]:
xgb = XGBClassifier(learning_rate=0.01, n_estimators=25, max_depth=15,gamma=0.6, subsample=0.52,colsample_bytree=0.6,seed=27, 
                    reg_lambda=2, booster='dart', colsample_bylevel=0.6, colsample_bynode=0.5)
xgb.fit(X_train, y_train)
xgb_predicted = xgb.predict(X_test)
xgb_conf_matrix = confusion_matrix(y_test, xgb_predicted)
xgb_acc_score = accuracy_score(y_test, xgb_predicted)
print("confussion matrix")
print(xgb_conf_matrix)
print("\n")
print("Accuracy of Extreme Gradient Boost:",xgb_acc_score*100,'\n')
print(classification_report(y_test,xgb_predicted))

In [None]:
#DecisionTree Classifier
dt = DecisionTreeClassifier(criterion = 'entropy',random_state=0,max_depth = 6)
dt.fit(X_train, y_train)
dt_predicted = dt.predict(X_test)
dt_conf_matrix = confusion_matrix(y_test, dt_predicted)
dt_acc_score = accuracy_score(y_test, dt_predicted)
print("confussion matrix")
print(dt_conf_matrix)
print("\n")
print("Accuracy of DecisionTreeClassifier:",dt_acc_score*100,'\n')
print(classification_report(y_test,dt_predicted))

In [None]:
#K-NeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)
knn_predicted = knn.predict(X_test)
knn_conf_matrix = confusion_matrix(y_test, knn_predicted)
knn_acc_score = accuracy_score(y_test, knn_predicted)
print("confussion matrix")
print(knn_conf_matrix)
print("\n")
print("Accuracy of K-NeighborsClassifier:",knn_acc_score*100,'\n')
print(classification_report(y_test,knn_predicted))

In [None]:
#Support Vector Classifier
svc =  SVC(kernel='rbf', C=2)
svc.fit(X_train, y_train)
svc_predicted = svc.predict(X_test)
svc_conf_matrix = confusion_matrix(y_test, svc_predicted)
svc_acc_score = accuracy_score(y_test, svc_predicted)
print("confussion matrix")
print(svc_conf_matrix)
print("\n")
print("Accuracy of Support Vector Classifier:",svc_acc_score*100,'\n')
print(classification_report(y_test,svc_predicted))

### Model Optimization: Cross-Validation and Hyperparameter Tuning

After training the baseline models, the next step is to improve their performance and reliability through:

1. Cross-Validation:  
   - Ensures the model performs consistently across different subsets of data.  
   - We use 5-fold cross-validation, which splits the data into 5 parts, training on 4 and validating on 1 each time.

2. Hyperparameter Tuning:  
   - Searches for the best combination of parameters to maximize model performance.  
   - We use GridSearchCV, which tests all parameter combinations systematically.

We will perform this for our top-performing models:  
**Random Forest**, **Decision Tree**, **Support Vector Machine (SVM)** and **XGBoost.**


In [None]:
# Importing required libraries
from sklearn.model_selection import GridSearchCV, cross_val_score

# Defining models and parameter grids
models_params = {
    'Random Forest': {
        'model': RandomForestClassifier(random_state=0),
        'params': {
            'n_estimators': [50, 100, 200],
            'max_depth': [3, 5, 7, None],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }
    },
    'Decision Tree': {
        'model': DecisionTreeClassifier(random_state=0),
        'params': {
            'criterion': ['gini', 'entropy'],
            'max_depth': [3, 5, 7, None],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }
    },
    'Support Vector Machine': {
        'model': SVC(probability=True),
        'params': {
            'C': [0.1, 1, 10],
            'kernel': ['linear', 'rbf', 'poly'],
            'gamma': ['scale', 'auto']
        }
    },
    'XGBoost': {
        'model': XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=0),
        'params': {
            'n_estimators': [50, 100, 200],
            'max_depth': [3, 5, 7],
            'learning_rate': [0.01, 0.1, 0.2],
            'subsample': [0.7, 1.0],
            'colsample_bytree': [0.7, 1.0]
        }
    }
}

# Running GridSearchCV for each model
best_models = {}
for model_name, mp in models_params.items():
    print(f"\n Running GridSearchCV for {model_name}...")
    grid_search = GridSearchCV(mp['model'], mp['params'], cv=5, scoring='accuracy', n_jobs=-1)
    grid_search.fit(X_train, y_train)
    print(f" Best Parameters for {model_name}: {grid_search.best_params_}")
    print(f" Best Cross-Validation Accuracy: {grid_search.best_score_:.4f}")
    
    best_models[model_name] = grid_search.best_estimator_

# Evaluating tuned models on the test set
print("\n\n **Model Performance After Tuning:**")
for name, model in best_models.items():
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"\n {name} Accuracy on Test Set: {acc*100:.2f}%")
    print(classification_report(y_test, y_pred))


### Results:

All models were optimized using GridSearchCV with 5-fold cross-validatio to improve reliability and performance.  
The table below summarizes the best parameters and corresponding cross-validation accuracy for each model.

| Model | Best Parameters (Key Highlights) | Cross-Val Accuracy |
| :----- | :------------------------------- | ----------------: |
| **Random Forest** | n_estimators=50, max_depth=None, min_samples_split=2 | 0.9878 |
| **Decision Tree** | criterion='entropy', max_depth=None | 0.9854 |
| **SVM** | C=10, kernel='poly', gamma='auto' | 0.9561 |
| **XGBoost** | n_estimators=50, max_depth=5, learning_rate=0.2 | 0.9890 |

Key Insights:
- Random Forest and XGBoost achieved the best cross-validation performance, showing strong generalization power.  
- Decision Tree also performed excellently with slightly lower variance.  
- SVM performed competitively, but slightly below the ensemble models.  
- These tuned models were then evaluated on the test dataset to select the best one for deployment.


In [None]:
from sklearn.metrics import accuracy_score, f1_score

# Comparing tuned models on the test set
model_performance = {}

print("Model Performance Comparison:\n")

for name, model in best_models.items():
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    model_performance[name] = {'Accuracy': round(acc * 100, 2), 'F1-Score': round(f1, 4)}
    print(f"{name:25s} - Accuracy: {acc*100:.2f}% | F1-Score: {f1:.4f}")

# Converting to DataFrame and display
performance_df = (
    pd.DataFrame(model_performance)
    .T
    .sort_values(by='Accuracy', ascending=False)
)

display(performance_df)

# Identifying best model
best_model_name = performance_df.index[0]
best_model = best_models[best_model_name]

print(f"\n **Best Model for Deployment:** {best_model_name}")


In [None]:
# Use the best model from earlier
best_model_name = "Random Forest"  # or whichever you chose
best_model = best_models[best_model_name]

# Perform 10-fold cross-validation on the full dataset
cv_scores = cross_val_score(best_model, X, y, cv=10, scoring='accuracy')

print(f"Cross-Validation Scores ({best_model_name}): {cv_scores}")
print(f"Mean Accuracy: {cv_scores.mean()*100:.2f}%")
print(f"Standard Deviation: {cv_scores.std()*100:.2f}%")


**Model Evaluation and Validation**

After tuning and testing multiple models — Random Forest, Decision Tree, XGBoost, and Support Vector Machine — the Random Forest model achieved the highest and most consistent performance:
- Accuracy: 100.00%
- F1-Score: 1.0000

To ensure this was not due to overfitting, a 10-fold cross-validation was conducted on the full dataset, yielding:
- Mean Accuracy: 99.71%
- Standard Deviation: 0.88%

**Conclusion:**

The Random Forest model demonstrates excellent predictive power and stable performance across all validation folds, making it the most reliable model for deployment.

### Feature Importance Visualization

In [None]:
importances = pd.DataFrame({
    'Feature': X.columns,
    'Importance': best_models['Random Forest'].feature_importances_
}).sort_values(by='Importance', ascending=True)

# Plot
plt.figure(figsize=(10, 5))
plt.barh(importances['Feature'], importances['Importance'])
plt.title("Feature Importance (Random Forest)", fontsize=14, weight='bold')
plt.xlabel("Importance Score", fontsize=12)
plt.ylabel("Features", fontsize=12)
plt.grid(axis='x', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

**Findings:**
The Random Forest model identifies the most influential factors contributing to heart disease prediction.

- Chest Pain Type is the most significant predictor, strongly associated with heart disease presence.  
- Maximum Heart Rate Achieved and Number of Major Vessels also play critical roles in determining heart health.  
- Thalassemia, ST Depression and Serum Cholesterol show moderate importance, highlighting their contribution to cardiovascular risk.  
- Features like Fasting Blood Sugar and Resting ECG have minimal influence and may be less critical for prediction.

These insights suggest that chest pain type, heart rate and vessel count are the strongest indicators to prioritize in model interpretation and app input design.


In [None]:
import pickle

# Save the best Random Forest model
filename = 'random_forest_model.sav'
pickle.dump(best_models["Random Forest"], open(filename, 'wb'))

print("Random Forest model has been saved successfully as 'random_forest_model.sav'")


In [None]:
load_model = pickle.load(open(filename,'rb'))

In [None]:
X_test.head()

In [None]:
load_model.predict([[52,1,0,125,212,0,1,168,0,1,2,2,3
]])

In [None]:
# Define feature names used during model training
feature_names = [
    'Age', 'Sex', 'Chest_Pain_Type', 'Resting_BP', 'Serum_Cholesterol',
    'Fasting_Blood_Sugar', 'Resting_ECG', 'Max_Heart_Rate',
    'Exercise_Angina', 'ST_Depression', 'Slope_ST_Segment',
    'Major_Vessels', 'Thalassemia'
]

# Example input (replace with actual or user-provided values)
input_data = pd.DataFrame([[
    52,1,0,125,212,0,1,168,0,1,2,2,3
]], columns=feature_names)

# Predict using the best Random Forest model
prediction = best_models["Random Forest"].predict(input_data)

# Convert numeric prediction to meaningful label
prediction_label = "Heart Disease" if prediction[0] == 1 else "No Heart Disease"

print(f"Prediction: {prediction_label}")

