## Use Heart Disease [Dataset](https://github.com/cksajil/DSAIRP25/blob/main/datasets/heart_disease.csv) and answer the following questions

## 1. Find the top 5 important features to the target column

In [15]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
df = pd.read_csv('heart_disease.csv')
X = df.drop('target', axis=1)
y = df['target']
model = RandomForestClassifier(random_state=42)
model.fit(X, y)
feature_importances = pd.Series(model.feature_importances_, index=X.columns)
top_5_features = feature_importances.sort_values(ascending=False).head(5)
print("Top 5 important features to the target column:")
top_5_features


Top 5 important features to the target column:


Unnamed: 0,0
cp,0.134201
thalach,0.120473
ca,0.116755
oldpeak,0.116151
thal,0.097043


## 2. Perform Box-Cox Transformations to relevant features

In [23]:
from scipy import stats
numerical_features = df.select_dtypes(include=np.number).columns.tolist()
features_to_transform = [f for f in numerical_features if f not in ['target', 'sex', 'fbs', 'exang', 'restecg', 'ca', 'thal']]
for feature in features_to_transform:
    if (df[feature] > 0).all():
      df[feature], _ = stats.boxcox(df[feature])
    else:
        print(f"Skipping Box-Cox transformation for '{feature}' as it contains non-positive values.")
print("\nDataFrame after applying Box-Cox transformations:")
print(df.head())

Skipping Box-Cox transformation for 'cp' as it contains non-positive values.
Skipping Box-Cox transformation for 'oldpeak' as it contains non-positive values.
Skipping Box-Cox transformation for 'slope' as it contains non-positive values.
Skipping Box-Cox transformation for 'age_binned' as it contains non-positive values.

DataFrame after applying Box-Cox transformations:
          age  sex  cp  trestbps      chol  fbs  restecg       thalach  exang  \
0  271.022426    1   0  0.326596  3.437567    0        1  31302.761017      0   
1  279.066773    1   0  0.329893  3.408243    1        0  26279.867742      1   
2  427.580007    1   0  0.330859  3.303150    0        1  16471.717088      1   
3  346.254970    1   0  0.331412  3.408243    0        1  28539.281170      0   
4  354.997982    0   0  0.329489  3.655097    1        1  11514.106856      0   

   oldpeak  slope  ca  thal  target  age_binned  age_boxcox  trestbps_boxcox  \
0      1.0      2   2     3       0           2  269.48828

## 3. Perform Feature Binning to Age Column and add it as a new column to the dataset

In [17]:
df['age_binned'] = pd.cut(df['age'], bins=5, labels=False)
print("\nDataFrame with Age Binning column:")
print(df[['age', 'age_binned']].head())


DataFrame with Age Binning column:
          age  age_binned
0  272.372422           2
1  280.429390           2
2  429.185698           4
3  347.725370           3
4  356.482692           3


## 4. Find the most orthogonal feature to the 'chol' feature

In [18]:
correlation_matrix = df.corr()
chol_correlations = correlation_matrix['chol'].drop('chol')
most_orthogonal_feature = chol_correlations.abs().idxmin()
print(f"\nThe most orthogonal feature to 'chol' is: {most_orthogonal_feature}")


The most orthogonal feature to 'chol' is: slope
