### About Dataset

##### This is a multivaraiate type of dataset which means providing or involving a variety of separate mathematical or statistical variables, multivariate numerical data analysis. It is composed of 14 attributes which are age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina, oldpeak — ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels and Thalassemia. 

### Aims and objectives
we will find this after exploratory data analysis on dataset lets star the project importing all the libraries that we will need in this data set

## Import Libraries

In [217]:
import pandas as pd
import numpy as np
# import preprocessing libraries
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

#Import Evaluation Metrices
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score

# import Models
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

### Load Dataset

In [218]:
df = pd.read_csv("heart_disease.csv")

# Reorder columns
new_order = ['age', 'trestbps', 'chol', 'thalch', 'oldpeak', 'ca', 'sex', 'exang', 'dataset', 'cp', 'restecg', 'slope', 'thal', 'num']
df = df[new_order]
df.head()

Unnamed: 0,age,trestbps,chol,thalch,oldpeak,ca,sex,exang,dataset,cp,restecg,slope,thal,num
0,63,145.0,233.0,150.0,2.3,0.0,Male,False,Cleveland,typical angina,lv hypertrophy,downsloping,fixed defect,0
1,67,160.0,286.0,108.0,1.5,3.0,Male,True,Cleveland,asymptomatic,lv hypertrophy,flat,normal,2
2,67,120.0,229.0,129.0,2.6,2.0,Male,True,Cleveland,asymptomatic,lv hypertrophy,flat,reversable defect,1
3,37,130.0,250.0,187.0,3.5,0.0,Male,False,Cleveland,non-anginal,normal,downsloping,normal,0
4,41,130.0,204.0,172.0,1.4,0.0,Female,False,Cleveland,atypical angina,lv hypertrophy,upsloping,normal,0


### Make Inpedendent and dependant variables

In [219]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values


### Dealing With missing Values

In [220]:
# Impute missing values using median for numerical columns
imputer_numeric = SimpleImputer(strategy='median')
X[:, 0:6] = imputer_numeric.fit_transform(X[:, 0:6])

# Impute missing values using most_frequent for categorical columns
categorical_columns = [6, 7, 8, 9, 10, 11, 12]
imputer_categorical = SimpleImputer(strategy='most_frequent')
X[:, categorical_columns] = imputer_categorical.fit_transform(X[:, categorical_columns])
X

array([[63.0, 145.0, 233.0, ..., 'lv hypertrophy', 'downsloping',
        'fixed defect'],
       [67.0, 160.0, 286.0, ..., 'lv hypertrophy', 'flat', 'normal'],
       [67.0, 120.0, 229.0, ..., 'lv hypertrophy', 'flat',
        'reversable defect'],
       ...,
       [55.0, 122.0, 223.0, ..., 'st-t abnormality', 'flat',
        'fixed defect'],
       [58.0, 130.0, 385.0, ..., 'lv hypertrophy', 'flat', 'normal'],
       [62.0, 120.0, 254.0, ..., 'lv hypertrophy', 'flat', 'normal']],
      dtype=object)

### Dealing with catogorical data

In [221]:
labelEncoder_y = LabelEncoder()
y = labelEncoder_y.fit_transform(y)

# Label encode categorical columns in X (consider one-hot encoding if multiple categories)
labelEncoder_X = LabelEncoder()
for col in categorical_columns:
    X[:, col] = labelEncoder_X.fit_transform(X[:, col])

# One-hot encode specified categorical columns in X
cat_encode = [9, 10, 11, 12]  # Use integer indices instead of column names
ct = ColumnTransformer(
    [('one_hot_encode', OneHotEncoder(), cat_encode)],
    remainder='passthrough'
)
X = ct.fit_transform(X)
print(pd.DataFrame(X[:, :]))

      0    1    2    3    4    5    6    7    8    9   ...   12    13     14  \
0    0.0  0.0  0.0  1.0  1.0  0.0  0.0  1.0  0.0  0.0  ...  0.0  63.0  145.0   
1    1.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0  0.0  ...  0.0  67.0  160.0   
2    1.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0  0.0  ...  1.0  67.0  120.0   
3    0.0  0.0  1.0  0.0  0.0  1.0  0.0  1.0  0.0  0.0  ...  0.0  37.0  130.0   
4    0.0  1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  1.0  ...  0.0  41.0  130.0   
..   ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...   ...    ...   
915  1.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  1.0  0.0  ...  0.0  54.0  127.0   
916  0.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  1.0  0.0  ...  0.0  62.0  130.0   
917  1.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  1.0  0.0  ...  0.0  55.0  122.0   
918  1.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0  0.0  ...  0.0  58.0  130.0   
919  0.0  1.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0  0.0  ...  0.0  62.0  120.0   

        15     16   17   18 19 20 21  


### Split data

In [222]:

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


### Standarized data

In [223]:

# Feature scaling
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

# Print the first few rows of X_train for verification
print("X_train after preprocessing:")
print(pd.DataFrame(X_train[:5, :]))

X_train after preprocessing:
         0         1         2        3         4         5         6   \
0 -1.073334 -0.488935  1.817717 -0.20253 -0.508055  0.804072 -0.480384   
1 -1.073334  2.045262 -0.550141 -0.20253 -0.508055  0.804072 -0.480384   
2  0.931676 -0.488935 -0.550141 -0.20253 -0.508055 -1.243669  2.081666   
3 -1.073334 -0.488935  1.817717 -0.20253  1.968292 -1.243669 -0.480384   
4 -1.073334  2.045262 -0.550141 -0.20253 -0.508055  0.804072 -0.480384   

         7         8         9   ...        12        13        14        15  \
0 -0.261180  0.636066 -0.537556  ... -0.510171 -0.879721  0.159825  0.219439   
1 -0.261180  0.636066 -0.537556  ... -0.510171 -1.197839 -0.681599 -0.029345   
2  3.828782 -1.572164 -0.537556  ... -0.510171  1.028983 -0.345030 -1.835336   
3 -0.261180 -1.572164  1.860270  ... -0.510171 -0.985760  0.440299  0.330010   
4 -0.261180  0.636066 -0.537556  ... -0.510171 -0.137447  0.440299  0.219439   

         16        17        18        19    

### Train model on Decision Tree Classifier

In [224]:


# Initialize the Decision Tree model
dt_model = DecisionTreeClassifier()

# Fit the model on the training data
dt_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = dt_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Perform cross-validation
scores = cross_val_score(dt_model, X_train, y_train, cv=5)
mean_accuracy = scores.mean()

# Print the performance metrics
print('Decision Tree Model Accuracy:', accuracy)
print("Cross-validation Accuracy:", mean_accuracy)


Decision Tree Model Accuracy: 0.4945652173913043
Cross-validation Accuracy: 0.49863945578231295


### Train Model on KNN CLassifier

In [225]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

# Initialize the KNN model
knn_model = KNeighborsClassifier()

# Fit the model on the training data
knn_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = knn_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Perform cross-validation
scores = cross_val_score(knn_model, X_train, y_train, cv=5)
mean_accuracy = scores.mean()

# Print the performance metrics
print('KNN Model Accuracy:', accuracy)
print("Cross-validation Accuracy:", mean_accuracy)


KNN Model Accuracy: 0.5054347826086957
Cross-validation Accuracy: 0.5665839308696452


### Train Model on SVM

In [226]:

# Initialize the SVM model
svm_model = SVC()

# Fit the model on the training data
svm_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = svm_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Perform cross-validation
scores = cross_val_score(svm_model, X_train, y_train, cv=5)
mean_accuracy = scores.mean()

# Print the performance metrics
print('SVM Model Accuracy:', accuracy)
print("Cross-validation Accuracy:", mean_accuracy)


SVM Model Accuracy: 0.5271739130434783
Cross-validation Accuracy: 0.5842526199669058
