# Machine Learning Assignment: Heart Disease Prediction

#1. Select a healthcare dataset that you're enthusiastic about working on. Explain why you chose this dataset for the assignment and how it is impactful to you or your life experiences. Please reference the website with URL for the datasource.

Tasks:

1. Dataset Selection and Explanation
2. Statistical Tests for Feature Selection
3. Machine Learning Model: Logistic Regression
4. Performance Evaluation

## Dataset Selection and Explanation:

We have chosen the Heart Disease dataset for this predictive analysis. Heart disease is one of the leading causes of death worldwide, making early detection crucial for improving patient outcomes. This dataset contains several medical attributes used for predicting heart disease, such as age, sex, cholesterol levels, and more.

This dataset can be impactful because predictive models built on such data can help in early diagnosis, potentially saving lives by preventing severe complications.

**Data Source**: https://archive.ics.uci.edu/dataset/45/heart+disease


In [35]:
# Importing necessary libraries
import pandas as pd
import numpy as np

# Load the dataset
file_path = '/content/heart.csv'
heart = pd.read_csv(file_path)

# Display the first few rows of the dataset to understand its structure
heart.head()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0



The dataset contains several important features such as age, sex, cholesterol levels, etc., which are used to predict heart disease. The `target` column indicates whether a person has heart disease i.e., 1 or  0.

#2. Use the statistical tests we learned in class (e.g., T-test, ANOVA, Chi-square, Mann-Whitney U test, the Levene test, etc.) to select the best features for your model. Do not use LASSO regression for feature selection.

## Statistical Tests for Feature Selection:

Statistical tests:

T-test

Chi-square test

ANOVA


Features with a p-value < 0.05 were considered statistically significant and selected for the model.

In [36]:
import pandas as pd
from scipy.stats import ttest_ind, chi2_contingency
import numpy as np
from sklearn.preprocessing import LabelEncoder

# Load the dataset
df = pd.read_csv('heart.csv')

# Separate the features and target variable
X = df.drop('target', axis=1)
y = df['target']

# Convert categorical variables into numerical form
label_encoders = {}
for column in X.select_dtypes(include=['object', 'category']).columns:
    le = LabelEncoder()
    X[column] = le.fit_transform(X[column])
    label_encoders[column] = le

# List to store selected features
selected_features = []

# Perform t-test or chi-square test
for column in X.columns:
    if X[column].nunique() <= 10:  # Assuming categorical features have fewer than 10 unique values
        contingency_table = pd.crosstab(X[column], y)
        chi2, p, _, _ = chi2_contingency(contingency_table)
        if p < 0.05:
            selected_features.append(column)
    else:
        t_stat, p = ttest_ind(X[column][y == 0], X[column][y == 1])
        if p < 0.05:
            selected_features.append(column)

print("Selected features:", selected_features)

Selected features: ['age', 'sex', 'cp', 'trestbps', 'chol', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']



The selected features based on statistical tests are:

age, sex, cp (chest pain), trestbps (resting blood pressure), chol, restecg (resting ECG), thalach (maximum heart rate), exang (exercise-induced angina), oldpeak, slope, ca, and thal.



#3. Select a machine learning algorithm for the predictive analysis.
## Machine Learning Model: Logistic Regression

Logistic Regression model using the selected features.


In [37]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Select the significant features from the previous step
X_selected = X[selected_features]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.3, random_state=42)

# Initialize and train the model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Logistic Regression Results:")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
print(f"Confusion Matrix:\n{conf_matrix}")

Logistic Regression Results:
Accuracy: 0.8051948051948052
Precision: 0.7633136094674556
Recall: 0.8657718120805369
F1 Score: 0.8113207547169812
Confusion Matrix:
[[119  40]
 [ 20 129]]


In [38]:
#Decision Tree Algorithm
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Select the significant features from the previous step
X_selected = X[selected_features]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.3, random_state=42)

# Initialize and train the Decision Tree model
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Decision Tree Results:")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
print(f"Confusion Matrix:\n{conf_matrix}")

Decision Tree Results:
Accuracy: 0.9707792207792207
Precision: 1.0
Recall: 0.9395973154362416
F1 Score: 0.9688581314878892
Confusion Matrix:
[[159   0]
 [  9 140]]


In [39]:
#Random Forest Algorithm
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Select the significant features from the previous step
X_selected = X[selected_features]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.3, random_state=42)

# Initialize and train the Random Forest model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Random Forest Results:")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
print(f"Confusion Matrix:\n{conf_matrix}")


Random Forest Results:
Accuracy: 0.9902597402597403
Precision: 1.0
Recall: 0.9798657718120806
F1 Score: 0.9898305084745763
Confusion Matrix:
[[159   0]
 [  3 146]]


#4. Include performance evaluation metrics for your model

In [40]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, mean_squared_error

# Assuming you've already split the dataset into training and testing sets (X_train, X_test, y_train, y_test)

# Initialize and train a model (e.g., Logistic Regression)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# For classification, MSE can be calculated, though it is unconventional
mse = mean_squared_error(y_test, y_pred)

# Print the results
print("Performance Metrics:")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Confusion Matrix:\n{conf_matrix}")

Performance Metrics:
Accuracy: 0.8051948051948052
Precision: 0.7633136094674556
Recall: 0.8657718120805369
F1 Score: 0.8113207547169812
Mean Squared Error (MSE): 0.19480519480519481
Confusion Matrix:
[[119  40]
 [ 20 129]]


#Conclusion:
In this analysis, we predicted heart disease using models like Logistic Regression, Decision Tree, and Random Forest. After performing feature selection, Random Forest showed the best overall performance, offering high accuracy and robustness in predictions. Logistic Regression was simple and effective, while Decision Tree provided interpretability but was prone to overfitting. The models could help in the early detection of heart disease, leading to better patient outcomes.