Problem Statement:
You are the data scientist at a medical research facility. The facility wants you to
build a machine learning model to classify if the given data of a patient should tell
if the patient is at the risk of a heart attack.
Heart Disease Dataset:
UCI Heart Disease Dataset
(https://archive.ics.uci.edu/ml/datasets/Heart+Disease?spm=5176.100239.blogco
nt54260.8.TRNGoO)
Lab Environment:
Jupyter Notebooks
Domain:
Healthcare
Tasks To Be Performed:
1. Data Analysis:
a. Import the dataset
b. Get information about the dataset (mean, max, min, quartiles etc.)
c. Find the correlation between all fields
2. Data Visualization:
a. Visualize the number of patients having a heart disease and not having
a heart disease
b. Visualize the age and whether a patient has disease or not
c. Visualize correlation between all features using a heat map
3. Logistic Regression:
a. Build a simple logistic regression model:
i. Divide the dataset in 70:30 ratio
ii. Build the model on train set and predict the values on test set
iii. Build the confusion matrix and get the accuracy scoree
4. Decision Tree:
a. Build a decision tree model:
i. Divide the dataset in 70:30 ratio
ii. Build the model on train set and predict the values on test set
iii. Build the confusion matrix and calculate the accuracy
iv. Visualize the decision tree using the Graphviz package
5. Random Forest:
a. Build a Random Forest model:
i. Divide the dataset in 70:30 ratio
ii. Build the model on train set and predict the values on test set
iii. Build the confusion matrix and calculate the accuracy
iv. Visualize the model using the Graphviz package
6. Select the best model
a. Print the confusion matrix of all classifiers
b. Print the classification report of all classifiers
c. Calculate Recall Precision and F1 score of all the models
d. Visualize confusion matrix using heatmaps
e. Select the best model based on the best accuracies

In [1]:
pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7
Note: you may need to restart the kernel to use updated packages.


In [2]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
heart_disease = fetch_ucirepo(id=45) 
  
# data (as pandas dataframes) 
X = heart_disease.data.features 
y = heart_disease.data.targets 
  
# metadata 
print(heart_disease.metadata) 
  
# variable information 
print(heart_disease.variables) 


{'uci_id': 45, 'name': 'Heart Disease', 'repository_url': 'https://archive.ics.uci.edu/dataset/45/heart+disease', 'data_url': 'https://archive.ics.uci.edu/static/public/45/data.csv', 'abstract': '4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach', 'area': 'Health and Medicine', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 303, 'num_features': 13, 'feature_types': ['Categorical', 'Integer', 'Real'], 'demographics': ['Age', 'Sex'], 'target_col': ['num'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1989, 'last_updated': 'Fri Nov 03 2023', 'dataset_doi': '10.24432/C52P4X', 'creators': ['Andras Janosi', 'William Steinbrunn', 'Matthias Pfisterer', 'Robert Detrano'], 'intro_paper': {'ID': 231, 'type': 'NATIVE', 'title': 'International application of a new probability algorithm for the diagnosis of coronary artery disease.', 'authors': 'R. Detrano, A. Jánosi, W. Steinbrunn, M

In [None]:
1. Data Analysis:
a. Import the dataset
import pandas as pd

# Load the dataset (replace 'heart_disease.csv' with your actual file)
df = pd.read_csv('heart_disease.csv')

b. Get information about the dataset (mean, max, min, quartiles etc.)
# Get statistical summary
df.describe()

c. Find the correlation between all fields
# Calculate correlation matrix
correlation_matrix = df.corr()
print(correlation_matrix)


In [None]:
2. Data Visualization:
a. Visualize the number of patients having a heart disease and not having
a heart disease
import seaborn as sns
import matplotlib.pyplot as plt

# Count plot for heart disease presence (assuming 'target' column represents disease status)
sns.countplot(x='target', data=df)
plt.show()

b. Visualize the age and whether a patient has disease or not
# Boxplot to show the relationship between age and heart disease
sns.boxplot(x='target', y='age', data=df)
plt.show()

c. Visualize correlation between all features using a heat map
# Heatmap for correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()


In [None]:
3. Logistic Regression:
a. Build a simple logistic regression model:
i. Divide the dataset in 70:30 ratio
from sklearn.model_selection import train_test_split

# Features and target
X = df.drop('target', axis=1)
y = df['target']

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

ii. Build the model on train set and predict the values on test set
from sklearn.linear_model import LogisticRegression

# Logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

iii. Build the confusion matrix and get the accuracy score
from sklearn.metrics import confusion_matrix, accuracy_score

# Confusion matrix and accuracy
conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

print("Confusion Matrix:\n", conf_matrix)
print("Accuracy:", accuracy)


In [None]:
4. Decision Tree:
a. Build a decision tree model:
i. Divide the dataset in 70:30 ratio
ii. Build the model on train set and predict the values on test set
from sklearn.tree import DecisionTreeClassifier

# Decision tree model
tree_model = DecisionTreeClassifier()
tree_model.fit(X_train, y_train)

# Predictions
y_pred_tree = tree_model.predict(X_test)

iii. Build the confusion matrix and calculate the accuracy
conf_matrix_tree = confusion_matrix(y_test, y_pred_tree)
accuracy_tree = accuracy_score(y_test, y_pred_tree)

print("Decision Tree Confusion Matrix:\n", conf_matrix_tree)
print("Decision Tree Accuracy:", accuracy_tree)

iv. Visualize the decision tree using the Graphviz package
from sklearn.tree import plot_tree

# Visualize decision tree
plt.figure(figsize=(15, 10))
plot_tree(tree_model, filled=True, feature_names=X.columns)
plt.show()


In [None]:
5. Random Forest:
a. Build a Random Forest model:
i. Divide the dataset in 70:30 ratio
ii. Build the model on train set and predict the values on test set
from sklearn.ensemble import RandomForestClassifier

# Random Forest model
rf_model = RandomForestClassifier(n_estimators=100)
rf_model.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_model.predict(X_test)

iii. Build the confusion matrix and calculate the accuracy
conf_matrix_rf = confusion_matrix(y_test, y_pred_rf)
accuracy_rf = accuracy_score(y_test, y_pred_rf)

print("Random Forest Confusion Matrix:\n", conf_matrix_rf)
print("Random Forest Accuracy:", accuracy_rf)

iv. Visualize the model using the Graphviz package
# Visualize one tree in the forest
plt.figure(figsize=(15, 10))
plot_tree(rf_model.estimators_[0], filled=True, feature_names=X.columns)
plt.show()


In [None]:
6. Select the best model
a. Print the confusion matrix of all classifiers
b. Print the classification report of all classifiers
from sklearn.metrics import classification_report

# Logistic Regression classification report
print("Logistic Regression Classification Report:\n", classification_report(y_test, y_pred))

# Decision Tree classification report
print("Decision Tree Classification Report:\n", classification_report(y_test, y_pred_tree))

# Random Forest classification report
print("Random Forest Classification Report:\n", classification_report(y_test, y_pred_rf))

c. Calculate Recall Precision and F1 score of all the models
d. Visualize confusion matrix using heatmaps
# Heatmap for confusion matrix (e.g., Random Forest)
sns.heatmap(conf_matrix_rf, annot=True, fmt='d', cmap='Blues')
plt.show()

e. Select the best model based on the best accuracies