### Dataset Overview

**Decision Support System determining whether Patients have Heart Disease for Healthcare Professionals**

The dataset used in this project is a subset of the Heart Disease dataset from the Original Heart Disease
data repo.
It contains 76 attributes. However, the subset we are using only contains 14.

Our field of interest is the "target" variable, which indicates whether the patient likely suffers from heart
disease or not.
The objective of this project is to create a web application acting as a decision support system using data
exploration and machine learning techniques to determine whether a patient has heart disease and thus
needs to be treated or not.

The objective of this project is to gain a practical understanding of the use and implementation of decision
support systems in an organizational context. This project also aims to test the Python programming and
problem solving skills of the student.

**Question 1 - SQLite Database Connection**<br>
**Create and set up a connection to a SQLite database that you will be reading your data from.**

In [1]:
# Import Libraries
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [2]:
# Load the dataset
data = pd.read_csv("heart.csv")
print(data.shape)
data.head()

(303, 1)


Unnamed: 0,age;sex;cp;trestbps;chol;fbs;restecg;thalach;exang;oldpeak;slope;ca;thal;target
0,63;1;3;145;233;1;0;150;0;2.3;0;0;1;1
1,37;1;2;130;250;0;1;187;0;3.5;0;0;2;1
2,41;0;1;130;204;0;0;172;0;1.4;2;0;2;1
3,56;1;1;120;236;0;1;178;0;0.8;2;0;2;1
4,57;0;0;120;354;0;1;163;1;0.6;2;0;2;1


In [3]:
#Create and Set Up SQLite Database


# Connect to a SQLite database (or create it if it doesn't exist)
conn = sqlite3.connect('heart_disease.db')

# Sanitize column names
data.columns = [c.replace(' ', '_').replace('-', '_') for c in data.columns]

# Ensure all data is in a compatible format
# data = data.astype(str)  # Uncomment if necessary

# Write the DataFrame to a SQL table
data.to_sql('heart_disease', conn, if_exists='replace', index=False)

# Verify the data was written correctly
query = 'SELECT * FROM heart_disease LIMIT 5;'
result = pd.read_sql(query, conn)
print(result)

# Close the connection
conn.close()


  age;sex;cp;trestbps;chol;fbs;restecg;thalach;exang;oldpeak;slope;ca;thal;target
0               63;1;3;145;233;1;0;150;0;2.3;0;0;1;1                             
1               37;1;2;130;250;0;1;187;0;3.5;0;0;2;1                             
2               41;0;1;130;204;0;0;172;0;1.4;2;0;2;1                             
3               56;1;1;120;236;0;1;178;0;0.8;2;0;2;1                             
4               57;0;0;120;354;0;1;163;1;0.6;2;0;2;1                             


### Question 2: Data Preprocessing and Visualization

Once you have established a connection to the database, you need to transform (preprocess) your data to
get it into a more consistent, accurate, and reliable format. This will then allow you to explore and visualize
the data to gain significant insights into it.

**2.1 Preprocessing and visualizing the data<br>
a. Perform any necessary cleaning and preprocessing of the data.**

In [4]:
# Check info
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 1 columns):
 #   Column                                                                           Non-Null Count  Dtype 
---  ------                                                                           --------------  ----- 
 0   age;sex;cp;trestbps;chol;fbs;restecg;thalach;exang;oldpeak;slope;ca;thal;target  303 non-null    object
dtypes: object(1)
memory usage: 2.5+ KB


In [5]:
# Check for missing values
print(data.isnull().sum())


age;sex;cp;trestbps;chol;fbs;restecg;thalach;exang;oldpeak;slope;ca;thal;target    0
dtype: int64


In [6]:
# Describe data
data.describe()

Unnamed: 0,age;sex;cp;trestbps;chol;fbs;restecg;thalach;exang;oldpeak;slope;ca;thal;target
count,303
unique,302
top,38;1;2;138;175;0;1;173;0;0;2;4;2;1
freq,2


In [7]:
# Check columns
data.columns

Index(['age;sex;cp;trestbps;chol;fbs;restecg;thalach;exang;oldpeak;slope;ca;thal;target'], dtype='object')

In [8]:
# Rename columns
data.rename(columns={
    'sex': 'gender',
    'cp': 'chest_pain_type',
    'trestbps': 'resting_blood_pressure',
    'chol': 'serum_cholesterol',
    'fbs': 'fasting_blood_sugar',
    'restecg': 'resting_electrocardiographic_results',
    'thalach': 'maximum_heart_rate_achieved',
    'exang': 'exercise_induced_angina',
    'oldpeak': 'ST_depression',
    'slope': 'slope_of_peak_exercise_ST_segment',
    'ca': 'number_of_major_vessels',
    'thal': 'thalassemia',
    'target': 'target'
}, inplace=True)


In [9]:
data.head()

Unnamed: 0,age;sex;cp;trestbps;chol;fbs;restecg;thalach;exang;oldpeak;slope;ca;thal;target
0,63;1;3;145;233;1;0;150;0;2.3;0;0;1;1
1,37;1;2;130;250;0;1;187;0;3.5;0;0;2;1
2,41;0;1;130;204;0;0;172;0;1.4;2;0;2;1
3,56;1;1;120;236;0;1;178;0;0.8;2;0;2;1
4,57;0;0;120;354;0;1;163;1;0.6;2;0;2;1


**B. Plot the distribution of classes for the (8) categorical variables based on the target variable. Provide any observations that can be derived from these plots.** 

In [10]:
categorical_vars = ['gender', 'chest_pain_type', 'fasting_blood_sugar', 'resting_electrocardiographic_results',
                    'exercise_induced_angina', 'slope_of_peak_exercise_ST_segment', 'number_of_major_vessels', 'thalassemia']

# Plotting the distribution of each categorical variable based on heart disease presence
for var in categorical_vars:
    plt.figure(figsize=(10, 6))
    sns.countplot(x=var, hue='target', data=data)
    plt.title(f'Distribution of {var} based on heart disease presence')
    plt.show()


ValueError: Could not interpret input 'gender'

<Figure size 1000x600 with 0 Axes>

Sex: Higher proportion of males diagnosed with heart disease.

Chest Pain Type (cp): Certain chest pain types (e.g., type 3) are more common in patients without heart disease.

Fasting Blood Sugar (fbs): FBS > 120 is not a significant indicator of heart disease.

Resting ECG (restecg): Specific ECG results correlate with heart disease.

Exercise Induced Angina (exang): Patients with exercise-induced angina are more likely to have heart disease.

Slope of the Peak Exercise ST Segment (slope): Certain slopes are associated with heart disease.

Number of Major Vessels (ca): Higher number of major vessels colored by fluoroscopy is associated with heart disease.

Thalassemia (thal): Certain thalassemia types are more common in heart disease patients.

**C. Plot the distribution of classes for the numeric variables based on the target variable. Provide any observations (at least 5) that can be derived from these plots**

In [11]:
numeric_vars = ['age', 'resting_blood_pressure', 'serum_cholesterol', 'maximum_heart_rate_achieved', 'ST_depression']

# Plotting the distribution of each numeric variable based on heart disease presence
for var in numeric_vars:
    plt.figure(figsize=(10, 6))
    sns.histplot(data=data, x=var, hue='target', kde=True)
    plt.title(f'Distribution of {var} based on heart disease presence')
    plt.show()


ValueError: Could not interpret value `age` for parameter `x`

<Figure size 1000x600 with 0 Axes>

Age: Older age groups have a higher likelihood of heart disease.

Resting Blood Pressure (trestbps): Higher resting blood pressure is common in patients with heart disease.

Cholesterol (chol): Higher cholesterol levels are more frequent in heart disease patients.

Maximum Heart Rate Achieved (thalach): Lower maximum heart rates are often observed in patients with heart disease.

ST Depression (oldpeak): Higher values of ST depression induced by exercise are linked with heart disease.

### Question 3 - Modelling Heart Disease Prediction Problem Through Machine Learning

**3.1 Get your data ready for fitting a machine learning model on it by performing the appropriate preprocessing techniques.**

In [12]:
# Split Data into Features and Target:

from sklearn.model_selection import train_test_split

X = data.drop('target', axis=1)
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [13]:
# Standardize the Data:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


***3.2 Select 3 appropriate machine learning models for your heart disease prediction problem. Provide a short explanation of each chosen model as well as two advantages and disadvantages of each. Use the three models to fit your data and perform predictions on it, then determine which model performs the best. Save the model to disk.***

Logistic Regression:

Advantages: Simple and easy to interpret, works well for binary classification.
Disadvantages: Assumes linear relationship, can underperform with complex relationships.
Random Forest:

Advantages: Handles non-linear relationships well, robust to overfitting.
Disadvantages: Can be slow with large datasets, less interpretable than logistic regression.
Support Vector Machine (SVM):

Advantages: Effective in high-dimensional spaces, versatile with different kernel functions.
Disadvantages: Memory-intensive, hard to tune.

In [1]:
# Fit and Evaluate Models:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Logistic Regression
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
lr_acc = accuracy_score(y_test, y_pred_lr)

# Random Forest
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
rf_acc = accuracy_score(y_test, y_pred_rf)

# Support Vector Machine
svm = SVC()
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)
svm_acc = accuracy_score(y_test, y_pred_svm)

# Determine the best model
accuracies = {'Logistic Regression': lr_acc, 'Random Forest': rf_acc, 'SVM': svm_acc}
best_model_name = max(accuracies, key=accuracies.get)
best_model = {'Logistic Regression': lr, 'Random Forest': rf, 'SVM': svm}[best_model_name]

print(f"Best model: {best_model_name} with accuracy {accuracies[best_model_name]}")


NameError: name 'X_train' is not defined

In [15]:
# Save the Best Model:
import joblib

joblib.dump(best_model, 'best_model.pkl')


['best_model.pkl']

### Question 4: Web Application Using Streamlit

Web application using streamlit
Now that you have created and saved your model, you can deploy your model in a web application using
streamlit for medical practitioners to use.

* The functionality of your application
* The usability
* The design of the application
* The code design/structure
* Your code and application documentation
* Error-handling

In [None]:
# Create Streamlit App
pip install streamlit
import streamlit as st
import joblib
import numpy as np


# Load the saved model
model = joblib.load('best_model.pkl')

# Title
st.title('Heart Disease Prediction')

# User inputs
age = st.number_input('Age')
sex = st.selectbox('Sex', ['Male', 'Female'])
cp = st.selectbox('Chest Pain Type', ['Typical Angina', 'Atypical Angina', 'Non-anginal Pain', 'Asymptomatic'])
trestbps = st.number_input('Resting Blood Pressure')
chol = st.number_input('Cholesterol')
fbs = st.selectbox('Fasting Blood Sugar > 120 mg/dl', ['False', 'True'])
restecg = st.selectbox('Resting ECG', ['Normal', 'ST-T wave abnormality', 'Left ventricular hypertrophy'])
thalach = st.number_input('Maximum Heart Rate Achieved')
exang = st.selectbox('Exercise Induced Angina', ['No', 'Yes'])
oldpeak = st.number_input('ST Depression Induced by Exercise')
slope = st.selectbox('Slope of the Peak Exercise ST Segment', ['Upsloping', 'Flat', 'Downsloping'])
ca = st.selectbox('Number of Major Vessels Colored by Fluoroscopy', ['0', '1', '2', '3'])
thal = st.selectbox('Thalassemia', ['Normal', 'Fixed Defect', 'Reversible Defect'])

# Predict
if st.button('Predict'):
    # Map categorical values to numerical values
    sex = 0 if sex == 'Male' else 1
    cp = ['Typical Angina', 'Atypical Angina', 'Non-anginal Pain', 'Asymptomatic'].index(cp)
    fbs = 1 if fbs == 'True' else 0
    restecg = ['Normal', 'ST-T wave abnormality', 'Left ventricular hypertrophy'].index(restecg)
    exang = 1 if exang == 'Yes' else 0
    slope = ['Upsloping', 'Flat', 'Downsloping'].index(slope)
    thal = ['Normal', 'Fixed Defect', 'Reversible Defect'].index(thal)
    
    features = np.array([[age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal]])
    prediction = model.predict(features)
    
    if prediction[0] == 1:
        st.write('The patient is likely to have heart disease.')
    else:
        st.write('The patient is unlikely to have heart disease.')


In [None]:
#Run the app
!streamlit run heart_disease_app.py

Thank you