<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Importing-Dependancies" data-toc-modified-id="Importing-Dependancies-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Importing Dependancies</a></span></li><li><span><a href="#Data-Cleaning-And-Preparation" data-toc-modified-id="Data-Cleaning-And-Preparation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Cleaning And Preparation</a></span><ul class="toc-item"><li><span><a href="#Loading-Data" data-toc-modified-id="Loading-Data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Loading Data</a></span></li><li><span><a href="#Dealing-with-NaN-values" data-toc-modified-id="Dealing-with-NaN-values-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Dealing with NaN values</a></span></li></ul></li><li><span><a href="#Exploratory-Data-Analysis" data-toc-modified-id="Exploratory-Data-Analysis-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Exploratory Data Analysis</a></span></li><li><span><a href="#Machine-learning-modeling" data-toc-modified-id="Machine-learning-modeling-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Machine learning modeling</a></span><ul class="toc-item"><li><span><a href="#Standard-Scaling-and-One-Hot-encoding" data-toc-modified-id="Standard-Scaling-and-One-Hot-encoding-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Standard Scaling and One-Hot encoding</a></span><ul class="toc-item"><li><span><a href="#One-Hot-Encoding-for-Categorical-Variables" data-toc-modified-id="One-Hot-Encoding-for-Categorical-Variables-4.1.1"><span class="toc-item-num">4.1.1&nbsp;&nbsp;</span>One-Hot Encoding for Categorical Variables</a></span></li><li><span><a href="#Standard-Scaling-Numerical-Variables" data-toc-modified-id="Standard-Scaling-Numerical-Variables-4.1.2"><span class="toc-item-num">4.1.2&nbsp;&nbsp;</span>Standard Scaling Numerical Variables</a></span></li></ul></li><li><span><a href="#Train-Test-Split" data-toc-modified-id="Train-Test-Split-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Train Test Split</a></span><ul class="toc-item"><li><span><a href="#Random-Forest-Model" data-toc-modified-id="Random-Forest-Model-4.2.1"><span class="toc-item-num">4.2.1&nbsp;&nbsp;</span>Random Forest Model</a></span></li><li><span><a href="#Support-Vector-Machines" data-toc-modified-id="Support-Vector-Machines-4.2.2"><span class="toc-item-num">4.2.2&nbsp;&nbsp;</span>Support Vector Machines</a></span></li></ul></li></ul></li></ul></div>

## Importing Dependancies

In [1]:
import pandas as pd
import numpy as np
import altair as alt


## Data Cleaning And Preparation

### Loading Data
In the code cell below, the following actions are preformed:
- Loading datasets to be used in the analysis
- concatenating the datasets into one dataset
- Assigning column names to the datasets
- Changing the DType of columns to numeric
- specifying which columns are categical 
- Displaying the first 5 rows of the dataframe 

In [2]:
'''Reading Cleveland Data'''
cleveland = pd.read_csv("./data/processed.cleveland.data.txt", header=None)

'''Reading Hungarian Data'''
hungarian = pd.read_csv("./data/processed.hungarian.data.txt", header=None)

'''Reading Switzerland Data'''
switzerland = pd.read_csv("./data/processed.switzerland.data.txt", header=None)

'''Reading Va Data'''
va = pd.read_csv("./data/processed.va.data.txt", header=None)

'''Concatenating the data together'''
data = pd.concat([cleveland, va, switzerland, hungarian])

'''Assigning Column names to daataframe'''
data.columns = [
    'age', 'sex', 'chest_pain', 'rest_bp', 'cholesterol', 'fasting_bs',
    'rest_ecg', 'max_heart_rate', 'exercise_angina', 'st_depression', 'slope',
    'fluoroscopy', 'defect', 'diagnosis'
]

'''Mutating the columns to be of type integer'''
for column in list(data.columns):
    if column in ["age", "rest_bp", "cholesterol", "max_heart_rate", "st_depression"]:
        data[column] = pd.to_numeric(data[column], errors="coerce")
    else:
        data[column] = pd.to_numeric(data[column],
                                     errors="coerce").astype("category")
data.head()

Unnamed: 0,age,sex,chest_pain,rest_bp,cholesterol,fasting_bs,rest_ecg,max_heart_rate,exercise_angina,st_depression,slope,fluoroscopy,defect,diagnosis
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [3]:
'''Obtaining some summary statistics from numerical variables'''
data.describe()

Unnamed: 0,age,rest_bp,cholesterol,max_heart_rate,st_depression
count,920.0,861.0,890.0,865.0,858.0
mean,53.51087,132.132404,199.130337,137.545665,0.878788
std,9.424685,19.06607,110.78081,25.926276,1.091226
min,28.0,0.0,0.0,60.0,-2.6
25%,47.0,120.0,175.0,120.0,0.0
50%,54.0,130.0,223.0,140.0,0.5
75%,60.0,140.0,268.0,157.0,1.5
max,77.0,200.0,603.0,202.0,6.2


In [4]:
'''Getting a glimpse of the DataType of every column in the datasert'''
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 920 entries, 0 to 293
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   age              920 non-null    float64 
 1   sex              920 non-null    category
 2   chest_pain       920 non-null    category
 3   rest_bp          861 non-null    float64 
 4   cholesterol      890 non-null    float64 
 5   fasting_bs       830 non-null    category
 6   rest_ecg         918 non-null    category
 7   max_heart_rate   865 non-null    float64 
 8   exercise_angina  865 non-null    category
 9   st_depression    858 non-null    float64 
 10  slope            611 non-null    category
 11  fluoroscopy      309 non-null    category
 12  defect           434 non-null    category
 13  diagnosis        920 non-null    category
dtypes: category(9), float64(5)
memory usage: 52.6 KB


### Dealing with NaN values

The cell below displays the percentages of missing values in each column of the dataset

In [5]:
df = {}
for column in data.columns:
    df[column] = data[column].isnull().mean() * 100

pd.DataFrame.from_dict(
    df, orient="index").rename(columns={
        0: "Percentage of Missing Values"
    }).sort_values(
        by="Percentage of Missing Values",
        ascending=False).style.background_gradient(axis=0).set_caption(
            "A Table Showing The Proportions of missing Values in columns")

Unnamed: 0,Percentage of Missing Values
fluoroscopy,66.413043
defect,52.826087
slope,33.586957
fasting_bs,9.782609
st_depression,6.73913
rest_bp,6.413043
max_heart_rate,5.978261
exercise_angina,5.978261
cholesterol,3.26087
rest_ecg,0.217391


From the table above, we see that:
- `Flouroscopy` has 66% of its values missing
- `defect` has 52.8% of its values missing
- `Slope` has 33.5% of its values missing

Due to the large amounts of missing values in these columns, we decided to drop them

Since `rest_ecg` only has two missing values, We will replace them with the mode. We believe this will not make a substantial impact on the outcome of our analysis.

In [8]:
'''Removing the columns with many missing values'''
final_df = data.copy().drop(["fluoroscopy", "defect", "slope"], axis=1)

'''Replacing the NaN value in the rest_ecg column with the mode of the colums'''
final_df['rest_ecg'].fillna(final_df['rest_ecg'].mode().iloc[0], inplace=True)

'''Looking at the head of the dataframe'''
final_df.head()

Unnamed: 0,age,sex,chest_pain,rest_bp,cholesterol,fasting_bs,rest_ecg,max_heart_rate,exercise_angina,st_depression,diagnosis
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,0


Having dealt with the three rows, we will just remove all rows with NaN values for our final dataset

In [9]:
heart = final_df.copy().dropna()
heart.head()

Unnamed: 0,age,sex,chest_pain,rest_bp,cholesterol,fasting_bs,rest_ecg,max_heart_rate,exercise_angina,st_depression,diagnosis
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,0


Checking if the final dataframe `heart_data` has any `NaN entries`

In [10]:
heart.isnull().any().reset_index().rename(columns={
    0: "has NaN entries?",
    "index": "Column"
}).style.set_caption("In the table below we see that all columns have no missing values")

Unnamed: 0,Column,has NaN entries?
0,age,False
1,sex,False
2,chest_pain,False
3,rest_bp,False
4,cholesterol,False
5,fasting_bs,False
6,rest_ecg,False
7,max_heart_rate,False
8,exercise_angina,False
9,st_depression,False


## Exploratory Data Analysis

The code cell below displays the distribution of age in the dataset

In [11]:
bar = alt.Chart(heart).mark_bar().encode(
    alt.X("age", bin=alt.Bin(step=3)), alt.Y("count()"),
    color="count()").properties(title="Distribution of Age")
bar

From the table above we see that `age` in the dataset roughly follows a normal distribution

In [12]:
from altair import X, Y, Chart
from altair_saver import save
alt.Chart(heart).mark_bar().encode(
    X("sex"),
    Y("count()"),
    tooltip="count()",
    color="sex").properties(title="Number of genders in the dataset")  # .interactive()

This graph shows that there is a class imbalance. class `0` is so underrepresented. This is a sampling bias that could potentially impact our model.

In [13]:
Chart(heart).mark_bar().encode(
    X("diagnosis"), Y("count()"), color="diagnosis",
    tooltip="count()").properties(title="Number of samples per diagnosis")

The graph above shows that diagnosis 4 is underrepresented. This is a form of sampling bias and it might affect the performance of our model. 

The code cell below plots boxplots in an attempt to find patterns in distributions of some numeric variables in different categories

In [14]:
cholesterol_vs_angina = Chart(heart).mark_boxplot().encode(
    Y("cholesterol:Q"), X("exercise_angina:O"),
    color="exercise_angina").properties(
        title="Excercise Angina vs Cholesterol Levels", width=200, height=300)

cholesterol_vs_rest = Chart(heart).mark_boxplot().encode(
    Y("cholesterol:Q"), X("rest_ecg:O"),
    color="rest_ecg").properties(title="Cholesterol Levels Vs rest ecg",
                                 width=200,
                                 height=300)

cholesterol_vs_fasting = Chart(heart).mark_boxplot().encode(
    Y("cholesterol:Q"), X("fasting_bs:O"),
    color="fasting_bs").properties(title="Cholesterol Levels Vs fasting_bs",
                                   width=200,
                                   height=300)

cholesterol_vs_angina | cholesterol_vs_rest | cholesterol_vs_fasting

In [15]:
heart_rate_vs_angina = Chart(heart).mark_boxplot().encode(
    Y("max_heart_rate:Q"), X("exercise_angina:O"),
    color="exercise_angina").properties(
        title="Excercise Angina vs Maximum Heart Rate", width=190, height=300)

heart_rate_vs_rest = Chart(heart).mark_boxplot().encode(
    Y("max_heart_rate:Q"), X("rest_ecg:O"),
    color="rest_ecg").properties(title="Max Heart Rate Vs rest ecg",
                                 width=190,
                                 height=300)

heart_rate_vs_fasting = Chart(heart).mark_boxplot().encode(
    Y("max_heart_rate:Q"), X("fasting_bs:O"),
    color="fasting_bs").properties(title="Max Heart Rate Levels Vs rest ecg",
                                   width=190,
                                   height=300)
heart_rate_vs_angina | heart_rate_vs_rest | heart_rate_vs_fasting

In [16]:
Chart(heart).mark_point().encode(
    X("age", scale=alt.Scale(zero=False)),
    Y("max_heart_rate"),
    color="max_heart_rate"
).properties(
    title="Maximum Heart Rate vs Age"
)

## Machine learning modeling
###  Standard Scaling and One-Hot encoding
#### One-Hot Encoding for Categorical Variables

In [69]:
'''Obtaining excercise angina dummy variables'''
angina_dummies = pd.get_dummies(heart["exercise_angina"],
                                prefix="exercise_angina")

'''Obtaining Rest ECG dummy variables'''
rest_dummies = pd.get_dummies(heart["rest_ecg"], prefix="rest_ecg")

'''Obtaining fasting_bs dummy variables'''
fasting_dummies = pd.get_dummies(heart["fasting_bs"], prefix="fasting")

'''Obtaining chest pain dummy variables'''
chest_pain_dummies = pd.get_dummies(heart["chest_pain"], prefix="chest_pain")

'''Obtaining sex dummy variables'''
sex_dummies = pd.get_dummies(heart["sex"], prefix="sex")

'''Concatenating the dummy columns to the dataset'''
heart_df = pd.concat([heart, angina_dummies,
                      rest_dummies, fasting_dummies,
                      chest_pain_dummies, sex_dummies], axis=1)

'''dropping the non_dummy categorical variables'''
new_heart = heart_df.drop(
    ["sex", "chest_pain", "fasting_bs", "rest_ecg", "exercise_angina"], axis=1)
new_heart.head()

Unnamed: 0,age,rest_bp,cholesterol,max_heart_rate,st_depression,diagnosis,exercise_angina_0.0,exercise_angina_1.0,rest_ecg_0.0,rest_ecg_1.0,rest_ecg_2.0,fasting_0.0,fasting_1.0,chest_pain_1.0,chest_pain_2.0,chest_pain_3.0,chest_pain_4.0,sex_0.0,sex_1.0
0,63.0,145.0,233.0,150.0,2.3,0,1,0,0,0,1,0,1,1,0,0,0,0,1
1,67.0,160.0,286.0,108.0,1.5,2,0,1,0,0,1,1,0,0,0,0,1,0,1
2,67.0,120.0,229.0,129.0,2.6,1,0,1,0,0,1,1,0,0,0,0,1,0,1
3,37.0,130.0,250.0,187.0,3.5,0,1,0,1,0,0,1,0,0,0,1,0,0,1
4,41.0,130.0,204.0,172.0,1.4,0,1,0,0,0,1,1,0,0,1,0,0,1,0


#### Standard Scaling Numerical Variables

In [70]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled = scaler.fit_transform(heart_df[['age', 'rest_bp', 'cholesterol',
                                        'max_heart_rate', 'st_depression']])
scaled_df = pd.DataFrame(scaled)
scaled_df.columns = scaler.feature_names_in_

new_heart[scaled_df.columns] = scaled_df
new_heart

Unnamed: 0,age,rest_bp,cholesterol,max_heart_rate,st_depression,diagnosis,exercise_angina_0.0,exercise_angina_1.0,rest_ecg_0.0,rest_ecg_1.0,rest_ecg_2.0,fasting_0.0,fasting_1.0,chest_pain_1.0,chest_pain_2.0,chest_pain_3.0,chest_pain_4.0,sex_0.0,sex_1.0
0,1.053690,0.659345,0.136456,0.436207,1.295248,0,1,0,0,0,1,0,1,1,0,0,0,0,1
1,1.479417,1.467619,0.703128,-1.190982,0.558727,2,0,1,0,0,1,1,0,0,0,0,1,0,1
2,1.479417,-0.687778,0.093688,-0.377388,1.571443,1,0,1,0,0,1,1,0,0,0,0,1,0,1
3,-1.713539,-0.148929,0.318218,1.869684,2.400029,0,1,0,1,0,0,1,0,0,0,1,0,0,1
4,-1.287811,-0.148929,-0.173610,1.288545,0.466662,0,1,0,0,0,1,1,0,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,0.308667,-0.687778,0.211299,1.172317,-0.822249,1,0,1,1,0,0,1,0,0,0,0,1,0,1
290,1.479417,1.036540,-0.088075,0.436207,-0.085729,1,0,1,0,1,0,1,0,0,0,1,0,1,0
291,0.202235,-0.041159,1.301875,1.056089,0.282532,1,0,1,1,0,0,0,1,0,0,0,1,0,1
292,-0.968516,-0.687778,-0.547828,0.203752,1.755573,1,0,1,1,0,0,1,0,0,1,0,0,1,0


### Train Test Split

In [81]:
from sklearn.model_selection import train_test_split

predictors = new_heart.copy().drop("diagnosis", axis=1)
y = new_heart["diagnosis"]

X_train, X_test, y_train, y_test = train_test_split(predictors,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=42)

- #### Random Forest Model

In [89]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = [
    {'n_estimators': list(range(1, 20)),
     'max_features': list(range(1, 20)),
     'max_depth': list(range(2, 20))}
]
rfc = RandomForestClassifier(random_state=42)
CrossVal_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv=5)
CrossVal_rfc.fit(X_train, y_train)
CrossVal_rfc.best_params_

{'max_depth': 3, 'max_features': 3, 'n_estimators': 5}

In [114]:
from sklearn.metrics import accuracy_score, classification_report
'''Building the Random Forest Model with the best estimators'''
forest_clf = RandomForestClassifier(random_state=42, max_depth=3,
                                    max_features=3, n_estimators=5)

'''Fitting the model on the train set'''
forest_clf.fit(X_train, y_train)

y_pred = forest_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.610738255033557

#### Support Vector Machines

In [109]:
from sklearn.svm import SVC

# defining parameter range
svm_grid = {'C': [0.1, 1, 10, 100, 1000],
            'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
            'kernel': ['rbf', 'poly', 'sigmoid']}

svm_grid = GridSearchCV(SVC(random_state=42), param_grid=svm_grid, cv=5)

# fitting the model for grid search
svm_grid.fit(X_train, y_train)
svm_grid.best_params_

{'C': 1, 'gamma': 0.01, 'kernel': 'sigmoid'}

In [115]:
svm_clf = SVC(random_state=42, C=1, gamma=0.01, kernel='sigmoid')
svm_clf.fit(X_train, y_train)
y_pred_svm = svm_clf.predict(X_test)
accuracy_score(y_test, y_pred_svm)

0.5906040268456376