## Context
source: [1]
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.
A dataset containing 918 observations is available (heart.csv). This dataset contains 11 features that can be used to predict a possible heart disease:

    1. Age: age of the patient [years]
    2. Sex: sex of the patient [M: Male, F: Female]
    3. ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
    4. RestingBP: resting blood pressure [mm Hg]
    5. Cholesterol: serum cholesterol [mm/dl]
    6. FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
    7. RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes’ criteria]
    8. MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
    9. ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
    10. Oldpeak: oldpeak = ST [Numeric value measured in depression]
    11. ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
    12. HeartDisease: output class [1: heart disease, 0: Normal]


## Assignment
Create a machine learning model that is able to predict a possible heart disease for a patient with a high accuracy. The model must be created using a Support Vector Machine. Experiment with different kernel types and kernel parameters to achieve the highest accuracy. Take the following into account:
    - Some parameters must be transformed from categorical (e.g. male - female) to numerical in order for the SVM to process. This could be handled one hot encoding. See https://www.geeksforgeeks.org/ml-one-hot-encoding/ or other websites on this topic
    - Scaling of the numerical parameters will most likely be necessary.
    - Use *classification_report, confusion_matrix and ConfusionMatrixDisplay* from **sklearn.metrics** to investigate the performance of your model. The terms precision, recall and F1 score that are used in the classification report are explained in https://en.wikipedia.org/wiki/Precision_and_recall.
    - Don’t forget to split the dataset into a training set and a test set for validation purposes.

In [3]:
# imports 
import numpy as np
import pandas as pd
import sklearn as sk

In [9]:
data = pd.read_csv('heart.csv')
print(data)

     Age Sex ChestPainType  RestingBP  Cholesterol  FastingBS RestingECG  \
0     40   M           ATA        140          289          0     Normal   
1     49   F           NAP        160          180          0     Normal   
2     37   M           ATA        130          283          0         ST   
3     48   F           ASY        138          214          0     Normal   
4     54   M           NAP        150          195          0     Normal   
..   ...  ..           ...        ...          ...        ...        ...   
913   45   M            TA        110          264          0     Normal   
914   68   M           ASY        144          193          1     Normal   
915   57   M           ASY        130          131          0     Normal   
916   57   F           ATA        130          236          0        LVH   
917   38   M           NAP        138          175          0     Normal   

     MaxHR ExerciseAngina  Oldpeak ST_Slope  HeartDisease  
0      172              N  

In [7]:
# get the unique data from column sex, chestPainType, restingECG, exerciseAngina and st_Slope
print(data['Sex'].unique()) 
print(data['ChestPainType'].unique()) 
print(data['RestingECG'].unique()) 
print(data['ExerciseAngina'].unique()) 
print(data['ST_Slope'].unique())

['M' 'F']
['ATA' 'NAP' 'ASY' 'TA']
['Normal' 'ST' 'LVH']
['N' 'Y']
['Up' 'Flat' 'Down']


In [18]:
# One hot Encoding using pandas
oneHotData = pd.get_dummies(data, columns=['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']) # , dtype=int
print(oneHotData)

# using sklearn
categorical_columns = data.select_dtypes(include=['object']).columns.tolist()

enc = sk.preprocessing.OneHotEncoder(sparse_output=False)

one_hot_enc = enc.fit_transform(data[categorical_columns])

one_hot_df = pd.DataFrame(one_hot_enc, columns=enc.get_feature_names_out(categorical_columns))

df_enc = pd.concat([data, one_hot_df], axis=1)

print(f'encoded: \n{df_enc}')
print(df_enc[:4])

# due to above code not working like expected, we will continue this using numpy
sex_conditions = [
    (data['Sex'] == 'M'),
    (data['Sex'] == 'F')
]

exercise_conditions = [
    (data['ExerciseAngina'] == 'Y'),
    (data['ExerciseAngina'] == 'N')
]


encoded: 
     Age Sex ChestPainType  RestingBP  Cholesterol  FastingBS RestingECG  \
0     40   M           ATA        140          289          0     Normal   
1     49   F           NAP        160          180          0     Normal   
2     37   M           ATA        130          283          0         ST   
3     48   F           ASY        138          214          0     Normal   
4     54   M           NAP        150          195          0     Normal   
..   ...  ..           ...        ...          ...        ...        ...   
913   45   M            TA        110          264          0     Normal   
914   68   M           ASY        144          193          1     Normal   
915   57   M           ASY        130          131          0     Normal   
916   57   F           ATA        130          236          0        LVH   
917   38   M           NAP        138          175          0     Normal   

     MaxHR ExerciseAngina  Oldpeak  ... ChestPainType_NAP  ChestPainType_TA  

In [19]:
# training and testing
from sklearn.model_selection import train_test_split

X = df_enc.values[:,:4]
y = df_enc.values[:,4].astype(int)
print('ClassLabels: \n', np.unique(y))
print(np.bincount(y))

print(df_enc[:4])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, stratify = y)

print(np.bincount(y_train))

ClassLabels: 
 [  0  85 100 110 113 117 123 126 129 131 132 139 141 142 147 149 152 153
 156 157 159 160 161 163 164 165 166 167 168 169 170 171 172 173 174 175
 176 177 178 179 180 181 182 183 184 185 186 187 188 190 192 193 194 195
 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213
 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231
 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249
 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267
 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285
 286 287 288 289 290 291 292 293 294 295 297 298 299 300 302 303 304 305
 306 307 308 309 310 311 312 313 315 316 318 319 320 321 322 325 326 327
 328 329 330 331 333 335 336 337 338 339 340 341 342 344 347 349 353 354
 355 358 360 365 369 384 385 388 392 393 394 404 407 409 412 417 458 466
 468 491 518 529 564 603]
[172   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0  

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.