# Heart Attack Analysis & Prediction Dataset

In this task you are asked to use `heart-data.csv` to train a support vector machine to predict heart attacks.

See `Data description.docx` or `Data description.pdf` for description of dataset.

# Reading Dataset

In [362]:
import pandas as pd

data = pd.read_csv('heart-data.csv')

data.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,Male,Non-anginal pain,145,233,High,Hypertrophy,150,No,2.3,Down-sloping,0.0,Fixed defect,1
1,37,Male,Atypical angina,130,250,Low,Normal,187,No,3.5,Down-sloping,0.0,Normal,1
2,41,Female,Typical angina,130,204,Low,Hypertrophy,172,No,1.4,Up-sloping,0.0,Normal,1
3,56,Male,Typical angina,120,236,Low,Normal,178,No,0.8,Up-sloping,0.0,Normal,1
4,57,Female,Asymptomatic,120,354,Low,Normal,163,Yes,0.6,Up-sloping,0.0,Normal,1


# TODO
1. Remove samples with missing data (there are **7 samples** with missing data).
2. Split the data to input and output.
3. Replace categorical values with numeric values (Use numeric encoding and one-hot encoding when suitable).
4. Split the dataset to (train - validation - test) by calling `train_test_split` two times:
    - First time: use `test_size=0.20` and `random_state=0`.
    - Second time: use `test_size=0.25` and `random_state=0`.
5. Apply feature scaling using `MinMaxScaler`.
6. Train a support vector machine classifier using suitable hyper-parameter values. 
7. Print the accuracy of both training and validation. Try to achieve **validation accuracy > 82%**.
8. Test your support vector machine and print accuracy of testing.

# 1. Handling missing values

**The Size of the dataset**

In [363]:
data.shape

(303, 14)

**Basic Info about data**

In [364]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    object 
 2   cp        303 non-null    object 
 3   trtbps    303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    object 
 6   restecg   303 non-null    object 
 7   thalachh  303 non-null    int64  
 8   exng      303 non-null    object 
 9   oldpeak   303 non-null    float64
 10  slp       303 non-null    object 
 11  caa       298 non-null    float64
 12  thall     301 non-null    object 
 13  output    303 non-null    int64  
dtypes: float64(2), int64(5), object(7)
memory usage: 33.3+ KB


In [365]:
print("Number of Missing values:",data.isnull().any(axis=1).sum())

Number of Missing values: 7


In [366]:
mask = data.isnull().any(axis=1)
# calculate number of rows with missing data
num_of_rows_with_nan = mask.sum()
# print the ratio of rows with missing data
print('the ratio of rows with missing data:', num_of_rows_with_nan/len(data))
# "the ratio of rows with missing Values : "0.023" "

the ratio of rows with missing data: 0.0231023102310231


In [367]:
# Remove rows with missing data
data_clean = data[~mask]

In [368]:
data_clean.shape

(296, 14)

# 2. Split the data to input and output

In [369]:
data_input = data_clean.drop(columns=['output'])
data_output = data_clean['output']

In [370]:
data_input.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall
0,63,Male,Non-anginal pain,145,233,High,Hypertrophy,150,No,2.3,Down-sloping,0.0,Fixed defect
1,37,Male,Atypical angina,130,250,Low,Normal,187,No,3.5,Down-sloping,0.0,Normal
2,41,Female,Typical angina,130,204,Low,Hypertrophy,172,No,1.4,Up-sloping,0.0,Normal
3,56,Male,Typical angina,120,236,Low,Normal,178,No,0.8,Up-sloping,0.0,Normal
4,57,Female,Asymptomatic,120,354,Low,Normal,163,Yes,0.6,Up-sloping,0.0,Normal


In [371]:
data_output.head()

0    1
1    1
2    1
3    1
4    1
Name: output, dtype: int64

# 3. Handling categorical data


In [372]:
print("sex:",data_input['sex'].unique())
print("cp",data_input['cp'].unique())
print('fbs',data_input['fbs'].unique())
print('restecg',data_input['restecg'].unique())
print('slp',data_input['slp'].unique())
print('ecng',data_input['exng'].unique())

sex: ['Male' 'Female']
cp ['Non-anginal pain' 'Atypical angina' 'Typical angina' 'Asymptomatic']
fbs ['High' 'Low']
restecg ['Hypertrophy' 'Normal' 'ST-T wave abnormality']
slp ['Down-sloping' 'Up-sloping' 'Flat']
ecng ['No' 'Yes']


## 3.1 Numeric encoding

In [373]:
data_input_encoded_1 = data_input.replace({
    'sex': {'Male': 0, 'Female': 1},
    'fbs': {'High': 1, 'Low': 0},
    'exng': {'No': 0, 'Yes': 1}
})

In [374]:
data_input_encoded_1.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall
0,63,0,Non-anginal pain,145,233,1,Hypertrophy,150,0,2.3,Down-sloping,0.0,Fixed defect
1,37,0,Atypical angina,130,250,0,Normal,187,0,3.5,Down-sloping,0.0,Normal
2,41,1,Typical angina,130,204,0,Hypertrophy,172,0,1.4,Up-sloping,0.0,Normal
3,56,0,Typical angina,120,236,0,Normal,178,0,0.8,Up-sloping,0.0,Normal
4,57,1,Asymptomatic,120,354,0,Normal,163,1,0.6,Up-sloping,0.0,Normal


## 3.2. One-hot encoding

In [375]:
data_input_encoded_2 = pd.get_dummies(data_input_encoded_1)

In [376]:
data_input_encoded_2.head()

Unnamed: 0,age,sex,trtbps,chol,fbs,thalachh,exng,oldpeak,caa,cp_Asymptomatic,...,cp_Typical angina,restecg_Hypertrophy,restecg_Normal,restecg_ST-T wave abnormality,slp_Down-sloping,slp_Flat,slp_Up-sloping,thall_Fixed defect,thall_Normal,thall_Reversable defect
0,63,0,145,233,1,150,0,2.3,0.0,0,...,0,1,0,0,1,0,0,1,0,0
1,37,0,130,250,0,187,0,3.5,0.0,0,...,0,0,1,0,1,0,0,0,1,0
2,41,1,130,204,0,172,0,1.4,0.0,0,...,1,1,0,0,0,0,1,0,1,0
3,56,0,120,236,0,178,0,0.8,0.0,0,...,1,0,1,0,0,0,1,0,1,0
4,57,1,120,354,0,163,1,0.6,0.0,1,...,0,0,1,0,0,0,1,0,1,0


# 4. Split into (train - validation - test)

In [377]:
from sklearn.model_selection import train_test_split

X, X_test, y, y_test = train_test_split(
    data_input_encoded_2, data_output, test_size=0.20, random_state=0
)

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.25, random_state=0
)

In [378]:
print(X_train.shape)
print(y_train.shape)
print('---------------------')
print(X_val.shape)
print(y_val.shape)
print('---------------------')
print(X_test.shape)
print(y_test.shape)

(177, 22)
(177,)
---------------------
(59, 22)
(59,)
---------------------
(60, 22)
(60,)


# 5. Feature scaling (Normalization)

In [379]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_val_scaled =  scaler.transform(X_val)
X_test_scaled =  scaler.transform(X_test)

# 6. Using Support Vector Machine Algorithm


In [380]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

## Linear SVM

In [401]:
svc = SVC(kernel='linear', random_state=0, C=100)
svc.fit(X_train_scaled, y_train)

y_pred_train = svc.predict(X_train_scaled)
y_pred_val = svc.predict(X_val_scaled)


print("Acc Training:",accuracy_score(y_train, y_pred_train))
print("Acc Validation:",accuracy_score(y_val, y_pred_val))


Acc Training: 0.9152542372881356
Acc Validation: 0.8305084745762712


## Poly SVM

In [408]:
svc = SVC(kernel='poly', degree=2, random_state=0, C=2.5)
svc.fit(X_train_scaled, y_train)

y_pred_train = svc.predict(X_train_scaled)
y_pred_val = svc.predict(X_val_scaled)

print("Acc Training:",accuracy_score(y_train, y_pred_train))
print("Acc Validation:",accuracy_score(y_val, y_pred_val))

Acc Training: 0.943502824858757
Acc Validation: 0.8305084745762712


## RBF SVM

In [442]:
svc = SVC(kernel='rbf', gamma=0.01, random_state=0, C=200)
svc.fit(X_train_scaled, y_train)

y_pred_train = svc.predict(X_train_scaled)
y_pred_val = svc.predict(X_val_scaled)

print("Acc Training:",accuracy_score(y_train, y_pred_train))
print("Acc Validation:",accuracy_score(y_val, y_pred_val))

Acc Training: 0.9265536723163842
Acc Validation: 0.847457627118644


# 7. Accuracy of Best Model(Testing Accuracy)
### validation accuracy > 82%

In [467]:
rbf_svc = SVC(kernel='rbf', gamma=0.01, random_state=0, C=200)
rbf_svc.fit(X_train_scaled, y_train)

y_pred_train = rbf_svc.predict(X_train_scaled)
y_pred_val = rbf_svc.predict(X_val_scaled)

print("Acc Training:",accuracy_score(y_train, y_pred_train))
print("Acc Validation:",accuracy_score(y_val, y_pred_val))

Acc Training: 0.9265536723163842
Acc Validation: 0.847457627118644


# 8. Testing

In [468]:
# import used libraries for evaluation the model
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

In [469]:
# test our model
y_pred_test = rbf_svc.predict(X_test_scaled)

In [470]:
def eval_model (y_actual,y_pred):
    print("Confusion Matrix:\n",confusion_matrix(y_actual, y_pred))
    print("Precision:",precision_score(y_actual, y_pred))
    print("Recall:   ",recall_score(y_actual, y_pred))
    print("F1Score:  " ,f1_score(y_actual, y_pred))
    print("Acc Test: ",accuracy_score(y_actual, y_pred))

In [471]:
eval_model(y_test, y_pred_test)

Confusion Matrix:
 [[26  6]
 [ 3 25]]
Precision: 0.8064516129032258
Recall:    0.8928571428571429
F1Score:   0.8474576271186439
Acc Test:  0.85


# Save model

In [472]:
import pickle
with open('saved-model.pickle', 'wb') as f:
    pickle.dump(rbf_svc, f)