#### Attribute Information:

- age
- sex   (male=1, female=0)
- chest pain type (4 values)
- resting blood pressure
- serum cholestoral in mg/dl
- fasting blood sugar > 120 mg/dl
- resting electrocardiographic results (values 0,1,2)
- maximum heart rate achieved
- exercise induced angina
- oldpeak = ST depression induced by exercise relative to rest
- the slope of the peak exercise ST segment
- number of major vessels (0-3) colored by flourosopy
- thal: 0 = normal; 1 = fixed defect; 2 = reversable defect

In [1]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt
%matplotlib inline


In [2]:
data=pd.read_csv('heart.csv')
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [3]:
data.slope.unique()

array([2, 0, 1], dtype=int64)

#### Checking if there are any null values

In [4]:
data.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [5]:
data.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0
mean,54.434146,0.69561,0.942439,131.611707,246.0,0.149268,0.529756,149.114146,0.336585,1.071512,1.385366,0.754146,2.323902,0.513171
std,9.07229,0.460373,1.029641,17.516718,51.59251,0.356527,0.527878,23.005724,0.472772,1.175053,0.617755,1.030798,0.62066,0.50007
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48.0,0.0,0.0,120.0,211.0,0.0,0.0,132.0,0.0,0.0,1.0,0.0,2.0,0.0
50%,56.0,1.0,1.0,130.0,240.0,0.0,1.0,152.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,275.0,0.0,1.0,166.0,1.0,1.8,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [8]:
data.ca.unique()

array([2, 0, 1, 3, 4], dtype=int64)

#### Independent and dependent values

In [3]:
X=data.iloc[:,:-1]
X.head()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2


In [4]:
y=data.iloc[:,-1]
y.head()


0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int64

##### here 0 and 1 represents having a heart disease or not 

In [14]:
y.unique()

array([0, 1], dtype=int64)

### Build a model

Splitting the data into train and test data 

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=0.4,random_state=0)

Since this is a classification problem we will use ensemble techniques to build our model

We are using Random Forest, AdaBoost Classifiers, Bagging Classifier, Gradient Boosting Classifier and slect the one which will give us the highest accuracy.

In [6]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier, GradientBoostingClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

### Fitting the data 

##### Random Forest

In [119]:
#rfc=RandomForestClassifier(n_estimators=1,criterion='entropy')
#rfc.fit(X_train,y_train)

RandomForestClassifier(criterion='entropy', n_estimators=1)

##### AdaBoost Classifier

In [120]:
#abc=AdaBoostClassifier()
#abc.fit(X_train,y_train)

AdaBoostClassifier()

##### Bagging Classifier

In [7]:
bc=BaggingClassifier()
bc.fit(X_train,y_train)

BaggingClassifier()

##### Gradient Boosting Classifier

In [122]:
#gbc=GradientBoostingClassifier()
#gbc.fit(X_train,y_train)

GradientBoostingClassifier()

##### XGBoost Classifier 

In [123]:
#xgb=XGBClassifier()
#xgb.fit(X_train,y_train)





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=12, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

##### predicting the test Dataset

In [14]:
#y_pred_rfc=rfc.predict(X_test)
#y_pred_abc=abc.predict(X_test)
y_pred_bc=bc.predict(X_test)
#y_pred_gbc=gbc.predict(X_test)
#y_pred_xgb=xgb.predict(X_test)


##### Checking the accuracy score and confusion matrix of test and predicted values

In [15]:
#print('Accuracy score for Random forest Classifier is:   {}  '.format(accuracy_score(y_test,y_pred_rfc)))

#print('Accuracy score for AdaBoost Classifier is:   {}'.format(accuracy_score(y_test,y_pred_abc)))
print('Accuracy score for Bagging Classifier is:   {}'.format(accuracy_score(y_test,y_pred_bc)))
#print('Accuracy score for Gradient Boosting Classifier is:   {}'.format(accuracy_score(y_test,y_pred_gbc)))
##print('Accuracy score for XGBoost Classifier is:   {}'.format(accuracy_score(y_test,y_pred_xgb)))

Accuracy score for Bagging Classifier is:   0.9682926829268292


In [16]:
#print('Confusion matrix for Random forest Classifier is:')
#print(confusion_matrix(y_test,y_pred_rfc))

#print('Confusion matrix for AdaBoost Classifier is:')
#print(confusion_matrix(y_test,y_pred_abc))

print('Confusion matrix for Bagging Classifier is:')
print(confusion_matrix(y_test,y_pred_bc))

#print('Confusion matrix for Gradient Boosting Classifier is:')
#print(confusion_matrix(y_test,y_pred_gbc))

Confusion matrix for Bagging Classifier is:
[[186   3]
 [ 10 211]]


#### From the selected models we see that Bagging classifier gives us the accuracy of 99 percent followed by Gradient Boosting Classifier
- hence we will continue with Bagging Classifier model for prediction

#### Create a pickle file using serialization

In [17]:
import pickle
pickle_out = open('bagging.pkl','wb')
pickle.dump(bc, pickle_out)
pickle_out.close()

In [9]:
bc.predict([data.iloc[243][:-1]])[0]



0

In [23]:
y_train.head()

243    0
641    0
35     0
81     0
159    1
Name: target, dtype: int64

In [18]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   cp        1025 non-null   int64  
 3   trestbps  1025 non-null   int64  
 4   chol      1025 non-null   int64  
 5   fbs       1025 non-null   int64  
 6   restecg   1025 non-null   int64  
 7   thalach   1025 non-null   int64  
 8   exang     1025 non-null   int64  
 9   oldpeak   1025 non-null   float64
 10  slope     1025 non-null   int64  
 11  ca        1025 non-null   int64  
 12  thal      1025 non-null   int64  
 13  target    1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB


age          60.0
sex           1.0
cp            2.0
trestbps    140.0
chol        185.0
fbs           0.0
restecg       0.0
thalach     155.0
exang         0.0
oldpeak       3.0
slope         1.0
ca            0.0
thal          2.0
target        0.0
Name: 243, dtype: float64