# Heart disease classification

## Problem description

A dataset set of hospital patients is released and some of their medical records are processed and put in a medical csv file, the task is to use the data to build a predictive model with the best possible accuracy to detect heart disease in the patients with given records

## Dataset description


The dataset originally contains 76 attributes but only 13 were released and put in use, The dataset contained 14 columns(13 labels and 1 target), and 303 features/rows


## Description of Columns

* age (age in years)
* sex (1=male,0=female)
* cp (chest pain type)
* trestbps (resting blood pressure in mmHg on admission to the hospital)
* chol (serum cholestoral in mg/dl
* fbs (fast blood sugar > 120 mg/dl,1=true,0=false)
* restecg (resting electrocardiographic results)
* thalach (maximum heart rate achieved)
* exang - Exercise induced angina(1=yes,0=no)
* old peak (st depression induced by exercise relative to rest)
* slope (the slope of the peak exercise ST segment)
* ca (the number of major vessels colored by flurosopy)
* thal (3=normal, 6=fixed defect,7=reversible defect)
* target (1=positive or 0=negative)

## Analyzing the dataset and visualizing the dataset

In [22]:
#importing all the neccesary libraries
import pandas as pd #importing pandas for data manipulation and visualization
import matplotlib.pyplot as plt #importing matplotlib for visulaization
import numpy as np #numpy for high speed computing
import seaborn as sns #seaborn for Exploratory data analysis

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import xgboost as xgb
%matplotlib inline 

In [3]:
#importing the dataset
dataset= pd.read_csv('heart.csv')

In [4]:
dataset.head() #getting the five top variables in the dataset

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [5]:
dataset.tail() #getting the last five variables

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0
302,57,0,1,130,236,0,0,174,0,0.0,1,1,2,0


In [6]:
dataset.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

Therefore, the data has no missing entry in the dataset, all data entries are consistent

## Data preprocessing


In [7]:
X=dataset.drop('target',axis=1)
y=dataset['target']

In [8]:
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3
8,52,1,2,172,199,1,1,162,0,0.5,2,0,3
9,57,1,2,150,168,0,1,174,0,1.6,2,0,2


In [9]:
y

0      1
1      1
2      1
3      1
4      1
5      1
6      1
7      1
8      1
9      1
10     1
11     1
12     1
13     1
14     1
15     1
16     1
17     1
18     1
19     1
20     1
21     1
22     1
23     1
24     1
25     1
26     1
27     1
28     1
29     1
      ..
273    0
274    0
275    0
276    0
277    0
278    0
279    0
280    0
281    0
282    0
283    0
284    0
285    0
286    0
287    0
288    0
289    0
290    0
291    0
292    0
293    0
294    0
295    0
296    0
297    0
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64

For each piece of algorithm we would be working on, we would be alternating two types of data
- Scaled data
- Unscaled data

In [15]:
#splitting the data into training and test parts
x_train,x_test,y_train,y_test=train_test_split(X,y,random_state=7,test_size=0.2)

#scaling
scaler=MinMaxScaler()
sc_x=scaler.fit(x_train)
x_train_scaled=sc_x.transform(x_train.values)
x_test_scaled=sc_x.transform(x_test.values)

We are going to build only two kinds of models, because of their usual speed and performance on classification dataset. The model typs therefore are
 * Random forest Classifier
 * XGBoost Classifier

In [21]:
#Randomforest
rclassifier1=RandomForestClassifier()
rclassifier2=RandomForestClassifier()
rclassifier1.fit(x_train,y_train)
y_preds=rclassifier1.predict(x_test)
rclassifier2.fit(x_train_scaled,y_train)
y_preds1=rclassifier1.predict(x_test_scaled)
print('The accuracy score for Random Forest Unscaled is', accuracy_score(y_test,y_preds))
print('The accuracy score for Random Forest scaled is', accuracy_score(y_test,y_preds1))

The accuracy score for Random Forest Unscaled is 0.7377049180327869
The accuracy score for Random Forest scaled is 0.639344262295082


In [27]:
#XGBoost
xclassifier1=xgb.XGBClassifier()
xclassifier2=xgb.XGBClassifier()
xclassifier1.fit(x_train,y_train)
y_preds2=xclassifier1.predict(x_test)
xclassifier2.fit(x_train_scaled,y_train)
y_preds3=xclassifier2.predict(x_test_scaled)
print('The accuracy score for XGBoost Unscaled is', accuracy_score(y_test,y_preds2))
print('The accuracy score for XGBoost scaled is', accuracy_score(y_test,y_preds3))

The accuracy score for XGBoost Unscaled is 0.7213114754098361
The accuracy score for XGBoost scaled is 0.7213114754098361


  if diff:
  if diff:


The Random Forest Classifier with Unscaled data is produced better results even without hyperparameter tuning.Notice that the accuracy scores for XGBClassifier was equal for both scaled and unscaled data therefore it was a close call to Random forest. Instintively we should concentrate our efforts on the XGBoost classfier even though it performed a little worse. It seems to generalize with the data better 

First find the optimum random state parameter by running a loop through the entire dataset

In [28]:
play =[]
for i in range(101):
    x_train_set,x_test_set,y_train_set,y_test_set=train_test_split(X,y,random_state=i,test_size=0.2)    
    xclassifier_real=xgb.XGBClassifier()
    xclassifier_real.fit(x_train_set,y_train_set)
    y_preds6=xclassifier_real.predict(x_test_set)
    play.append(accuracy_score(y_test_set,y_preds6))
print(play.index(max(play)))


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


4


  if diff:


In [29]:
play[4]

0.9016393442622951

Now retrain the model on the optimum random state parameter

In [33]:
#splitting the data into training and test parts
x_train_new,x_test_new,y_train_new,y_test_new=train_test_split(X,y,random_state=4,test_size=0.2)

#scaling
scaler=MinMaxScaler()
sc_x=scaler.fit(x_train)
x_train_scaled_new=sc_x.transform(x_train_new.values)
x_test_scaled_new=sc_x.transform(x_test_new.values)

#XGBoost
xclassifier1=xgb.XGBClassifier()
xclassifier2=xgb.XGBClassifier()
xclassifier1.fit(x_train_new,y_train_new)
y_preds7=xclassifier1.predict(x_test_new)
xclassifier2.fit(x_train_scaled_new,y_train_new)
y_preds8=xclassifier2.predict(x_test_scaled_new)
print('The accuracy score for XGBoost Unscaled is', accuracy_score(y_test_new,y_preds7))
print('The accuracy score for XGBoost scaled is', accuracy_score(y_test_new,y_preds8))

The accuracy score for XGBoost Unscaled is 0.9016393442622951
The accuracy score for XGBoost scaled is 0.9016393442622951


  if diff:
  if diff:


This result for the model is satisfactory at this point in order not to overfit the model due to lack of data.


## The End 