# HEALTHCARE ANALYSIS ON HEART DISEASE DATA
Problem Statement: Health is real wealth in the pandemic time we all realized the brute effects of covid-19 on all irrespective of any status. You are required to analyze this health and medical data for better future preparation. This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Attribute Information:

age> 2. sex> 3. chest pain type (4 values)> 4. resting blood pressure>
serum cholesterol in mg/dl> 6. fasting blood sugar > 120 mg/dl> 7. resting electrocardiographic results (values 0,1,2)> 8. maximum heart rate achieved> 9. exercise induced angina> 10. oldpeak = ST depression induced by exercise relative to rest> 11. the slope of the peak exercise ST segment> 12. number of major vessels (0-3) colored by flourosopy> 13. thal: 3 = normal; 6 = fixed defect; 7 = reversible defect Find key metrics and factors and show the meaningful relationships between attributes. 

Do your own research and come up with your findings.

Analysis on Heart diseas.

To predict whether the patient has heart disease or not.

In [1]:
# IMPORTING DATA AND EXPLORING DATA

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from pylab import rcParams
rcParams['figure.figsize'] = 5,4
sns.set_style('whitegrid')

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

import warnings
warnings.filterwarnings('ignore')

In [2]:
hd = pd.read_csv("Heart disease.csv")
hd

FileNotFoundError: [Errno 2] No such file or directory: 'Heart disease.csv'

In [None]:
hd.head()

In [None]:
hd.info()

In [None]:
hd.shape

In [None]:
hd.columns

In [None]:
type(hd)

In [None]:
hd.isnull().sum()

In [None]:
hd.head()

# EDA (Exploratory Data Analysis)
Using Data Visualization

1.age: Patients age in years

2.sex : Female or male (1-Male,0-Female)

3.cp :Chest pain ( 1- Typical Angina, 2-Atypical Angina,3-Non- anginal pain, 4-Asymptomatic)

4.trestbp: Resting BloodPressure

5.chol :Cholestrol

6.fbs : Fasting Blood Sugar( >120 mg/dl, 1-True, 0 - False)

7.restecg : Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)

8.thalach :The person's maximum heart rate achieved

9.exang : Exercise induced angina (1 = yes; 0 = no)

10.oldpeak: ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot. See more here)

11.slope : the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)

12.ca :The number of major vessels (0-3)

13.thal : A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)

14.target : Heart disease (0 = no, 1 = yes)

In [None]:
hd.describe()

In [None]:
pd.set_option('display.float_format','{:.2f}'.format)

In [None]:
hd.describe()

In [None]:
hd['target'].value_counts()

In [None]:
hd.shape

In [None]:
hd['target'].value_counts()/hd.shape[0]*100

In [None]:
# percentage of positive and negative heart diseases
labels = ['yes', 'No']
values = hd['target'].value_counts().values

plt.pie(values, labels=labels, autopct='%1.0f%%')
plt.title('Heart Disease')
plt.show()

# Histagram plot for all feature 

In [None]:
import itertools
col= hd.columns[:13]
plt.subplots(figsize=(20,15))
length= len(col)

for i,j in itertools.zip_longest(col, range(length)):
    plt.subplot((length/2), 3, j+1)
    plt.subplots_adjust(wspace = 0.1,hspace= 0.5)
    hd[i].hist(bins=20)
    plt.title(i)
plt.show()

In [None]:
sns.catplot('target', data=hd, hue='sex', kind='count')
plt.xticks(np.arange(2), ("1-Not Healthy", "0-Healthy"),rotation=0);

In [None]:
hd['sex'].value_counts()

From the above Histogram plots we can conclude Age, Cholestrol, Resting blood pressure and person's maximum heart rate achieved plays major role in detection of Heart Disease.

From the another plot we can conclude that number of men are majorly having Heart disease than compared to females.And also men are more healthier than females

In [None]:
hd['cp'].value_counts().plot(kind='bar',color=["deeppink","salmon"])
plt.xticks(np.arange(2), ("1-Not Healthy", "0-Healthy"),rotation=0);

In [None]:
# Possibility of having Heart Disease 
#Creating a Function for unique values in data
categorical_values = []
for column in hd.columns:
    print('==============================')
    print(f"{column} : {hd[column].unique()}")
    if len(hd[column].unique()) <= 10:
        categorical_values.append(column)

In [None]:
plt.figure(figsize=(12,  12))
for i, column in enumerate(categorical_values, 1):
    plt.subplot(3, 3, i)
    sns.barplot(x=f"{column}", y='target', data=hd)
    plt.ylabel('Possibility to have heart disease')
    plt.xlabel(f'{column}')

The possibility of having Heart Disease can be judged from the above barplots.

In [None]:
# Comparing resting blood pressure with target

pd.crosstab(hd.trestbps[::15],hd.target).plot(kind="bar",figsize=(8,8),color=["deeppink","blue"])
plt.ylabel("patients");
plt.xticks(rotation=0);
plt.legend(['0', '1']);  

In [None]:
# Comparing cholestrol with target
#cp :Chest pain ( 1- Typical Angina, 2-Atypical Angina,3- Non-anginal pain, 4-Asymptomatic)

pd.crosstab(hd.cp[::15],hd.target).plot(kind="bar",figsize=(8,8),color=["deeppink","blue"])
plt.ylabel("patients");
plt.xticks(rotation=0);
plt.legend(['0', '1']); 


The above graph tells us that patients with Type 3 chest pain have heart disease. and verty few Patients with 1 type chest pain have heart disease

In [None]:
# Comparing cholestrol with target
pd.crosstab(hd.chol[::15],hd.target).plot(kind="bar",figsize=(8,8),color=["deeppink","blue"])
plt.ylabel("patients");
plt.xticks(rotation=0);
plt.legend(['0', '1']); 

In [None]:
# Comparing maximum heart rate with target
pd.crosstab(hd.thalach[::15],hd.target).plot(kind="bar",figsize=(8,8),color=["deeppink","blue"])
plt.ylabel("patients");
plt.xticks(rotation=0);
plt.legend(['0', '1']);

In [None]:
# fINDING Heart Disease PEOPLE WITH restingBloodPressure AND AGE  USING SCATTER PLOT

plt.figure(figsize=(10,8))
plt.scatter(hd.age[hd.target==1],hd.trestbps[hd.target==1],color="Red")

plt.scatter(hd.age[hd.target==0],hd.trestbps[hd.target==0],color="Green")

plt.title("heart disease in function of Age and resting Blood pressure")
plt.xlabel("Age")
plt.ylabel("TRESTBPS")
plt.legend(["HeartDisease","No HeartDisease"]);

In [None]:
# To knw thw relation between various features

corr_matrix = hd.corr()
top_corr_feature = corr_matrix.index
plt.figure(figsize=(20, 20))
sns.heatmap(hd[top_corr_feature].corr(), annot=True, cmap="RdYlGn", annot_kws={"size":15})

# OBSERVATION
1.Major features for having Heart diesease are : Resting blood pressure, Cholestrol, Chest pain and Maximum Heart rates achieved.

2.The data is not disbalanced.

3.From the another plot we can conclude that number of men are majorly having Heart disease than compared to females.And also men are more healthier than females.

4.The graph tells us that patients with Type 3 chest pain have heart disease and very few Patients with 1 type chest pain can have heart disease.

In [None]:
# creating a copy of dataset 
heart = hd.copy()

In [None]:
heart.shape

In [None]:
heart = heart.rename(columns={'condition':'target'})
heart.head()


# Lets divide our data set and use training dataset for model training, and test dataset is to eveluate model perfomance

In [None]:
from sklearn.model_selection import train_test_split

x= heart.drop(columns= 'target')
y= heart.target

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=42)
print('X_train size: {}, X_test size: {}'.format(x_train.shape, x_test.shape))

# Feature scaling


In [None]:
from sklearn.preprocessing import StandardScaler

scaler= StandardScaler()
x_train_scaler= scaler.fit_transform(x_train)
x_test_scaler= scaler.fit_transform(x_test)

from sklearn.preprocessing import StandardScaler

scaler= StandardScaler()
x_train_scaler= scaler.fit_transform(x_train)
x_test_scaler= scaler.fit_transform(x_test)

# Linear regression

In [None]:
from sklearn.linear_model import LinearRegression
lr_clf= LinearRegression()
lr_clf.fit(x_train_scaler, y_train)
y_pred_lr= lr_clf.predict(x_test_scaler)
lr_clf.score(x_test_scaler,y_test)

# LOGISTIC REGRESSION


In [None]:
from sklearn.linear_model import LogisticRegression

logr_clf= LogisticRegression()
logr_clf.fit(x_train_scaler, y_train)
y_pred_lor= logr_clf.predict(x_test_scaler)
logr_clf.score(x_test_scaler,y_test)

In [None]:
print('LogistiC Regression Classification Report\n', classification_report(y_test, y_pred_lor))
print('Accuracy: {}%\n'.format(round((accuracy_score(y_test, y_pred_lor)*100),2)))

In [None]:
cm = confusion_matrix(y_test, y_pred_lor)
cm

# Decision Tree classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt_clf= DecisionTreeClassifier()
dt_clf.fit(x_train_scaler, y_train)
y_pred_dct= dt_clf.predict(x_test_scaler)
dt_clf.score(x_test_scaler,y_test)

In [None]:
print(' DT Classification Report\n', classification_report(y_test, y_pred_dct))
print('Accuracy: {}%\n'.format(round((accuracy_score(y_test, y_pred_dct)*100),2)))

In [None]:
cm = confusion_matrix(y_test, y_pred_dct)
cm

# Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_clf= RandomForestClassifier()
rf_clf.fit(x_train_scaler, y_train)
y_pred_rfc= rf_clf.predict(x_test_scaler)
rf_clf.score(x_test_scaler,y_test)

In [None]:
print('Random Forest Classification Report\n', classification_report(y_test, y_pred_rfc))
print('Accuracy: {}%\n'.format(round((accuracy_score(y_test, y_pred_rfc)*100),2)))

In [None]:
cm = confusion_matrix(y_test, y_pred_rfc)
cm

# SVM

In [None]:
from sklearn.svm import SVC

svc_clf= SVC()
svc_clf.fit(x_train_scaler, y_train)
y_pred_svc= svc_clf.predict(x_test_scaler)
svc_clf.score(x_test_scaler,y_test)

In [None]:
print(' SVM Classification Report\n', classification_report(y_test, y_pred_svc))
print('Accuracy: {}%\n'.format(round((accuracy_score(y_test, y_pred_svc)*100),2)))

In [None]:
cm = confusion_matrix(y_test, y_pred_svc)
cm

In [None]:
import pickle

filename = 'Healthcare_Analysis_on_Heart_Disease.pkl'
pickle.dump(logr_clf, open(filename, 'wb'))