<img src="https://static1.cbrimages.com/wordpress/wp-content/uploads/2020/04/Netero-Featured.jpg?q=50&fit=crop&w=960&h=500&dpr=1.5" width=600   height=400 alt="Heart Picture which you can't see" style="margin: auto"/>

<h2 style="font-family:monospace">EDA on Heart Disease UCI</h2>



<h3 style="font-family:monospace">Overview</h3>

<ul style="font-family:monospace">
    <li>Slope results for various rest ECG (Electrocardiogram) results</li>
    <li>Patients having high FBS (Fasting Blood sugar) and how it varies with age</li>
    <li>Thallium Test Results, Number of Fluroscopy vessels colored</li>
    <li>Maximum Heart Rate achieved by patients and how it varies with age</li>
    <li>Cholestrol Levels vs Heart Disease</li>
    <li>Chest Pain Type and how many patients are having a heart disease for each type</li>
    <li>Finding outliers and scaling the data to build a decision tree</li>
    <li>Confusion Matrix and accuracy score of the decision tree classifier</li>
    <li>Feature importance according to the decision tree classifier and verification</li>
</ul>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier


%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv("../input/heart-disease-uci/heart.csv")
df.head(5)

In [None]:
df.describe()

In [None]:
print(df.shape)

<h2 style="font-family:monospace">Checking for any null values</h2>

In [None]:
print(df.isnull().sum())

<span style="font-family:monospace">No null values, we are saved! 🙏. Doesn't require converting any null values or dropping rows.</span>

<h2 style="font-family:monospace">Making Data Readable</h2>

<span style="font-family:monospace">At present, the dataframe contains nothing but numeric data, however we can see that some features are categorical and it doesn't make sense what the numbers mean 🤔. Let's convert them.</span>

<span style="font-family:monospace">Reference :: https://pubmed.ncbi.nlm.nih.gov/20494662/</span>

<ul style="font-family:monospace">
    <li>age: age (in years)</li>
    <li>sex: gender (1 = male; 0 = female)</li>
    <li>cp: chest pain type</li>
    <li>trestbps: resting blood pressure (in mmHg, upon admission to the hospital</li>
    <li>chol: serum cholesterol in mg/dL</li>
    <li>fbs: fasting blood sugar > 120 mg/dL (likely to be diabetic) 1 = true; 0 = false</li>
    <li>restecg: resting electrocardiogram results</li>
    <li>thalach: maximum heart rate achieved</li>
    <li>exang: exercise induced angina (1 = yes; 0 = no)</li>
    <li>oldpeak: ST depression induced by exercise relative to rest (in mm, achieved by subtracting the lowest ST segment points during exercise and rest)</li>
    <li>slope: the slope of the peak exercise ST segment, ST-T abnormalities are considered to be a crucial indicator for identifying presence of ischaemia</li>
    <li>ca: number of major vessels (0-3) colored by fluoroscopy</li>
    <li>thal: thallium test</li>
    <li>target: does that person have disease or not? 0 = no disease, 1 = disease</li>
</ul>

<span style="font-family:monospace">The confusing ones 😲..</span>


<h3 style="font-family:monospace">Chest Pain Type (cp)</h3>

| No. | Chest Pain Type | Criteria |
| --------------- | --------------- | --------------- |
| 0 | Typical Angina | All criteria present |
| 1 | Atypical Angina | 2 of 3 criteria present |
| 2 | Non Anginal Pain | Less than one criteria present |
| 3 | Asymptomatic | None of criteria are satisfied |

<ul style="font-family:monospace">
    <li>Angina: Discomfort that is noted when the heart does not get enough blood or oxygen</li>
    <li>Non Anginal Pain: Pain in the chest that is not caused by heart disease or a heart attack, usually related to digestive tract</li>
    <li>Asymptomatic: Silent killer which shows no symptoms</li>
</ul>

<h3 style="font-family:monospace">Resting ElectroCardiogram Results (restecg)</h3>

| No. | Results |
| --------------- | ------------------- |
| 0 | Normal |
| 1 | Having ST-T wave abnormality |
| 2 | Showing probable or definite left ventricular hypertrophy by Estes' criteria |

<ul style="font-family:monospace">
    <li>Left Ventricular Hypertrophy: A heart's left pumping chamber that has thickened and may not be pumping efficiently</li>
    <li>ST-T wave abnormality: ST segment abnormality (elevation or depression) indicates myocardial ischaemia or infarction i.e. a heart attack</li>
</ul>


<h3 style="font-family:monospace">Slope of peak exercise ST segment (slope)</h3>

| No. | Slope |
| --------------- | ------------------- |
| 1 | Upsloping |
| 2 | Flat |
| 3 | Downsloping |

<img src="https://litfl.com/wp-content/uploads/2018/10/ST-segment-depression-upsloping-downsloping-horizontal.png" style="margin:auto" alt="slope-img" width=500>

<span style="font-family:monospace">Horizontal or downsloping ST depression ≥ 0.5 mm at the J-point in ≥ 2 contiguous leads indicates myocardial ischaemia or blockage or arteries which eventually leads to heart disease</span>


<h3 style="font-family:monospace">Thallium Testing (thal)</h3>

| No. | Results | Meaning |
| --------------- | ------------------- | ------------------- |
| 0 | Normal | Passed Thallium Test and condition is normal |
| 1 | Fixed Defect | Heart tissue can't absorb thallium both under stress and in rest |
| 2 | Reversible Defect | Heart tissue is unable to absorb thallium only under the exercise portion of the test |


<span style="font-family:monospace">A thallium stress test is a nuclear medicine study that shows your physician how well blood flows through your heart muscle while you're exercising or at rest and you're basically screwed if the result is a fixed defect or reversible defect. Fixed defect being worse</span>


In [None]:
df.loc[df["sex"]==0,"sex"] = "Female"
df.loc[df["sex"]==1,"sex"] = "Male"

df.loc[df["cp"] == 0,"cp"] = "Typical Angina"
df.loc[df["cp"] == 1,"cp"] = "Atypical Angina"
df.loc[df["cp"] == 2,"cp"] = "Non Anginal Pain"
df.loc[df["cp"] == 3,"cp"] = "Asymptomatic"

df.loc[df["restecg"] == 0,"restecg"] = "Normal"
df.loc[df["restecg"] == 1,"restecg"] = "ST-T Wave Abnormality"
df.loc[df["restecg"] == 2,"restecg"] = "Left Ventricular Hypertrophy"

df.loc[df["slope"] == 0,"slope"] = "Unsloping"
df.loc[df["slope"] == 1,"slope"] = "Flat"
df.loc[df["slope"] == 2,"slope"] = "Downsloping"

df.loc[df["thal"] == 1,"thal"] = "Normal"
df.loc[df["thal"] == 2,"thal"] = "Fixed Defect"
df.loc[df["thal"] == 3,"thal"] = "Reversible Defect"

df.loc[df["fbs"] == 0,"fbs"] = "> 120mg/dL"
df.loc[df["fbs"] == 1,"fbs"] = "< 120mg/dL"

df.loc[df["exang"] == 0,"exang"] = "No"
df.loc[df["exang"] == 1,"exang"] = "Yes"

df.loc[df["target"] == 0,"target"] = "No heart disease found"
df.loc[df["target"] == 1,"target"] = "Has heart disease"

In [None]:
df.head(5)

In [None]:
print(df["thal"].unique())

# replacing 0 - causes problems in pre processing
df.loc[df["thal"]==0,"thal"] = "Not taken the test"

<h2 style="font-family:monospace">Inspection Time! 🤓</h2>
<span style="font-family:monospace">Get your lab coats ready!</span>


<h2 style="font-family:monospace">Relation between Attributes</h2>

In [None]:
plt.figure(figsize=(10,8))

corr = df.corr()

tick_labels = ["Age","Resting BP","Cholestrol","Max Heart Rate","Old Peak","Vessels colored"]

# Getting the Upper Triangle of the co-relation matrix
matrix = np.triu(corr)

# using the upper triangle matrix as mask 
corr_heatmap = sns.heatmap(corr,
            annot=True, 
            mask=matrix, 
            cmap="viridis",
            xticklabels=tick_labels,
            yticklabels=tick_labels
           )
plt.yticks(rotation=0)
plt.show()

<span style="font-family:monospace">Amazing!, we can observe many correlations to age and blood pressure from this heatmap.</span>

<ul style="font-family:monospace">
    <li>Age is positively correlated with almost all except heart rate. Increasing cholestrol, BP, Vessels colored from fluroscopy</li>
    <li>Maximum Heart Rate achieved by patient is negatively correlated with old peak (exercise relative to rest) and vessels colored</li>
</ul>


In [None]:
### ST-T slopes for various Rest ECG results

colors = ["#FFF338","#0CECDD"]
title_style = {
    "fontname":"monospace",
    "fontsize":25
}
plt.figure(figsize=(30,8))

slopes_st_t = df.loc[df["restecg"]=="ST-T Wave Abnormality"]
slopes_ventricular = df.loc[df["restecg"]=="Left Ventricular Hypertrophy"] 
slopes_normal = df.loc[df["restecg"]=="Normal"]


plt.subplot(1,3,1)
plt.title("ST-T Wave Abnormality",fontdict=title_style)
sns.countplot(x="slope",hue="sex",data=slopes_st_t,palette=colors)

plt.subplot(1,3,2)
plt.title("Ventricular Hypertrophy",fontdict=title_style)
sns.countplot(x="slope",hue="sex",data=slopes_ventricular,palette=colors)

plt.subplot(1,3,3)
plt.title("Normal",fontdict=title_style)
sns.countplot(x="slope",hue="sex",data=slopes_normal,palette=colors)

plt.show()

<span style="font-family:monospace">Interestingly, there are no downsloping results when restecg results showed left ventricular hypertrophy. Quite peculiar that all of the patients having flat slope in ventricular hypertrophy are males and unsloping are females 🤔. Count of males is more in most of them perhaps because of majority of patients being males. Let's find that out</span> 

In [None]:
plt.title("Count of males vs females")
sns.countplot(x="sex",data=df,palette="magma")
plt.show()

<span style="font-family:monospace">As expected there are nearly 50 percent more males than females as patients. But how many of them are probably diabetic?. We can tell that by looking at the fbs or fasting blood sugar test which is nothing but measuring sugar levels in blood without eating anything (fun fact : having food causes sugar levels to spike and gives inaccurate results and thus fasting before the test)</span> 

In [None]:
plt.title("FBS Count of Patients",fontdict={"fontname":"monospace","fontsize": 20})
sns.histplot(x="age",
             hue="fbs",
             data=df,
             element="poly",
            )
plt.show()

<span style="font-family:monospace">We can see that most of the patients admitted have fasting blood sugar levels more than 120mg/L which means they are most probably diabetic and also most of the patients lie in the age group of 50-65 ish. Let's see</span>     
<ul style="font-family:monospace">
    <li>Results of the Thallium test for the patients who have exercise induced angina</li>
    <li>Fluroscopy test results based on different thallium test results</li>
    <li>The average maximum heart rate achieved by the patients admitted</li>
</ul>

In [None]:
exercise_induced_angina = df.loc[(df["exang"]=="Yes") & (df["thal"]!=0)]
age_unique=sorted(df["age"].unique())

kwargs = dict(s=10)
plt.figure(figsize=(20,5))
# THALLIUM RESULTS
plt.subplot(1,2,1)
plt.title("Thallium Results",fontdict={"fontname":"monospace","fontsize":15})
sns.countplot(data=exercise_induced_angina,
              x="thal",
              palette="magma"
             )

# FLUROSCOPY RESULTS
plt.subplot(1,2,2)
plt.title("Fluroscopy - Vessels colored",fontdict={"fontname":"monospace","fontsize":15})
sns.histplot(x="ca",hue="thal",data=exercise_induced_angina,palette="viridis",multiple="stack")
plt.show()


# MAX HEART RATE ACHIEVED
plt.figure(figsize=(20,5))
plt.title("Maximum Heart Rate achieved by all patients admitted",fontdict={"fontname":"monospace","fontsize":15})
plt.ylabel("Maximum heart rate")

# Calculating mean thalach else it would show blue patches of range
age_thalach_values=df.groupby('age')['thalach'].count().values

mean_thalach = []
for i,age in enumerate(age_unique):
    mean_thalach.append(sum(df[df['age']==age].thalach)/age_thalach_values[i])

sns.pointplot(x=age_unique,y=mean_thalach,markers=['o'],scale=0.5,color="purple")

plt.show()

<span style="font-family:monospace">Most of the people who had exercise induced angina showed reversible defect in thallium test which is true as reversible defect is when heart tissue is unable to absorb thallium only under the exercise portion of the test, followed by fixed defect which means under stress and rest and very small percentage of patients got the result of normal</span> 

<span style="font-family:monospace">More vessels colored implies blood flow is proper and hence lesser count of admitted patients. Fixed defect thallium test is more visible in patients with number of colored vessels as 0</span> 


<span style="font-family:monospace">Average thalach (Maximum heart rate achieved) is decreasing with age 📉</span> 

<h3 style="font-family:monospace">Cholestrol and Heart Disease</h3> 

<img src="https://i.insider.com/5f19dd2ef0f41940574e24a5?width=750&format=jpeg&auto=webp" style="margin:auto" alt="slope-img" width=500>

In [None]:
# Cholestrol is in mg/dl according to this data and also dataset hence we can check the range and determine heart disease or not
kwargs=dict(s=30)
fig, ax = plt.subplots(nrows=1,figsize=(10,8))
plt.title("Cholestrol Levels vs Heart Disease",fontdict={"fontname":"monospace","fontsize": 20})
sns.scatterplot(x="age",
            y="chol",
            hue="target",
            data=df,
            palette="plasma",
            **kwargs)

ax.axhspan(0,200,alpha=0.2,color='green')
ax.axhspan(200,239,alpha=0.2,color='yellow')
ax.axhspan(239,600,alpha=0.2,color='red')

plt.show()

<span style="font-family:monospace">Quite an unusual result!, the only thing which makes sense here is that most of the patients have cholestrol above the ideal range. However, the patients having heart disease and vice versa is quite spread out.</span> 

<ul style="font-family:monospace">
    <li>Few patients having high cholestrol have been diagnosed as not having a heart disease</li>
    <li>The youngest patient admitted for a check on cardiovascular disease is around 28-29 ish and the oldest one is around 77-78 ish</li>
    <li>Age might contribute slightly to the heart diseases (angina), slightly more concentrated to the right. Still it is distributed across implying age doesn't really matter for a cardiovascular diseases to occur, another reason to enjoy the present :p 💃</li>
</ul>

In [None]:
plt.title("Patient's age diagnosed with heart disease",fontdict={"fontname":"monospace","fontsize": 15})
have_heart_disease = df.loc[df["target"]=="Has heart disease"]
sns.swarmplot(x="age",data=have_heart_disease)
plt.show()

<span style="font-family:monospace">Heart diseases are unbiased to age. Affects everyone how generous</span> 


<h3 style="font-family:monospace">Chest Pains</h3> 

In [None]:
plt.figure(figsize=(10,5))
plt.title("Chest Pain and Heart Disease",fontdict={"fontname":"monospace","fontsize": 20})
sns.histplot(x="cp",
             data=df,
             hue="target",
             multiple="stack",
             palette="terrain")
plt.show()

<span style="font-family:monospace">Most of the patients suffering non-anginal and atypical anginal chest pain have a higher risk of acquiring a heart disease. An equal proportion of people suffering from asymptomatic chest pain had a heart disease and most of the patients admitted having typical angina did not have a heart disease.</span> 

<span style="font-family:monospace">However, typical and atypical angina are equally deadly and it can't be confirmed for sure one has a lesser risk than the other etc of acquiring heart disease. It's just that typical angina shows more visible heart related symptoms</span> 

<h2 style="font-family:monospace">Let's build a tree 🌲</h2>

<h3 style="font-family:monospace">Outliers in features</h3>

In [None]:
plt.figure(figsize=(20,15))

plt.subplot(2,4,1)
sns.boxplot(y="age",data=df)

plt.subplot(2,4,2)
sns.boxplot(y="trestbps",data=df)

plt.subplot(2,4,3)
sns.boxplot(y="chol",data=df)

plt.subplot(2,4,4)
sns.boxplot(y="thalach",data=df)

plt.subplot(2,4,5)
sns.boxplot(y="oldpeak",data=df)

plt.subplot(2,4,6)
sns.boxplot(y="ca",data=df)

plt.show()

<span style="font-family:monospace">Let's scale em up a bit and preprocess our data for modelling a decision tree</span> 


In [None]:
# in case we need non encoded data later

not_encoded_df = df.copy()

In [None]:
le = LabelEncoder()

df["sex"] = le.fit_transform(df["sex"])
df["cp"] = le.fit_transform(df["cp"])
df["fbs"] = le.fit_transform(df["fbs"])
df["restecg"] = le.fit_transform(df["restecg"])
df["exang"] = le.fit_transform(df["exang"])
df["thal"] = le.fit_transform(df["thal"])
df["target"] = le.fit_transform(df["target"])

X = df.iloc[:,[0,1,2,3,4,5,6,7,8,11,12]].values
y = df.iloc[:,13].values

In [None]:
def accuracy(y_test,y_pred):
    mae = mean_absolute_error(y_test,y_pred)
    mse = mean_squared_error(y_test,y_pred)
    rmse = np.sqrt(mse)
    rmsle = np.sqrt(mean_squared_log_error(y_test,y_pred))
    accuracy = accuracy_score(y_test,y_pred)
    print("Mean absolute error : ",mae)
    print("Mean squared error : ",rmse)
    print("Mean squared log error : ",rmsle)
    print("Accuracy percentage : ",accuracy * 100)

In [None]:
# Splitting the data into training and testing data

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3)

<span style="font-family:monospace">To remove the influence of outliers in the dataset we use scaler, in specific I'm using standard scaler here which distributes my data in such a way that it's mean becomes 0 and standard deviation becomes 1. This is only to increase the accuracy of the model. Hence I'm using this on my training and testing data</span> 

In [None]:
ss = StandardScaler()

X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

<h3 style="font-family:monospace">Decision Tree Classifier</h3> 

In [None]:
dtc = DecisionTreeClassifier()
dtc = dtc.fit(X_train,y_train)

In [None]:
dtc_pred = dtc.predict(X_test)
accuracy(y_test,dtc_pred)

<span style="font-family:monospace">Let's build the confusion matrix now</span> 

In [None]:
confusion = confusion_matrix(y_test, dtc_pred)
sns.heatmap(confusion,
            annot=True,
            cmap="Blues"
           )
plt.xlabel("Actual values")
plt.ylabel("Predicted values")
plt.show()

<h3 style="font-family:monospace">Which factors matter the most?</h3> 

In [None]:
plt.figure(figsize=(20,5))
plt.title("Feature importance according to Decision Tree Classifier",fontdict={"fontname":"monospace","fontsize": 15})

feature_names = np.array(["age","sex","cp","trestbps","chol","fbs","restecg","thalach","exang","oldpeak","slope","ca","thal"])

sorted_feature_importance = dtc.feature_importances_.argsort()
plt.barh(feature_names[sorted_feature_importance],
dtc.feature_importances_[sorted_feature_importance],
color='purple')

plt.show()

<h3 style="font-family:monospace">Verification</h3> 

In [None]:
## Verification
not_encoded_df["count"]=1

print(not_encoded_df.groupby(["slope","target"]).count()["count"])

<span style="font-family:monospace">Clearly categorizing rows on the basis of slope is showing a huge difference on every category (107 and 35). Hence we can say it plays a major role in classifying data.</span> 

In [None]:
print(not_encoded_df.groupby(["cp","target"]).count()["count"])

In [None]:
print(not_encoded_df.groupby(["fbs","target"]).count()["count"])

<span style="font-family:monospace">The values here are very close (22 and 23) and hence this feature is not fit for categorizing data. It will be used last if need be in the decision tree</span> 

<span style="font-family:monospace">That's it. I hope you liked my notebook. This is one of my first few notebooks and ofcourse I would have made mistakes left un noticed. I would love to know if you've found one in the comments. Peace ✌</span> 