![image.png](attachment:image.png)

# Introduction:

Heart failure, sometimes known as congestive heart failure, occurs when your heart muscle doesn't pump blood as well as it should. Certain conditions, such as narrowed arteries in your heart (coronary artery disease) or high blood pressure, gradually leave your heart too weak or stiff to fill and pump efficiently.

Not all conditions that lead to heart failure can be reversed, but treatments can improve the signs and symptoms of heart failure and help you live longer. Lifestyle changes — such as exercising, reducing sodium in your diet, managing stress and losing weight — can improve your quality of life.

One way to prevent heart failure is to prevent and control conditions that cause heart failure, such as coronary artery disease, high blood pressure, diabetes or obesity.

### Heart Anatomy:
![image.png](attachment:image.png)

## Heart Failure: 

Heart Failure is the leading cause of death in many countries, 

### What does Heart Failure look like?
![image.png](attachment:image.png)

## Symptoms:

**Heart failure can be ongoing (chronic), or your condition may start suddenly (acute).**

Heart failure signs and symptoms may include:
*  Shortness of breath (dyspnea) when you exert yourself or when you lie down
*  Fatigue and weakness
*  Swelling (edema) in your legs, ankles and feet
*  Rapid or irregular heartbeat
*  Reduced ability to exercise
*  Persistent cough or wheezing with white or pink blood-tinged phlegm
*  Increased need to urinate at night
*  Swelling of your abdomen (ascites)
*  Very rapid weight gain from fluid retention
*  Lack of appetite and nausea
*  Difficulty concentrating or decreased alertness
*  Sudden, severe shortness of breath and coughing up pink, foamy mucus
*  Chest pain if your heart failure is caused by a heart attack

## Risk factors

**A single risk factor may be enough to cause heart failure, but a combination of factors also increases your risk.**

### Risk factors include:

* <ins>***High blood pressure:***</ins> Your heart works harder than it has to if your blood pressure is high.

* <ins>***Coronary artery disease:***</ins> Narrowed arteries may limit your heart's supply of oxygen-rich blood, resulting in weakened heart muscle.

* <ins>***Heart attack:***</ins> A heart attack is a form of coronary disease that occurs suddenly. Damage to your heart muscle from a heart attack may mean your heart can no longer pump as well as it should.

* <ins>***Diabetes:***</ins> Having diabetes increases your risk of high blood pressure and coronary artery disease.

* <ins>***Some diabetes medications:***</ins> The diabetes drugs rosiglitazone (Avandia) and pioglitazone (Actos) have been found to increase the risk of heart failure in some people. Don't stop taking these medications on your own, though. If you're taking them, discuss with your doctor whether you need to make any changes.

* <ins>***Certain medications:***</ins> Some medications may lead to heart failure or heart problems. Medications that may increase the risk of heart problems include nonsteroidal anti-inflammatory drugs (NSAIDs); certain anesthesia medications; some anti-arrhythmic medications; certain medications used to treat high blood pressure, cancer, blood conditions, neurological conditions, psychiatric conditions, lung conditions, urological conditions, inflammatory conditions and infections; and other prescription and over-the-counter medications.

* Don't stop taking any medications on your own. If you have questions about medications you're taking, discuss with your doctor whether he or she recommends any changes.

* <ins>***Sleep apnea:***</ins> The inability to breathe properly while you sleep at night results in low blood oxygen levels and increased risk of abnormal heart rhythms. Both of these problems can weaken the heart.

* <ins>***Congenital heart defects:***</ins> Some people who develop heart failure were born with structural heart defects.

* <ins>***Valvular heart disease:***</ins> People with valvular heart disease have a higher risk of heart failure.

* <ins>***Viruses:***</ins> A viral infection may have damaged your heart muscle.

* <ins>***Alcohol use:***</ins> Drinking too much alcohol can weaken heart muscle and lead to heart failure.

* <ins>***Tobacco use:***</ins> Using tobacco can increase your risk of heart failure.

* <ins>***Obesity:***</ins> People who are obese have a higher risk of developing heart failure.

* <ins>***Irregular heartbeats:***</ins> These abnormal rhythms, especially if they are very frequent and fast, can weaken the heart muscle and cause heart failure.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Importing Libraries:

In [None]:
# Libraries for data-visualization
from pandas.plotting import scatter_matrix
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
!pip install RapidPlot
import RapidPlot

# Libraries for interactive plotting
import plotly.express as px
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

# Library for automatic EDA
import pandas_profiling

# Library for pre-processing:
from sklearn.preprocessing import StandardScaler

# Library for Dimensionality-Reduction:
from sklearn.decomposition import PCA

# Libraries for modelling
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
!pip install catboost
from catboost import CatBoostClassifier


# Model Selection:
from sklearn.pipeline import Pipeline as Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.base import BaseEstimator, TransformerMixin

# Libraries for model evaluaton 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# --CLASSIFICATION:
from sklearn.metrics import roc_auc_score, confusion_matrix, precision_recall_curve, roc_curve, precision_score, recall_score, f1_score, accuracy_score

# Library for plotting confusion matrix
from mlxtend.plotting import plot_confusion_matrix

# Library for converting python equation to Latex (markdown)
%pip install handcalcs
import handcalcs.render

# Miscellanous libraries
from IPython.display import display

## Loading Dataset:

In [None]:
hf_df = pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')
hf_df

### Importance of "creatinine_phosphokinase":<br>
When the total CPK level is very high, it most often means there has been injury or stress to muscle tissue, the heart, or the brain.

Muscle tissue injury is most likely. When a muscle is damaged, CPK leaks into the bloodstream. Finding which specific form of CPK is high helps determine which tissue has been damaged.


### Importance of "serum_creatinine":<br>
A creatinine test reveals important information about your kidneys.

Creatinine is a chemical waste product that's produced by your muscle metabolism and to a smaller extent by eating meat. Healthy kidneys filter creatinine and other waste products from your blood. The filtered waste products leave your body in your urine.

If your kidneys aren't functioning properly, an increased level of creatinine may accumulate in your blood. A serum creatinine test measures the level of creatinine in your blood and provides an estimate of how well your kidneys filter (glomerular filtration rate). A creatinine urine test can measure creatinine in your urine.


### Importance of "serum_sodium":<br>
Measurement of serum sodium is routine in assessing electrolyte, acid-base, and water balance, as well as renal function. Sodium accounts for approximately 95% of the osmotically active substances in the extracellular compartment, provided that the patient is not in renal failure or does not have severe hyperglycemia. 

In [None]:
report = pandas_profiling.ProfileReport(hf_df)

In [None]:
display(report)

## Report Insights:
We will look only at numerical data here, since the set is not imbalanced by a large margin, report won't be helpful for categorical data.

### Age:
* The data we have consists mainly of seniors (average=60 yrs).

### Creatinine Phosphokinase:
* Normal Creatinine Phosphokinase level is 10-120 mcg/L by https://www.mountsinai.org/health-library/tests/creatine-phosphokinase-test.
* Our avergae is _581_ mcg/L. Veryyy baddd

### Ejection Fraction:<br>(https://www.mayoclinic.org/ejection-fraction/expert-answers/faq-20058286)
* Ejection fraction of 55 percent or higher is considered normal.
* Ejection fraction of 50 percent or lower is considered reduced.
* Ejection fraction between 50 and 55 percent is usually considered "borderline."
* We have an average of 38%.

### Platelets:
* A normal platelet count ranges from 150,000 to 450,000 platelets per microliter of blood. Having more than 450,000 platelets is a condition called thrombocytosis; having less than 150,000 is known as thrombocytopenia.
* We have as average of 263,358, which is alright. 

### Serum Creatinine:
* The normal range for creatinine in the blood may be 0.84 to 1.21 milligrams per deciliter (by https://www.mayoclinic.org/tests-procedures/creatinine-test/about/pac-20384646)
* We have an avg. of 1.39 mg/dL. (Just outside borderline).

### Serum Sodium:
* A normal blood sodium level is between 135 and 145 milliequivalents per liter (by https://www.mayoclinic.org/diseases-conditions/hyponatremia/symptoms-causes/syc-20373711)
* We have an avg. of 136.625 mEq/L.

No comments on follow-up period.

# EDA

In [None]:
cat_variables = np.array(['anaemia', 'diabetes', 'high_blood_pressure', 'smoking', 'DEATH_EVENT', 'sex'])
num_variables = np.array(['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time'])

In [None]:
death_labels = ['Male - Survived', 'Female - Survived', 'Male - Not Survived', 'Female - Not Survived']
anaemia_labels = ['Male - No Anaemia', 'Female - No Anaemia', 'Male - Anaemia', 'Female - Anaemia']
diabetes_labels = ['Male - No Diabetes', 'Female - No Diabetes', 'Male - Diabetes', 'Female - Diabetes']
bp_labels = ['Male - No High BP', 'Female - No High BP', 'Male - High BP', 'Female - High BP']
smoking_labels = ['Male - Non-Smokers', 'Female - Non-Smokers', 'Male - Smokers', 'Female - Smokers']

labels = [anaemia_labels, diabetes_labels, bp_labels, smoking_labels, death_labels]

male = hf_df[hf_df.sex == 1]
female = hf_df[hf_df.sex == 0]

fig = make_subplots(rows=1, cols=len(labels), specs=[[{"type": "pie"}]*len(labels)])

cat_counts = list()
for cat_var in cat_variables[:-1]:
    cat_counts.append([np.sum(male[cat_var] == 0), np.sum(female[cat_var] == 0),
                       np.sum(male[cat_var] == 1), np.sum(female[cat_var] == 1)
                     ])
    

for i, label_counts in enumerate(zip(cat_variables[:-1], labels, cat_counts)):
    cat_var = label_counts[0]
    cat_labels = label_counts[1]
    cat_counts = label_counts[2]
    fig.add_trace(go.Pie(values=cat_counts,
                         labels=cat_labels,
                         #domain=dict(x=[0, 0.5]),
                         name=f'{cat_var}'), 
                         row=1, col=i+1)     

fig.update_layout(title='Gender Composition for Various Categorical Variables')
fig.show() 

In [None]:
anaemia_labels = ['Smoker - No Anaemia', 'Non-Smoker - No Anaemia', 'Smoker - Anaemia', 'Non-Smoker - Anaemia']
diabetes_labels = ['Smoker - No Diabetes', 'Non-Smoker - No Diabetes', 'Smoker - Diabetes', 'Non-Smoker - Diabetes']
bp_labels = ['Smoker - No High BP', 'Non-Smoker - No High BP', 'Smoker - High BP', 'Non-Smoker - High BP']

labels = [anaemia_labels, diabetes_labels, bp_labels]

smoker = hf_df[hf_df.smoking == 1]
non_smoker = hf_df[hf_df.smoking == 0]

fig = fig = make_subplots(rows=1, cols=len(labels), specs=[[{"type": "pie"}]*len(labels)])

cat_counts = list()
for cat_var in cat_variables[:-1]:
    cat_counts.append([np.sum(smoker[cat_var] == 0), np.sum(non_smoker[cat_var] == 0),
                       np.sum(smoker[cat_var] == 1), np.sum(non_smoker[cat_var] == 1)
                     ])
    
for i, label_counts in enumerate(zip(['anaemia', 'diabetes', 'high blood pressure'], labels, cat_counts)):
    cat_var = label_counts[0]
    cat_labels = label_counts[1]
    cat_counts = label_counts[2]
    fig.add_trace(go.Pie(values=cat_counts,
                         labels=cat_labels,
                         #domain=dict(x=[0, 0.5]),
                         name=f'{cat_var}'), 
                         row=1, col=i+1)     

fig.update_layout(title='Smoker Composition for Various Categorical Variables')
fig.show() 

In [None]:
fig = px.imshow(hf_df.corr())
fig.show()

From the heatmap, we can see that ****'ejection_fraction', 'serum_creatinine', 'serum_sodium', and 'time'**** are the most important features for DEATH_EVENT. <br>
But, we will check for a few more Categorical Variables hoping to find some more interesting relations.

Below, I am using  ***Plotter class*** which I created to quickly skim over the relations of a particular feature,<br>and zoom in on the ones which can give better insights. It is uploaded on PyPi under ***RapidPlot*** (!pip install RapidPlot ), Library that I created. Only Contains 1 classs with 4 functions till now ;)

In [None]:
plot_maker = RapidPlot.Plotter()
plot_maker.plotter(hf_df, 'age', ['ejection_fraction', 'serum_creatinine', 'serum_sodium', 'time'], ['anaemia', 'diabetes', 'sex', 'DEATH_EVENT'], 30, 30)

Relations :-
* age v/s ejection fraction for DEATH EVENT,
* age v/s time for DEATH EVENT,
* age v/s serum creatinine for DEATH EVENT<br><br>
can give some insights, let's zoom in on them. Rest all plots told no story.

In [None]:
age_min = np.min(hf_df.age)
age_max = np.max(hf_df.age)

fig = px.scatter(x=hf_df.age, y=hf_df.ejection_fraction, color=hf_df['DEATH_EVENT'].astype('bool'))
fig.update_layout(
    xaxis=dict(
        tickmode='array',
        tickvals=np.int64(np.linspace(age_min, age_max, 10)),
        title='Age',
    ),
    yaxis=dict(
        title='Ejection Fraction',
    ),
    title='Age v/s Ejection Fraction for Deaths'
)

fig.add_trace(go.Scatter(x=[40-3, 40-3, 75+3, 75+3, 40-3], y=[10, 30+3, 30+3, 10, 10], fillcolor='yellow', fill='toself', opacity=0.15, name='Age(40-75)'))

We can see that, most of the people in age group (40 - 75) who died had a low ejection fraction

In [None]:
age_min = np.min(hf_df.age)
age_max = np.max(hf_df.age)

fig = px.scatter(x=hf_df.age, y=hf_df.time, color=hf_df['DEATH_EVENT'].astype('bool'))
fig.update_layout(
    xaxis=dict(
        tickmode='array',
        tickvals=np.int64(np.linspace(age_min, age_max, 10)),
        title='Age',
    ),
    yaxis=dict(
        title='Follow-up period',
    ),
    title='Age v/s Follow-up time for Deaths'
)

fig.add_trace(go.Scatter(x=[43, 43, 97, 97, 43], y=[0-3, 65, 65, 0-3, 0-3], fillcolor='yellow', fill='toself', opacity=0.15, name='Age(45-95) approx.'))

We can see that those who had died, had followed up recently (compared to majority of those who didn't die). From this we can say that their situation was a lot more serious resulting in more frequent follow-ups or, some complications might have arisen after the treatment got over.

In [None]:
age_min = np.min(hf_df.age)
age_max = np.max(hf_df.age)

fig = px.scatter(x=hf_df.age, y=hf_df.serum_creatinine, color=hf_df['DEATH_EVENT'].astype('bool'))
fig.update_layout(
    xaxis=dict(
        tickmode='array',
        tickvals=np.int64(np.linspace(age_min, age_max, 10)),
        title='Age',
    ),
    yaxis=dict(
        title='Serum Creatinine',
    ),
    title='Age v/s Serum Creatinine for Deaths'
)

The values are too congested, all we can say is higher the serum creatinine, higher the death chances (we already know ;) )

# Model-Selection:

In [None]:
X = hf_df.drop(columns='DEATH_EVENT')
y = hf_df['DEATH_EVENT']

X_train, X_test_cv, y_train, y_test_cv = train_test_split(X, y, test_size=0.4)
X_cv, X_test, y_cv, y_test = train_test_split(X_test_cv, y_test_cv, test_size=0.5)

In [None]:
rnd_state=3

model_list = [LogisticRegression(random_state=rnd_state),
              SGDClassifier(random_state=rnd_state),
              SVC(random_state=rnd_state),
              KNeighborsClassifier(),
              GaussianNB(),
              DecisionTreeClassifier(random_state=rnd_state),
              RandomForestClassifier(random_state=rnd_state),
              GradientBoostingClassifier(random_state=rnd_state),
             ]

In [None]:
Main_Pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.99)),
    ('model', LogisticRegression()),
])

# param_grid for fast execution
params_grid = [{

    'model': model_list,  
}]

In [None]:
main_grid_f1 = GridSearchCV(Main_Pipeline, params_grid, scoring='f1', cv=2, verbose=2)
main_grid_f1.fit(X_train, y_train)

In [None]:
main_grid_f1.best_estimator_

In [None]:
main_grid_f1.best_estimator_.steps[1][1].n_components_

In [None]:
y_pred_main = main_grid_f1.best_estimator_.predict(X_test)
f1_score(y_test, y_pred_main)

In [None]:
accuracy_score(y_test, y_pred_main)

## HyperParameter Tuning:

In [None]:
best_model = main_grid_f1.best_estimator_
best_model

In [None]:
best_param_grid = [{
    'model__penalty': ['l1', 'l2', 'elasticnet'],
    'model__tol': [1e-5, 2e-5, 1e-4, 2e-4, 1e-3],
    #'model__C': np.r_[1, np.array([C for C in np.random.uniform(0.001, 1, 5) if C != 1])], # C=1 is the default value, and we want to include it but not get repeated
    'model__max_iter': np.r_[100, np.array([max_iter for max_iter in np.random.randint(80, 200, 5) if max_iter != 100])], # Same as reason as C
}]

In [None]:
best_grid = GridSearchCV(best_model, best_param_grid, scoring='f1', cv=3, verbose=2)
best_grid.fit(X_cv, y_cv)

In [None]:
best_grid.best_estimator_

In [None]:
y_pred_tuned = best_grid.best_estimator_.predict(X_test)
print(f1_score(y_test, y_pred_tuned))
print(accuracy_score(y_test, y_pred_tuned))

Looks like HyperParameter Tuning imporoved the model. Kudos!

### Confusion Matrices for both the models:

In [None]:
conf_mat_ori = confusion_matrix(y_test, y_pred_main)
conf_mat_tuned = confusion_matrix(y_test, y_pred_tuned)

In [None]:
plot_confusion_matrix(conf_mat=conf_mat_ori); plot_confusion_matrix(conf_mat=conf_mat_tuned);

#### First one is of Original Best model (before hyperparameter tuning) & second one is of Tuned Best Model.