# 💔 Heart Failure Prediction - Machine Learning Classifier
Building with Python and Scikit-learn

<a id='indice'></a>
### Indice
[Problem Definition](#introduction)<br>
[Step 0.0:  Setting the Engine Tools](#step0)<br>
$\;\;\;\;\;$[Step 0.1:  Importing Required Python Libraries](#step0.1)<br>
$\;\;\;\;\;$[Step 0.2:  Display Setting](#step0.2)<br>
$\;\;\;\;\;$[Step 0.3: Defining Functions and Dictionaries](#step0.3)<br>
[Step 1.0: Data Extraction](#step1)<br>
$\;\;\;\;\;$[Step 1.1: Downloading the Data](#step1.1)<br>
$\;\;\;\;\;$[Step 1.2: Loading the DataFrame](#step1.2)<br>
$\;\;\;\;\;$[Step 1.3: DataFrame Display](#step1.3)<br>
[Step 2.0: Data Preparation](#step2)<br>
$\;\;\;\;\;$[Step 2.1: DataFrame Description](#step2.1)<br>
$\;\;\;\;\;$[Step 2.2: Quantifying Cardinality](#step2.2)<br>
$\;\;\;\;\;$[Step 2.3: Removing Duplicates](#step2.3)<br>
$\;\;\;\;\;$[Step 2.4: Removing Irrelevant Data](#step2.4)<br>
$\;\;\;\;\;$[Step 2.5: Fixing Structural Errors](#step2.5)<br>
$\;\;\;\;\;$[Step 2.6: Detecting Outliers](#step2.6)<br>
$\;\;\;\;\;$[Step 2.7: Handling Missing Data](#step2.7)<br>
$\;\;\;\;\;$[Step 2.8: One Hot Encoding](#step2.8)<br>
[Step 3.0: Data Exploration/Visualization](#step3)<br>
$\;\;\;\;\;$[Step 3.1: Descriptive Statistics](#step3.1)<br> 
$\;\;\;\;\;$[Step 3.2: Feature Selection ](#step3.2)<br> 
$\;\;\;\;\;\;\;\;\;\;$[Step 3.2.1: Univariate Selction](#step3.2.1)<br> 
$\;\;\;\;\;\;\;\;\;\;$[Step 3.2.2: Feature Importance](#step3.2.2)<br> 
$\;\;\;\;\;$[Step 3.3: Features Correlation](#step3.3)<br> 
[Step 4.0: Predictive Modeling](#step4)<br>
[Step 5.0: Model Validation](#step5)<br>
[Deployment of the Solution](#deployment)<br>
[Acknowledgements](#acknowledgements)

<a id='introduction'></a>
## Problem Definition

Due to unhealthy eating habits and the sedentary lifestyle of modern people, heart diseases (which refers any condition that affects the structure or function of the heart) is one of the major concerns of our society. In 2020, about 697,000 Americans died from heart diseases ([CDC heart disease](https://www.cdc.gov/heartdisease/about.htm)). That means one in every five deaths was due to heart disease. It's known that early detection of these diseases can cause a reduction in the mortality rate from them. Although, there are instruments and clinical tests that can predict heart disease. The high cost and difficultly of continually monitoring all of the sufferers of heart disease, calls for alternative solutions. In this work, we will use data analysis and machine learning tools to find the characteristics that lead a certain person to have or not to have heart disease. The data<a name="cite_ref-1"></a>[<sup>[1]</sup>](#cite_note-1) consists of five databases already available independently ([UCI Machine Learning databases](https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/)). It contains 918 observations<a name="cite_ref-2"></a>[<sup>[2]</sup>](#cite_note-2) with 12 features, including the predicted feature (see the table below). 

|Index| Feature |Description | Domain |
|:-|  :-|:- | :- | 
|0|Age  |age of the patient    | years|
|1|Sex       |sex of the patient|M: Male, F: Female|
|2|ChestPainType|chest pain type| TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic|
|3|RestingBP|resting blood pressure (mm Hg)| --- |
|4|Cholesterol|serum cholesterol (mm/dl)| --- |
|5|FastingBS|fasting blood sugar| 1: if FastingBS > 120 mg/dl, 0: otherwise|
|6|RestingECG|resting electrocardiogram results  |Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria |
|7|MaxHR|maximum heart rate achieved | Numeric value between 60 and 202
|8|ExerciseAngina|exercise-induced angina |Y: Yes, N: No|
|9|Oldpeak|ST depression induced by exercise relative to rest	| Numeric value measured in depression|
|10|ST_Slope|the slope of the peak exercise ST segment|Up: upsloping, Flat: flat, Down: downsloping |
|11|HeartDisease|output class  | 1: heart disease, 0: normal|

Note that the "HeartDisease" field refers to the presence of heart disease in the patient.  We will examine trends and correlations within available data by determining which features are essential concerning the presence of heart disease. Finally, we will compare different classification machine learning algorithms and find the efficient one for considering as a "heart disease classifier".


<a name="cite_note-1"></a>1. [](#cite_ref-1) fedesoriano. (September 2021). Heart Failure Prediction Dataset. Retrieved [August 7, 2022] from https://www.kaggle.com/fedesoriano/heart-failure-prediction.

<a name="cite_note-2"></a>2. [](#cite_ref-2) Cleveland: 303 observations, Hungarian: 294 observations, Switzerland: 123 observations, Long Beach VA: 200 observations and Stalog (Heart) Data Set: 270 observations<br>

___

[back to the top](#indice)<br>

<a id='step0'></a>
## Step 0.0:  Setting the Engine Tools

<a id='step0.1'></a>
### Step 0.1: Importing Required Python Libraries

#### Packages

In [None]:
import pandas  as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy   as np  # linear algebra
import seaborn as sns # statistical data visualization 

In [None]:
import os # OS routines
import matplotlib.pyplot as plt # a MATLAB-like way of plotting (a state-based interface to matplotlib). 

In [None]:
import opendatasets as od # collection of downloadable  datasets

#### Functions from Scikit-Learn Modules

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics         import accuracy_score
from sklearn.linear_model    import LogisticRegression, SGDClassifier
from sklearn.tree            import DecisionTreeClassifier
from sklearn.ensemble        import GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.neighbors       import KNeighborsClassifier
from sklearn.naive_bayes     import GaussianNB
from sklearn.svm             import LinearSVC, SVC
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing   import StandardScaler, LabelEncoder

<a id='step0.2'></a>
### Step 0.2: Display Setting

In [None]:
# Disabling Scientific Notation 
pd.options.display.float_format = '{:.2f}'.format

In [None]:
# Opening Figures on Screen
%matplotlib inline  

In [None]:
# Setting Plots Parameters
sns.set_style('whitegrid') 

<a id='step0.3'></a>
### Step 0.3: Defining Functions and Dictionaries

In [None]:
def detect_outliers(data): 
    '''Returns a list of the outiliers detcted (points that lie outside the range defined by the quartiles +/- 1.5 * IQR, where IQR is Interquartile Range):
    
    detect_outliers(data)
    >>> outliers
    '''
    outliers = []
    for col in nocategorical_features:
        q1, q3= np.percentile(sorted(data[col]),[25,75])
        iqr = q3 - q1
       
        lower_bound = q1 - (1.5 * iqr) 
        upper_bound = q3 + (1.5 * iqr)
        
        outliers.append(data[(data[col]<lower_bound ) | (data[col]> upper_bound)].index.values)
   
    outliers = tuple(outliers)
    outliers = flatten(outliers)
    return outliers      

In [None]:
detect_outliers

In [None]:
def drop_outliers(data, outliers): 
    '''Returns the data without the outliers rows:
    
    drop_outliers(data, outliers)
    >>> data
    '''
    data = data.drop(df.index[outliers])
    data.reset_index(drop=True, inplace=True)
    return data


In [None]:
def flatten(xss):
    '''Returns a single list from a composed list:
    
    flatten(xss)
    >>> [x for xs in xss for x in xs]
    '''  
    return [x for xs in xss for x in xs]

In [None]:
def gradient_statistics(data):
    '''
    Returns a background gradient of the statistics:
    
    gradient_statistics(data)
    >>> data.describe().style.background_gradient(cmap='Reds')
    '''
    return data.describe().style.background_gradient(cmap='Reds')

In [None]:
def boxenplot_generate(data):
    '''
    Returns a boxenplot:
    
    boxenplot_generate(data)
    >>> sns.boxenplot(data=data)
    >>> plt.figure(figsize=(12,6))
    >>> plt.xlabel("No Categorical Features") # add x label
    >>> plt.ylabel("Entries") # add y label
    >>> plt.title("Shape of the Distribution") # add a histogram title
    >>> sns.boxenplot(data=data)
    '''
    plt.figure(figsize=(12,6))
    plt.xlabel("No Categorical Features") # add x label
    plt.ylabel("Entries") # add y label
    plt.title("Shape of the Distribution") # add a histogram title
    sns.boxenplot(data=data)

In [None]:
def plotbar_generate(data):
    '''
    Returns a plotbar:
    
    plotbar_generate(data)
    >>> data.nunique().plot.bar(figsize=(8,4), color=plt.cm.Paired(np.arange(len(data))))
    >>> data.nunique().plot.bar(figsize=(8,4), color=plt.cm.Paired(np.arange(len(data)))) # change the figure size with the figsize argument
    >>> plt.xlabel('Features') # add x label
    >>> plt.ylabel('Number of Unique Categories') # add y label
    >>> plt.title('Features Cardinality') # add a histogram title
    '''
    data.nunique().plot.bar(figsize=(8,4), color=plt.cm.Paired(np.arange(len(data)))) # change the figure size with the figsize argument
    plt.xlabel('Features') # add x label
    plt.ylabel('Number of Unique Categories') # add y label
    plt.title('Features Cardinality') # add a histogram title

In [None]:
def cardinality_quantifying(data, num_cord=10):
    '''
    Returns the categorical columns with relatively low cardinality (< 10), 
    nocategorical columns, numerical column and object columns:
    
    cardinality_quantifying(data)
    >>> categorical_features, nocategorical_features, numerical_features, object_features
    '''
    numerical_features = data.select_dtypes("number").columns
    object_features = data.select_dtypes("object").columns
    numerical_features= list(set(numerical_features))
    object_features= list(set(object_features))
   
    categorical_features = []
    nocategorical_features = []

    for col in data.columns:
        if data[col].nunique() < num_cord:
            categorical_features.append(col)
        else:
            nocategorical_features.append(col)            
    
    return categorical_features, nocategorical_features, numerical_features, object_features

In [None]:
def Scaling(df_train, df_test):
    '''
    Returns a standard scaled tuple of DataFrame:
    
    Scaling(df_train, df_test)
    >>> df_train, df_test
    '''
    scaler = StandardScaler()
    nocategorical_features=cardinality_quantifying(df_train)[1]
    df_train[nocategorical_features] = scaler.fit_transform(df_train[nocategorical_features])
    df_test[nocategorical_features] = scaler.transform(df_test[nocategorical_features])

    return df_train, df_test

In [None]:
def MLC_report(data, dict_classifiers):
    '''
    Returns a report classification DataFrame:
    
    MLC_report(data, dict_classifiers)
    >>> scores_data
    '''
    scores_dict = {'models': [],
                   'train accuracy (%)':[],
                  # 'test accuracy (%)':[],
                   'pred accuracy (%)':[]
                  }
    models_index = list(scores_dict)[0]
    train_index = list(scores_dict)[1]
#     test_index = list(scores_dict)[2]
    pred_index = list(scores_dict)[2]
    
    #nocategorical_features=cardinality_quantifying(data)[1]

    for model, model_instantiation in dict_classifiers.items():
        
        X = data.drop(TARGET_col, axis=1)
        y = data[TARGET_col]

        X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)
        
        X_train, X_test = Scaling(X_train, X_test)
        

        model_instantiation.fit(X_train,y_train)

        score_train = model_instantiation.score(X_train,y_train)
        # print(f"accuracy train: {score_train * 100:.2f}%")

#         score_test = model_instantiation.score(X_test,y_test)
#         #print(f"accuracy (test): {score_test * 100:.2f}%")
        
        y_pred = model_instantiation.predict(X_test)
        score_pred = accuracy_score(y_test,y_pred)
        #print(f"accuracy (pred): {score_pred * 100:.2f}%")
        
        scores_dict[train_index].append(round(score_train*100,2))
#         scores_dict[test_index].append(round(score_test*100,2))
        scores_dict[pred_index].append(round(score_pred*100,2))
        scores_dict[models_index].append(model)
        
    scores_data = pd.DataFrame(scores_dict)
    scores_data.sort_values(by=pred_index, ascending=False, inplace=True)
    return scores_data

In [None]:
def features_score1(data):
    '''
    Returns the data withiout outiliers rows:
    
    drop_outliers(data, outliers)
    >>> data
    '''
#     X = data.iloc[:,0:13]  #independent columns
#     y = data.iloc[:,-1]    #target column 
    X = data.drop(TARGET_col, axis=1)
    y = data[TARGET_col]
    #apply SelectKBest class to extract top best features
    bestfeatures = SelectKBest(score_func=chi2, k=10)
    fit = bestfeatures.fit(X,y)
    dfscores = pd.DataFrame(fit.scores_)
    dfcolumns = pd.DataFrame(X.columns)
        
    #concat two dataframes for better visualization 
    featureScores = pd.concat([dfcolumns,dfscores],axis=1)
    featureScores.columns = ['Specs','Score']  #naming the dataframe columns
    return featureScores

In [None]:
def features_score(data):
    '''
    Returns the data withiout outiliers rows:
    
    drop_outliers(data, outliers)
    >>> data
    '''

    X = data.drop(TARGET_col, axis=1)
    y = data[TARGET_col]
   
    #apply SelectKBest class to extract top best features
    bestfeatures = SelectKBest(score_func=chi2, k=10)
    fit = bestfeatures.fit(X,y)
    dfscores = pd.DataFrame(fit.scores_)
    dfcolumns = pd.DataFrame(X.columns)
    
    print(fit.scores_)
    
    #concat two dataframes for better visualization 
    featureScores = pd.concat([dfcolumns,dfscores],axis=1)
    featureScores.columns = ['Specs','Score']  #naming the dataframe columns
    print(featureScores.nlargest(12,'Score'))  #print best features

In [None]:
def features_importance(data):
    '''
    Returns the data withiout outiliers rows:
    
    drop_outliers(data, outliers)
    >>> data
    '''
    model = ExtraTreesClassifier()
    # X = data.iloc[:,0:13]  #independent columns
    # y = data.iloc[:,-1]    #target column 
    X = df.drop(TARGET_col, axis=1)
    y = df[TARGET_col]
    model.fit(X,y)
    print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
    #plot graph of feature importances for better visualization
    feat_importances = pd.Series(model.feature_importances_, index=X.columns)
    feat_importances.nlargest(13).plot(kind='barh')
    plt.show()

In [None]:
def distribution_feature(data):
    '''
    Returns the data withiout outiliers rows:
    
    drop_outliers(data, outliers)
    >>> data
    '''
    for col in data.columns:
        if col != TARGET_col:
            plt.figure(figsize=(8,4))
            plt.title(f"Distribution of Heart Diseases by {col}", fontsize=16)
            if col in categorical_features:
                sns.countplot(x=data[col], hue=data[TARGET_col])
            else:
                sns.histplot(data = data, x= data[col], hue=TARGET_col, kde=True)  

In [None]:
# fig, ax = plt.subplots(nrows = 2,ncols = 2,figsize = (10,9.75))
# for i in range(len(numerical_features) - 1):
#     plt.subplot(2,2,i+1)
#     sns.distplot(data[numerical_features[i]])
#     title = 'Distribution : ' + numerical_features[i]
#     plt.title(title)
# plt.show()

# plt.figure(figsize = (4.75,4.55))
# sns.distplot(df1[numerical_features[len(numerical_features) - 1]],kde_kws = {'bw' : 1})
# title = 'Distribution : ' + numerical_features[len(numerical_features) - 1]
# plt.title(title);

In [None]:
dict_classifiers = {
    "Gradient Boosting Classifier": GradientBoostingClassifier(),
    "Decision Tree Classifier": DecisionTreeClassifier(),
    "Random Forest Classifier": RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=0),
    "K Neighbors Classifier": KNeighborsClassifier(),
    "SGD Classifier": SGDClassifier(),
    "Gaussian NB": GaussianNB(),
    "Logistic Regression": LogisticRegression(max_iter=1000,solver='lbfgs'),
    "Linear SVM": SVC(probability=True, kernel='linear'),
    "Linear SVC": LinearSVC(dual=False),
}

[back to the top](#indice)<br>

___

<a id='step1'></a>
## Step 1.0:  Data Extraction

<a id='step1.1'></a>
### Step 1.1: Downloading the Data

In [None]:
od.download("https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction")

<a id='step1.2'></a>
### Step 1.2: Loading the DataFrame

The CSV files can be loaded into a DataFrame as follows:

In [None]:
# Path of the file to read
file_path = "~/PythonProgramming/JupyterProjects/Kaggle/kaggleDataSets/HeartDiseaseDataset/heart-failure-prediction/heart.csv"
data = pd.read_csv(file_path) # reading csv file 
print(f'DataFrame successfully loaded!\npath: "{file_path}"')

<a id='step1.3'></a>
### Step 1.3: DataFrame Display

After loading the data, print a sample and some aditional information to see what we're working with:

In [None]:
display(data)

The are a total of 918 entries and 12 features. Note that the data frame is correctly indexed, that is, without an index feature.

<b>TIP</b><br>Below are some helpful methods:
```python
data.head() # display the first five rows
data.tail() # display the last five rows
data.shape  # check out the dimension of the dataset
```

[back to the top](#indice)<br>

___

<a id='step2'></a>
## Step 2.0: Data Preparation

In this section, we will focus on the preparation of the data: cleaning, normalizing, and transforming
data into an optimized data set, that is, in a prepared format, normally tabular, suitable for the methods of
analysis that has been scheduled during the design phase. For this dataset, as the number of features are less, we can manually check the dataset as well.

Before starting the next step, let's make a copy to keep safe the original data and save the target name for ease of use within some functions posteriorly:

In [None]:
df = data.copy(deep=True) # making a deep copy of DataFrame and save it to df
TARGET_col = 'HeartDisease' # saving the target name

<a id='step2.1'></a>
### Step 2.1: Data Description

Printing general informations, including the index dtype and columns, non-null values and memory usage.

In [None]:
df.info()

The are a total of 86.2 KB of memory usage. No columns have missing values (Non-Null Count = 918 entries). Note that there are five object-like features. It will be necessary to address these features later.

<b>TIP</b><br>Below are some helpful methods:
```python
data.dtypes                  # look at the data types for each column  
data.columns.values          # return an array of column names
data.columns.values.tolist() # return a list of column names
```

<a id='step2.2'></a>
### Step 2.2: Quantifying Cardinality 

Before starting the preparation of the data itself, let's classify the features as categorical or non-categorical and numeric or non-numeric. Remembering that the cardinality of a feature is given by its number of unique categories:

In [None]:
plotbar_generate(df)

We will define as a categorical variable those with a cardinality lower than 10. All others will be considered non-categorical. Also, we will select numeric and non-numeric features.

In [None]:
categorical_features, nocategorical_features, numerical_features, object_features = cardinality_quantifying(data, 10)

Let's see the classification result:

In [None]:
print("categorical:\n", categorical_features)
print("no categorical:\n", nocategorical_features)
print("numerical:\n", numerical_features)
print("object:\n", object_features)

Following we''ll work with some of the central machine learning data cleaning steps (see [What Is Data Cleaning: A Practical Guide](https://deepchecks.com/what-is-data-cleaning/))
. We will use tools and automation to reduce the unnecessary overhead.

<a id='step2.3'></a>
### Step 2.3: Removing Duplicates

Duplicate entries are always problematic in data analysis. If an entry that appears more than once and is given disproportionate weight during training can result in great results for the training data and serious discrepancy in test time. Let's start by checking if there are duplicate entries:

In [None]:
df.duplicated().sum()

There are no duplicate entries. 

<b>TIP</b><br>If there are duplicate entries it is necessary to remove them: 
```python
data.drop_duplicates(inplace=True)  # remove duplicates, if any.
data.reset_index(drop=True, inplace=True) # reset the DataFrame index
```
It's necessary to set inplace as True so that the changes are from the same DataFrame and drop as True to drop the old index kept by the reset_index() function.

Below are some useful methods for droping rows and columns:
```python
subset_1 = data.drop(data.index[[1,7,9]])  # drop the 2nd, 8th, and 10th rows
subset_2 = data.drop(data.index[range(1,11)])  # drop all rows between 2nd to 10th rows
subset_3 = data.drop(["col1", "col2"], axis=1)  # drop col1 and col2
subset_4 = data.iloc[:100, :]  # a subset of the first 100 rows of the original data
subset_5 = data.iloc[:, :3]  # a subset of the first 3 columns of the original data
subset_6 = data.iloc[:100, :15]  # a subset of the first 100 rows and the first 15 columns
subset_7 = data[["col1", "col2", "col3"]]  # a subset contains features col1, col2, and col3
subset_8 = data.sample(n=1000) # a random sample of size 1000 without replacement (replace = False (Default))
subset_9 = data.sample(frac=0.1, replace=True) # a random sample of 10% of the original data with replacement
```

<a id='step2.4'></a>
### Step 2.4: Removing Irrelevant Data 
It is not uncommon for the data to come from diverse sources and consequently there is a probability of unnecessary entries or even entries that don't belong. At the first analysis, we don't see in the DataFrame any irrelevant data.

<a id='step2.5'></a>
### Step 2.5: Fixing Structural Errors

It is not uncommon to see data with similar names  (for example, an underscore or a capital letter) or incorrectly filled (for example, a categorical value outside of the predefined range). Let's see if the unique values of the categorical features have any "strange" entry:

In [None]:
print("Sex:", df['Sex'].unique())
print("ChestPainType:", df['ChestPainType'].unique())
print("FastingBS:", df['FastingBS'].unique())
print("RestingECG:", df['RestingECG'].unique())
print("ExerciseAngina:", df['ExerciseAngina'].unique())
print("ST_Slope:", df['ST_Slope'].unique())
print("HeartDisease:", df['HeartDisease'].unique())

Note that there are no structural errors. For all features, the entries match the expected range (see table in the section [Problem Definition](#introduction)). Now let's see the range of the values of the no categorical features:

In [None]:
print(f"Range of the values no categorical features\n")
for col in df[nocategorical_features].columns:
    print(f"{col}: {df[col].min()} - {df[col].max()}")

Note that there are values negative for the  "Oldpeak" feature that are incorrect entries, domain: 0-6.2 (see [A Hybrid Classification System for Heart Disease Diagnosis Based on the RFRS Method](https://www.hindawi.com/journals/cmmm/2017/8272091/)).

In [None]:
print('Total number of "Oldpeak" negative entries:', df.loc[(df["Oldpeak"]<0)].shape[0])

We will treat these negative entries later.

<a id='step2.6'></a>
### Step 2.6: Detecting Outliers

Outlier detection requires a deeper understanding of what the data should look like, and when entries should be ignored because they are inaccurate. To detect unwanted outliers we need to explore the ranges and possibilities for numerical and categorical data entries. In the case of the categorical data, we saw there are no outliers. Let's generate the background color of the descriptive statistics according to the each feature:

In [None]:
gradient_statistics(df)

Note evidence of the existence of some outliers. The last quartiles of the "RestingBP", "Cholesterol" and "MaxHR" features seem to break out of their respective growth trends. The first quartiles of the "RestingBP" and "Cholesterol" features seem to breack out of their respective decrease trends:

In [None]:
print('Total number of "RestingBP" null entries:', df.loc[(df["RestingBP"]==0)].shape[0])

In [None]:
print('Total number of "Cholesterol" null entries:', df.loc[(df["Cholesterol"]==0)].shape[0])

As we can see in this first analysis that there are at least 173 entries outside the expected range of the features "RestingBP" and "Cholesterol". Drawing an enhanced box plot for the no categorical features provides more information about the shape of the distribution, particularly in the tails:

In [None]:
boxenplot_generate(df[nocategorical_features])

Note how clear the existence of outliers is. Mainly, null values for the "Cholesterol" feature. Now we will use InterQuartile Range (IQR) to see when a value is too far from the  middle data values. Will be considered an outilier a point which falls more than 1.5 times the interquartile range above the third quartile or below the first quartile.

In [None]:
outliers=detect_outliers(df)
print("There are", len(outliers), "outiliers.")

Now let's remove the outliers:

In [None]:
df=drop_outliers(df,outliers)

Again generating the background color of the descriptive statistics:

In [None]:
df.describe().style.background_gradient(cmap='Reds')

Note that there still seems to be scattered data. We will keep the data as it is since we have already removed the outliers once. A further study could examine the consequences of performing yet another outliers remotion. There are some negative values for the "Oldpeak" feature remain. Let's remove them manually:

In [None]:
df = df.drop(df[df["Oldpeak"] < 0].index)
df.reset_index(drop=True, inplace=True)
print('Total number of "Oldpeak" negative entries:', df.loc[(df["Oldpeak"]<0)].shape[0])

<b>TIP</b><br>Below is an useful method for droping with multiples condictions: 
```python
data = data.drop(data[(df["col1"] < 0) & (df["col2"] == "M")].index) # removes rows in the "col1" feature where there are negative values and in the "col2" feature where the values are M.
data.reset_index(drop=True, inplace=True) # resets the DataFrame index
```

 Drawing an enhanced box plot for the no categorical features to see the final results:

In [None]:
boxenplot_generate(df[nocategorical_features])

<a id='step2.7'></a>
### Step 2.7: Handling Missing Data

Handling missing data  in  machine learning data cleaning steps is very important. To recheck if there are any missing values:

In [None]:
data.isnull().values.any()

As we expect there are no missing values.

<b> TIP</b><br>Below are examples of how to use some useful methods to deal with Missing Values. 

Check information of the missing values:
```python
data.isnull()  # checking missing values
data["col1"].isnull().sum()  # return the number of missing values in col1
data.notnull()  # checking non-missing values
data.isnull().values.any()  # only want to know if there are any missing values
data["col1"].isnull().values.any()  # only want to know if there are any missing values in col1
data.notnull().sum()  # knowling number of non-missing values for each variable
data.isnull().sum().sum()  # knowing how many missing values in the data
```
Get information without missing values:

```python
data[data["col1"].notnull()]  # the data contain rows that no missing values in col1
data[data["col1"].notnull() & data["col2"].notnull()] # the data contain rows that no missing values in col1 and col2
no_missing = data.dropna()  # drop missing values and assign the data to no_missing
clean_missing_rows = data.dropna(how="all")  # drop rows where all cells in the row in NA and assign the data to clean_missing_rows
data.dropna(axis=1, how="all")  # drop columns if they only contain missing values
data.dropna(thresh=25)  # drop rows that contain less than 25 non-missing values
```
Fill in missing values:

```python
Fill_no = data.fillna(1000000)  # fill in missing with 1000000 and save the data to Fill_no
Fill_str = data["col1"].fillna("missing")  # fill in missing with a string "missing" and save the data to Fill_str
bikedata["col1"].fillna(data["col1"].mean(), inplace=True)  # fill missing values with the sample mean and save the changes to the original data

```

<a id='step2.8'></a>
### Step 2.8: One Hot Encoding

Knowing that categorical (non-numeric) features are not handled by machine learning algorithms, we need to convert them to numeric features. This process is called  One Hot Encoding (OHE). To facilitate the analysis in the visualization step, we will save the data frame without the OHE:

In [None]:
# df['Sex'] = LabelEncoder().fit_transform(df['Sex'])
# df['ChestPainType'] = LabelEncoder().fit_transform(df['ChestPainType'])
# df['RestingECG'] = LabelEncoder().fit_transform(df['RestingECG'])
# df['ExerciseAngina'] = LabelEncoder().fit_transform(df['ExerciseAngina'])
# df['ST_Slope'] = LabelEncoder().fit_transform(df['ST_Slope'])

In [None]:
df_noOHE = df.copy()

It was seen in the [step2.2](#step2.2) that the 2 features – "Sex" and "ExerciseAngina" among the 5 total categorical features are binary i.e. they only take two values. It's possible, therefore, manually encode these using 0 and 1:

In [None]:
df['Sex'] = np.where(df['Sex'] == "M", 0, 1)
df['ExerciseAngina'] = np.where(df['ExerciseAngina'] == "N", 0, 1)

From this point we have: 0 to "M" (male) and 1 to "F" (female) for the feature "Sex" and 0 to "N" (no) and 1 to "Y" (yes) for the feature "ExerciseAngina".

For resources with 3 or more, we will use the pandas get_dummies function. This function creates a new feature per label. For example, ChestPainType has 4 labels, therefore, 4 new features are created with values 0 or 1. As one of these features is redundant, the first attribute is deleted. To facilitate the analysis in the next ste we will save a copy of the complete data frame, that is, we will keep the first feature:

In [None]:
df_OHE = pd.get_dummies(df, columns=['ChestPainType', 'RestingECG', 'ST_Slope'], drop_first=False)

In [None]:
df=pd.get_dummies(df, columns=['ChestPainType', 'RestingECG', 'ST_Slope'], drop_first=True)

As a final result of the preparation step we have a DataFrame with a total of 701 entries and 16 features, with 39.8 KB of memory usage (to compare see [step2.1](#step2.1) ). 

[back to the top](#indice)<br>

___

In [None]:
yes = df[df['HeartDisease'] == 1].describe().T
no = df[df['HeartDisease'] == 0].describe().T

fig,ax = plt.subplots(nrows = 1,ncols = 2,figsize = (5,5))
plt.subplot(1,2,1)
sns.heatmap(yes[['mean']],annot = True,cmap = 'plasma',linewidths = 0.4,linecolor = 'black',cbar = False,fmt = '.2f')
plt.title('Heart Disease');

plt.subplot(1,2,2)
sns.heatmap(no[['mean']],annot = True,cmap = 'plasma',linewidths = 0.4,linecolor = 'black',cbar = False,fmt = '.2f')
plt.title('No Heart Disease');

fig.tight_layout(pad = 2)

<a id='step3'></a>
## Step 3.0: Data Exploration/Visualization

In this section, we will search for data in a graphical or statistical presentation according to find patterns, connections, and relationships in the data. Data visualization is the best tool to highlight possible patterns.

<a id='step3.1'></a>
### Step 3.1: Descriptive Statistics 
Let's generate some descriptive statistics, excluding NaN values:

In [None]:
df.describe().T

Except for the Cholesterol feature, which still seems to have scattered data about the center all, which can be considered in a further study, all the features are well distributed.

<a id='step3.2'></a>
### Step 3.2: Feature Selection 

<a id='step3.2'></a>
### Step 3.2: Feature Selection 

<a id='step3.2'></a>
### Step 3.2: Feature Selection 

<a id='step3.2.1'></a>
#### Step 3.2.1: Univariate Selction

The scikit-learn library provides the SelectKBest class that can be used to select the best features from the DataFrame. SelectKBest class selects a specific number of features in a suite of different statistical tests.
In the following we use the chi-squared (chi2) statistical test for non-negative features to select 13 of the best features from the Dataset.

In [None]:
features_score(df)

In [None]:
features_score(df_OHE)

<a id='step3.2.2'></a>
#### Step 3.2.2: Feature Importance

The significance of each feature of the dataset can be obtained by using the Model Characteristics property.
Feature value gives a score for every function of the results, the higher the score the more significant or appropriate the performance variable is. Feature importance is the built-in class that comes with Tree-Based Classifiers. The Extra Tree Classifier will be used to extract the top features for the dataset.

In [None]:
features_importance(df_OHE)

<a id='step3.3'></a>
### Step 3.3: Features Correlation

Correlation shows whether the characteristics are related to each other or to the target variable. Correlation can be positive (increase in one value, the value of the objective variable increases) or negative (increase in one value, the value of the target variable decreased). From this heatmap we can observe that the ‘cp’ chest pain is highly related to the target variable. Compared to relation between other two variables we can say that chest pain contributes the most in prediction of presences of a heart disease. Medical emergency is a heart attack. A cardiac occurs usually when blood clot blocks blood flow to the cardiac. Tissue loses oxygen without blood and dies causing chest pain.

A correlation could be positive, meaning both features move in the same direction, or negative, meaning that when one variable’s value increases, the other features’ values decrease. Correlation can also be neutral or zero, meaning that the features are unrelated.

In [None]:
plt.figure(figsize=(16,7))
sns.heatmap(df.corr(), annot=True, cmap = 'Blues') # plot rectangular data as a color-encoded matrix

In [None]:
plt.figure(figsize=(16,7))
sns.heatmap(df_OHE.corr(), annot=True, cmap = 'Blues') # plot rectangular data as a color-encoded matrix

We can see that none of the features were found to be strongly correlated with the target. There are four features (“cp”, “restecg”, “thalach”, “slope”) positively correlated with the target and nine negatively correlated.

In [None]:
distribution_feature(df_noOHE)

#### Population distribution of the dataset

In [None]:
plt.figure(figsize=(8,4))
plt.title(f"Distribution", fontsize=16)
sns.histplot(data = df, x= df['Age'], hue=None, stat='density', kde=True) 

In [None]:
# # First create some toy data:
# x = np.linspace(0, 2*np.pi, 400)
# y = np.sin(x**2)

# # Create just a figure and only one subplot
# fig, ax = plt.subplots()
# ax.plot(x, y)
# ax.set_title('Simple plot')

# # Create two subplots and unpack the output array immediately
# f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
# ax1.plot(x, y)
# ax1.set_title('Sharing Y axis')
# ax2.scatter(x, y)

# # Create four polar axes and access them through the returned array
# fig, axs = plt.subplots(2, 2, subplot_kw=dict(projection="polar"))
# axs[0, 0].plot(x, y)
# axs[1, 1].scatter(x, y)

# # Share a X axis with each column of subplots
# plt.subplots(2, 2, sharex='col')

# # Share a Y axis with each row of subplots
# plt.subplots(2, 2, sharey='row')

# # Share both X and Y axes with all subplots
# plt.subplots(2, 2, sharex='all', sharey='all')

# # Note that this is the same as
# plt.subplots(2, 2, sharex=True, sharey=True)

# # Create figure number 10 with a single subplot
# # and clears it if it already exists.
# fig, ax = plt.subplots(num=10, clear=True)

In [None]:
# f, axes = plt.subplots(2, 2, figsize=(7,7), sharex=True)

# sns.distplot(df[df['Sex'] == 1]['Age'], axlabel="Homens Total", ax=axes[0,0]) # Homens
# sns.distplot(df[df['Sex'] == 0]['Age'], axlabel="Mulheres Total", ax=axes[0,1]) # Mulheres

# #Homens e mulheres que possuem Doença Cardíaca
# #df[df['sex'] == 1] <- Isso pode ser lido como: "dados onde dados na coluna 'sex' seja igual a 1."
# #Adicionalmente, onde está essa condição, podem existir outras condições. 

# homens_com_doenca_cardiaca = df[(df['Sex'] == 1) & (df['HeartDisease'] == 1)]['Age']
# mulheres_com_doenca_cardiaca = df[(df['Sex'] == 0) & (df['HeartDisease'] == 1)]['Age']

# sns.distplot(homens_com_doenca_cardiaca, axlabel="Homens com Doença", ax=axes[1,0]) 
# sns.distplot(mulheres_com_doenca_cardiaca, axlabel="Mulheres com Doença", ax=axes[1,1]) 

### Visualization of the relationship between Age, cholesterol and the target variable

In [None]:
# sns.relplot(x='chol',y='age',hue='target', data=df)

[back to the top](#indice)<br>

___

<a id='step4'></a>
## Step 4.0: Predictive Modeling

Predictive modeling is a process used in data analysis to create or choose a suitable statistical model to
predict the probability of a result.
After exploring data you have all the information needed to develop the mathematical model that
encodes the relationship between the data. These models are useful for understanding the system under
study, and in a specific way they are used for two main purposes. The first is to make predictions about the
data values produced by the system; in this case, you will be dealing with regression models. The second is
to classify new data products, and in this case, you will be using classification models or clustering models.
In fact, it is possible to divide the models according to the type of result that they produce:
• Classification models: If the result obtained by the model type is categorical.
• Regression models: If the result obtained by the model type is numeric.
• Clustering models: If the result obtained by the model type is descriptive.

In [None]:
categorical_features[:-1]

In [None]:
X = df.drop(TARGET_col, axis=1)
y = df[TARGET_col]

In [None]:
X

In [None]:
y

In [None]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)

In [None]:
dict_classifiers["Logistic Regression"].fit(X_test, y_test)

In [None]:
y_pred = dict_classifiers["Logistic Regression"].predict(X_test)
y_pred

In [None]:
accuracy_score(y_test,y_pred)

___

In [None]:
list1 = [1, 2, 3, 4, 5]
list2 = [123, 234, 456]
d = {'a': [], 'b': []}
d['a'].append(list1)
d['a'].append(list2)
print (d)

[back to the top](#indice)<br>

___

<a id='step5'></a>
## Step 5.0: Model Validation


REVERRRRR

A maximum classification accuracy of 92.59% was achieved according to a jackknife cross-validation scheme. The results demonstrate that the performance of the proposed system is superior to the performances of previously reported classification techniques.

In [None]:
scores_data=MLC_report(df, dict_classifiers)
scores_data

In [None]:
scores_data['models'][0]

[back to the top](#indice)<br>

___

<a id='deployment'></a>
## Deployment of the Solution 


REVERRRRR

A maximum classification accuracy of 92.59% was achieved according to a jackknife cross-validation scheme. The results demonstrate that the performance of the proposed system is superior to the performances of previously reported classification techniques.

<a id='acknowledgements'></a>
## Acknowledgements
Creators:
1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

Donor: David W. Aha (aha '@' ics.uci.edu) (714) 856-8779

[back to the top](#indice)<br>

In [None]:
df1=features_score1(df)

In [None]:
df1

In [None]:
products_list = df1.nlargest(12,'Score').values.tolist()
print(products_list)

In [None]:
x = range(6)
for n in x:
    print(n)

In [None]:
best_features =[]
for prod in range(10):
    best_features.append(products_list[prod][0])

In [None]:
best_features

In [None]:
products_list[0][1]

In [None]:
best_features.append(TARGET_col)

In [None]:
best_features

In [None]:
df

In [None]:
df[best_features]

In [None]:
scores_data = MLC_report(df[best_features], dict_classifiers)
scores_data

In [None]:
sex = data[data['HeartDisease'] == 1]['Sex'].value_counts()
sex = [sex[0] / sum(sex) * 100, sex[1] / sum(sex) * 100]

cp = data[data['HeartDisease'] == 1]['ChestPainType'].value_counts()
cp = [cp[0] / sum(cp) * 100,cp[1] / sum(cp) * 100,cp[2] / sum(cp) * 100,cp[3] / sum(cp) * 100]

fbs = data[data['HeartDisease'] == 1]['FastingBS'].value_counts()
fbs = [fbs[0] / sum(fbs) * 100,fbs[1] / sum(fbs) * 100]

restecg = data[data['HeartDisease'] == 1]['RestingECG'].value_counts()
restecg = [restecg[0] / sum(restecg) * 100,restecg[1] / sum(restecg) * 100,restecg[2] / sum(restecg) * 100]

exang = data[data['HeartDisease'] == 1]['ExerciseAngina'].value_counts()
exang = [exang[0] / sum(exang) * 100,exang[1] / sum(exang) * 100]

slope = data[data['HeartDisease'] == 1]['ST_Slope'].value_counts()
slope = [slope[0] / sum(slope) * 100,slope[1] / sum(slope) * 100,slope[2] / sum(slope) * 100]

In [None]:
ax,fig = plt.subplots(nrows = 4,ncols = 2,figsize = (15,15))
colors = ['#F3ED13','#451FA4']

plt.subplot(3,2,1)
plt.pie(sex,labels = ['Male','Female'],autopct='%1.1f%%',startangle = 90,explode = (0.1,0),colors = colors)
plt.title('Sex');

plt.subplot(3,2,2)
plt.pie(cp,labels = ['ASY', 'NAP', 'ATA', 'TA'],autopct='%1.1f%%',startangle = 90,explode = (0,0.1,0.1,0.1))
plt.title('ChestPainType');

plt.subplot(3,2,3)
plt.pie(fbs,labels = ['FBS < 120 mg/dl','FBS > 120 mg/dl'],autopct='%1.1f%%',startangle = 90,explode = (0.1,0),colors = colors)
plt.title('FastingBS');

plt.subplot(3,2,4)
plt.pie(restecg,labels = ['Normal','ST','LVH'],autopct='%1.1f%%',startangle = 90,explode = (0,0.1,0.1))
plt.title('RestingECG');

plt.subplot(3,2,5)
plt.pie(exang,labels = ['Angina','No Angina'],autopct='%1.1f%%',startangle = 90,explode = (0.1,0),colors = colors)
plt.title('ExerciseAngina');

plt.subplot(3,2,6)
plt.pie(slope,labels = ['Flat','Up','Down'],autopct='%1.1f%%',startangle = 90,explode = (0,0.1,0.1))
plt.title('ST_Slope');

In [2]:
!jupyter nbconvert --to pdf HeartFailurePredictionDataset.ipynb

[NbConvertApp] Converting notebook HeartFailurePredictionDataset.ipynb to pdf
[NbConvertApp] Support files will be in HeartFailurePredictionDataset_files/
[NbConvertApp] Making directory ./HeartFailurePredictionDataset_files
[NbConvertApp] Making directory ./HeartFailurePredictionDataset_files
[NbConvertApp] Making directory ./HeartFailurePredictionDataset_files
[NbConvertApp] Making directory ./HeartFailurePredictionDataset_files
[NbConvertApp] Making directory ./HeartFailurePredictionDataset_files
[NbConvertApp] Making directory ./HeartFailurePredictionDataset_files
[NbConvertApp] Making directory ./HeartFailurePredictionDataset_files
[NbConvertApp] Making directory ./HeartFailurePredictionDataset_files
[NbConvertApp] Making directory ./HeartFailurePredictionDataset_files
[NbConvertApp] Making directory ./HeartFailurePredictionDataset_files
[NbConvertApp] Making directory ./HeartFailurePredictionDataset_files
[NbConvertApp] Making directory ./HeartFailurePredictionDataset_files
[NbCo