**Each phase of the process:**
1. [Business understanding](#Businessunderstanding)
    1. [What Questions Are We Trying to Answer?](#QA)
        1. [What are the Desired Outputs](#Desiredoutputs)
2. [Data Understanding](#Dataunderstanding)
    1. [Initial Data Report](#Datareport)
    2. [Describe Data](#Describedata)
    3. [Initial Data Exploration](#Exploredata) 
    4. [Verify Data Quality](#Verifydataquality)
        1. [Missing Data](#MissingData) 
        2. [Outliers](#Outliers)
3. [Data Preparation](#Datapreparation)
    1. [Select Your Data](#Selectyourdata)
    2. [Cleanse the Data](#Cleansethedata)
        1. [Label Encoding](#labelEncoding)
        2. [Drop Unnecessary Columns](#DropCols)
        3. [Altering Datatypes](#AlteringDatatypes)
        4. [Dealing With Zeros](#DealingZeros)
    3. [Construct Required Data](#Constructrequireddata)
    4. [Integrate Data](#Integratedata)

## 1.1 What Questions Are We Trying To Answer? <a class="anchor" id="QA"></a>


In this project - we are using datasets from Kaggle - HEART DISEASE ANALYSIS and applying several ML Models, such as KNN, Linear Regression, Decision Tree, for  predicting heart illness of patients by the characteristics provided by the data set. However, we aren't looking for the best algorithm, but to see how each of them performs.

# Point out all library that we are using here.


 - Seaborn
 - Scikit-Learn

In [None]:
# Manipulation
import pandas as pd
import numpy as np

# Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# ML
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, mean_squared_error,roc_auc_score
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV


# Setting Configs
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.float_format', lambda x: '%.3f' % x)
plt.style.use('ggplot')

# 2. Stage  Two - Data Understanding <a class="anchor" id="Dataunderstanding"></a>

## Dataset Description 

This data set dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. The "target" field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease.

**The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.**

https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset


## Attribute Information

- Age (age in years)
- Sex (1 = male; 0 = female)
- CP (chest pain type)
- TRESTBPS (resting blood pressure (in mm Hg on admission to the hospital))
- CHOL (serum cholestoral in mg/dl)
- FPS (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- RESTECH (resting electrocardiographic results)
- THALACH (maximum heart rate achieved)
- EXANG (exercise induced angina (1 = yes; 0 = no))
- OLDPEAK (ST depression induced by exercise relative to rest)
- SLOPE (the slope of the peak exercise ST segment)
- CA (number of major vessels (0-3) colored by flourosopy)
- THAL (3 = normal; 6 = fixed defect; 7 = reversable defect)
- TARGET (1 or 0)

## 2.1 Describe Data <a class="anchor" id="Describedata"></a>

In [None]:
# Loading Dataset
heart = pd.read_csv('./datasets/heart.csv')

# Inspecting Dataset
heart.head()

In [None]:
# Dataset info
print('#' * 50)
print('Total Rows:', heart.shape[0])
print('Total Columns:', heart.shape[1])
print('#' * 50, '\n')
heart.info(memory_usage=False)
print('\n')
print('#' * 50)

In [None]:
#Descriptive Statistics
print('#' * 64)
print('Descriptive Statistics')
print('#' * 64)
heart.select_dtypes(exclude='object').describe()

## 2.3 Verify Data Quality <a class="anchor" id="Verifydataquality"></a>

### 2.3.1. Missing Data <a class="anchor" id="MissingData"></a>

In [None]:
# Dataset info
print('#' * 35)
print('Checking NA Values ')
print('#' * 35)
print(heart.isna().sum())
print('#' * 35,)

In [None]:
# Dataset info
print('#' * 35)
print('Checking NULL Values ')
print('#' * 35)
print(heart.isnull().sum())
print('#' * 35,)

### 2.3.2. Outliers <a class="anchor" id="Outliers"></a>

At this point, we may also want to remove outliers. These can be due to typos in data entry, mistakes in units, or they could be legitimate but extreme values. For this project, we will remove anomalies based on the definition of extreme outliers:

https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm

- Below the first quartile − 3 ∗ interquartile range
- Above the third quartile + 3 ∗ interquartile range



**We didn't found any outliers in our dataset, or odd figures that could affect our analysis.**

## 2.4 Initial Data Exploration  <a class="anchor" id="Exploredata"></a>

### 2.4.1 Distributions  <a class="anchor" id="Distributions"></a>

In [None]:
plt.figure(figsize=(16, 8))
plt.suptitle('Sex Distribution by Age', 
             fontweight='heavy', 
             fontsize='16', fontfamily='sans-serif'
            )
plt.subplot(1, 2, 1)
sns.histplot(data=heart[heart.sex == 0], x=heart.age[heart.sex == 0], color='#3597e8',label ='Female')
plt.legend()

plt.subplot(1, 2, 2)
sns.histplot(data=heart[heart.sex == 1], x=heart.age[heart.sex == 1], color='#ff964f', label='Male')
plt.legend()
plt.show()

In [None]:
sns.countplot(x='sex', data=heart)
plt.xlabel("Sex (0 = Man, 1= Woman)")
plt.show()

In [None]:
pd.crosstab(heart.cp, heart.target).plot(kind='bar')
plt.title("Frequência de doença cardíaca em relação ao tipo de dor")
plt.xlabel("Tipo de dor ")
plt.ylabel("tem a doença ou nao");

In [None]:
heart['age'].plot.hist()

In [None]:
plt.figure(figsize=(8,8))
sns.heatmap(heart.corr(), linewidth = 0.1, cmap='coolwarm')

In [None]:
sns.boxplot(data=heart, x='target', y='age', hue='sex');

## 3.2 Clean The Data <a class="anchor">

In [None]:
# Fixing and inspecting datatypes
print('#' * 35)
print('Fixing Data Type')
print('#' * 35)
fix_dtype          = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']
heart[fix_dtype]   = heart[fix_dtype].astype(object)
print(heart[fix_dtype].dtypes)
print('#' * 35)

In [None]:
# One-Hot Enconding -> Label Encoding to turn Categorical values to Integers
# Dropping unecessary columns
df = pd.get_dummies(heart, columns=['cp', 'thal', 'slope'], drop_first=True)

# Inspecting new dataset
df.head()

In [None]:
# Divide dataset in 2: Dependent (Target) variable and Independent Variables
X = df.loc[:, df.columns != 'target'].values

y = df.target.values

# Splitting dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1984)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

In [None]:
# Inspecting X and y
print(X.shape, y.shape)

# ML part

# Logistic Regression

In [None]:
# Build the steps
steps = [('scaler', StandardScaler()),
         ('logreg', LogisticRegression())]
         
pipeline = Pipeline(steps)

# Create the parameter space
parameters = {"logreg__C": np.linspace(0.001, 1.0, 20)}

# Instantiate the grid search object
cv = GridSearchCV(pipeline, param_grid=parameters)

# Fit to the training data
cv.fit(X_train, y_train)
print(cv.best_score_, "\n", cv.best_params_)

# Make predictions
y_pred = cv.predict(X_test)
print("Predictions: {}, Actual Values: {}".format(y_pred[:2], y_test[:2]))

In [None]:
# Compute R-squared
r_squared = cv.score(X_test, y_test)

# Compute RMSE
rmse = mean_squared_error(y_test, y_pred, squared=False)

# Print the metrics
print("R^2: {}".format(r_squared))
print("RMSE: {}".format(rmse))

In [None]:
# Create a KFold object
kf = KFold(n_splits=6, shuffle=True, random_state=5)

reg = LogisticRegression()

# Compute 6-fold cross-validation scores
cv_scores = cross_val_score(reg ,X, y, cv=kf)

# Print scores
print(cv_scores)

In [None]:
# Print the mean
print(np.mean(cv_scores))

# Print the standard deviation
print(np.std(cv_scores))

# Print the 95% confidence interval
print(np.quantile(cv_scores, [0.025, 0.975]))

In [None]:
roc_auc_score(y_test,y_pred)

# KNN

In [None]:
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the model to the training data
knn.fit(X_train, y_train)

# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

In [None]:
# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 40)
train_accuracies = {}
test_accuracies = {}
error_rate = []

for neighbor in neighbors:
    # Setup a k-NN Classifier with k neighbors: knn
    knn = KNeighborsClassifier(n_neighbors=neighbor)

    # Fit the classifier to the training data
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)

	# Compute accuracy
    train_accuracies[neighbor] = knn.score(X_train, y_train)
    test_accuracies[neighbor] = knn.score(X_test, y_test)

print(neighbors, '\n', train_accuracies, '\n', test_accuracies)


In [None]:
# Add a title
plt.title("KNN: Varying Number of Neighbors")

# Plot training accuracies
plt.plot(neighbors,train_accuracies.values(), label="Training Accuracy")

# Plot test accuracies
plt.plot(neighbors, test_accuracies.values(), label="Testing Accuracy")

plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")

# Display the plot
plt.show()

# Linear Regression

# Decision Tree

# Linear Regression

# Messing around

In [None]:
# Create models dictionary
models = {"Logistic Regression": LogisticRegression(), "KNN": KNeighborsClassifier(), "Decision Tree Classifier": DecisionTreeClassifier()}
results = []

# Loop through the models' values
for model in models.values():
  
  # Instantiate a KFold object
  kf = KFold(n_splits=6, random_state=1984, shuffle=True)
  
  # Perform cross-validation
  cv_results = cross_val_score(model, X_train_scaled, y_train, cv=kf)
  results.append(cv_results)
plt.boxplot(results, labels=models.keys())
plt.show()

In [None]:
# Test set perfomance

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    test_score = model.score(X_test_scaled, y_test)
    print("{} Test Set Accuracy: {:.2f}% ".format(name, test_score * 100))

In [None]:
for name, model in models.items():
  
  # Fit the model to the training data
  model.fit(X_train_scaled,y_train)
  
  # Make predictions on the test set
  y_pred = model.predict(X_test_scaled)
  
  # Calculate the test_rmse
  test_rmse = mean_squared_error(y_test, y_pred, squared=False)
  print("Model {} \n".format(name))
  print("Test Set RMSE: {} \n".format(test_rmse))
  print("Classification Report: \n {}".format(classification_report(y_test, y_pred)))

In [None]:
import graphviz
from sklearn import tree
# DOT data
dot_data = tree.export_graphviz(cv, out_file=None, 
                                feature_names=df.loc[:, df.columns != 'target'].columns,  
                                class_names=heart.target.unique(),
                                filled=True)

# Draw graph
graph = graphviz.Source(dot_data, format="png") 
graph

In [None]:
df.loc[:, df.columns != 'target'].columns