# IBM Unsupervised Learning Capstone Project

*by Lucian Popa*

**Februry, 2022**

## Introduction

The heart is an amazing organ. It continuously pumps oxygen and nutrient-rich blood throughout your body to sustain life. This fist-sized powerhouse beats (expands and contracts) 100,000 times per day pumping 23,000 liters (5,000 gallons) of blood every day. To work properly, the heart (just like any other muscle) needs a good blood supply.   
WHO announced that cardiovascular diseases is the top one killer over the world. There are seventeen million people died from it every year, especially heart disease. Prevention is better than cure. If we can evaluate the risk of every patient who probably has heart disease, that is, not only patients but also everyone can do something earlier to keep illness away.

A heart attack (also known as myocardial infarction; MI) is defined as the sudden blockage of blood flow to a portion of the heart. Some of the heart muscle begins to die during a heart attack, and without early medical treatment, the loss of the muscle could be permanent. 

Conditions such as high blood pressure, high blood cholesterol, obesity, and diabetes can raise the risk of a heart attack.  Behaviors such as an unhealthy diet, low levels of physical activity, smoking, and excessive alcohol consumption can contribute to the conditions that can cause heart attacks.  Some factors, such as age and family history of heart disease, cannot be modified but are associated with a higher risk of a heart attack.

## Dataset

For the exploration of the risk a person has to develop a heart attack, the [Heart Attack Analysis & Prediction Dataset](https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset) from *kaggle.com* was utilized. It consists of:

+ Age of the patient (age in years)
+ Sex of the patient (sex; 1 = male, 0 = female)
+ Exercise induced angina (exng; 1 = yes, 0 = no)
+ Number of major vessels (ca; 0-3)
+ Chest pain type (cp; Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)
+ Resting blood pressure (trestpbs; in mm/Hg on admission to the hospital)
+ Cholesterol levels (chol; in mg/dl)
+ Fasting blood sugar (fbs; if > 120 mg/dl, 1 = true; 0 = false)
+ Resting electrocardiographic results (rest_ecg; 0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes’ criteria)
+ Maximum heart rate achieved (thalach)
+ Chance of heart attack (target: Heart disease)
+ A blood disorder called thalassemia (thall; 1 = normal; 2 = fixed defect; 3 = reversable defect)
+ Previous peak (oldpeak; ST depression induced by exercise relative to rest - ‘ST’ relates to positions on the ECG plot)
+ Slope (slp; the slope of the peak exercise ST segment, Value 1: upsloping, Value 2: flat, Value 3: downsloping)

## The goal:

The aim of this project is to apply unsupervised learning techniques to find whether an individual will develop a heart attack risk or not. More specifically, after some feature engineering and exploratory data analysis, the k-means and agglomerative clusteing algorithms will be explored.

In [2]:
# Importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import silhouette_score

from sklearn.decomposition import PCA 
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from scipy.cluster import hierarchy

from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedShuffleSplit

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [5]:
# Import the dataset and load it into a DataFrame
data = pd.read_csv("data/heart.csv")

# make a copy of the original data
df = data.copy()

### Data cleaning and feature engineering

In [6]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
