<a href="https://colab.research.google.com/github/JABU-2022/heart-disease-analysis/blob/main/heart_disease_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ABOUT AUTHOR

Hi! I am Jaques BUTERA, a data science student at World Quant University (WQU) and a Software student majoring in AI/ML at African Leadership University (ALU). As a data enthusiast, I always love to play with data and explore its insights. I have expertise in data wrangling, data visualization, data analysis, and building machine learning algorithms and their evaluation.

Let's dive with me into this notebook to explore it. If you like my notebook, then don't forget to upvote it.


# **Heart Disease Dataset Analysis**

## **Identifying High-Risk Patients for Heart Disease: Clustering Analysis and Risk Factor Identification**

## **Notebook Content**

## 1. Introduction
- **Overview of the heart disease dataset**
- **Objectives and goals of the analysis**

## 2. Loading the Dataset
- **Import necessary libraries**
- **Load the dataset**
- **Display the first few rows of the dataframe**

## 3. Exploratory Data Analysis (EDA)
- **Summary statistics**
- **Check for missing values**
- **Data types of each column**
- **Distribution of target variable**
- **Visualizations (histograms, boxplots, correlation matrix)**

## 4. Data Preprocessing
- **Handle missing values**
- **Scale numerical features**
- **Encode categorical variables (e.g., OrdinalEncoder)**
- **Split the data into features and target variable**

## 5. Clustering
- **K-means clustering**
  - Apply K-means
  - Determine optimal number of clusters (Elbow method, silhouette score)
- **Hierarchical clustering**
  - Apply hierarchical clustering
  - Visualize dendrogram
- **DBSCAN clustering**
  - Apply DBSCAN
  - Tune parameters (epsilon, minimum samples)

## 6. Dimensionality Reduction for Visualization
- **Principal Component Analysis (PCA)**
  - Apply PCA
  - Visualize clusters in 2D/3D space
- **t-Distributed Stochastic Neighbor Embedding (t-SNE)**
  - Apply t-SNE
  - Visualize clusters in 2D space

## 7. Gaussian Mixture Models (GMMs)
- **Apply GMM to identify clusters**
- **Analyze risk factors associated with heart disease**
- **Compare GMM clusters with other clustering methods**

## 8. Evaluation of Clustering Performance
- **Silhouette score**
- **Davies-Bouldin index**
- **Compare performance of K-means, hierarchical clustering, DBSCAN, and GMMs**

## 9. Conclusion
- **Summary of findings**
- **Recommendations based on clustering results**
- **Potential improvements and future work**

# **#Here is the workflow starting with the introsuction**

**1. INTRODUCTION**

## Dataset Information

### Overview
This dataset originally contains 76 attributes, although most studies utilize a subset of 14 attributes. The data primarily focuses on the Cleveland database, which is widely used in machine learning research related to heart disease. The "goal" field indicates the presence of heart disease in patients, represented as an integer value ranging from 0 (no presence) to 4. Previous experiments have mainly aimed at distinguishing between the presence (values 1, 2, 3, 4) and absence (value 0) of heart disease.

### Additional Information
- The dataset has undergone anonymization where patient names and social security numbers were replaced with dummy values to ensure privacy.
- The dataset contains missing values that need to be addressed during preprocessing.



**2. Loading the Dataset**

**2.1 import the necessary libraries**

In [28]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OrdinalEncoder

In [2]:
# Mount the drive
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


**2.2 Load the dataset**

In [3]:
from google.colab import files

uploaded = files.upload()


Saving heart_disease_uci.csv to heart_disease_uci.csv


### **2.3 Display the first few rows of the dataframe**

In [42]:
df_old = pd.read_csv('heart_disease_uci.csv')

df_old.head()
#df_old.shape

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0


## **3. Exploratory Data Analysis (EDA)**

In [37]:
# List of columns to delete
columns_to_delete = ['id', 'dataset']
# Drop columns from the DataFrame
df_old.drop(columns=columns_to_delete, inplace=True)


In [38]:
df = df_old
df.shape

(920, 14)

In [39]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,63,Male,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,67,Male,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,67,Male,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,37,Male,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,41,Female,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0


In [10]:
df.dropna(inplace=True)

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 299 entries, 0 to 748
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       299 non-null    int64  
 1   sex       299 non-null    object 
 2   cp        299 non-null    object 
 3   trestbps  299 non-null    float64
 4   chol      299 non-null    float64
 5   fbs       299 non-null    object 
 6   restecg   299 non-null    object 
 7   thalch    299 non-null    float64
 8   exang     299 non-null    object 
 9   oldpeak   299 non-null    float64
 10  slope     299 non-null    object 
 11  ca        299 non-null    float64
 12  thal      299 non-null    object 
 13  num       299 non-null    int64  
dtypes: float64(5), int64(2), object(7)
memory usage: 35.0+ KB


In [29]:

enc = OrdinalEncoder()

for col in df:
    if df[col].dtype == 'object':
        df[col] = df[col].astype(str)
        df[col] = enc.fit_transform(df[col].values.reshape(-1,1))

df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,63.0,1.0,3.0,145.0,233.0,1.0,0.0,150.0,0.0,2.3,0.0,0.0,0.0,0.0
1,67.0,1.0,0.0,160.0,286.0,0.0,0.0,108.0,1.0,1.5,1.0,3.0,1.0,2.0
2,67.0,1.0,0.0,120.0,229.0,0.0,0.0,129.0,1.0,2.6,1.0,2.0,2.0,1.0
3,37.0,1.0,2.0,130.0,250.0,0.0,1.0,187.0,0.0,3.5,0.0,0.0,1.0,0.0
4,41.0,0.0,1.0,130.0,204.0,0.0,0.0,172.0,0.0,1.4,2.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299,68.0,1.0,0.0,144.0,193.0,1.0,1.0,141.0,0.0,3.4,1.0,2.0,2.0,2.0
300,57.0,1.0,0.0,130.0,131.0,0.0,1.0,115.0,1.0,1.2,1.0,1.0,2.0,3.0
301,57.0,0.0,1.0,130.0,236.0,0.0,0.0,174.0,0.0,0.0,1.0,1.0,1.0,1.0
508,47.0,1.0,0.0,150.0,226.0,0.0,1.0,98.0,1.0,1.5,1.0,0.0,2.0,1.0


In [30]:
def convert_int_to_float(df):
    """
    Convert all integer columns in a DataFrame to float.

    Parameters:
    df (pd.DataFrame): The input DataFrame.

    Returns:
    pd.DataFrame: The DataFrame with integer columns converted to float.
    """
    # Iterate through each column in the DataFrame
    for col in df.columns:
        # Check if the column is of integer type
        if pd.api.types.is_integer_dtype(df[col]):
            # Convert the column to float
            df[col] = df[col].astype(float)
    return df

In [31]:
df.shape

(299, 14)