<img src = "images/Heart.avif" alt = "Unsupervised Learning" />

# 0729 I06 SVM and Random Forest Bin Liao.
This analysis focuses on the Personal Key Indicators of Heart Disease dataset. Our goal is to understand the dataset's characteristics, perform exploratory data analysis, and build predictive models to predict the likelihood of heart disease.

## Dataset Description

The dataset of Pesonal Key Indicators of Heart Disease is get from [Kaggle](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease?datasetId=1936563). The dataset includes a variety of personal health indicators that may potentially be linked to heart disease. These include Body Mass Index (BMI), smoking status, alcohol drinking habits, stroke history, mental and physical health conditions, difficulty in walking, sex, age category, race, diabetes status, physical activity level, general health perception, sleep time, asthma, kidney disease, and skin cancer.

In [42]:
# Project initialization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
%matplotlib inline

In [43]:
# Load data
url = 'https://raw.githubusercontent.com/Bencool/MBA6636-Business-Analytics/main/datas/heart_2020_cleaned.csv'

heart_df = pd.read_csv(url);
heart_df.head()

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.6,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No


In [44]:
# Get the dataset description.
heart_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319795 entries, 0 to 319794
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   HeartDisease      319795 non-null  object 
 1   BMI               319795 non-null  float64
 2   Smoking           319795 non-null  object 
 3   AlcoholDrinking   319795 non-null  object 
 4   Stroke            319795 non-null  object 
 5   PhysicalHealth    319795 non-null  float64
 6   MentalHealth      319795 non-null  float64
 7   DiffWalking       319795 non-null  object 
 8   Sex               319795 non-null  object 
 9   AgeCategory       319795 non-null  object 
 10  Race              319795 non-null  object 
 11  Diabetic          319795 non-null  object 
 12  PhysicalActivity  319795 non-null  object 
 13  GenHealth         319795 non-null  object 
 14  SleepTime         319795 non-null  float64
 15  Asthma            319795 non-null  object 
 16  KidneyDisease     31

From the data description, it shows there only two datatype object: float(numberic) and object(String). Let's introduce columns of type int(numeric) and boolean.

- An Integer column: We can convert the **SleepTime** column to int, assuming that sleep time is typically measured in whole hours.
- Boolean: the following columns could convert to boolean:
    - **Smoking**: The 'Smoking' column could be converted to True for 'Yes' and False for 'No', indicating whether the individual is a smoker.
    - **AlcoholDrinking**: Similarly, the 'AlcoholDrinking' column could be converted to True for 'Yes' and False for 'No', indicating whether the individual drinks alcohol.
    - **Stroke**: The 'Stroke' column could be converted to True for 'Yes' and False for 'No', indicating whether the individual has had a stroke.

We will create a new category column **BMIRange** base on the following BMI definition:
- If your BMI is less than 18.5, it falls within the underweight range.
- If your BMI is 18.5 to 24.9, it falls within the Healthy Weight range.
- If your BMI is 25.0 to 29.9, it falls within the overweight range.
- If your BMI is 30.0 or higher, it falls within the obese range.

In [47]:
# Convert 'SleepTime' to int
heart_df['SleepTime'] = heart_df['SleepTime'].astype(int)

# Convert 'Smoking' to boolean
heart_df['Smoking'] = heart_df['Smoking'].map({'Yes': True, 'No': False})

# Convert 'AlcoholDrinking' to boolean
heart_df['AlcoholDrinking'] = heart_df['AlcoholDrinking'].map({'Yes': True, 'No': False})

# Convert 'Stroke' to boolean
heart_df['Stroke'] = heart_df['Stroke'].map({'Yes': True, 'No': False})

# BMI Range
def get_bmi_range(bmi):
    if bmi < 18.5:
        return 'Underweight'
    elif 18.5 <= bmi < 25:
        return 'Healthy Weight'
    elif 25 <= bmi < 30:
        return 'Overweight'
    else:
        return 'Obese'

heart_df['BMIRange'] = heart_df['BMI'].apply(get_bmi_range)


In [48]:
# Check datatype after columns upate:
heart_df.dtypes

HeartDisease         object
BMI                 float64
Smoking                bool
AlcoholDrinking        bool
Stroke                 bool
PhysicalHealth      float64
MentalHealth        float64
DiffWalking          object
Sex                  object
AgeCategory          object
Race                 object
Diabetic             object
PhysicalActivity     object
GenHealth            object
SleepTime             int64
Asthma               object
KidneyDisease        object
SkinCancer           object
BMIRange             object
dtype: object

## Data Preprocessing

To prepare the dataset for analysis, we will perform exploratory data analysis using pandas and scikit library or any other analysis library if it needs. Descriptions can involve five number summary, histograms, boxplots, checking for missing values, checking for outliers etc. Preprocess the data as necessary before applying SVM and RF


In [49]:
# General description of the dataset
heart_df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
HeartDisease,319795.0,2.0,No,292422.0,,,,,,,
BMI,319795.0,,,,28.325399,6.3561,12.02,24.03,27.34,31.42,94.85
Smoking,319795.0,2.0,False,187887.0,,,,,,,
AlcoholDrinking,319795.0,2.0,False,298018.0,,,,,,,
Stroke,319795.0,2.0,False,307726.0,,,,,,,
PhysicalHealth,319795.0,,,,3.37171,7.95085,0.0,0.0,0.0,2.0,30.0
MentalHealth,319795.0,,,,3.898366,7.955235,0.0,0.0,0.0,3.0,30.0
DiffWalking,319795.0,2.0,No,275385.0,,,,,,,
Sex,319795.0,2.0,Female,167805.0,,,,,,,
AgeCategory,319795.0,13.0,65-69,34151.0,,,,,,,


In [50]:
# Check for missing values
missing_values = heart_df.isna().sum()
print("\nMissing values:")
print(missing_values)


Missing values:
HeartDisease        0
BMI                 0
Smoking             0
AlcoholDrinking     0
Stroke              0
PhysicalHealth      0
MentalHealth        0
DiffWalking         0
Sex                 0
AgeCategory         0
Race                0
Diabetic            0
PhysicalActivity    0
GenHealth           0
SleepTime           0
Asthma              0
KidneyDisease       0
SkinCancer          0
BMIRange            0
dtype: int64


There is no missing values in the dataset.

In [53]:
# Five numbers description on Numerical columns
heart_df.describe()

Unnamed: 0,BMI,PhysicalHealth,MentalHealth,SleepTime
count,319795.0,319795.0,319795.0,319795.0
mean,28.325399,3.37171,3.898366,7.097075
std,6.3561,7.95085,7.955235,1.436007
min,12.02,0.0,0.0,1.0
25%,24.03,0.0,0.0,6.0
50%,27.34,0.0,0.0,7.0
75%,31.42,2.0,3.0,8.0
max,94.85,30.0,30.0,24.0


In [55]:
# Categorical Data Description
heart_df.describe(include=['object', 'bool']).T

Unnamed: 0,count,unique,top,freq
HeartDisease,319795,2,No,292422
Smoking,319795,2,False,187887
AlcoholDrinking,319795,2,False,298018
Stroke,319795,2,False,307726
DiffWalking,319795,2,No,275385
Sex,319795,2,Female,167805
AgeCategory,319795,13,65-69,34151
Race,319795,6,White,245212
Diabetic,319795,4,No,269653
PhysicalActivity,319795,2,Yes,247957


In [57]:
# Uniquie value for categorical columns:
print(f"Unique Values for categorical columns:")
for col in heart_df.select_dtypes(include=['object','bool']):
  print(f"  - {col}: {heart_df[col].unique()}\n")

Unique Values for categorical columns:
  - HeartDisease: ['No' 'Yes']

  - Smoking: [ True False]

  - AlcoholDrinking: [False  True]

  - Stroke: [False  True]

  - DiffWalking: ['No' 'Yes']

  - Sex: ['Female' 'Male']

  - AgeCategory: ['55-59' '80 or older' '65-69' '75-79' '40-44' '70-74' '60-64' '50-54'
 '45-49' '18-24' '35-39' '30-34' '25-29']

  - Race: ['White' 'Black' 'Asian' 'American Indian/Alaskan Native' 'Other'
 'Hispanic']

  - Diabetic: ['Yes' 'No' 'No, borderline diabetes' 'Yes (during pregnancy)']

  - PhysicalActivity: ['Yes' 'No']

  - GenHealth: ['Very good' 'Fair' 'Good' 'Poor' 'Excellent']

  - Asthma: ['Yes' 'No']

  - KidneyDisease: ['No' 'Yes']

  - SkinCancer: ['Yes' 'No']

  - BMIRange: ['Underweight' 'Healthy Weight' 'Overweight' 'Obese']

