
# Support Vector Machine (SVM) Analysis for Predicting Coronary Heart Disease (CHD)
    


This notebook demonstrates how to use a Support Vector Machine (SVM) to predict the likelihood of a person having Coronary Heart Disease (CHD). We will also cover the important step of feature scaling, which is crucial for the performance of SVM models.
    


 ## 1. Data Loading and Preparation
First, we'll load the data, select our features and target variable, and prepare the data for modeling. The target variable is `CHDDXY1`, and we'll use `AGEY1X`, `ADSEX4`, `PRVEVY1`, and `total_comorbidities` as predictors. We will handle missing values and convert our target variable into a binary format (0 or 1).
    

In [1]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
    
# Load the dataset
df = pd.read_csv('CVD_data2.csv')
    
# Select relevant columns and drop rows with missing values
svm_vars = ['CHDDXY1', 'AGEY1X', 'ADSEX4', 'PRVEVY1', 'total_comorbidities']
df_svm = df[svm_vars].dropna()
    
# Binarize the target variable 'CHDDXY1' (assuming 1 is 'Yes' and 2 is 'No')
df_svm['CHD_Diagnosis'] = df_svm['CHDDXY1'].apply(lambda x: 1 if x == 1 else 0)
df_svm = df_svm.drop('CHDDXY1', axis=1)
    
print("Cleaned DataFrame for SVM:")
df_svm.info()
print("\nTarget variable distribution:")
print(df_svm['CHD_Diagnosis'].value_counts())
print("\nDataFrame Head:")
print(df_svm.head())
    

Cleaned DataFrame for SVM:
<class 'pandas.core.frame.DataFrame'>
Index: 3489 entries, 2 to 6740
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   AGEY1X               3489 non-null   float64
 1   ADSEX4               3489 non-null   float64
 2   PRVEVY1              3489 non-null   float64
 3   total_comorbidities  3489 non-null   int64  
 4   CHD_Diagnosis        3489 non-null   int64  
dtypes: float64(3), int64(2)
memory usage: 163.5 KB

Target variable distribution:
CHD_Diagnosis
0    3243
1     246
Name: count, dtype: int64

DataFrame Head:
    AGEY1X  ADSEX4  PRVEVY1  total_comorbidities  CHD_Diagnosis
2     62.0     1.0      1.0                    1              0
3     67.0     1.0      2.0                    1              0
5     79.0     2.0      2.0                    0              0
9     70.0     1.0      1.0                    0              0
11    71.0     2.0      1.0               


## 2. Feature Scaling
SVMs are sensitive to the scale of the features. Therefore, it is important to scale our data before training the model. We will use Standardization for this purpose.
    

In [2]:

# Define predictors and target
X = df_svm[['AGEY1X', 'ADSEX4', 'PRVEVY1', 'total_comorbidities']]
y = df_svm['CHD_Diagnosis']
    
# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
    


## 3. SVM Modeling
Split the data into training and testing sets, train the SVM model, and evaluate its performance.
    

In [3]:

# Split the scaled data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)
    
# Create and train the SVM model
 # We will use a linear kernel as a starting point
svm_model = SVC(kernel='linear', random_state=42)
svm_model.fit(X_train, y_train)
    
# Make predictions
y_pred = svm_model.predict(X_test)
    
# Evaluate the model
print("--- SVM Model Performance ---")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
    

--- SVM Model Performance ---
Accuracy: 0.9627507163323782

Confusion Matrix:
 [[960   6]
 [ 33  48]]

Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.99      0.98       966
           1       0.89      0.59      0.71        81

    accuracy                           0.96      1047
   macro avg       0.93      0.79      0.85      1047
weighted avg       0.96      0.96      0.96      1047




    ## 4. Conclusion
    This notebook provided a step-by-step guide to using an SVM for a classification task on your dataset. We prepared the data, scaled the features, and trained an SVM model to predict Coronary Heart Disease.
    
    The performance of the model can be further tuned by experimenting with different kernels (e.g., 'rbf', 'poly') and by optimizing the hyperparameters of the SVM model (e.g., `C` and `gamma`).
    