# Support Vector Machine (SVM) Analysis for Predicting Coronary Heart Disease (CHD)
    


This notebook demonstrates how to use a Support Vector Machine (SVM) to predict the likelihood of a person having Coronary Heart Disease (CHD). We will cover the important step of feature scaling and compare the model's performance on unscaled, standardized, and normalized data.
    


## 1. Data Loading and Preparation
First, we'll load the data, select our features and target variable, and prepare the data for modeling. The target variable is `CHDDXY1`, and we'll use `AGEY1X`, `ADSEX4`, `PRVEVY1`, and `total_comorbidities` as predictors. We will handle missing values and convert our target variable into a binary format (0 or 1).
    

In [1]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
    
# Load the dataset
df = pd.read_csv('CVD_data2.csv')
    
# Select relevant columns and drop rows with missing values
svm_vars = ['CHDDXY1', 'AGEY1X', 'ADSEX4', 'PRVEVY1', 'total_comorbidities']
df_svm = df[svm_vars].dropna()
    
# Binarize the target variable 'CHDDXY1' (assuming 1 is 'Yes' and 2 is 'No')
df_svm['CHD_Diagnosis'] = df_svm['CHDDXY1'].apply(lambda x: 1 if x == 1 else 0)
df_svm = df_svm.drop('CHDDXY1', axis=1)
    
# Define predictors and target
X = df_svm[['AGEY1X', 'ADSEX4', 'PRVEVY1', 'total_comorbidities']]
y = df_svm['CHD_Diagnosis']
    
print("Cleaned DataFrame for SVM:")
df_svm.info()
print("\nTarget variable distribution:")
print(df_svm['CHD_Diagnosis'].value_counts())
    

Cleaned DataFrame for SVM:
<class 'pandas.core.frame.DataFrame'>
Index: 3489 entries, 2 to 6740
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   AGEY1X               3489 non-null   float64
 1   ADSEX4               3489 non-null   float64
 2   PRVEVY1              3489 non-null   float64
 3   total_comorbidities  3489 non-null   int64  
 4   CHD_Diagnosis        3489 non-null   int64  
dtypes: float64(3), int64(2)
memory usage: 163.5 KB

Target variable distribution:
CHD_Diagnosis
0    3243
1     246
Name: count, dtype: int64



## 2. Model with Unscaled Data
As a baseline, we'll first train an SVM on the original, unscaled data.
    

In [2]:

# Split the unscaled data
X_train_unscaled, X_test_unscaled, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
# Create and train the SVM model
svm_model_unscaled = SVC(kernel='linear', random_state=42)
svm_model_unscaled.fit(X_train_unscaled, y_train)
    
# Make predictions
y_pred_unscaled = svm_model_unscaled.predict(X_test_unscaled)
    
# Evaluate the model
print("--- SVM Model Performance (Unscaled Data) ---")
print("Accuracy:", accuracy_score(y_test, y_pred_unscaled))
print("\nClassification Report:\n", classification_report(y_test, y_pred_unscaled))
    

--- SVM Model Performance (Unscaled Data) ---
Accuracy: 0.9627507163323782

Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.99      0.98       966
           1       0.89      0.59      0.71        81

    accuracy                           0.96      1047
   macro avg       0.93      0.79      0.85      1047
weighted avg       0.96      0.96      0.96      1047




## 3. Model with Standardized Data
We will use Standardization (scaling to zero mean and unit variance) and see how it impacts performance.
    

In [3]:

# Scale the features using StandardScaler
scaler_standard = StandardScaler()
X_standardized = scaler_standard.fit_transform(X)
    
# Split the standardized data
X_train_std, X_test_std, y_train_std, y_test_std = train_test_split(X_standardized, y, test_size=0.3, random_state=42)
    
# Create and train the SVM model on standardized data
svm_model_standardized = SVC(kernel='linear', random_state=42)
svm_model_standardized.fit(X_train_std, y_train_std)
    
# Make predictions
y_pred_std = svm_model_standardized.predict(X_test_std)
    
# Evaluate the model
print("--- SVM Model Performance (Standardized Data) ---")
print("Accuracy:", accuracy_score(y_test_std, y_pred_std))
print("\nClassification Report:\n", classification_report(y_test_std, y_pred_std))
    

--- SVM Model Performance (Standardized Data) ---
Accuracy: 0.9627507163323782

Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.99      0.98       966
           1       0.89      0.59      0.71        81

    accuracy                           0.96      1047
   macro avg       0.93      0.79      0.85      1047
weighted avg       0.96      0.96      0.96      1047




## 4. Model with Normalized Data
Finally, we'll apply Normalization (scaling to a [0, 1] range) and compare its effect.
    

In [4]:

# Scale the features using MinMaxScaler
scaler_minmax = MinMaxScaler()
X_normalized = scaler_minmax.fit_transform(X)
    
# Split the normalized data
X_train_norm, X_test_norm, y_train_norm, y_test_norm = train_test_split(X_normalized, y, test_size=0.3, random_state=42)
    
# Create and train the SVM model on normalized data
svm_model_normalized = SVC(kernel='linear', random_state=42)
svm_model_normalized.fit(X_train_norm, y_train_norm)
    
# Make predictions
y_pred_norm = svm_model_normalized.predict(X_test_norm)
    
# Evaluate the model
print("--- SVM Model Performance (Normalized Data) ---")
print("Accuracy:", accuracy_score(y_test_norm, y_pred_norm))
print("\nClassification Report:\n", classification_report(y_test_norm, y_pred_norm))
    

--- SVM Model Performance (Normalized Data) ---
Accuracy: 0.9627507163323782

Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.99      0.98       966
           1       0.89      0.59      0.71        81

    accuracy                           0.96      1047
   macro avg       0.93      0.79      0.85      1047
weighted avg       0.96      0.96      0.96      1047




## 5. Conclusion
This notebook provided a step-by-step guide to using an SVM for a classification task and demonstrated the importance of feature scaling. We compared the performance of the SVM on unscaled data against data scaled with both standardization and normalization.
    
In this specific case, both scaling methods resulted in a similar model's accuracy compared to the unscaled baseline. 
    