# Metabolic Syndrome Prediction

This notebook demonstrates the process of predicting metabolic syndrome using various machine learning models. The dataset contains both categorical and numerical features, and we will preprocess the data, handle missing values, and evaluate multiple models.

**Dataset**: Metabolic Syndrome.csv

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, RobustScaler
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report

In [2]:
# Load the dataset
raw_df = pd.read_csv("Metabolic Syndrome.csv")

# Display dataset information and check for duplicates
print(raw_df.info())
print(f"\nThe number of duplicated rows: {raw_df.duplicated().sum()}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2401 entries, 0 to 2400
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   seqn               2401 non-null   int64  
 1   Age                2401 non-null   int64  
 2   Sex                2401 non-null   object 
 3   Marital            2193 non-null   object 
 4   Income             2284 non-null   float64
 5   Race               2401 non-null   object 
 6   WaistCirc          2316 non-null   float64
 7   BMI                2375 non-null   float64
 8   Albuminuria        2401 non-null   int64  
 9   UrAlbCr            2401 non-null   float64
 10  UricAcid           2401 non-null   float64
 11  BloodGlucose       2401 non-null   int64  
 12  HDL                2401 non-null   int64  
 13  Triglycerides      2401 non-null   int64  
 14  MetabolicSyndrome  2401 non-null   int64  
dtypes: float64(5), int64(7), object(3)
memory usage: 281.5+ KB
None

The num

In [3]:
# Separate categorical and numerical data
cat_data = raw_df.select_dtypes('object')
num_data = raw_df.select_dtypes(['float64', 'int64']).iloc[:, 1:-1]
y = raw_df['MetabolicSyndrome']

# Handle missing values in categorical data using SimpleImputer
si = SimpleImputer(strategy='most_frequent')
cat_imp = si.fit_transform(cat_data)
catimp = pd.DataFrame(cat_imp, columns=cat_data.columns)

# Encode categorical data using LabelEncoder and OneHotEncoder
le = LabelEncoder()
cat_data_bin = le.fit_transform(cat_data['Sex'])
cat_data_bin = pd.DataFrame(cat_data_bin, columns=['Sex'])

ohe = OneHotEncoder()
ohe_data = ohe.fit_transform(catimp[['Marital', 'Race']]).toarray()

# Combine encoded categorical data, numerical data, and target variable
final = np.concatenate([cat_data_bin.values, num_data.values, ohe_data], axis=1)
final = pd.DataFrame(final)

In [4]:
# Split the data into training and testing sets
x = final.values
y = y.values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=444)

# Scale the data using RobustScaler
scaler = RobustScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

# Handle missing values in scaled data using KNNImputer
kni = KNNImputer(n_neighbors=5)
x_train_imp = kni.fit_transform(x_train_scaled)
x_test_imp = kni.transform(x_test_scaled)

In [5]:
# K-Nearest Neighbors (KNN) Model
knc = KNeighborsClassifier()
params = {
    'n_neighbors': range(1, 30, 2),
    'weights': ['uniform', 'distance']
}
gclf = GridSearchCV(knc, param_grid=params, cv=5, scoring='f1')
gclf.fit(x_train_imp, y_train)
print(f"Best params for KNN: {gclf.best_params_}")

best_model = gclf.best_estimator_
y_pred = best_model.predict(x_test_imp)
print(f"KNN Test Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))

Best params for KNN: {'n_neighbors': 5, 'weights': 'uniform'}
KNN Test Accuracy: 0.817047817047817
              precision    recall  f1-score   support

           0       0.85      0.88      0.87       323
           1       0.74      0.68      0.71       158

    accuracy                           0.82       481
   macro avg       0.80      0.78      0.79       481
weighted avg       0.81      0.82      0.82       481



In [6]:
# Support Vector Classifier (SVC) Model
svclass = SVC(class_weight='balanced')
params_svc = {
    "C": [0.001, 0.01, 0.1, 1, 10, 100],
    'kernel': ['poly', 'rbf'],
    'gamma': [0.001, 0.01, 0.1, 1]
}
svgrid = RandomizedSearchCV(svclass, param_distributions=params_svc, cv=5, scoring='f1', random_state=444, n_iter=5)
svgrid.fit(x_train_imp, y_train)
print(f"Best params for SVC: {svgrid.best_params_}")

best_svc_model = svgrid.best_estimator_
y_svc_pred = best_svc_model.predict(x_test_imp)
print(f"SVC Test Accuracy: {accuracy_score(y_test, y_svc_pred)}")
print(classification_report(y_test, y_svc_pred))

Best params for SVC: {'kernel': 'rbf', 'gamma': 0.01, 'C': 10}
SVC Test Accuracy: 0.817047817047817
              precision    recall  f1-score   support

           0       0.92      0.80      0.85       323
           1       0.68      0.85      0.75       158

    accuracy                           0.82       481
   macro avg       0.80      0.82      0.80       481
weighted avg       0.84      0.82      0.82       481



In [7]:
# Decision Tree Model
dte = DecisionTreeClassifier(class_weight='balanced')
params_dte = {
    'max_depth': range(3, 10),
    'max_leaf_nodes': range(3, 10),
    'criterion': ['gini', 'entropy']
}
tgs = GridSearchCV(dte, param_grid=params_dte, cv=5, scoring='f1')
tgs.fit(x_train_imp, y_train)
print(f"Best params for Decision Tree: {tgs.best_params_}")

best_dt_model = tgs.best_estimator_
y_pred_dt = best_dt_model.predict(x_test_imp)
print(f"Decision Tree Test Accuracy: {accuracy_score(y_test, y_pred_dt)}")
print(classification_report(y_test, y_pred_dt))

Best params for Decision Tree: {'criterion': 'gini', 'max_depth': 5, 'max_leaf_nodes': 9}
Decision Tree Test Accuracy: 0.8378378378378378
              precision    recall  f1-score   support

           0       0.93      0.82      0.87       323
           1       0.71      0.87      0.78       158

    accuracy                           0.84       481
   macro avg       0.82      0.85      0.83       481
weighted avg       0.85      0.84      0.84       481



In [None]:
# Logistic Regression Model
log_reg = LogisticRegression(class_weight='balanced')
params_log_reg = {
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear'],
    'max_iter': [200]
}
tgs_log_reg = GridSearchCV(log_reg, param_grid=params_log_reg, cv=5, scoring='f1')
tgs_log_reg.fit(x_train_imp, y_train)
print(f"Best params for Logistic Regression: {tgs_log_reg.best_params_}")

best_lr_model = tgs_log_reg.best_estimator_
y_pred_lr = best_lr_model.predict(x_test_imp)
print(f"Logistic Regression Test Accuracy: {accuracy_score(y_test, y_pred_lr)}")
print(classification_report(y_test, y_pred_lr))

Best params for Logistic Regression: {'max_iter': 200, 'penalty': 'l1', 'solver': 'liblinear'}
Logistic Regression Test Accuracy: 0.8087318087318087
              precision    recall  f1-score   support

           0       0.91      0.80      0.85       323
           1       0.67      0.83      0.74       158

    accuracy                           0.81       481
   macro avg       0.79      0.81      0.79       481
weighted avg       0.83      0.81      0.81       481



In [None]:
# Random Forest Model
rf = RandomForestClassifier(class_weight='balanced')
param_grid_rf = {
    'n_estimators': [100, 200],
    'max_depth': [2, 3],
    'max_features': ['log2', 'sqrt'],
    'criterion': ['gini', 'entropy'],
    'max_leaf_nodes': [4, 6],
    'min_samples_leaf': [15, 16],
    'min_samples_split': [15, 16]
}
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=444)
rfclf = GridSearchCV(rf, param_grid=param_grid_rf, cv=stratified_kfold, scoring='f1')
rfclf.fit(x_train_imp, y_train)
print(f"Best params for Random Forest: {rfclf.best_params_}")

best_rf_model = rfclf.best_estimator_
y_pred_rf = best_rf_model.predict(x_test_imp)
print(f"Random Forest Test Accuracy: {accuracy_score(y_test, y_pred_rf)}")
print(classification_report(y_test, y_pred_rf))

Best params for Random Forest: {'criterion': 'gini', 'max_depth': 3, 'max_features': 'log2', 'max_leaf_nodes': 6, 'min_samples_leaf': 16, 'min_samples_split': 16, 'n_estimators': 200}
Random Forest Test Accuracy: 0.8295218295218295
              precision    recall  f1-score   support

           0       0.93      0.81      0.86       323
           1       0.69      0.87      0.77       158

    accuracy                           0.83       481
   macro avg       0.81      0.84      0.82       481
weighted avg       0.85      0.83      0.83       481



## Conclusion

In this notebook, we preprocessed the dataset, handled missing values, and evaluated several machine learning models for predicting metabolic syndrome. The models included K-Nearest Neighbors (KNN), Support Vector Classifier (SVC), Decision Tree, Logistic Regression, and Random Forest. 

### Results:
- **Best Model**: The **Decision Tree** model performed the best with an accuracy of **83.78%** and an F1-score of **0.87** for class 0 (no metabolic syndrome) and **0.78** for class 1 (metabolic syndrome).
- **Runner-Up**: The **Random Forest** model also performed well, achieving an accuracy of **82.95%** and an F1-score of **0.86** for class 0 and **0.77** for class 1.

### Model Comparison

| Model               | Accuracy | F1-Score (Class 0) | F1-Score (Class 1) |
|---------------------|----------|--------------------|--------------------|
| KNN                 | 81.70%   | 0.87               | 0.71               |
| SVC                 | 81.70%   | 0.85               | 0.75               |
| Decision Tree       | 83.78%   | 0.87               | 0.78               |
| Logistic Regression | 80.87%   | 0.85               | 0.74               |
| Random Forest       | 82.95%   | 0.86               | 0.77               |

### Insights:
- The Decision Tree model likely performed well due to its ability to capture non-linear relationships in the data without overfitting, as indicated by the hyperparameter tuning results (`max_depth=5` and `max_leaf_nodes=9`).
- The Random Forest model, while slightly less accurate, showed robust performance across both classes, making it a good alternative for this classification task.

This notebook demonstrates the effectiveness of machine learning models in predicting metabolic syndrome, with the Decision Tree model emerging as the top performer.