# Can Survey Questions from the BRFSS Provide Accurate Predictions of Whether an Individual Has Diabetes?

## Step 1: Data Understanding
We will:
- Load and explore the dataset.
- Identify the target variable (e.g., `Diabetes_binary`) and the features (survey responses).

## Step 2: Data Preprocessing
- Handle missing values, if any.
- Normalize/standardize features for modeling.
- Split the data into training and testing sets.

## Step 3: Build Predictive Models
- Start with a basic logistic regression model.
- Expand to other classifiers like decision trees, random forests, or XGBoost for comparison.

## Step 4: Evaluate Model Accuracy
- Use metrics such as accuracy, precision, recall, and ROC-AUC to assess prediction quality.

## Step 5: Report Findings
- Summarize whether survey responses can effectively predict diabetes.


### Step 1: Load Libraries and Dataset

In [33]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Load the dataset (replace the file path with your actual dataset location)
file_path = "/Users/maralbarkhordari/Desktop/Diabetes Health Indicators Dataset/diabetes_012_health_indicators_BRFSS2015.csv"
data = pd.read_csv(file_path)

# Display the first few rows of the dataset
data.head()


Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


In [34]:
data

KeyboardInterrupt: 

### Step 2: Preprocess the Data

In [None]:
# Separate features (X) and target (y)
X = data.drop(columns=['Diabetes_012'])  # Drop target column
y = data['Diabetes_012']  # Target column

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features for better performance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


### Step 3: Train a Logistic Regression Model

In [None]:
# Initialize and train the logistic regression model
log_reg = LogisticRegression(max_iter=1000, random_state=42)
log_reg.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = log_reg.predict(X_test_scaled)
y_pred_proba = log_reg.predict_proba(X_test_scaled)


### Step 4: Evaluate the Model

In [None]:
# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Classification report
class_report = classification_report(y_test, y_pred, digits=3)
print("\nClassification Report:")
print(class_report)

# ROC-AUC score (multiclass setting)
roc_auc_multiclass = roc_auc_score(y_test, y_pred_proba, multi_class='ovr')
print(f"\nROC-AUC Score (Multiclass): {roc_auc_multiclass:.3f}")


Confusion Matrix:
[[41754     0  1041]
 [  871     0    73]
 [ 5714     0  1283]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



Classification Report:
              precision    recall  f1-score   support

         0.0      0.864     0.976     0.916     42795
         1.0      0.000     0.000     0.000       944
         2.0      0.535     0.183     0.273      6997

    accuracy                          0.848     50736
   macro avg      0.466     0.386     0.396     50736
weighted avg      0.802     0.848     0.811     50736


ROC-AUC Score (Multiclass): 0.782


# Interpretation of Results

## 1. Confusion Matrix
The confusion matrix shows how well the model predicted each class:

- **Rows:** True labels.
- **Columns:** Predicted labels.

Each row corresponds to one true class:

- **Class 0 (No Diabetes):**
  - Correctly classified as Class 0: **41,754**
  - Incorrectly classified as Class 2: **1,041**
  - Misclassified as Class 1: **0**

- **Class 1 (Pre-Diabetes):**
  - Correctly classified as Class 1: **0**
  - Misclassified as Class 0: **871**
  - Misclassified as Class 2: **73**

- **Class 2 (Diabetes):**
  - Correctly classified as Class 2: **1,283**
  - Misclassified as Class 0: **5,714**
  - Misclassified as Class 1: **0**

## 2. Classification Report
- **Precision:** Proportion of true positives out of all predicted positives for a class.
- **Recall:** Proportion of true positives out of all actual positives for a class.
- **F1-Score:** Harmonic mean of precision and recall (balancing false positives and false negatives).

### Key Observations:
- **Class 0** has the highest performance:
  - Precision = **0.864**
  - Recall = **0.976**
- **Class 1** predictions are completely ineffective:
  - Precision = **0.000**
  - Recall = **0.000**
- **Class 2** predictions are moderate but limited:
  - Precision = **0.535**
  - Recall = **0.183**
- The **macro average** (average of metrics for all classes) indicates poor performance for minority classes (Classes 1 and 2).

## 3. ROC-AUC Score (Multiclass): **0.782**
- The ROC-AUC score of **0.782** suggests decent overall performance in distinguishing between the three classes.
- However, the low precision and recall for Classes 1 and 2 indicate the model struggles with these minority classes.


## Summary
### Strengths:
- The model performs well in predicting **Class 0 (No Diabetes)**, which is the majority class.
- The **ROC-AUC score** indicates the model has potential.

### Weaknesses:
- Poor performance for **Classes 1 (Pre-Diabetes)** and **2 (Diabetes)**.
- Class imbalance is likely causing the model to focus on the majority class.

### Current Status:
Findings so far: The model demonstrates strong performance for predicting individuals without diabetes (Class 0). However, it struggles with minority classes (pre-diabetes and diabetes).
Conclusion at this stage: The survey questions provide some predictive power, especially for identifying individuals without diabetes. However, the model's poor performance on minority classes indicates the need for further refinement.

## Handling Class Imbalance

We will use SMOTE (Synthetic Minority Oversampling Technique) to balance the classes in the training dataset. This will generate synthetic samples for the minority classes, improving model fairness.

### Step 1: Install and Import Required Libraries

In [None]:
from imblearn.over_sampling import SMOTE
from collections import Counter


In [None]:
from imblearn.over_sampling import SMOTE
from collections import Counter
print("imblearn and SMOTE imported successfully!")


imblearn and SMOTE imported successfully!


### Step 2: Apply SMOTE

In [None]:
# Display class distribution before balancing
print("Class distribution before SMOTE:", Counter(y_train))

# Apply SMOTE to oversample the minority classes
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_scaled, y_train)

# Display class distribution after balancing
print("Class distribution after SMOTE:", Counter(y_train_balanced))


Class distribution before SMOTE: Counter({0.0: 170908, 2.0: 28349, 1.0: 3687})
Class distribution after SMOTE: Counter({0.0: 170908, 2.0: 170908, 1.0: 170908})


### Step 3: Retrain Logistic Regression Model

In [None]:
# Retrain the logistic regression model on the balanced data
log_reg_balanced = LogisticRegression(max_iter=1000, random_state=42)
log_reg_balanced.fit(X_train_balanced, y_train_balanced)

# Evaluate the model on the test set
y_pred_balanced = log_reg_balanced.predict(X_test_scaled)
y_pred_proba_balanced = log_reg_balanced.predict_proba(X_test_scaled)

# Confusion matrix
print("Confusion Matrix (Balanced):")
print(confusion_matrix(y_test, y_pred_balanced))

# Classification report
print("\nClassification Report (Balanced):")
print(classification_report(y_test, y_pred_balanced, digits=3))

# ROC-AUC score
roc_auc_balanced = roc_auc_score(y_test, y_pred_proba_balanced, multi_class='ovr')
print(f"\nROC-AUC Score (Balanced): {roc_auc_balanced:.3f}")


Confusion Matrix (Balanced):
[[28260  7466  7069]
 [  269   304   371]
 [ 1116  1813  4068]]

Classification Report (Balanced):
              precision    recall  f1-score   support

         0.0      0.953     0.660     0.780     42795
         1.0      0.032     0.322     0.058       944
         2.0      0.353     0.581     0.440      6997

    accuracy                          0.643     50736
   macro avg      0.446     0.521     0.426     50736
weighted avg      0.853     0.643     0.720     50736


ROC-AUC Score (Balanced): 0.774


# Interpretation of Results After Handling Class Imbalance

## 1. Confusion Matrix (Balanced)

### Class 0 (No Diabetes):
- **Correct Predictions:** 28,260  
- **Misclassified as Class 1 (Pre-Diabetes):** 7,466  
- **Misclassified as Class 2 (Diabetes):** 7,069  

### Class 1 (Pre-Diabetes):
- **Correct Predictions:** 304 (still low, but better than before)  
- **Misclassified as Class 0:** 269  
- **Misclassified as Class 2:** 371  

### Class 2 (Diabetes):
- **Correct Predictions:** 4,068  
- **Misclassified as Class 0:** 1,116  
- **Misclassified as Class 1:** 1,813  



## 2. Classification Report (Balanced)

### Class 0 (No Diabetes):
- **Precision:** 0.953 (remains very high)  
- **Recall:** 0.660 (dropped, indicating a more conservative model for Class 0 predictions)  

### Class 1 (Pre-Diabetes):
- **Recall:** Small improvement (0 → 0.322)  
- **Precision:** Still very poor (0.032)  

### Class 2 (Diabetes):
- **Recall:** Moderate improvement (0.183 → 0.581)  
- **Precision:** Improved (0.353 from 0.535)  

### Averages:
- **Macro Average:** Shows moderate performance across all classes  
- **Weighted Average:** Reflects overall accuracy (64.3%), heavily weighted by Class 0  



## 3. ROC-AUC Score (Balanced)
- **Score:** 0.774  
- Slightly lower than before balancing (0.782 → 0.774), expected due to better minority class predictions impacting majority class performance.  



## Insights

### Strengths:
- Improved **recall** for minority classes, especially **Class 2 (Diabetes)**.  
- **Class 0 (No Diabetes)** maintains strong **precision**.  
- Addressed the prediction gap for minority classes partially.  

### Weaknesses:
- **Class 1 (Pre-Diabetes):** Very challenging to predict, with poor precision and F1-score.  
- Overall accuracy dropped as the model focuses more on minority classes.  



## Conclusion for the Question
Balancing the dataset improved the model's ability to predict minority classes, highlighting the utility of BRFSS survey questions for identifying diabetes risks. However, further refinements are necessary for robust predictions, especially for Class 1.  





## Next Step: Move to XGBoost
XGBoost is a more sophisticated model capable of capturing non-linear relationships and handling class imbalances effectively.


### Step 1: Install and Import XGBoost

In [None]:
# Install XGBoost if not already installed
# !pip install xgboost

from xgboost import XGBClassifier


### Step 2: Train XGBoost

In [None]:
import numpy as np
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Convert target variable to NumPy array (ensure compatibility)
y_train_balanced = np.array(y_train_balanced)

# Initialize and train the XGBoost model
xgb = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)
xgb.fit(X_train_balanced, y_train_balanced)

# Make predictions on the test set
y_pred_xgb = xgb.predict(X_test_scaled)
y_pred_proba_xgb = xgb.predict_proba(X_test_scaled)

# Confusion matrix
print("Confusion Matrix (XGBoost):")
print(confusion_matrix(y_test, y_pred_xgb))

# Classification report
print("\nClassification Report (XGBoost):")
print(classification_report(y_test, y_pred_xgb, digits=3))

# ROC-AUC score
roc_auc_xgb = roc_auc_score(y_test, y_pred_proba_xgb, multi_class='ovr')
print(f"\nROC-AUC Score (XGBoost): {roc_auc_xgb:.3f}")


Parameters: { "use_label_encoder" } are not used.



Confusion Matrix (XGBoost):
[[41242     0  1553]
 [  811     0   133]
 [ 5226     0  1771]]

Classification Report (XGBoost):


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

         0.0      0.872     0.964     0.916     42795
         1.0      0.000     0.000     0.000       944
         2.0      0.512     0.253     0.339      6997

    accuracy                          0.848     50736
   macro avg      0.462     0.406     0.418     50736
weighted avg      0.806     0.848     0.819     50736


ROC-AUC Score (XGBoost): 0.764


# Interpretation of XGBoost Results

## 1. Classification Report

- **Class 0 (No Diabetes):**  
  Excellent performance with high precision (0.872) and recall (0.964), resulting in a strong F1-score (0.916).

- **Class 1 (Pre-Diabetes):**  
  No improvement; the model fails to predict pre-diabetes cases (precision = 0.000, recall = 0.000).

- **Class 2 (Diabetes):**  
  Moderate precision (0.512) and low recall (0.253), indicating some improvement in identifying diabetes cases.

## 2. ROC-AUC Score

- **ROC-AUC Score (0.764):**  
  Slightly lower than the SMOTE-Logistic Regression model (0.774). This indicates the model is still struggling to separate minority classes effectively.



## Insights

### Strengths:
- The model excels at predicting **Class 0 (No Diabetes)**, reflecting the dominant class in the dataset.
- Slight improvement in predicting **Class 2 (Diabetes)** compared to the initial Logistic Regression without SMOTE.

### Weaknesses:
- The model continues to ignore **Class 1 (Pre-Diabetes)**, likely due to extreme class imbalance and overlap in features between Classes 0 and 1.
- The overall performance (macro and weighted averages) is heavily influenced by the dominant class.



## Conclusions for the Question

**"Can survey questions from the BRFSS provide accurate predictions of whether an individual has diabetes?"**

- The survey questions are effective in predicting **no diabetes cases (Class 0)**.
- Predictions for **pre-diabetes (Class 1)** remain ineffective, likely due to class imbalance and overlapping features.
- Predictions for **diabetes cases (Class 2)** show moderate improvement but require further refinement.







## Next Steps

### Enhance the Model:
- Perform hyperparameter tuning on XGBoost to optimize performance for minority classes.
- Use custom class weights to emphasize minority classes during training.

- Hyperparameter tuning can significantly improve your model's performance. We'll use RandomizedSearchCV or GridSearchCV from sklearn to find the best hyperparameters for your XGBoost model.

### Step 1: Define the Search Space
#### Guide to Hyperparameter Tuning for XGBoost

Hyperparameter tuning can significantly improve your model's performance. We'll use `RandomizedSearchCV` or `GridSearchCV` from `sklearn` to find the best hyperparameters for your XGBoost model.

## Step 1: Define the Search Space

XGBoost has several hyperparameters. Here are the key ones to tune:

- **`n_estimators`**: Number of trees in the model.
- **`max_depth`**: Maximum depth of each tree (controls model complexity).
- **`learning_rate`**: Step size for weight updates.
- **`subsample`**: Fraction of samples used for training each tree.
- **`colsample_bytree`**: Fraction of features used per tree.
- **`gamma`**: Minimum loss reduction to split a node.
- **`reg_lambda` (L2 regularization)**: Adds a penalty for large weights.





### Step 2: Set Up RandomizedSearchCV
RandomizedSearchCV performs hyperparameter tuning faster than GridSearchCV by sampling a random subset of the parameter space.

In [35]:
from sklearn.metrics import make_scorer, roc_auc_score
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBClassifier

# Define the scoring function
scoring = make_scorer(roc_auc_score, multi_class='ovr', needs_proba=True)

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 150],
    'max_depth': [3, 5],
    'learning_rate': [0.1, 0.2],
    'subsample': [0.8],
    'colsample_bytree': [0.8],
    'gamma': [0, 0.1],
    'reg_lambda': [1, 5],
}

# Initialize the XGBoost model
xgb = XGBClassifier(eval_metric='mlogloss', random_state=42)

# Set up RandomizedSearchCV with fewer combinations for faster runtime
random_search = RandomizedSearchCV(
    estimator=xgb,
    param_distributions=param_grid,
    n_iter=5,  # Reduce number of parameter combinations to limit runtime
    scoring=scoring,
    cv=2,  # Reduce cross-validation folds
    verbose=2,
    random_state=42,
    n_jobs=-1  # Use all available cores
)

# Fit RandomizedSearchCV
random_search.fit(X_train_balanced, y_train_balanced)

# Display the best parameters
print("Best Parameters:", random_search.best_params_)




Fitting 2 folds for each of 5 candidates, totalling 10 fits
[CV] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.2, max_depth=5, n_estimators=100, reg_lambda=5, subsample=0.8; total time=  43.1s
[CV] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.2, max_depth=5, n_estimators=100, reg_lambda=5, subsample=0.8; total time=  43.3s
[CV] END colsample_bytree=0.8, gamma=0, learning_rate=0.2, max_depth=5, n_estimators=150, reg_lambda=5, subsample=0.8; total time= 1.1min
[CV] END colsample_bytree=0.8, gamma=0, learning_rate=0.2, max_depth=5, n_estimators=150, reg_lambda=5, subsample=0.8; total time= 1.1min
[CV] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.2, max_depth=3, n_estimators=100, reg_lambda=1, subsample=0.8; total time=  40.0s
[CV] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.2, max_depth=3, n_estimators=100, reg_lambda=1, subsample=0.8; total time=  40.3s
[CV] END colsample_bytree=0.8, gamma=0.1, learning_rate=0.1, max_depth=3, n_estimators=100, reg_lambda=5

### Step 3: Evaluate the Tuned Model
Once the best parameters are identified, use them to train a new XGBoost model:

In [36]:
# Retrieve the best parameters
best_params = random_search.best_params_

# Train the final model with the best parameters
xgb_tuned = XGBClassifier(**best_params, eval_metric='mlogloss', random_state=42)
xgb_tuned.fit(X_train_balanced, y_train_balanced)

# Make predictions
y_pred_xgb_tuned = xgb_tuned.predict(X_test_scaled)
y_pred_proba_xgb_tuned = xgb_tuned.predict_proba(X_test_scaled)

# Confusion matrix
print("Confusion Matrix (Tuned XGBoost):")
print(confusion_matrix(y_test, y_pred_xgb_tuned))

# Classification report
print("\nClassification Report (Tuned XGBoost):")
print(classification_report(y_test, y_pred_xgb_tuned, digits=3))

# ROC-AUC score
roc_auc_xgb_tuned = roc_auc_score(y_test, y_pred_proba_xgb_tuned, multi_class='ovr')
print(f"\nROC-AUC Score (Tuned XGBoost): {roc_auc_xgb_tuned:.3f}")


Confusion Matrix (Tuned XGBoost):
[[41096     0  1699]
 [  812     0   132]
 [ 5088     0  1909]]

Classification Report (Tuned XGBoost):


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

         0.0      0.874     0.960     0.915     42795
         1.0      0.000     0.000     0.000       944
         2.0      0.510     0.273     0.356      6997

    accuracy                          0.848     50736
   macro avg      0.462     0.411     0.424     50736
weighted avg      0.808     0.848     0.821     50736


ROC-AUC Score (Tuned XGBoost): 0.768


# Interpretation of Tuned XGBoost Results

## 1. Confusion Matrix

- **Class 0 (No Diabetes)**:
  - Correctly classified as Class 0: **41,096**.
  - Misclassified as Class 2: **1,699**.
  - No misclassification as Class 1.

- **Class 1 (Pre-Diabetes)**:
  - Correctly classified as Class 1: **0** (the model fails to predict any Class 1 instances).
  - Misclassified as Class 0: **812**.
  - Misclassified as Class 2: **132**.

- **Class 2 (Diabetes)**:
  - Correctly classified as Class 2: **1,909**.
  - Misclassified as Class 0: **5,088**.
  - No misclassification as Class 1.



## 2. Classification Report

| Metric       | Class 0 (No Diabetes) | Class 1 (Pre-Diabetes) | Class 2 (Diabetes) | Macro Avg | Weighted Avg |
|--------------|-----------------------|------------------------|--------------------|-----------|--------------|
| **Precision**| 0.874                 | 0.000                  | 0.510              | 0.462     | 0.808        |
| **Recall**   | 0.960                 | 0.000                  | 0.273              | 0.411     | 0.848        |
| **F1-Score** | 0.915                 | 0.000                  | 0.356              | 0.424     | 0.821        |

- **Class 0 (No Diabetes)**:
  - Excellent performance with high precision (**0.874**), recall (**0.960**), and F1-score (**0.915**).

- **Class 1 (Pre-Diabetes)**:
  - The model completely fails to predict any pre-diabetes cases, with all metrics at **0.000**.

- **Class 2 (Diabetes)**:
  - Moderate precision (**0.510**), poor recall (**0.273**), and low F1-score (**0.356**).

- **Macro Average**:
  - Shows low overall performance due to poor metrics for Classes 1 and 2.

- **Weighted Average**:
  - Heavily influenced by the dominant Class 0, inflating the overall metrics.



## 3. ROC-AUC Score

- **ROC-AUC Score (0.768)**: Indicates reasonable discrimination ability for a multiclass model but is heavily skewed towards Class 0.



## Strengths

- **Class 0 (No Diabetes)**:
  - The model performs exceptionally well in predicting this majority class.

- **Improvement for Class 2 (Diabetes)**:
  - Moderate precision indicates some success in identifying diabetes cases, but recall remains low.


## Weaknesses

- **Class 1 (Pre-Diabetes)**:
  - The model completely fails to predict pre-diabetes cases, likely due to:
    - Extreme class imbalance.
    - Overlap of features between Classes 0 and 1.

- **Class 2 (Diabetes)**:
  - The model struggles with recall, missing a large proportion of actual diabetes cases.



## Can We Consider This a Success?

- **For the Question ("Can Survey Questions from the BRFSS Provide Accurate Predictions of Whether an Individual Has Diabetes?")**:
  - The model can predict diabetes (Class 2) and no diabetes (Class 0) with reasonable accuracy, suggesting that survey questions are informative.
  - However, the failure to predict pre-diabetes (Class 1) raises concerns about the dataset's ability to differentiate between borderline cases.
