
[Opening Slide: Predicting Diabetes Status Using CDC Health Indicators by Ryan Talbot]
Hello! My name is… and today I’m excited to present my machine learning project titled 'Predicting Diabetes Status Using CDC Health Indicators'. In the next 10 minutes, I’ll walk you through the problem I’ll be addressing, the machine learning approach I used, and the key results I achieved.

[Slide 2: Problem Description]
Diabetes is a chronic condition affecting millions worldwide. Early prediction and intervention are crucial for improving patient outcomes and reducing healthcare costs. However, accurately identifying individuals at risk—whether they are diabetic, pre-diabetic, or healthy—remains a significant challenge. This is where my project comes in.

[Slide 3: Objectives]
my project has two main objectives:
Predictive Modeling: use a supervised machine learning model to classify individuals into diabetic, pre-diabetic, or healthy categories based on various health indicators and lifestyle factors.
Model Evaluation: Assess the performance of this model using appropriate metrics to ensure its reliability and effectiveness.

[Slide 4: Dataset Overview]
I utilized the CDC Diabetes Health Indicators Dataset, which includes comprehensive healthcare statistics and lifestyle survey information. Let me give you a quick overview:
Number of Instances: 253,680
Number of Features: 21
Target Variable: Diabetes_binary, where 0 indicates no diabetes and 1 indicates pre-diabetes or diabetes.
The features encompass demographics like sex, age, education level, income, and health indicators such as BMI, blood pressure, cholesterol levels, smoking status, and physical activity, among others.

[Slide 5: Data Preprocessing]
Before building my models, I performed several data preprocessing steps:
Handling Missing Values: I addressed any missing data through appropriate imputation techniques to ensure data quality.
Categorical Variable Encoding: Categorical features were encoded using methods like One-Hot Encoding to make them suitable for machine learning algorithms.
Test-Train Split: The dataset was split into training and testing sets to evaluate model performance effectively.

[Slide 6: Addressing Class Imbalance with SMOTE]
One of the key challenges I faced was class imbalance—where the number of non-diabetic cases significantly outweighed diabetic cases. To tackle this, I employed SMOTE, which stands for Synthetic Minority Oversampling Technique. SMOTE generates synthetic samples for the minority class, helping the model learn to recognize diabetic cases more effectively.

[Slide 7: Model Selection and Training]
For my predictive modeling, I chose the Random Forest classifier due to its robustness and ability to handle feature interactions well. I trained two versions of the model:
Without SMOTE: To establish a baseline performance.
With SMOTE: To assess the impact of handling class imbalance on model performance.

[Slide 8: Results Without SMOTE]
Let’s look at the results without applying SMOTE:
Accuracy: 87%
Class 0 (No Diabetes): Precision 88%, Recall 98%, F1-Score 93%
Class 1 (Diabetes): Precision 57%, Recall 14%, F1-Score 23%
AUC Score: 0.77
As you can see, while the model performs excellently for non-diabetic cases, it struggles with diabetic case identification, with a recall of only 14%. This means most diabetic cases are being missed.

[Slide 9: Results With SMOTE]
After applying SMOTE, the results improved significantly for diabetic cases:
Accuracy: 79%
Class 0 (No Diabetes): Precision 91%, Recall 84%, F1-Score 87%
Class 1 (Diabetes): Precision 32%, Recall 48%, F1-Score 39%
AUC Score: 0.77
The recall for diabetic cases increased from 14% to 48%, meaning the model is now correctly identifying almost half of the diabetic individuals. However, this comes with a trade-off—a decrease in precision, leading to more false positives.

[Slide 10: Visual Comparison - Confusion Matrices and ROC Curves]
Here you can see a side-by-side comparison of the confusion matrices and ROC curves for both models. The model with SMOTE shows a reduction in false negatives, which is crucial for healthcare applications where missing diabetic cases can have severe consequences. The ROC curves remain the same, indicating consistent overall discriminatory power.

[Slide 11: Key Observations]
From my analysis, several key observations emerge:
Impact of Class Imbalance: Without addressing class imbalance, the model was biased towards the majority class, leading to poor recall for diabetic cases.
Trade-Off Between Precision and Recall: Applying SMOTE improved recall but decreased precision, highlighting the balance between reducing false negatives and managing false positives.
Feature Importance: Features like BMI, high blood pressure, and cholesterol levels were significant predictors of diabetes, aligning with clinical knowledge.

[Slide 12: Final Summary]
In summary, my Random Forest model demonstrates strong performance in identifying non-diabetic individuals but initially struggled with diabetic case identification due to class imbalance. By applying SMOTE, I significantly improved the model’s ability to recall diabetic cases, albeit with a trade-off in precision. The consistent AUC score of 0.77 indicates moderate overall effectiveness, suggesting room for further optimization.

[Slide 15: Conclusion]
In conclusion, my project successfully developed a machine learning model to predict diabetes status, addressing the critical challenge of class imbalance with SMOTE. While the model shows promise, especially in improving recall for diabetic cases, further enhancements are necessary to optimize its precision and overall effectiveness. Thank you for ymy attention!



