
### 2. Data Loading and Preprocessing

**Data Loading:**

- **Loading the Data:** Add code and explanations for loading the dataset into your working environment.

**Data Preprocessing:**

- **Handling Missing Values:** Describe how you dealt with any missing or null values in the dataset.
- **Data Types and Encoding:** Explain any conversions or encoding performed on categorical variables.
- **Feature Scaling:** If applicable, mention any scaling or normalization techniques applied to the features.
- **Class Imbalance:** Discuss whether the target variable is imbalanced and how you addressed it (e.g., resampling techniques like SMOTE).

### 3. Exploratory Data Analysis (EDA)

**Statistical Summaries:**

- Provide descriptive statistics for the dataset (mean, median, mode, standard deviation).
- Discuss any noteworthy observations from the data summary.

**Visualizations:**

- **Univariate Analysis:** Histograms or box plots for individual features to understand their distributions.
- **Bivariate Analysis:** Scatter plots or correlation heatmaps to explore relationships between features.
- **Target Variable Distribution:** Analyze the distribution of the target variable to check for class imbalance.

**Insights:**

- Highlight any significant patterns or anomalies discovered during EDA that could impact model performance or require further investigation.

### 4. Model Building and Training

**Model Selection:**

- **Justification:** Explain why you chose the Random Forest classifier over other algorithms.
- **Algorithm Explanation:** Provide a brief overview of how Random Forest works.

**Hyperparameter Tuning:**

- Detail the hyperparameters you tuned and the rationale behind choosing specific values.
- Include code snippets or tables showing the results of grid search or cross-validation.

**Training Process:**

- Discuss the train-test split strategy, including the proportion of data used for training and testing.
- Mention any cross-validation techniques used to ensure model robustness.

### 5. Results

**Performance Metrics:**

- **Accuracy, Precision, Recall, F1-Score:** Provide these metrics to give a comprehensive evaluation of model performance.
- **Confusion Matrix:** Include a confusion matrix to visualize true vs. predicted classifications.
- **ROC Curve and AUC:** Present the ROC curve and discuss the AUC score to evaluate the model's discriminative ability.

**Feature Importance:**

- Include a plot showing feature importances as determined by the Random Forest model.
- Discuss how these features contribute to the model's predictions.

**Validation:**

- If possible, validate your model on a separate validation set or through cross-validation to assess its generalizability.

### 6. Discussion and Conclusion

**Interpretation of Results:**

- **Deep Dive into Features:** Elaborate on how the top features contribute to diabetes risk based on domain knowledge.
- **Comparison with Literature:** Compare your findings with existing research to validate your results.



# Predicting Diabetes Status Using CDC Diabetes Health Indicators

## Table of Contents
1. [Problem Description](#1.Problem-Description)
2. [Data Loading and Preprocessing](#Data-Loading-and-Preprocessing)
3. [Exploratory Data Analysis (EDA)](#Exploratory-Data-Analysis-EDA)
4. [Model Building and Training](#4.Model-Building-and-Training)
5. [Results](#Results)
6. [Discussion and Conclusion](#Discussion-and-Conclusion)


---

# 1. Problem Description

Diabetes is a chronic condition that affects millions of individuals worldwide. Early prediction and intervention can significantly improve patient outcomes and reduce healthcare costs. This project aims to build a supervised machine learning model to predict whether an individual has diabetes, is pre-diabetic, or healthy based on various health indicators and lifestyle factors.

### Objectives
- **Predictive Modeling**: Develop a model to classify individuals into diabetic, pre-diabetic, or healthy categories.
- **Feature Importance**: Identify key health indicators that contribute most to diabetes prediction.
- **Model Evaluation**: Assess the performance of the model using appropriate metrics.

### Dataset Overview
The [CDC Diabetes Health Indicators Dataset](https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators) contains healthcare statistics and lifestyle survey information about individuals, including demographics, lab test results, and survey responses related to health behaviors.

- **Number of Instances**: 253,680
- **Number of Features**: 21
- **Target Variable**: `Diabetes_binary` (0 = No Diabetes, 1 = Prediabetes or Diabetes)
- **Features Include**:
  - Demographics: Sex, Age, Education Level, Income
  - Health Indicators: BMI, High Blood Pressure, High Cholesterol, Smoking Status, Physical Activity, etc.
    - ['ID', 'Diabetes_binary', 'HighBP', 'HighChol', 'CholCheck', 'BMI',
       'Smoker', 'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits',
       'Veggies', 'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost',
       'GenHlth', 'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age',
       'Education', 'Income']
---


# 2. Data Loading and Preprocessing

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from io import StringIO

# Machine Learning Libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, roc_auc_score, roc_curve
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

# Set visualizations style
sns.set(style="whitegrid")



In [2]:
!pip3 install -U ucimlrepo 

[33mDEPRECATION: beakerx-base 2.2.0 has a non-standard dependency specifier ipywidgets<8pandas,>=7.5.1. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of beakerx-base or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

In [3]:
# Load Dataset
data_url = 'https://archive.ics.uci.edu/static/public/891/data.csv'
response = requests.get(data_url)
data = pd.read_csv(StringIO(response.text))

# Display first few rows
data.head()


Unnamed: 0,ID,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0,0,1,1,1,40,1,0,0,0,...,1,0,5,18,15,1,0,9,4,3
1,1,0,0,0,0,25,1,0,0,1,...,0,1,3,0,0,0,0,7,6,1
2,2,0,1,1,1,28,0,0,0,0,...,1,1,5,30,30,1,0,9,4,8
3,3,0,1,0,1,27,0,0,0,1,...,1,0,2,0,0,0,0,11,3,6
4,4,0,1,1,1,24,0,0,0,1,...,1,0,2,3,0,0,0,11,5,4


---

# 3. Exploratory Data Analysis (EDA)

### Understanding the Data

In [4]:
# Display dataset information
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253680 entries, 0 to 253679
Data columns (total 23 columns):
 #   Column                Non-Null Count   Dtype
---  ------                --------------   -----
 0   ID                    253680 non-null  int64
 1   Diabetes_binary       253680 non-null  int64
 2   HighBP                253680 non-null  int64
 3   HighChol              253680 non-null  int64
 4   CholCheck             253680 non-null  int64
 5   BMI                   253680 non-null  int64
 6   Smoker                253680 non-null  int64
 7   Stroke                253680 non-null  int64
 8   HeartDiseaseorAttack  253680 non-null  int64
 9   PhysActivity          253680 non-null  int64
 10  Fruits                253680 non-null  int64
 11  Veggies               253680 non-null  int64
 12  HvyAlcoholConsump     253680 non-null  int64
 13  AnyHealthcare         253680 non-null  int64
 14  NoDocbcCost           253680 non-null  int64
 15  GenHlth               253680 non-n

In [6]:
data.columns

Index(['ID', 'Diabetes_binary', 'HighBP', 'HighChol', 'CholCheck', 'BMI',
       'Smoker', 'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits',
       'Veggies', 'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost',
       'GenHlth', 'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age',
       'Education', 'Income'],
      dtype='object')

In [None]:
# Summary statistics
data.describe()


### Checking for Missing Values

In [None]:
# Check for missing values
missing_values = data.isnull().sum()
missing_values[missing_values > 0]


### Distribution of Target Variable

In [None]:
# Distribution of the target variable
plt.figure(figsize=(6,4))
sns.countplot(x='Diabetes_binary', data=data)
plt.title('Distribution of Diabetes Status')
plt.xlabel('Diabetes Binary (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()


### Correlation Analysis

In [None]:
# Compute correlation matrix
corr_matrix = data.corr()

# Plot heatmap
plt.figure(figsize=(12,10))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()


#### 1. **Target Variable (`Diabetes_binary`) Correlations**
   - **Strong Positive Correlations**:
     - `HighBP` (High Blood Pressure) shows a moderate correlation with `Diabetes_binary` (~0.26). This suggests individuals with high blood pressure are more likely to have diabetes.
     - `HighChol` (High Cholesterol) also exhibits a moderate positive correlation with `Diabetes_binary` (~0.22), indicating a potential link between high cholesterol levels and diabetes.
     - `GenHlth` (General Health) correlates positively (~0.29). Poor general health appears to be a strong indicator of diabetes.

   - **Moderate Negative Correlations**:
     - `PhysActivity` (Physical Activity) has a weak negative correlation (~-0.13). Individuals with higher physical activity levels tend to have a lower likelihood of diabetes.
     - `Income` correlates negatively (~-0.17), suggesting that higher income levels may be associated with a reduced likelihood of diabetes.

#### 2. **Strong Inter-Feature Correlations**
   - **`PhysHlth` (Physical Health) and `GenHlth`**: Correlation is very high (~0.52). This indicates a strong relationship between perceived general health and the physical health days reported by respondents.
   - **`Age` and `HighBP`**: A moderately strong correlation (~0.34) suggests that older individuals are more likely to have high blood pressure.
   - **`Age` and `Diabetes_binary`**: Correlation (~0.18) suggests that age is a factor influencing diabetes, albeit weaker than some health indicators.

#### 3. **Features with Weak or No Correlation**
   - Variables like `Smoker`, `Veggies`, and `Fruits` exhibit minimal or near-zero correlation with `Diabetes_binary`, suggesting these lifestyle factors may have limited direct influence in this dataset.
   - `AnyHealthcare` and `HvyAlcoholConsump` also have weak correlations, indicating healthcare access or heavy alcohol consumption might not strongly predict diabetes in this dataset.

#### 4. **Insights on Multicollinearity**
   - Some features show significant correlations with each other, such as `PhysHlth` and `GenHlth` or `Age` and `HighBP`. These relationships may lead to multicollinearity issues, which could impact model performance if not addressed (e.g., through feature selection or dimensionality reduction).

### Feature Distributions

In [None]:
# Distribution of numerical features
numerical_features = ['BMI', 'Age', 'GenHlth', 'MentHlth', 'PhysHlth']

data[numerical_features].hist(bins=30, figsize=(15,10))
plt.suptitle('Distribution of Numerical Features')
plt.show()


From the visualization of the numerical feature distributions, I have the following comments:

#### **BMI (Body Mass Index)**
   - The distribution of BMI is unimodal and slightly right-skewed, with most values concentrated between 20 and 40.
   - There are a few outliers with BMI exceeding 60, which might represent individuals with severe obesity.
   - The skewness suggests that the majority of individuals have a healthy BMI or are moderately overweight.

#### **Age**
   - Age appears to be evenly distributed across the predefined age categories, with slightly higher frequencies in the middle-age groups (values 8–10 in this representation, corresponding to age ranges of 40–59 based on dataset coding).
   - The distribution suggests a good representation of both younger and older individuals, which helps in building a generalized model.
   - Fewer samples in the youngest and oldest age brackets may slightly reduce their predictive power.

#### **GenHlth (General Health Rating)**
   - The distribution of `GenHlth` is skewed toward lower values (better health ratings), with a significant concentration at values 1 and 2.
   - This indicates that most individuals report their health as "good" or "very good," with relatively few reporting "poor" general health (value 5).
   - The skewness highlights a potential imbalance in health ratings, which might influence how this feature correlates with diabetes.

#### **MentHlth (Mental Health)**
   - The distribution of `MentHlth` is heavily right-skewed, with the majority of individuals reporting few or no days of poor mental health in the past month (value near 0).
   - A small proportion of individuals reported significantly worse mental health (values closer to 30 days), representing a minority of the sample.
   - This feature has a long tail and a concentration near 0, which suggests it may require normalization or transformation for effective modeling.

#### **PhysHlth (Physical Health)**
   - Similar to `MentHlth`, the distribution of `PhysHlth` is highly skewed to the right, with most individuals reporting very few days of poor physical health.
   - There is a noticeable concentration at higher values (near 30 days), likely reflecting individuals with chronic health issues or disabilities.
   - This feature also exhibits a long tail, which might require scaling or special handling to ensure its impact is proportional during model training.

---

#### General Insights:
 **Skewness**:
   - `MentHlth` and `PhysHlth` are highly right-skewed, indicating the presence of a large proportion of healthy individuals with occasional outliers representing worse health.

**Diversity in Data**:
   - The `BMI` and `Age` features display broader distributions, which indicate diverse representation across different categories.

**Outliers**:
   - Both `BMI` and `PhysHlth` show extreme values (e.g., BMI > 60 or PhysHlth = 30).



### Categorical Features Analysis

In [None]:
# Categorical features
categorical_features = ['HighBP', 'HighChol', 'Smoker', 'Stroke', 
                        'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 
                        'Veggies', 'HvyAlcoholConsump', 'AnyHealthcare', 
                        'NoDocbcCost', 'DiffWalk', 'Sex', 'Education', 'Income']

plt.figure(figsize=(20, 15))
for idx, feature in enumerate(categorical_features):
    plt.subplot(5, 4, idx+1)
    sns.countplot(x=feature, hue='Diabetes_binary', data=data)
    plt.title(f'Diabetes Status by {feature}')
    plt.legend(title='Diabetes Binary')
plt.tight_layout()
plt.show()


Based on the categorical feature analysis above

---

#### **1. HighBP (High Blood Pressure)**
- A significantly larger proportion of individuals with diabetes (`Diabetes_binary = 1`) have high blood pressure (`HighBP = 1`) compared to those without diabetes.
- This indicates a strong association between high blood pressure and diabetes.

---

#### **2. HighChol (High Cholesterol)**
- A higher proportion of individuals with diabetes have high cholesterol (`HighChol = 1`).

---

#### **3. Smoker**
- There is no clear difference in the proportions of smokers (`Smoker = 1`) between individuals with and without diabetes.
- This suggests that smoking might not be a significant predictor of diabetes in this dataset.

---

#### **4. Stroke**
- A higher proportion of individuals who have experienced a stroke (`Stroke = 1`) also have diabetes.

---

#### **5. HeartDiseaseorAttack**
- A higher proportion of individuals with diabetes have a history of heart disease or attack (`HeartDiseaseorAttack = 1`).

---

#### **6. PhysActivity (Physical Activity)**
- Individuals without diabetes (`Diabetes_binary = 0`) are more likely to report regular physical activity (`PhysActivity = 1`).

---

#### **7. Fruits and Veggies**
- The distributions of fruit consumption (`Fruits = 1`) and vegetable consumption (`Veggies = 1`) show little variation between individuals with and without diabetes.
- This might suggest that these factors, have a limited direct impact on diabetes prediction.

---

#### **8. HvyAlcoholConsump (Heavy Alcohol Consumption)**
- Heavy alcohol consumption (`HvyAlcoholConsump = 1`) is rare in both diabetic and non-diabetic groups.
- This feature may have limited predictive value for this dataset.

---

#### **9. AnyHealthcare**
- Almost everyone reports having healthcare coverage (`AnyHealthcare = 1`), making this feature unlikely to be a strong predictor.

---

#### **10. NoDocbcCost (Couldn’t See a Doctor Due to Cost)**
- Individuals with diabetes are more likely to report not being able to see a doctor due to cost (`NoDocbcCost = 1`).
- This suggests that economic barriers to healthcare access may be associated with diabetes prevalence.

---

#### **11. DiffWalk (Difficulty Walking)**
- A large proportion of individuals with diabetes report difficulty walking or climbing stairs (`DiffWalk = 1`).
- This feature shows a clear association with diabetes.

---

#### **12. Sex**
- There is no significant difference in diabetes prevalence between males (`Sex = 1`) and females (`Sex = 0`).

---

#### **13. Education**
- Lower education levels (e.g., `Education = 1` or `2`) are more common among individuals with diabetes compared to higher education levels.
- This highlights a potential socioeconomic factor in diabetes prevalence.

---

#### **14. Income**
- Individuals with higher income levels (`Income = 6`, `7`, `8`) are less likely to have diabetes compared to those with lower income.
- This supports the idea that economic factors and access to resources may influence diabetes risk.

---

#### General Observations:
- **Strong Predictors**:
  - `HighBP`, `HighChol`, `DiffWalk`, `HeartDiseaseorAttack`, and `PhysActivity` show clear differences between the diabetic and non-diabetic groups, making them likely strong predictors.
- **Weak Predictors**:
  - Features like `Smoker`, `Fruits`, `Veggies`, and `HvyAlcoholConsump` exhibit little variation between the groups and may have limited predictive power.
- **Socioeconomic Factors**:
  - Features like `Education`, `Income`, and `NoDocbcCost` suggest that socioeconomic status plays a role in diabetes prevalence.


___

# 4. Model Building and Training 

### Data Preprocessing

In [None]:
# Drop ID column as it's not useful for modeling
data.drop('ID', axis=1, inplace=True)

# Encode categorical variables
label_encoder = LabelEncoder()
for column in categorical_features:
    data[column] = label_encoder.fit_transform(data[column])

# Feature and Target Separation
X = data.drop('Diabetes_binary', axis=1)
y = data['Diabetes_binary']

# Feature Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


### Train-Test Split

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")


### Handling Class Imbalance

In [None]:
# Check class distribution
sns.countplot(x=y_train)
plt.title('Training Set Class Distribution')
plt.show()



### Model Selection and Training

In [None]:
# Initialize Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Hyperparameter Tuning using Grid Search
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid,
                           cv=3, n_jobs=-1, verbose=2, scoring='accuracy')

# Fit Grid Search
grid_search.fit(X_train, y_train)

# Best Parameters
print(f"Best Parameters: {grid_search.best_params_}")


### Final Model Training

In [None]:
# Best estimator from Grid Search
best_rf = grid_search.best_estimator_

# Train the model
best_rf.fit(X_train, y_train)


---

# 5. Results

### Model Performance on Test Set

In [None]:
# Predictions
y_pred = best_rf.predict(X_test)
y_pred_proba = best_rf.predict_proba(X_test)[:,1]

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Set Accuracy: {accuracy:.2f}")

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


### Confusion Matrix

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Diabetes', 'Diabetes'], yticklabels=['No Diabetes', 'Diabetes'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()


### ROC Curve and AUC

In [None]:
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
auc_score = roc_auc_score(y_test, y_pred_proba)

plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, label=f'Random Forest (AUC = {auc_score:.2f})')
plt.plot([0,1], [0,1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.show()


### Feature Importance

In [None]:
# Feature Importance
importances = best_rf.feature_importances_
feature_names = X.columns
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False).head(10)

# Plot
plt.figure(figsize=(10,6))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df, palette='viridis')
plt.title('Top 10 Feature Importances')
plt.xlabel('Importance Score')
plt.ylabel('Feature')
plt.show()


---

# 6. Discussion and Conclusion

The Random Forest classifier achieved an accuracy of **{accuracy:.2f}** on the test set, indicating a strong ability to distinguish between individuals with and without diabetes based on the provided health indicators.

### Key Findings

- **Top Features**: The most significant features contributing to diabetes prediction include:
  - **BMI**: Higher Body Mass Index is strongly associated with diabetes.
  - **HighBP**: Presence of high blood pressure increases diabetes risk.
  - **HighChol**: High cholesterol levels are indicative of diabetes.
  - **Age**: Older age groups show a higher prevalence of diabetes.
  - **GenHlth**: General health status correlates with diabetes status.
  
- **Model Performance**: The model exhibits high accuracy and a robust AUC score, demonstrating its effectiveness in predicting diabetes status.

### Recommendations for Improvement

- **Hyperparameter Tuning**: Further fine-tuning of hyperparameters using more extensive grid search or randomized search could enhance model performance.
  
- **Feature Engineering**: Creating new features or transforming existing ones (e.g., interaction terms) might capture more complex relationships.
  
- **Alternative Models**: Exploring other algorithms such as Support Vector Machines (SVM), Gradient Boosting Machines (e.g., XGBoost), or Neural Networks could potentially yield better performance.
  
- **Handling Imbalanced Data**: Although the dataset is large, ensuring balanced classes through techniques like SMOTE could be beneficial, especially if class imbalance is detected in specific subsets.

### Conclusion

This supervised learning approach effectively utilizes health indicators to predict diabetes likelihood. The Random Forest model not only provides high accuracy but also offers insights into the most influential factors affecting diabetes risk. Implementing the recommended improvements can further enhance the model's predictive capabilities and reliability.

---