# Predicting HbA1c Levels in Hispanic Males
This notebook builds a supervised regression model to predict HbA1c levels using demographic and health features. The model is trained specifically on Hispanic males.

In [None]:
!pip install kagglehub pandas scikit-learn matplotlib seaborn --quiet
import kagglehub
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error


In [None]:
# Download dataset from KaggleHub
path = kagglehub.dataset_download("marshalpatel3558/diabetes-prediction-dataset")
print("Path to dataset files:", path)


In [None]:
# Load the CSV file (adjust the filename as needed)
df = pd.read_csv(f"{path}/diabetes_prediction_dataset.csv")
df.head()


In [None]:
# Filter for Hispanic males only
df = df[(df['gender'] == 'Male') & (df['ethnicity'].str.contains("Hispanic", case=False, na=False))]
df = df.dropna(subset=['HbA1c_level'])  # Ensure target is available
df.head()


In [None]:
# Select features and target
features = ['age', 'bmi', 'physical_activity']
X = df[features]
y = df['HbA1c_level']

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")


In [None]:
# Plot predictions vs actual values
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.xlabel("Actual HbA1c Level")
plt.ylabel("Predicted HbA1c Level")
plt.title("Predicted vs Actual HbA1c for Hispanic Males")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.show()


## Reflection

This notebook demonstrates a basic but effective approach to using supervised learning for predicting HbA1c levels among Hispanic males. It highlights:

- Importance of filtering data to target a specific demographic
- Use of simple linear regression for interpretability
- Evaluation using MSE for quantitative model assessment

This model can support early intervention efforts and better health outcomes for underserved communities.
