# Predicting HbA1c Levels in Hispanic Males
This notebook builds a supervised regression model to predict HbA1c levels using demographic and health features. The model is trained specifically on Hispanic males.

In [1]:
%pip install kaggle pandas scikit-learn matplotlib seaborn
import kagglehub
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error


Note: you may need to restart the kernel to use updated packages.


In [3]:
# Download dataset from KaggleHub
path = kagglehub.dataset_download("marshalpatel3558/diabetes-prediction-dataset")
print("Path to dataset files:", path)


Downloading from https://www.kaggle.com/api/v1/datasets/download/marshalpatel3558/diabetes-prediction-dataset?dataset_version_number=1...


100%|██████████| 319k/319k [00:00<00:00, 13.3MB/s]

Extracting files...
Path to dataset files: /home/codespace/.cache/kagglehub/datasets/marshalpatel3558/diabetes-prediction-dataset/versions/1





In [5]:
import os

# List files in the directory to verify the correct filename
print("Files in the dataset directory:", os.listdir(path))

# Load the CSV file (adjust the filename if necessary)
files = os.listdir(path)
print("Files in the dataset directory:", files)

# Update the filename based on the actual file in the directory
csv_file = [file for file in files if file.endswith('.csv')][0]
df = pd.read_csv(f"{path}/{csv_file}")
df.head()


Files in the dataset directory: ['diabetes_dataset.csv']


FileNotFoundError: [Errno 2] No such file or directory: '/home/codespace/.cache/kagglehub/datasets/marshalpatel3558/diabetes-prediction-dataset/versions/1/diabetes_prediction_dataset.csv'

In [None]:
# Filter for Hispanic males only
df = df[(df['gender'] == 'Male') & (df['ethnicity'].str.contains("Hispanic", case=False, na=False))]
df = df.dropna(subset=['HbA1c_level'])  # Ensure target is available
df.head()


In [None]:
# Select features and target
features = ['age', 'bmi', 'physical_activity']
X = df[features]
y = df['HbA1c_level']

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")


In [None]:
# Plot predictions vs actual values
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.xlabel("Actual HbA1c Level")
plt.ylabel("Predicted HbA1c Level")
plt.title("Predicted vs Actual HbA1c for Hispanic Males")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.show()


## Reflection

This notebook demonstrates a basic but effective approach to using supervised learning for predicting HbA1c levels among Hispanic males. Outline is below


- Show how filtering data to target a specific demographic can be beneficial
- Use of supervised linear regression for ease of use and visibility
- Evaluation using MSE metric

This model can support early intervention efforts and better health outcomes for minority communities.
