<a href="https://colab.research.google.com/github/Atoms919/ML_project_diabetes_health_indicator/blob/Tom/ML_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#pip install kagglehub[pandas-datasets]

### Importing librairies and the dataset

In [None]:
import kagglehub
from kagglehub import KaggleDatasetAdapter
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix

ModuleNotFoundError: No module named 'kagglehub'

In [None]:
file_path = "diabetes_dataset.csv"

# Load the latest version
diabetes = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "mohankrishnathalla/diabetes-health-indicators-dataset",
  file_path,
  # Provide any additional arguments like
  # sql_query or pandas_kwargs. See the
  # documenation for more information:
  # https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
)

diabetes.head()

### Analysing the data

Check structure: shape and data types

In [None]:
print("Shape : ", diabetes.shape)

diabetes.info()

Check for missing values

In [None]:
print(diabetes.isnull().sum())

Count the number of different values in column with object type values

In [None]:
unique_counts_object = diabetes.select_dtypes(include=['object']).nunique()

print("Number of unique values in object-type columns:")
print(unique_counts_object)

Changing object values into numerical values

In [None]:
categorical_cols = ['gender', 'ethnicity', 'education_level', 'income_level', 'employment_status', 'smoking_status', 'diabetes_stage']

diabetes_numerical = pd.get_dummies(diabetes, columns=categorical_cols, drop_first=True)

print(diabetes_numerical.columns)

Heatmap

In [None]:
corr = diabetes_numerical.corr()

plt.figure(figsize=(30, 15))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

Droping colums that aren't useful for linear regression

In [None]:
columns_to_drop = [
    'education_level_Highschool',
    'education_level_No formal',
    'education_level_Postgraduate',
    'employment_status_Unemployed',
    'employment_status_Student',
    'employment_status_Retired',
    'income_level_Low',
    'income_level_Lower-Middle',
    'income_level_Middle',
    'income_level_Upper-Middle',
    'sleep_hours_per_day',
    'smoking_status_Never',
    'screen_time_hours_per_day',
    'alcohol_consumption_per_week',
    'ethnicity_Other',
    'ethnicity_White',
    'ethnicity_Hispanic',
    'ethnicity_Black',
    'gender_Male',
    'gender_Other',
    'hdl_cholesterol',
    'heart_rate',
    'smoking_status_Former',
    'diabetes_stage_No Diabetes',
    'diabetes_stage_Pre-Diabetes',
    'diabetes_stage_Type 1',
    'diabetes_stage_Type 2'
  ]

diabetes_numerical_Linear_Regression = diabetes_numerical.drop(columns_to_drop, axis=1)

New heatmap

In [None]:
corr = diabetes_numerical_Linear_Regression.corr()

plt.figure(figsize=(15, 10))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

### Linear Regression

In [None]:
predictors = [col for col in diabetes_numerical_Linear_Regression.columns if col != 'diagnosed_diabetes']
X = diabetes_numerical_Linear_Regression[predictors]
y = diabetes_numerical_Linear_Regression['diagnosed_diabetes']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)
print("R^2 score:", lr.score(X_test, y_test))

threshold = 0.5
y_pred_binary = (y_pred >= threshold).astype(int)

cm = confusion_matrix(y_test, y_pred_binary)

tn, fp, fn, tp = cm.ravel()
print(f"\nConfusion Matrix:\n", cm)

precision = precision_score(y_test, y_pred_binary)
recall = recall_score(y_test, y_pred_binary)
f1 = f1_score(y_test, y_pred_binary)
accuracy = accuracy_score(y_test, y_pred_binary)

print(f"\nPrecision: {precision:.2f}")
print(f"Recall:    {recall:.2f}")
print(f"F1 Score:  {f1:.2f}")
print(f"Accuracy:  {accuracy:.2f}")

**Conclusion**

The linear regression model  explains about 46.4% of the variance in the diagnosed_diabetes outcome. This is considered moderate.  In the context of complex health outcomes like diabetes, which are influenced by many factors, this is a reasonable starting result.

The model achieves a balanced tradeoff between precision (0.87) and recall (0.89), indicating it effectively identifies true diabetes cases while maintaining low false positive rates which is important in healthcare to avoid missed diagnoses and unnecessary interventions.

Although the precision and recall are quite high, a more complex model could increase them even more since there is substantial part of variability that remains unexplained.