<a href="https://colab.research.google.com/github/SanchezJoseAntonio/Prediction_risk_diabetes/blob/main/Prediction_risk_diabetes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import kagglehub
from kagglehub import KaggleDatasetAdapter
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score
from sklearn.inspection import permutation_importance
from lightgbm import LGBMRegressor

In [2]:
file_path = "diabetes.csv"
# Load the latest version
df = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "uciml/pima-indians-diabetes-database",
  file_path,

SyntaxError: incomplete input (ipython-input-4013140087.py, line 6)

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

Some of this data is biologically impossible. There is a minimum of 0 for SkinThickness, bloodpressure... This means that at least one observation has these values. These are clearly incorrect.


In [None]:
df.duplicated().sum()

In [None]:
sns.boxplot(df["BloodPressure"]);

In [None]:
df[df["BloodPressure"] < 40].head(5)

It's clear that there are missing values, that are just represented as 0. I'll now change them to NA for easier retrieval. I will update the columns that have 0 as a minimum and are continuous.

In [None]:
cols = ["Glucose",	"BloodPressure",	"SkinThickness",	"Insulin",	"BMI"]
df[cols]=df[cols].replace(0, np.nan)

In [None]:
df.isna().sum()

The amount of missing values for skin thickness and insulin are very high. However, insulin is highly related to diabetes, and skin thickness is highly related to obesity, a major factor in diabetes. Therefore, I will not drop these features.


On the other hand, missingness can be informative (e.g they may or may not have done the test for a reason). This is why for insulin and skin thickness I will add a column regarding whether the value was missing or not. This will assure that this information can be evaluated by the algorithm later on, even after handling the missing values.

In [None]:
df["Insulin_NA"]=df['Insulin'].isna().astype('category')
df["SkinThickness_NA"]=df['SkinThickness'].isna().astype('category')
df["Missing_Total"]=df.isna().sum(axis=1)
df["Missing_Total"].sort_values(ascending=False)

In [None]:
df.drop(df.index[df["Missing_Total"]>3],axis=0, inplace=True) #Delete rows where there are more than 3 missing values.
df.drop("Missing_Total",axis=1,inplace=True)
df.shape

In [None]:
df2 = df.copy()

In [None]:
df.head()

In [None]:
sns.pairplot(df, hue="Outcome");

# Handling missing data: replacing by median
I will substitute the missing values by the median in each case. I will first divide the dataset in training and test data so there is no data leakage.

In [None]:
X = df.drop(labels="Outcome",axis=1)
y = df["Outcome"]
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=42, stratify=y)


cols = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]

for col in cols:
  X_train[col]=X_train[col].replace(np.nan, X_train[col].median())
  X_test[col]=X_test[col].replace(np.nan, X_train[col].median()) # Imputing for the test with the training data

In [None]:
X_test.isna().sum()

# Checking feature importance
I will evaluate feature importance through a random forest. This can help me figure out if any features are irrelevant and simplify the model. Because impurity importance can bias the results, I will use permutation importance.

In [None]:
def imp_rf(X_train, X_test, y_train, y_test, random_state=25, n_estimators=200, class_weight = "balanced"):
  rf = RandomForestClassifier(n_estimators=n_estimators, random_state=random_state, class_weight=class_weight)
  model =rf.fit(X_train, y_train)
  result = permutation_importance(model, X_train, y_train, n_repeats=10,random_state=random_state)
  return result


In [None]:
ranking = pd.DataFrame(zip(df.drop(columns=['Outcome']).columns, imp_rf(X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test).importances_mean))
ranking.sort_values(by=1, axis=0)

Surprisingly, the columns that represent missing values for skin thickness and insulin have a permutation importance of 0 or almost 0. This is because during the pre-processing, I substituted the NA values for the median. The trees are adjusting by using the information from the median in the insulin column instead of from the missing values column.



# Handling missing data with lightgbm
I will now proceed to use lightgbm to handle the missing data. It will infer the data that should be in the missing values.
To avoid data leakage, I will separate the test and training data, and apply the model to both of them.

In [None]:
df2["NA_Total"]=df2.isna().sum(axis=1)
df2.isna().sum()

From the previous data, I know glucose and BMI have specially high importances so the imputation of these two variables is of special relevance. I can also see that the missing values from  them are 5 and 4 rows respectively. Let's examine them closely

In [None]:
df2[df2["BMI"].isna()]

In [None]:
df2[df2["Glucose"].isna()]

I see two observations in BMI that contain 3 missing values. I will drop these as they can introduce noise in the model.

In [None]:
df2.drop([9,684], axis=0, inplace=True)
df2.drop("NA_Total", axis=1, inplace=True)
df2[df2["BMI"].isna()]

I will start imputing the rest of the missing values.Therefore I will divide the data in training and test data

In [None]:
X = df2.drop(labels="Outcome",axis=1)
y = df2["Outcome"]
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=33, stratify=y)

In [None]:
na_cols = X_train.isna().sum()
na_cols = na_cols[na_cols > 0 ].sort_values() #Sorting the values from the least amount of NAs to the most, so the ones with the most can use more information
na_cols

In [None]:
X = df2.drop(labels="Outcome",axis=1)
y = df2["Outcome"]
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=33, stratify=y)
for col in na_cols:
