# Linear Regression for Diabetes Prediction

## Step 1: Define our problem

We want to use the features in this kaggle healthcare dataset https://www.kaggle.com/datasets/nanditapore/healthcare-diabetes?resource=download
to predict on the "Outcome" column 0/1 for diabetes prediction using Linear Regression.

### Import relevant packages

In [71]:
#for data wrangling
import pandas as pd
#for scaling the data
from sklearn.preprocessing import StandardScaler
#for our dataset
import kagglehub
#for splitting the data
from sklearn.model_selection import train_test_split
#load our model
from sklearn.linear_model import LinearRegression

## Step 2: Explore our data

In [29]:
#read our data into a dataframe
df = pd.read_csv('Healthcare-Diabetes.csv', encoding = 'latin-1')
#sneak peak of the data
df.head()

Unnamed: 0,Id,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,1,6,148,72,35,0,33.6,0.627,50,1
1,2,1,85,66,29,0,26.6,0.351,31,0
2,3,8,183,64,0,0,23.3,0.672,32,1
3,4,1,89,66,23,94,28.1,0.167,21,0
4,5,0,137,40,35,168,43.1,2.288,33,1


In [72]:
#describe the dataset to see rudimentary stats on its cols
df.describe()

Unnamed: 0,Id,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,2768.0,2768.0,2768.0,2768.0,2768.0,2768.0,2768.0,2768.0,2768.0,2768.0
mean,1384.5,3.742775,121.102601,69.134393,20.824422,80.12789,32.137392,0.471193,33.132225,0.343931
std,799.197097,3.323801,32.036508,19.231438,16.059596,112.301933,8.076127,0.325669,11.77723,0.475104
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,692.75,1.0,99.0,62.0,0.0,0.0,27.3,0.244,24.0,0.0
50%,1384.5,3.0,117.0,72.0,23.0,37.0,32.2,0.375,29.0,0.0
75%,2076.25,6.0,141.0,80.0,32.0,130.0,36.625,0.624,40.0,1.0
max,2768.0,17.0,199.0,122.0,110.0,846.0,80.6,2.42,81.0,1.0


In [77]:
#see most common occurances especially for our Outcome response var, this suggests stratification
df["Outcome"].mode()

0    0
Name: Outcome, dtype: int64

## Step 3: Scale and Split our Data

In [46]:
#identify our X and y
X = df.drop(columns = ["Outcome"])
y = df["Outcome"]
scaler = StandardScaler()
#scale our data due to numerical column value differences
X_scaled = scaler.fit_transform(X)

In [38]:
#split our data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, 
                                                    random_state = 42, #for reproducible splitting
                                                    test_size = 0.2, # 20% test size
                                                    stratify = y) #keeps target same in train/test

In [42]:
#validate shape
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)
print("y Train shape:", y_train.shape)
print("y Test shape:", y_test.shape)

Train shape: (2214, 9)
Test shape: (554, 9)
y Train shape: (2214,)
y Test shape: (554,)


In [53]:
lr = LinearRegression()
lr.fit(X_train, y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


## Step 4: Accuracy

In [65]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np

In [69]:
y_pred = lr.predict(X_test)

In [70]:
r2 = r2_score(y_test, y_pred)
print("R²:", r2)
mse = mean_squared_error(y_test, y_pred)
print("MSE:", mse)
mae = mean_absolute_error(y_test, y_pred)
print("MAE:", mae)
rmse = np.sqrt(mse)
print("RMSE:", rmse)

R²: 0.288688905702674
MSE: 0.1606867419779891
MAE: 0.33743112306825396
RMSE: 0.40085750832183387


# Conclusion
Why Linear Regression Struggles

Target is binary (0/1) → Linear regression isn’t ideal for classification.

Outputs are continuous → Can produce predictions outside [0,1].

Assumptions of linearity are often violated with categorical/binary outcomes.