Step 1: Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

Step 2: Load Diabetes dataset

In [2]:
# Load the dataset
df = pd.read_csv('diabetes_dirty.csv')

# Display first 5 rows
df.head()

# Get info about dataset
df.info()
df.describe()
print(df.columns)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   AGE          442 non-null    int64  
 1   SEX          442 non-null    int64  
 2   BMI          442 non-null    float64
 3   BP           442 non-null    float64
 4   S1           442 non-null    int64  
 5   S2           442 non-null    float64
 6   S3           442 non-null    float64
 7   S4           442 non-null    float64
 8   S5           442 non-null    float64
 9   S6           442 non-null    int64  
 10  PROGRESSION  442 non-null    int64  
dtypes: float64(6), int64(5)
memory usage: 38.1 KB
Index(['AGE', 'SEX', 'BMI', 'BP', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6',
       'PROGRESSION'],
      dtype='object')


Step 4: Separate independent variables (x) and dependent variable (y)

In [3]:
# Assuming the target column is named 'progression'
x = df.drop('PROGRESSION', axis=1)
y = df['PROGRESSION']

Step 5: Split the dataset into training (80%) and testing (20%) sets

In [4]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, shuffle=True)

Step 6: Investigate the need for scaling

In [5]:
print("\nBasic statistics of features before scaling:")
print(x_train.describe())


Basic statistics of features before scaling:
              AGE         SEX         BMI          BP          S1          S2  \
count  353.000000  353.000000  353.000000  353.000000  353.000000  353.000000   
mean    47.974504    1.473088   26.319263   94.216176  189.246459  115.933428   
std     13.163393    0.499984    4.500748   13.734541   34.705644   31.070428   
min     19.000000    1.000000   18.000000   62.000000   97.000000   43.400000   
25%     37.000000    1.000000   23.100000   84.000000  165.000000   95.000000   
50%     49.000000    1.000000   25.700000   93.000000  186.000000  113.400000   
75%     58.000000    2.000000   29.400000  104.000000  209.000000  134.600000   
max     79.000000    2.000000   42.200000  131.000000  301.000000  242.400000   

               S3          S4          S5          S6  
count  353.000000  353.000000  353.000000  353.000000  
mean    49.907932    4.061416    4.626146   90.600567  
std     13.136116    1.279639    0.515481   11.129746  


Scaling is beneficial for Linear models to ensure equal weightings and better convergence.

Step 7: Apply MinMaxScaler and StandardScaler

In [6]:
minmax_scaler = MinMaxScaler()
standard_scaler = StandardScaler()

In [7]:
# Fit on training set
x_train_minmax = minmax_scaler.fit_transform(x_train)
x_test_minmax = minmax_scaler.transform(x_test)

x_train_standard = standard_scaler.fit_transform(x_train)
x_test_standard = standard_scaler.transform(x_test)

# I'll proceed with StandardScaler as it centers the data (mean = 0), which is useful for regression
x_train_scaled = x_train_standard
x_test_scaled = x_test_standard

Step 8: Train a multiple linear regression model

In [8]:
model = LinearRegression()
model.fit(x_train_scaled, y_train)

Step 9: Print model for intercept and co-efficient 

In [9]:
print("\nModel Intercept:", model.intercept_)
print('Model Co-efficient:', model.coef_)


Model Intercept: 149.371104815864
Model Co-efficient: [  0.43173753 -10.83023146  21.55710633  18.3618367  -46.39086961
  29.55538306   6.83028374   7.4709764   37.76758745   2.32775   ]


Step 10: Generate predictions using the test set

In [10]:
y_pred = model.predict(x_test_scaled)

Step 11: Evalaute model performance with R-squared

In [11]:
r2 = r2_score(y_test, y_pred)
print("\nR-squared score on test data:", r2)

# Interpretation note
if r2 >= 0.75:
    print("Good model fit.")
elif r2 >= 0.5:
    print("Moderate fit. Consider checking for multicollinearity or feature engineering.")
else:
    print("Weak fit. Consider improving data quality, feature selection, or trying a different model.")


R-squared score on test data: 0.483548993101371
Weak fit. Consider improving data quality, feature selection, or trying a different model.
