## Problem Definition

The goal of this project is to predict the **diabetes progression score** based on various features such as **age**, **blood pressure**, **insulin levels**, and other relevant health metrics. The task is approached as a **regression problem**, where the target variable (diabetes progression score) is continuous.

Given the data, we aim to build a machine learning model that can predict how the progression of diabetes changes over time based on the input features.


In [1]:
import pandas as pd # Import for data handling.
import numpy as np # Import for numerical data.

In [2]:
df = pd.read_csv('data/diabetes.csv') #Load dataset.
df.head() # Display first 5 rows.

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
df.info() # Prints info about DataFrame.
df.isnull().sum() # Identify missing values.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [None]:
from sklearn.preprocessing import StandardScaler # Import scaler for normalization.

features = df.drop('Outcome', axis=1) # Predicting outcome.

scaler = StandardScaler() # Normalise data numbers.

normalized_features = scaler.fit_transform(features) # Changes all features to the same scale.
df_normalized = pd.DataFrame(normalized_features, columns=features.columns) # Put scaled features in new table.

df_normalized['Outcome'] = df['Outcome'] # Add outcome column back to table.
print(df_normalized.head()) # show first 5 rows of new table.

   Pregnancies   Glucose  BloodPressure  SkinThickness   Insulin       BMI  \
0     0.639947  0.848324       0.149641       0.907270 -0.692891  0.204013   
1    -0.844885 -1.123396      -0.160546       0.530902 -0.692891 -0.684422   
2     1.233880  1.943724      -0.263941      -1.288212 -0.692891 -1.103255   
3    -0.844885 -0.998208      -0.160546       0.154533  0.123302 -0.494043   
4    -1.141852  0.504055      -1.504687       0.907270  0.765836  1.409746   

   DiabetesPedigreeFunction       Age  Outcome  
0                  0.468492  1.425995        1  
1                 -0.365061 -0.190672        0  
2                  0.604397 -0.105584        1  
3                 -0.920763 -1.041549        0  
4                  5.484909 -0.020496        1  


In [None]:
from sklearn.model_selection import train_test_split # Import to split data into training and testing.

X = df_normalized.drop('Outcome', axis=1) # X is all the columns expect outcome. Inputs from user.
y = df_normalized['Outcome']  # y is the outcome column. Result/ output model needs to predict.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Split data into training and testing sets. Training 80% of data and testing 20%.

print(f"Training Features Shape: {X_train.shape}") # Rows and columns in training features
print(f"Testing Features Shape: {X_test.shape}") # Rows and columns in testing features
print(f"Training Target Shape: {y_train.shape}") # Rows in training target
print(f"Testing Target Shape: {y_test.shape}") # Rows in testing target

Training Features Shape: (614, 8)
Testing Features Shape: (154, 8)
Training Target Shape: (614,)
Testing Target Shape: (154,)


In [None]:
from sklearn.linear_model import LinearRegression # Import Linear Regression model.
from sklearn.metrics import mean_squared_error, r2_score # Evaluate the model.

model = LinearRegression() # Create Linear Regression.

model.fit(X_train, y_train) # Learns patterns from X_train and y_train.

y_pred = model.predict(X_test) # Predict outcome using test features.

mse = mean_squared_error(y_test, y_pred) # Measures how far predicition is from actual values. 
r2 = r2_score(y_test, y_pred) # Tells how well the model fits the data.

print(f"Mean Squared Error: {mse}") # Print how far the predcitions are from actual values
print(f"R-squared: {r2}") # Print how well the model fits the data.

Mean Squared Error: 0.17104527280850101
R-squared: 0.25500281176741757


In [None]:
from sklearn.ensemble import RandomForestRegressor # Import Random Forest Regressor

rf_model = RandomForestRegressor(n_estimators=100, random_state=42) # Uses 100 decision trees in the forest. More trees imporves accuracy. 
# random_state=42 ensures the results are consistent.
rf_model.fit(X_train, y_train) #Trains the model using training data.

y_rf_pred = rf_model.predict(X_test) # Predicts target values.

rf_mse = mean_squared_error(y_test, y_rf_pred) # Measures how far predicition is from actual values.
rf_r2 = r2_score(y_test, y_rf_pred) # Tells how well the model fits the data.

print(f"Random Forest Mean Squared Error: {rf_mse}") # Display MSE.
print(f"Random Forest R-squared: {rf_r2}") # Display R-Squared.

Random Forest Mean Squared Error: 0.1710025974025974
Random Forest R-squared: 0.2551886868686867


In [None]:
from sklearn.svm import SVR # Import Support Vector Regression.

svr_model = SVR(kernel='rbf')  # Activate SVR model.

svr_model.fit(X_train, y_train) # Train the SVR model.
y_svr_pred = svr_model.predict(X_test) # Predict target values for test set.


svr_mse = mean_squared_error(y_test, y_svr_pred) # Measures how far predicition is from actual values.
svr_r2 = r2_score(y_test, y_svr_pred) # Tells how well the model fits the data.

print(f"Support Vector Regression Mean Squared Error: {svr_mse}") # DisplayS MSE.
print(f"Support Vector Regression R-squared: {svr_r2}") # DisplayS R-Squared.

Support Vector Regression Mean Squared Error: 0.1802369656341436
Support Vector Regression R-squared: 0.2149678830157299


In [None]:
from sklearn.model_selection import GridSearchCV # Import for settings.
from sklearn.ensemble import GradientBoostingRegressor # Import model.
from sklearn.metrics import mean_squared_error, r2_score # Import to measure model.

param_grid = {
    'n_estimators': [100, 200], # Number of learning steps our model will use.
    'learning_rate': [0.01, 0.1], # This controls how fast our model learns.
    'max_depth': [3, 5] # Controls complexity.
}

grid_search = GridSearchCV( # Helps test different combination of settings.
    estimator=GradientBoostingRegressor(), # The model.
    param_grid=param_grid, # Different settings.
    cv=3, # Splits the data into 3 parts.
    n_jobs=-1,
    verbose=2 # Shows us details while search is progressing.
)

grid_search.fit(X_train, y_train) # Uses training data to try different combinations of settings.
best_gbr = grid_search.best_estimator_ # Gives us best model that GridSearch found.
y_pred = best_gbr.predict(X_test) # Model can make predicitions using the test data.

mse = mean_squared_error(y_test, y_pred) # Measures how far predicition is from actual values.
r2 = r2_score(y_test, y_pred) # Tells how well the model fits the data.

print("Best Parameters:", grid_search.best_params_) # Shows which settings worked best.
print(f"Mean Squared Error: {mse}") # DisplayS MSE.
print(f"R-squared: {r2}") # DisplayS R-Squared.

Fitting 3 folds for each of 8 candidates, totalling 24 fits
Best Parameters: {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 200}
Mean Squared Error: 0.1664337228518826
R-squared: 0.2750886738006889


In [None]:
y_pred = best_gbr.predict(X_test) # X is the new data, y will be models predicition.
from sklearn.metrics import mean_squared_error, r2_score # Import libararies.

mse = mean_squared_error(y_test, y_pred) # Tells us how close the prediction is from the actual values.
r2 = r2_score(y_test, y_pred) # Calculates the R-squared score.

print("Optimized Model Mean Squared Error:", mse) # Displays the MSE
print("Optimized Model R-squared:", r2) # Displays the R-squared 

Optimized Model Mean Squared Error: 0.1664337228518826
Optimized Model R-squared: 0.2750886738006889


In [None]:
import joblib # Import to save and load model so it can be reused.
joblib.dump(best_gbr, 'gradient_boosting_model.pkl') #  Saves the model.

['gradient_boosting_model.pkl']

In [None]:
import joblib # Used to save and load models.
import pandas as pd # Used to work with data
from sklearn.preprocessing import StandardScaler # Used to scale and normalize data.
from sklearn.ensemble import GradientBoostingRegressor # This is the learning model 

df = pd.read_csv('data/diabetes.csv') # Load the dataset.

X = df.drop(columns='Outcome') # Show all columns except outcome.
y = df['Outcome'] # Outcome we want the model to predict.

scaler = StandardScaler() # Normalizes the features.
X_scaled = scaler.fit_transform(X) # Learns then applies.

model = GradientBoostingRegressor() # Creates an instance.
model.fit(X_scaled, y) # Train the model to predict the target.

joblib.dump(model, 'gradient_boosting_model.pkl') # Save the model to a file named gradient_boosting_model.pkl.
joblib.dump(scaler, 'scaler.pkl') # Save the scaler to a file named scaler.pkl.

print("Model and Scaler have been saved successfully.") # Confirmation message.

Model and Scaler have been saved successfully.
