# Diabetes dataset analysis and modeling

This notebook follows the requested steps for the `load_diabetes` dataset:

1. Data collection
2. Data validation
3. Data preprocessing (cleaning)
4. Data splitting (train/test)
5. Model selection
6. Model training
7. Model testing / evaluation
8. Model prediction
9. Model deployment (save & download)
10. Model monitoring & maintenance

**Bonus:** Automation & Pipeline Tools suggestions

---


## 1) Data collection
Load the dataset via `sklearn.datasets.load_diabetes` and convert to a pandas DataFrame.

In [1]:
from sklearn.datasets import load_diabetes
import pandas as pd
diabetes = load_diabetes(as_frame=True)
X = diabetes.data.copy()
y = diabetes.target.copy()
print('X shape:', X.shape)
print('y shape:', y.shape)
X.head()

X shape: (442, 10)
y shape: (442,)


Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


## 2) Data validation
Check for missing values, duplicate rows, dtypes, and basic statistics.

In [2]:
# Basic validation
print('Columns and dtypes:')
print(X.dtypes)
print('\nMissing values per column:')
print(X.isna().sum())
print('\nAny duplicate rows in X? ', X.duplicated().any())
print('\nTarget (y) stats:')
print(y.describe())
# Quick correlation check (features vs target)
corr_with_target = X.join(y.rename('target')).corr()['target'].sort_values(ascending=False)
corr_with_target

Columns and dtypes:
age    float64
sex    float64
bmi    float64
bp     float64
s1     float64
s2     float64
s3     float64
s4     float64
s5     float64
s6     float64
dtype: object

Missing values per column:
age    0
sex    0
bmi    0
bp     0
s1     0
s2     0
s3     0
s4     0
s5     0
s6     0
dtype: int64

Any duplicate rows in X?  False

Target (y) stats:
count    442.000000
mean     152.133484
std       77.093005
min       25.000000
25%       87.000000
50%      140.500000
75%      211.500000
max      346.000000
Name: target, dtype: float64


target    1.000000
bmi       0.586450
s5        0.565883
bp        0.441482
s4        0.430453
s6        0.382483
s1        0.212022
age       0.187889
s2        0.174054
sex       0.043062
s3       -0.394789
Name: target, dtype: float64

## 3) Data preprocessing (data cleaning)
- No missing values in the dataset, but we'll create a reproducible preprocessing pipeline with `StandardScaler`.
- We'll also show simple outlier detection (IQR method) and option to remove outliers (commented out by default).

In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import numpy as np

# Example of simple outlier detection using IQR (not applied by default)
def detect_outliers_iqr(df, factor=1.5):
    outlier_index = set()
    for col in df.columns:
        q1 = df[col].quantile(0.25)
        q3 = df[col].quantile(0.75)
        iqr = q3 - q1
        lower = q1 - factor * iqr
        upper = q3 + factor * iqr
        outlier_index.update(df[(df[col] < lower) | (df[col] > upper)].index.tolist())
    return sorted(list(outlier_index))

outliers = detect_outliers_iqr(X)
print('Number of detected outlier rows (IQR method):', len(outliers))

# Preprocessing pipeline: Standard scaling for all features (diabetes data is numeric)
numeric_features = X.columns.tolist()
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features)])

# Show how to fit-transform a small sample (not changing original X)
X_scaled_sample = preprocessor.fit_transform(X.iloc[:5])
X_scaled_sample

Number of detected outlier rows (IQR method): 31


array([[ 5.32713604e-01,  1.22474487e+00,  1.35772039e+00,
         1.11474080e+00, -1.14995688e+00, -1.00587806e+00,
        -8.54282143e-01,  3.71647064e-17,  8.90712800e-01,
         6.98771243e-01],
       [-1.64887544e-01, -8.16496581e-01, -1.18800534e+00,
        -8.86076023e-01,  3.30185639e-01, -3.83428765e-01,
         1.82581870e+00, -1.58113883e+00, -1.65448141e+00,
        -1.81680523e+00],
       [ 1.35715132e+00,  1.22474487e+00,  9.69800278e-01,
        -2.85830975e-02, -1.20688544e+00, -9.80980087e-01,
        -6.03022689e-01,  3.71647064e-17,  3.99027555e-01,
         4.19262746e-01],
       [-1.68692641e+00, -8.16496581e-01, -2.90940083e-01,
        -1.31482249e+00,  1.18411402e+00,  1.37187824e+00,
        -6.86775840e-01,  1.58113883e+00,  9.70907452e-01,
         9.78279740e-01],
       [-3.80509717e-02, -8.16496581e-01, -8.48575243e-01,
         1.11474080e+00,  8.42542666e-01,  9.98408667e-01,
         3.18261975e-01,  3.71647064e-17, -6.06166398e-01,
        -2.

## 4) Data splitting (train / test)
We'll use `train_test_split` with a fixed random seed for reproducibility.

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('Train shape:', X_train.shape, 'Test shape:', X_test.shape)

Train shape: (353, 10) Test shape: (89, 10)


## 5) Model selection (choosing candidate algorithms)
We'll compare:
- Linear Regression
- Ridge Regression
- Lasso Regression
- Random Forest Regressor
- Gradient Boosting Regressor

We'll use `GridSearchCV` (with a modest parameter grid) to find a strong candidate.

In [10]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.pipeline import Pipeline

# Candidate pipelines (with preprocessing)
models = {
    'LinearRegression': Pipeline([('pre', preprocessor), ('model', LinearRegression())]),
    'Ridge': Pipeline([('pre', preprocessor), ('model', Ridge())]),
    'Lasso': Pipeline([('pre', preprocessor), ('model', Lasso(max_iter=10000))]),
    'RandomForest': Pipeline([('pre', preprocessor), ('model', RandomForestRegressor(random_state=42))]),
    'GradientBoosting': Pipeline([('pre', preprocessor), ('model', GradientBoostingRegressor(random_state=42))])
}

param_grids = {
    'LinearRegression': {},
    'Ridge': {'model__alpha': [0.1, 1.0, 10.0]},
    'Lasso': {'model__alpha': [0.001, 0.01, 0.1, 1.0]},
    'RandomForest': {'model__model__n_estimators': [50, 100], 'model__model__max_depth': [None, 3, 5]},
    'GradientBoosting': {'model__model__n_estimators': [50, 100], 'model__model__learning_rate': [0.01, 0.1], 'model__model__max_depth': [3, 5]}
}

# Note: For RandomForest and GradientBoosting, we placed the estimator under 'model' step;
# depending on sklearn version, param names reflect the pipeline step name 'model'.


## 6) Model training (Grid search for best hyperparameters)
We'll run GridSearchCV on each candidate and pick the best model by cross-validated R² (default scoring).

In [6]:
best_estimators = {}
best_scores = {}

for name, pipeline in models.items():
    print('\nTraining / tuning:', name)
    param_grid = param_grids.get(name, {})
    # Use 3-fold CV to keep runtime modest; adjust cv for better tuning on real runs.
    gs = GridSearchCV(pipeline, param_grid=param_grid, cv=3, scoring='r2', n_jobs=-1)
    gs.fit(X_train, y_train)
    best_estimators[name] = gs.best_estimator_
    best_scores[name] = gs.best_score_
    print('Best CV r2:', gs.best_score_)
    print('Best params:', gs.best_params_)
    
# Summary of CV results
sorted_best = sorted(best_scores.items(), key=lambda x: x[1], reverse=True)
print('\nModel ranking by CV R2:')
for name, score in sorted_best:
    print(f'{name}: {score:.4f}')


Training / tuning: LinearRegression
Best CV r2: 0.4878823490667685
Best params: {}

Training / tuning: Ridge
Best CV r2: 0.48793455930940804
Best params: {'model__alpha': 0.1}

Training / tuning: Lasso
Best CV r2: 0.4878781553976899
Best params: {'model__alpha': 0.001}

Training / tuning: RandomForest


ValueError: Invalid parameter 'model' for estimator RandomForestRegressor(random_state=42). Valid parameters are: ['bootstrap', 'ccp_alpha', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'max_samples', 'min_impurity_decrease', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'monotonic_cst', 'n_estimators', 'n_jobs', 'oob_score', 'random_state', 'verbose', 'warm_start'].

## 7) Model testing / evaluation
Evaluate the top models on the test set using RMSE, MAE, and R².

In [7]:
def evaluate_model(model, X_test, y_test):
    preds = model.predict(X_test)
    mse = mean_squared_error(y_test, preds)
    rmse = mse ** 0.5
    mae = mean_absolute_error(y_test, preds)
    r2 = r2_score(y_test, preds)
    return {'rmse': rmse, 'mse': mse, 'mae': mae, 'r2': r2}

results = {}
for name, estimator in best_estimators.items():
    results[name] = evaluate_model(estimator, X_test, y_test)

import pandas as pd
results_df = pd.DataFrame(results).T.sort_values('r2', ascending=False)
results_df

Unnamed: 0,rmse,mse,mae,r2
Ridge,53.842869,2899.054556,42.796235,0.452818
Lasso,53.851726,2900.008373,42.794066,0.452638
LinearRegression,53.853446,2900.193628,42.794095,0.452603


## 8) Model prediction
Use the best model to predict on new samples (here: the test set first 5 rows).

In [8]:
# Choose best model by test R2
best_model_name = results_df.index[0]
best_model = best_estimators[best_model_name]
print('Best model on test set:', best_model_name)

# Example predictions
sample_X = X_test.iloc[:5]
sample_y = y_test.iloc[:5]
preds = best_model.predict(sample_X)
pd.DataFrame({'y_true': sample_y, 'y_pred': preds})

Best model on test set: Ridge


Unnamed: 0,y_true,y_pred
287,219.0,139.585031
211,70.0,179.575695
72,202.0,134.249708
321,230.0,291.51177
73,111.0,123.710442


## 9) Model deployment (model download)
Save the best model as a joblib file which can be downloaded and loaded later.

In [12]:
import joblib, os
model_path = 'C:/Users/ASUS/AI_Internship_OCAC_2/08_08_25/diabetes_best_model.joblib'
joblib.dump(best_model, model_path)
print('Saved best model to:', model_path)
# Check file exists
os.path.exists(model_path)

Saved best model to: C:/Users/ASUS/AI_Internship_OCAC_2/08_08_25/diabetes_best_model.joblib


True

## 10) Model monitoring and maintenance
Recommendations:
- Track performance metrics (RMSE, MAE, R²) over time.
- Monitor data drift (feature distribution changes) and label drift.
- Retrain on new data periodically or when performance degrades.
- Use validation & production holds for safe deployment.

**Suggested tools:** MLflow (experiment tracking), Prometheus/Grafana (metrics), Seldon/TFServing or FastAPI (model serving), Great Expectations (data validation).

### Bonus: Automation & Pipeline Tools
- Convert the preprocessing+model into a sklearn `Pipeline` (already used here).
- CI/CD: use GitHub Actions to run tests, training and push model artifacts to a model registry.
- Orchestrate with Airflow or Prefect to schedule retraining and monitoring jobs.
- For production features: use feature stores (Feast) and model registries.

---

End of notebook.