> # **Practical Lab 2 -Multivariate Linear Regression, Non-Parametric Models and Cross-Validation**

**CSCN8010 Foundations of Machine Learning Frameworks**

**Student Name: Hasyashri Bhatt**

**Student Number : 9028501**

**Practical Lab 2 : Multivariate Linear Regression, Non-Parametric Models and Cross-Validation**

---
### **Introduction**

---

> #### **What is Diabetes Progression?**


Diabetes mellitus is a chronic metabolic disorder that affects how the body processes blood glucose (sugar). Over time, poorly managed diabetes can lead to severe complications such as cardiovascular disease, nerve damage, kidney failure, and vision loss.

**Diabetes progression** refers to the worsening of the disease over time, especially in terms of organ function and glucose control. Accurately predicting this progression can allow healthcare providers to intervene earlier and personalize treatment plans for high-risk patients.

In this lab, we focus on **quantifying disease progression one year after baseline** using clinical and physiological variables collected from patients.

> #### **Why is it Important to Predict?**

Predicting diabetes progression is essential for:

- **Early detection** of patients who are at higher risk.
- **Efficient allocation** of medical resources and personalized care.
- **Improving patient outcomes** by enabling timely interventions.
- **Assisting physicians** with data-driven screening tools in clinical decision-making.

By training machine learning models on relevant medical features, we aim to develop a predictive system that can serve as a support tool in healthcare settings.

>#### Dataset Source

We use the **Diabetes Dataset** from Scikit-learn's built-in collection. It is a well-known dataset in the machine learning community, often used for regression tasks.

- **Source**: Scikit-learn’s `load_diabetes()` method https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html#
- **Features**: 10 continuous numerical attributes (e.g., age, sex, BMI, blood pressure)
- **Target**: Disease progression measured as a continuous score one year after the baseline
- **Preprocessing**: All features are standardized to have zero mean and unit variance

---

### **Objective**

---

The objective of this project is to build and evaluate machine learning models that can **predict the progression of diabetes one year after baseline measurements**. 

We will compare four modeling techniques:

1. Univariate Polynomial Regression
2. Multivariate Polynomial Regression
3. Decision Trees
4. k-Nearest Neighbors (kNN)

Each model will be evaluated using three metrics:
- **R-squared (R²)**
- **Mean Absolute Error (MAE)**
- **Mean Absolute Percentage Error (MAPE)**

We will use a train-validation-test pipeline to ensure fair evaluation of the models.

---

># **PART - 1**

>**1. Get The Data**

We need to import necessary libraries to load the the diabetes dataset from the sklearn and further process as below.

In [2]:
# Import necessary libraries
import pandas as pd 
from sklearn.datasets import load_diabetes

**Loading the diabetes dataset from the sklearn using  `load()` method here we used as_frame=true to load dataset as a pandas dataframe and not as a Numpy array then we seperate feature and target using x and y then we combined x and y to make it one dataset so we can perform all operations easily. Also we renamed targeted variable name from y to disease_progression.**

In [14]:
# Load the diabetes dataset as a pandas DataFrame and not as a numpy array
diabetes = load_diabetes(as_frame=True)

# Separate features and target
X = diabetes.data
y = diabetes.target

# Combine for x and y as a one dataset for easy manipulation and y renaming to 'disease_progression'
df = pd.concat([X, y.rename("disease_progression")], axis=1)

**Inspect the dataset to know more about it**

In [15]:
# Show first few rows of the DataFrame

df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,disease_progression
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0


>**2. Problem Statement**

We are tasked with developing predictive models for diabetes progression, defined as a quantitative measurement of disease severity **one year after the baseline**. The dataset provides 10 numeric clinical features for each patient, such as body mass index (BMI), blood pressure, and blood serum measurements.

The goal is to accurately estimate the **future disease progression score** for each individual, based on their baseline characteristics.

This is a **supervised regression** task:

- **Input Features (X)**: 10 continuous, normalized medical indicators
- **Output Variable (y)**: `disease_progression` (continuous target)

The target value is a **medical score** that reflects the change in the patient's diabetic condition after one year.

The solution must be:

- **Interpretable** (for medical relevance)
- **Accurate** (to assist in early detection and care)
- **Generalizable** (to unseen patients)

Ultimately, we aim to develop a model that can serve as a **screening tool** to help physicians identify patients at higher risk of disease deterioration.