# Practical Task 2: Diabetes Progression Multiple Linear Regression
This notebook analyses the `diabetes_dirty.csv` dataset to model diabetes 
progression based on various physiological attributes. 

**Process Overview:**
1. Data cleaning and preprocessing.
2. Feature scaling (Standardisation/Normalisation).
3. Multiple Linear Regression modelling.
4. Model evaluation using the $R^2$ score.

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Load the dataset 
try:
    df = pd.read_csv('diabetes_dirty.csv')
    
    # Identify and remove missing values to clean the 'dirty' data
    df_clean = df.dropna()
    
    print("Step 2 Complete: Data loaded and rows with missing values removed.")
    print(f"Cleaned dataset shape: {df_clean.shape}")
except FileNotFoundError:
    print("Error: Ensure 'diabetes_dirty.csv' is in your folder.")

Step 2 Complete: Data loaded and rows with missing values removed.
Cleaned dataset shape: (442, 11)


In [11]:
# 3. Differentiate between independent (X) and dependent (Y) variables
# We use all attributes for X and the final progression column for Y 
dependent_variable = df_clean.columns[-1] 
X = df_clean.drop(columns=[dependent_variable]) 
y = df_clean[dependent_variable]

# 4. Generate training (80%) and test (20%) sets 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Step 3 & 4 Complete: Data split into {len(X_train)} training and {len(X_test)} test samples.")

Step 3 & 4 Complete: Data split into 353 training and 89 test samples.


In [12]:
# 5. Scaling is necessary to ensure features with larger ranges don't 
# dominate the model 
scaler = StandardScaler()

# Fit only on training data; transform both training and test data 
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Step 5 Complete: Features standardised using StandardScaler.")

Step 5 Complete: Features standardised using StandardScaler.


In [13]:
# 6. Generate a multiple linear regression model 
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# 7. Print intercept and coefficients 
print(f"Model Intercept: {model.intercept_:.4f}")
print("\nFeature Coefficients:")
for feature, coef in zip(X.columns, model.coef_):
    print(f"{feature}: {coef:.4f}")

# 8. Generate predictions for the test set 
y_pred = model.predict(X_test_scaled)
print("\nStep 6, 7 & 8 Complete: Model trained and predictions generated.")

Model Intercept: 153.7365

Feature Coefficients:
AGE: 1.7538
SEX: -11.5118
BMI: 25.6071
BP: 16.8289
S1: -44.4489
S2: 24.6410
S3: 7.6770
S4: 13.1388
S5: 35.1612
S6: 2.3514

Step 6, 7 & 8 Complete: Model trained and predictions generated.


In [14]:
# 9. Compute R-squared (R2) score 
r2 = r2_score(y_test, y_pred)

print(f"Model R-squared Score: {r2:.4f}")
print(f"This indicates that {r2*100:.2f}% of the variance is explained by the model.")

Model R-squared Score: 0.4526
This indicates that 45.26% of the variance is explained by the model.


### **Final Reflection and Model Interpretation**

* **Data Cleaning:** By removing null values from `diabetes_dirty.csv`, we 
    ensured the model was trained on a "gold standard" dataset.
* **Feature Scaling:** We applied `StandardScaler` because linear regression 
    is sensitive to the magnitude of values (e.g., Blood Pressure vs BMI) .
* **R-squared Insight:** Our $R^2$ score of **0.4526** shows 
    how well our independent variables (X) predict the diabetes progression (Y).