# üöÄ Linear Regression 



---
## üì¶ Step 0: Importing Our Tools

Before we build anything, we need the right tools. In Python, we don't write everything from scratch‚Äîwe import libraries created by smart people around the world!

### üßë‚Äçüè´ Modules :
*   `numpy` (Numerical Python): Think of this as a super-powered calculator. It handles complex math and large grids of numbers (arrays/matrices) extremely fast.
*   `pandas`: This is basically Excel on steroids. It lets us load data, manipulate columns/rows, and view it nicely in a table (DataFrame).
*   `matplotlib.pyplot`: This is our paintbrush. We use it to draw charts and graphs so we can *see* our data and errors visually.
*   `sklearn` (Scikit-Learn): The ultimate Machine Learning toolbox! It contains everything we need:
    *   `load_diabetes`: A built-in dataset we can use instantly.
    *   `train_test_split`: A tool to randomly chop our data into "study" and "exam" sets.
    *   `mean_absolute_error, mean_squared_error, r2_score`: Our "grading tools" to see how well the model performed.
    *   `LinearRegression`: The actual ML algorithm we will train.
    *   `RidgeCV, LassoCV`: Advanced forms of linear regression to stop the model from memorizing too much (Regularization).
    *   `StandardScaler`: A tool to make sure all numerical features are on the same "scale" (e.g., converting heights in cm and weights in kg into a standardized format).


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Scikit-learn datasets and tools
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Set plot style for better visuals
try:
    plt.style.use("seaborn-v0_8-whitegrid")
except OSError:
    plt.style.use("ggplot")

# Convenience: show more columns in dataframe printing
pd.set_option("display.max_columns", 50)


---
## 1Ô∏è‚É£ Introduction to the Problem & The Data

**üí° Concept:** Machine learning is about finding patterns in data to predict the future. Here, we are trying to predict a **continuous number** (disease progression of diabetes). This makes it a **Regression** problem, not a Classification problem.

Let's load our data and take a look! We will use the built-in Scikit-Learn **Diabetes dataset**.

### üìä Understanding the Columns (Features)
The dataset contains 442 patients, and for each patient, we have 10 baseline variables (features):
1.  **age**: Age (in years)
2.  **sex**: Sex
3.  **bmi**: Body mass index
4.  **bp**: Average blood pressure
5.  **s1 (tc)**: Total serum cholesterol
6.  **s2 (ldl)**: Low-density lipoproteins
7.  **s3 (hdl)**: High-density lipoproteins
8.  **s4 (tch)**: Total cholesterol / HDL
9.  **s5 (ltg)**: Log of serum triglycerides level
10. **s6 (glu)**: Blood sugar level

**Target Variable (What we want to predict!):**
*   **Target**: A quantitative measure of disease progression one year after baseline. (A higher number means the diabetes got worse).


In [8]:
# Load the built-in Diabetes dataset
data = load_diabetes(as_frame=True)

# Features (X) and Target (y)
X_df = data.data.copy()
y = data.target.copy()

# Combine for easier EDA
df = X_df.copy()
df["target"] = y

print(f"Dataset has {df.shape[0]} rows (patients) and {X_df.shape[1]} features (columns).")
display(df.head())


Dataset has 442 rows (patients) and 10 features (columns).


Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0



---
## 2Ô∏è‚É£ Train/Test Split & Baseline Model

**üí° Concept:** We never test a model on the exact same data it practiced with! If we do, it might just memorize the answers. We split our data into **Training** (practice) and **Testing** (final evaluation).


In [9]:
# Split the data! 80% for training (homework), 20% for testing (exam)
X_train, X_test, y_train, y_test = train_test_split(
    X_df, y, test_size=0.2, random_state=42
)

print(f"Training on {X_train.shape[0]} samples. Testing on {X_test.shape[0]} samples.")


Training on 353 samples. Testing on 89 samples.


### The Baseline (The "Dumb" Model)

**üí° Concept:** Before we do fancy AI, what if we just guessed the **average** disease progression for EVERY patient? We call this our baseline. If our AI cannot beat the average guess, our AI is useless!



In [10]:
# Guess the average of the training set for every single test patient
mean_guess = y_train.mean()
baseline_predictions = np.full(shape=y_test.shape, fill_value=mean_guess)

baseline_mae = mean_absolute_error(y_test, baseline_predictions)

print(f"Guessing the average ({mean_guess:.1f}) gives a Mean Absolute Error (MAE) of: {baseline_mae:.2f}")
print("Our ML model MUST get an error lower than this to be considered useful!")


Guessing the average (153.7) gives a Mean Absolute Error (MAE) of: 64.01
Our ML model MUST get an error lower than this to be considered useful!


---
## 3Ô∏è‚É£ Building the ML Model (Linear Regression)

**üí° Concept:** Linear regression tries to draw the "Best Fit Line" through our data. Mathematically, it's finding the weights (w) for each feature.
Equation: `Target = (w1 * age) + (w2 * bmi) + (w3 * bp) ... + intercept`

**ü§î Frequent Doubt:**
> *"How does the computer know what the 'w' (weights) are? Does it just guess?"*
> **Explanation:** "Initially, yes! It starts with random guesses (or zeros). Then it looks at how wrong its predictions are (the Error), and mathematically calculates how to adjust those weights to make the error smaller. It repeats this until it finds the best possible weights. "


In [11]:
# 1. Initialize the model
lin_reg = LinearRegression()

# 2. Train the model (Model learns the weights here!)
lin_reg.fit(X_train, y_train)

# 3. Make predictions on the Test set (The Final Exam)
y_pred = lin_reg.predict(X_test)

# 4. Evaluate how well it did
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"üî• ML Model MAE: {mae:.2f} (Beat the baseline of {baseline_mae:.2f}!)")
print(f"üî• ML Model RMSE: {rmse:.2f}")
print(f"üî• ML Model R-Squared: {r2:.2f}")


üî• ML Model MAE: 42.79 (Beat the baseline of 64.01!)
üî• ML Model RMSE: 2900.19
üî• ML Model R-Squared: 0.45


### Evaluating Performance (Metrics)


**üí° Concept:** 
- **MAE (Mean Absolute Error):** Easy to explain. The simple average of our errors.
- **RMSE (Root Mean Squared Error):** Punishes large errors heavily (because errors are squared before averaging). Useful if being slightly wrong is okay, but being *very* wrong is disastrous.
- **R¬≤ (R-Squared):** Percentage of variance explained. 1.0 is a perfect score! 0 is as bad as just guessing the average baseline. (Here it's 0.45, meaning our features explain 45% of what's happening. The rest is random noise, or maybe we need better features like diet/genetics to get a higher score).


In [12]:
# Let's peek at the "Weights" the model learned
weights_df = pd.DataFrame({
    "Feature": X_train.columns,
    "Weight": lin_reg.coef_
}).sort_values(by="Weight", key=abs, ascending=False)

print("Top 5 most impactful features:")
display(weights_df.head())


Top 5 most impactful features:


Unnamed: 0,Feature,Weight
4,s1,-931.488846
8,s5,736.198859
2,bmi,542.428759
5,s2,518.062277
3,bp,347.703844


---
## 5Ô∏è‚É£ Regularization: Keeping the Model in Check (Ridge & Lasso)

**üí° Concept:** Sometimes models memorize the training data too well (Overfitting) and assign crazy high weights to certain features. **Regularization** acts as a penalty fee. The model pays a fine if its weights get too big.


Let's test this! We'll see if Lasso automatically sets some of our 10 features' weights to exactly 0.


In [15]:
# We use CV (Cross-Validation) versions so Scikit-Learn automatically finds the best "fine" (alpha/lambda) to charge the model.
alphas = np.logspace(-4, 4, 100)

# Ridge (L2)
ridge_model = Pipeline([
    ("scaler", StandardScaler()), # Always scale before regularizing!
    ("ridge", RidgeCV(alphas=alphas, cv=5))
])

# Lasso (L1)
lasso_model = Pipeline([
    ("scaler", StandardScaler()),
    ("lasso", LassoCV(alphas=alphas, cv=5, max_iter=20000))
])

ridge_model.fit(X_train, y_train)
lasso_model.fit(X_train, y_train)

# Compare how many features Lasso "kicked out" (set weight to 0)
lasso_weights = lasso_model.named_steps["lasso"].coef_
zero_weights_count = sum(abs(lasso_weights) < 1e-7)

print(f"Lasso set {zero_weights_count} out of {X_train.shape[1]} feature weights to EXACTLY Zero!")
print("It automatically did Feature Selection for us! üé©‚ú®")


Lasso set 3 out of 10 feature weights to EXACTLY Zero!
It automatically did Feature Selection for us! üé©‚ú®
