# Supervised Learning -> Linear Regression

The first section (1 - 4) is identical across all regression models:
Keeping this part unchanged allows easy model comparison
and prevents pipeline-related mistakes.
Everything after that depends on the specific model.

1. Project setup and common pipeline
2. Dataset loading
3. Train-test split
4. Feature scaling (why we do it)
____
5. What is this model? (Intuition)
6. Model training
7. Model parameters interpretation
8. Predictions
9. Model evaluation
10. When to use it and when not to
11. Model persistence
12. Mathematical formulation (deep dive)

-----------------------------------------------------

## How this notebook should be read

This notebook is designed to be read **top to bottom**.

Before every code cell, you will find a short explanation describing:
- what we are about to do
- why this step is necessary
- how it fits into the overall process

The goal is not just to run the code,
but to understand what is happening at each step
and be able to adapt it to your own data.

-----------------------------------------------------

## What is Linear Regression? 

Linear Regression is one of the simplest and most widely used machine learning models.

At a high level, Linear Regression tries to **draw a straight line through the data**
in such a way that the line is as close as possible to all data points.

Each data point represents:
- some input values (features)
- a real observed output (target)

The model:
- places a line (or a plane, or a higher-dimensional surface)
- measures how far each data point is from that line
- adjusts the line to minimize the overall distance to all points

These distances represent prediction errors.

If the line is too high:
- predictions are systematically too large

If the line is too low:
- predictions are systematically too small

By minimizing the total error, the model learns the **overall trend** in the data.

Once the line is learned, the model can:
- take new input values
- project them onto the learned line
- return a predicted output value

-----------------------------------------------------

## Why we start with intuition

Understanding the intuition first is important because:

- the code will directly reflect this idea
- every training step is just an automated way of adjusting the line
- every prediction is a projection onto that line

If this mental model is clear,
the rest of the notebook becomes much easier to follow.

-----------------------------------------------------

## What you should expect from the results

Before training the model, it is important to set expectations.

With Linear Regression, you should expect:
- reasonable but not perfect predictions
- a simple and interpretable model
- performance that captures general trends, not complex patterns

If the model performs poorly:
- it does not necessarily mean something is wrong
- it may simply mean the relationship is not linear

This model is often used as:
- a baseline
- a reference point for more complex models

____________________________________________


## 1. Project setup and common pipeline

In this section we prepare everything that is shared across all regression models.

The goal is to:
- set up a clean and reproducible environment
- import all common dependencies
- define a standard pipeline that will not change when switching models

This part of the notebook is intentionally kept simple and consistent.
If something changes here, it should change in all models.


### Why having a common pipeline matters

Using the same pipeline for all models allows us to:
- compare models fairly
- avoid data leakage
- reduce implementation errors
- focus on understanding the model instead of debugging the setup

From this point on, every model will start from the exact same data preparation steps.


In [2]:
#   Used for numerical operations and data manipulation.
import numpy as np 
import pandas as pd

#  Provides a real-world regression dataset for demonstration purposes.
from sklearn.datasets import fetch_california_housing

#  Ensures a clean separation between training and test data.
from sklearn.model_selection import train_test_split

#  Applies feature scaling to ensure numerical stability and pipeline consistency.
from sklearn.preprocessing import StandardScaler

#  Used later to assess model performance.
from sklearn.metrics import mean_squared_error, r2_score

#  Used to save trained models and preprocessing objects.
import joblib



# ____________________________________

## 2. Dataset loading

In this section we load the dataset that will be used throughout the notebook.

The purpose of this step is to:
- obtain real data to work with
- clearly separate input features from the target variable
- create a structure that can be easily replaced with custom datasets

At this stage, no modeling is involved.
We are only defining what the model will learn from.


### About the dataset

We use the California Housing dataset, a classic regression dataset.

Each row represents a housing district in California.
Each column represents a numerical feature describing that district.
The target variable is a continuous value representing house prices.


In [3]:
data = fetch_california_housing(as_frame=True)

X = data.data
y = data.target


### What we obtained

- `X`  
  A table of input features used by the model to make predictions.

- `y`  
  The target variable we want the model to predict.

This separation is fundamental:
- the model learns patterns from `X`
- the model is evaluated by comparing predictions to `y`

Adapting this step to your own data:
`X` should contain only feature columns
`y` should contain the target variable

# ____________________________________

## 3. Train-test split

In this section we split the dataset into two separate parts:
- a training set
- a test set

This step is fundamental in machine learning and applies to almost every model.


### Why we split the data

The goal of a machine learning model is not to memorize data,
but to make good predictions on **new, unseen data**.

If we trained and evaluated the model on the same data:
- we would get overly optimistic results
- we would not know how well the model generalizes

By splitting the data:
- the training set is used to learn patterns
- the test set is used only for evaluation


In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)


### What these parameters mean

- `test_size=0.2`  
  20% of the data is reserved for testing,  
  80% is used for training.

That split can be change based on the dataset.

### What we have after this step

- `X_train`, `y_train`  
  Data used to train the model.

- `X_test`, `y_test`  
  Data kept aside and used only for evaluation.

From this point on:
- the model must never see the test data during training
- all learning happens exclusively on the training set
