# Deep Learning – Regression (scikit-learn)

This notebook is part of the **ML-Methods** project.

It introduces **Deep Learning for supervised regression**
using the scikit-learn implementation of neural networks.

As with all other notebooks in this project,
the initial sections focus on data preparation
and are intentionally repeated.

This ensures:
- conceptual consistency
- fair comparison across models
- a unified learning pipeline


___

## Notebook Roadmap (standard ML-Methods)

1. Project setup and common pipeline  
2. Dataset loading  
3. Train-test split  
4. Feature scaling (why we do it)  

----------------------------------

5. What is this model? (Intuition)  
6. Model training  
7. Model behavior and key parameters  
8. Predictions  
9. Model evaluation  
10. When to use it and when not to  
11. Model persistence  
12. Mathematical formulation (deep dive)  
13. Final summary – Code only


___
## How this notebook should be read

This notebook is designed to be read **top to bottom**.

Before every code cell, you will find a short explanation describing:
- what we are about to do
- why this step is necessary
- how it fits into the overall process

The goal is not just to run the code,
but to understand how **deep learning regression**
differs from classical regression models
and how it fits into the supervised learning pipeline.


___
## What is Deep Learning (in this context)?

In this notebook,
Deep Learning refers to **neural networks with multiple layers**
used to solve **regression problems**.

Unlike classification:
- the target is a continuous value
- there are no class labels
- the model predicts a real number

The neural network learns a function:

input features → continuous output


___
## What do we want to achieve?

Our objective is to train a model that:
- receives a vector of numerical features
- processes them through multiple layers
- outputs a single continuous value

The model learns how combinations of input features
map to a numerical target.

This is useful when:
- relationships are non-linear
- classical linear regression is insufficient
- feature interactions are complex


___
## Why use scikit-learn for Deep Learning regression?

scikit-learn provides a high-level abstraction
for neural networks through `MLPRegressor`.

This allows us to:
- reuse the same ML pipeline as classical models
- focus on concepts rather than low-level training details
- understand *what* deep learning regression does
  before implementing it manually

This notebook acts as a **bridge**
between classical regression
and full deep learning frameworks
such as PyTorch and TensorFlow.


___
## What you should expect from the results

With Deep Learning (scikit-learn regression),
you should expect:

- ability to model non-linear relationships
- improved performance on complex patterns
- sensitivity to feature scaling
- longer training times than linear models

However:
- interpretability is low
- hyperparameter tuning is important
- the model behaves as a black box


___
## 1. Project setup and common pipeline

In this section we set up the common pipeline
used across regression models in this project.

Although this notebook uses a neural network,
the data preparation steps
remain identical to other regression approaches.


In [2]:
# ====================================
# Common imports used across regression models
# ====================================

import numpy as np
import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor

from sklearn.metrics import (
    mean_squared_error,
    mean_absolute_error,
    r2_score
)

from pathlib import Path
import joblib
import matplotlib.pyplot as plt


### What changes compared to classical regression

Compared to linear regression:
- the target remains continuous
- the pipeline remains identical
- the model becomes non-linear

The main difference lies in the model itself,
not in the surrounding workflow.

In the next section,
we will load the dataset
used for the regression task.


___
## 2. Dataset loading

In this section we load the dataset
used for the deep learning regression task.

We use a **regression-specific dataset**
with a continuous target variable,
which allows us to evaluate how neural networks
behave when predicting real-valued outputs.


In [3]:
# ====================================
# Dataset loading
# ====================================

data = fetch_california_housing(as_frame=True)

X = data.data
y = data.target


### Inputs and target

- `X` contains the input features
- `y` contains the target variable

This is a **supervised regression problem**:
- each sample has a continuous target value
- the goal is to predict a real number, not a class

### Why this dataset

The California Housing dataset is well suited
for regression because:
- it contains numerical features
- relationships are non-linear
- target values are continuous

This makes it a good benchmark
for comparing classical regression
and deep learning regression models.

At this stage:
- data is still in pandas format
- no preprocessing has been applied yet

In the next section,
we will split the dataset
into training and test sets.


___
## 3. Train-test split

In this section we split the dataset
into training and test sets.

This step allows us to evaluate
how well the neural network regressor
generalizes to unseen data.


In [4]:
# ====================================
# Train-test split
# ====================================

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)


### Why this step is essential

A regression model must be evaluated
on data it has never seen during training.

By splitting the data:
- the training set is used to learn the mapping
  from features to target values
- the test set is used only for evaluation

This prevents overly optimistic results
and reflects real-world performance.

### Choice of split ratio

An 80 / 20 split is a common default:
- enough data to train the model
- enough data to reliably evaluate predictions

At this point:
- training and test data are separated
- no preprocessing has been applied yet

In the next section,
we will apply **feature scaling**.

For deep learning regression,
this step is **mandatory**.


___
## 4. Feature scaling (why we do it)

In this section we apply feature scaling
to the input data.

For deep learning regression models,
feature scaling is **mandatory**.

Neural networks are trained using gradient-based optimization,
which is highly sensitive to the scale of input features.


In [5]:
# ====================================
# Feature scaling
# ====================================

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


### Why we use standardization here

We use **standardization** for feature scaling
because neural networks rely on gradients
to update their parameters.

Standardization:
- centers features around zero
- ensures comparable variance across features
- improves numerical stability during training

This helps:
- gradients behave more predictably
- optimization converge faster
- training remain stable across layers

At this stage:
- data is numerically ready
- still in NumPy format

In the next section,
we will explain **what this model is**
and how a neural network performs **regression**
using scikit-learn.


___
## 5. What is this model? (Deep Learning Regression)

Before training the model,
it is important to understand
what a neural network does
when it is used for **regression**.

Unlike classification,
the goal is not to assign a class label,
but to predict a **continuous numerical value**.


### What problem are we solving?

In a regression problem:
- each input sample is described by multiple features
- the target is a real number
- predictions can take infinitely many values

The model learns a function:

input features → continuous output

The objective is to make predictions
as close as possible to the true values.


### How does a neural network perform regression?

A neural network performs regression by:

1. Receiving a vector of input features  
2. Combining features through weighted sums  
3. Applying non-linear transformations  
4. Producing a single numerical output  

Each layer transforms the input
into a representation
that is more informative for prediction.


### What is different from linear regression?

Linear regression assumes:
- a linear relationship between inputs and target
- a single global equation

Neural network regression:
- does not assume linearity
- learns complex, non-linear relationships
- adapts to feature interactions automatically

This makes neural networks
more expressive than linear models.


### Why no activation in the output layer?

In regression:
- the output represents a real value
- there is no notion of class probability

For this reason:
- the output layer uses a linear activation
- the model can predict any real number

Non-linear activations are used only
in the hidden layers.


### How learning happens conceptually

Learning follows an iterative process:

1. The model makes a numerical prediction  
2. The prediction is compared to the true value  
3. An error is computed  
4. Model parameters are adjusted  
5. The process repeats  

Over time,
the model reduces the prediction error
and improves accuracy.


### Why neural networks can overfit in regression

Neural networks have high expressive power.

This means:
- they can fit training data extremely well
- they may capture noise instead of structure

Overfitting occurs when:
- training error becomes very small
- test error stops improving or increases

This behavior is common
in deep learning regression models,
especially on small datasets.


### Key takeaway

Deep Learning regression models:
- predict continuous values
- learn non-linear feature relationships
- require careful scaling and evaluation

They are powerful tools
when classical regression models
are not expressive enough.

In the next section,
we will train the neural network regressor
using scikit-learn.


___
## 6. Model training (Deep Learning Regression)

In this section we train a neural network regressor
using scikit-learn.

Unlike classical regression models,
this model performs real training:
it iteratively adjusts internal parameters
to minimize prediction error.


In [6]:
# ====================================
# Model initialization
# ====================================

mlp_regressor = MLPRegressor(
    hidden_layer_sizes=(64, 32),
    activation="relu",
    solver="adam",
    learning_rate_init=0.001,
    max_iter=500,
    random_state=42
)


In [9]:
# ====================================
# Model training
# ====================================

mlp_regressor.fit(X_train_scaled, y_train)


0,1,2
,"loss  loss: {'squared_error', 'poisson'}, default='squared_error' The loss function to use when training the weights. Note that the ""squared error"" and ""poisson"" losses actually implement ""half squares error"" and ""half poisson deviance"" to simplify the computation of the gradient. Furthermore, the ""poisson"" loss internally uses a log-link (exponential as the output activation function) and requires ``y >= 0``. .. versionchanged:: 1.7  Added parameter `loss` and option 'poisson'.",'squared_error'
,"hidden_layer_sizes  hidden_layer_sizes: array-like of shape(n_layers - 2,), default=(100,) The ith element represents the number of neurons in the ith hidden layer.","(64, ...)"
,"activation  activation: {'identity', 'logistic', 'tanh', 'relu'}, default='relu' Activation function for the hidden layer. - 'identity', no-op activation, useful to implement linear bottleneck,  returns f(x) = x - 'logistic', the logistic sigmoid function,  returns f(x) = 1 / (1 + exp(-x)). - 'tanh', the hyperbolic tan function,  returns f(x) = tanh(x). - 'relu', the rectified linear unit function,  returns f(x) = max(0, x)",'relu'
,"solver  solver: {'lbfgs', 'sgd', 'adam'}, default='adam' The solver for weight optimization. - 'lbfgs' is an optimizer in the family of quasi-Newton methods. - 'sgd' refers to stochastic gradient descent. - 'adam' refers to a stochastic gradient-based optimizer proposed by  Kingma, Diederik, and Jimmy Ba For a comparison between Adam optimizer and SGD, see :ref:`sphx_glr_auto_examples_neural_networks_plot_mlp_training_curves.py`. Note: The default solver 'adam' works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, 'lbfgs' can converge faster and perform better.",'adam'
,"alpha  alpha: float, default=0.0001 Strength of the L2 regularization term. The L2 regularization term is divided by the sample size when added to the loss.",0.0001
,"batch_size  batch_size: int, default='auto' Size of minibatches for stochastic optimizers. If the solver is 'lbfgs', the regressor will not use minibatch. When set to ""auto"", `batch_size=min(200, n_samples)`.",'auto'
,"learning_rate  learning_rate: {'constant', 'invscaling', 'adaptive'}, default='constant' Learning rate schedule for weight updates. - 'constant' is a constant learning rate given by  'learning_rate_init'. - 'invscaling' gradually decreases the learning rate ``learning_rate_``  at each time step 't' using an inverse scaling exponent of 'power_t'.  effective_learning_rate = learning_rate_init / pow(t, power_t) - 'adaptive' keeps the learning rate constant to  'learning_rate_init' as long as training loss keeps decreasing.  Each time two consecutive epochs fail to decrease training loss by at  least tol, or fail to increase validation score by at least tol if  'early_stopping' is on, the current learning rate is divided by 5. Only used when solver='sgd'.",'constant'
,"learning_rate_init  learning_rate_init: float, default=0.001 The initial learning rate used. It controls the step-size in updating the weights. Only used when solver='sgd' or 'adam'.",0.001
,"power_t  power_t: float, default=0.5 The exponent for inverse scaling learning rate. It is used in updating effective learning rate when the learning_rate is set to 'invscaling'. Only used when solver='sgd'.",0.5
,"max_iter  max_iter: int, default=200 Maximum number of iterations. The solver iterates until convergence (determined by 'tol') or this number of iterations. For stochastic solvers ('sgd', 'adam'), note that this determines the number of epochs (how many times each data point will be used), not the number of gradient steps.",500


### What we just did (step by step)

#### 1. Defining the model architecture

We created a neural network with:
- two hidden layers
- 64 neurons in the first layer
- 32 neurons in the second layer



#### 2. Activation function

We use the ReLU activation function
in the hidden layers.

ReLU:
- introduces non-linearity
- helps the model learn complex patterns
- improves training stability

The output layer uses a **linear activation**,
which is appropriate for regression.


#### 3. Optimization algorithm

The model uses the **Adam** optimizer.

Adam:
- adapts learning rates automatically
- handles noisy gradients well
- is a strong default choice for neural networks


#### 4. Training iterations

The parameter `max_iter=500`
controls the maximum number of training iterations.

Each iteration:
- computes predictions
- measures prediction error
- updates model parameters

Training stops when:
- the maximum number of iterations is reached
- or the optimization converges


#### 5. What scikit-learn handles internally

scikit-learn automatically performs:
- forward passes
- loss computation (mean squared error)
- gradient calculation
- parameter updates

This makes the training process:
- concise
- robust
- easy to use

At the cost of:
- reduced control
- limited customization


### Key takeaway

scikit-learn allows us to train
a deep learning regression model
with very little code.

The core learning principles
are identical to PyTorch and TensorFlow,
but the implementation is fully abstracted.

In the next section,
we will analyze **model behavior**
and the most important parameters
that influence regression performance.


___
## 7. Model behavior and key parameters

In this section we analyze how the deep learning regressor
behaves during training
and which parameters most strongly influence its performance.

Unlike linear regression,
the behavior of a neural network
emerges from the interaction of architecture,
optimization, and data.


### Model capacity and architecture

The architecture determines
how expressive the model is.

In this notebook, the network has:
- two hidden layers
- 64 neurons in the first layer
- 32 neurons in the second layer

This gives the model enough capacity
to learn complex, non-linear relationships.

However:
- higher capacity increases the risk of overfitting
- smaller datasets are more sensitive to model size


#### Architecture choices in real-world scenarios

In real-world problems,
the architecture of a neural network
is often adjusted based on:
- dataset size
- feature complexity
- problem difficulty

In such cases,
it is common to increase model depth
by adding more hidden layers.


#### Example of a deeper architecture (conceptual)

For example, a deeper regression network
could be defined with additional hidden layers,
such as:

```python
MLPRegressor(
    hidden_layer_sizes=(128, 64, 32, 16),
    activation="relu",
    solver="adam"
)


### Role of hidden layers in regression

Hidden layers allow the model to:
- combine input features in non-linear ways
- capture interactions between variables
- approximate complex regression functions

Each layer transforms the input
into a representation
that makes the target value easier to predict.

Deeper representations
do not guarantee better performance,
but they increase expressive power.


### Optimization behavior

The model is trained using gradient-based optimization.

Key aspects of this process:
- the model minimizes prediction error
- parameters are updated iteratively
- learning happens through many small adjustments

The Adam optimizer:
- adapts learning rates automatically
- speeds up convergence
- works well across many regression problems


### Effect of training iterations

The number of training iterations
controls how long the model learns.

- too few iterations → underfitting
- too many iterations → overfitting

Because neural networks are flexible,
they can fit training data very closely
if allowed to train for too long.


### Overfitting in regression

Overfitting in regression occurs when:
- training error becomes very small
- test error stops improving or increases

The model may start fitting:
- noise
- outliers
- dataset-specific patterns

This behavior is common
when deep models are applied
to limited datasets.


### Sensitivity to feature scaling

Neural network regressors
are highly sensitive to feature scale.

Without proper scaling:
- gradients become unstable
- convergence slows down
- training may fail completely

Standardization is therefore
an essential part of the pipeline,
not an optional preprocessing step.


### Comparison with classical regression models

Compared to linear regression:
- neural networks are more expressive
- they capture non-linear relationships
- they require more careful tuning

The performance gain
comes at the cost of:
- reduced interpretability
- higher computational complexity


### Key takeaway

The behavior of a deep learning regressor
is determined by:
- model architecture
- training duration
- optimization strategy
- data preprocessing

Understanding these factors
helps explain why the model performs well
or why it may fail to generalize.

In the next section,
we will use the trained model
to generate predictions
and evaluate regression performance.


___
## 8. Predictions

In this section we use the trained neural network
to generate predictions on unseen test data.

For regression models,
predictions are **continuous numerical values**,
not class labels.


In [10]:
# ====================================
# Predictions
# ====================================

y_pred = mlp_regressor.predict(X_test_scaled)


### What the model outputs

The neural network outputs:
- one numerical value per input sample
- representing the predicted target value

Each prediction corresponds to:
- a continuous estimate
- not constrained to a fixed range
- directly comparable to the true target

### Interpretation of predictions

In regression:
- the goal is not exact equality
- small differences are expected
- performance is evaluated using error metrics

Predictions should be analyzed:
- numerically (error metrics)
- visually (optional plots)
- comparatively (against baselines)

At this point, we have:
- true target values (`y_test`)
- predicted target values (`y_pred`)

In the next section,
we will evaluate regression performance
using standard regression metrics.


___
## 9. Model evaluation

In this section we evaluate the performance
of the deep learning regression model
on unseen test data.

For regression problems,
evaluation focuses on **prediction error**
rather than classification accuracy.


In [12]:
# ====================================
# Regression evaluation metrics
# ====================================

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mse)

mse, mae, r2, rmse 


(0.2742889195569193,
 0.35050323257082144,
 0.7906844930770192,
 np.float64(0.5237259966403418))

### How to read these results together

No single metric is sufficient
to fully evaluate a regression model.

Each metric answers
a different and complementary question.


### Mean Squared Error (MSE)

The Mean Squared Error measures
the average squared difference
between predicted and true values.

- penalizes large errors strongly
- sensitive to outliers
- lower values indicate better performance


### Mean Absolute Error (MAE)

The Mean Absolute Error measures
the average absolute difference
between predictions and true values.

- easier to interpret than MSE
- less sensitive to outliers
- expressed in the same units as the target


### R² score (coefficient of determination)

The R² score measures
how much variance in the target variable
is explained by the model.

- R² = 1 → perfect prediction
- R² = 0 → model performs like a constant predictor
- R² < 0 → model performs worse than a baseline

Higher values indicate better explanatory power.


### Root Mean Squared Error (RMSE)

In many regression problems,
RMSE is one of the most important metrics.

RMSE:
- is expressed in the same unit as the target
- provides a direct measure of prediction error
- is easier to interpret than MSE

For this reason,
RMSE is often preferred
when communicating model performance.


### Role of each metric

- **RMSE**  
  Measures the typical prediction error  
  in the same unit as the target variable.  
  It is often the most intuitive metric
  for practical interpretation.

- **MAE**  
  Measures the average absolute error.  
  It is less sensitive to outliers
  and provides a robust view of performance.

- **R² score**  
  Measures how much of the target variance
  is explained by the model.  
  It reflects overall fit quality,
  not absolute error magnitude.


### Why RMSE is especially important

In many real-world regression problems,
RMSE is one of the most important metrics.

This is because:
- it is directly interpretable
- it reflects typical prediction error
- it penalizes large errors more strongly

For this reason,
RMSE is often preferred
when communicating model performance
to non-technical stakeholders.


### Interpreting metrics together

A good regression model typically shows:
- low RMSE
- low MAE
- high R²

However:
- a high R² does not guarantee low error
- a low error does not guarantee strong generalization

Metrics must always be interpreted
together and on unseen test data.


### Key takeaway

Deep learning regression models
must be evaluated using multiple metrics.

RMSE and MAE describe **how much the model errs**,
while R² describes **how well the model explains the data**.

Using these metrics together
provides a complete and reliable
assessment of regression performance.

In the next section,
we will discuss when deep learning regression
is an appropriate choice
and when simpler models may be preferable.


___
## 10. When to use it and when not to

Deep Learning regression models are powerful,
but they are not always the best choice.

Choosing this approach depends on:
- data complexity
- dataset size
- performance requirements
- interpretability constraints


### When to use Deep Learning for regression

Deep learning regression is a good choice when:

- the relationship between features and target is non-linear
- feature interactions are complex
- classical linear models underperform
- prediction accuracy is more important than interpretability
- sufficient training data is available

It is particularly useful for:
- complex tabular data
- problems with hidden patterns
- scenarios where flexibility is required


### When NOT to use Deep Learning for regression

Deep learning regression may not be ideal when:

- the dataset is small
- the relationship is approximately linear
- model interpretability is critical
- training time or resources are limited

In these cases,
simpler regression models
are often more efficient and reliable.


### Practical warning signs

You should be cautious if:

- training error is very low
- test error does not improve
- model complexity grows quickly
- performance gains are marginal

These are common indicators
that deep learning may be unnecessary
for the problem at hand.


### Comparison with classical regression

Compared to linear regression:
- deep learning models are more expressive
- they can capture non-linear relationships
- they require more tuning and care

The performance gain
comes at the cost of:
- reduced interpretability
- increased computational complexity


### Key takeaway

Deep Learning regression models
are powerful tools for complex problems,
but they should not be used by default.

Model choice should always balance:
- accuracy
- complexity
- interpretability
- maintainability

In the next section,
we will save the trained model
and complete the regression pipeline.


___
## 11. Model persistence

In this section we save the trained deep learning
regression model and the preprocessing steps
used during training.

Saving the model allows us to:
- reuse it without retraining
- ensure reproducibility
- separate training from inference


In [None]:
# ====================================
# Model persistence
# ====================================

model_dir = Path("models/supervised_learning/regression/deep_learning_sklearn")
model_dir.mkdir(parents=True, exist_ok=True)

# Save trained model
joblib.dump(mlp_regressor, model_dir / "mlp_regressor.joblib")

# Save scaler
joblib.dump(scaler, model_dir / "scaler.joblib")


### What we have saved

We saved:
- the trained neural network regressor
- the feature scaler used during preprocessing

These components together form
the complete regression pipeline.


### Why saving the scaler is essential

Neural networks are highly sensitive
to feature scaling.

Using a different scaler
would lead to inconsistent predictions.

Saving the scaler ensures
that new data is transformed
in exactly the same way
as during training.


### How the model can be reused

To reuse the model:
1. load the scaler
2. apply it to new input data
3. load the trained regressor
4. generate predictions

This guarantees consistency
between training and inference.


___
## 12. Mathematical formulation (deep dive)

This section provides a conceptual and mathematical view
of deep learning regression.

The goal is to understand **what is optimized**
and **how predictions are produced**,
without going into low-level implementation details.


### Representation of the data

The regression dataset is represented as:

$$
\{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\}
$$

where:
- $x_i \in \mathbb{R}^d$ is a feature vector
- $y_i \in \mathbb{R}$ is a continuous target value


### Neural network as a function approximator

A neural network learns a function:

$$
\hat{y} = f(x; \theta)
$$

where:
- $x$ is the input feature vector
- $\theta$ represents all weights and biases
- $\hat{y}$ is the predicted value

The function $f$ is **non-linear**
and composed of multiple layers.


### Layer-wise transformation

Each hidden layer applies a transformation of the form:

$$
h = \sigma(Wx + b)
$$

where:
- $W$ is the weight matrix
- $b$ is the bias vector
- $\sigma$ is a non-linear activation function (ReLU)

This process is repeated across layers,
building increasingly abstract representations.


### Output layer for regression

For regression,
the output layer is linear:

$$
\hat{y} = W_{\text{out}} h + b_{\text{out}}
$$

This allows the model to predict
any real-valued number.


### Loss function

The model is trained by minimizing
the **Mean Squared Error (MSE)**:

$$
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

This loss penalizes large errors
and encourages accurate predictions.


### Optimization process

Training consists of:
- computing predictions
- measuring error via the loss function
- updating parameters using gradients

An optimizer (e.g. Adam)
adjusts the parameters iteratively
to minimize the loss.


### Bias–variance perspective

Deep learning regression models:
- have low bias (high flexibility)
- may have high variance if over-parameterized

Generalization depends on:
- model capacity
- data size
- training duration
- regularization effects


### Final takeaway

Deep learning regression can be viewed as:
- learning a non-linear function
- minimizing prediction error
- approximating complex relationships

The mathematical principles are simple,
but their composition yields powerful models.


___
## 13. Final summary – Code only

The following cell contains the complete regression pipeline
from data loading to model persistence.

No explanations are provided here on purpose.

This section is intended for:
- quick execution
- reference
- reuse in scripts or applications


In [None]:
# ====================================
# Imports
# ====================================

import numpy as np
import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor

from sklearn.metrics import (
    mean_squared_error,
    mean_absolute_error,
    r2_score
)

from pathlib import Path
import joblib
import matplotlib.pyplot as plt


# ====================================
# Dataset loading
# ====================================

data = fetch_california_housing(as_frame=True)

X = data.data
y = data.target


# ====================================
# Train-test split
# ====================================

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)


# ====================================
# Feature scaling
# ====================================

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# ====================================
# Model initialization
# ====================================

mlp_regressor = MLPRegressor(
    hidden_layer_sizes=(64, 32),
    activation="relu",
    solver="adam",
    learning_rate_init=0.001,
    max_iter=500,
    random_state=42
)


# ====================================
# Model training
# ====================================

mlp_regressor.fit(X_train_scaled, y_train)


# ====================================
# Predictions
# ====================================

y_pred = mlp_regressor.predict(X_test_scaled)


# ====================================
# Model evaluation
# ====================================

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

mse, rmse, mae, r2


# ====================================
# Model persistence
# ====================================

model_dir = Path("models/supervised_learning/regression/deep_learning_sklearn")
model_dir.mkdir(parents=True, exist_ok=True)

joblib.dump(mlp_regressor, model_dir / "mlp_regressor.joblib")
joblib.dump(scaler, model_dir / "scaler.joblib")
