# **Linear Regression - rice_ml**
This notebook demonstrates how to use the LinearRegression class within the rice_ml package. It demonstrates it in an informative way that also analyzes the results, mirroring a standard use case of the classes.

Note that when using this in robust model selection, k-fold cross-validation and deeper hyperparameter tuning is recommended. In this example, since it's main goal is demonstrating the classes, we will not do as deep of hyperparameter tuning, and will compare every test using the same random state (42).

This notebook shows how to:
- Use 'LinearRegression' from 'rice_ml'
- Prepare and normalize data using 'rice_ml'
- Evaluate decision trees on a regression task

## Table of Contents
- [Algorithm](#algorithm)
- [Data Preparation](#data-preparation)
- [Linear Regression](#linear-regression)
  - [Model Training](#model-training)
  - [Results](#results)

## Algorithm
Linear regression is a supervised learning algorithm that models the relationship between a set of input features and a continuous target variable by fitting a linear equation to the data. The model assumes that the target can be expressed as a weighted sum of the input features plus a bias term.

The algorithm learns the optimal coefficients by minimizing a loss function, typically mean squared error (MSE), which measures the average squared difference between predicted and actual values.

During training, linear regression adjusts its weights so that the predicted values lie as close as possible to the observed data points in a least-squares sense. Each feature contributes linearly to the final prediction, scaled by its learned coefficient.

To make a prediction, the model computes the dot product of the input features and the learned weights, then adds the intercept term. The resulting value is returned as the predicted output.

Linear regression is simple, fast, and highly interpretable, making it a strong baseline model. However, its performance depends heavily on the assumption of linearity and can degrade when relationships between features and targets are nonlinear or when strong outliers are present.

![Linear Regression Example](../images/lin_reg.webp)
Source: [SSDSI](https://sixsigmadsi.com/glossary/simple-linear-regression/)


### Pros vs Cons
#### Pros
- Very interpretable (coefficients provide great insight into feature importance)
- Fast to train and to predict
- Works very well with approximately linear relationships
#### Cons
- Assumes linear relationships (does not work with nonlinear relationships at all)
- Sensitive to outliers
- Can often require careful feature engineering and transformations

## Data Preparation
We will be using the California Housing data from 'sklearn'. It contains census information about median housing prices in California districts. Given demographic and geographic features, we can use it to predict the median house value for a district.

In [1]:
from sklearn.datasets import fetch_california_housing
import numpy as np

data = fetch_california_housing()

X = data.data
y = data.target
feature_names = data.feature_names

print("X shape:", X.shape)
print("y shape:", y.shape)
print("Features:", feature_names)


X shape: (20640, 8)
y shape: (20640,)
Features: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']


## Linear Regression

### Model Training

In [2]:
from rice_ml.supervised_learning.linear_regression import LinearRegression
from rice_ml.utilities.preprocess import train_test_split, normalize

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
)

model = LinearRegression()

model.fit(X_train, y_train)

r2 = model.score(X_test, y_test)
print(f"R2 score: {r2:.3f}")

y_pred = model.predict(X_test)

mse = ((y_test - y_pred) ** 2).mean()
print(f"RMSE: {np.sqrt(mse):.3f}")

R2 score: 0.610
RMSE: 0.725


The base model has a decent R^2 score. It is explaining around 61% of the variance. The RMSE is pretty good. It is in terms of 100,000 USD, so it is showing the average error at around 72,500 USD.
Next we will look at some parameter tuning and normalization.

Linear regression is simple and very interpretable. It is often used as a base model for comparison. This is a decent result, and is a great baseline to compare to other models for this data.

#### Z-Score Normalization

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
)

# Normalize on the training data to avoid data leakage.
X_train, stats = normalize(X_train, method="zscore", return_stats=True)
X_test = normalize(X_test, method="zscore", stats=stats)

model = LinearRegression()

model.fit(X_train, y_train)

r2 = model.score(X_test, y_test)
print(f"R2 score: {r2:.3f}")

y_pred = model.predict(X_test)

mse = ((y_test - y_pred) ** 2).mean()
print(f"RMSE: {np.sqrt(mse):.3f}")

R2 score: 0.610
RMSE: 0.725


### Results
This actually does not improve the model at all. While normalization does not have a major effect on linear regression, it does have a different benefit. It is recommended to still normalize your data, as it greatly improves feature interpretability and feature importance clarity.