# Linear Regression

This notebook demonstrates how to use the `Linear Regression` module from the `rice2025.supervised_learning` library.  

## Setup
Import necessary modules and load data. For this example, the California housing dataset from sklearn will be used. 

The housing dataset is a small classification dataset that has:

- **Samples:** ~20,000  
- **Features:** 8
- **Target:** Median house value 

**Goal:** Predict house value from demographic and geographic features.

In [231]:
# import library
from rice2025.supervised_learning import linear_regression
import rice2025.utilities as util

# load dataset
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()
X = data.data
y = data.target

## Data Pre-Processing
Before training, we split the dataset into **training** and **test** sets using `train_test_split`. We can verify the split by printing the lengths of each output dataset. Then, we can use the `scale` function to scale our data. 

In [232]:
# split dataset
X_train, X_test, y_train, y_test = util.train_test_split(X, y, test_size=.2)
print(f"Train size: {X_train.shape}, Test size: {X_test.shape}")

# scale dataset
X_train = util.scale(X_train)
X_test = util.scale(X_test)


Train size: (16512, 8), Test size: (4128, 8)


## Initializing and Training the Linear Regression Model

The `Linear Regression` class can be initialized without any parameters. 
Use the `fit()` method to "train" the model on the training data.

In [233]:
model = linear_regression.LinearRegression()
model.fit(X_train, y_train)

## Making Predictions
Once the model is trained, the `predict()` method can be used to classify new data points.

In [234]:
y_pred = model.predict(X_test)

## Evaluating the Model

We compute:
- **Mean Squared Error (MSE)**
- **R² Score** (coefficient of determination)  

Using sklearn's metrics functions. 

In [235]:
from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.3f}")
print(f"R² Score: {r2:.3f}")

Mean Squared Error: 0.505
R² Score: 0.617


The MSE of 0.505 and R² of 0.617 are typical for linear regression on the California housing dataset, as housing prices have complex non-linear relationships that simple linear models cannot fully capture. This R² score indicates the model explains approximately 57% of the variance in housing prices, with the remaining variance requiring more sophisticated modeling techniques.