### Install Packages and Import Dataset

We’re continuing to work with the same dataset of 932 real estate transactions in Sacramento, California, which includes features like property location, size, and type. The target remains Price, which we’ll aim to predict based on these features.

This dataset was obtained from [spatialkey](https://support.spatialkey.com/spatialkey-sample-csv-data/)


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error , r2_score
from sklearn import set_config

# Output dataframes instead of arrays
set_config(transform_output="pandas")

KNN regression has its benefits—it’s simple to understand and can capture complex, nonlinear relationships in data. It works well when the data has patterns that are best explained by nearby neighbors. However, KNN regression also has its downsides. It struggles to make predictions for values outside the range of the training data, meaning it can’t effectively handle cases where the target variable extends beyond what’s been observed. Additionally, as the dataset grows larger, KNN becomes computationally slower since it has to calculate distances for every new prediction.
 
Because of these limitations, especially when we need to generalize beyond the training data or handle larger datasets more efficiently, we often turn to linear regression as an alternative. Linear regression offers a more scalable approach and provides a way to make predictions across a wider range of values.

This notebook will start with simple linear regression, which uses one predictor and one outcome, and later moves to multivariable linear regression.

### Well... what is linear regression then?

Linear regression is a method to find the best straight line that shows the relationship between two things. For example, if you have data about how much time students spend studying and their test scores, linear regression will help you draw a line to show how study time affects scores. The line helps us see the overall pattern—students who study more tend to get better scores—and allows us to predict test scores based on study time. It’s called "linear" because it shows a straight-line relationship between the two things.

![linear regression](./images/linear_regression.gif)

This method can be applied to other examples too, like the housing data we’re using, where linear regression helps predict house prices based on features like size or location.

#### Our question is predictive: 
**Can we use the size of a house in the Sacramento, CA area to predict its sale price?**

The equation for the straight line is:

$$
\text{House sale price} = b_0 + b_1 \times (\text{house size})
$$

where:

- $ b_0 $ is the price when the house size is 0 (the intercept).
- $ b_1 $ is how much the price increases for each unit increase in house size (the slope).

Using data to find the line of best fit means finding the coefficients $ b_0 $ and $ b_1 $, which define the line. You can think of $ b_0 $ as the base price and $ b_1 $ as the price increase per square foot. 

In [2]:
sacramento = pd.read_csv("dataset/sacramento.csv")
sacramento

Unnamed: 0,street,city,zip,state,beds,baths,sq__ft,type,sale_date,price,latitude,longitude
0,1005 MORENO WAY,SACRAMENTO,95838,CA,3,2,1410,Residential,Fri May 16 00:00:00 EDT 2008,180000,38.646206,-121.442767
1,10105 MONTE VALLO CT,SACRAMENTO,95827,CA,4,2,1578,Residential,Fri May 16 00:00:00 EDT 2008,190000,38.573917,-121.316916
2,10133 NEBBIOLO CT,ELK GROVE,95624,CA,4,3,2096,Residential,Fri May 16 00:00:00 EDT 2008,289000,38.391085,-121.347231
3,10165 LOFTON WAY,ELK GROVE,95757,CA,3,2,1540,Residential,Fri May 16 00:00:00 EDT 2008,266510,38.387708,-121.436522
4,10254 JULIANA WAY,SACRAMENTO,95827,CA,4,2,2484,Residential,Fri May 16 00:00:00 EDT 2008,331200,38.568030,-121.309966
...,...,...,...,...,...,...,...,...,...,...,...,...
808,9507 SEA CLIFF WAY,ELK GROVE,95758,CA,4,2,2056,Residential,Wed May 21 00:00:00 EDT 2008,285000,38.410992,-121.479043
809,9570 HARVEST ROSE WAY,SACRAMENTO,95827,CA,5,3,2367,Residential,Wed May 21 00:00:00 EDT 2008,315537,38.555993,-121.340352
810,9723 TERRAPIN CT,ELK GROVE,95757,CA,4,3,2354,Residential,Wed May 21 00:00:00 EDT 2008,335750,38.403492,-121.430224
811,9837 CORTE DORADO CT,ELK GROVE,95624,CA,4,2,1616,Residential,Wed May 21 00:00:00 EDT 2008,227887,38.400676,-121.381010


This question guides our initial exploration: the columns in the data that we are interested in are 
- **sq__ft** (house size, in livable square feet)
-  **price** (house sale price, in US dollars (USD)). 

### How do we perform it?

We can perform simple linear regression in Python using scikit-learn much like we did for KNN regression. Instead of using a `KNeighborsRegressor` model, we create a `LinearRegression` model. Unlike KNN, we don't need to pick a $K$ value or use cross-validation to fine-tune the model.

Here's how we can predict house sale prices based on house size using simple linear regression with the full Sacramento real estate dataset.

### Training, evaluating, and tuning the model

#### **Step 1:** Split the dataset into test and train.

> **Note**: 
>
> Even though we’re using a different model, like linear regression, that doesn’t mean cross-validation goes away. Cross-validation is still useful and can be applied to various models, ensuring they perform well on unseen data by testing on multiple subsets. For more information on this please take a look at [classification_2.ipynb](./Classification-2.ipynb).

#### **Step 2:** fit the linear regression model.

Here, we extract the slope of the line via the `coef_[0]` property, as well as the intercept of the line via the `intercept_` property.

Our coefficients are:

- Intercept ($b_0$): 7,069
- Slope ($b_1$): 139

This means the equation of the line of best fit is:

$$
\text{House sale price} = 7,069 + 139 \times (\text{house size})
$$

**So for each additional square foot of house size, the price increases by $139.**

#### **Step 3.** Finally, we predict on the test data set to assess how well our model does.

Our final model's test error, measured by RMSPE, is $72,549. Since this is in US Dollars, it tells us how far off our predictions are on average. But does that make the model "good" at predicting house prices based on home size? That depends on how precise you need the predictions to be for your purpose! 

For example, a real estate investor dealing with multimillion-dollar properties might not mind this level of error, but for a first-time homebuyer with a 300,000 budget, being off by $72,549 could be a big deal.

To visualize the simple linear regression model, we can plot the predicted house sale prices across all possible house sizes. Since the model is a straight line, we only need to calculate the predicted prices at the smallest and largest house sizes, then draw a line between them. By overlaying this line on a scatter plot of the actual housing prices, we can visually check how well the model fits the data.

Don't worry about the details of this plot. This is simply depicting the predicted values of house price (red line) for the final linear regression model.



### Cross-validation

Now, lets look at how we can implement cross-validation for linear regression. Using cross-validation provides a more reliable and comprehensive evaluation of your model's performance compared to a single train-test split!

To perform 5-fold cross-validation in Python using `scikit-learn`, we need to follow these steps:
1. Set cv=5 for 5 folds.
2. Provide the predictors and response as X and y.
3. Use the `cross_validate` function from scikit-learn.
4. Convert the results into a pandas DataFrame for better visualization.


In [9]:
lm = LinearRegression()
returned_dictionary = cross_validate(                          # I DID NOT UNDERSTAND CROSS VALIDATION. multiple intercepts and slopes??
    estimator=lm,                                              # IS lm FIT ALREADY WITH TRAINING DATA? b0 and b1 FIXED ALREADY?
    cv=5,    # setting up the cross validation number
    X= sacramento[["sq__ft"]],
    y= sacramento["price"],
    scoring="neg_root_mean_squared_error", #or scoring="r2"
    return_estimator=True
)

# Initialize an empty list to hold your coefficients
coefficients_list = []
intercept_list = []
# Loop over all the models in your results, extracting the model coefficients like you would with a scikit-learn attribute
for model in returned_dictionary['estimator']:
    coefficients_list.append(model.coef_)
    intercept_list.append(model.intercept_)

coefficients_list, intercept_list

([array([134.38969389]),
  array([132.17390252]),
  array([131.86015559]),
  array([132.47892322]),
  array([142.81234371])],
 [14460.38087747668,
  18680.436829069688,
  23561.545038647542,
  17172.844691855542,
  6638.57040192993])

`test_score` column, displays the negative Root Mean Squared Prediction Error (RMSPE) values estimated during the cross-validation process.

To obtain the actual root mean squared prediction error values (the nonnegative), we need to take the absolute value of the `test_score` column.


In [18]:
cv_5_df["test_score"] = cv_5_df["test_score"].abs()
cv_5_df

Unnamed: 0,fit_time,score_time,test_score
0,0.004998,0.003004,81369.919847
1,0.003954,0.003374,97590.23634
2,0.001931,0.0,61790.733828
3,0.0,0.011874,92026.28301
4,0.004997,0.002382,75474.94749


We can then aggregate the 5 fold scores, to compute their mean and standard error of the mean, representing the estimated root mean squared prediction error and uncertainty around its estimate. 


In [19]:
cv_5_metrics = cv_5_df.agg(["mean","sem"])
cv_5_metrics

Unnamed: 0,fit_time,score_time,test_score
mean,0.003176,0.004127,81650.424103
sem,0.000972,0.002024,6302.216095


The RMSPE of $81,650 suggests that, on average, the model's price predictions deviate by about this amount from the actual prices.
The standard error indicates the degree of uncertainty around this estimate, meaning the error could be ± 6,302 (range from $75,348 to $87,952).

Since this estimate is based on cross-validation, it represents the model's error across different subsets of the data, rather than one random train-test split, which helps provide a more robust estimate of its performance compared to a single holdout test set!


These steps could also be repeated for r2, which is our other main metric for model evaluation.

In [20]:
returned_dictionary2 = cross_validate(
    estimator=lm,
    cv=5,    # setting up the cross validation number
    X=sacramento[["sq__ft"]],
    y=sacramento["price"],
    scoring="r2" 
)

cv_5_df2 = pd.DataFrame(returned_dictionary2)    # Converting it to pandas DataFrame

cv_5_df2

cv_5_metrics = cv_5_df2.agg(["mean","sem"])
cv_5_metrics


Unnamed: 0,fit_time,score_time,test_score
mean,0.005417,0.001799,0.507727
sem,0.002592,0.0008,0.020646


This means our predictor variable (square footage) explains roughly 51% ± 2% of the variance in housing prices.

By using cross-validation, we evaluate the model on different subsets of the data, rather than relying on a single train-test split. This reduces the risk of overfitting or underfitting based on a specific partition of the data. It also helps to ensure that the model's performance is more generalizable!


In [21]:
# WHAT IF WE EVALUATE IT IN THE WHOLE SET?
# make predictions
sacramento["predicted"] = lm.predict(sacramento[["sq__ft"]])

# calculate RMSPE_whole_set
RMSPE_whole_set = mean_squared_error(
    y_true=sacramento["price"],
    y_pred=sacramento["predicted"]
)**(1/2)

RMSPE_whole_set

81958.20486381615

In [22]:
# WHAT IF WE EVALUATE IT IN THE WHOLE SET?
# Calculate R²_whole_set 
r2_whole_set = r2_score( 
y_true=sacramento["price"], y_pred=sacramento["predicted"] 
)

r2_whole_set

0.5300873163598303

### Multivariable linear regression

Wouldn't it be nice if we could consider more than just one factor when making predictions? Multivariable linear regression lets us do just that by including multiple predictors. In the real world, outcomes like house prices depend on more than just one variable. For example, not only does the size of the house matter, but so do factors like the number of bedrooms, the location, and even the property’s condition.

With multivariable linear regression, we can take all of these into account, giving us a more accurate and realistic model. Here, we'll use the Sacramento real estate data to include both house size and number of bedrooms to predict sale prices. This opens the door to countless possibilities where we can model more complex relationships in various fields.

The equation for the multivariable regression is:

$$
\text{House sale price} = b_0 + b_1 \times (\text{house size}) + b_2 \times (\text{number of bedrooms})
$$

where:

- $b_0$ is the price when both house size and number of bedrooms are 0 (the intercept).
- $b_1$ is how much the price increases for each unit increase in house size (the slope for house size).
- $b_2$ is how much the price increases for each additional bedroom (the slope for number of bedrooms).

Using scikit-learn, we can easily include both predictors and fit the model as before.

#### **Step 1:** Fit the linear regression model on the training data.

In [23]:
# Multivariable Linear Regression (using both square footage and number of bedrooms as predictors)
mlm = LinearRegression()

mlm.fit(
    sacramento_train[["sq__ft", "beds"]],  # Two predictors: square footage and number of bedrooms
    sacramento_train["price"]  # Target variable: house prices
)

# Comparison: This is how simple linear regression would look, using only square footage
# lm.fit(
#    sacramento_train[["sq__ft"]],  # Single predictor: square footage
#    sacramento_train["price"]  # Target variable: house prices
# )

For each predictor in a multivariable linear regression model, we get a slope (coefficient) and an intercept, which together describe the best fit mathematically. In scikit-learn, we can extract these values from the model as follows:

Slopes (coefficients): These are obtained from the `coef_` property of the model.
Intercept: This is obtained from the `intercept_` property of the model.

In [24]:
mlm.coef_

array([   167.62705043, -30687.56006942])

In [25]:
mlm.intercept_

62336.77007332351

So since we used sacramento_train[["sq__ft", "beds"]] when training, we have that `mlm.coef_[0]` corresponds to square feet, and `mlm.coef_[1]` corresponds to beds. 

Given the model output values:

- Intercept ($b_0$): 62,336
- Slope for house size ($ b_1$): 167
- Slope for number of bedrooms ($b_2$): -30,687

The equation of the plane of best fit is:

$$
\text{House sale price} = 62,336 + 167 \times (\text{house size}) -30,687 \times (\text{number of bedrooms})
$$

This equation describes how the house sale price is predicted based on both house size and the number of bedrooms.

#### **Step 2:** Make predictions on the test data set to assess the quality of our model.

In [26]:
# Predict house prices using the multivariable linear regression model (mlm) with two predictors: square footage and number of bedrooms.
# This is different from earlier examples where only square footage was used as a predictor.
sacramento_test["predicted"] = mlm.predict(sacramento_test[["sq__ft", "beds"]])

# Calculate RMSPE for the multivariable model.
lm_mult_test_RMSPE = mean_squared_error(
    y_true=sacramento_test["price"],
    y_pred=sacramento_test["predicted"]
)**(1/2)

lm_mult_test_RMSPE

74441.52046148347

In [27]:
# Calculate R² 
lm_mult_test_r2 = r2_score( 
y_true=sacramento_test["price"], y_pred=sacramento_test["predicted"] 
)

lm_mult_test_r2

0.4925271118728288

Our model’s test error as assessed by RMSPE is $74,441 and $R^2$ is 0.49.


Once again, it would be best practice to perform cross-validation on our entire dataset, so our results aren't overly reliant on a single train-test split. By doing cross-validation, we evaluate the model's performance on different subsets of the data, ensuring that we obtain a more robust and reliable estimate of how the model will perform on unseen data. This also helps mitigate the risk of overfitting to a specific subset, making the model's performance assessment more generalizable to the broader dataset.

These steps are identical to above!

In [28]:
#scoring method as neg_root_mean_squared_error

returned_dictionary_mlm = cross_validate(
    estimator=mlm, 
    cv=5,    # setting up the cross validation number
    X=sacramento[["sq__ft", "beds"]],
    y=sacramento["price"],
    scoring="neg_root_mean_squared_error" 
)

cv_5_df_mlm = pd.DataFrame(returned_dictionary_mlm)    # Converting it to pandas DataFrame
cv_5_df_mlm["test_score"] = cv_5_df_mlm["test_score"].abs()

cv_5_df_mlm

Unnamed: 0,fit_time,score_time,test_score
0,0.004999,0.002999,79862.10441
1,0.004001,0.001998,95393.067119
2,0.003986,0.002999,61298.854877
3,0.004002,0.002999,92691.983589
4,0.006,0.003392,73940.758792


In [29]:
#aggregate to obtain the mean and standard error across all 5 folds
cv_5_metrics_mlm = cv_5_df_mlm.agg(["mean","sem"])
cv_5_metrics_mlm

Unnamed: 0,fit_time,score_time,test_score
mean,0.004597,0.002878,80637.353757
sem,0.000401,0.000233,6254.870549


Our model’s test error, measured by Root Mean Squared Prediction Error (RMSPE), is $84,812 with a standard error ± $10,158.


In [30]:
#scoring method as r2

returned_dictionary_mlm2 = cross_validate(
    estimator=mlm, 
    cv=5,    # setting up the cross validation number
    X=sacramento[["sq__ft", "beds"]],
    y=sacramento["price"],
    scoring="r2" 
)

cv_5_df_mlm2 = pd.DataFrame(returned_dictionary_mlm2)    # Converting it to pandas DataFrame

cv_5_df_mlm2


Unnamed: 0,fit_time,score_time,test_score
0,0.003195,0.0,0.456312
1,0.0,0.0,0.55606
2,0.017091,0.003,0.559608
3,0.002999,0.003,0.483592
4,0.003039,0.00601,0.543327


In [31]:
#aggregate to obtain the mean and standard error across all 5 folds
cv_5_metrics_mlm2 = cv_5_df_mlm2.agg(["mean","sem"])
cv_5_metrics_mlm2

Unnamed: 0,fit_time,score_time,test_score
mean,0.005265,0.002402,0.51978
sem,0.003016,0.001124,0.02097


This means our predictor variables (house size and number of bedrooms) explain roughly 52% ± 2% of the variance in housing prices

### Conclusion

In this notebook, we worked through several steps to predict housing prices using square feet using a data set of 932 real estate transactions in Sacramento, California. Here's a summary of what we covered:

1. **Simple Linear Regression:** We implemented simple linear regression and evaluated its performance on a test dataset. We then applied cross-validation to the entire dataset to assess how well the model generalizes to unseen data, ensuring that the model's performance was not dependent on a single train-test split.

2. **Multiple Linear Regression:** We implemented multiple linear regression. Similar to the simple linear regression step, we performed a train-test split and  and evaluated its performance on a test dataset, followed by cross-validation on the entire dataset.

By applying cross-validation in both the simple and multiple linear regression models, we ensured that the results were reliable and that our models generalized well across different data splits.

We hope this notebook has provided a practical understanding of data regression, model evaluation, and the application of machine learning algorithms like linear regression. Feel free to experiment further with the dataset or the code to enhance your learning!