# Applying the Least Squares Method with Expanded Data

In previous sections, we explored how linear regression could be applied to a fixed set of cabins using the least squares method. Now, we will extend this concept by working with additional data to better simulate a real-world scenario, where the data size and number of features can vary.

Here, we'll walk through a practical example where we use the least squares method to fit a linear regression model. The dataset consists of five cabins, each described by features like size, sauna size, distance to water, number of indoor bathrooms, and proximity to neighbors. We aim to estimate the coefficients that best explain the relationship between these features and the price of each cabin.

## Dataset Overview

We start with the following dataset of cabin characteristics and their corresponding prices:

| Cabin   | Size (sqm) | Sauna Size (sqm) | Distance to Water (m) | Number of Indoor Bathrooms | Proximity to Neighbors (m) | Price (€) |
|---------|------------|------------------|-----------------------|----------------------------|----------------------------|-----------|
| Cabin 1 | 25         | 2                | 50                    | 1                          | 500                        | 127,900   |
| Cabin 2 | 39         | 3                | 10                    | 1                          | 1000                       | 222,100   |
| Cabin 3 | 13         | 2                | 13                    | 1                          | 1000                       | 143,750   |
| Cabin 4 | 82         | 5                | 20                    | 2                          | 120                        | 268,000   |
| Cabin 5 | 130        | 6                | 10                    | 2                          | 600                        | 460,700   |

We use these features as the input \( X \), and the prices as the output \( y \). Our goal is to fit a linear regression model that captures the relationship between these features and the price of the cabins.


## Applying the Least Squares Method

The least squares method allows us to estimate the coefficients that minimize the sum of squared errors (SSE) between the predicted cabin prices and the actual prices.

We can calculate the coefficient estimates using NumPy's `linalg.lstsq` function, which solves for the coefficient vector \( c \) by minimizing the SSE. The inputs are the feature matrix \( X \) and the price vector \( y \):

In [1]:
import numpy as np

# Input data: cabin features (X) and prices (y)
x = np.array([
             [25, 2, 50, 1, 500], 
             [39, 3, 10, 1, 1000], 
             [13, 2, 13, 1, 1000], 
             [82, 5, 20, 2, 120], 
             [130, 6, 10, 2, 600]
            ])   
y = np.array([127900, 222100, 143750, 268000, 460700])

# Estimate the coefficients using the least squares method
c = np.linalg.lstsq(x, y)[0]
print(c)
print(x @ c)

[3000.  200.  -50. 5000.  100.]
[127900. 222100. 143750. 268000. 460700.]


The output consists of the estimated coefficients for each feature, as well as the predicted prices for the cabins in the dataset.

## Interpreting the Results

- The **first coefficient** (approximately 3000) corresponds to the cabin size in square meters. This means that for every additional square meter, the cabin price increases by €3000.

- The **third coefficient** (approximately −50) shows that for each meter the cabin is farther from the water, the price decreases by €50. Conversely, moving closer to water increases the price by the same amount per meter.

Interestingly, the predicted prices for the five cabins match the actual prices exactly. This happens because the number of observations (five cabins) is equal to the number of features used in the regression model. In such cases, the model can perfectly fit the data.

## Adding More Data to the Model

Now, let's see what happens when we add more cabins to the dataset. By introducing additional data points, we expect the model to adjust, as it attempts to find a better fit for the larger dataset.

We'll simulate this by adding one more cabin to the data:

In [2]:
import numpy as np
from io import StringIO

# Simulated CSV input with six cabins
input_string = '''
25 2 50 1 500 127900
39 3 10 1 1000 222100
13 2 13 1 1000 143750
82 5 20 2 120 268000
130 6 10 2 600 460700
115 6 10 1 550 407000
'''

np.set_printoptions(precision=1)    # Set output precision for easier reading
 
def fit_model(input_file):

    # Read the CSV-like input
    data = np.genfromtxt(input_file, skip_header=0)
    
    # Split data into features (X) and prices (y)
    x = np.array([
        [25, 2, 50, 1, 500],
        [39, 3, 10, 1, 1000],
        [13, 2, 13, 1, 1000],
        [82, 5, 20, 2, 120],
        [130, 6, 10, 2, 600],
        [115, 6, 10, 1, 550]
    ])
    
    y = np.array([127900, 222100, 143750, 268000, 460700, 407000])

    # Read the data in and fit it. the values below are placeholder values
    c = np.linalg.lstsq(x, y, rcond=None)[0]
 
    print(c)
    print(x @ c)

# Simulate reading a file
input_file = StringIO(input_string)
fit_model(input_file)


[2989.6  800.6  -44.8 3890.8   99.8]
[127907.6 222269.8 143604.5 268017.6 460686.6 406959.9]


## Observing the Changes in the Model

By adding the sixth cabin to the dataset, we observe changes in both the estimated coefficients and the predicted prices. For instance, the effect of cabin size on price changed from approximately €3000/m² to €2989.6/m². Similarly, the predicted prices for the original five cabins also changed slightly.

This is a result of incorporating more data, which alters the linear relationship between the features and the price. The model must now account for more variation, leading to slight adjustments in the coefficient estimates and predicted values.

## Conclusion

This example demonstrates the power of the least squares method in fitting linear regression models. As the dataset expands, the model adapts to account for additional variation in the data, leading to updated predictions and coefficient estimates. This showcases how linear regression provides a flexible framework for predictive modeling, even as the number of data points or features changes.

Through this exercise, we have seen that:

- The least squares method can perfectly fit a model when the number of features matches the number of observations.

- When more data is added, the model adjusts, leading to more nuanced predictions that account for the new information.

This flexibility makes linear regression a fundamental tool in machine learning and data analysis, especially for tasks where interpretability and simplicity are essential.