# Linear Regression: A Core Tool in Predictive Modeling

In modern data science, the terminology used can vary depending on the context. When talking to investors, the broader term Artificial Intelligence (AI) often attracts attention. However, as we narrow down into more specialized fields like Machine Learning (ML), it becomes crucial when seeking talent with specific expertise. Ultimately, in many practical machine learning applications, one of the most frequently applied methods is linear regression, especially for predictive tasks.

Linear regression is fundamental in tasks where we need to predict continuous numerical values, given a set of input features. While it might appear simple compared to more advanced techniques like neural networks, linear regression is both interpretable and powerful in the right settings.

## Types of Machine Learning and Their Applications

Before we delve into the details of linear regression, it’s essential to outline the broader categories of machine learning, as this provides context for where regression fits in.

### Supervised Learning

In supervised learning, the model is trained on labeled data, where both inputs and the corresponding outputs are known. The goal is to build a model that can predict the output for unseen inputs. For example, if we have data on apartment sales that include features like apartment size and the number of floors, we can use this data to build a model to predict future apartment prices.

Supervised learning can take two main forms:

- **Classification**: The goal is to assign a label to each input, such as determining if an image contains a dog, cat, or parrot.

- **Regression**: The goal is to predict a continuous value, such as house prices, based on historical data of features like square footage and location.

### Unsupervised Learning

Unsupervised learning deals with data that does not have labeled outputs. Instead, the model’s goal is to identify hidden patterns or structures. For instance, a video streaming service might group users based on their viewing habits, identifying clusters of users with similar tastes without predefined labels.

### Reinforcement Learning

In reinforcement learning, the focus is on training agents to make decisions that maximize some cumulative reward. The agent interacts with an environment and learns from the feedback it receives after each action. A well-known example is AlphaGo, a system trained using reinforcement learning to master the game of Go, surpassing human champions.

## Why Linear Regression?

Linear regression specifically falls under the umbrella of supervised learning, where the goal is to predict a continuous outcome based on input features. This method is particularly useful when the relationship between inputs and outputs can be approximated by a linear function.

In regression tasks, each feature contributes to the final predicted value by being multiplied by a coefficient. The challenge is to find the optimal coefficients that minimize the prediction error when compared to the actual data.

### Key Terminology

- **Features (Inputs)**: These are the variables that influence the predicted outcome. In a cabin pricing model, examples include cabin size, distance to water, and number of bathrooms.

- **Response (Output)**: This is the value that the model aims to predict, such as the cabin price.


## Understanding the Linear Regression Formula

In linear regression, the prediction is a weighted sum of the input features. For example, the price of a cabin can be predicted using a formula like:

$$ \text{price} = c_1 \times \text{feature}_1 + c_2 \times \text{feature}_2 + c_3 \times \text{feature}_3 + \dots + \text{intercept} $$

Where:

- **c1, c2, c3**, ... are the coefficients that represent how much each feature contributes to the price.

- The **intercept** adjusts the prediction by adding a constant, irrespective of the feature values.

## Predicting Cabin Prices: A Practical Example

Let’s walk through a practical example where we predict the price of a cabin based on its features. We will use linear regression to understand how factors like cabin size and distance to water influence the price.

In [1]:
# input values for one mökkis: size, size of sauna, distance to water, number of indoor bathrooms, 
# proximity of neighbours

x = [66, 5, 15, 2, 500]
c = [3000, 200 , -50, 5000, 100]     # coefficient values

# Calculating the predicted price
prediction = c[0]*x[0] + c[1]*x[1] + c[2]*x[2] + c[3]*x[3] + c[4]*x[4]

print(prediction)

258250


In this case, the predicted price of a 66 sqm cabin with the given features is €258,250. This model assumes a linear relationship between the features and the cabin price.

## Scaling the Prediction to Multiple Cabins

While the above example predicts the price for a single cabin, we can extend the model to handle multiple properties at once. Here’s how the same linear regression model can be applied to predict prices for several cabins:

Edit the following program so that it can process multiple cabins that may be described by any number of details (like five below), at the same time. You can assume that each of the lists contained in the list x and the coefficients c contain the same number of elements.

In [2]:
# input values for three mökkis: size, size of sauna, distance to water, number of indoor bathrooms, 
# proximity of neighbors
X = [[66, 5, 15, 2, 500], 
     [21, 3, 50, 1, 100], 
     [120, 15, 5, 2, 1200]]
c = [3000, 200, -50, 5000, 100]    # coefficient values

def predict(X, c):
    for cabin in X:
        price = sum([c[i] * cabin[i] for i in range(len(c))])           
        print(price)

predict(X, c)

258250
76100
492750


## Leveraging NumPy for Efficient Computations

When working with larger datasets, efficiency becomes a concern. To optimize our linear regression computations, we can use the NumPy library, which simplifies matrix operations and improves performance.

In [3]:
import numpy as np

# Features and coefficients as NumPy arrays
x = np.array([66, 5, 15, 2, 500])
c = np.array([3000, 200 , -50, 5000, 100])

# Compute the predicted price using the dot product
print(x @ c)


258250


This method uses the dot product to calculate the price. NumPy’s `@` operator is equivalent to `np.dot()`, which is a common operation in linear algebra.

To handle multiple cabins, we can extend this approach:

In [4]:
import numpy as np

# Features for two cabins
x = np.array([[66, 5, 15, 2, 500], 
              [21, 3, 50, 1, 100]])
c = np.array([3000, 200 , -50, 5000, 100])

# Compute the predictions for both cabins
print(x @ c)

[258250  76100]


This approach efficiently handles multiple predictions at once, using NumPy to manage the array operations.

## Conclusion

Linear regression serves as an essential tool in predictive modeling, offering both simplicity and powerful interpretability. By exploring this method, we can understand how various features affect predictions, making it ideal for applications like real estate pricing. While more complex models exist, linear regression remains a robust and effective technique, especially when the relationships between features and outputs are linear.

---

# Exploring Least Squares in Linear Regression

In the previous section, we calculated price estimates directly using predefined coefficient values. While this approach is useful for prediction, it doesn’t leverage the full power of linear regression. The true strength of linear regression emerges when we flip the problem: rather than knowing the coefficients and predicting prices, we can use data to estimate the coefficients themselves. This allows us to determine how each feature influences the final price, which is the key to understanding relationships in data.

## Why Can't We Always Get Perfect Predictions?

In reality, it is nearly impossible to find coefficients that perfectly predict the prices for every data point. There are numerous reasons why this happens:

- **External factors**: Prices are affected by factors outside of the model, such as market trends, location desirability, and the economic climate.

- **Data noise**: Random variations in the data may introduce unpredictable fluctuations that are difficult for any model to capture.

- **Confounding variables**: Some features may have hidden relationships that aren’t accounted for in the model, making predictions less reliable.

- **Selection bias**: The data used to build the model may not represent the full population or all possible scenarios, leading to inaccuracies.

For these reasons, linear regression models will usually make approximate predictions rather than exact ones. Therefore, it's important to critically assess how well the model reflects the true relationships in the data and understand its limitations.

## Estimating Coefficients with the Least Squares Method

One of the most widely-used techniques for estimating the coefficients in a linear regression model is the least squares method. This method, developed by Adrien-Marie Legendre in the early 19th century, minimizes the sum of the squared differences between the actual observed values and the predicted values produced by the model.

Given a dataset with known input features \( X \) and known output values \( y \), the goal is to find the coefficient vector \( c \) that minimizes the sum of squared errors (SSE):

\[
SSE = \sum (y_{\text{actual}} - y_{\text{predicted}})^2
\]

The coefficients that minimize the SSE are those that make the model's predictions as close as possible to the true observed values.

## Practical Example: Finding the Best Coefficient Set

To illustrate this concept, we will calculate the sum of squared errors for several different sets of coefficient values and identify which set provides the best fit for the data. This is a simplified example of the least squares method, where instead of finding the global optimum, we evaluate a fixed number of alternatives.

In [None]:
import numpy as np

# Data: Features (X) and actual prices (y)
X = np.array([[66, 5, 15, 2, 500], 
              [21, 3, 50, 1, 100], 
              [120, 15, 5, 2, 1200]])

y = np.array([250000, 60000, 525000])

# Alternative sets of coefficient values
c = np.array([[3000, 200, -50, 5000, 100], 
              [2000, -250, -100, 150, 250], 
              [3000, -100, -150, 0, 150]])

def find_best(X, y, c):
    smallest_error = np.inf  # Initialize with infinity to find minimum
    best_index = -1  # To track the best set of coefficients
    
    for i, coeff in enumerate(c):
        # Predict prices using current coefficient set
        predictions = X @ coeff
        
        # Calculate sum of squared errors (SSE)
        sse = np.sum((y - predictions) ** 2)
        
        # Update best index if current set has a smaller error
        if sse < smallest_error:
            smallest_error = sse
            best_index = i
    
    print("The best set of coefficients is set %d" % best_index)

find_best(X, y, c)


## How the Least Squares Method Works

The least squares method tries to find the best-fitting line (or hyperplane in higher dimensions) that minimizes the error between the actual data points and the predicted values. The key idea is to adjust the coefficients so that the sum of squared errors across all data points is as small as possible.

In the example above, we are comparing three different sets of coefficients. For each set, we compute the predictions by multiplying the input features \( X \) by the coefficients \( c \), and then calculate the sum of squared errors for the difference between the actual prices and the predicted prices.

## Visualizing the Fit

To better understand how well each coefficient set fits the data, it is helpful to visualize the predictions and compare them to the actual prices. If the model is a good fit, the predicted values should lie close to the actual data points when plotted on a graph. By visualizing this relationship, we can determine whether a linear model is appropriate for the data at hand.

## Conclusion

The least squares method is a cornerstone technique in both statistics and machine learning for fitting linear models. By minimizing the sum of squared errors, it provides a simple yet effective way to estimate the coefficients that best explain the relationships in the data. While it may not always provide a perfect fit, especially in the presence of noise or bias, it is a powerful tool for understanding how features contribute to predictions.

As we continue exploring linear regression, we will see how this method can be applied to more complex datasets and scenarios, extending beyond basic cabin price predictions to real-world applications.

# Applying the Least Squares Method with Expanded Data

In previous sections, we explored how linear regression could be applied to a fixed set of cabins using the least squares method. Now, we will extend this concept by working with additional data to better simulate a real-world scenario, where the data size and number of features can vary.

Here, we'll walk through a practical example where we use the least squares method to fit a linear regression model. The dataset consists of five cabins, each described by features like size, sauna size, distance to water, number of indoor bathrooms, and proximity to neighbors. We aim to estimate the coefficients that best explain the relationship between these features and the price of each cabin.

## Dataset Overview

We start with the following dataset of cabin characteristics and their corresponding prices:

| Cabin   | Size (sqm) | Sauna Size (sqm) | Distance to Water (m) | Number of Indoor Bathrooms | Proximity to Neighbors (m) | Price (€) |
|---------|------------|------------------|-----------------------|----------------------------|----------------------------|-----------|
| Cabin 1 | 25         | 2                | 50                    | 1                          | 500                        | 127,900   |
| Cabin 2 | 39         | 3                | 10                    | 1                          | 1000                       | 222,100   |
| Cabin 3 | 13         | 2                | 13                    | 1                          | 1000                       | 143,750   |
| Cabin 4 | 82         | 5                | 20                    | 2                          | 120                        | 268,000   |
| Cabin 5 | 130        | 6                | 10                    | 2                          | 600                        | 460,700   |

We use these features as the input \( X \), and the prices as the output \( y \). Our goal is to fit a linear regression model that captures the relationship between these features and the price of the cabins.


## Applying the Least Squares Method

The least squares method allows us to estimate the coefficients that minimize the sum of squared errors (SSE) between the predicted cabin prices and the actual prices.

We can calculate the coefficient estimates using NumPy's `linalg.lstsq` function, which solves for the coefficient vector \( c \) by minimizing the SSE. The inputs are the feature matrix \( X \) and the price vector \( y \):

In [1]:
import numpy as np

# Input data: cabin features (X) and prices (y)
x = np.array([
             [25, 2, 50, 1, 500], 
             [39, 3, 10, 1, 1000], 
             [13, 2, 13, 1, 1000], 
             [82, 5, 20, 2, 120], 
             [130, 6, 10, 2, 600]
            ])   
y = np.array([127900, 222100, 143750, 268000, 460700])

# Estimate the coefficients using the least squares method
c = np.linalg.lstsq(x, y)[0]
print(c)
print(x @ c)

[3000.  200.  -50. 5000.  100.]
[127900. 222100. 143750. 268000. 460700.]


The output consists of the estimated coefficients for each feature, as well as the predicted prices for the cabins in the dataset.

## Interpreting the Results

- The **first coefficient** (approximately 3000) corresponds to the cabin size in square meters. This means that for every additional square meter, the cabin price increases by €3000.

- The **third coefficient** (approximately −50) shows that for each meter the cabin is farther from the water, the price decreases by €50. Conversely, moving closer to water increases the price by the same amount per meter.

Interestingly, the predicted prices for the five cabins match the actual prices exactly. This happens because the number of observations (five cabins) is equal to the number of features used in the regression model. In such cases, the model can perfectly fit the data.

## Adding More Data to the Model

Now, let's see what happens when we add more cabins to the dataset. By introducing additional data points, we expect the model to adjust, as it attempts to find a better fit for the larger dataset.

We'll simulate this by adding one more cabin to the data:

In [2]:
import numpy as np
from io import StringIO

# Simulated CSV input with six cabins
input_string = '''
25 2 50 1 500 127900
39 3 10 1 1000 222100
13 2 13 1 1000 143750
82 5 20 2 120 268000
130 6 10 2 600 460700
115 6 10 1 550 407000
'''

np.set_printoptions(precision=1)    # Set output precision for easier reading
 
def fit_model(input_file):

    # Read the CSV-like input
    data = np.genfromtxt(input_file, skip_header=0)
    
    # Split data into features (X) and prices (y)
    x = np.array([
        [25, 2, 50, 1, 500],
        [39, 3, 10, 1, 1000],
        [13, 2, 13, 1, 1000],
        [82, 5, 20, 2, 120],
        [130, 6, 10, 2, 600],
        [115, 6, 10, 1, 550]
    ])
    
    y = np.array([127900, 222100, 143750, 268000, 460700, 407000])

    # Read the data in and fit it. the values below are placeholder values
    c = np.linalg.lstsq(x, y, rcond=None)[0]
 
    print(c)
    print(x @ c)

# Simulate reading a file
input_file = StringIO(input_string)
fit_model(input_file)


[2989.6  800.6  -44.8 3890.8   99.8]
[127907.6 222269.8 143604.5 268017.6 460686.6 406959.9]


## Observing the Changes in the Model

By adding the sixth cabin to the dataset, we observe changes in both the estimated coefficients and the predicted prices. For instance, the effect of cabin size on price changed from approximately €3000/m² to €2989.6/m². Similarly, the predicted prices for the original five cabins also changed slightly.

This is a result of incorporating more data, which alters the linear relationship between the features and the price. The model must now account for more variation, leading to slight adjustments in the coefficient estimates and predicted values.

## Conclusion

This example demonstrates the power of the least squares method in fitting linear regression models. As the dataset expands, the model adapts to account for additional variation in the data, leading to updated predictions and coefficient estimates. This showcases how linear regression provides a flexible framework for predictive modeling, even as the number of data points or features changes.

Through this exercise, we have seen that:

- The least squares method can perfectly fit a model when the number of features matches the number of observations.

- When more data is added, the model adjusts, leading to more nuanced predictions that account for the new information.

This flexibility makes linear regression a fundamental tool in machine learning and data analysis, especially for tasks where interpretability and simplicity are essential.

---

# Does More Data Always Improve Predictions?

In machine learning, it is commonly believed that more data leads to better predictions. After all, companies with access to vast amounts of data can build more sophisticated models that provide highly accurate predictions, such as predicting user behavior. However, in certain cases, adding more data can lead to less accurate predictions.

This phenomenon was observed in our previous example: when we had only five cabins in the dataset, the model provided perfect predictions. But when we added a sixth cabin, the accuracy decreased. This leads to an important distinction in machine learning between **training data** and **test data**.

In the previous example, we evaluated the prediction accuracy using the same data that we trained the model on. This scenario, while ideal for demonstration purposes, does not reflect a real-world use case. In practice, we are more concerned with how well the model performs on new data that it hasn't seen before—this is where **test data** comes into play.

## The Importance of Training and Testing in Machine Learning

When building a predictive model, it is crucial to evaluate its performance on unseen data. This is why machine learning workflows often involve splitting the dataset into two parts:

- **Training data**: This is used to fit the model and learn the relationships between the input features and the output.

- **Test data**: This is used to assess the model's ability to generalize to new, unseen instances. The test data contains the actual outputs (prices in this case), but the model should not use this information when making predictions.

## Practical Example: Predicting Cabin Prices Using Separate Training and Test Sets

In this exercise, we will simulate a real-world scenario where the goal is to predict the prices of cabins based on their features. We will use two separate datasets:

- **Training data**: A set of six cabins with known features and prices.
- **Test data**: A separate set of two cabins, for which the actual prices are provided but will not be used in the prediction process.

The model will first be trained on the training dataset to estimate the regression coefficients, and then use these coefficients to predict the prices of the cabins in the test dataset.

### Training Data

| Cabin   | Size (sqm) | Sauna Size (sqm) | Distance to Water (m) | Number of Indoor Bathrooms | Proximity to Neighbors (m) | Price (€) |
|---------|------------|------------------|-----------------------|----------------------------|----------------------------|-----------|
| Cabin 1 | 25         | 2                | 50                    | 1                          | 500                        | 127,900   |
| Cabin 2 | 39         | 3                | 10                    | 1                          | 1000                       | 222,100   |
| Cabin 3 | 13         | 2                | 13                    | 1                          | 1000                       | 143,750   |
| Cabin 4 | 82         | 5                | 20                    | 2                          | 120                        | 268,000   |
| Cabin 5 | 130        | 6                | 10                    | 2                          | 600                        | 460,700   |
| Cabin 6 | 115        | 6                | 10                    | 1                          | 550                        | 407,000   |

### Test Data

| Cabin   | Size (sqm) | Sauna Size (sqm) | Distance to Water (m) | Number of Indoor Bathrooms | Proximity to Neighbors (m) | Price (€) |
|---------|------------|------------------|-----------------------|----------------------------|----------------------------|-----------|
| Cabin 1 | 36         | 3                | 15                    | 1                          | 850                        | 196,000   |
| Cabin 2 | 75         | 5                | 18                    | 2                          | 540                        | 290,000   |


## Implementing the Model

We'll implement the solution by first reading the training and test data, then using the least squares method to estimate the regression coefficients from the training data. Finally, we'll use these coefficients to predict the prices of the cabins in the test dataset.

In [3]:
import numpy as np
from io import StringIO

# Define the training data
train_string = '''
25 2 50 1 500 127900
39 3 10 1 1000 222100
13 2 13 1 1000 143750
82 5 20 2 120 268000
130 6 10 2 600 460700
115 6 10 1 550 407000
'''

# Define the test data
test_string = '''
36 3 15 1 850 196000
75 5 18 2 540 290000
'''

def main():
    np.set_printoptions(precision=1)    # Set output precision for easier reading
    
    # Load the training data
    train_data = np.genfromtxt(StringIO(train_string), skip_header=0)
    
    # Read in the training data and separate it to x_train and y_train
    x_train = train_data[:, :-1] 
    y_train = train_data[:, -1]

    # Using the least squares method to the data and get the coefficients
    c = np.linalg.lstsq(x_train, y_train, rcond=None)[0]

    # Read in the test data and separate x_test from it
    test_data = np.genfromtxt(StringIO(test_string), skip_header=0)
    x_test = test_data[:, :-1]

    # Print out the linear regression coefficients
    print(c)

    # Print out the predicted prics for the two new cabins in the test data set
    print(x_test @ c)


main()


[2989.6  800.6  -44.8 3890.8   99.8]
[198102.4 289108.3]


## Interpreting the Output

After running the program, we get two sets of outputs:

- The **estimated coefficients** for the linear regression model based on the training data.

- The **predicted prices** for the cabins in the test set.
- 
The predicted prices might differ from the actual prices in the test data because the model has not seen these data points before. This discrepancy highlights the importance of generalization—how well a model trained on one set of data performs on new, unseen data.

## Understanding Model Accuracy

In our example, even though the training data perfectly fit the model (since there are enough coefficients to match each data point exactly), the model’s performance on the test data is a better indicator of its real-world applicability. This aligns with a common practice in machine learning: evaluating model accuracy on a separate test set helps avoid overfitting, where the model is too finely tuned to the training data and performs poorly on new data.

## Conclusion

This exercise demonstrated how separating data into training and test sets is crucial for assessing a model's ability to generalize to new data. By training the model on one dataset and testing it on another, we simulate real-world applications where the model needs to make predictions on unseen data.

The linear regression method, combined with the least squares approach, provides a solid foundation for predictive modeling. As datasets grow larger and more complex, techniques like these will continue to be essential in building robust, interpretable models.
