# Introduction to Linear Regression
Linear regression is a statistical method used to model the relationship between a dependent variable (target) and one or more independent variables (features). It is widely used for prediction and forecasting in various fields such as economics, finance, and science.

## 1. What is Linear Regression?
Linear regression assumes that there is a linear relationship between the independent variable(s) and the dependent variable and that this relationship can be represented by a straight line (or hyperplane). Linear regression aims to find the best-fitting linear equation that describes this relationship. 

### Applications:
- **Predicting Exam Scores:** Imagine you have data on students' study hours and their corresponding exam scores. You can use linear regression to predict a student's exam score based on the number of hours they studied.

- **Forecasting House Prices:** Suppose you have data on house sizes (in square feet) and their selling prices. You can use linear regression to predict the selling price of a house based on its size.

- **Estimating Gas Mileage:** If you have data on cars' engine sizes and their corresponding gas mileage, you can use linear regression to predict a car's gas mileage based on its engine size.

## 2. Theory Behind Linear Regression

Recall the equation for a straight line from your early math classes,

$$ y = mx + b$$

The equation represents a straight line where $m$ is the slope and $b$ is the y-intercept.

Here's a Python code example using matplotlib to plot the line represented by the equation $ y = mx + b$ and allowing you to adjust the values of $m$ and $b$.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, FloatSlider, Checkbox

In [2]:
# Function to plot the line
def plot_line(m, b):
    x_vals = np.linspace(0, 10, 100)
    y_vals = m * x_vals + b
    plt.plot(x_vals, y_vals, color='red', label=f'y = {m:.2f}x + {b:.2f}')
    plt.xlabel('X')
    plt.ylabel('y')
    plt.title('y = mx + b')
    plt.legend()
    plt.grid(True)
    plt.show()

# Define sliders for m and b
m_slider = FloatSlider(min=-10, max=10, step=1, value=0, description='Slope (m)')
b_slider = FloatSlider(min=-10, max=10, step=1, value=0, description='Intercept (b)')

# Create interactive plot
interact(plot_line, m=m_slider, b=b_slider)

interactive(children=(FloatSlider(value=0.0, description='Slope (m)', max=10.0, min=-10.0, step=1.0), FloatSli…

<function __main__.plot_line(m, b)>

Now imagine we have data on house sizes (in square feet) and their selling prices. Below, we generate some synthetic data for house sizes and selling prices.

In [3]:
# Generate some random data for house sizes and selling prices
np.random.seed(0)
house_sizes = np.random.randint(1000, 3000, 50)  # House sizes in square feet
prices = 50 * house_sizes + np.random.normal(0, 10000, 50)  # Selling prices

# Create a DataFrame
df = pd.DataFrame({'House Size (sqft)': house_sizes, 'Selling Price': prices})

# Display the DataFrame
df

Unnamed: 0,House Size (sqft),Selling Price
0,1684,58670.101842
1,1559,84486.185954
2,2653,141294.361989
3,2216,103378.349796
4,1835,114447.54624
5,1763,73606.343254
6,2731,137007.585173
7,2383,117278.1615
8,2033,116977.792144
9,2747,152043.587699


Now let's plot these data points in a scatter plot.

In [4]:
# Function to plot the data and multiple lines with different fits
def plot_data_and_lines(show_lines, show_legend):
    plt.figure(figsize=(10, 6))
    plt.scatter(house_sizes, prices, color='blue', label='Data')

    # Plot multiple lines with different fits
    if show_lines:
        for i in range(-50, 51, 25):  # Generate 5 lines with different fits
            m = i
            b = np.mean(prices) - m * np.mean(house_sizes)  # Calculate intercept
            x_vals = np.linspace(min(house_sizes), max(house_sizes), 100)
            y_vals = m * x_vals + b
            plt.plot(x_vals, y_vals, label=f'y = {m:.2f}x + {b:.2f}')

    # Show legend if specified
    if show_legend:
        plt.legend()

    plt.xlabel('House Size (sqft)')
    plt.ylabel('Selling Price')
    plt.title('House Size vs Selling Price')
    plt.grid(True)
    plt.show()

# Checkbox widget to show/hide the lines
lines_checkbox = Checkbox(value=False, description='Show Lines')  # Default value set to False

# Checkbox widget to show/hide the legend
legend_checkbox = Checkbox(value=False, description='Show Legend')  # Default value set to False

# Function to update the plot when checkboxes are toggled
def update_plot(show_lines, show_legend):
    plot_data_and_lines(show_lines, show_legend)

# Create interactive plot with checkboxes
interact(update_plot, show_lines=lines_checkbox, show_legend=legend_checkbox)

interactive(children=(Checkbox(value=False, description='Show Lines'), Checkbox(value=False, description='Show…

<function __main__.update_plot(show_lines, show_legend)>

Let's imagine 5 data scientists are working with the same dataset. If each scientist draws a different line of fit, how do they decide which line is best?

How can we find a linear equation that best represents the relationship between the dependent variable, *price*, and the independent variable, *size*? In other words, how do we find the **line of best fit**?

### Line of Best Fit:
The line of best fit represents the linear relationship between the independent variable (predictor) and the dependent variable (response). In our case, the independent variable is the house size (size) and the dependent variable is the selling price (price). 

This line is often determined through linear regression, which aims to minimize **the difference between the observed values and the values predicted by the line**.

### What are Residuals?
Residuals, denoted as $ε$ (epsilon), are the differences between the observed values ($y$) and the values predicted by the model ($\hat{y}$). In other words, they represent the error in the model's predictions. Mathematically, residuals can be expressed as,

$$ε_{i} = y_{i} - \hat{y}_{i}$$
$$~~~~~~~~~~~~~~~~~~= y_{i} - (mx_{i} + b)$$

where:
- $ε_{i}$ is the error or residual for the $i$ th data point

- $y_{i}$ is the observed (actual) value for the $i$ th data point

- $\hat{y}_{i}$ is the predicted value by the model for the $i$ th data point

- $m$ represents the slope of the line in a linear regression model

- $x_{i}$ represents the value of the independent variable for the $i$ th data point

- $b$ represents the y-intercept of the line in a linear regression model

A residual is a measure of how well a line fits an individual data point. Consider this simple data set with a line of fit drawn through it.

<p align="center">
  <img src="/workspaces/themarisolhernandez-4geeks-ds-lessons/imgs/residual1.png" alt="Alt text" width="400" height="400">
</p>

and notice how point **(2, 8)** is **<span style="color:green">4</span>** units above the line:

<p align="center">
  <img src="/workspaces/themarisolhernandez-4geeks-ds-lessons/imgs/residual2.png" alt="Alt text" width="400" height="400">
</p>

This vertical distance is known as a **residual**. For data points above the line, the residual is positive, and for data points below the line, the residual is negative.

For example, the residual for the point **(4, 3)** is **<span style="color:red">-2</span>**.

<p align="center">
  <img src="/workspaces/themarisolhernandez-4geeks-ds-lessons/imgs/residual3.png" alt="Alt text" width="400" height="400">
</p>

The closer a data point's residual is to 0 the better the fit. In this case, the line fits the point (4, 3) better than (2, 8).

### Visualizing Residuals
We can further explore residuals by visualizing how they relate to our linear regression model for our housing dataset. Here, we can adjust the slope ($m$) and intercept ($b$) of the regression line and observe the corresponding residuals.

In [8]:
# Function to plot the data, line, and residuals
def plot_data_line_residuals(m, b, show_line, show_residuals):
    plt.figure(figsize=(12, 6))
    plt.scatter(house_sizes, prices, color='blue', label='Data')

    # Calculate predicted prices using the selected m and b
    predicted_prices = m * house_sizes + b
    
    # Plot the line if show_line is True
    if show_line:
        plt.plot(house_sizes, predicted_prices, color='red', label=f'y = {m}x + {b}')

    # Plot dashed lines representing residuals if show_residuals is True
    if show_residuals:
        for i in range(len(house_sizes)):
            plt.plot([house_sizes[i], house_sizes[i]], [prices[i], predicted_prices[i]], color='green', linestyle='--', linewidth=0.8)

        # Add legend for residuals if not already added
        handles, labels = plt.gca().get_legend_handles_labels()
        if 'Residuals' not in labels:
            plt.plot([], [], color='green', linestyle='--', label='Residuals')

    plt.xlabel('House Size (sqft)')
    plt.ylabel('Selling Price')
    plt.title('House Size vs Selling Price')
    plt.legend()
    plt.grid(True)
    plt.show()

# Define sliders for m and b
m_slider = FloatSlider(min=-100, max=100, step=1, value=0, description='Slope (m)')
b_slider = FloatSlider(min=-50000, max=50000, step=1000, value=0, description='Intercept (b)')

# Checkbox widget to show/hide the line
line_checkbox = Checkbox(value=False, description='Show Line')  # Default value set to False

# Checkbox widget to show/hide the residuals
residuals_checkbox = Checkbox(value=False, description='Show Residuals')  # Default value set to False

# Function to update the plot when checkboxes or sliders are adjusted
def update_plot(show_line, show_residuals, m, b):
    plot_data_line_residuals(m, b, show_line, show_residuals)

# Create interactive plot with sliders and checkboxes
interact(update_plot, show_line=line_checkbox, show_residuals=residuals_checkbox, m=m_slider, b=b_slider)


interactive(children=(Checkbox(value=False, description='Show Line'), Checkbox(value=False, description='Show …

<function __main__.update_plot(show_line, show_residuals, m, b)>

So how do we know we've found the coefficients for the **line of best fit**?

### Determining Coefficients for the Line of Best Fit
#### **Method of Least Squares**
In linear regression, the coefficients for the line of best fit are determined using the **method of least squares**. This method aims to <u>minimize</u> the sum of the squared differences between the observed values and the values predicted by the regression line.

#### **Mathematical Formulation**
Given a set of $n$ data points $(x_{i}, y_{i})$, where $x_{i}$ represents the independent variable and $y_{i}$ represents the corresponding dependent variable, the line of best fit is represented by the equation:

$$ \hat{y}_{i} = mx_{i} + b$$

where:
- $m$ is the slope of the line (coefficient for the independent variable $x$)

- $x_{i}$ represents the value of the independent variable for the $i$ th data point

- $b$ is the y-intercept of the line

The goal is to find the values of $m$ and $b$ that minimize the sum of the squared errors, denoted as $SSE$:

$$SSE = \sum_{i=1}^{n}ε_{i}^2 = \sum_{i=1}^{n}(y_{i} - \hat{y}_{i})^2 = \sum_{i=1}^{n}(y_{i} - (mx_{i} + b))^2$$

where:
- $\hat{y}_{i}$ is the predicted value of $y_{i}$ at the $i$ th data point

With our same housing dataset, we can adjust the slope ($m$) and intercept ($b$) of the regression line and observe the corresponding sum of squared errors.

In [9]:
# Function to plot the data, line, residuals, and SSE
def plot_data_line_residuals_sse(m, b, show_line, show_residuals):
    plt.figure(figsize=(12, 6))
    plt.scatter(house_sizes, prices, color='blue', label='Data')

    # Calculate predicted prices using the selected m and b
    predicted_prices = m * house_sizes + b
    
    # Calculate residuals
    residuals = prices - predicted_prices
    
    # Calculate sum of squared residuals if the line is shown
    if show_line:
        sum_squared_residuals = np.sum(residuals**2)
    else:
        sum_squared_residuals = None
    
    # Plot the line if show_line is True
    if show_line:
        plt.plot(house_sizes, predicted_prices, color='red', label=f'y = {m}x + {b}')

    # Plot dashed lines representing residuals if show_residuals is True
    if show_residuals:
        for i in range(len(house_sizes)):
            plt.plot([house_sizes[i], house_sizes[i]], [prices[i], predicted_prices[i]], color='green', linestyle='--', linewidth=0.8)

        # Add legend for residuals if not already added
        handles, labels = plt.gca().get_legend_handles_labels()
        if 'Residuals' not in labels:
            plt.plot([], [], color='green', linestyle='--', label='Residuals')

    plt.xlabel('House Size (sqft)')
    plt.ylabel('Selling Price')
    title = 'House Size vs Selling Price'
    if show_line:
        title += f'\nSum of Squared Errors: {sum_squared_residuals:.2f}'
    plt.title(title)
    plt.legend()
    plt.grid(True)
    plt.show()

# Define sliders for m and b
m_slider = FloatSlider(min=-100, max=100, step=1, value=0, description='Slope (m)')
b_slider = FloatSlider(min=-50000, max=50000, step=1000, value=0, description='Intercept (b)')

# Checkbox widget to show/hide the line
line_checkbox = Checkbox(value=False, description='Show Line')  # Default value set to False

# Checkbox widget to show/hide the residuals
residuals_checkbox = Checkbox(value=False, description='Show Residuals')  # Default value set to False

# Function to update the plot when checkboxes or sliders are adjusted
def update_plot(show_line, show_residuals, m, b):
    plot_data_line_residuals_sse(m, b, show_line, show_residuals)

# Create interactive plot with sliders and checkboxes
interact(update_plot, show_line=line_checkbox, show_residuals=residuals_checkbox, m=m_slider, b=b_slider)


interactive(children=(Checkbox(value=False, description='Show Line'), Checkbox(value=False, description='Show …

<function __main__.update_plot(show_line, show_residuals, m, b)>