<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Examples.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Examples: Least squares regression
© ExploreAI Academy

## Learning objectives

By the end of this train we will:
- Understand what least squares regression is and how we use it to calculate the line of best fit.
- Understand the mathematical techniques used in least squares regression.
- Know how least squares regression is implemented using sci-kit learn.

## Least squares regression

Least Squares is a method used in regression analysis to find the best-fitting straight line through a set of data points. It does this by minimising the sum of the squares of the residuals between the observed values and those predicted by the line, i.e., 
$$Q = \sum_{i=1}^n(y_i-\hat{y}_i)^2$$

The formulae for the slope, \(m\), and the intercept, \(c\), are determined by minimising the equation for the sum of the squared prediction errors:   
$$Q = \sum_{i=1}^n(y_i-(m x_i+c))^2$$

Optimal values for \(m\) and \(c\) are found by differentiating \(Q\) with respect to \(m\) and \(c\), setting both equal to 0, and then solving for \(m\) and \(c\).   
   
The equations for \(m\) and \(c\) are:   
   
$$m = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})^2}$$   
   
and:   
   
$$c = \bar{y} - m \bar{x}$$

where $\bar{y}$ and $\bar{x}$ are the mean values of \(y\) and \(x\) in our dataset, respectively.

## Examples

### Example 1
   
Let's calculate these values in Python, where \(c\) is the intercept and \(m\) is the slope.


In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rc

# Load dataset and set the first column as the index
df = pd.read_csv('https://github.com/Explore-AI/Public-Data/blob/master/exports%20ZAR-USD-data.csv?raw=true', index_col=0)

# Rename columns to 'Y' for the dependent variable and 'X' for the independent variable
df.columns = ['Y', 'X'] # Rename the columns of the dataframe to 'Y' and 'X'

In [None]:
# Extract values of X and Y as numpy arrays for mathematical operations
X = df.X.values
Y = df.Y.values

# Calculate mean of X and Y
x_bar = np.mean(X)
y_bar = np.mean(Y)

# Calculate the slope (m) of the regression line using the least squares method
m = sum((X - x_bar) * (Y - y_bar)) / sum((X - x_bar) ** 2)

# Calculate the intercept (c) of the regression line
c = y_bar - m * x_bar

# Output the calculated slope and intercept
print("Slope = ", m)
print("Intercept = ", c)

Now we'll plot the line we've just calculated the coefficients for.

In [None]:
# use the function we created earlier to generate y-values for a given list of x-values, using the calculated slope and intercept
def gen_y(x_list, m, c):
    y_gen = []
    for x_i in x_list:  
        y_i = m*x_i + c 
        y_gen.append(y_i) 
    
    return(y_gen) 

# Generate y-values for the given x-values in the dataset based on the calculated slope and intercept
y_gen = m * df.X + c

# Plot the original data points as a scatter plot
plt.scatter(df.X, df.Y)

# Plot the regression line using the generated y-values
plt.plot(df.X, y_gen, color='red')

plt.show()

In a list called ```errors2```, we'll store the new error values.

In [None]:
errors2 = np.array(y_gen - df.Y) # Calculate the residuals by subtracting the observed Y values from the generated Y values
print(np.round(errors2, 2)) # Print the residuals, rounded to 2 decimal places

Finally, let's plot the errors on a histogram again.

In [None]:
plt.hist(errors2)
plt.show()

In [None]:
# Calculate the Residual Sum of Squares (RSS) by squaring the residuals and summing them up
print("Residual sum of squares:", (errors2 ** 2).sum())

Here we can see our RSS has improved from ~867, in our previous example, down to ~321.  
Furthermore, if we calculate the sum of the errors we find that the value is close to 0.

In [None]:
# Round off the sum of residuals to 11 decimal places to check for numerical stability or precision issues
np.round(errors2.sum(),11)

----
Intuitively, this should make sense as it is an indication that the sum of the positive errors is equal to the sum of the negative errors. The line fits in the 'middle' of the data.

## Linear regression in sci-kit learn
   
Now that you understand how least squares linear regression works, let's implement it using sci-kit learn.   

We'll start by loading the LinearRegression library.

In [None]:
# Import the LinearRegression class from scikit-learn's linear_model module
from sklearn.linear_model import LinearRegression

We can take a peak under the hood by using the Python help (`?`) function. This returns the documentation of the required parameters and the attributes of any function or object.   

We're going to need to create a `LinearRegression()` object, so let's first take a look at the documentation for that object:

In [None]:
LinearRegression?

Let's create a `LinearRegression()` object with all the default parameters.

In [None]:
# Initialise the LinearRegression model
lm = LinearRegression()

At this stage, all we have done is initialise a model of the form: $y = mx+c$ 

But we haven't _fitted the model_ i.e. used the data to calculate the model parameters $a$ and $b$.

### Fitting the linear model

With the object created, we will then need to fit the model to our data. This is done using the `.fit()` function.

In [None]:
lm.fit?

We can see that the `.fit()` function requires two parameters (`X` and `y`), with an optional third parameter, `sample_weight`.   

The `sample_weight` parameter would be useful in situations where the observations in our data have unequal errors - think weight vs height of university students where some students were weighed with an older analogue scale and others were weighed with a new digital scale.   

We have no reason to believe that any of our data is any more, or any less trustworthy so we'll leave out the optional weights parameter.

In [None]:
# Reshape the X array to a 2D array as required by scikit-learn, converting from pandas Series to numpy array if necessary
X = df.X.values[:, np.newaxis]

In [None]:
# Fit the linear model to the data
lm.fit(X, df.Y)

If needed, the model parameters found by the `.fit()` function can be obtained as follows: 

In [None]:
# Extract the slope (coefficient) and intercept from the fitted model
m = lm.coef_[0]
c = lm.intercept_

In [None]:
# Print the slope and intercept
print("Slope:\t\t", m)
print("Intercept:\t", c)

### Getting model predictions

To obtain $y$ values from our linear regression model we use the `.predict()` function. Given an array of $x$ values, this function evaluates the fitted model at those $x$ values and returns the corresponding $y$ values. Note that in this case, the `.predict()` function does exactly what the `gen_y()` function we created earlier does. We will explore the concept of prediction in depth in later tutorials.

In [None]:
# Use the fitted model to generate Y values from the X values
gen_y = lm.predict(X)

In [None]:
# plot the results
plt.scatter(X, df.Y)  # Plot the original data
plt.plot(X, gen_y, color='red')  # Plot the line connecting the generated y-values

# Label the axes
plt.ylabel("ZAR/USD")
plt.xlabel("Value of Exports (ZAR, millions)")

plt.show()

## Assessing the model accuracy
We can measure the overall error of the fit by calculating the **Residual Sum of Squares**:
   
$$RSS = \sum_{i=1}^n(y_i-\hat{y}_i)^2$$

In [None]:
# Calculate and print the Residual Sum of Squares (RSS) for the fitted model
print("Residual sum of squares:", ((gen_y - df.Y) ** 2).sum())

### Sci-kit learn error metrics
Sci-kit learn also has implementations of common error metrics which will make things easier for us to assess the fit of our model.   

In addition to RSS, there are some other metrics we can use:

**Mean Squared Error (MSE)** measures the average of the squares of the errors between actual and predicted values in a linear regression model. It assesses the fit of the model by quantifying the variance between predicted and observed values, with lower values indicating a better fit.
$$MSE = \frac{RSS}{n}$$   
$$MSE = \frac{1}{n}\sum_{i=1}^n(y_i-\hat{y}_i)^2$$   
   
**R squared ($R^2$)** quantifies the proportion of the variance in the dependent variable that is predictable from the independent variable(s) in a linear regression. It assesses the strength of the relationship between the model's predictions and the actual data, with values closer to 1 indicating a stronger relationship.
$$R^2 = 1 - \frac{RSS}{TSS}$$   
$$R^2 = 1 - \frac{\sum(y_i-\hat{y}_i)^2}{\sum(y_i-\bar{y})^2}$$

We can compute these metrics using sci-kit learn as follows:

In [None]:
# Import metrics from scikit-learn
from sklearn import metrics

In [None]:
# Calculate and print the Mean Squared Error (MSE) between the observed and predicted Y values
print('MSE:', metrics.mean_squared_error(df.Y, gen_y))

In [None]:
# Calculate and print the RSS by multiplying the MSE by the number of observations
print("Residual sum of squares:", metrics.mean_squared_error(df.Y, gen_y)*len(X)) 

In [None]:
# Calculate and print the R-squared value, a measure of how well the observed values are replicated by the model
print('R_squared:', metrics.r2_score(df.Y, gen_y))

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>