<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Examples.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Examples: Introduction to simple linear regression
© ExploreAI Academy

## Learning objectives

In this train we will:
- Explore the fundamentals of simple linear regression and how it uses the relationship between two variables to predict outcomes.
- Learn how to create scatter plots and find the line of best fit in Python.



## What is simple linear regression

Simple linear regression is a statistical method that allows us to understand and predict the relationship between **two continuous** variables. One variable, denoted as `y`, is considered the dependent variable, and the other, denoted as `x`, is considered the independent variable. The goal of simple linear regression is to model the linear relationship between these two variables through a linear equation of the form: 

```y = mx + c```

In this equation:
- `y` is the dependent variable we aim to predict or understand.
- `x` is the independent variable used for prediction.
- `m` is the slope of the line, representing how much `y` changes for a one-unit change in `x`.
- `c` is the y-intercept, indicating the value of `y` when `x` is `0`.

## Why we use it
Simple linear regression is widely used for several key reasons:
- **Prediction:** It enables the prediction of the dependent variable's value based on the independent variable. For instance, predicting the yield of a crop based on the altitude it is grown at.
- **Inference:** It provides insights into the relationship between the two variables, including the direction and strength of this relationship. It helps in understanding how significant the independent variable is in explaining variations in the dependent variable.
- **Simplicity:** Its straightforwardness makes it an accessible tool for modeling linear relationships, offering a clear starting point for regression analysis before considering more complex models.

## When we use it
Simple linear regression is applicable under certain conditions:
- **Linear relationship:** The method assumes a linear relationship between the independent and dependent variables. Preliminary scatter plot analysis can help verify this assumption.
- **Homoscedasticity:** The variability (or spread) of the dependent variable should be consistent across all values of the independent variable.
- **Independence:** Each observation's residuals should be independent, ensuring no correlation between them for reliable statistical inference.
- **Normality:** For purposes of statistical inference, the residuals of the model should ideally follow a normal distribution.

## Implementing simple linear regression in Python

 We can use libraries such as NumPy, pandas, and scikit-learn to model the relationships between variables in Python. This allows us to uncover insights and predict outcomes using the model.

## Examples

### Example 1

First, we will import the Python libraries which we will need to work with.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rc

**Next** we will load our data as a pandas DataFrame.

In [None]:
df = pd.read_csv('https://github.com/Explore-AI/Public-Data/blob/master/exports%20ZAR-USD-data.csv?raw=true', index_col=0)

In [None]:
# view the first 10 rows of the DataFrame
df.head(10)

So it looks like we have some monthly data for both the Rand:Dollar exchange rate, and the value of South African exports.   

In this case, we will choose the Rand/Dollar exchange rate as our dependent variable and the monthly value of exports (in Rands) as our independent variable - i.e. we are going to use exports as a predictor for currency value.

For ease of future reference, let's change our column names to reflect the notation which we'll be following going forward:

In [None]:
df.columns = ['Y', 'X'] # Rename the columns of the dataframe to 'Y' and 'X'
df.head() # Display the first few rows of the dataframe to verify the column renaming

**Let's create a few plots of the data** using the matplotlib libarary.

 Let's create a plot to visualise the exchange rate of the South African Rand (ZAR) against the US Dollar (USD) over time. This provides insights into currency value fluctuations across the observed months.

In [None]:
# Plotting the ZAR/USD exchange rate over time
plt.plot(np.arange(len(df.Y)), df.Y) # Create a line plot for the ZAR/USD values
plt.title("ZAR/USD over time") # Title of the plot
plt.xlabel("Months") # Label for the x-axis 
plt.ylabel("ZAR/USD") # Label for the y-axis
plt.show() # Display the plot

 Let's create a plot to visualise the value of exports from South Africa, measured in millions of ZAR. This helps show us any trends in export activity and economic performance over the same period.

In [None]:
# Plotting the value of exports over time
plt.plot(np.arange(len(df.X)), df.X) # Create a line plot for the export values
plt.title("Value of Exports over time") # Title of the plot
plt.xlabel("Months") # Label for the x-axis
plt.ylabel("Exports (ZAR, millions)") # Label for the y-axis
plt.show() # Display the plot

### Example 2
Let's try and create something a little more visually appealing than the two plots above.

- We'll plot both dependent data series on the same graph
- We'll assign two separate y-axes: one for each series
- We'll display a legend near the top of the plot

In [None]:
rc('mathtext', default='regular') # Set the default style for math text in plots to regular text
fig = plt.figure() # Create a blank figure object to hold the plots

# Add a subplot to the figure, '111' means 1x1 grid, first subplot
ax = fig.add_subplot(111) 

# Plot the ZAR/USD exchange rate data on the primary y-axis using an orange line
ax.plot(np.arange(len(df.Y)), df.Y, '-', label = 'ZAR/USD', color='orange') 

# Create a second y-axis (twin of the first) to accommodate another data set on the same x-axis
ax2 = ax.twinx() 

# Plot the Value of Exports data on the secondary y-axis
ax2.plot(np.arange(len(df.X)), df.X, '-', label = 'Exports (ZAR)')

# Add legends to distinguish the two lines, with custom locations for clarity
ax.legend(loc=2)  # Legend for the primary axis (ZAR/USD) at the top left (location 2)
ax2.legend(loc=9)  # Legend for the secondary axis (Exports) at the top center (location 9)

ax.grid() # Enable grid for better readability of the plot

# Set the labels for the x-axis and both y-axes to indicate the data being plotted
ax.set_xlabel("Months")
ax.set_ylabel("ZAR/USD")
ax2.set_ylabel("Exports (ZAR, millions)")

plt.show() # Display the plot

# Let's get started with regression
So the plots above - although they are easy to interpret - show too many variables. We have our Y, our X, and time (months), but we're just trying to model ZAR/USD as a *function* of Exports.   

**Note:** It's important to note here that we are **not** trying to model the exchange rate as a function of time.
   
### Example 3
Let's try and show the relationship between the two variables using a scatter plot.

In [None]:
# Plot the original dataset with 'X' as Value of Exports and 'Y' as ZAR/USD exchange rate using a scatter plot
plt.scatter(df['X'], df['Y'])
plt.ylabel("ZAR/USD")
plt.xlabel("Value of Exports (ZAR, millions)")
plt.show()

From the plot above, we can see that there possibly exists a linear relationship between the two variables: **Value of Exports** and **ZAR/USD**, due to the linear trend of the data.

We would like to be able to fit a line to this data such that we can make predictions of Y for new values of X. We would like to calculate values for the slope, $m$ and the y-intercept, $c$ such that the line represented by $y = mx + c$ lies as close as possible to the points shown in the scatter plot above.

## Finding the best fit line
Let's say we have estimated some values for $m$ and $c$. We could plug in all of our values of X to find the corresponding values of Y. These *new* values of Y could be compared to the *actual* values of Y to assess the fit of the line. This becomes tedious as the number of data points increases.

### Example 4
   
Looking at the data, we can guess the values of the slope and intercept of the line. We'll use a rough estimate of the slope as $\frac{rise}{run} = \frac{16}{80000} = 0.0002$. For the intercept, we'll just take a guess and call it $-3$.   
   
Let's plot a line with values of $m = 0.0002$ and $c = -3$.
   
First, we will need to generate some values of y using the following formula:
   
$$\hat{y}_i = mx_i + c$$   
   

In [None]:
# Define a function to generate predicted values of 'Y' (ZAR/USD) for a given list of 'X' (Value of Exports),
# based on linear equation parameters 'm' (slope) and 'c' (intercept)

def gen_y(x_list, m, c):
    y_gen = []
    for x_i in x_list:  # Loop through each value in the list of 'X'
        y_i = m*x_i + c # Calculate the corresponding 'Y' value using the linear equation
        y_gen.append(y_i) # Append the calculated 'Y' value to the list
    
    return(y_gen) # Return the list of generated 'Y' values

# Generate the predicted 'Y' values using the 'gen_y' function with specified 'm' and 'c' parameters
y_gen = gen_y(df.X, 0.0002, -3)

# Plot the original data and the linear regression line
plt.scatter(df.X, df.Y) # Plot the original data as a scatter plot
plt.plot(df.X, y_gen, color='red') # Plot the generated 'Y' values as a line in red to represent the linear model
plt.ylabel("ZAR/USD")
plt.xlabel("Value of Exports (ZAR, millions)")
plt.show()

At first glance, it looks like the fit of this line is not too bad!   

### Example 5
   
**Let's check the calculated fit of the line** by measuring how far the true y-values of each point are from their corresponding y-value on the line.   
   
We'll use the equation below to calculate the error of each generated value of y:   
   
$$e_i = y_i - \hat{y}_i$$   
   

In [None]:
errors = np.array(df.Y - y_gen)
np.round(errors, 2)

When looking at the errors, we see that some of our predictions are off by almost R6!

Let's take a look at a distribution of our errors (using a histogram).

In [None]:
plt.hist(errors)
plt.show()

In addition to having some very large errors, we can also see that most of the errors are negative numbers. Ideally, we want our errors to be evenly distributed either side of zero - we want our line to best fit the data, i.e. no bias.
   
We can measure the overall error of the fit by calculating the **Residual Sum of Squares**:
   
$$RSS = \sum_{i=1}^n(y_i-\hat{y}_i)^2$$

The RSS finds the difference between the y-value of each data point and our estimated line (which may be either negative or positive), squares the difference, and then adds all the differences up. In other words, it's the sum of the squares of all the errors we calculated before.

In [None]:
print("Residual sum of squares:", (errors ** 2).sum())

An RSS value of `867` indicates that the model's predictions are significantly deviating from the actual data points. A high RSS value suggests that the simple linear regression model does not capture the underlying pattern effectively, leading to inaccurate predictions and implying a poor fit between the model and the observed data. However, in this case, the RSS value is high because we guessed the values for `m` and `c`. Instead of doing this, we can employ mathematical techniques to find the optimal value for `m` and `c`, which can significantly enhance model accuracy and fit.

**Note:** The RSS is influenced by the scale of the data being forecasted, meaning that for forecasts involving large values, such as in billions, the RSS will also likely be a large number, reflecting the squared deviations of these large values.

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>