# [CPSC 322]() Data Science Algorithms
[Gonzaga University](https://www.gonzaga.edu/) |
[Sophina Luitel](https://www.gonzaga.edu/school-of-engineering-applied-science/faculty/detail/sophina-luitel-phd-0dba6a9d)

---

# Linear Regression
What are our learning objectives for this lesson?
* Calculate a least squares linear regression line

Content used in this lesson is based upon information in the following sources:
* Dr. Gina Sprint's Data Science Algorithms notes, Fall 2024

# Linear Regression 
Imagine you’re a social media influencer (or trying to become one). You want to figure out how many likes you’ll get on a post based on how many followers you have. So, you gather data from your past posts and start noticing a trend -- more followers usually means more likes.

That’s where linear regression comes in! It helps you draw a straight line through your data to predict likes for future posts.

## Breaking It Down:

* Input (independent variable): Number of followers -- because that’s what you control or know.

* Output (dependent variable): Number of likes -- because it depends on how many followers you have.

* The goal: learn the relationship between followers and likes so we can predict future post performance.

<img src="https://raw.githubusercontent.com/DataScienceAlgorithms/M4_MLAlgorithmsIntro/main/figures/insta.png" alt="Linear Regression" width ="600"/>


## Best Fit Line in Linear Regression
In linear regression, the best-fit line is the straight line that nails the relationship between your input and output. It’s the line that keeps the gaps between the real data points and its predictions as tiny as possible.

In simple linear regression, the goal is to find the best-fit line:

$$
\hat{y} = m x + b
$$

where:  
- $\hat{y}$ is the predicted value (dependent variable)  
- $x$ is the input (independent variable)  
- $m$ is the slope of the line (how much $y$ changes when $x$ changes)  
- $b$ is the intercept (the value of $y$ when $x = 0$)

---
### Least Squares Method

To calculate the best-fit line, we use the **Least Squares Method**. This method finds the line that minimizes the sum of the squared differences between the actual values ($y$) and the predicted values ($\hat{y}$). These squared differences are called **residuals**.  
Each residual is the difference between an actual value and the predicted value:

$$
e_i = y_i - \hat{y}_i
$$

The basic least squares approach:
1. Calculate the mean $\bar{x}$ of the $x$ values and the mean $\bar{y}$ of the $y$ values
    * note the line must go through the point ($\bar{x}$, $\bar{y}$)
2. Calculate the slope using the means.
3. Calculate the y intercept as b.
     



The Least Squares Method minimizes the sum of squared residuals, also known as the Sum of Squared Errors (SSE):

$$
\text{SSE} = \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$

This is why it’s called the "least squares" method. We are finding the line that makes this total squared error as small as possible.

In simple terms, the least squares method helps us find the values of $m$ and $b$ that make our predictions as close as possible to the real data points.


### Formulas

**Slope (m):**

$$
m = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}
$$

**Intercept (b):**

$$
b = \bar{y} - m \times \bar{x}
$$
where:

- $x_i, y_i$ are data points  
- $\bar{x}, \bar{y}$ are means of $x$ and $y$


---



### Understanding Relationship in Data

Before computing the best-fit line, we can assess how the variable relate to each other using metrics like covariance and correlation coefficient.

#### Covariance

Covariance measures how two variables change together:

$$
\text{cov} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{n}
$$


#### Correlation Coefficient \( r \)

The correlation coefficient \( r \) checks how strong and linear the relationship is:

$$
r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}}
$$

- Note: The bottom part essentially the same as the top just squared to strip away the signs .
- \( r = 1 \): perfect positive linear correlation  
- \( r = -1 \): perfect negative linear correlation  
- \( r = 0 \): no linear relationship  
You can also use this to calculate the correlation:

$$
r = \frac{\text{cov}}{\sigma_x \sigma_y}
$$
You can also use this alternative formula for the **slope**:

$$
m = r \cdot \frac{\sigma_y}{\sigma_x}
$$


### Assesing Model Fit

#### Standard Error of the Estimate

The standard error measures the spread of the residuals (how far off predictions are):

$$
\text{stderr} = \sqrt{\frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{n}}
$$

- $(y_i - y^\prime)$  is a residual
- Standard error is like the standard deviation of residuals
- Lower values indicate a better fit

---

#### Tips for Calculation

- Use `numpy.std(x)` for standard deviation
- Watch out for integer division: use `/`, not `//`
- Use list comprehensions for compact calculations:

```python
sum([(x[i] - x_avg) * (y[i] - y_avg) for i in range(n)])
```

---
Q: What does it mean if there is a strong (linear) correlation?

A strong linear correlation between two variables means they change together in a consistent, predictable way. This can have several implications:

- One variable might be unnecessary, since it can be inferred from the other — they carry similar information.
- One variable can be used to predict the other, which is especially helpful if one is a target or class label.
- This relationship makes linear regression a useful tool, as it models how one variable depends on another.

In short: Strong linear correlation suggests predictability, possible redundancy, and supports using regression techniques.


# Practice: Simple Linear Regression Class

Understand the concept of Simple Linear Regression and explore a Python implementation using a custom class.

1. Create a folder and file
   - Create a folder named `ClassificationFun` and create a `main.py` file inside it.
   - We take some quick notes on programming interfaces for ML algorithms.

2. Download the Linear Regressor module
   - Download `my_simple_linear_regressor.py` from:
     [https://github.com/DataScienceAlgorithms/M4_MLAlgorithmsIntro/ClassificationFun](https://github.com/DataScienceAlgorithms/M4_MLAlgorithmsIntro/ClassificationFun)
   - Read through the module to understand how `fit()` and `predict()` are implemented.

3. Create a MySimpleLinearRegressor object
   - In `main.py`, create a `MySimpleLinearRegressor` object.
   - Call `fit()` using our training data (`y = 2x + some noise`) and then `predict()` for an unseen instance.
   - Example:

```python
import numpy as np
from my_simple_linear_regressor import MySimpleLinearRegressor

# Prepare training data
X_train = [[val] for val in range(0, 100)]
y_train = [row[0] * 2 + np.random.normal(0, 25) for row in X_train]
```

#### Gradient Descent 

Gradient descent is the backbone of the learning process for various algorithms, including linear regression, logistic regression, support vector machines, and neural networks which serves as a fundamental optimization technique to minimize the cost function of a model by iteratively adjusting the model parameters to reduce the difference between predicted and actual values, improving the model's performance.


In [6]:
import numpy as np
import matplotlib.pyplot as plt

# Sample data
X_train = [[val] for val in range(0, 100)]
y_train = [row[0] * 2 + np.random.normal(0, 25) for row in X_train]

# Initialize parameters
m = 0  # slope
b = 0  # intercept
learning_rate = 0.01
epochs = 50
n = len(X_train)

# Gradient Descent


