# Simple Linear Regression
---

1.   **[Introduction to Linear Regression](#1.-Introduction-to-Linear-Regression)**
1.   **[Foundations of Linear Regression](#2.-Foundations-of-Linear-Regression)**
1.   **[Model Assumptions](#3.-Model-Assumptions)**
1.   **[Exploratory Data Analysis](#4.-Exploratory-Data-Analysis)**
1.   **[Model Construction](#5.-Model-Construction)**
1.   **[Model Evaluation](#6.-Model-Evaluation)**

<a name="1. Definitions"></a>
### 1. Introduction to Linear Regression

#### 1.1 Definitions

**Simple Linear Regression |** Technique that estimates the linear relationship between a `continuous` dependent variable and `one independent` variable. 

**Dependant variable (y) |** The variable a given model estimates, also referred to as a response or outcome variable

**Independent variable (x) |** A variable that explains trends in the dependent variable, also referred to as an explanatory or predictor variable.

**Simple Linear Regression Formula |** $y = intercept + slope(x)$

**Slope |** The amount that `y` increases or decreases per one-unit increase of `x`

**Intercept |** The value of `y`, the dependent variable, when `x`, the independent variable, equals 0



#### 1.2 Mathematical Linear Regression

**Linear Regression Equation |** $y = \beta_0 + \beta_1 x + \epsilon$

parameters are properties of populations so we can never know their true values unless the entire population is observed

- estimates of the parameters are calculated from sample data
- estimates are denoted with ^ hats

**Linear Regression Estimation |** $\hat{y} = \hat{\beta_0} + \hat{\beta_1} x + \epsilon$

**Regression Coefficients |** The estimated betas in a regression model. Represented as $\hat{\beta_i}$

**Ordinary Least Squares Estimation (OLS) |** Common way to calculate linear regression coefficients $\hat{(\beta)}_n$ 

**Loss Function |** A function that measures the distance between the observed values and the model's estimated values 


---
### 2. Foundations of Linear Regression

#### 2.1 Ordinary Least Squares Estimation


Ordinary least squares (OLS) is a method used in linear regression analysis to estimate the unknown parameters of the linear regression model. The goal of OLS estimation is to find the values of the regression coefficients that minimize the sum of the squared errors between the predicted values and the actual values of the dependent variable.

**Best Fit Line |** The line that fits the data best by minimizing some loss function or error

**Predicted values |** The estimated (y) values for each (x) calculated by a model

**Residual |** The difference between observed or actual values and the predicted values of the regression line 
- Residual = Observed - Predicted ---> $\epsilon_i = y_i - \hat{y_i}$

**Sum of Squared Residuals (SSR) |** The sum of the squared differences between each observed value and its associated predicted value 
- $SSR = \sum\limits_{i=1}^{n}(Observed - Predicted)^2$
- $SSR = \sum\limits_{i=1}^{n}(y_i - \hat{y_i})^2$

**Ordinary Least Squares (OLS) |** A method that minimizes the sum of the squared residuals to estimate parameters in a linear regression model
- Used to calculate: $\hat{y}=\hat{\beta_0} + \hat{\beta_1(x)}$

---
### 3. Model Assumptions

Model assumptions are statements about the data that must be true in order to justify the use of a particular modeling technique



#### 3.1 Linear Regression Assumptions
- **Linearity**
- **Normality**
- **Independent Observations**
- **Homoscedasticity**


##### 3.1.1 Linearity

**Each predictor variable $(x_i)$ is linearly related to the outcome variable $(y)$**

##### 3.1.2 Normality

**The residuals of errors are normally distributed.**
- Can only be checked after the model is built because residuals must be known for calculation
- Checked using a quantile-quantile plot (Q-Q plot)
    - if points on the plot form a straight diagonal line then can assume normality

##### 3.1.3 Independent Observation 

**Each observation in the dataset is independent**

##### 3.1.4 Homoscedasticity

**The variation of the residuals (errors) is constant or similar across the model**
- Homoscedasticity means having the same scatter


#### 3.2 Assumption Violations



##### 3.2.1 Linearity
**Transform one or both of the variables**, such as taking the logarithm.
- For example, if measuring the relationship between years of education and income, take the logarithm of the income variable and check if that helps the linear relationship.

##### 3.2.2 Normality
**Transform one or both variables.** Most commonly, this would involve taking the logarithm of the outcome variable.
- When the outcome variable is right skewed, the normality of the residuals can be affected. Taking the logarithm of the outcome variable can sometimes help with this assumption.
- When transforming a variable, reconstruct the model and recheck the normality assumption. If the assumption is still not satisfied, continue troubleshooting the issue.

##### 3.2.3 Independent Observation 
**Take just a subset of the available data.**
- If, for example, data is a survey including responses from people in the same household, responses may be correlated. Correct for this by just keeping the data of one person in each household.
- Another example data on bike rental over a time period. If data collected every 15 minutes, the number of bikes rented out at 8:00 a.m. might correlate with the number of bikes rented out at 8:15 a.m. Perhaps the number of bikes rented out is independent if the data is taken once every 2 hours, instead of once every 15 minutes.

##### 3.2.4 Homoscedasticity
**Define a different outcome variable.**
- If interested in understanding how a cityâ€™s population correlates with the number of restaurants in a city, it's known that some cities are more populous than others. Therefore possibe to redefine the outcome variable as the ratio of population to restaurants instead.

**Transform the Y variable.**
- As with the above assumptions, sometimes taking the logarithm or transforming the Y variable in another way can potentially fix inconsistencies with the homoscedasticity assumption.

---
### 4. Exploratory Data Analysis

#### 4.1 Imports

In [None]:
# Import relevant Python libraries and modules

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.formula.api import ols
import statsmodels.api as sm

# Load the dataset into a DataFrame and save in a variable

data = pd.read_csv("example_file.csv")

#### 4.2 Data Exploration

In [None]:
# Display the first 10 rows of the data
data.head(10)

In [None]:
# Display number of rows, number of columns
data.shape

#### 4.2.1 Missing Data

##### 4.2.1.1 Check for missing data

In [None]:
# Step 1. Start with .isna() to get booleans indicating whether each value in the data is missing
data.isna()

In [None]:
# Step 2. Use .any(axis=1) to get booleans indicating whether there are any missing values along the columns in each row
data.isna().any(axis=1)

In [None]:
# Step 3. Use .sum() to get the number of rows that contain missing values
data.isna().any(axis=1).sum()

##### 4.2.1.2 Drop missing data

In [None]:
# Step 1. Use .dropna(axis=0) to indicate that you want rows which contain missing values to be dropped
# Step 2. To update the DataFrame, reassign it to the result
data = data.dropna(axis=0)

In [None]:
# Check to make sure that the data does not contain any rows with missing values now

# Step 1. Start with .isna() to get booleans indicating whether each value in the data is missing
# Step 2. Use .any(axis=1) to get booleans indicating whether there are any missing values along the columns in each row
# Step 3. Use .sum() to get the number of rows that contain missing values
data.isna().any(axis=1).sum()

#### 4.2.2 Model Assumptions | Initial Check

Can only check for linearity at this point. The rest of the assumptions will be checked after the model is constructed.

In [None]:
# Create plot of pairwise relationships to check linearity
sns.pairplot(data)

---
### 5. Model Construction

In [None]:
# Select relevant columns
# Save resulting DataFrame in a separate variable to prepare for regression
ols_data = data[["Independent variable(Column_n)", "Dependant variable(Column_n)"]]

# Display first 10 rows of the new DataFrame
ols_data.head(10)

In [None]:
# Write the linear regression formula replacing Y and X with the corresponding column names eg: Sales and Ad_Spend
# Save it in a variable
ols_formula = "Dependant variable(Y) ~ Independent variable(X)"

# Implement OLS approach for linear regression
OLS = ols(formula= ols_formula, data= ols_data)

# Fit the model to the data
# Save the fitted model in a variable
model = OLS.fit()

---
### 6. Model Evaluation

In [None]:
# Get summary of results
model.summary()

#### 6.1 Summary Analysis

**Questions to consider based on the results:**
1. What is the y-intercept?
2. What is the slope? 
3. What is the linear equation if written to express the relationship between (x) and (y) in the form of y = slope * x + y-intercept?
4. What do you think the slope in this context means?

#### 6.2 Model Assumptions | Final Check

##### 6.2.1 Normality Check

**Distribution Visualization**

In [None]:
# Get the residuals from the model
residuals = model.resid

In [None]:
# Visualize the distribution of the residuals
fig = sns.histplot(residuals)
fig.set_xlabel("Residual Value")
fig.set_title("Histogram of Residuals")
plt.show()

**Question to answer:**

1. Based on the visualization above, is the distribution of the residuals normal?

**Q-Q Plot Visualization**

In [None]:
# Create a Q-Q plot 
sm.qqplot(residuals, line='s')
plt.title("Q-Q plot of Residuals")
plt.show()

**Question to answer:**

1. Do the points on the Q-Q plot closely follow a straight diagonal line trending upward?
    - If yes then normality assumption met

##### 6.2.2 Independent Observation and Homoscedasticity Check

In [None]:
# Get fitted values
fitted_values = model.predict(ols_data["Radio"])

In [None]:
# Create a scatterplot of residuals against fitted values
fig = sns.scatterplot(x=fitted_values, y=residuals)
fig.axhline(0)
fig.set_xlabel("Fitted Values")
fig.set_ylabel("Residuals")
plt.show()

**Question to answer:**

1. Do the data points have a cloud-like resemblance and do not follow an explicit pattern?
    - If yes then normality assumption met