# An Introduction to Linear Regression

Linear regression is a foundational statistical technique that's used to establish a mathematical relationship between a dependent variable (often referred to as the response variable) and one or more independent variables (referred to as predictors). It operates under the assumption that this relationship can be approximated by a straight line.

The core formula for linear regression is expressed as follows:

\begin{equation}
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \varepsilon,
\end{equation}

Where:
- $Y$ represents the dependent variable (response).
- $X_1, X_2, \ldots, X_p$ are the independent variables (predictors).
- $\beta_0, \beta_1, \ldots, \beta_p$ are the coefficients (regression coefficients or model parameters).
- $\varepsilon$ denotes the error term, accounting for unexplained variability.

The objective of linear regression is to determine the coefficients $\beta_0, \beta_1, \ldots, \beta_p$ in a manner that optimally fits the observed data. The prevalent approach to estimating these coefficients is the least-squares method, which minimizes the sum of squared residuals [James et al., 2023]

\begin{equation}
\text{RSS} = \sum_{i=0}^{n} (y_i - \hat{y}_i)^2,
\end{equation}

Here:
- $y_i$ is the observed value of the dependent variable.
- $\hat{y}_i$ is the predicted value based on the model.
- $n$ is the number of data points.

The coefficient estimation process involves using a training dataset with known values of $Y$ and $X$. The aim is to find the $\beta$ values that minimize the RSS. This is typically achieved using numerical optimization techniques.

After the coefficients are estimated, the linear regression model can be applied to make predictions for new data points. Given the values of the independent variables $X_1, X_2, \ldots, X_p$, the predicted value of the dependent variable $Y$ is calculated using this equation [James et al., 2023]:

\begin{equation}
\hat{Y} = \hat{\beta_0} + \hat{\beta_1} X_1 + \hat{\beta_2} X_2 + \ldots + \hat{\beta_p} X_p,
\end{equation}
where $\hat{\beta_0}, \hat{\beta_1}, \ldots, \hat{\beta_p}$ denote estimated value for unknown coefficients $\beta_0, \beta_1, \ldots, \beta_p$.

Linear regression accommodates both simple linear regression (with a single predictor) and multiple linear regression (involving multiple predictors). It's extensively employed across diverse fields, including statistics, economics, social sciences, and machine learning, for tasks like prediction, forecasting, and understanding the relationships between variables.


---

<font color='Red'><b>Note:</b></font>

In this context, the hat symbol (ˆ) is employed to signify the estimated value for an unknown parameter or coefficient, or to represent the predicted value of the response.

---

## Estimating Beta Coefficients in Linear Regression using Least Squares
The estimation of beta coefficients in linear regression, also known as the model parameters or regression coefficients, is a key step in fitting the regression model to the data. The most commonly used method for estimating these coefficients is the least squares approach. The goal is to find the values of the coefficients that minimize the sum of squared residuals (RSS), which measures the discrepancy between the observed values and the values predicted by the model.

Here's a step-by-step overview of how beta coefficients are estimated using the least squares method:

1. **Formulate the Objective**: The linear regression model is defined as:
   
   \begin{equation}
   Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \varepsilon
   \end{equation}

   The goal is to find the values of $\beta_0, \beta_1, \ldots, \beta_p$ that minimize the RSS:

   \begin{equation}
   \text{RSS} = \sum_{i=0}^{n} (y_i - \hat{y}_i)^2
   \end{equation}

   Where:
   - $y_i$ is the observed value of the dependent variable.
   - $\hat{y}_i$ is the predicted value of the dependent variable based on the model.
   - $n$ is the number of data points.

2. **Differentiate the RSS**: Take the partial derivatives of the RSS with respect to each coefficient $\beta_j$ (for $j=0$ to $p$) and set them equal to zero. This step finds the values of the coefficients that minimize the RSS.

3. **Solve the Equations**: The equations obtained from differentiation are a set of linear equations in terms of the coefficients. You can solve these equations to obtain the estimated values of $\beta_0, \beta_1, \ldots, \beta_p$.

4. **Interpretation**: Once the coefficients are estimated, you can interpret their values. Each coefficient $\beta_j$ represents the change in the mean response for a one-unit change in the corresponding predictor $X_j$, while holding all other predictors constant.

5. **Assumptions**: It's important to note that linear regression makes certain assumptions about the data, such as linearity, independence of errors, constant variance of errors (homoscedasticity), and normality of errors. These assumptions should be checked to ensure the validity of the regression results.

6. **Implementation**: In practice, numerical optimization techniques are often used to solve the equations and estimate the coefficients, especially when dealing with complex models or large datasets. Software packages like Python's `scikit-learn` [Pedregosa et al., 2011, scikit-learn Developers, 2023], `statsmodels` [Seabold and Perktold, 2010], and others provide functions for performing linear regression and estimating the coefficients.

The resulting estimated coefficients represent the best-fitting linear relationship between the dependent variable and the independent variables in the least squares sense.

## Simple Linear Regression: Modeling a Single Variable Relationship

Simple linear regression is a fundamental statistical technique that focuses on modeling the relationship between two variables: a dependent variable (often referred to as the response variable) and a single independent variable (commonly known as the predictor). This approach operates under the assumption of a linear relationship, implying that the connection between these variables can be approximated by a straight line.

The core equation for simple linear regression, as described in [James et al., 2023], is expressed as follows:

\begin{equation}
Y = \beta_0 + \beta_1 X + \varepsilon,
\end{equation}

Where:
- $Y$ represents the dependent variable (response).
- $X$ is the independent variable (predictor).
- $\beta_0$ denotes the intercept, signifying the value of $Y$ when $X$ equals zero.
- $\beta_1$ represents the slope, indicating the change in $Y$ for a one-unit change in $X$.
- $\varepsilon$ is the error term, accounting for the unexplained variability within the model.

The objective of simple linear regression is to estimate the coefficients $\beta_0$ and $\beta_1$ in a manner that the line best fits the observed data. The widely-used method for estimating these coefficients is the least-squares approach, which minimizes the sum of squared residuals:

\begin{equation}
\text{RSS} = \sum_{i=0}^{n} (y_i - \hat{y}_i)^2,
\end{equation}

Here:
- $y_i$ is the observed value of the dependent variable ($Y$) for the $i$th data point.
- $\hat{y}_i$ is the predicted value of the dependent variable ($Y$) for the $i$th data point based on the model.
- $n$ is the number of data points.


<center>
<img src="https://raw.githubusercontent.com/HatefDastour/hatefdastour.github.io/master/_notes/Introduction_to_Digital_Engineering/_images/LinearReg.png" alt="picture" width="700">

The aggregate of the dashed green lines is elucidated as $\text{RSS} = \sum_{i=0}^{n} (y_i - \hat{y}_i)^2$.
</center>

<br>
<br>
Once the coefficients $\beta_0$ and $\beta_1$ are estimated, the simple linear regression model can be represented as:

\begin{equation}
\hat{Y} = \hat{\beta_0} + \hat{\beta_1} X,
\end{equation}
where $\hat{\beta_0}$ and $\hat{\beta_1}$ denote estimated value for unknown coefficients $\beta_0$ and $\beta_1$.

This line serves as the best linear fit to the data and is utilized for making predictions regarding new data points. Given the value of the independent variable ($X$) for a new data point, the equation enables us to calculate the predicted value of the dependent variable ($Y$).

Simple linear regression finds broad applications across diverse fields, including economics, finance, social sciences, and engineering. It is particularly effective when a clear linear relationship exists between the two variables, enabling predictions and understanding the impact of the independent variable on the dependent variable, as highlighted in [James et al., 2023].

Nevertheless, interpreting the results of a simple linear regression analysis demands careful consideration. Factors such as goodness of fit (e.g., R-squared), statistical significance of the coefficients, and the validity of assumptions (e.g., linearity, independence of errors, normality of errors) should be taken into account to draw meaningful conclusions from the model. Furthermore, it's essential to recognize that simple linear regression may not be suitable if the relationship between variables is nonlinear or if multiple predictors influence the dependent variable. In such cases, other approaches, such as multiple linear regression or more complex regression models, may be more appropriate, as discussed in [James et al., 2023].

### Finding  the intercept and the slope

Finding the values of the coefficients $\beta_0$ (the intercept) and $\beta_1$ (the slope) in simple linear regression involves minimizing the sum of squared residuals (RSS). The RSS represents the difference between the observed values ($y_i$) and the predicted values ($\hat{y}_i$) based on the linear regression model for each data point. Mathematically, the objective is to find $\beta_0$ and $\beta_1$ that minimize the following expression:

\begin{equation} \text{RSS} = \sum_{i=0}^{n} (y_i - \hat{y}_i)^2 \end{equation}

Here's a step-by-step mathematical derivation for finding $\beta_0$ and $\beta_1$:

1. **Define the Linear Regression Model**: Start with the basic linear regression equation:

\begin{equation} Y = \beta_0 + \beta_1 X + \varepsilon \end{equation}

2. **Calculate the Predicted Values**: The predicted value of Y (Ŷ) for each data point is given by:

\begin{equation} \hat{y}_i = \beta_0 + \beta_1 x_i \end{equation}

3. **Define the Residuals**: The residuals (eᵢ) are the differences between the observed values and the predicted values:

\begin{equation} e_i = y_i - \hat{y}_i \end{equation}

4. **Formulate the Objective Function**: The goal is to minimize the sum of squared residuals:

\begin{equation} \text{RSS} = \sum_{i=0}^{n} e_i^2 \end{equation}

5. **Find the Partial Derivatives**: Compute the partial derivatives of the RSS with respect to $\beta_0$ and $\beta_1$:

\begin{equation} \frac{\partial}{\partial \beta_0}\text{RSS} = -2 \sum_{i=0}^{n} (y_i - \beta_0 - \beta_1 x_i) \end{equation}

\begin{equation} \frac{\partial}{\partial \beta_1}\text{RSS} = -2 \sum_{i=0}^{n} x_i (y_i - \beta_0 - \beta_1 x_i) \end{equation}

6. **Set the Derivatives to Zero and Solve**: Set the partial derivatives to zero and solve the resulting system of equations for $\beta_0$ and $\beta_1$:

\begin{align}
\begin{cases}
\displaystyle{\sum_{i=0}^{n} (y_i - \beta_0 - \beta_1 x_i) = 0},
\\
\displaystyle{\sum_{i=0}^{n} x_i (y_i - \beta_0 - \beta_1 x_i) = 0}.
\end{cases}
\end{align}
it follows that

\begin{align}\begin{cases}
\displaystyle{\sum_{j = 0}^{n} y_{j}} & = \displaystyle{\beta_{0}\, \sum_{j = 0}^{n} 1 + \beta_{1}\, \sum_{j = 0}^{n} x_{j}},
\\
\displaystyle{\sum_{j = 0}^{n} x_{j}\,y_{j}} &= \displaystyle{\beta_{0}\,\sum_{j = 0}^{n}x_{j} + \beta_{1}\,\sum_{j = 0}^{n}x_{j}^{2}}.
\end{cases}\end{align}

The above linear system can be expressed in the following matrix form,

\begin{align}\begin{bmatrix}n+1 & \sum_{j = 0}^{n} x_{j}\\ \sum_{j = 0}^{n} x_{j} & \sum_{j = 0}^{n} x_{j}^{2}\end{bmatrix}
\begin{bmatrix}\beta_{0}\\ \beta_{1}\end{bmatrix}
= \begin{bmatrix}\sum_{j = 0}^{n}y_{j}\\ \sum_{j = 0}^{n}x_{j}y_{j}\end{bmatrix}.\end{align}

Once the optimal values of $\beta_0$ and $\beta_1$ are found, the simple linear regression model is defined as:

\begin{equation} \hat{Y} = \hat{\beta_0} + \hat{\beta_1} X \end{equation}

Where:
- $\hat{\beta_0}$ is the estimated intercept.
- $\hat{\beta_1}$ is the estimated slope.

This model represents the best-fit line that minimizes the sum of squared residuals and can be used for making predictions and understanding the relationship between the variables.

### Finding  the intercept and the slope (Vector Format)

Given:
- $ X $: The matrix of predictor variables (including an additional column of ones for the intercept).
- $ y $: The vector of response variable values.
- $ \beta $: The vector of coefficients ($\beta_0$ and $\beta_1$).

1. **Define the Linear Regression Model**:

The linear regression model can be represented in matrix form as:

\begin{equation} y = X \beta + \varepsilon \end{equation}

Where:
- $ y $ is the vector of observed response values.
- $ X $ is the matrix of predictor variables (including an additional column of ones for the intercept).
- $ \beta $ is the vector of coefficients ($\beta_0$ and $\beta_1$).
- $ \varepsilon $ is the vector of error terms.

2. **Calculate the Predicted Values**:

The predicted values of $ y $ ($\hat{y}$) can be calculated as:

\begin{equation} \hat{y} = X \beta \end{equation}

3. **Define the Residuals**:

The residuals can be defined as the difference between the observed values $ y $ and the predicted values $ \hat{y} $:

\begin{equation} e = y - \hat{y} \end{equation}

4. **Formulate the Objective Function (Sum of Squared Residuals)**:

The objective is to minimize the sum of squared residuals:

\begin{equation} \text{RSS} = e^T e = (y - \hat{y})^T (y - \hat{y}) \end{equation}

5. **Find the Optimal Coefficients**:

The optimal coefficients $ \beta $ can be found by minimizing the RSS:

\begin{equation} \hat{\beta} = (X^T X)^{-1} X^T y \end{equation}

Where:
- $ X^T $ is the transpose of the matrix $ X $.
- $ (X^T X)^{-1} $ is the inverse of the matrix $ X^T X $.
- $ X^T y $ is the matrix-vector multiplication between $ X^T $ and $ y $.

6. **Use the Optimal Coefficients for Prediction**:

Once you have the optimal coefficients $ \hat{\beta} $, you can use them to predict new values of $ y $ based on new values of $ X $:

\begin{equation} \hat{y}_{\text{new}} = X_{\text{new}} \hat{\beta} \end{equation}

Where $ X_{\text{new}} $ is the matrix of new predictor variables.

## Example: Boston House-Price Data

<font color='Blue'><b>Example</b></font>. To illustrate these concepts, we will employ the Boston Housing dataset. Widely utilized in machine learning and data science, Scikit-learn offers a range of tools for analysis. The Boston Housing dataset is a staple in regression analysis and can be accessed from the link: http://lib.stat.cmu.edu/datasets/boston. This dataset contains information about various features related to housing prices in different neighborhoods in Boston.

For our purposes, we will focus exclusively on three variables: `LSTAT`, and `MEDV`.


| Variable |                               Description                               |
|:--------:|:-----------------------------------------------------------------------:|
|   LSTAT  |                     \% lower status of the population                   |
|   MEDV   |             Median value of owner-occupied homes in \$1000's            |

In [None]:
import numpy as np
import pandas as pd

_url = "http://lib.stat.cmu.edu/datasets/boston"
columns = 12 *['_'] + ['LSTAT', 'MEDV']

Boston = pd.read_csv(filepath_or_buffer= _url, delim_whitespace=True, skiprows=21,
                 header=None)

#Flatten all the values into a single long list and remove the nulls
values_w_nulls = Boston.values.flatten()
all_values = values_w_nulls[~np.isnan(values_w_nulls)]

#Reshape the values to have 14 columns and make a new df out of them
Boston = pd.DataFrame(data = all_values.reshape(-1, len(columns)),
                      columns = columns)
Boston = Boston.drop(columns=['_'])
display(Boston)

To start, we will fit a simple linear regression model using the `sm.OLS()` function from `statsmodels` library. In this model, our response variable will be `medv`, and `lstat` will be the single predictor.

### Loading the Data

You have the Boston Housing dataset loaded, which contains variables like `LSTAT` (predictor) and `MEDV` (response).

In [None]:
# Select predictor and response variables
X = Boston['LSTAT'].values
y = Boston['MEDV'].values

### Creating the Design Matrix

After selecting the `LSTAT` variable as the predictor, you add a constant term to create the design matrix `X`.

\begin{equation} X = \begin{bmatrix}
1 & \text{LSTAT}_1 \\
1 & \text{LSTAT}_2 \\
\vdots & \vdots \\
1 & \text{LSTAT}_n \\
\end{bmatrix} \end{equation}

Here, $ n $ is the number of observations in the dataset.

In [None]:
# Add a column of ones to the predictor matrix for the intercept term
X_with_intercept = np.column_stack((np.ones(len(X)), X))
print(X_with_intercept)

### Fitting the Linear Regression Model

We fit the simple linear regression model: $ y = \beta_0 + \beta_1 \cdot \text{LSTAT} $, where $ y $ is the response variable `MEDV` and $\text{LSTAT}$ is the predictor variable.

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
plt.style.use('https://raw.githubusercontent.com/HatefDastour/ENGG_680/main/Files/mystyle.mplstyle')

def my_linear_regression(X, y):
    """
    Perform linear regression and return model coefficients, predicted values, and residuals.

    Parameters:
    X (numpy.ndarray): Input feature matrix with shape (n, p).
    y (numpy.ndarray): Target vector with shape (n,).

    Returns:
    float: Intercept of the linear regression model.
    numpy.ndarray: Array of slope coefficients for each feature.
    numpy.ndarray: Array of predicted values.
    numpy.ndarray: Array of residuals.
    """

    # Calculate the coefficients using the normal equation
    coeff_matrix = np.linalg.inv(X.T @ X) @ (X.T @ y)

    # Extract coefficients
    intercept, slope = coeff_matrix

    # Calculate the predicted values
    y_pred = X @ coeff_matrix

    # Calculate the residuals
    residuals = y - y_pred

    return intercept, slope, y_pred, residuals

intercept, slope, y_pred, residuals = my_linear_regression(X_with_intercept, y)

fig, ax = plt.subplots(1, 2, figsize=(9.5, 4.5), sharey = False)
ax = ax.ravel()
# regplot
_ = sns.scatterplot(x = X, y = y, ax = ax[0],
                    fc= 'SkyBlue', ec = 'k', s = 50, label='Original Data')
_ = ax[0].set(xlim = [-10, 50], ylim = [0, 60], title = 'Linear Regression')
xlim = ax[0].get_xlim()
ylim = [slope * xlim[0] + intercept, slope * xlim[1] + intercept]
_ = ax[0].plot(xlim, ylim, linestyle = 'dashed', color = 'red', lw = 3, label='Fitted Line')
_ = ax[0].legend()

# residuals
_ = ax[1].scatter(y_pred, residuals, fc='none', ec = 'DarkRed', s = 50)
_ = ax[1].axhline(0, c='k', ls='--')
_ = ax[1].set(aspect = 'auto',
              xlabel = 'Fitted value',
              ylabel = 'Residual',
              xlim = [0, 40], ylim = [-20, 30], title = 'Residuals for Linear Regression')
plt.tight_layout()

# Print the coefficients
print(f'Intercept: {intercept:.8f}')
print(f'Slope: {slope:.8f}')

### Modeling through statsmodels api

Now, let's summerize everything and do the above through [statsmodels API](https://www.statsmodels.org/stable/index.html) [Seabold and Perktold, 2010]:

In [None]:
import statsmodels.api as sm

# Create the model matrix manually
X = Boston['LSTAT']  # Predictor variable (lstat)
# Response variable (medv)
y = Boston['MEDV']

X = sm.add_constant(X)  # Add an intercept term
display(X)

# Fit the simple linear regression model
Results = sm.OLS(y, X).fit()

# Print the model summary
print(Results.summary())

The output is from the summary of a fitted linear regression model, which provides valuable information about the model's coefficients, their standard errors, t-values, p-values, and confidence intervals. Let's break down each component:

1. `const` and `LSTAT`: These are the variable names in the model. `const` represents the intercept (constant term), and `LSTAT` represents the predictor variable.

2. `coef`: This column displays the estimated coefficients of the linear regression model. For `const`, the estimated coefficient is 34.5538, and for `LSTAT`, it is -0.9500.

3. `std err`: This column represents the standard errors of the coefficient estimates. It quantifies the uncertainty in the estimated coefficients. For `const`, the standard error is 0.563, and for `LSTAT`, it is 0.039.

4. `t`: The t-values are obtained by dividing the coefficient estimates by their standard errors. The t-value measures how many standard errors the coefficient estimate is away from zero. For `const`, the t-value is 61.415, and for `LSTAT`, it is -24.528.

5. `P>|t|`: This column provides the p-values associated with the t-values. The p-value represents the probability of observing a t-value as extreme as the one calculated, assuming the null hypothesis that the coefficient is equal to zero. Lower p-values indicate stronger evidence against the null hypothesis. In this case, both coefficients have extremely low p-values (close to 0), indicating that they are statistically significant.

6. `[0.025 0.975]`: These are the lower and upper bounds of the 95% confidence intervals for the coefficients. The confidence intervals provide a range of plausible values for the true population coefficients. For `const`, the confidence interval is [33.448, 35.659], and for `LSTAT`, it is [-1.026, -0.874].

The coefficient estimates indicate how the response variable (`medv`) is expected to change for a one-unit increase in the predictor variable (`lstat`). The low p-values and confidence intervals not containing zero suggest that both the intercept and `LSTAT` coefficient are statistically significant and have a significant impact on the response variable.

In [None]:
# OLS Reression results
def Reg_Result(Inp):
    Temp = pd.read_html(Inp.summary().tables[1].as_html(), header=0, index_col=0)[0]
    display(Temp.style\
    .format({'coef': '{:.4e}', 'P>|t|': '{:.4e}', 'std err': '{:.4e}'})\
    .bar(subset=['coef'], align='mid', color='Lime')\
    .set_properties(subset=['std err'], **{'background-color': 'DimGray', 'color': 'White'}))

Reg_Result(Results)

## Linear Regression through sklearn API

One can also execute a straightforward linear regression analysis by employing the `LinearRegression` model available in the `sklearn` library {cite:p}`sklearnUserGuide`. The following delineates the procedure:

In [None]:
from sklearn.linear_model import LinearRegression
# Prepare the predictor and response variables
X = Boston[['LSTAT']]  # Predictor variable (lstat)
y = Boston['MEDV']  # Response variable (medv)

# Create and fit the Linear Regression model
reg = LinearRegression()
_ = reg.fit(X, y)

# Print the model coefficients and intercept
print(f"Intercept (const): {reg.intercept_}")
print(f"Coefficient (LSTAT): {reg.coef_[0]}")

In this code, we use `sklearn` to perform the simple linear regression. We first read the dataset as before, then prepare the predictor variable `X` (lstat) and the response variable `y` (medv). We create an instance of the `LinearRegression` model and fit it to the data using `model.fit(X, y)`.

After fitting the model, we can access the model's coefficients using `model.coef_`, which gives us the coefficient for the predictor variable `LSTAT`, and `model.intercept_`, which gives us the intercept (constant term). The coefficients obtained here should match the ones we got from the `statsmodels` library earlier.

<font color='Blue'><b>Example:</b></font>
The dataset used in this example comprises monthly mean temperature data for 'CALGARY INT'L A'. Our objective is to identify linear trendlines for each of the 12 months, spanning the period from 1881 to 2012.

In [None]:
import pandas as pd
Link = 'https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=2205&Year=2000&Month=1&Day=1&time=&timeframe=3&submit=Download+Data'

df = pd.read_csv(Link, usecols = ['Date/Time', 'Year', 'Month' , 'Mean Temp (°C)'])
df['Date/Time'] = pd.to_datetime(df['Date/Time'])
df = df.rename(columns = {'Date/Time':'Date'})
display(df)

In [None]:
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import calendar

def _regplot(ax, df_sub, X, reg, *args, **kwargs):
    '''
    Plots the regression line on the given axis.

    Parameters:
    - ax: matplotlib axis
        The axis to plot on.
    - df_sub: pandas DataFrame
        Subset of data for a specific month.
    - X: numpy array
        Independent variable (index values).
    - reg: regression model
        The regression model used for prediction.
    - *args, **kwargs: additional plot arguments
        Additional arguments for the plot function.
    '''
    y_hat = reg.predict(X)
    ax.plot(df_sub.Date, y_hat, *args, **kwargs)
    ax.set_title(f'Slope = {reg.coef_.ravel()[0]:.3}', fontsize=14)

fig, axes = plt.subplots(12, 1, figsize=(9.5, 20), sharex=True)
for m, ax in enumerate(axes, start=1):
    # Create and fit the Linear Regression model
    reg = LinearRegression()
    # Subset data for a specific month
    df_sub = df.loc[df.Month == m].reset_index(drop=True)

    # Scatter plot for Mean Max Temp
    ax.scatter(df_sub['Date'], df_sub['Mean Temp (°C)'],
               fc='Blue', ec='k', s=15, label = 'Mean Temp (°C)')

    # Set axis label as the month name
    ax.set_ylabel(calendar.month_name[m], weight='bold')
    ax.grid(True)

    # Remove rows with missing values
    df_sub = df_sub.dropna()

    # Prepare data for linear regression
    X = df_sub.index.values.reshape(-1, 1)
    y = df_sub['Mean Temp (°C)'].values.reshape(-1, 1)

    # Create and fit a Linear Regression model
    reg = LinearRegression()
    _ = reg.fit(X, y)

    # Plot the fitted line
    _regplot(ax=ax, df_sub=df_sub, X=X, reg=reg,
             linestyle='dashed', color='red', lw=1.5, label='Fitted Line')
    ax.legend()

    # Set four y-ticks
    yticks = np.linspace(df_sub['Mean Temp (°C)'].min()-3, df_sub['Mean Temp (°C)'].max()+3, 4)
    yticks = np.round(yticks, 1)
    ax.set_yticks(yticks)

fig.suptitle("""Monthly Mean Temperature (°C) Trends at CALGARY INT'L A: Linear Analysis""", y = 0.99,
            weight = 'bold', fontsize = 16)
plt.tight_layout()

---
<font color='Red'><b>Note:</b></font>

Please note that all the slopes shown in the figure above have been provided with three significant decimal places.

---

## Analyzing Long-Term Trends

When analyzing trends in data, it's important to consider the significance of the observed trendlines, and p-values play a crucial role in this context. A p-value is a statistical measure that helps determine the significance of a trend or relationship observed in a dataset. In the context of the provided trendlines for mean monthly temperatures at Calgary International Airport, it would be advisable to conduct a statistical test to evaluate the significance of these trends.

To assess the significance of these trendlines and whether they are statistically meaningful, you would typically perform a linear regression analysis and calculate p-values for each trendline. Here's a brief note on the significance of these trendlines and p-values:

**Note on P-Values and Significance of Trendlines:**

In the analysis of the mean monthly temperature trends at Calgary International Airport, we have calculated the slopes of linear trendlines to understand how temperatures are changing over time. However, to determine whether these observed trends are statistically significant, we need to consider p-values.

- **P-Values**: A p-value measures the probability of obtaining the observed trendlines (or more extreme ones) if there were no actual trends in the data. Lower p-values indicate stronger evidence against the null hypothesis, suggesting that the observed trend is not due to random chance.

- **Significance**: To assess the significance of these trendlines, we typically perform linear regression analysis. In this analysis, each month's temperature data is regressed against time (the independent variable). The p-value associated with each trendline quantifies the likelihood of observing the reported trend under the assumption that there is no true linear relationship between time and temperature.

- **Interpretation**: A low p-value (typically, below a significance level like 0.05) suggests that the trend is statistically significant. In other words, it provides evidence that the temperature changes observed for that month are not just random fluctuations.

- **Caution**: It's important to be cautious when interpreting p-values. While a low p-value indicates statistical significance, it does not by itself establish the practical or real-world significance of the trend. Additionally, trends can be affected by many factors, and further analysis is needed to understand the underlying causes of temperature changes.

Several Python packages can calculate p-values for linear regression. Here are some commonly used ones:
* StatsModels
* SciPy
* PyStatsModels
* etc.

<font color='Blue'><b>Example:</b></font>

In [None]:
import matplotlib.pyplot as plt
from scipy import stats
import calendar

slope_list = []
month_list = []
p_val_list = []

for m in range(1, 13):
    df_sub = df.loc[df.Month == m].reset_index(drop=True).dropna()
    X = df_sub.index.values
    y = df_sub['Mean Temp (°C)'].values
    slope, _, _, p_value, _ = stats.linregress(X, y)
    slope_list.append(slope)
    month_list.append(calendar.month_name[m])
    p_val_list.append(p_value)
    del df_sub, X, y, slope, p_value

df_results = pd.DataFrame({'Slope':slope_list, 'p-value':p_val_list}, index = month_list)

The following table  represents the slopes of linear trendlines for the mean monthly temperature at Calgary International Airport weather stations from January 1881 to July 2012. Additionally, p-values are provided to assess the significance of these trends.

In [None]:
display(df_results)

Each slope corresponds to a specific month and represents the rate of change in temperature over time, expressed in degrees Celsius per year. Let's break down the information, explain the slope, and discuss the significance at different confidence levels.

1. **January (0.028849):** The positive slope indicates an increasing trend in mean temperatures in January over the years. At a 90% confidence level, this trend is significant, as the p-value (0.019986) is less than 0.1.

2. **February (0.036869):** This positive slope suggests a significant increase in mean temperatures in February. The very low p-value (0.000755) indicates significance even at a 99% confidence level.

3. **March (0.018809):** The positive slope implies a gradual rise in March temperatures. At a 90% confidence level, this trend is significant (p-value: 0.025253).

4. **April (-0.003040):** The negative slope suggests a slight decrease in mean temperatures in April, although it is not statistically significant at any of the specified confidence levels. The p-value is relatively high (0.601981).

5. **May (-0.000235):** Similarly, the small negative slope in May temperatures is not statistically significant at any confidence level (p-value: 0.948623).

6. **June (0.002036):** The positive slope indicates a minor increase in June temperatures, but it's not statistically significant at any confidence level (p-value: 0.517881).

7. **July (0.004418):** This positive slope suggests a small but significant increase in July temperatures at a 90% confidence level, as the p-value is 0.142400.

8. **August (0.005177):** Like July, August also shows a slight but statistically significant increase in temperatures at a 90% confidence level (p-value: 0.151559).

9. **September (0.008547):** The positive slope for September indicates a more significant increase in temperatures. This trend is significant at a 90% confidence level (p-value: 0.067413).

10. **October (-0.001246):** The small negative slope in October temperatures is not statistically significant at any of the confidence levels (p-value: 0.796791).

11. **November (0.003791):** This positive slope suggests a slight increase in November temperatures, but it is not statistically significant at any confidence level (p-value: 0.684526).

12. **December (0.000463):** The positive slope for December indicates a very slight increase in temperatures, but it's not statistically significant at any confidence level (p-value: 0.962145).

These slopes provide insights into the long-term temperature trends for each month at the Calgary International Airport weather stations. Positive slopes indicate warming trends, negative slopes indicate cooling trends, and the magnitude of the slope reflects the rate of change.