# Regression Analysis:
Regression analysis is a statistical method used to examine the relationship between one or more independent variables and a dependent variable. In simple terms, it helps us understand how the value of the dependent variable changes when one or more independent variables are varied.

## Linear Regression:
Linear regression is a fundamental technique in regression analysis, particularly when there is a linear relationship between the independent and dependent variables. The basic premise of linear regression is to find the best-fitting straight line that describes the relationship between the independent variable(s) and the dependent variable.

Consider a simple linear equation where we have one independent variable (x) and one dependent variable (y). The equation of the line can be represented as:

$$y = b + m x$$


Here,  

- $y$ represents the dependent variable.  
- $x$ represents the independent variable.  
- $m$ is the slope of the line, indicating the rate of change of $y$ with respect to $x$.  
- $b$ is the intercept or bias term, representing the value of $y$ when $x$ is zero.  

#### How it Works:
In linear regression, we aim to minimize the difference between the observed values of the dependent variable and the values predicted by the linear model. This difference is often referred to as the error or residual. By adjusting the parameters (slope and intercept) of the linear model, we strive to minimize this error.  

We achieve this by fitting a line through the data points and then optimizing the model parameters to minimize the overall error. This optimization process typically involves techniques such as the method of least squares or gradient descent.

![](./imgs/linear-regression.png)

## Multivariable Regression:
In real-world scenarios, the relationship between the dependent variable and independent variables may not be as simple as in the case of simple linear regression. Multivariable regression allows us to model situations where the dependent variable is influenced by multiple independent variables.

The equation for multivariable regression can be expressed as:

$$ y = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + \ldots + \theta_nx_n $$

Here,

- $\theta_0$ represents the intercept term.
- $\theta_1, \theta_2, \ldots, \theta_n$ are the coefficients associated with each independent variable.
- $x_1, x_2, \ldots, x_n$ represent the independent variables.

## Assumptions in Regression Analysis:

1. **Linearity Assumption:**
   
    It is assumed that there exists a linear relationship between the independent variables and the dependent variable. This implies that as the independent variables change, the dependent variable changes proportionally. To validate this assumption, one common method is to utilize a pair plot, which visually inspects the relationships between each independent variable and the dependent variable. If the relationships appear linear, the assumption is considered valid.
    
    
2. **No Multicollinearity:**

    Multicollinearity occurs when independent variables in a regression model are highly correlated with each other. This can lead to issues in interpretation and stability of the model coefficients. To detect multicollinearity, scatter plots can be utilized to visualize the relationships between pairs of independent variables. Additionally, the Variance Inflation Factor (VIF) statistic is employed, where a VIF value less than or equal to 4 suggests no significant multicollinearity, while a value greater than or equal to 5 indicates serious multicollinearity.
    
    
3. **Homoskedasticity:**

    Homoskedasticity refers to the assumption that the error terms have constant variance across all levels of the independent variables. In other words, the spread of the residuals should remain consistent as the predicted values change. This assumption can be examined by plotting the residuals against the predicted values. A homoskedastic plot would show a uniform spread of points around the horizontal line at zero, without any discernible pattern or funnel shape.
    
    
4. **No Autocorrelation:**

    Autocorrelation occurs when there is a correlation between the residuals of the regression model. This implies that the error terms are not independent of each other, violating the assumption of regression analysis. A common method to detect autocorrelation is by plotting the residuals against the fitted values. If there is no clear pattern in the plot and the residuals appear randomly scattered around zero, the assumption of no autocorrelation is met.
    
    
5. **Normality of Error Terms:**

    It is assumed that the error terms in the regression model are normally distributed. This assumption can be assessed using a Q-Q plot (Quantile-Quantile plot), which compares the distribution of the residuals to a theoretical normal distribution. A linear relationship in the Q-Q plot suggests that the residuals follow a normal distribution, validating this assumption. If the Q-Q plot deviates significantly from linearity, it indicates a departure from normality in the error terms.



## Regression Analysis: A Three-Stage Process

Regression analysis is commonly employed in various fields, including economics, finance, and social sciences, to understand the nature and strength of the associations between variables. The process of regression analysis typically involves three main stages:

1. **Correlation and Directionality Analysis**: 
   In this stage, the focus is on exploring the correlation between the independent and dependent variables and determining the directionality of the relationship. Correlation analysis measures the strength and direction of the linear relationship between variables, ranging from -1 to +1. A positive correlation indicates that as one variable increases, the other variable also tends to increase, while a negative correlation suggests an inverse relationship.

2. **Model Estimation (Fitting the Line)**:
   Once the correlation and directionality have been assessed, the next step involves estimating the regression model. This entails fitting a line or curve to the data points that best represents the relationship between the independent and dependent variables. The objective is to find the line that minimizes the difference between the observed data points and the predicted values generated by the model. This process typically employs various regression techniques, such as linear regression, logistic regression, or polynomial regression, depending on the nature of the data and the research question.

3. **Model Evaluation**:
   The final stage of regression analysis is evaluating the validity and usefulness of the model. This involves assessing how well the model fits the data and whether it provides meaningful insights into the relationship between the variables. Common measures used for model evaluation include R-squared (coefficient of determination), which indicates the proportion of the variance in the dependent variable that is explained by the independent variables, and hypothesis testing to determine the statistical significance of the coefficients. Additionally, diagnostic tests such as residual analysis and multicollinearity checks are conducted to identify any potential issues or violations of the regression assumptions.


## Covariance and Correlation Analysis

Covariance and correlation analysis are fundamental statistical tools used to quantify the relationship between two variables. They provide insights into the direction and strength of their association, aiding in understanding patterns and dependencies in data.

#### Covariance:

Covariance measures the directional relationship between two variables. It indicates whether the variables tend to move in the same direction (positive covariance) or opposite directions (negative covariance). Mathematically, covariance between two variables $x$ and $y$ is calculated as:

$$\text{Cov}(x, y) = \frac{\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})}{n}$$

Where:
- $n$ represents the number of observations.
- $x_i$ and $y_i$ denote individual data points of variables $x$ and $y$, respectively.
- $\overline{x}$ and $\overline{y}$ are the means of variables $x$ and $y$, respectively.


This formula computes the average of the product of the deviations of each data point from their respective means, providing insight into the joint variability of the variables. A positive covariance indicates that the variables tend to move in the same direction, while a negative covariance implies movement in opposite directions.

#### Correlation:

Correlation analysis extends covariance by standardizing the measure to a range between -1 and +1, irrespective of the scale of the variables. It not only indicates the direction of the relationship but also quantifies its strength. The correlation coefficient ($\rho$) is calculated as:

$$ \rho = \frac{Cov(x,y)}{\sigma_x \sigma_y} $$

Where:
- $\sigma_x$ and $\sigma_y$ are the standard deviations of variables $x$ and $y$ respectively.

The correlation coefficient ranges from -1 to +1:
- $\rho = 1$ implies a perfect positive linear relationship.
- $\rho = -1$ implies a perfect negative linear relationship.
- $\rho = 0$ implies no linear relationship between the variables.

Correlation analysis provides valuable insights into the nature and strength of associations between variables, aiding in decision-making processes across various domains.

## Ordinary Least Squares Method for Parameter Estimation

In the context of parameter estimation, let's consider $\theta_0, \theta_1, ..., \theta_n$ as weight parameters and $x_0, x_1, x_2, ..., x_n$ as independent variables. (Note: $\theta_0$ can be interpreted as bias, and $x_0 = 1$).

**Parameters Representation:**
- $\theta$ as a column matrix:
  $ \theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{bmatrix}_{(n+1, 1)} $
- $X$ as a matrix of independent variables:
  $ X = \begin{bmatrix} x_{0,0} & x_{0,1} & \cdots & x_{0,n} \\ 1 & x_{1,1} & \cdots & x_{1,n} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{m,1} & \cdots & x_{m,n} \end{bmatrix}_{(m , n+1)} $

**Dependent Variable Representation:**
- $Y$ as a column matrix:
  $ Y = \begin{bmatrix} y_0 \\ y_1 \\ \vdots \\ y_m \end{bmatrix}_{(m , 1)} $

We represent the linear relationship between the independent variables $x_i$ and the dependent variable $Y$ as:

$$Y = X \cdot \theta$$

Here, $\theta$ denotes the parameter vector, and $X$ represents the design matrix comprising the independent variables.

In reality, it might not be possible to find the value of $\theta$ such that the equation would be satisfied exactly for each of the $m$ observations. Hence, the attempt is to find $\theta$ such that $X * \theta$ is as close to Y as possible.

We replace $Y$ by $Y_p$, which denotes the predicted value of $Y$:

$$Y_p = X \cdot \theta$$

Now we need to find $\theta$ such that $Y_p$ is as close to $Y$ as possible, i.e., $(Y – Y_p)$ should be as close to $0$ as possible and this should be evaluated on the basis of all the rows of data. Sometimes, a value of $\theta$ which improves the error for a given row could deteriorate the error for another row. So, we want to consider the total (or, average) errors across all the rows. 
When we consider the errors for all the rows, large positive error for a row could potentially be cancelled by large negative values on another row. To overcome this problem, we can consider the square of an error, which is always positive, and hence, there is no chance of errors from different rows being cancelled by each other.  


Since $Y$, $Y_p$ are both column matrices, $Y – Y_p$ is also a column matrix, and squaring each number (to get rid of negative values) can be thought of as $(Y – Y_p)^T (Y – Y_p)$ where $(Y – Y_p)^T$ is transpose of matrix $(Y – Y_p)$. This is equivalent to squaring of each element of column matrix $(Y – Y_p)$.  
So, now, the requirement is to minimize $(Y – Y_p)^T (Y – Y_p)$.  

Since $Y_p = X \cdot \theta$, the above requirement can be rewritten as minimize $(Y - X \cdot \theta)^T (Y - X \cdot \theta)$.

The loss function or the cost function, which quantifies the error between the predicted values and the actual values, is now given by:

$$J(\theta) = (Y - X \cdot \theta)^T \cdot (Y - X \cdot \theta)$$


Differential calculus says that for any curve, at its minimum value, the slope of the curve will be 0. Using matrix differential calculus, the slope of the $J(\theta)$, i.e., taking a derivative with respect to $\theta$, will be $2X^TX \cdot \theta - 2X^T Y$.

$$2X^TX\theta - 2X^TY = 0$$
$$X^TX\theta = X^TY$$
$$\theta = (X^TX)^{-1} X^TY$$



To find the optimal parameters $\theta$ that minimize this loss function, we aim to solve the normal equations: $\theta = (X^T X)^{-1} \cdot X^T \cdot Y$.  
However, there are situations where the matrix $X^T X$ may not be invertible, or the computational cost of matrix inversion might be prohibitive. In such cases, an alternative method known as Gradient Descent is employed for parameter estimation.

## Gradient Descent for Parameter Optimization

Gradient descent is a fundamental optimization algorithm used in machine learning and optimization tasks to minimize a function iteratively. The objective is to find the optimal parameters of a model that minimize a given cost or loss function. In the context of machine learning, these parameters are often weights or coefficients associated with features in a predictive model.

#### Overview

Gradient descent operates by iteratively adjusting the parameters in the direction of the steepest descent of the loss function. The process continues until a minimum of the loss function is reached, indicating convergence to an optimal solution. In the case of convex functions, such as many encountered in machine learning, there exists only one global minimum.

#### Mathematical Representation

The update rule for each parameter $ \theta_i $ in gradient descent is given by:

$$ \theta_i = \theta_i - \alpha \frac{\partial J(\theta)}{\partial \theta_i} $$

Where:
- $ \theta_i $ is the $ i^{th} $ parameter to be updated.
- $ \alpha $ is the learning rate, determining the size of the step taken in each iteration.
- $ J(\theta) $ is the loss function or cost function being minimized.

#### Key Components

- **Learning Rate ( $ \alpha $ ):** The learning rate is a hyperparameter that controls the size of the steps taken during optimization. It influences the convergence speed and the stability of the algorithm. A larger learning rate may cause the algorithm to converge faster but risk overshooting the minimum, while a smaller learning rate may lead to slower convergence but greater stability.

- **Partial Derivative ( $ \frac{\partial J(\theta)}{\partial \theta_i} $ ):** This term represents the rate of change of the loss function with respect to the $ i^{th} $ parameter $ \theta_i $. It indicates the direction in which the parameter should be adjusted to decrease the loss. Calculating this derivative requires knowledge of the specific form of the loss function being minimized.

#### Iterative Process

1. **Initialization:** The process begins by initializing the parameters $ \theta $ to some initial values. These initial values can be chosen randomly or set to some predefined values.

2. **Gradient Computation:** In each iteration, the partial derivatives of the loss function with respect to each parameter $ \theta_i $ are computed. This gradient represents the direction of steepest ascent.

3. **Parameter Update:** Using the computed gradients, the parameters are updated according to the update rule mentioned earlier. The learning rate determines the size of the step taken in the direction of the negative gradient. **Note:** All the parameters are updated simultaneously. By using temporary variables, we ensure that the gradient calculation $\frac{\partial J(\theta)}{\partial \theta_i}$ is performed on the original parameter values $\theta_i$. This prevents confusion and ensures that the gradient calculation is not affected by any updates made during the current iteration.

$$\text{temp}_{\theta_0} = \theta_0 - \alpha \frac{\partial J(\theta)}{\partial \theta_0}$$

$$\text{temp}_{\theta_1} = \theta_1 - \alpha \frac{\partial J(\theta)}{\partial \theta_1}$$

$$\ldots$$

$$\text{temp}_{\theta_n} = \theta_n - \alpha \frac{\partial J(\theta)}{\partial \theta_n}$$

$$\theta_0 = \text{temp}_{\theta_0}$$

$$\theta_1 = \text{temp}_{\theta_1}$$

$$\ldots$$

$$\theta_n = \text{temp}_{\theta_n}$$
 

4. **Convergence Check:** The process continues iteratively until a convergence criterion is met, such as reaching a predefined number of iterations or until the change in the loss function becomes negligible.

#### Conclusion

Gradient descent is a versatile optimization algorithm widely used in machine learning for finding optimal parameters. By iteratively adjusting parameters in the direction of the negative gradient of the loss function, gradient descent efficiently converges to a local minimum, allowing models to learn from data and make accurate predictions. Proper tuning of hyperparameters, such as the learning rate, is crucial for the algorithm's effectiveness and stability.

## Loss functions:

In machine learning, loss functions are pivotal for training algorithms to learn from data and make accurate predictions. They quantify the disparity between actual observations and predictions generated by the model. Here, we'll delve into three common loss functions used for regression tasks:

Consider we have m samples and n coefficients $\theta_0$ to $\theta_n$, including the bias term.

1. **Mean Squared Error (MSE) or L2 Loss:**
   - The MSE, also known as L2 loss, measures the average squared difference between actual $ y_i $ and predicted $ y_i^p $ values across all data points.
   - It's computed as: 
     $$ MSE = \frac{\sum_{i=0}^m(y_i - y_i^p)^2}{m} $$
   - MSE penalizns larger errors mnre significantly due to the squaring operation, making it sensitive to outliers.

2. **Mean Absolute Error (MAE) or L1 Loss:**
   - MAE, also termed L1 loss, calculates the average absolute difference between actual and predicted values.
   - The formula for MAE is: 
     $$ MAE = \frac{\sum_{i=0}^m|y_i - y_i^p|}{m} $$
   - Unlike MSE, MAE is more robust to outliers since it considers the absolute differences.

3. **Huber Loss $L_{\delta}(y, f(x)):$**
   - Huber loss offers a compromise between MSE and MAE, providing robustness to outliers while still being differentiable.
   - Defined as:
     $$ L_{\delta}(y, f(x)) = 
     \begin{cases} 
     \frac{1}{2} * (y - f(x))^2, & \text{if } |y - f(x)| \leq \delta \\ 
     \delta * (|y - f(x)| - \frac{1}{2} * \delta), & \text{otherwise} 
     \end{cases} $$
   - It applies quadratic behavior when the absolute error is small $(|y - f(x)| \leq \delta)$, akin to MSE, and linear behavior otherwise.

Understanding the nature and characteristics of these loss functions is crucial for selecting the appropriate one based on the problem's requirements and data characteristics. While MSE is sensitive to outliers due to the squaring, MAE offers robustness, and Huber loss provides a balanced approach catering to both scenarios.

## Goodness of Fit:

The R-squared statistic, also known as the *coefficient of determination*, provides a measure of how well a linear regression model fits the observed data. It quantifies the proportion of the variance in the dependent variable that is predictable from the independent variables. In essence, R-squared indicates the percentage of the response variable variation explained by the regression model.

Mathematically, R-squared is calculated as the ratio of the explained variation to the total variation:


$$ R^2 = \frac{\text{Explained Variance}}{\text{Total Variance}}= \frac{{\sum_{i=0}^{m}(y_i - \bar{y})^2} - {\sum_{i=0}^{m}(y_i - \hat{y}_i)^2}}{\sum_{i=0}^{m}(y_i - \bar{y})^2}  $$
$$ R^2 = 1 - \frac{\sum_{i=0}^{m}(y_i - \hat{y}_i)^2}{\sum_{i=0}^{m}(y_i - \bar{y})^2} $$

Where:
- $ y_i $ represents the observed values of the dependent variable.
- $ \hat{y}_i $ represents the predicted values of the dependent variable from the regression model.
- $ \bar{y} $ represents the mean of the observed values of the dependent variable.


R-squared values range from 0% to 100%:

- A value of 0% indicates that the model fails to explain any of the variability in the response data around its mean.
- A value of 100% indicates that the model perfectly explains all the variability in the response data around its mean.

It's crucial to note that adding more independent variables or predictors to a regression model typically increases the R-squared value. However, this can lead to a phenomenon known as overfitting, where the model becomes excessively complex and performs well on the training data but poorly on new, unseen data.

To address this issue, *Adjusted R-squared* is employed. Adjusted R-squared considers the number of predictors in the model and penalizes excessive complexity. It provides a more reliable estimate of the correlation between the independent and dependent variables.

Adjusted R-squared is calculated using the formula:

$$\text{Adjusted }R^2 = 1 - \frac{(1-R^2)(m-1)}{m-n-1}$$

Where:
- $m$ is the sample size.
- $n$ is the number of predictors.
- $R^2$ is the sample R-squared.

Adjusted R-squared penalizes the addition of unnecessary predictors, helping to mitigate the risk of overfitting and providing a more accurate assessment of the model's goodness of fit.

## Regularization Techniques:

Regularization techniques are essential methods in machine learning to mitigate overfitting, a common issue where a model learns to perform well on the training data but fails to generalize to unseen data. These techniques impose constraints on the model parameters during training, discouraging it from becoming overly complex or flexible.

#### L1 Regularization (Lasso):

L1 regularization, also known as Lasso regularization, introduces a penalty term to the loss function based on the absolute values of the model coefficients. It is formulated as:

$$ L1 \text{ or } L_{lasso} = \frac{1}{m}\sum_{i=0}^m\bigg(y_i - \sum_{j=0}^n\theta_jx_{ij}\bigg)^2 + \lambda\sum_{j=0}^n|\theta_j| $$

Where:
- $y_i$ represents the observed target value for the $i^{th}$ sample.
- $x_{ij}$ denotes the $j^{th}$ feature of the $i^{th}$ sample.
- $\theta_j$ are the model coefficients (parameters) associated with each feature.
- $\lambda$ is the regularization parameter that controls the strength of the penalty term.

L1 regularization encourages sparsity in the model by driving some coefficients to exactly zero, effectively selecting a subset of features and discarding the rest.

#### L2 Regularization (Ridge):

L2 regularization, also known as Ridge regularization, imposes a penalty term based on the square of the model coefficients. The formulation is:

$$ L2 \text{ or } L_{ridge} = \frac{1}{m}\sum_{i=0}^m\bigg(y_i - \sum_{j=0}^n\theta_jx_{ij}\bigg)^2 + \lambda\sum_{j=0}^n\theta_j^2 $$

Similar to L1 regularization, $\lambda$ controls the strength of the penalty term. However, L2 regularization tends to distribute the penalty more evenly across all coefficients, often resulting in smaller but non-zero coefficients.

#### Choosing the Regularization Parameter:

The choice of the regularization parameter $\lambda$ is crucial as it balances between fitting the training data well and keeping the model simple to generalize better. It is typically determined through techniques like cross-validation, where the dataset is split into training and validation sets. K-Fold cross-validation is a commonly used method, where the data is divided into K subsets, and the model is trained K times, each time using a different subset as the validation set and the remaining as the training set. The average performance across all folds is then used to select the optimal $\lambda$ value.

By incorporating regularization techniques such as L1 and L2 regularization, along with proper parameter tuning using techniques like cross-validation, we can effectively prevent overfitting and build models that generalize well to unseen data.