1. [Linear Regression](#linear_reg)
   1. [Assumption](#assumption)
   2. [Accuracy metrics](#accuracy_metrics)
   3. [Regularization](#regularization)
2.  [Non-Linear Regression](#non_linear)
    1.  [Logistic Regression](#logistics)  

<a id='linear_reg'></a>
# Linear Regression

<a id="assumption"></a>
### Assumption

1. **Linearity**: There should be a linear relationship between the independent variable(s) and the dependent variable.
2. **Independence**: The residuals (errors) should be independent. This means that the residuals of one observation are not correlated with the residuals of another.
3. **Homoscedasticity**: The residuals should have constant variance at every level of the independent variable(s). This means that the spread of the residuals should be roughly the same across all levels of the independent variable(s).
4. **Normality**: The residuals of the model should be normally distributed.

In [1]:
import sys
import numpy as np
import pandas as pd
sys.path.append(r'../../')
from utils.plots import linear_plot,scatter_plot

seed = np.random.seed(seed=42)

In [2]:
m = 10
b = 2
noise = 50

x = np.arange(start=0, stop=10, step=1)
noise = np.random.randint(low=0, high=noise, size=len(x))

y = m * x + b + noise

In [3]:
# first order polynomial - simple linear regression
coeff = np.polyfit(x=x, y=y, deg=1)
y_pred = (coeff[0] * x**1) + coeff[1]
linear_plot(x, y, y_pred).show()

In [4]:
# third order polynomial
coeff = np.polyfit(x=x, y=y, deg=3)
y_pred = (coeff[0] * x**3) + (coeff[1] * x**2) + (coeff[2] * x**1) + coeff[3]
linear_plot(x, y, y_pred).show()

In [5]:
# using numpy polyval function
coeff = np.polyfit(x=x, y=y, deg=8)
y_pred = np.round(np.polyval(coeff, x),2)
linear_plot(x, y, y_pred).show()

<a id='accuracy_metrics'></a>
#### Accuracy metrics

1. **MAE (Mean Absolute Error)**
   - Pros:
     - The MAE is expressed in the same unit as the output variable.
     - Robust to outliers.
   - Cons:
     - Not differentiable, so it can't be used as a loss function.

2. **MSE (Mean Squared Error)**
   - Pros:
     - Differentiable and can be used as a loss function.
   - Cons:
     - Output is in squared units.
     - Not robust to outliers due to squared differences.

3. **RMSE (Root Mean Squared Error)**
   - Pros:
     - Output is in the same unit as the target variable.

4. **RMSLE (Root Mean Squared Log Error)**
   - Pros:
     - Does not penalize high errors due to the logarithm.
     - Useful when underestimation is unacceptable.
   - Cons:
     - Large penalty for underestimation.

5. **MAPE (Mean Absolute Percentage Error)**
   - Pros:
     - Reflects errors for both high and low magnitude values.
   - Cons:
     - Sensitive to outliers.

6. **R2 (R-Squared)**
   $R^2 = 1 - \frac{{SS_{\text{res}}}}{{SS_{\text{tot}}}}$
   - Pros:
     - Compares regression line to mean line.
     - Useful for model comparison.
     - Value between 0 and 1 (1 being best). How much variance is explained by your model.
   - Cons:
     - Adding useless features doesn't decrease R2.

7. **Adj R2 (Adjusted R-Squared)**
   $ \text{Adjusted R}^2 = 1 - \left(1 - R^2\right) \cdot \frac{{n - 1}}{{n - k - 1}} $
   - Pros:
     - Crucial for model evaluation.
     - Decreases with irrelevant features.

[Reference 1](https://www.analyticsvidhya.com/blog/2021/05/know-the-best-evaluation-metrics-for-your-regression-model/)
[Reference 2](https://www.linkedin.com/pulse/regression-metrics-all-why-mse-aishwarya-b/)

<a id="regularization"></a>
## Regularization

**Regularization** is important to maintain bias-variance trade off or overfitting/underfitting.
1. **Bias-Variance Trade-off**:
    - Polynomial regression aims to find a balance between **bias** (underfitting) and **variance** (overfitting).
    - High-degree polynomials can fit the training data perfectly but may generalize poorly to unseen data (overfitting).
    - Regularization helps control this trade-off.

2. **Why Regularization?**:
    - When fitting polynomials, we often face a dilemma:
        - **Low-degree polynomials** (e.g., linear or quadratic) may underfit the data.
        - **High-degree polynomials** (e.g., cubic or higher) may overfit the data.
    - Regularization provides a way to address this by introducing a **penalty term**.

3. **Penalty Term**:
    - Regularization adds a penalty to the loss function.
    - The total loss becomes: **Loss = Loss Function + Penalty**
    - The penalty discourages large coefficients, preventing overfitting.

4. **Types of Regularization**:
    - **L2 (Ridge) Regularization**:
        - Adds the sum of squared coefficients to the loss function.
        - Encourages small coefficients.
        - Helps prevent overfitting.
    - **L1 (Lasso) Regularization**:
        - Adds the sum of absolute coefficients to the loss function.
        - Encourages sparse models (sets some coefficients to exactly zero).
        - Useful for feature selection.
    - **Elastic Net Regularization**:
        - Combines L1 and L2 regularization.
        - Balances between sparsity and smoothness.

5. **Effect on Coefficients**:
    - Regularization shrinks the coefficients toward zero.
    - Smaller coefficients lead to simpler models.
    - It helps prevent overfitting by reducing the model's complexity.

6. **Continuous Complexity Range**:
    - Regularization provides a **continuous range** of complexity parameters.
    - Unlike choosing a fixed polynomial degree, you can fine-tune the regularization strength.
    - This flexibility allows finding the right balance between bias and variance.


### Methods to detect overfitting.
1. **Visual Inspection:** Plot the fitted line against the data points.
2. **Cross-Validation:** Use techniques like k-fold cross-validation to assess model performance on unseen data.
3. **Learning Curves:**
   - Plot the model’s performance (e.g., accuracy or loss) against the size of the training dataset.
   - If the training performance keeps improving while the validation performance plateaus or worsens, overfitting could be occurring.
4. **Feature Importance Analysis:** If a few features dominate, it might indicate overfitting.


<a id="non_linear"></a>
## Non Linear Regression

<a id="logistics"></a>
<div style="text-align: left;">

## Logistic Regression

### Classification Problems

* Email: spam / not spam
* Online transactions: fraudulent (yes/no)?
* Tumor: malignant / benign?

#### Binary classification
y ∈ {0, 1}
0: "Negative class" (e.g. benign tumor)
1: "Positive class" (e.g. malignant tumor)

#### Multi class classification
y ∈ {0,1,2,3,4,5} N

Linear regression for classification problems is not a good idea. 

The formula is:

**$P(Y=1|X) = \sigma(z) = \frac{1}{1 + e^{-z}}$**
<br>Here: $z = w_0 + w_1x_1 + w_2x_2 + \cdots + w_nx_n$
<br>where:  $w_0, w_1, \ldots, w_n$  are the weights (or coefficients) of the model.
<br>$\sigma(z)$  is the sigmoid function, which converts  z  into a probability.

1. **Log-Odds (Logit)**
    Logistic regression models the log of odds of the dependent variable being 1 (i.e.,  Y = 1 ) as a linear combination of the input features. 
    <br>Defined as: 

    $\text{Logit}(P) = \ln\left(\frac{P}{1-P}\right)$
    <br>Here:<br>
    - $P = P(Y=1|X)$ , the probability of the positive class.<br>
    - $1-P = P(Y=0|X)$ , the probability of the negative class.
    <br>This equation can be rewritten as: $\ln\left(\frac{P}{1-P}\right) = z = w_0 + w_1x_1 + w_2x_2 + \cdots + w_nx_n$<br>
    Hence, logistic regression predicts the log-odds of the target variable.

2. **Why Called Regression?**
    <br> Logistic regression is called regression because it fits a linear model to predict the log-odds (a continuous value). However, the final output is converted into a probability using the sigmoid function, making it suitable for classification tasks.

3. **Interpretation of the Log-Odds**
    <br>The odds represent the ratio of the probability of success ( P ) to the probability of failure ( 1-P ). The log of the odds transforms the multiplicative relationship of odds into an additive relationship. It linearizes the model, allowing coefficients to be interpreted as the effect of a one-unit change in the feature on the log-odds.

4. **Why the Negative Sign in Sigmoid Function?**
    <br>Sigmoid function is defined as: $\sigma(z) = \frac{1}{1 + e^{-z}}$
    <br>The negative sign in  $e^{-z}$  is due to the mathematical properties of the sigmoid function, ensuring it has the following characteristics:<br>
    - When  $z \to +\infty ,  e^{-z} \to 0$ , so  $\sigma(z) \to 1$  (probability of success).
    - When  $z \to -\infty ,  e^{-z} \to \infty$ , so  $\sigma(z) \to 0$  (probability of failure).
    - This negative exponent ensures the sigmoid is a monotonically increasing function, mapping all values of  z  to a range between 0 and 1.

### Cost Function

The cost function used for linear regression is not a good choice for logistic regression because given the sigmoid function involved, it would be non-convex. Running gradient descent in a non-convex function can converge to a local minimum that is different than the global minimum.

A good cost function _J_ that is guaranteed to be convex (and also non-negative) is:

	Cost(h(x), y) = |   -log(h(x)), if y=1
	                | -log(1-h(x)), if y=0

When plotting the two funcions, we can see that for _y=1_, the cost goes very high when _h(x)_ approaches 0, and for _y=0_, the cost goes very high when _h(x)_ approaches 1. This is actually very good for our purposes.


### Simplified Cost Function and Gradient Descent

We can further simplify both _Cost(h(x), y)_ and _J(Θ)_:

	Cost(h(x), y) = -y * log(h(x)) - (1-y) * log(1 - h(x))
	
	# replacing these values on our original cost function J, we have:
	
	J(Θ) = -(1/m) * ∑( y * log(h(x)) + (1-y) * log(1 - h(x)) )

    
In logistic regression, the loss function commonly used is called Log Loss (also known as logistic loss or cross-entropy loss). This function measures the performance of a classification model whose output is a probability value between 0 and 1.<br>
The formula for Log Loss is:

$Log Loss = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(p_i) + (1 - y_i) \log(1 - p_i)]$

Where:

- ${N}$ is the number of observations.
- ${y_i}$ is the actual label (0 or 1).
- ${p_i}$ is the predicted probability of the observation being in class 1.

This loss function penalizes incorrect predictions more heavily as the predicted probability diverges from the actual label. For example, predicting a probability close to 0 when the actual label is 1 results in a high loss.

[Reference 1](https://developers.google.com/machine-learning/crash-course/logistic-regression)


In [6]:
weights = [40, 50, 75, 36, 90, 122, 115, 40, 130, 80, 170]
label = [0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1]

data = pd.DataFrame({"weight": weights, "obese": label})

X = data[['weight']].values
y = data['obese'].values.reshape(-1, 1)  # Reshape y to be a column vector

In [7]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Initialize weights and bias
def initialize_weights(n_features):
    W = np.zeros((n_features, 1))  # Weight vector
    b = 0.0  # Bias term
    return W, b

# Compute loss (Binary Cross-Entropy)
def compute_loss(y, y_pred):
    m = y.shape[0]
    loss = -1 / m * np.sum(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))
    return loss

# Compute gradients
def compute_gradients(X, y, y_pred):
    m = X.shape[0]
    dW = 1 / m * np.dot(X.T, (y_pred - y))
    db = 1 / m * np.sum(y_pred - y)
    return dW, db

# Train logistic regression model
def train(X, y, learning_rate=0.01, num_iterations=1000, tolerance=0.01):
    n_features = X.shape[1]

    W, b = initialize_weights(n_features)
    prev_loss = float("inf")

    for i in range(num_iterations):
        
        # Linear model
        z = np.dot(X, W) + b
        
        # Sigmoid function
        y_pred = sigmoid(z)
        
        # Compute gradients
        dW, db = compute_gradients(X, y, y_pred)
        
        # Update weights and bias
        W -= learning_rate * dW
        b -= learning_rate * db
        
        # Compute current loss
        loss = compute_loss(y, y_pred)

        # Check for early stopping
        if prev_loss - loss < tolerance * prev_loss:
            print(f"Early stopping at iteration {i}: Loss = {loss:.4f}")
            break
        prev_loss = loss

    return W, b

W, b = train(X, y, learning_rate=0.001, num_iterations=100000, tolerance=1e-10)

In [8]:
z = np.dot(X, W) + b
y_sigmoid = sigmoid(z)

threshold = 0.5
y_predicted = [1 if i > threshold else 0 for i in y_sigmoid]

In [9]:
X_new = np.random.randint(10, 150, 10000).reshape(-1,1)
z = np.dot(X_new, W) + b
y_sigmoid = sigmoid(z)

In [10]:
new_data = pd.concat(
    [
        pd.DataFrame(X_new, columns=["feature"]),
        pd.DataFrame(y_sigmoid, columns=["label"]),
    ],
    axis=1,
)
scatter_plot(data, x1='weight', y1='obese', data2=new_data, x2='feature', y2='label')