# Regression

* Regression is a type of supervised learning where the goal is to predict **continuous numerical values** based on input features.

##  Types of Regression

## 1. Linear Regression

 * Linear Regression is a supervised machine learning algorithm used for predictive modeling. 

 * It establishes a linear relationship between the dependent variable (target) and one or more independent variables (features).

 * It tries to find out the best linear relationship that describes the data you have.

![image.png](attachment:image.png)


**Purpose:**
 
 * Regression predicts, forecasts, and models time series.

 * It determines causal relationships between variables.
 * Regression analysis predicts continuous variables.



![image.png](attachment:image.png)



### Working Mechanism (Enhanced)

* The goal of linear regression is to find a **straight line that minimizes the error (the difference) between the observed data points and the predicted values**.
This line helps us predict the dependent variable for new, unseen data.

* The model assumes a linear relationship:

​	![image.png](attachment:image.png)
 

* Training involves:
 
  * Learning the optimal values of `β (coefficients/weights)`.
  
  * These are estimated by minimizing the **cost function (commonly Mean Squared Error)**.

* Prediction is simply plugging in new `X` values to the learned equation.

* Two common ways to optimize:
    * **Normal Equation (Closed-form):** Directly computes `β` without iteration.

    * **Gradient Descent:** Iteratively updates `β` to minimize the cost function.


​	

### Types of Linear Regression


![image.png](attachment:image.png)


![image-2.png](attachment:image-2.png)

### Mathematical Foundation

![image.png](attachment:image.png)

<br><br><br>

- **Gradient Descent** is an optimization algorithm used to **minimize the loss function in machine learning models**. For a deeper understanding, refer to the official Deep Learning documentation on Gradient Descent


### Assumptions of Linear Regression

1. **Linearity:**

   The dependent variable (Y) has a linear relationship with the independent variables (X). This means that the change in Y is proportional to the change in X.

2. **Independence of Errors:**

   The residuals (errors) should be independent of each other. In other words, the error associated with one observation should not influence the error of another.

3. **Homoscedasticity (Constant Variance of Errors):**

   The variance of the residuals should remain constant across all levels of the independent variables. If the variance varies (i.e., increases or decreases systematically), this condition is violated and is referred to as heteroscedasticity.

4. **Normality of Errors:**

    The residuals should follow a normal distribution. This is particularly important for constructing confidence intervals and hypothesis testing.

5. **No Multicollinearity (for Multiple Regression):**

    The independent variables should not be highly correlated with one another. Strong correlations between predictors can make it difficult to isolate the effect of each variable.

6. **No Autocorrelation:**

   The residuals should not display systematic patterns over time. This is especially critical in time series data, where autocorrelated errors indicate model misspecification.

7. **Additivity:**

   The combined effect of all independent variables on the dependent variable is simply the sum of their individual effects. There should be no interaction or non-linear combination of predictors.


   ![image.png](attachment:image.png)

### Evaluation Metrics


![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

### Example:


In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split

# Step 1: Load the dataset
df = pd.read_csv("Melbourne_housing_FULL.csv")  # Replace with actual path

# Step 2: Preprocessing
# Drop rows where Price is missing (since it's the target)
df = df.dropna(subset=["Price"])

# Select numerical features only for simplicity
numeric_features = ["Rooms", "Bedroom2", "Bathroom", "Car", "Landsize", "BuildingArea", "Distance"]
df[numeric_features] = df[numeric_features].fillna(df[numeric_features].median())

X = df[numeric_features]
y = df["Price"]

# Step 3: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Fit Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)

# Step 5: Make predictions
y_pred = model.predict(X_test)

# Step 6: Evaluate
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rss = np.sum(np.square(y_test - y_pred))

# Step 7: Print metrics
print("🔍 Evaluation Metrics:")
print(f"Mean Squared Error (MSE): {mse:,.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:,.2f}")
print(f"Mean Absolute Error (MAE): {mae:,.2f}")
print(f"R² Score: {r2:.4f}")
print(f"Residual Sum of Squares (RSS): {rss:,.2f}")


🔍 Evaluation Metrics:
Mean Squared Error (MSE): 273,966,010,891.83
Root Mean Squared Error (RMSE): 523,417.63
Mean Absolute Error (MAE): 348,764.40
R² Score: 0.3589
Residual Sum of Squares (RSS): 1,493,114,759,360,469.00
