# **Environment Setup and Data Loading**

### This block imports libraries like `Pandas` for data manipulation and `NumPy` for mathematical operations. We load the dataset and inspect its structure to determine how many inputs (features) the model needs to handle.

In [2]:
import pandas as pd
import numpy as np

# Load dataset
data = pd.read_csv("multiple_linear_regression_dataset.csv")

# Inspect data
print("First 5 rows:\n", data.head())
print("\nColumn Names:", data.columns)
print("Shape of Dataset:", data.shape)

# Think About This (Step 1):
# - Inputs: 'age' and 'experience'
# - Output: 'income'
# - Features: The model needs to handle 2 features

First 5 rows:
    age  experience  income
0   25           1   30450
1   30           3   35670
2   47           2   31580
3   32           5   40130
4   43          10   47830

Column Names: Index(['age', 'experience', 'income'], dtype='object')
Shape of Dataset: (20, 3)


# **Data Preprocessing**
 ### We separate the dataset into independent variables ($X$) and the target variable ($y$). This mapping is essential because if inputs and outputs are mixed, the learning process has no meaning.

In [4]:
# Inputs (features)
X = data[["age", "experience"]].values

# Output (target)
y = data["income"].values

# Think About This (Step 2):
# - X has 2 columns because it represents 2 independent features
# - y has 1 column because we are predicting a single numeric value

# **Parameter Initialization**
### We initialize the weights and bias. Since learning starts from imperfect guesses, we begin with zeros. We need exactly one weight for every input feature so each input can have its own level of "importance".

In [5]:
# Number of features
n_features = X.shape[1]

# Initialize weights (one per feature) and bias (baseline offset)
w = np.zeros(n_features)
b = 0.0

# Think About This (Step 3):
# - One weight per feature: Needed to calculate the specific importance of each input
# - Bias: Separate as it represents the baseline salary offset when inputs are zero
# - Large values: Risky because they might cause the loss to explode or slow down convergence

# **The Forward Pass (Prediction)**
### This defines the Linear Neuron. We use the formula $\hat{y} = X \cdot w + b$. There is no activation function (like Sigmoid) because we need to preserve numeric magnitude, not destroy it for a decision.

In [6]:
def predict(X, w, b):
    # Returns predicted values: y_hat = w1x1 + w2x2 + b
    y_hat = X.dot(w) + b
    return y_hat

# Think About This (Step 4):
# - No activation: Because we need a continuous number, not a 0/1 category
# - y_hat values: Can be any real number (continuous)
# - vs Logistic Regression: Logistic uses a Sigmoid to squash output between 0 and 1

# **Loss Function (MSE)**

### We implement Mean Squared Error (MSE). This function summarizes how far our predictions are from the actual values.

In [7]:
def mean_squared_error(y, y_hat):
    # Calculates the average of the squared errors
    loss = ((y_hat - y) ** 2).mean()
    return loss

# Think About This (Step 5):
# - Why square?: To remove signs (penalize over/under-estimates equally) and amplify large errors
# - Very wrong prediction: MSE will penalize it heavily due to the squaring
# - Why not absolute error?: Squaring makes the function smooth, which is better for calculus/gradients

# **Gradient Computation**

### Gradients tell the model how to adjust weights to reduce loss. They provide the direction and magnitude of the necessary change

In [8]:
def compute_gradients(X, y, y_hat):
    N = len(y)
    # Gradient of MSE w.r.t weights: (2/N) * X^T * (error)
    dw = (2/N) * X.T.dot(y_hat - y)
    # Gradient of MSE w.r.t bias: (2/N) * sum(error)
    db = (2/N) * (y_hat - y).sum()
    return dw, db

# Think About This (Step 6):
# - X in dw: Because weights are tied to specific features; bias is an independent offset
# - Error term: Appears because gradients are calculated based on the difference between prediction and truth
# - Error is zero: Gradients become zero, meaning the model has stopped learning (it is perfect)

# **Parameter Update (Gradient Descent)**
### We update the parameters using the subtraction rule: $w = w - \eta \cdot dw$. The learning rate ($\eta$) controls the speed of these updates.

In [9]:
def update_parameters(w, b, dw, db, lr):
    # Move parameters in the opposite direction of the gradient to reduce loss
    w = w - lr * dw
    b = b - lr * db
    return w, b

# **Training Loop**

### This loop repeatedly predicts, calculates loss, and updates weights for a set number of epochs. We use a very small learning rate because salary values are large.

In [10]:
# Hyperparameters
lr = 0.0001 # Small learning rate to maintain stability with large numeric values
epochs = 1000

for epoch in range(epochs):
    # 1. Forward pass
    y_hat = predict(X, w, b)

    # 2. Compute loss
    loss = mean_squared_error(y, y_hat)

    # 3. Compute gradients
    dw, db = compute_gradients(X, y, y_hat)

    # 4. Update weights and bias
    w, b = update_parameters(w, b, dw, db, lr)

    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss}")

# Think About This (Step 8):
# - Loss trend: Should decrease as the model learns
# - If loss increases: The learning rate is too high, causing the model to overshoot

Epoch 0, Loss: 1727049635.0
Epoch 100, Loss: 66491868.55311352
Epoch 200, Loss: 61752567.201190114
Epoch 300, Loss: 58616531.07847049
Epoch 400, Loss: 56528801.53951118
Epoch 500, Loss: 55126542.02946697
Epoch 600, Loss: 54172526.94885703
Epoch 700, Loss: 53511656.14292054
Epoch 800, Loss: 53042523.72795741
Epoch 900, Loss: 52698829.56325033


# **Final Evaluation**

### After training, we check the final weights and use the model to predict the salary of a new candidate.

In [12]:
print(f"Final Weights (age, experience): {w}")
print(f"Final Bias: {b}")

# Example prediction:
new_candidate = np.array([50, 15])
predicted_income = new_candidate.dot(w) + b
print(f"Predicted income for [50, 15]: {predicted_income}")

# Think About This (Step 9):
# - Reasonable?: Yes, if it aligns with the data trends
# - Better than threshold rules?: Yes, because it handles magnitude and provides precise estimates

Final Weights (age, experience): [ 764.75405919 1371.03430441]
Final Bias: 321.73641174472493
Predicted income for [50, 15]: 59124.953937381215
