In [2]:
import pandas as pd
import numpy as np

data = pd.read_csv("multiple_linear_regression_dataset.csv")

print(data.head())
print(data.columns)
print(data.shape)

   age  experience  income
0   25           1   30450
1   30           3   35670
2   47           2   31580
3   32           5   40130
4   43          10   47830
Index(['age', 'experience', 'income'], dtype='object')
(20, 3)


In [None]:
"""
Inputs: age and experience
These features are used to predict salary.

Output: income
This is the value the model tries to predict.

Number of features: 2
So the model needs two weights.
"""

In [5]:
# Inputs (features)
X = data[["age", "experience"]].values

# Output (target)
y = data["income"].values

In [None]:
"""
Shape of X: (20, 2)
There are 20 samples and 2 input features (age and experience).

Shape of y: (20,)
There are 20 target values, one salary for each sample.

X has 2 columns because the model uses two input features.
y has only one column because we predict only one output (income).
"""

In [6]:
n_features = X.shape[1]
w = np.zeros(n_features)
b = 0.0

In [None]:
"""
We need one weight per feature because each input should have its own importance.
Each weight tells how strongly that feature affects the salary prediction.

Bias is separate because it shifts the prediction line up or down.
It allows the model to make predictions even when all inputs are zero.

Initializing with large values is risky because predictions may become very large,
loss can explode, and training may become unstable or slow to converge.
Small values make learning smoother and safer.
"""

In [7]:
def predict(X, w, b):
    y_hat = X.dot(w) + b
    return y_hat

In [None]:
"""
There is no activation function because this is a regression problem.
We want to predict a number (salary), not a class label.
Activation functions like step or sigmoid restrict the output range,
which is not suitable for numeric prediction.

y_hat can take any real value.
It can be small, large, positive, or decimal depending on the inputs.

This is different from logistic regression because logistic regression
uses a sigmoid function and outputs probabilities between 0 and 1.
Here we directly output a number without any restriction.
"""

In [8]:
def mean_squared_error(y, y_hat):
    loss = ((y_hat - y) ** 2).mean()
    return loss

In [None]:
"""
We square the error to make all errors positive and to penalize large
mistakes more strongly. Bigger errors get much larger penalties.

If one prediction is very wrong, its squared error becomes very large,
so the total loss increases a lot. This forces the model to correct it quickly.

We do not just take absolute error because it is harder to optimize and
not smooth for gradient descent. Squared error is smoother and easier for learning.
"""

In [9]:
def compute_gradients(X, y, y_hat):
    N = len(y)
    dw = (2 / N) * X.T.dot(y_hat - y)
    db = (2 / N) * (y_hat - y).sum()
    return dw, db

In [None]:
"""
X appears in dw because weights are multiplied with the input features.
So the gradient of weights depends on both the error and the input values.

X does not appear in db because bias is not multiplied by any feature.
Bias is just a constant shift, so its gradient depends only on the error.

The error term appears everywhere because learning is based on how wrong
the prediction is. Bigger error means bigger correction in weights and bias.

If error is zero, gradients become zero, weights do not change,
and learning stops because the model is already correct.
"""

In [10]:
def update_parameters(w, b, dw, db, lr):
    w = w - lr * dw
    b = b - lr * db
    return w, b

In [11]:
lr = 0.0001
epochs = 1000
for epoch in range(epochs):
    y_hat = predict(X, w, b)
    loss = mean_squared_error(y, y_hat)
    dw, db = compute_gradients(X, y, y_hat)
    w, b = update_parameters(w, b, dw, db, lr)
    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss}")

Epoch 0, Loss: 1727049635.0
Epoch 100, Loss: 66491868.55311352
Epoch 200, Loss: 61752567.201190114
Epoch 300, Loss: 58616531.07847049
Epoch 400, Loss: 56528801.53951118
Epoch 500, Loss: 55126542.02946697
Epoch 600, Loss: 54172526.94885703
Epoch 700, Loss: 53511656.14292054
Epoch 800, Loss: 53042523.72795741
Epoch 900, Loss: 52698829.56325033


In [None]:
"""
Yes, loss should decrease over time if learning is correct.
This means the model is improving and predictions are getting closer to actual values.

If loss increases, it usually means the learning rate is too high
or the gradients are wrong. The updates may be too large and unstable.

Learning rate and epochs work together:
learning rate controls how big each step is,
epochs control how many steps we take.
Small learning rate needs more epochs,
large learning rate needs fewer epochs but may be unstable.
"""

In [12]:
print("Final weights:", w)
print("Final bias:", b)
new_candidate = np.array([4.5, 68])
predicted_salary = new_candidate.dot(w) + b
print("Predicted salary:", predicted_salary)

Final weights: [ 764.75405919 1371.03430441]
Final bias: 321.73641174472493
Predicted salary: 96993.4623777421


In [None]:
"""
Yes, the prediction is reasonable if it is close to the salary values
seen in the dataset. It should not be extremely large or negative.

Yes, it interpolates smoothly because linear regression produces a
continuous straight-line relationship. Small changes in inputs lead
to small changes in output.

This is better than threshold rules because threshold rules give only
fixed or step-like outputs, while regression gives precise numeric
predictions and adapts smoothly to new inputs.
"""