In [1]:
import pandas as pd
df = pd.read_csv("multiple_linear_regression_dataset.csv")

In [2]:
df.head()

Unnamed: 0,age,experience,income
0,25,1,30450
1,30,3,35670
2,47,2,31580
3,32,5,40130
4,43,10,47830


In [3]:
df.columns

Index(['age', 'experience', 'income'], dtype='object')

In [4]:
df.shape

(20, 3)

In [5]:
#Which columns are inputs?
#age, experience

#Which column is the output?
#income

#How many features does your model need to handle?
#2

In [6]:
x = df[["age", "experience"]].values
y = df["income"].values

In [7]:
#What is the shape of X ?
#(20, 2)

#What is the shape of y ?
#(20,)

#Why does X have 2 columns but y only one?
#X has 2 columns because it contains the input features (age and experience), while y has only one column because it contains the output variable (income) that we are trying to predict.

In [8]:
import numpy as np
n_features = x.shape[1]
w = np.zeros(n_features)
b = 0.0

In [9]:
#Why do we need one weight per feature?
#We need one weight per feature because each feature contributes differently to the prediction of the output variable. The weights allow us to assign different levels of importance to each feature in the linear regression model.

#Why is Bias seperate?
#Bias is separate because it allows the model to make predictions even when all input features are zero. It acts as an intercept term that shifts the regression line up or down, enabling the model to fit the data better.

#Would initializing with large values be risky?
#Yes, because it may lead to slow convergence or divergence during the training process. Large initial weights can cause the model to make large updates to the weights during training, which can result in overshooting the optimal solution and failing to converge.

In [10]:
def predict(x, w, b):
    y_hat = np.dot(x, w) + b
    return y_hat

In [11]:
#Why is there no activation function?
#In linear regression, we are modeling a continuous output variable, so we do not need an activation function. 

# What kind of values can y_hat take?
# The output of the linear regression model can take any real value, which is appropriate for regression tasks. 

# How is this different from logistic regression?
# logistic regression is used for classification tasks where the output is a probability between 0 and 1, and therefore requires an activation function (sigmoid) to map the linear output to a probability.

In [12]:
def mean_squared_error(y, y_hat):
    loss = np.mean((y - y_hat) ** 2)
    return loss

In [13]:
#Why square the error?
#Squaring the error ensures that all errors are positive and gives more weight to larger errors, which can help the model to focus on minimizing larger mistakes during training.

#What happens if one prediction is very wrong?
#If one prediction is very wrong, it will have a large squared error, which can significantly increase the overall mean squared error. This can make the model more sensitive to outliers and may lead to a less accurate fit for the majority of the data points.

#Why not just take the absolute value of the error?
#Taking the absolute value of the error (mean absolute error) is another option, but it does not penalize larger errors as much as mean squared error. Mean squared error can be more effective in certain cases where we want to give more importance to larger errors, while mean absolute error can be more robust to outliers.

In [14]:
def compute_gradients(x, y, y_hat):
    n = len(y)
    dw = (-2/n) * np.dot(x.T, (y - y_hat))
    db = (-2/n) * np.sum(y - y_hat)
    return dw, db

In [15]:
#Why does X appear in dw but not in db ?
#x appears in dw because the gradient with respect to the weights (dw) depends on the input features, while the gradient with respect to the bias (db) does not depend on the input features and is only influenced by the difference between the true and predicted values.

#Why does the error term appear everywhere?
#The error term appears everywhere because it represents the difference between the true values and the predicted values, which is the basis for calculating both the loss and the gradients. The model uses this error to adjust the weights and bias in order to minimize the loss during training.

#What happens if the error is zero?
#It means that the model's predictions perfectly match the true values, resulting in a mean squared error of zero. In this case, the gradients (dw and db) would also be zero, indicating that there is no need to update the weights and bias further, as the model has already achieved optimal performance on the training data.

In [16]:
def update_parameters(w, b, dw, db, lr):
    w = w - lr * dw
    b = b - lr * db
    return w, b

In [17]:
lr = 0.0001
epochs = 1000
for epoch in range(epochs):
    y_hat = predict(x, w, b)
    loss = mean_squared_error(y, y_hat)
    dw, db = compute_gradients(x, y, y_hat)
    w, b = update_parameters(w, b, dw, db, lr)
    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss}")

Epoch 0, Loss: 1727049635.0
Epoch 100, Loss: 66491868.553113505
Epoch 200, Loss: 61752567.2011901
Epoch 300, Loss: 58616531.07847047
Epoch 400, Loss: 56528801.53951118
Epoch 500, Loss: 55126542.02946697
Epoch 600, Loss: 54172526.94885705
Epoch 700, Loss: 53511656.14292053
Epoch 800, Loss: 53042523.72795741
Epoch 900, Loss: 52698829.56325033


In [18]:
#Does loss decrease over time?
#Yes, the loss should decrease over time as the model learns to make better predictions by updating the weights and bias based on the computed gradients. As the model converges towards the optimal parameters, the mean squared error should reduce, indicating that the predictions are getting closer to the true values.

#What happens if it increases?
#If the loss increases, it may indicate that the learning rate is too high, causing the model to overshoot the optimal parameters during updates. This can lead to divergence and a failure to converge towards a minimum loss. In such cases, it may be necessary to reduce the learning rate or implement techniques like learning rate decay to stabilize training.

#How do learning rate and epochs interact?
#The learning rate determines the size of the steps taken towards the minimum loss during each update,while the number of epochs determines how many times the model will iterate over the entire training dataset. A higher learning rate may require fewer epochs to converge, but it can also lead to instability if it's too high. Conversely, a lower learning rate may require more epochs to converge but can provide a more stable training process. It's important to find a balance between the learning rate and the number of epochs to ensure effective training of the model.

In [19]:
print(f"Final weights: {w}")
print(f"Final bias: {b}")
new_candidate = np.array([[30, 5]])
predicted_income = predict(new_candidate, w, b)
print(f"Predicted income: {predicted_income[0]}")

Final weights: [ 764.75405919 1371.03430441]
Final bias: 321.7364117447249
Predicted income: 30119.529709500806


In [20]:
#Is the prediction reasonable?
#The reasonableness of the prediction depends on the context of the data and the range of incomes in the dataset. If the predicted income falls within a plausible range based on the training data, then it can be considered reasonable. However, without knowing the specific values in the dataset, it's difficult to definitively assess the reasonableness of the prediction.

#Does it interpolate smoothly?
#Since the model is a linear regression, it will interpolate smoothly between the training data points. The predicted income for a 30-year-old with 5 years of experience will be a linear combination of the weights and bias learned from the training data, which allows for smooth interpolation.

#Why is this better than threshold rules?
#Linear regression provides a continuous output, which allows for more nuanced predictions compared to threshold rules that classify inputs into discrete categories. This can be particularly beneficial when the relationship between the input features and the output variable is not strictly binary, allowing for more accurate modeling of real-world data.