`machine_learning/linear_regression.py` doesn't give optimal coefficients #8847

tianyizheng02 · 2023-06-30T08:24:28Z

Feature description

Related to issue #8844

TL;DR: the current implementation doesn't give optimal solutions, the current implementation calculates SSE wrong, and we should add an implementation of a numerical methods algorithm that actually gives optimal solutions

In machine_learning/linear_regression.py, add the following code at the bottom of the main() function:

from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

data = np.asarray(data.astype(float))
X = data[:, 0].reshape(-1, 1)
y = data[:, 1]

reg = LinearRegression().fit(X, y)
print(f"Sklearn coefficients: {reg.intercept_}, {reg.coef_}")

sse = np.sum(np.square(reg.predict(X) - y.T))
print(f"{sse = }")
print(f"mse = {sse / len(y)}")
print(f"half mse = {sse / (2 * len(y))}")

plt.scatter(X, y, color="lightgray")
plt.axline(xy1=(0, theta[0, 0]), slope=theta[0, 1], color="red", label="Gradient descent")
plt.axline(xy1=(0, reg.intercept_), slope=reg.coef_, color="blue", label="Sklearn")
plt.legend()
plt.show()

This code performs ordinary least squares (OLS) linear regression using sklearn as a point of reference to compare the current implementation against. It then calculates the sum of squared errors (SSE), mean squared error (MSE), and half of the MSE. To compare the outputs visually, the code uses matplotlib to plot the sklearn regression line and the regression line produced by the current implementation.

The code produces the following command line output:

...
At Iteration 100000 - Error is 128.03882
Resultant Feature vector: 
-9.34325
1.53067
Sklearn coefficients: -15.547901662158367, [1.6076036]
sse = 253795.17406773588
mse = 253.79517406773587
half mse = 126.89758703386794

As we can see, what the implementation calculates as the SSE (128.03882) is actually half of the MSE, meaning that the sum_of_square_error function is incorrect and needs to be fixed. Why the implementation is using half of the MSE, I have no clue.

Furthermore, we can see that both the regression coefficients and the errors are slightly off. This is because the current implementation works via gradient descent, meaning that it can only approximate the OLS regression coefficients. Meanwhile, libraries like numpy and sklearn calculate the mathematically optimal coefficients using numerical methods.

Although using gradient descent to perform linear regression does work, it's still suboptimal and (AFAIK) it's not how linear regression is actually performed in practice. We can still include this implementation, but we should definitely also include an implementation of an optimal numerical method.

The text was updated successfully, but these errors were encountered:

tianyizheng02 · 2023-06-30T15:40:34Z

Assign this task to me :)

@Madhav-2808 We don't assign tasks in this repo. If you want to work on it, then just make a PR.

liakdimi1 · 2023-07-20T15:59:51Z

@tianyizheng02 I try to run your code and get your graph but i get the error

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (2, 2) +
inhomogeneous part.

tianyizheng02 · 2023-07-22T09:57:48Z

@liakdimi1 I'm not able to recreate your error. Can you show your code and/or the error traceback?

liakdimi1 · 2023-07-22T12:10:47Z

@tianyizheng02 i added : slope=reg.coef_[0] to the plt.axline instead of slope=reg.coef_

JrX970 · 2023-09-24T18:36:27Z

If the problem is just focussing on the univariate case (as plotted), wouldn't it be worthwhile ditching the iterative closed form solution method and just doing a simple "sum of rectangular area over sum of square area" method?
Especially considering that you would have to manually edit the code to increase the amount of features used, applying any data transformation or altering the design matrix.

tianyizheng02 · 2023-09-30T16:00:59Z

@mahi01agarwal @ssom01-github Please read the contributing guidelines. As it says, we do not assign issues, so do not ask for permission to work on an issue. If you want to contribute, just open a PR. However, at the moment, please wait until October 1 in your time zone to open a PR because we're currently not accepting new PRs ahead of Hacktoberfest.

Ademsk1 · 2023-10-18T22:29:49Z

Wondering whether we should split this out into

univariate_linear_regression.py => numerical method
multivariate_linear_regression.py => gradient descent approach
and deprecate the linear_regression.py file we have. That way individuals can chose.

I can't see any benefit to using this model for a univariate case, and I'm guessing if the individual importing this file never bothers to check it, then they won't know they could be wasting resources.

Before that's done, we should definitely examine why this gradient descent isn't optimal. I'll try and check it out over the weekend :D

smruthi-sumanth · 2024-03-27T18:19:07Z

Hello! My PR #11311 attempts to rectify this issue by implementing OLS instead of gradient descent. Can someone please review my code?

tianyizheng02 added the enhancement This PR modified some existing files label Jun 30, 2023

This comment was marked as off-topic.

Sign in to view

liakdimi1 added a commit to liakdimi1/Python that referenced this issue Jul 22, 2023

Fixes TheAlgorithms#8847

51c8190

liakdimi1 mentioned this issue Jul 24, 2023

Optimal coefficients, fixes #8847 #8885

Closed

14 tasks

tianyizheng02 mentioned this issue Jul 28, 2023

Linear Regression #8844

Closed

This comment was marked as off-topic.

Sign in to view

mahi01agarwal mentioned this issue Sep 28, 2023

Fixes #8847 #9114

Closed

14 tasks

This comment was marked as off-topic.

Sign in to view

mahi01agarwal mentioned this issue Sep 30, 2023

Fixes#8847 #9141

Closed

14 tasks

This comment has been minimized.

Sign in to view

TheAlgorithms deleted a comment Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`machine_learning/linear_regression.py` doesn't give optimal coefficients #8847

`machine_learning/linear_regression.py` doesn't give optimal coefficients #8847

tianyizheng02 commented Jun 30, 2023 •

edited

This comment was marked as off-topic.

tianyizheng02 commented Jun 30, 2023

liakdimi1 commented Jul 20, 2023

tianyizheng02 commented Jul 22, 2023

liakdimi1 commented Jul 22, 2023

JrX970 commented Sep 24, 2023

This comment was marked as off-topic.

This comment was marked as off-topic.

tianyizheng02 commented Sep 30, 2023

This comment has been minimized.

Ademsk1 commented Oct 18, 2023 •

edited

smruthi-sumanth commented Mar 27, 2024

machine_learning/linear_regression.py doesn't give optimal coefficients #8847

machine_learning/linear_regression.py doesn't give optimal coefficients #8847

Comments

tianyizheng02 commented Jun 30, 2023 • edited

Feature description

This comment was marked as off-topic.

tianyizheng02 commented Jun 30, 2023

liakdimi1 commented Jul 20, 2023

tianyizheng02 commented Jul 22, 2023

liakdimi1 commented Jul 22, 2023

JrX970 commented Sep 24, 2023

This comment was marked as off-topic.

This comment was marked as off-topic.

tianyizheng02 commented Sep 30, 2023

This comment has been minimized.

Ademsk1 commented Oct 18, 2023 • edited

smruthi-sumanth commented Mar 27, 2024

`machine_learning/linear_regression.py` doesn't give optimal coefficients #8847

`machine_learning/linear_regression.py` doesn't give optimal coefficients #8847

tianyizheng02 commented Jun 30, 2023 •

edited

Ademsk1 commented Oct 18, 2023 •

edited