In [1]:
import pandas as pd
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [2]:
data = pd.read_csv("train.csv")

In [3]:
X = data.drop('SalePrice',axis=1)
y = data.loc[:,'SalePrice']

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [5]:
linear_reg = LinearRegression()
ridge_reg = Ridge(alpha=0.05, normalize=True)

In [6]:
linear_reg.fit(X_train, y_train)
ridge_reg.fit(X_train, y_train)

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), Ridge())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alpha to: original_alpha * n_samples. 


Ridge(alpha=0.05, normalize=True)

In [7]:
linear_pred = linear_reg.predict(X_test)
ridge_pred = ridge_reg.predict(X_test)

In [8]:
linear_mse = mean_squared_error(y_test, linear_pred)
ridge_mse = mean_squared_error(y_test, ridge_pred)

In [9]:
print(f"MSE without Ridge: {linear_mse}")
print(f"MSE with Ridge : {ridge_mse}")

MSE without Ridge: 5116399803.951063
MSE with Ridge : 4465770299.03716


Underfitting describes the problem of a model being too simple so that it is unable to find the patterns in the training dataset. It does not fit the data properly, and ignores a large portion of it.

Overfitting describes the problem of a model being too specific for a dataset and trying to fit every datapoint. The model is unable to generalize for other data because it is looking for the specific patterns in the training dataset.

Variance refers to the sensitivity of a model to specific datasets. The variance is high in the case of overfitting but low for underfitting. Bias, on the other hand, refers to the inability of the model to understand the complexity of data. The bias is high when a model is an underfit but low for an overfit.

There is a trade-off between bias and variance. This means that as variance increases, bias decreases and vice versa. For a good-performing model, there needs to be a good balance.

The tuning parameter lambda(λ) is used to specify how much we want to penalize the flexibility of our model. It helps shrink the less predictive features’ coefficient.

L1 regularization is equal to the absolute value of the magnitude of the coefficient; it simply restricts or penalizes the size of the coefficients. In the L2 loss function, the magnitude of coefficients is squared. When there are outliers in the dataset, using the L2 loss function is not useful because taking squares of the differences between the actual and predicted values will lead to a much larger error, while the L1 loss function is not affected by them.

The general idea for solving overfitting and high variance is to make the data less complex. Regularization prevents the learning of more complex patterns.

When there are outliers in the dataset, using the L2 loss function is not useful because taking squares of the differences between the actual and predicted values will lead to a much larger error.

The loss function L2 is used in ridge regression. The L1 loss function is used in lasso regression.

The variance refers to the sensitivity of a model to specific datasets. The variance is high in the case of overfitting and low in the case of underfitting.