**Learning Curves for Linear Regression**

Learning curves help us understand how a model's performance changes as we increase the training data size. A plot of training error and validation error versus training set size can provide insights into whether a model is underfitting, overfitting, or performing well.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection  import train_test_split
from sklearn.metrics          import mean_squared_error 
from sklearn.datasets         import fetch_openml
import plotly.express as px


from sklearn.compose       import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline      import Pipeline
from sklearn.impute        import SimpleImputer


In [8]:
# Load the California housing dataset from OpenML
housing_sale = fetch_openml(name='house_sales', version=1,parser='pandas',target_column='price')
housing_sale.data.drop('date',axis=1,inplace=True)
X = housing_sale.data
y = housing_sale.target

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  housing_sale.data.drop('date',axis=1,inplace=True)


In [10]:
X

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,3,1.00,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.7210,-122.319,1690,7639
2,2,1.00,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,4,3.00,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,3,2.00,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21608,3,2.50,1530,1131,3.0,0,0,3,8,1530,0,2009,0,98103,47.6993,-122.346,1530,1509
21609,4,2.50,2310,5813,2.0,0,0,3,8,2310,0,2014,0,98146,47.5107,-122.362,1830,7200
21610,2,0.75,1020,1350,2.0,0,0,3,7,1020,0,2009,0,98144,47.5944,-122.299,1020,2007
21611,3,2.50,1600,2388,2.0,0,0,3,8,1600,0,2004,0,98027,47.5345,-122.069,1410,1287


In [3]:
# Display data types to check for categorical features
print(X.dtypes)

bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object


In [4]:
categorical_features = X.select_dtypes(include=['object','category']).columns.tolist()
numeric_features     = X.select_dtypes(include=['int64','float64']).columns.tolist()

# Preprocessor for numeric data
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Preprocessor for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing for both types of data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Preprocess the data
X_processed = preprocessor.fit_transform(X)

In [5]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)

**A Function to plot the Learning Curves**

- The purpose any ML is to develop a model that generalizes well

- Train the model --->model and test on train data ---->model to predict on the test.

- 

In [6]:
def plot_learning_curves_plotly(model, X_train, y_train, X_val, y_val, start_size=20, end_size=200):
    
    train_errors, val_errors = [], []
    for m in range(start_size, end_size):
        # Fit the model on the subset of the training data
        model.fit(X_train[:m], y_train[:m])
        
        # Predict on the training and validation sets
        y_train_predict = model.predict(X_train[:m]) #Train data
        y_val_predict = model.predict(X_val) # Test Data
        
        # Calculate mean squared error for both sets
        train_errors.append(mean_squared_error(y_train[:m], y_train_predict)) #Train
        val_errors.append(mean_squared_error(y_val, y_val_predict)) # T

    # Convert errors to square root (Root Mean Squared Error)
    train_errors = np.sqrt(train_errors)
    val_errors   = np.sqrt(val_errors)

    # Prepare data for Plotly Express
    data = {
        "Training Set Size": list(range(start_size, end_size)),
        "Training Error"   : train_errors,
        "Validation|Test Error" : val_errors
    }

    # Plot the learning curves
    fig = px.line(
        data_frame=data,
        x="Training Set Size",
        y=["Training Error", "Validation|Test Error"],
        labels={"value": "Root Mean Squared Error (RMSE)", "variable": "Error Type"},
        title="Learning Curves"
    )
    fig.show()



In [12]:
#Apply To Linear Regression without regularization
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()

plot_learning_curves_plotly(lin_reg, X_train, y_train, X_test, y_test)

**Lets Traina a Linear Model Without Regularization**

In [178]:
lr_model  = LinearRegression()
lr_model.fit(X_train,y_train)
y_pred= lr_model.predict(X_test)

print("Ridge Regression RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))


Ridge Regression RMSE: 212539.51663817756


Regularized Linear Models

Ridge Regression - Applies L2 regularization to prevent overfitting.

In [182]:
from sklearn.linear_model import Ridge

# Ridge Regression
ridge_reg = Ridge(alpha=1.0)
ridge_reg.fit(X_train, y_train)
y_pred_ridge = ridge_reg.predict(X_test)

print("Ridge Regression RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_ridge)))



Ridge Regression RMSE: 212540.13987359506


In [183]:
plot_learning_curves_plotly(ridge_reg, X_train, y_train, X_test, y_test)

Lasso Regression - Applies L1 regularization.

In [184]:
from sklearn.linear_model import Lasso

# Lasso Regression
lasso_reg = Lasso(alpha=0.1,max_iter=50000)
lasso_reg.fit(X_train, y_train)
y_pred_lasso = lasso_reg.predict(X_test)

print("Lasso Regression RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_lasso)))



Lasso Regression RMSE: 212539.5260270513



Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.007e+13, tolerance: 2.259e+11



ElasticNet

In [186]:
from sklearn.linear_model import ElasticNet

# Elastic Net Regression
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X_train, y_train)
y_pred_elastic = elastic_net.predict(X_test)

print("Elastic Net Regression RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_elastic)))


Elastic Net Regression RMSE: 213354.04216794018


In [187]:
plot_learning_curves_plotly(elastic_net, X_train, y_train, X_test, y_test)

We'll use SGDRegressor with early stopping. This requires scaling the features and configuring the model for early stopping.

In [188]:
from sklearn.linear_model import SGDRegressor

# Scaling features for SGDRegressor
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# SGD Regressor with Early Stopping
sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, early_stopping=True, validation_fraction=0.1, n_iter_no_change=10, random_state=42)
sgd_reg.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred_sgd = sgd_reg.predict(X_test_scaled)
print("SGD Regressor with Early Stopping RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_sgd)))


SGD Regressor with Early Stopping RMSE: 213250.34589183502


In [189]:
plot_learning_curves_plotly(sgd_reg, X_train, y_train, X_test, y_test)