# Overview

Linear Regression a basic algorithms in meachine learning(supervised learning). It can be adopted while making predictions for the estimated value of a continuous target variable based on one or moe predictor variables.

# 1.Data Preparation

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
import pandas as pd

ds=fetch_california_housing(as_frame=True)

In [2]:
ds=ds.frame

In [3]:
ds.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [4]:
# Handle missing values(if any)
ds.fillna(ds.mean(), inplace=True)

In [5]:
# define input features and target variable
x=ds.drop("MedHouseVal", axis=1)
y=ds["MedHouseVal"]

In [6]:
# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# 2.Model Training

Scikit-learn provides two main classes in Linear Regression, which are:
* LinearRegression
* Ridge

In [7]:
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler

# Standardize the features
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

In [8]:
# Create and train the Linear Refression model
lr=LinearRegression()
lr.fit(x_train_scaled, y_train)

In [9]:
# Create and train the Ridge Regression model
rr = Ridge(alpha=1.0) 
rr.fit(x_train_scaled, y_train)

# 3.Model Evaluation

It's crucial to evaluate its performance using various metrics and techniques:

### R-squared(Coefficient of Determination)

R-squared is the proportion of the variance in target variables predicateable from the input features. It lies in between 0 and 1, which 1 indicating a perfect fit.

### Mean Squared Error

MSE measures the average of squared differences between the actual and the predicted values. The lower the value of MSE, the better the performance of the model.

### Mean Absolute Error

MAE is the average absolute difference between prediction and actual values. MAE is less sensitive to outliers as compared to MSE.

### Cross-Validation

Cross-validation techniques, such as K-Fold Cross Validation, may give a much better estimate of a model if the data are split into multiple subsets and the model is trained and tested in turn on different combinations.


## 3.1 Evaluate the Lienar Regression model

In [10]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.model_selection import cross_val_score

# evaluate the Linear Regression Model
y_pred_lin = lr.predict(x_test_scaled)

In [11]:
r2_lin = r2_score(y_test, y_pred_lin)
print(f"R-squared: {r2_lin:.2f}")

R-squared: 0.58


In [12]:
mse_lin = mean_squared_error(y_test, y_pred_lin)
print(f"Mean Squared Error: {mse_lin:.2f}")

Mean Squared Error: 0.56


In [13]:
mae_lin = mean_absolute_error(y_test, y_pred_lin)
print(f"Mean Absolute Error: {mae_lin:.2f}")

Mean Absolute Error: 0.53


## 3.2 Evaluate the Ridge Regression model

In [14]:
y_pred_ridge=rr.predict(x_test_scaled)

In [15]:
r2_ridge=r2_score(y_test, y_pred_ridge)
print(f"R-squared: {r2_ridge:.2f}")

R-squared: 0.58


In [16]:
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
print(f"Mean Squared Error: {mse_ridge:.2f}")

Mean Squared Error: 0.56


In [17]:
mae_ridge = mean_absolute_error(y_test, y_pred_ridge)
print(f"Mean Absolute Error: {mae_ridge:.2f}")

Mean Absolute Error: 0.53


## 3.3 Cross-validation

In [18]:
cv_scores_lin = cross_val_score(lr, x_train_scaled, y_train, cv=5)
print("\nCross-Validation Scores (Linear Regression):", cv_scores_lin)


Cross-Validation Scores (Linear Regression): [0.62011512 0.61298876 0.6134416  0.61069973 0.60017477]


In [19]:
cv_scores_ridge=cross_val_score(rr, x_train_scaled, y_train, cv=5)
print("Cross-Validation Scores (Ridge Regression):", cv_scores_ridge)

Cross-Validation Scores (Ridge Regression): [0.62010998 0.61298705 0.61343295 0.61070059 0.60018886]


# 4.Handling Overfitting and Underfitting

**Overfitting** occurs when a model learns the training data too well and fails to generalize to new, unseen data.

**Underfitting** happends when a model is too simple to capture the underlying patterns in the data.

In [20]:
# Adjust the regularization parameter alpha to prevent overfitting
ridge_reg_alpha = Ridge(alpha=0.5) 
ridge_reg_alpha.fit(x_train_scaled, y_train)

In [21]:
# Evaluate the model with adjusted alpha
y_pred_ridge_alpha = ridge_reg_alpha.predict(x_test_scaled)

In [22]:
r2_ridge_alpha = r2_score(y_test, y_pred_ridge_alpha)
print(f"R-squared: {r2_ridge_alpha:.2f}")

R-squared: 0.58


In [23]:
mse_ridge_alpha = mean_squared_error(y_test, y_pred_ridge_alpha)
print(f"Mean Squared Error: {mse_ridge_alpha:.2f}")

Mean Squared Error: 0.56


In [24]:
mae_ridge_alpha = mean_absolute_error(y_test, y_pred_ridge_alpha)
print(f"Mean Absolute Error: {mae_ridge_alpha:.2f}")

Mean Absolute Error: 0.53


# 5.Polynomial Regression

It's a form of linear regression with polynomial terms to capture the non-linear relationship between the features and the target variable.

In [25]:
from sklearn.preprocessing import PolynomialFeatures

# Create polynomial features
poly = PolynomialFeatures(degree=2)  # Adjust the degree as needed
X_train_poly = poly.fit_transform(x_train_scaled)
X_test_poly = poly.transform(x_test_scaled)



In [26]:
# Train the polynomial regression model
poly_reg = LinearRegression()
poly_reg.fit(X_train_poly, y_train)

In [27]:
# Evaluate the polynomial regression model
y_pred_poly = poly_reg.predict(X_test_poly)

In [28]:
r2_poly = r2_score(y_test, y_pred_poly)
print(f"R-squared: {r2_poly:.2f}")

R-squared: 0.65


In [29]:
mse_poly = mean_squared_error(y_test, y_pred_poly)
print(f"Mean Squared Error: {mse_poly:.2f}")

Mean Squared Error: 0.46


In [30]:
mae_poly = mean_absolute_error(y_test, y_pred_poly)
print(f"Mean Absolute Error: {mae_poly:.2f}")

Mean Absolute Error: 0.47


# 6.Regulatization Techniques

Regulation is a technique used to prevent overfitting by adding a penalty term to the cost function.


### Ridge Regression

Adds an L2 penalty term to the cost function. It shrinks the coefficients of less important features toward zero.


### Lasso Regression

Uses an L1 penalty term, that can cause sparse models where some coefficients are actually zero, effective3ly performing feature selection.

In [31]:
from sklearn.linear_model import Lasso

# Train the Lasso Regression model
lasso_reg = Lasso(alpha=0.1) 

In [32]:
lasso_reg.fit(x_train_scaled, y_train)

In [33]:
# Evaluate the Lasso Regression model
y_pred_lasso = lasso_reg.predict(x_test_scaled)

In [34]:
r2_lasso = r2_score(y_test, y_pred_lasso)
print(f"R-squared: {r2_lasso:.2f}")

R-squared: 0.48


In [35]:
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
print(f"Mean Squared Error: {mse_lasso:.2f}")

Mean Squared Error: 0.68


In [36]:
mae_lasso = mean_absolute_error(y_test, y_pred_lasso)
print(f"Mean Absolute Error: {mae_lasso:.2f}")

Mean Absolute Error: 0.62


# Acknowledgement

* https://medium.com/gopenai/the-ultimate-guide-to-linear-regression-with-scikit-learn-8002b42ccaea