## Linear Regression Assignment:
### California Housing Price Prediction
---
### Part 1: Data Loading, Preprocessing, and Initial Linear Regression
---


#### Load and Prepare the Data
1. Load the **California Housing dataset** from `sklearn.datasets.fetch_california_housing`.
2. Examine the features and the target variable (median house value).
3. Split the data into training and testing sets (e.g., 80% train, 20% test). **Standardize** the feature data using `StandardScaler` from `sklearn.preprocessing` on the training set, and then apply the same scaling to the test set.



In [43]:
# Import required libraries
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [32]:
# Load the California Housing dataset
california = fetch_california_housing()
X = california.data        # Features
y = california.target      # Target (Median House Value)

#Convert to DataFrame for better visualization
feature_names = california.feature_names
df = pd.DataFrame(X, columns=feature_names)
df['Target'] = y

# Display the first few rows
print("First 5 rows of the dataset:")
print(df.head())

First 5 rows of the dataset:
   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  Target  
0    -122.23   4.526  
1    -122.22   3.585  
2    -122.24   3.521  
3    -122.25   3.413  
4    -122.25   3.422  


In [33]:
# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [34]:
# Standardize the feature data
scaler = StandardScaler()

# Fit the scaler on the training data
X_train_scaled = scaler.fit_transform(X_train)

# Apply the same scaling to the test data
X_test_scaled = scaler.transform(X_test)

# Print shapes to confirm
print("\nTraining features shape:", X_train_scaled.shape)
print("Testing features shape:", X_test_scaled.shape)


Training features shape: (16512, 8)
Testing features shape: (4128, 8)


---
#### Train the Simple Linear Regression Model

1. Instantiate and train a standard **Linear Regression** model (`sklearn.linear_model.LinearRegression`) using the standardized training data.

2. Make predictions on the test set.


In [35]:
# Instantiate the Linear Regression model
lr_model = LinearRegression()

# Train the model on standardized training data
lr_model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = lr_model.predict(X_test_scaled)

# Evaluate the model (optional but recommended)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Linear Regression Model Trained Successfully.")
print(f"Mean Squared Error on Test Set: {mse:.4f}")
print(f"R^2 Score on Test Set: {r2:.4f}")

Linear Regression Model Trained Successfully.
Mean Squared Error on Test Set: 0.5559
R^2 Score on Test Set: 0.5758


---
#### Evaluate the Simple Linear Regression Model (10)

1. Calculate and report the following evaluation metrics for the test set predictions:

   - Mean Absolute Error (MAE)  
   - Mean Squared Error (MSE)  
   - Root Mean Squared Error (RMSE) (Calculate this from the MSE)  
   - R-squared score

2. Briefly interpret the meaning of the **R-Squared** in the context of this problem.


In [37]:
# Evaluate the Simple Linear Regression Model

from sklearn.metrics import mean_absolute_error
import numpy as np

# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

# Print the results
print("Linear Regression Model Evaluation on Test Set:")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"R-squared Score (R²): {r2:.4f}")

# Interpretation of R-squared
print("\nInterpretation of R-squared:")
print(f"R² = {r2:.4f} means that approximately {r2*100:.2f}% of the variance in the median house prices "
      "can be explained by the linear relationship with the given features. "
      "The closer R² is to 1, the better the model fits the data.")


Linear Regression Model Evaluation on Test Set:
Mean Absolute Error (MAE): 0.5332
Mean Squared Error (MSE): 0.5559
Root Mean Squared Error (RMSE): 0.7456
R-squared Score (R²): 0.5758

Interpretation of R-squared:
R² = 0.5758 means that approximately 57.58% of the variance in the median house prices can be explained by the linear relationship with the given features. The closer R² is to 1, the better the model fits the data.


### **Explanation**:
- Mean Absolute Error (MAE): Average absolute difference between predicted and actual house values.
- Mean Squared Error (MSE): Average squared difference; penalizes large errors more than MAE.
- Root Mean Squared Error (RMSE): Square root of MSE; same units as the target variable, easier to interpret.
- R-squared (R²): Proportion of variance in the target explained by the features.
- For example, R² = 0.60 means 60% of the variation in median house prices can be explained by this linear regression model.
- The remaining 40% is due to factors not captured by the model (e.g., neighborhood specifics, economic conditions).

---
### Part 2: Regularized Linear Models (Lasso and Ridge)
---

#### Task 2.1: Implement Ridge Regression (15)

1. Instantiate and train a **Ridge Regression** model (`sklearn.linear_model.Ridge`). Start with a regularization strength (alpha) of alpha = 1.0.

2. Make predictions on the test set.

3. Calculate and report the same four metrics (MAE, MSE, RMSE, R-Squared) for the Ridge model’s predictions.


In [None]:
# Import required libraries
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

In [None]:
# Instantiate Ridge Regression with alpha = 1.0
ridge_model = Ridge(alpha=1.0)

# Train the Ridge Regression model on standardized training data
ridge_model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred_ridge = ridge_model.predict(X_test_scaled)

# Evaluate Ridge Regression model
mae_ridge = mean_absolute_error(y_test, y_pred_ridge)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
rmse_ridge = np.sqrt(mse_ridge)
r2_ridge = r2_score(y_test, y_pred_ridge)

# Print results
print("Ridge Regression Model Evaluation on Test Set:")
print(f"Mean Absolute Error (MAE): {mae_ridge:.4f}")
print(f"Mean Squared Error (MSE): {mse_ridge:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse_ridge:.4f}")
print(f"R-squared Score (R²): {r2_ridge:.4f}")

Ridge Regression Model Evaluation on Test Set:
Mean Absolute Error (MAE): 0.5332
Mean Squared Error (MSE): 0.5559
Root Mean Squared Error (RMSE): 0.7456
R-squared Score (R²): 0.5758


---
#### Task 2.2: Implement Lasso Regression

1. Instantiate and train a **Lasso Regression** model (`sklearn.linear_model.Lasso`). Start with a regularization strength of alpha = 0.01.

2. Make predictions on the test set.

3. Calculate and report the same four metrics (MAE, MSE, RMSE, R-Squared) for the Lasso model’s predictions.

4. **Analyze the Coefficients**: Print the coefficients learned by the Lasso model. Briefly explain the primary difference between **Ridge** and **Lasso** in terms of how they affect the model’s coefficients.



In [None]:
# Instantiate Lasso Regression with alpha = 0.01
lasso_model = Lasso(alpha=0.01, max_iter=10000)  # Increased max_iter to ensure convergence

# Train the Lasso Regression model
lasso_model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred_lasso = lasso_model.predict(X_test_scaled)

# Evaluate the Lasso Regression model
mae_lasso = mean_absolute_error(y_test, y_pred_lasso)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
rmse_lasso = np.sqrt(mse_lasso)
r2_lasso = r2_score(y_test, y_pred_lasso)

# Print evaluation metrics
print("Lasso Regression Model Evaluation on Test Set:")
print(f"Mean Absolute Error (MAE): {mae_lasso:.4f}")
print(f"Mean Squared Error (MSE): {mse_lasso:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse_lasso:.4f}")
print(f"R-squared Score (R²): {r2_lasso:.4f}")

Lasso Regression Model Evaluation on Test Set:
Mean Absolute Error (MAE): 0.5353
Mean Squared Error (MSE): 0.5483
Root Mean Squared Error (RMSE): 0.7404
R-squared Score (R²): 0.5816


In [None]:
# Analyze the coefficients
coefficients = lasso_model.coef_
print("\nLasso Model Coefficients:")
for feature, coef in zip(feature_names, coefficients):
    print(f"{feature}: {coef:.4f}")


Lasso Model Coefficients:
MedInc: 0.8010
HouseAge: 0.1271
AveRooms: -0.1628
AveBedrms: 0.2062
Population: -0.0000
AveOccup: -0.0306
Latitude: -0.7901
Longitude: -0.7557


###**Analysis**:
- Ridge Regression (L2 regularization) shrinks coefficients but generally keeps all features in the model.
- Lasso Regression (L1 regularization) can shrink some coefficients exactly to zero, effectively performing feature selection.

---
### Part 3: Analysis and Observation

---

#### Task 3: Final Comparison and Conclusion

1. Create a summary table comparing the **R-Squared scores** and **RMSE values** for the three models:

| Model | R2 Score | RMSE | MAE |
|-------|---------|------|-----|
| Simple Linear Regression |  |  |  |
| Lasso |  |  |  |
| Ridge |  |  |  |

2. Based on the results, which model performs the
best for predicting median house values in
California? Justify your choice by referencing the
evaluation metrics.

3. Briefly explain the role of regularization (both
Ridge and Lasso) in the context of preventing
overfitting.


In [42]:
# Analysis and Observation
# Create a summary table comparing metrics
summary_data = {
    "Model": ["Simple Linear Regression", "Lasso Regression", "Ridge Regression"],
    "R2 Score": [r2, r2_lasso, r2_ridge],
    "RMSE": [np.sqrt(mse), rmse_lasso, rmse_ridge],
    "MAE": [mae, mae_lasso, mae_ridge]
}

summary_df = pd.DataFrame(summary_data)
print("Summary Table of Model Performance:\n")
print(summary_df)

# Determine the best model
best_model_index = summary_df['R2 Score'].idxmax()
best_model = summary_df.loc[best_model_index, "Model"]
best_r2 = summary_df.loc[best_model_index, "R2 Score"]
best_rmse = summary_df.loc[best_model_index, "RMSE"]

print(f"\nBased on the evaluation metrics, the best performing model is: {best_model}")
print(f"It has the highest R² score of {best_r2:.4f} and a relatively low RMSE of {best_rmse:.4f}.")

Summary Table of Model Performance:

                      Model  R2 Score      RMSE       MAE
0  Simple Linear Regression  0.575788  0.745581  0.533200
1          Lasso Regression  0.581615  0.740442  0.535326
2          Ridge Regression  0.575816  0.745557  0.533193

Based on the evaluation metrics, the best performing model is: Lasso Regression
It has the highest R² score of 0.5816 and a relatively low RMSE of 0.7404.


Role of Regularization:
Regularization techniques like Ridge (L2) and Lasso (L1) help prevent overfitting by adding a penalty term to the loss function. This discourages overly large coefficients which can make the model fit noise in the training data. Ridge shrinks all coefficients, while Lasso can shrink some coefficients to zero, effectively performing feature selection. Both help improve generalization to unseen data.

---

Author: **Rezaul Islam**