In [13]:
import pandas as pd
import numpy as np


In [14]:
df=pd.read_csv('housing.csv')
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND


In [15]:
df.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

In [16]:
#Replace missing values with the median value of the column
df['total_bedrooms'].fillna(df['total_bedrooms'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['total_bedrooms'].fillna(df['total_bedrooms'].median(), inplace=True)


In [17]:
df.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

In [18]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df['ocean_proximity'] = labelencoder.fit_transform(df['ocean_proximity'])

In [19]:
from math import ceil, log2
n = len(df)
bins = ceil(log2(n) + 1)

In [20]:
df['bins'] = pd.cut(df['median_house_value'], bins=bins)
X = df.drop(columns=['median_house_value', 'bins'])
y = df['median_house_value']

In [21]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=df['bins'], test_size=0.2, random_state=42)

In [22]:
#Train a regression model using Ridge/Lasso
from sklearn.linear_model import Ridge
ridge = Ridge()
ridge.fit(X_train, y_train)

#Predict the test set
y_pred = ridge.predict(X_test)

In [23]:
from sklearn.metrics import mean_squared_error,mean_absolute_error
MAE_ridge = mean_absolute_error(y_test, y_pred)
MSE_ridge = mean_squared_error(y_test, y_pred)
RMSE_ridge = np.sqrt(MSE_ridge)

print('MAE:', MAE_ridge)
print('MSE:', MSE_ridge)
print('RMSE:', RMSE_ridge)

MAE: 50762.90344380258
MSE: 4747014469.976564
RMSE: 68898.58104472519


In [24]:
#Lasso
from sklearn.linear_model import Lasso
lasso = Lasso()
lasso.fit(X_train, y_train)

#Predict the test set
y_pred = lasso.predict(X_test)

In [25]:
from sklearn.metrics import mean_squared_error,mean_absolute_error
MAE_lasso = mean_absolute_error(y_test, y_pred)
MSE_lasso = mean_squared_error(y_test, y_pred)
RMSE_lasso = np.sqrt(MSE_lasso)

print('MAE:', MAE_lasso)
print('MSE:', MSE_lasso)
print('RMSE:', RMSE_lasso)

MAE: 50762.89938594224
MSE: 4747001205.302423
RMSE: 68898.48478234063


In [None]:
#Compare the MAE, MSE and RMSE values and write your findings
print('Ridge Regression')
print('MAE:', MAE_ridge)
print('MSE:', MSE_ridge)
print('RMSE:', RMSE_ridge)

print('Lasso Regression')
print('MAE:', MAE_lasso)
print('MSE:', MSE_lasso)
print('RMSE:', RMSE_lasso)

#Ridge regression has lower MAE, MSE and RMSE values compared to Lasso regression. Which means Ridge regression 
#is a better model for this dataset.

# **Comparison of Ridge and Lasso Models**

## **1. Mean Absolute Error (MAE)**
- **Ridge:** 50762.90  
- **Lasso:** 50762.90  
- **Observation:** Both models have nearly identical MAE, indicating similar average absolute errors.

## **2. Mean Squared Error (MSE)**
- **Ridge:** 4747014469.98  
- **Lasso:** 4747001205.30  
- **Observation:** The Lasso model has a slightly lower MSE, suggesting marginally better squared error performance.

## **3. Root Mean Squared Error (RMSE)**
- **Ridge:** 68898.58  
- **Lasso:** 68898.48  
- **Observation:** RMSE values are extremely close, meaning both models have nearly identical predictive performance.

 

## **Key Findings**
- Both models yield very similar results, with **Lasso showing slightly lower errors** in terms of MSE and RMSE.
- The small differences suggest **minimal impact of regularization**, meaning both models perform similarly on the dataset.
- If interpretability is important, **Lasso is preferable** as it shrinks some coefficients to zero, simplifying the model.
- If retaining all features is necessary, **Ridge is better** as it penalizes large coefficients without eliminating them.


## **Conclusion**
Since the performance differences are negligible, the choice depends on the goal:  
- **Lasso:** If feature selection and sparsity are needed.  
- **Ridge:** If keeping all features while reducing variance is important.
