# **Lasso Regression**

Lasso Regression is a linear regression technique that uses L1 regularization(regularization is used to address overfitting) to prevent overfitting by penalizing large coefficients. It is particularly useful when dealing with high-dimensional datasets where feature selection is important.

It achieves this by shrinking some regression coefficients to exactly zero, effectively removing those features from the model. 

This technique is particularly useful when dealing with datasets that have many features, some of which may be irrelevant or redundant. 

When to use Lasso Regression: 
- When you suspect that many of your features are irrelevant or redundant. 
- When you want to build a model that is both accurate and easy to interpret. 
- When you are dealing with high-dimensional data (many features). 
- When you want to perform automatic feature selection. 

In [107]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso

In [108]:
df = pd.read_csv('MagicBricks.csv')
df.dropna(inplace=True)
df.head()

Unnamed: 0,Area,BHK,Bathroom,Furnishing,Locality,Parking,Price,Status,Transaction,Type,Per_Sqft
1,750.0,2,2.0,Semi-Furnished,"J R Designers Floors, Rohini Sector 24",1.0,5000000,Ready_to_move,New_Property,Apartment,6667.0
2,950.0,2,2.0,Furnished,"Citizen Apartment, Rohini Sector 13",1.0,15500000,Ready_to_move,Resale,Apartment,6667.0
3,600.0,2,2.0,Semi-Furnished,Rohini Sector 24,1.0,4200000,Ready_to_move,Resale,Builder_Floor,6667.0
4,650.0,2,2.0,Semi-Furnished,Rohini Sector 24 carpet area 650 sqft status R...,1.0,6200000,Ready_to_move,New_Property,Builder_Floor,6667.0
5,1300.0,4,3.0,Semi-Furnished,Rohini Sector 24,1.0,15500000,Ready_to_move,New_Property,Builder_Floor,6667.0


In [109]:
print("Info - ", df.info())
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
Index: 1005 entries, 1 to 1258
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Area         1005 non-null   float64
 1   BHK          1005 non-null   int64  
 2   Bathroom     1005 non-null   float64
 3   Furnishing   1005 non-null   object 
 4   Locality     1005 non-null   object 
 5   Parking      1005 non-null   float64
 6   Price        1005 non-null   int64  
 7   Status       1005 non-null   object 
 8   Transaction  1005 non-null   object 
 9   Type         1005 non-null   object 
 10  Per_Sqft     1005 non-null   float64
dtypes: float64(4), int64(2), object(5)
memory usage: 94.2+ KB
Info -  None
               Area          BHK     Bathroom      Parking         Price  \
count   1005.000000  1005.000000  1005.000000  1005.000000  1.005000e+03   
mean    1504.301968     2.791045     2.575124     1.697512  2.224030e+07   
std     1729.104830     0.961469     1.088503   

In [110]:
X = df.drop(columns=['Per_Sqft', 'Price'])
y = df['Price']

In [111]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)

In [112]:
from sklearn.preprocessing import TargetEncoder
# First, we convert the string columns into numbers using TargetEncoder.
target_columns = ['Furnishing', 'Locality', 'Status', 'Transaction', 'Type']
target_encoder = TargetEncoder(target_type='continuous')

# Make copies to work with
X_train_encoded = X_train.copy()
X_test_encoded = X_test.copy()

# Fit and transform the training data
X_train_encoded[target_columns] = target_encoder.fit_transform(X_train[target_columns], y_train)

# Just transform the test data
X_test_encoded[target_columns] = target_encoder.transform(X_test[target_columns])

# At this point, X_train_encoded and X_test_encoded are FULLY NUMERICAL.

# --- 4. SCALE NUMERICAL FEATURES (Math Operations) ---
# NOW that everything is a number, we can safely apply StandardScaler.
scaler = StandardScaler()

# Fit and transform the training data
X_train_scaled = scaler.fit_transform(X_train_encoded)

# Just transform the test data
X_test_scaled = scaler.transform(X_test_encoded)

In [113]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((804, 9), (201, 9), (804,), (201,))

In [114]:
X_train_scaled.shape, X_test_scaled.shape


((804, 9), (201, 9))

In [115]:
# alpha is a hyperparameter for Lasso regression
# it controls the amount of regularization applied
# a higher alpha means more regularization
# a lower alpha means less regularization
model = Lasso()
model.fit(X_train_scaled, y_train)

In [116]:
y_pred = model.predict(X_test_scaled)
display(pd.DataFrame({'Actual': y_test, 'Predicted': y_pred}))

Unnamed: 0,Actual,Predicted
1160,16500000,1.196421e+07
1158,15600000,1.226170e+07
1122,32500000,2.418845e+07
118,28000000,3.186786e+07
1100,4600000,1.129761e+07
...,...,...
1085,19000000,3.086770e+07
1007,3800000,-5.365405e+05
294,55000000,5.485365e+07
992,1700000,-9.415403e+06


In [117]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

print("Accuracy:", model.score(X_test_scaled, y_test))
print("MAE:", mean_absolute_error(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 Score:", r2_score(y_test, y_pred))

Accuracy: 0.6664461930079986
MAE: 8090540.067561318
MSE: 207170529788985.94
RMSE: 14393419.669730539
R2 Score: 0.6664461930079986


In [118]:
# tol is the tolerance for the optimization. It determines when to stop the optimization process
# tol can be adjusted to control the convergence of the algorithm. It is the threshold for the optimization algorithm to stop iterating.
# It can be used to speed up the training process by allowing the algorithm to stop earlier if the changes in the loss function are small enough.
param_grid = {
    'alpha': [0.01, 0.1, 0.5, 1, 10, 100],
    'max_iter': [1000, 5000, 10000],
    'tol': [1e-4, 1e-5, 1e-6]
}

In [119]:
from sklearn.model_selection import GridSearchCV

model2 = GridSearchCV(model, param_grid, cv=10, n_jobs=4, verbose=2)
model2.fit(X_train_scaled, y_train)

Fitting 10 folds for each of 54 candidates, totalling 540 fits


In [120]:
y_pred_2 = model2.predict(X_test_scaled)
display(pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_2}))

Unnamed: 0,Actual,Predicted
1160,16500000,1.196420e+07
1158,15600000,1.226160e+07
1122,32500000,2.418870e+07
118,28000000,3.186796e+07
1100,4600000,1.129797e+07
...,...,...
1085,19000000,3.086774e+07
1007,3800000,-5.361395e+05
294,55000000,5.485373e+07
992,1700000,-9.415141e+06


In [None]:
print("Accuracy 2:", model.score(X_test_scaled, y_test))
print("MAE 2:", mean_absolute_error(y_test, y_pred_2))
print("MSE 2:", mean_squared_error(y_test, y_pred_2))
print("RMSE 2:", np.sqrt(mean_squared_error(y_test, y_pred_2)))
print("R2 Score 2:", r2_score(y_test, y_pred_2))

Accuracy: 0.6664461930079986
MAE: 8090507.224919359
MSE: 207170664273288.56
RMSE: 14393424.341458447
R2 Score: 0.6664459764822641
