# Lasso Regression for House Price Prediction


This notebook demonstrates the use of the Lasso Regression for predicting house prices. 
In the latter parts of the notebook, we will also perform hyperparameter tuning using GridSearchCV to find the optimal regularization strength.


## Data Preparation

In [8]:
import pickle
import pathlib
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Loading data
DATA_DIR = pathlib.Path.cwd().parent / 'data'
clean_data_path = DATA_DIR / 'processed' / 'ames_clean.pkl'
with open(clean_data_path, 'rb') as file:
    data = pickle.load(file)

# Definindo target e features
X = data.drop('SalePrice', axis=1)
y = data['SalePrice']

# Test + Train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [9]:
# Testando quais colunas possuem 'Other' como valor para corrigir erros que deram no pipeline
for col in X_train.columns:
    if 'Other' in X_train[col].values:
        print(f"'Other' found in column: {col}")

'Other' found in column: MS.SubClass
'Other' found in column: Roof.Style
'Other' found in column: Mas.Vnr.Type
'Other' found in column: Foundation
'Other' found in column: Sale.Type
'Other' found in column: Exterior



Lasso Regression is a type of linear regression that includes a regularization term. 
The regularization term encourages simpler models, which can help improve model generalization.
The strength of the regularization is controlled by the `alpha` hyperparameter.


In [10]:
from sklearn.linear_model import Lasso
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Lista de colunas categóricas
categorical_cols = [cname for cname in X_train.columns if X_train[cname].dtype == "category"]

# ColumnTransformer para aplicar OneHotEncoder apenas nas colunas categóricas
preprocessor = ColumnTransformer(
    transformers=[
        ('one_hot_encoder', OneHotEncoder(drop='first'), categorical_cols)  # drop='first' to avoid collinearity
    ],
    remainder='passthrough'  # automatically passthrough remaining columns
)

# Pipando o line baby
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('lasso', Lasso())
                       ])

model.fit(X_train, y_train)

TypeError: Encoders require their input argument must be uniformly strings or numbers. Got ['int', 'str']

## Hyperparameter Tuning with GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

# Hiperparametros
param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}

# Gridsearch
grid_search = GridSearchCV(Lasso(), param_grid, cv=5, scoring='neg_mean_squared_error')

grid_search.fit(X_train, y_train)


In [None]:
best_alpha = grid_search.best_params_['alpha']
best_alpha

## Model Evaluation

In [None]:
# Treinando o modelo com o melhor alpha
best_lasso = Lasso(alpha=best_alpha)
best_lasso.fit(X_train, y_train)

# RMSE
from sklearn.metrics import mean_squared_error

predictions = best_lasso.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
rmse
