# Ames Housing Step-by-step - Extra - Use Pipelines with LASSO

Pieter Overdevest  
2024-02-09

For suggestions/questions regarding this notebook, please contact
[Pieter Overdevest](https://www.linkedin.com/in/pieteroverdevest/)
(pieter@innovatewithdata.nl).

### How to work with this Jupyter Notebook yourself?

- Get a copy of the repository ('repo') [machine-learning-with-python-explainers](https://github.com/EAISI/machine-learning-with-python-explainers) from EAISI's GitHub site. This can be done by either cloning the repo or simply downloading the zip-file. Both options are explained in this Youtube video by [Coderama](https://www.youtube.com/watch?v=EhxPBMQFCaI).

- Copy the folder 'ames-housing-pieter\' located in the folder 'example-solutions\' to your own project folder.

### Import packages

In [2]:
# Third party packages.
import pandas as pd     # Data handling
import numpy as np      # Numeric calculations
import pickle           # Save and load data

from sklearn.pipeline           import Pipeline, make_pipeline  # Pipeline
from sklearn.compose            import ColumnTransformer        # Column transformer
from sklearn.impute             import SimpleImputer            # Imputation
from sklearn.model_selection    import train_test_split         # Split data
from sklearn.preprocessing      import StandardScaler           # Scale data
from sklearn.preprocessing      import OneHotEncoder            # One hot encoding
from sklearn.decomposition      import PCA                      # PCA
from sklearn.linear_model       import LassoCV                  # Lasso regression


# Setting sklearn parameter for Pipeline visualization.
import sklearn 
sklearn.set_config(display="diagram")

# Pieter's utils package.
import utils_pieter as up

Done!


### Load objects from 'ames_housing_pieter'

The solution notebooks 'ames-housing-pieter-exercise-1-2-3.ipynb', 'ames-housing-pieter-exercise-4.ipynb', and, 'ames-housing-pieter-exercise-5-6.ipynb' in the 'ames_housing_pieter\\' folder each generate a so-called [pickle file](https://docs.python.org/3/library/pickle.html), 'dc-ames-housing-pieter-exercise-1-2-3.pkl', 'dc-ames-housing-pieter-exercise-4.pkl', and 'dc-ames-housing-pieter-exercise-5-6.pkl', resp. These are located in the 'data\\' folder. Copy the 'data\\' folder to the same project folder where you are running this notebook. Then, run the cell below, which will read the pickle files and store the data in python objects.

In [3]:
with open('data/dc-ames-housing-pieter-exercise-1-2-3.pkl', 'rb') as pickle_file:
    dc_exercise_1_2_3 = pickle.load(pickle_file)

df_orig = dc_exercise_1_2_3['df_orig']

with open('data/dc-ames-housing-pieter-exercise-4.pkl', 'rb') as pickle_file:
    dc_exercise_4 = pickle.load(pickle_file)

l_df_X_names  = dc_exercise_4['l_df_X_names']

### Let's get started

In this extra notebook, we repeat - more or less - what we did in Ames Housing exercises, but now using pipelines to demonstrate their use, see also Python Explainer 'pipelines'. Besides the functions [`Pipeline()`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) or [`make_pipeline()`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html), we make use of the [`ColumnTransformer()`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) function to process numerical and categorical variables differently. We start out by defining separate pipelines for numerical and categorical variables. Here, we use the `Pipeline()` function - instead of the `make_pipeline()` function - so we can reference to the individual pipelines later on (["Are you using Pipeline in Scikit-Learn?" by Ankit Goel](https://towardsdatascience.com/are-you-using-pipeline-in-scikit-learn-ac4cd85cb27f)). Imputation is performed by SciKit Learn's [`SimpleImputer()`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) function. You can observe that we are just defining the steps in the pipeline, there is no reference to the data yet.

In [4]:
# Create pipeline for numerical variables.
pl_Num = Pipeline([

        ( "impute", SimpleImputer(missing_values = np.nan, strategy = "median") ),

        ( "scale",  StandardScaler() )
])

# Create pipeline for categorical variables.
pl_Cat = Pipeline([

        ( "impute", SimpleImputer(missing_values = np.nan, strategy = "most_frequent") ),

        ( "onehot", OneHotEncoder() )
])

The parameter `transformers` in the function [`ColumnTransformer()`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) receives a list of tuples each containing: (1) pipeline name, (2) pipeline, and (3) variable name(s), resp., see below. Each tuple specifies how the specified variables in the concerned tuple are transformed.

Note, `remainder = 'drop'` tells the column transformer to drop the variables not mentioned in the tuples.

In [7]:
# Define column transformer object.
ct_ColumnTransformer = ColumnTransformer(
    
    transformers = [
        
        # Tuples containing pipeline name, pipeline, and variable names for each transformation to take place.
        ('num', pl_Num, l_df_X_names),

        # As we pull the data through a pipeline, it is easier to add more categorical variables.
        ('cat', pl_Cat, ['Neighborhood'] + ['Pool QC'])
        #('cat', pl_Cat, ['Neighborhood'])
    ],
    
    remainder = 'drop',
    verbose   = True
)

ct_ColumnTransformer

As with scaling and one-hot encoding, we use the `fit_transform()` function to pull a data frame through the pipeline. While with scaling we perform `fit()` and `transform()` separately, to apply the scaling based on the train set to the test set, with one-hot encoding we perform both steps in one go by using `fit_transform()`. The latter approach we apply to the original data (`df_orig`) resulting in `m_X_transformed`.

Do the number of columns in `m_X_transformed` correspond to what we expect? Yes, 38 + 28 + 4 = 70!

In [5]:
m_X_transformed = ct_ColumnTransformer.fit_transform(df_orig)

print(f"\nNumber of columns in the resulting array:        {m_X_transformed.shape[1]}")
print(f"Number of numerical variables (excl. 'SalePrice'): {len(l_df_X_names)}")
print(f"Number of unique values in 'Neighborhood':         {len(df_orig['Neighborhood'].value_counts())}")
print(f"Number of unique values in 'Pool QC':              {len(df_orig['Pool QC'].value_counts())}")

[ColumnTransformer] ........... (1 of 2) Processing num, total=   0.0s
[ColumnTransformer] ........... (2 of 2) Processing cat, total=   0.0s

Number of columns in the resulting array:        70
Number of numerical variables (excl. 'SalePrice'): 38
Number of unique values in 'Neighborhood':         28
Number of unique values in 'Pool QC':              4


Next on our list of activities is to create a list of variable names, so we can convert the array `m_X_transformed` to a data frame and assign the corresponding variable names. For the numerical part this is easy, as the numerical variable names are already stored in `l_df_X_names`. 

For the one-hot encoded categorical variables - `Neighborhood` and `Pool QC` - this is a bit trickier. The object `ct_ColumnTransformer` contains the attribute `named_transformers_`. This has two elements, 'num' and 'cat', i.e., the names we gave to the respective transformers. Now, we see the benefit of `Pipeline()` over `make_pipeline()`, since we can make use of the names that we gave to each pipeline. With `make_pipeline()`, we do not give names to the individual pipelines, Python creates them for us. However, this means that it is more cumbersome to refer to individual transformers, if needed at later stage, as we see here. See also Python Explainer 'pipeline'.

We assign the variable names derived from `ct_ColumnTransformer` to object `l_df_cat_transformed_names`, and we observe that they match what we derived above.

In [6]:
l_df_cat_transformed_names = (
    ct_ColumnTransformer
    .named_transformers_['cat']['onehot']
    .get_feature_names_out(['neighborhood', 'pool_qc'])
    #.get_feature_names_out(['neighborhood'])
    .tolist()
)

We append the numeric variable names and those that follow from the two categorical variables, `Neighbordhood` and `Pool QC`.

In [16]:
l_df_X_transformed_names = l_df_X_names + l_df_cat_transformed_names

l_df_X_transformed_names

['Order',
 'PID',
 'MS SubClass',
 'Lot Frontage',
 'Lot Area',
 'Overall Qual',
 'Overall Cond',
 'Year Built',
 'Year Remod/Add',
 'Mas Vnr Area',
 'BsmtFin SF 1',
 'BsmtFin SF 2',
 'Bsmt Unf SF',
 'Total Bsmt SF',
 '1st Flr SF',
 '2nd Flr SF',
 'Low Qual Fin SF',
 'Gr Liv Area',
 'Bsmt Full Bath',
 'Bsmt Half Bath',
 'Full Bath',
 'Half Bath',
 'Bedroom AbvGr',
 'Kitchen AbvGr',
 'TotRms AbvGrd',
 'Fireplaces',
 'Garage Yr Blt',
 'Garage Cars',
 'Garage Area',
 'Wood Deck SF',
 'Open Porch SF',
 'Enclosed Porch',
 '3Ssn Porch',
 'Screen Porch',
 'Pool Area',
 'Misc Val',
 'Mo Sold',
 'Yr Sold',
 'neighborhood_Blmngtn',
 'neighborhood_Blueste',
 'neighborhood_BrDale',
 'neighborhood_BrkSide',
 'neighborhood_ClearCr',
 'neighborhood_CollgCr',
 'neighborhood_Crawfor',
 'neighborhood_Edwards',
 'neighborhood_Gilbert',
 'neighborhood_Greens',
 'neighborhood_GrnHill',
 'neighborhood_IDOTRR',
 'neighborhood_Landmrk',
 'neighborhood_MeadowV',
 'neighborhood_Mitchel',
 'neighborhood_NAmes',


Let's do our checks and balances to confirm the dimensions of our data is mathing our expectations, see below. The transformed data matrix `m_X_transformed` consists of as many columns as there are elements in `l_df_X_transformed_names` that we created from the numerical variables in `l_df_X_names` and the unique values in `Neighborhood` and `Pool QC`.

In [8]:
print(f"Number of columns in the resulting array: {m_X_transformed.shape[1]}")
print(f"Length of 'l_df_X_transformed_names':     {len(l_df_X_transformed_names)}")


Number of columns in the resulting array: 70
Length of 'l_df_X_transformed_names':     70


Now, we construct data frame `df_X_transformed` from matrix/2D array `m_X_transformed` and list `l_df_X_transformed_names`.

In [9]:
df_X_transformed = pd.DataFrame(m_X_transformed, columns = l_df_X_transformed_names)

df_X_transformed.head(5)

Unnamed: 0,Order,PID,MS SubClass,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,...,neighborhood_Sawyer,neighborhood_SawyerW,neighborhood_Somerst,neighborhood_StoneBr,neighborhood_Timber,neighborhood_Veenker,pool_qc_Ex,pool_qc_Fa,pool_qc_Gd,pool_qc_TA
0,-1.73146,-0.997164,-0.877005,3.375742,2.744381,-0.067254,-0.506718,-0.375537,-1.163488,0.061046,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,-1.730277,-0.996904,-0.877005,0.514952,0.187097,-0.776079,0.393091,-0.342468,-1.115542,-0.566039,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,-1.729095,-0.996899,-0.877005,0.56185,0.522814,-0.067254,0.393091,-0.441674,-1.25938,0.03865,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,-1.727913,-0.996888,-0.877005,1.124628,0.128458,0.641571,-0.506718,-0.110988,-0.779919,-0.566039,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,-1.726731,-0.992903,0.061285,0.233563,0.467348,-0.776079,-0.506718,0.848,0.658466,-0.566039,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


We split the predictor data using Scikit-Learn's `train_test_split()` ([*ref*](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)).

In [10]:
df_X_train, df_X_test, ps_y_log_train, ps_y_log_test = train_test_split(
    df_X_transformed,
    np.log(df_orig.SalePrice),
    test_size    = 0.3,
    random_state = 42
)

Instead of using Scikit Learn's `Lasso()` function to build a single model, we use `LassoCV()` to build a series of LASSO models. By default `LassoCV()` tries 100 different values for $\alpha$, through the input parameter `n_alphas`.

In [11]:
# Define log of lower border of alphas range.
n_alphas_min_log = np.log(0.0001)
n_alphas_max_log = np.log(0.01)
n_step = (n_alphas_max_log - n_alphas_min_log) / 100

# List of alphas.
l_alpha = [round(np.e**i,5) for i in np.arange(n_alphas_min_log, n_alphas_max_log, n_step)]


mo_lasso = LassoCV(
    
    cv           = 5,       # Number of folds.
    random_state = 42,      # Fixing random_state ensures the results are reproducible.
    max_iter     = 100000,  # Max number of iterations.

    # We can enforce for which alphas a Lasso model is fitted.
    # In case we do not provide a list, LassoCV() will select 100 values.
    alphas       = l_alpha

).fit(
    
    df_X_train,
    ps_y_log_train
)

From the `mo_lasso` object we can extract the intercept and coefficients of the best model. What explains the number of coefficients equal to zero? How can we increase the number of coefficients equal to zero?

In [12]:
print(f"intercept: {mo_lasso.intercept_:,.0f}")

pd.set_option("display.max_rows", 50)
pd.set_option("display.min_rows", 40)

pd.DataFrame(

    {
        'coef':     mo_lasso.coef_,
        'coef_abs': abs(mo_lasso.coef_),
        'variable':  df_X_train.columns
    }

).sort_values(

    'coef_abs',
    ascending=False
)

intercept: 12


Unnamed: 0,coef,coef_abs,variable
68,-0.454732,0.454732,pool_qc_Gd
48,0.394561,0.394561,neighborhood_GrnHill
51,-0.178800,0.178800,neighborhood_MeadowV
40,-0.149644,0.149644,neighborhood_BrDale
57,0.129602,0.129602,neighborhood_NridgHt
49,-0.119768,0.119768,neighborhood_IDOTRR
44,0.114031,0.114031,neighborhood_Crawfor
5,0.112157,0.112157,Overall Qual
63,0.111212,0.111212,neighborhood_StoneBr
39,-0.105308,0.105308,neighborhood_Blueste


The best model is selected based on the lowest RMSE. The alpha for that model can be obtained through the attribute `alpha_`.

In [13]:
print(f"Lowest RMSE found at alpha: {mo_lasso.alpha_:.5f}")

Lowest RMSE found at alpha: 0.00013


The `mo_lasso` object holds the properties of the best model, i.e., the one resulting in the lowest RMSE.

In [14]:
ps_y_log_pred = mo_lasso.predict(df_X_test)

Evaluating the performance of the model on the test set.

In [15]:
up.f_evaluate_results(
    ps_y_true = ps_y_log_test,
    ps_y_pred = ps_y_log_pred
)

Performance Metrics:
MAE:  0.086
MSE:  0.016
RMSE: 0.125
