# Ames Housing Step-by-step - Appendix Pipelines - Use Pipelines with LASSO

Pieter Overdevest  
2024-02-09

For suggestions/questions regarding this notebook, please contact
[Pieter Overdevest](https://www.linkedin.com/in/pieteroverdevest/)
(pieter@innovatewithdata.nl).

### How to work with this Jupyter Notebook yourself?

- Get a copy of the repository ('repo') [machine-learning-with-python-explainers](https://github.com/EAISI/machine-learning-with-python-explainers) from EAISI's GitHub site. This can be done by either cloning the repo or simply downloading the zip-file. Both options are explained in this Youtube video by [Coderama](https://www.youtube.com/watch?v=EhxPBMQFCaI).

- Copy the folders 'ames_housing_pieter\' and 'utils_pieter\' folder to your own project folder.

### Import packages

In [1]:
# Load packages and assign to a shorter alias.
import pandas as pd
import numpy as np

# Pieter's utils package.
import utils_pieter as up

### Let's get started

In this section, we repeat - more or less - what we did above, but now using pipelines to demonstrate their use, see also Python Explainer 'pipelines'. Besides the functions [`Pipeline()`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) or [`make_pipeline()`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html), we make use of the [`ColumnTransformer()`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) function to process numerical and categorical variables differently. We start out by defining separate pipelines for numerical and categorical variables. Here, we use the `Pipeline()` function - instead of the `make_pipeline()` function - so we can reference to the individual transformers later on (["Are you using Pipeline in Scikit-Learn?" by Ankit Goel](https://towardsdatascience.com/are-you-using-pipeline-in-scikit-learn-ac4cd85cb27f)). Imputation is performed by SciKit Learn's [`SimpleImputer()`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) function. 

In [121]:
# Setting sklearn parameter for Pipeline visualization.
import sklearn 
sklearn.set_config(display="diagram")

# Import functions from modules.
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer

In [122]:
# Create pipeline for numerical variables.
pl_Num = Pipeline([

        ( "impute", SimpleImputer(missing_values = np.nan, strategy = "median") ),

        ( "scale",  StandardScaler() )
])

# Create pipeline for categorical variables.
pl_Cat = Pipeline([

        ( "impute", SimpleImputer(missing_values = np.nan, strategy = "most_frequent") ),

        ( "onehot", OneHotEncoder() )
])

The parameter `transformers` in the function [`ColumnTransformer()`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) receives a list of tuples each containing: (1) transformer name, (2) transformer, and (3) variable name(s), resp., see below. Each tuple specifies how the specified variables in the concerned tuple are transformed.

Note, `remainder = 'drop'` tells the transformer to drop the variables not mentioned in the tuples.

In [123]:
# Import module.
from sklearn.compose import ColumnTransformer

# Define transformer object.
pl_ColumnTransformer = ColumnTransformer(
    
    transformers = [
        
        # Tuples containing transformer name, transformer, and variable names for each transformation to take place.
        ('num', pl_Num, l_df_X_names),

        # As we pull the data through a pipeline, it is easier to add more categorical variables.
        ('cat', pl_Cat, ['Neighborhood'] + ['Pool QC'])
    ],
    
    remainder = 'drop',
    verbose   = True
)

As with scaling and one-hot encoding we use the `fit_transform()` function to pull a data frame through the pipeline. While with scaling we perform `fit()` and `transform()` separately, to apply the scaling based on the train set to the test set, with one-hot encoding we perform both steps in one go by using `fit_transform()`. The latter approach we apply to the original data, `df_orig`, resulting in `m_X_transformed`.

Do the number of columns in `m_X_transformed` correspond to what we expect?

Why are we not able to do the calculation for 'Pool QC' in the same notation as for 'Neighborhoord'? 

In [None]:
m_X_transformed = pl_ColumnTransformer.fit_transform(df_orig)

print('')
print(f"Number of columns in the resulting array:          {m_X_transformed.shape[1]}")
print(f"Number of numerical variables (excl. 'SalePrice'): {len(l_df_X_names)}")
print(f"Number of unique values in 'Neighborhood':         {len(df_orig['Neighborhood'].value_counts())}")
print(f"Number of unique values in 'Pool QC':              {len(df_orig['Pool QC'].value_counts())}")

Next on our list of activities is to create a list of variable names, so we can convert the array `m_X_transformed` to a data frame and assign the corresponding variable names. For the numerical part this is easy, as the numerical variable names are stored in `l_df_X_names`. 

For the one-hot encoded variables - `Neighborhood` and `Pool QC` - this is a bit trickier. The object `pl_ColumnTransformer` contains the attribute `named_transformers_`. This has two elements, 'num' and 'cat', i.e., the names we gave to the respective transformers. We see the benefit of `Pipeline()` over `make_pipeline()`, since we can make use of the names that we gave to each pipeline. With `make_pipeline()` we do not (have to) give names to the individual transformers, Python creates them for us. However, this means that it is more cumbersome to refer to individual transformers, if needed at later stage, as we see here. See also Python Explainer 'pipeline' in the syllabus.

We assign the variable names derived from `pl_ColumnTransformer` to object `v_df_cat_transformed_names`, and we observe that they match what we derived above.

In [None]:
v_df_cat_transformed_names = pl_ColumnTransformer.named_transformers_['cat']['onehot'].get_feature_names_out(['neighborhood', 'pool_qc'])

print(v_df_cat_transformed_names)

We continue and append the numeric variable names and those that follow from the two categorical variables, `Neighbordhood` and `Pool QC`.

In [None]:
v_df_X_transformed_names = np.append(l_df_X_names, v_df_cat_transformed_names)

Let's do our checks and balances to confirm the dimensions of our data is mathing our expectations, see below. The transformed data matrix `m_X_transformed` consists of as many columns as there are elements in `v_df_X_transformed_names` that we created from the numerical variables in `df_X` and the unique values in `Neighborhood` and `Pool QC`.

In [None]:
print(f"Number of columns in the resulting array: {m_X_transformed.shape[1]}")
print(f"Length of 'v_df_X_transformed_names':     {len(v_df_X_transformed_names)}")


Now, we are ready to construct data frame `df_X_transformed` that follows from the ColumnTransformer.

In [None]:
# Convert the 2D array to a data frame:
df_X_transformed = pd.DataFrame(m_X_transformed, columns = v_df_X_transformed_names)

df_X_transformed.head(5)

We split the predictor data using Scikit-Learn's `train_test_split()` ([*ref*](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)).

In [None]:
df_X_train, df_X_test, ps_y_log_train, ps_y_log_test = f_train_test_split(df_X_transformed, ps_y_log)

Instead of using Scikit Learn's `Lasso()` function to build a single model, we use `LassoCV()` to build a series of LASSO models. By default `LassoCV()` tries 100 different values for $\alpha$, through the input parameter `n_alphas`. We make use of an example given in SciKit Learn's documentation providing a list of $\alpha$'s ourselves ([ref](https://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_model_selection.html#sphx-glr-auto-examples-linear-model-plot-lasso-model-selection-py)).

In [None]:
# Define log of lower border of alphas range.
#n_alphas_min = 2

mo_lasso = LassoCV(
    
    # Number of folds.
    cv           = 5,

    # Fixing random_state ensures the results are reproducible.
    random_state = 42,

    # Use any CPU available.
    #n_jobs       = -1,

    # Max number of iterations.
    max_iter     = 100000,

    # We can enforce for which alphas a Lasso model is fitted.
    # In case we do not provide a list, LassoCV() will select 100 values.
    #alphas       = [round(10**i) for i in np.arange(n_alphas_min, n_alphas_min+3, 0.2)]

).fit(
    
    df_X_train,
    ps_y_log_train
)

From the `mo_lasso` object we can extract the intercept and coefficients of the best model. What explains the number of coefficients equal to zero? How can we increase the number of coefficients equal to zero?

In [None]:
print(f"intercept: {mo_lasso.intercept_:,.0f}")

pd.DataFrame(

    {
        'coef':     mo_lasso.coef_,
        'coef_abs': abs(mo_lasso.coef_),
        'variable':  df_X_train.columns
    }

).sort_values(

    'coef_abs',
    ascending=False
)

The best model is selected based on the lowest RMSE. The alpha for that model can be obtained through the attribute `alpha_`.

In [None]:
print(f"Lowest RMSE found at alpha: {mo_lasso.alpha_:.2f}")

The `mo_lasso` object holds the properties of the best model, i.e., the one resulting in the lowest RMSE.

In [None]:
ps_y_log_pred = mo_lasso.predict(df_X_test)

The same three primary metrics are used to evaluate the best LASSO model. The RMSE is considerably lower than when we limited the predictor data to numerical variables only, see above. We observe that by adding `Neighborhood` to the model it can explain a larger part of the variance in the data.

In [None]:
up.f_evaluate_results(
    ps_y_true = ps_y_log_test,
    ps_y_pred = ps_y_log_pred
)

###  Appendix B - Determine number of principle components to explain 90% of variance in the data

In this section, we determine the number of principle components (PC) to explain 90% of the variance in the data using a pipeline ([ref](https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60)), see also 'ML Explainer' 'pca'. So, we need access to the `explained_variance_ratio_` attribute of a PCA object. As far as I know, this can only be done outside the pipeline. In case we are only interested in the principle components themselves, we can include `PCA(n_components = ...)` in the pipeline.

In [None]:
pl_impute_scale_numerical = make_pipeline(
    
    SimpleImputer(missing_values = np.nan, strategy = "median"),

    StandardScaler()

    )

The pipeline is applied to the numerical variables in the original data (`df_orig`).

In [None]:
m_X_transformed = pl_impute_scale_numerical.fit_transform(df_orig[l_df_X_names])

We define a PCA object `pca_` anticipating that 25 principle components will cover at least 90% of the variance.

In [None]:
# Import module.
from sklearn.decomposition import PCA

# Define PCA object.
pca_ = PCA(n_components=25)

# Matrix containing the principle components.
m_pc = pca_.fit_transform(m_X_transformed)

The attribute `explained_variance_ratio_` holds the additional variance that is explained by adding another principle component.

In [None]:
v_pca_ames = pca_.explained_variance_ratio_
v_pca_ames

Using a list comprehension, we demonstrate that 23 PC's are needed to explain at least 90% of the total variance in the data. In other words the 23 PC's contain 90% of the information present in the 38 variables. To show the outcome of `np.cumsum(v_pca_ames)`:

In [None]:
print(len(l_df_X_names))

np.cumsum(v_pca_ames)

In [None]:
[[i,j] for i,j in enumerate(np.cumsum(v_pca_ames))]

Let's put the following in one table to compare the order of the numerical variables in the data, see table below:

1. **Pearson Correlation coefficients** of the numerical variables with `SalePrice`.

2. **LASSO coefficients** fitting the numerical variables to a LASSO model predicting `SalePrice`.

3. **PCA loadings** of first the principle component ('PC1'). Note, this is independent of `SalePrice`.

Two housekeeping remarks:
1. We make use of `reset_index(drop=True)` to concatenate the data frames as there are offered. Without it, the data frames would be joined using the original index and all variable names would end up in the same row driven by the index of the first data frame, the correlation coefficients in this case.

2. In 'Step 3 - Split the data' in section 'Estimate a LASSO model' two scenario's can be chosen (A and B). Subsequently, in step 6 of the same section, `df_lasso_coefficients` is calculated. To allow for a proper comparison of the correlation coefficients, the LASSO coefficients, and the PCA loadings, please ensure you ran scenario A and steps 3-6 accordingly.

The table below allows us to investigate the numerical variable order in each of the three analysis. Comparing the Pearson correlation coefficients and the LASSO coefficients shows that the first two variables occur at the top of each list. We also expect variables that have a high correlation with the `SalePrice` to also end up high in the LASSO coefficient table. *Question: Why?* However, the equality does not continue all the way down. `Garage Area` is high in the list of correlation coefficients, however, it is somewhere in the middle in the list of LASSO coefficients. *Why is this the case?* Since LASSO wants to include as few variables as possible (regularization, remember $\alpha$) it will choose one of the two. Both are highly correlated (not shown in the table), so the information is sufficiently captured in one of the two.  We observe the same for `Gr Liv Area` and `1st Flr SF`. The table suggests that the variable with the higher correlation with `SalePrice` is chosen for LASSO and the other one is *punished* by giving it a lower LASSO coefficient. So, correlations between variables causes the order in the two lists to differ.

We observe no similarity between the order in variables between the loadings in the first principle component (PC1) on the one hand and the correlations and LASSO coefficients on the other hand. Possibly, this is explained by PCA only focussing on the predictor data (X), where correlations and LASSO depend on the relation between the predictor data (X) and `SalePrice` (y).

In [None]:
if df_lasso_coefficients.shape[0] != 38:
    print("Before proceeding, run Scenario A ('numerical only') in Step 3 of section 'Estimate a LASSO model', and run steps 3-6 above.")

In [None]:
pd.concat(
        
    [   # Correlation coefficients. Note, we remove the first row containing correlation of SalePrice with itself.
        df_corr_table.tail(-1).reset_index(drop=True),

        #LASSO coefficients.
        df_lasso_coefficients.reset_index(drop=True),
          
        # PCA Loadings.  
        pd.DataFrame({
            
            'name': l_df_X_names,
            'pc1_loading': ["{:.2f}".format(x) for x in pca_.components_[1]],
            'pc1_loading_abs': ["{:.2f}".format(abs(x)) for x in pca_.components_[1]]

        }).sort_values(
            
            by = 'pc1_loading_abs',
            ascending = False
            
        ).reset_index(drop=True)
    ],
    
    axis = 1
)