# Ames Housing Step-by-step - Appendix PCA - Determine the number of PCs to explain 90% of variance in the data

Pieter Overdevest  
2024-02-09

For suggestions/questions regarding this notebook, please contact
[Pieter Overdevest](https://www.linkedin.com/in/pieteroverdevest/)
(pieter@innovatewithdata.nl).

### How to work with this Jupyter Notebook yourself?

- Get a copy of the repository ('repo') [machine-learning-with-python-explainers](https://github.com/EAISI/machine-learning-with-python-explainers) from EAISI's GitHub site. This can be done by either cloning the repo or simply downloading the zip-file. Both options are explained in this Youtube video by [Coderama](https://www.youtube.com/watch?v=EhxPBMQFCaI).

- Copy the folders 'ames_housing_pieter\' and 'utils_pieter\' folder to your own project folder.

### Import packages

In [1]:
# Load packages and assign to a shorter alias.
import pandas as pd
import numpy as np

# Pieter's utils package.
import utils_pieter as up

### Let's get started

In this appendix, we determine the number of principle components (PC) to explain 90% of the variance in the data using a pipeline ([ref](https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60)), see also 'ML Explainer' 'pca'. So, we need access to the `explained_variance_ratio_` attribute of a PCA object. As far as I know, this can only be done outside the pipeline. In case we are only interested in the principle components themselves, we can include `PCA(n_components = ...)` in the pipeline.

In [None]:
pl_impute_scale_numerical = make_pipeline(
    
    SimpleImputer(missing_values = np.nan, strategy = "median"),

    StandardScaler()

    )

The pipeline is applied to the numerical variables in the original data (`df_orig`).

In [None]:
m_X_transformed = pl_impute_scale_numerical.fit_transform(df_orig[l_df_X_names])

We define a PCA object `pca_` anticipating that 25 principle components will cover at least 90% of the variance.

In [None]:
# Import module.
from sklearn.decomposition import PCA

# Define PCA object.
pca_ = PCA(n_components=25)

# Matrix containing the principle components.
m_pc = pca_.fit_transform(m_X_transformed)

The attribute `explained_variance_ratio_` holds the additional variance that is explained by adding another principle component.

In [None]:
v_pca_ames = pca_.explained_variance_ratio_
v_pca_ames

Using a list comprehension, we demonstrate that 23 PC's are needed to explain at least 90% of the total variance in the data. In other words the 23 PC's contain 90% of the information present in the 38 variables. To show the outcome of `np.cumsum(v_pca_ames)`:

In [None]:
print(len(l_df_X_names))

np.cumsum(v_pca_ames)

In [None]:
[[i,j] for i,j in enumerate(np.cumsum(v_pca_ames))]

Let's put the following in one table to compare the order of the numerical variables in the data, see table below:

1. **Pearson Correlation coefficients** of the numerical variables with `SalePrice`.

2. **LASSO coefficients** fitting the numerical variables to a LASSO model predicting `SalePrice`.

3. **PCA loadings** of first the principle component ('PC1'). Note, this is independent of `SalePrice`.

Two housekeeping remarks:
1. We make use of `reset_index(drop=True)` to concatenate the data frames as there are offered. Without it, the data frames would be joined using the original index and all variable names would end up in the same row driven by the index of the first data frame, the correlation coefficients in this case.

2. In 'Step 3 - Split the data' in section 'Estimate a LASSO model' two scenario's can be chosen (A and B). Subsequently, in step 6 of the same section, `df_lasso_coefficients` is calculated. To allow for a proper comparison of the correlation coefficients, the LASSO coefficients, and the PCA loadings, please ensure you ran scenario A and steps 3-6 accordingly.

The table below allows us to investigate the numerical variable order in each of the three analysis. Comparing the Pearson correlation coefficients and the LASSO coefficients shows that the first two variables occur at the top of each list. We also expect variables that have a high correlation with the `SalePrice` to also end up high in the LASSO coefficient table. *Question: Why?* However, the equality does not continue all the way down. `Garage Area` is high in the list of correlation coefficients, however, it is somewhere in the middle in the list of LASSO coefficients. *Why is this the case?* Since LASSO wants to include as few variables as possible (regularization, remember $\alpha$) it will choose one of the two. Both are highly correlated (not shown in the table), so the information is sufficiently captured in one of the two.  We observe the same for `Gr Liv Area` and `1st Flr SF`. The table suggests that the variable with the higher correlation with `SalePrice` is chosen for LASSO and the other one is *punished* by giving it a lower LASSO coefficient. So, correlations between variables causes the order in the two lists to differ.

We observe no similarity between the order in variables between the loadings in the first principle component (PC1) on the one hand and the correlations and LASSO coefficients on the other hand. Possibly, this is explained by PCA only focussing on the predictor data (X), where correlations and LASSO depend on the relation between the predictor data (X) and `SalePrice` (y).

In [None]:
if df_lasso_coefficients.shape[0] != 38:
    print("Before proceeding, run Scenario A ('numerical only') in Step 3 of section 'Estimate a LASSO model', and run steps 3-6 above.")

In [None]:
pd.concat(
        
    [   # Correlation coefficients. Note, we remove the first row containing correlation of SalePrice with itself.
        df_corr_table.tail(-1).reset_index(drop=True),

        #LASSO coefficients.
        df_lasso_coefficients.reset_index(drop=True),
          
        # PCA Loadings.  
        pd.DataFrame({
            
            'name': l_df_X_names,
            'pc1_loading': ["{:.2f}".format(x) for x in pca_.components_[1]],
            'pc1_loading_abs': ["{:.2f}".format(abs(x)) for x in pca_.components_[1]]

        }).sort_values(
            
            by = 'pc1_loading_abs',
            ascending = False
            
        ).reset_index(drop=True)
    ],
    
    axis = 1
)