# Ames Housing Step-by-step - Extra - Determine the number of PCs to explain 90% of variance in the data

Pieter Overdevest  
2024-02-09

For suggestions/questions regarding this notebook, please contact
[Pieter Overdevest](https://www.linkedin.com/in/pieteroverdevest/)
(pieter@innovatewithdata.nl).

### How to work with this Jupyter Notebook yourself?

- Get a copy of the repository ('repo') [machine-learning-with-python-explainers](https://github.com/EAISI/machine-learning-with-python-explainers) from EAISI's GitHub site. This can be done by either cloning the repo or simply downloading the zip-file. Both options are explained in this Youtube video by [Coderama](https://www.youtube.com/watch?v=EhxPBMQFCaI).

- Copy the folder 'ames-housing-pieter\' located in the folder 'example-solutions\' to your own project folder.

### Import packages

In [1]:
# Third party packages.
import pandas as pd     # Data handling
import numpy as np      # Numeric calculations
import pickle           # Save and load data

from sklearn.pipeline           import make_pipeline    # Pipeline
from sklearn.impute             import SimpleImputer    # Imputation
from sklearn.preprocessing      import StandardScaler   # Scale data
from sklearn.decomposition      import PCA              # PCA

# Setting sklearn parameter for Pipeline visualization.
import sklearn 
sklearn.set_config(display="diagram")

### Load objects from 'ames_housing_pieter'

The solution notebooks 'ames-housing-pieter-exercise-1-2-3.ipynb', 'ames-housing-pieter-exercise-4.ipynb', and, 'ames-housing-pieter-exercise-5-6.ipynb' in the 'ames_housing_pieter\\' folder each generate a so-called [pickle file](https://docs.python.org/3/library/pickle.html), 'dc-ames-housing-pieter-exercise-1-2-3.pkl', 'dc-ames-housing-pieter-exercise-4.pkl', and 'dc-ames-housing-pieter-exercise-5-6.pkl', resp. These are located in the 'data\\' folder. Copy the 'data\\' folder to the same project folder where you are running this notebook. Then, run the cell below, which will read the pickle files and store the data in python objects.

In [2]:
with open('data/dc-ames-housing-pieter-exercise-1-2-3.pkl', 'rb') as pickle_file:
    dc_exercise_1_2_3 = pickle.load(pickle_file)

df_orig = dc_exercise_1_2_3['df_orig']

with open('data/dc-ames-housing-pieter-exercise-4.pkl', 'rb') as pickle_file:
    dc_exercise_4 = pickle.load(pickle_file)

l_df_X_names  = dc_exercise_4['l_df_X_names']
df_corr_table = dc_exercise_4['df_corr_table']

with open('data/dc-ames-housing-pieter-exercise-5-6.pkl', 'rb') as pickle_file:
    dc_exercise_5_6 = pickle.load(pickle_file)

df_lasso_coefficients = dc_exercise_5_6['df_lasso_coefficients']

### Let's get started

In this extra notebook, we determine the number of principle components (PC) to explain 90% of the variance in the data using a pipeline, see also 'ML Explainer' 'how-does-pca-work'. So, we need access to the `explained_variance_ratio_` attribute of a PCA object. As far as I know, this can only be done outside the pipeline. In case we are only interested in the principle components themselves, we can include `PCA(n_components = ...)` in the pipeline.

In [3]:
pl_impute_scale_numerical = make_pipeline(
    
    SimpleImputer(missing_values = np.nan, strategy = "median"),

    StandardScaler()

)

pl_impute_scale_numerical

The pipeline is applied to the numerical variables in the original data (`df_orig`).

In [4]:
m_X_transformed = pl_impute_scale_numerical.fit_transform(df_orig[l_df_X_names])

We define a PCA object `pca_` considering as many principle components as there are columns in df_X.

In [5]:
# Define PCA object.
pca_ = PCA(n_components = len(l_df_X_names))

# Matrix containing the principle components.
m_pc = pca_.fit_transform(m_X_transformed)

The attribute `explained_variance_ratio_` holds the additional variance that is explained by adding another principle component.

In [6]:
v_pca_ames = pca_.explained_variance_ratio_

print(v_pca_ames)
print("\n")
print(np.cumsum(v_pca_ames))


[1.91550559e-01 8.49988724e-02 6.82715934e-02 5.39498331e-02
 5.19202588e-02 3.84647664e-02 3.11401688e-02 3.04748704e-02
 2.98660081e-02 2.89683027e-02 2.79590779e-02 2.70137805e-02
 2.61827771e-02 2.52963579e-02 2.46915451e-02 2.41494505e-02
 2.33939412e-02 2.25889892e-02 2.24758299e-02 1.88902344e-02
 1.83763134e-02 1.70526853e-02 1.60675142e-02 1.49319374e-02
 1.45676739e-02 1.14141619e-02 9.63615941e-03 8.61665972e-03
 7.61873506e-03 7.22906925e-03 6.47158986e-03 5.63124983e-03
 4.07465189e-03 3.50081163e-03 2.39715405e-03 1.66077925e-04
 3.39002729e-07 3.77766868e-32]


[0.19155056 0.27654943 0.34482102 0.39877086 0.45069112 0.48915588
 0.52029605 0.55077092 0.58063693 0.60960523 0.63756431 0.66457809
 0.69076087 0.71605723 0.74074877 0.76489822 0.78829216 0.81088115
 0.83335698 0.85224722 0.87062353 0.88767622 0.90374373 0.91867567
 0.93324334 0.9446575  0.95429366 0.96291032 0.97052906 0.97775813
 0.98422972 0.98986097 0.99393562 0.99743643 0.99983358 0.99999966
 1.         1. 

Using a list comprehension, we demonstrate that 23 PC's are needed to explain at least 90% of the total variance in the data. In other words the first 23 PC's contain 90% of the information present in the 38 variables. To show the outcome of `np.cumsum(v_pca_ames)`:

In [7]:
[[i,j] for i,j in enumerate(np.cumsum(v_pca_ames))]

[[0, 0.19155055858230147],
 [1, 0.2765494309892062],
 [2, 0.3448210243845698],
 [3, 0.39877085743829516],
 [4, 0.45069111622220276],
 [5, 0.4891558826235508],
 [6, 0.5202960514477148],
 [7, 0.5507709218666347],
 [8, 0.5806369299934158],
 [9, 0.6096052327331832],
 [10, 0.6375643106235788],
 [11, 0.6645780911103369],
 [12, 0.6907608682250982],
 [13, 0.7160572261435734],
 [14, 0.7407487712815731],
 [15, 0.7648982217360544],
 [16, 0.7882921629500373],
 [17, 0.8108811521477598],
 [18, 0.8333569820344573],
 [19, 0.8522472163995976],
 [20, 0.8706235298223807],
 [21, 0.8876762150740928],
 [22, 0.9037437292664038],
 [23, 0.9186756666236768],
 [24, 0.9332433404988875],
 [25, 0.9446575023708684],
 [26, 0.9542936617774621],
 [27, 0.9629103214998656],
 [28, 0.9705290565593426],
 [29, 0.9777581258139452],
 [30, 0.9842297156743187],
 [31, 0.9898609654995321],
 [32, 0.9939356173918497],
 [33, 0.9974364290176843],
 [34, 0.9998335830718637],
 [35, 0.9999996609972709],
 [36, 0.9999999999999996],
 [37, 0.

Let's put the following in one table to compare the order of the numerical variables in the data, see table below:

1. **Pearson Correlation coefficients** of the numerical variables with `SalePrice`.

2. **LASSO coefficients** fitting the numerical variables to a LASSO model predicting `SalePrice`.

3. **PCA loadings** of the first principle component ('PC1'). Note, these are independent of `SalePrice`.

Two housekeeping remarks:
1. We make use of `reset_index(drop=True)` to reset the index and thus to concatenate the data frames as there are offered. Without it, the data frames would be joined using the original index and all variable names would end up in the same row driven by the index of the first data frame, the correlation coefficients in this case.

2. In 'b3. Split the data' in section 'Estimate a LASSO model' of 'ames-housing-pieter-exercise-5-6.ipynb', two scenario's can be chosen (A and B). Subsequently, in 'b6. Interpret the coefficients' of the same section, `df_lasso_coefficients` is calculated. To allow for a proper comparison of the correlation coefficients, the LASSO coefficients, and the PCA loadings, please ensure you ran scenario A and steps 3-6 accordingly.

The table below allows us to investigate the numerical variable order in each of the three analysis. Comparing the Pearson correlation coefficients and the LASSO coefficients shows that the first two variables occur at the top of each list. We also expect variables that have a high correlation with the `SalePrice` to also end up high in the LASSO coefficient table. *Question: Why?* However, the equality does not continue all the way down. `Garage Area` is high in the list of correlation coefficients, however, it is somewhere in the middle in the list of LASSO coefficients. *Why is this the case?* Since LASSO wants to include as few variables as possible (regularization, remember $\alpha$) it will choose one of the two. Both are highly correlated (not shown in the table), so the information is sufficiently captured in one of the two.  We observe the same for `Gr Liv Area` and `1st Flr SF`. The table suggests that the variable with the higher correlation with `SalePrice` is chosen for LASSO and the other one is *punished* by giving it a lower LASSO coefficient. So, correlations between variables causes the order in the two lists to differ.

We observe no similarity between the order in variables between the loadings in the first principle component (PC1) on the one hand and the correlations and LASSO coefficients on the other hand. Possibly, this is explained by PCA only focussing on the predictor data (X), where correlations and LASSO depend on the relation between the predictor data (X) and `SalePrice` (y).

In [8]:
if df_lasso_coefficients.shape[0] != 38:
    
    print(
        "Before proceeding, run Scenario A ('numerical only') in 'b3. Split the data' "
        "in section 'Estimate a LASSO model' of 'ames-housing-pieter-exercise-5-6.ipynb', "
        "run the subsequent steps, and store the data objects in a pickle file as given."
    )

In [11]:
pd.concat(
        
    [   # Correlation coefficients. Note, we remove the first row containing correlation of SalePrice with itself.
        df_corr_table.tail(-1).reset_index(drop=True),

        #LASSO coefficients.
        df_lasso_coefficients.reset_index(drop=True),
          
        # PCA Loadings.  
        pd.DataFrame({
            
            'name': l_df_X_names,
            'pc1_loading': ["{:.2f}".format(x) for x in pca_.components_[1]],
            'pc1_loading_abs': ["{:.2f}".format(abs(x)) for x in pca_.components_[1]]

        }).sort_values(
            
            by = 'pc1_loading_abs',
            ascending = False
            
        ).reset_index(drop=True)
    ],
    
    axis = 1
)

Unnamed: 0,name,corr,corr_abs,name.1,lasso coeff,lasso_coeff_abs,name.2,pc1_loading,pc1_loading_abs
0,SalePrice,0.949,0.949,Gr Liv Area,0.127046,0.127046,2nd Flr SF,0.43,0.43
1,Overall Qual,0.828,0.828,Overall Qual,0.121715,0.121715,Bedroom AbvGr,0.37,0.37
2,Gr Liv Area,0.711,0.711,Year Built,0.091054,0.091054,TotRms AbvGrd,0.33,0.33
3,Garage Cars,0.675,0.675,Overall Cond,0.057199,0.057199,BsmtFin SF 1,-0.3,0.3
4,Garage Area,0.654,0.654,Total Bsmt SF,0.048916,0.048916,Bsmt Full Bath,-0.29,0.29
5,Total Bsmt SF,0.648,0.648,BsmtFin SF 1,0.034148,0.034148,Gr Liv Area,0.25,0.25
6,1st Flr SF,0.621,0.621,Year Remod/Add,0.023247,0.023247,Half Bath,0.23,0.23
7,Year Built,0.617,0.617,Garage Cars,0.020776,0.020776,Total Bsmt SF,-0.2,0.2
8,Year Remod/Add,0.588,0.588,Lot Area,0.019679,0.019679,Kitchen AbvGr,0.19,0.19
9,Full Bath,0.575,0.575,Fireplaces,0.017944,0.017944,Full Bath,0.17,0.17
