# Derived features: using PCA on a subset of columns

The modelling tools included in `ISLP` allow for
construction of transformers applied to features.

In [1]:
import numpy as np
from ISLP import load_data
from ISLP.models import ModelSpec, pca, Variable, derived_variable
from sklearn.decomposition import PCA

In [2]:
Carseats = load_data('Carseats')
Carseats.columns

Index(['Sales', 'CompPrice', 'Income', 'Advertising', 'Population', 'Price',
       'ShelveLoc', 'Age', 'Education', 'Urban', 'US'],
      dtype='object')

Let's create a `ModelSpec` that is aware of all of the relevant columns.

In [3]:
design = ModelSpec(Carseats.columns.drop(['Sales'])).fit(Carseats)

Suppose we want to make a `Variable` representing the first 3 principal components of the
 features `['CompPrice', 'Income', 'Advertising', 'Population', 'Price']`.

We first make a `Variable` that represents these five features columns, then `pca`
can be used to compute a new `Variable` that returns the first three principal components.

In [4]:
grouped = Variable(('CompPrice', 'Income', 'Advertising', 'Population', 'Price'), name='grouped', encoder=None)
sklearn_pca = PCA(n_components=3, whiten=True)

We can now fit `sklearn_pca` and create our new variable.

In [5]:
sklearn_pca.fit(design.build_columns(Carseats, grouped)[0]) 
pca_var = derived_variable(['CompPrice', 'Income', 'Advertising', 'Population', 'Price'],
                           name='pca(grouped)', encoder=sklearn_pca)
derived_features, _ = design.build_columns(Carseats, pca_var)



In [6]:
design.build_columns(Carseats, grouped)[0]

Unnamed: 0,CompPrice,Income,Advertising,Population,Price
0,138,73,11,276,120
1,111,48,16,260,83
2,113,35,10,269,80
3,117,100,4,466,97
4,141,64,3,340,128
...,...,...,...,...,...
395,138,108,17,203,128
396,139,23,3,37,120
397,162,26,12,368,159
398,100,79,7,284,95


## Helper function

The function `pca` encompasses these steps into a single function for convenience.

In [7]:
group_pca = pca(['CompPrice', 'Income', 'Advertising', 'Population', 'Price'], 
                n_components=3, 
                whiten=True, 
                name='grouped')

In [8]:
pca_design = ModelSpec([group_pca], intercept=False)
ISLP_features = pca_design.fit_transform(Carseats)
ISLP_features.columns



Index(['pca(grouped, n_components=3, whiten=True)[0]',
       'pca(grouped, n_components=3, whiten=True)[1]',
       'pca(grouped, n_components=3, whiten=True)[2]'],
      dtype='object')

## Direct comparison

In [9]:
X = np.asarray(Carseats[['CompPrice', 'Income', 'Advertising', 'Population', 'Price']])
sklearn_features = sklearn_pca.fit_transform(X)

In [10]:
np.linalg.norm(ISLP_features - sklearn_features), np.linalg.norm(ISLP_features - np.asarray(derived_features))

(4.073428490498941e-14, 0.0)