New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to project new data using existing VSP loadings? #68
Comments
And to follow up, I have tried a couple of approaches without success. And I'm testing this by "fitting" VSP on |
Can you share a code example? My guess is that you aren't matching the pre-processing and post-processing steps, and I can provide some pointers on how to do this. |
Hi @alexpghayes, thanks for prompt response, I appreciate that. Here's our current code as a sklearn transformer: from typing import Any, Literal, Optional, Union
import numpy as np
import numpy.typing as npt
from sklearn.decomposition import PCA
class VSP(PCA): # type:ignore[misc]
"""
R VSP implementation is at: https://github.com/RoheLab/vsp
"""
def __init__(
self,
n_components: Union[None, int, float, str] = None,
*,
copy: bool = True,
svd_solver: str = "auto",
tol: float = 0.0,
iterated_power: str = "auto",
random_state: Optional[float] = None,
rotation: Literal["varimax", "quartimax"] = "varimax"
):
"""
NOTE:
* R tol/eps is higher by default
* VSP will use always use svds
https://github.com/RoheLab/vsp/blob/d57d6e1858444035e1ab78fed371eb68ebdee0eb/R/vsp.R#L126
* VSP is whitening by default?
https://github.com/RoheLab/vsp/blob/d57d6e1858444035e1ab78fed371eb68ebdee0eb/R/vsp.R#L141-L142
"""
super().__init__(
n_components=n_components,
copy=copy,
whiten=True,
svd_solver=svd_solver,
tol=tol,
iterated_power=iterated_power,
random_state=random_state,
)
self.rotation = rotation
@classmethod
def _ortho_matrix(
cls,
components: Any,
method: str = "varimax",
tol: float = 1e-6,
max_iter: int = 100,
) -> npt.ArrayLike:
"""
Returns rotation matrix.
`components` is either components or scores. See:
https://github.com/RoheLab/vsp/blob/d57d6e1858444035e1ab78fed371eb68ebdee0eb/R/vsp.R#L141
This is copied from sklearn.decomposition._factor_analysis._ortho_rotation
but doesn't perform final dot product, which we will do inside fit/transform.
"""
nrow, ncol = components.shape
rotation_matrix = np.eye(ncol)
var = 0
for _ in range(max_iter):
comp_rot = np.dot(
components, rotation_matrix
) # type:ignore[no-untyped-call]
if method == "varimax":
tmp = comp_rot * np.transpose((comp_rot**2).sum(axis=0) / nrow)
elif method == "quartimax":
tmp = 0
u, s, v = np.linalg.svd(
np.dot(
components.T, comp_rot**3 - tmp
) # type:ignore[no-untyped-call]
)
rotation_matrix = np.dot(u, v)
var_new = np.sum(s)
if var != 0 and var_new < var * (1 + tol):
break
var = var_new
return rotation_matrix
def _fit(
self, X: npt.ArrayLike
) -> tuple[npt.ArrayLike, npt.ArrayLike, npt.ArrayLike]:
U, S, _ = super()._fit(X)
# NOTE: R_v is NOT normalized, as opposed to default R's varimax(rawLoadings)$rot
R_v = self._ortho_matrix(self.components_.T, method=self.rotation, tol=self.tol)
U_c = U[:, : self.n_components_]
R_u = self._ortho_matrix(U_c, method=self.rotation, tol=self.tol)
# NOTE: same as varimax(rawLoadings, normalize = F)$loadings, + sqrt(n)
Y = np.sqrt(self.n_features_) * np.dot(
self.components_.T, R_v
)
self.rot_components_ = Y.T
self.R_v_ = R_v
self.R_u_ = R_u
return U, S, Y
def fit_transform(
self, X: npt.ArrayLike, y: Optional[npt.ArrayLike] = None
) -> npt.ArrayLike:
U, S, Vt = self._fit(X)
U = U[:, : self.n_components_] # type:ignore
# TODO: make_skew_positive(fa)
return np.sqrt(self.n_samples_) * np.dot(
U, self.R_u_
)
def transform(self, X: npt.ArrayLike) -> npt.ArrayLike:
# TODO: ???
raise NotImplementedError()
def inverse_transform(self, X: npt.ArrayLike) -> npt.ArrayLike:
raise NotImplementedError() So to be clear Here's usage on Iris: from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
iris = load_iris()["data"]
x = StandardScaler().fit_transform(iris)
# NOTE: use 1e-5 tol to match R
x_scores = VSP(2, tol=1e-5).fit_transform(x)
# would like to:
VSP(2, tol=1e-5).fit(x).transform(x)
# or `new_x`, like `.transform(new_x)` |
And library(datasets)
library(vsp)
data(iris)
x = scale(data.matrix(iris[1:4]), scale = TRUE)
x_scores = vsp(x, rank=2, degree_normalize = FALSE)$Z |
Roughly you'll want something like def transform(self, X: npt.ArrayLike) -> npt.ArrayLike:
return np.dot(super().transform(X), self.R_u_) It looks like there is a little bit of a difference between |
@alexpghayes this is great, thank you so much! The difference is too big to be numerical error 🤔 |
It might come from the transformation of singular values into variance explained and back in the whitening logic. Personally I found the |
@alexpghayes I think you are right, the problem was the explained_variance_ = (S**2) / (n_samples - 1) if we adjust the denominator to explained_variance_ = ((S ** 2) / self.n_samples_)[:self.n_components_] the results are the same. |
Thanks again @alexpghayes, will close this now. |
@ravwojdyla Been looking at this thread and wondering if you'd be willing/able to share any updates to your python code here (for academic research). Only return would be that we are trying to get this to work with sparse input (afaik sklearn |
@kennyjoseph sure, here it is: from typing import Any, Literal
import numpy as np
import numpy.typing as npt
from sklearn.decomposition import PCA
class VSP(PCA): # type:ignore[misc]
"""
R VSP implementation is at: https://github.com/RoheLab/vsp
"""
def __init__(
self,
n_components: None | int | float | str = None,
*,
copy: bool = True,
svd_solver: str = "auto",
tol: float = 0.0,
iterated_power: str = "auto",
random_state: float | None = None,
rotation: Literal["varimax", "quartimax"] = "varimax"
):
"""
NOTE:
* R tol/eps is higher by default
* VSP will use always use svds
https://github.com/RoheLab/vsp/blob/d57d6e1858444035e1ab78fed371eb68ebdee0eb/R/vsp.R#L126
* VSP is whitening by default?
https://github.com/RoheLab/vsp/blob/d57d6e1858444035e1ab78fed371eb68ebdee0eb/R/vsp.R#L141-L142
"""
super().__init__(
n_components=n_components,
copy=copy,
whiten=True,
svd_solver=svd_solver,
tol=tol,
iterated_power=iterated_power,
random_state=random_state,
)
self.rotation = rotation
@classmethod
def _ortho_matrix(
cls,
components: Any,
method: str = "varimax",
tol: float = 1e-6,
max_iter: int = 100,
) -> npt.ArrayLike:
"""
Returns rotation matrix.
`components` is either components or scores. See:
https://github.com/RoheLab/vsp/blob/d57d6e1858444035e1ab78fed371eb68ebdee0eb/R/vsp.R#L141
This is copied from sklearn.decomposition._factor_analysis._ortho_rotation
but doesn't perform final dot product, which we will do inside fit/transform.
"""
nrow, ncol = components.shape
rotation_matrix = np.eye(ncol)
var = 0
for _ in range(max_iter):
comp_rot = np.dot(components, rotation_matrix)
if method == "varimax":
tmp = comp_rot * np.transpose((comp_rot**2).sum(axis=0) / nrow)
elif method == "quartimax":
tmp = 0
u, s, v = np.linalg.svd(np.dot(components.T, comp_rot**3 - tmp))
rotation_matrix = np.dot(u, v)
var_new = np.sum(s)
if var != 0 and var_new < var * (1 + tol):
break
var = var_new
return rotation_matrix
def _fit(
self, X: npt.ArrayLike
) -> tuple[npt.ArrayLike, npt.ArrayLike, npt.ArrayLike]:
U, S, _ = super()._fit(X)
explained_variance = (S**2) / self.n_samples_
total_var = explained_variance.sum()
explained_variance_ratio = explained_variance / total_var
# NOTE: R_v is NOT normalized, as opposed to default R's varimax(rawLoadings)$rot
R_v = self._ortho_matrix(self.components_.T, method=self.rotation, tol=self.tol)
U_c = U[:, : self.n_components_]
R_u = self._ortho_matrix(U_c, method=self.rotation, tol=self.tol)
# NOTE: same as varimax(rawLoadings, normalize = F)$loadings, but also includes
# the multiplication by sqrt(n)
Y = np.sqrt(self.n_features_) * np.dot(self.components_.T, R_v)
self.rot_components_ = Y.T
self.R_v_ = R_v
self.R_u_ = R_u
self.explained_variance_ = explained_variance[: self.n_components_]
self.total_var_ = total_var
self.explained_variance_ratio_ = explained_variance_ratio[: self.n_components_]
return U, S, Y
def fit_transform(
self, X: npt.ArrayLike, y: npt.ArrayLike | None = None
) -> npt.ArrayLike:
U, S, Vt = self._fit(X)
U = U[:, : self.n_components_] # type:ignore
# TODO: make_skew_positive(fa)
return np.sqrt(self.n_samples_) * np.dot( # type:ignore[no-any-return]
U, self.R_u_
)
def transform(self, X: npt.ArrayLike) -> npt.ArrayLike:
return np.dot(super().transform(X), self.R_u_) # type:ignore[no-any-return]
def inverse_transform(self, X: npt.ArrayLike) -> npt.ArrayLike:
raise NotImplementedError() |
Ah awesome, thanks so much! |
@kennyjoseph np, if you have any updates or notice any issues please let us know, also would love to hear the updates about the sparse input. |
Thank you so much for a great paper and this library!
I'm limited to python and have reimplemented VSP in python atop sklearn PCA, by adding the varimax rotation. I get the same results by following your exact implementation in python. But I have a follow up use case, what to do when I want to project new data using existing VSP loadings? This question is related to https://stats.stackexchange.com/questions/59213/how-to-compute-varimax-rotated-principal-components-in-r but in the context of VSP.
Would appreciate your help or hints.
The text was updated successfully, but these errors were encountered: