Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while applying .transform() #117

Closed
nico695 opened this issue Aug 4, 2021 · 3 comments
Closed

Error while applying .transform() #117

nico695 opened this issue Aug 4, 2021 · 3 comments

Comments

@nico695
Copy link

nico695 commented Aug 4, 2021

Same error that has been documented in here #56.

Tried downgrading the version to 0.7.0 through the repository that was linked in that thread. Still showing the same dimensionality error.

Here its the code:

import numpy as np
import pandas as pd
X_n = pd.DataFrame(data=np.random.rand(10000,2),columns=list('AB'))
X_c =pd.DataFrame(np.random.choice(list('abcde'),size=(10000,4),replace=True),columns =list('CDEF'))
X=pd.concat([X_n,X_c],axis=1)

from prince import FAMD

famd = FAMD(n_components = 6, n_iter = 100)
famd.fit(X)

famd.transform(X.iloc[1:10,:])

I got the same error in version 0.7.0 and 0.7.1

ValueError: shapes (9,20) and (22,6) not aligned: 20 (dim 1) != 22 (dim 0)

@christophe-williams
Copy link

I've run into this issue a few times and it looks like it's based on how dummies are generated in _build_X_global. When the dataset you are transforming does not have examples of all the categorical variables from the larger original dataset, the resulting dummified dataset has fewer columns (in this case, 20 rather than 22).

Suggested fix for this (and #56 and #116) is to store the dummified columns in the famd and mfa models. If a new dataset being transformed only has a subset of categorical values, then its dummified dataset should have the right number of columns and one or more columns will be all zeroes. If a new dataset being transformed has new categorical values, should probably throw an error.

@sibmike
Copy link

sibmike commented Sep 28, 2021

Had the same issue, so had to make sure my train, validation, and test have examples of all the categorical variables, before fitting MCA. And dump columns where they don't:

keep = []
for clmn in X_train_cat.columns:
    train_cats = set(X_val_train_cat[clmn].unique())
    val_cats = set(X_val_test_cat[clmn].unique())
    test_cats = set(X_test_cat[clmn].unique())
    keep.append(train_cats == val_cats == test_cats)

keep_columns = X_train_cat.columns[keep]

But that's obviously an awkward temp solution, just to make it work. The dummy matrix workaround @christophe-williams mentioned would be nice to have.

@MaxHalford
Copy link
Owner

Hello there 👋

I apologise for not answering earlier. I was not maintaining Prince anymore. However, I have just refactored the entire codebase. This refactoring should have fixed many bugs.

I don’t have time and energy to check if this fixes your issue, but there is a good chance it does. Feel free to reopen this issue if the problem persists after installing the new version — that is, version 0.8.0 and onwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants