Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FAMD implementation #16

Closed
thusithaC opened this issue Dec 17, 2017 · 10 comments
Closed

FAMD implementation #16

thusithaC opened this issue Dec 17, 2017 · 10 comments

Comments

@thusithaC
Copy link

Hi,

Any updates on the FAMD? Trying to get some staticstical analysis work done using python, but unfortunately cant find many tools. Appreciate the effort you have put into this package though!

@MaxHalford
Copy link
Owner

Hey, I just got back into the project refactored all the code. FAMD is, I promise, imminent.

@mlisovyi
Copy link

mlisovyi commented Aug 9, 2018

Is the FAMD implementation complete by now? It seems to be a high-level wrapper around MFA treating each feature as a separate group. While this seems to be reasonable for categorical features, what does it mean for numerical features?
PS Thanks a lot for developing this package! It is awesome to have these tools with convenient interface. I was not able to find another python package with FAMD implementation.

@MaxHalford
Copy link
Owner

I'm pretty sure the implementation is wrong. Instead of using one group per variable I should be using I group for the numerical variables and one group for the categorical ones. I'll fix this ASAP.

No worries! It is a bit difficult to find reference implementations to compare with so sometimes I get things wrong. FactoMineR in R is nice but the source code is very difficult to read and there are barely any comments.

@MaxHalford MaxHalford reopened this Aug 9, 2018
@mlisovyi
Copy link

mlisovyi commented Aug 9, 2018

Indeed, there are packages in R, but I have 0 knowledge of R so far, unfortunately :(

I also bumped into GLRM approach, which is documented here: https://web.stanford.edu/~boyd/papers/glrm.html. The paper is long, but the main idea is that instead of eigenvector decomposition, they solve a minimisation problem with loss function being different for numerical and categorical features. They also python, Julia and Spark implementations. The native python implementation is not advised to be used on medium or large datasets (I think, they mention O(100x100), but one should look it up in the paper). But there is a python wrapper around the Julia implementation available here: https://github.com/udellgroup/pyglrm, which is claimed to be able to work on large (in-memory) datasets. I do not have hands-on experience with it, but maybe it would be useful for you.

@MaxHalford
Copy link
Owner

Okay the implementation should be good in version 0.4.5. I'll close this issue once everyone seems happy with it and once I've added some more documentation.

Thanks for paper, I didn't know about this.

@Arne-He
Copy link

Arne-He commented Aug 19, 2018

Hi,

what is the intended behaviour on datasets containing zero columns?
If I run FAMD on a mixed dataset like the one below (based on the doku) it crashes.

...
data=[
         ['A', 'A', 'A', 2, 5, 7, 0, 3, 6, 7],
         ['A', 'A', 'A', 4, 4, 4, 0, 4, 4, 3],
         ['B', 'A', 'B', 5, 2, 1, 0, 7, 1, 1],
         ['B', 'A', 'B', 7, 2, 1, 0, 2, 2, 2],
         ['B', 'B', 'B', 3, 5, 6, 0, 2, 6, 6],
         ['B', 'B', 'A', 3, 5, 4, 0, 1, 7, 5]
     ],
...

@MaxHalford
Copy link
Owner

Hey @Arne-He,

The data you provided crashes because one column only contains 0s. This causes a 0 division in the following piece of code in MFA.py:

if self.normalize:
    # Scale continuous variables to unit variance
    num = X.select_dtypes(np.number).columns
    normalize = lambda x: x / np.sqrt((x ** 2).sum())
    X.loc[:, num] = (X.loc[:, num] - X.loc[:, num].mean()).apply(normalize, axis='rows')

We should however be checking for this. I've started a dev branch where it's fixed. It will be available in Prince's next release.

@Arne-He
Copy link

Arne-He commented Aug 20, 2018

Thanks for the quick reply and fix!

@srinikprem
Copy link

Hi,

How can we determine the variance explained by all original variable in a given famd component?

@MaxHalford
Copy link
Owner

Hey @srinikprem,

This hasn't been implemented yet, I'm sorry.

By the way I'm going to close this issue because it seems to be going stale. Feel free to open others if you have questions/bugs? I'm swamped at the moment but I plan to get back to prince and implement some more stuff. Please try to be precise in your demands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants