Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LCA - Data order #61

Open
yuanjames opened this issue May 10, 2024 · 4 comments
Open

LCA - Data order #61

yuanjames opened this issue May 10, 2024 · 4 comments
Labels
question Further information is requested

Comments

@yuanjames
Copy link

Hi,

I have recently conducted a series of experiments, I found it is tricky that the results changed when I shuffled the data (other settings same).

I am curious that LCA should have same results, but the shuffled data may change the convergence? am I right? If we want to have the same results, we may need to change parameters of LCA.

@sachaMorin
Copy link
Collaborator

By other settings name, do you mean you’re shuffling the order of the variables?

Shuffling should not affect the fit quality of the overall model, but could affect the order of the parameters. It would be really helpful if you could provide a minimum example to reproduce what you observed, perhaps with one of the datasets in stepmix.datasets.

@yuanjames
Copy link
Author

yuanjames commented May 10, 2024

Sorry, I just realised that I made one mistake yesterday, so I have updated the example I used, please check @sachaMorin.

df, target = load_iris(return_X_y=True, as_frame=True)
df['iris_flower_type'] = target.map({0:'setosa', 1:'versicolor', 2:'virginica'})
df = df.sample(frac=1) # shuffle
continuous_features = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
                 'petal width (cm)']
continuous_data = df[continuous_features]
continuous_data = continuous_data.sample(frac=1)
model = StepMix(n_components=5, measurement="continuous", verbose=1, random_state=123)
model.fit(continuous_data)
df['continuous_pred'] = model.predict(continuous_data)

Every time I run the code, it gives me different crosstab results. For example,

No 1.

continuous_pred 0 1 2 3 4

0 | 50 | 0 | 0 | 0
23 | 0 | 23 | 4 | 0
1 | 0 | 0 | 20 | 29

No 2.

continuous_pred 0 1 2 3 4

0 | 0 | 50 | 0 | 0
24 | 16 | 0 | 10 | 0
15 | 0 | 0 | 0 | 35

@yuanjames
Copy link
Author

yuanjames commented May 11, 2024

If my understanding is correct, I think if one LCA model with fixed hyperparameters can always reach the convergence after shuffling the data, then shuffling won't change the crosstab results. However, if the LCA can't reach the convergence, then shuffling did change the results.

I tried n_component = 2 or 3, shuffling did not change results, once I changed it to 5, as the above example shows, it changed the results. am I correct?

@sachaMorin
Copy link
Collaborator

sachaMorin commented May 22, 2024

Looking at your previous results, the clusterings still look good. Each cluster captures a class (or a part of it if you have more clusters than classes).

It's also possible that this is caused by numerical issues. For example, the sum of an ndarray may actually vary slightly if you shuffle the elements due to the summing order. See the following program:

import numpy as np
np.random.seed(123)
a = np.random.random(100)
b = np.copy(a)
np.random.shuffle(b)
sum_a = np.sum(a)
sum_b = np.sum(b)
print(sum_a)
print(sum_b)
print(sum_a == sum_b)

Output:

50.14288800514812
50.142888005148116
False

Given the numerous sums and means taken in the StepMix estimation, those small differences can compound over time and could potentially explain what we're seeing here. I'm not sure and would be interested in seeing how other libraries behave.

@sachaMorin sachaMorin added the question Further information is requested label May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants