LCA - Data order #61

yuanjames · 2024-05-10T19:38:53Z

Hi,

I have recently conducted a series of experiments, I found it is tricky that the results changed when I shuffled the data (other settings same).

I am curious that LCA should have same results, but the shuffled data may change the convergence? am I right? If we want to have the same results, we may need to change parameters of LCA.

sachaMorin · 2024-05-10T22:34:05Z

By other settings name, do you mean you’re shuffling the order of the variables?

Shuffling should not affect the fit quality of the overall model, but could affect the order of the parameters. It would be really helpful if you could provide a minimum example to reproduce what you observed, perhaps with one of the datasets in stepmix.datasets.

yuanjames · 2024-05-10T22:54:27Z

Sorry, I just realised that I made one mistake yesterday, so I have updated the example I used, please check @sachaMorin.

df, target = load_iris(return_X_y=True, as_frame=True)
df['iris_flower_type'] = target.map({0:'setosa', 1:'versicolor', 2:'virginica'})
df = df.sample(frac=1) # shuffle
continuous_features = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
                 'petal width (cm)']
continuous_data = df[continuous_features]
continuous_data = continuous_data.sample(frac=1)
model = StepMix(n_components=5, measurement="continuous", verbose=1, random_state=123)
model.fit(continuous_data)
df['continuous_pred'] = model.predict(continuous_data)

Every time I run the code, it gives me different crosstab results. For example,

No 1.

continuous_pred	0	1	2	3	4

0 | 50 | 0 | 0 | 0
23 | 0 | 23 | 4 | 0
1 | 0 | 0 | 20 | 29

No 2.

continuous_pred	0	1	2	3	4

0 | 0 | 50 | 0 | 0
24 | 16 | 0 | 10 | 0
15 | 0 | 0 | 0 | 35

yuanjames · 2024-05-11T08:43:48Z

If my understanding is correct, I think if one LCA model with fixed hyperparameters can always reach the convergence after shuffling the data, then shuffling won't change the crosstab results. However, if the LCA can't reach the convergence, then shuffling did change the results.

I tried n_component = 2 or 3, shuffling did not change results, once I changed it to 5, as the above example shows, it changed the results. am I correct?

sachaMorin · 2024-05-22T20:25:00Z

Looking at your previous results, the clusterings still look good. Each cluster captures a class (or a part of it if you have more clusters than classes).

It's also possible that this is caused by numerical issues. For example, the sum of an ndarray may actually vary slightly if you shuffle the elements due to the summing order. See the following program:

import numpy as np
np.random.seed(123)
a = np.random.random(100)
b = np.copy(a)
np.random.shuffle(b)
sum_a = np.sum(a)
sum_b = np.sum(b)
print(sum_a)
print(sum_b)
print(sum_a == sum_b)

Output:

50.14288800514812
50.142888005148116
False

Given the numerous sums and means taken in the StepMix estimation, those small differences can compound over time and could potentially explain what we're seeing here. I'm not sure and would be interested in seeing how other libraries behave.

sachaMorin added the question Further information is requested label May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LCA - Data order #61

LCA - Data order #61

yuanjames commented May 10, 2024

sachaMorin commented May 10, 2024

yuanjames commented May 10, 2024 •

edited

yuanjames commented May 11, 2024 •

edited

sachaMorin commented May 22, 2024 •

edited

LCA - Data order #61

LCA - Data order #61

Comments

yuanjames commented May 10, 2024

sachaMorin commented May 10, 2024

yuanjames commented May 10, 2024 • edited

yuanjames commented May 11, 2024 • edited

sachaMorin commented May 22, 2024 • edited

yuanjames commented May 10, 2024 •

edited

yuanjames commented May 11, 2024 •

edited

sachaMorin commented May 22, 2024 •

edited