Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: The condensed distance matrix must contain only finite values. #472

Open
sungla55guy opened this issue Aug 2, 2023 · 4 comments

Comments

@sungla55guy
Copy link

Hi I'm using generate report with a LGBMClassifier for a binary classification. My data has categoricals and missing values which lightgbm can handle natively. I'm able to get the dashboard to run however when I try to generate a report with the following code:

xpl.generate_report(
    output_file='report.html', 
    project_info_file='model.yml',
    x_train=X_train,
    y_train=y_train,
    y_test=y_test,
    title_story="CCA Default Risk",
    metrics=[
        {
            'path': 'sklearn.metrics.f1_score',
            'name': 'f1 score',
        },
        {
            'path': 'sklearn.metrics.balanced_accuracy',
            'name': 'Balanced Accuracy',
        },
        {
            'path': 'sklearn.metrics.roc_auc',
            'name': 'ROC AUC',
        }
    ]
)

I get the following error:

PapermillExecutionError: 
---------------------------------------------------------------------------
Exception encountered at "In [8]":
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[8], line 1
----> 1 report.display_dataset_analysis()

File ~\Miniconda3\envs\pandas2\lib\site-packages\shapash\report\project_report.py:284, in ProjectReport.display_dataset_analysis(self, global_analysis, univariate_analysis, target_analysis, multivariate_analysis)
    282 if multivariate_analysis:
    283     print_md("### Multivariate analysis")
--> 284     fig_corr = self.explainer.plot.correlations(
    285         self.df_train_test,
    286         facet_col='data_train_test',
    287         max_features=20,
    288         width=900 if len(self.df_train_test['data_train_test'].unique()) > 1 else 500,
    289         height=500,
    290     )
    291     print_html(plotly.io.to_html(fig_corr))
    292 print_md('---')

File ~\Miniconda3\envs\pandas2\lib\site-packages\shapash\explainer\smart_plotter.py:2296, in SmartPlotter.correlations(self, df, max_features, features_to_hide, facet_col, how, width, height, degree, decimals, file_name, auto_open)
   2294 if len(list_features) == 0:
   2295     top_features = compute_top_correlations_features(corr=corr, max_features=max_features)
-> 2296     corr = cluster_corr(corr.loc[top_features, top_features], degree=degree)
   2297     list_features = list(corr.columns)
   2299 fig.add_trace(
   2300     go.Heatmap(
   2301         z=corr.loc[list_features, list_features].round(decimals).values,
   (...)
   2308         hovertemplate=hovertemplate,
   2309     ), row=1, col=i+1)

File ~\Miniconda3\envs\pandas2\lib\site-packages\shapash\explainer\smart_plotter.py:2244, in SmartPlotter.correlations.<locals>.cluster_corr(corr, degree, inplace)
   2241     return corr
   2243 pairwise_distances = sch.distance.pdist(corr**degree)
-> 2244 linkage = sch.linkage(pairwise_distances, method='complete')
   2245 cluster_distance_threshold = pairwise_distances.max()/2
   2246 idx_to_cluster_array = sch.fcluster(linkage, cluster_distance_threshold, criterion='distance')

File ~\Miniconda3\envs\pandas2\lib\site-packages\scipy\cluster\hierarchy.py:1064, in linkage(y, method, metric, optimal_ordering)
   1061     raise ValueError("`y` must be 1 or 2 dimensional.")
   1063 if not np.all(np.isfinite(y)):
-> 1064     raise ValueError("The condensed distance matrix must contain only "
   1065                      "finite values.")
   1067 n = int(distance.num_obs_y(y))
   1068 method_code = _LINKAGE_METHODS[method]

ValueError: The condensed distance matrix must contain only finite values.
  • Provide a minimal code snippet example that reproduces the bug.
  • Provide screenshots where appropriate
  • What's the version of Python you're using ?
  • 3.9.16
  • Are you using Mac, Linux or Windows?
  • Windows 10

Python version :
3.9.16
Shapash version :
2.3.5
Operating System :
Windows 10

@guillaume-vignal
Copy link
Collaborator

Thank you for reporting us this bug, we'll fix it soon.
Best regards.

@ThomasBouche
Copy link
Collaborator

Hi,

We have fix this issue, you can try with the new version of shapash 2.3.7

@ekamioka
Copy link

ekamioka commented Oct 12, 2023

Hello @ThomasBouche , thanks for working on the issue.

I am afraid the issue is still open. I have just faced the same problem using the version 2.3.7.

I guess I understood the problem. The panda DataFrame received as corr contains NaNs. Thus, pairwise_distances will results in NaNs only, which triggers the error.

Analyzing the compute_corr function that generates the corr matrix we can see that df.corr() is generating NaNs du to the presence of constant values (once the standard deviation of a column with constant values is zero, which results in a division by zero in the correlation calculation).

ekamioka added a commit to ekamioka/shapash that referenced this issue Oct 12, 2023
@ThomasBouche
Copy link
Collaborator

Hello,
Do you have an example so that I can reproduce the error?
I tried to create an error with constant values, but it didn't create an error.

Furthermore, in the context of a machine learning model, in what cases does a feature have constant values?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants