Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error gracefully when on categorical matrices with missing data (segfault is bad). #328

Closed
jtilly opened this issue Nov 12, 2020 · 1 comment
Assignees
Projects

Comments

@jtilly
Copy link
Member

jtilly commented Nov 12, 2020

This produces a segfault:

import numpy as np
import pandas as pd

from quantcore.matrix import from_pandas
from quantcore.glm import GeneralizedLinearRegressor

df = pd.DataFrame(
    {
        "x1": [4, 4, 5, 6, 0, np.nan],
    }
)

y = np.array(
    [
        1.1,
        0.8,
        1.2,
        1.0,
        2.7,
        0.5,
    ]
)

split_matrix = from_pandas(df.astype("category"))
model = GeneralizedLinearRegressor(family="gamma", alpha=1.0)
model.fit(X=split_matrix, y=y)
model.coef_

This works:

split_matrix = from_pandas(df.fillna("missing").astype("category"))
model = GeneralizedLinearRegressor(family="gamma", alpha=1.0)
model.fit(X=split_matrix, y=y)
model.coef_
# array([ 0.15917267, -0.05053825,  0.00268371, -0.02201278, -0.08930535])

I think it's okay not to support this, but this should fail more graciously.

The underlying problem is that astype(category) doesn't represent missings as a separate level:

split_matrix
# [4.0, 4.0, 5.0, 6.0, 0.0, NaN]
# Categories (4, float64): [0.0, 4.0, 5.0, 6.0]  # <- 4 elements no 5
split_matrix.matvec(np.array([1, 2, 3, 4]))
array([2, 2, 3, 4, 1, 0])
@ElizabethSantorellaQC ElizabethSantorellaQC self-assigned this Nov 12, 2020
@tbenthompson tbenthompson changed the title Categorical matrices can't have missings Error gracefully when on categorical matrices with missing data (segfault is bad). Mar 18, 2021
@tbenthompson tbenthompson added this to quantcore.glm in release Mar 18, 2021
@tbenthompson
Copy link
Collaborator

Closing because this fails gracefully now:

Traceback (most recent call last):
  File "/home/tbent/Dropbox/active/quantco/quantcore.glm/fail.py", line 24, in <module>
    split_matrix = from_pandas(df.astype("category"))
  File "/home/tbent/.miniconda3/envs/quantcore.glm/lib/python3.9/site-packages/quantcore/matrix/constructor.py", line 95, in from_pandas
    cat = CategoricalMatrix(coldata, dtype=dtype)
  File "/home/tbent/.miniconda3/envs/quantcore.glm/lib/python3.9/site-packages/quantcore/matrix/categorical_matrix.py", line 40, in __init__
    raise ValueError("Categorical data can't have missing values.")
ValueError: Categorical data can't have missing values.

release automation moved this from quantcore.glm fixes to Done Sep 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

3 participants