Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerical columns treated as categorical #12

Closed
alexandersmedley opened this issue May 15, 2020 · 6 comments
Closed

Numerical columns treated as categorical #12

alexandersmedley opened this issue May 15, 2020 · 6 comments
Assignees

Comments

@alexandersmedley
Copy link

Hi guys,

I heard of PPS, through your article and was curious to test it. I have tried implementing it on some data I've been working on.

Unfortunately, I get numerous error messages when calculating the pps matrix :

Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=4.

UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.

My guess is pps is considering my data to be categorical and therefore trying to apply classification with a huge number of labels.

Looking at how pps determines if the data is numerical or categorical, I cannot find the reason it would consider my data categorical :

  • The dtypes are int or float
  • The number of unique values is higher than 15 (except for 1 column which is equal to 15, but changing the NUMERIC_AS_CATEGORIC_BREAKPOINT constant to 10 does not resolve the problem)

Also, if I try to force the pps score to be calculated using task = 'regression', I get the following error :

'DataFrame' object has no attribute 'dtype'

Here is my code :

import pandas as pd
import ppscore as pps

df = pd.read_csv('seattle_building_energy_benchmark.csv', sep = ';')

df.dtypes

df.nunique()

pps.NUMERIC_AS_CATEGORIC_BREAKPOINT = 10

for col in df.columns: 
    print(col)
    pps.score(df, x = 'YearBuilt', y = col, task = None)

for col in df.columns: 
    print(col)
    pps.score(df, x = 'YearBuilt', y = col, task = 'regression')

pps.matrix(df)

Is there something I am missing ? If not, would you like me to share the data with you ? (I do not know which sharing method is more convenient for you)

@8080labs
Copy link
Owner

Hi Alex,

It would be great if you can share your data. You can choose any method that works for you eg a Google Drive Link, other file upload, send us the original source, ...

There is definitely some work to be done from our side to further clarify or catch the errors that are currently raised from sklearn.

The error "'DataFrame' object has no attribute 'dtype'" seems strange to me and it seems like internally .dtype is trying to be called on a DataFrame instead of a series. Is it possible that there are columns with the same name?

By the way, why are you trying to predict YearBuilt with all columns (x is YearBuilt) ? Don't you want to do the opposite? And try to predict YearBuilt with the other columns? (y is YearBuilt)?

Cheers,
Florian

@alexandersmedley
Copy link
Author

Hi Florian !

Thx for your answer. I've uploaded the csv and jupyter notebook to Google Drive. Here is the link : https://drive.google.com/open?id=127BwkUTcKF18_Kh599jHlsMVIFnpwU3S

The original data is available on Kaggle here :
https://www.kaggle.com/city-of-seattle/sea-building-energy-benchmarking#2015-building-energy-benchmarking.csv

The data I am using was obtained from the original by combining the 2015 and 2016 tables and cleaning them.

Yes, I also found it strange that the .dtype looks like it's called on a DataFrame. It's particularly weird as that happens only wen task is specified, not when task = None. The columns all have a unique name.

Using 'YearBuilt' to predict was just a test ! I wanted to understand which variable was being considered a categorical variable. My initial objective was to calculate the whole pps matrix and compare it to the correlation matrix to see if it could provide more insight, like you did in your article.

Cheers,

Alexander

@8080labs
Copy link
Owner

Hi Alexander,

thank you for sending over the data.

The first error appeared because the logic with overriding the categorical breakpoint does not seem to work. I will have to look into this again. It failed for the following code:
pps.score(df, x = 'YearBuilt', y = "NumberofBuildings")
And it worked when explicitly passing the task.
pps.score(df, x = 'YearBuilt', y = "NumberofBuildings", task="regression")

The second error is due to the fact that the for-loop resulted in the following:
pps.score(df, x = 'YearBuilt', y = "YearBuilt", task="regression")
Internally, this resulted into a dataframe with two identical columns and hence we saw the error. This needs to be fixed.

Cheers,
Florian

@alexandersmedley
Copy link
Author

Hi Florian,

I'm happy to learn the data helped you identify the problems :)

I had a hint the categorical breakpoint might not work but couldn't be sure as the for loop was acting weird. Didn't anticipate the x = y exception !

Thanks again for providing this package and taking the time to update and support it.

Cheers,

Alexander

@8080labs
Copy link
Owner

Yes, the data was very helpful - thank you for that!

@FlorianWetschoreck FlorianWetschoreck self-assigned this Jul 13, 2020
@FlorianWetschoreck
Copy link
Collaborator

There have been two issues here:

  • adjusting the task which will be done in the future based on the dtype and not on numeric_breakpoints
  • allowing pps.score(df, x = 'YearBuilt', y = "YearBuilt")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants