Numerical columns treated as categorical #12

alexandersmedley · 2020-05-15T10:01:54Z

Hi guys,

I heard of PPS, through your article and was curious to test it. I have tried implementing it on some data I've been working on.

Unfortunately, I get numerous error messages when calculating the pps matrix :

Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=4.

UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.

My guess is pps is considering my data to be categorical and therefore trying to apply classification with a huge number of labels.

Looking at how pps determines if the data is numerical or categorical, I cannot find the reason it would consider my data categorical :

The dtypes are int or float
The number of unique values is higher than 15 (except for 1 column which is equal to 15, but changing the NUMERIC_AS_CATEGORIC_BREAKPOINT constant to 10 does not resolve the problem)

Also, if I try to force the pps score to be calculated using task = 'regression', I get the following error :

'DataFrame' object has no attribute 'dtype'

Here is my code :

import pandas as pd
import ppscore as pps

df = pd.read_csv('seattle_building_energy_benchmark.csv', sep = ';')

df.dtypes

df.nunique()

pps.NUMERIC_AS_CATEGORIC_BREAKPOINT = 10

for col in df.columns: 
    print(col)
    pps.score(df, x = 'YearBuilt', y = col, task = None)

for col in df.columns: 
    print(col)
    pps.score(df, x = 'YearBuilt', y = col, task = 'regression')

pps.matrix(df)

Is there something I am missing ? If not, would you like me to share the data with you ? (I do not know which sharing method is more convenient for you)

The text was updated successfully, but these errors were encountered:

8080labs · 2020-05-18T05:59:08Z

Hi Alex,

It would be great if you can share your data. You can choose any method that works for you eg a Google Drive Link, other file upload, send us the original source, ...

There is definitely some work to be done from our side to further clarify or catch the errors that are currently raised from sklearn.

The error "'DataFrame' object has no attribute 'dtype'" seems strange to me and it seems like internally .dtype is trying to be called on a DataFrame instead of a series. Is it possible that there are columns with the same name?

By the way, why are you trying to predict YearBuilt with all columns (x is YearBuilt) ? Don't you want to do the opposite? And try to predict YearBuilt with the other columns? (y is YearBuilt)?

Cheers,
Florian

alexandersmedley · 2020-05-18T07:42:59Z

Hi Florian !

Thx for your answer. I've uploaded the csv and jupyter notebook to Google Drive. Here is the link : https://drive.google.com/open?id=127BwkUTcKF18_Kh599jHlsMVIFnpwU3S

The original data is available on Kaggle here :
https://www.kaggle.com/city-of-seattle/sea-building-energy-benchmarking#2015-building-energy-benchmarking.csv

The data I am using was obtained from the original by combining the 2015 and 2016 tables and cleaning them.

Yes, I also found it strange that the .dtype looks like it's called on a DataFrame. It's particularly weird as that happens only wen task is specified, not when task = None. The columns all have a unique name.

Using 'YearBuilt' to predict was just a test ! I wanted to understand which variable was being considered a categorical variable. My initial objective was to calculate the whole pps matrix and compare it to the correlation matrix to see if it could provide more insight, like you did in your article.

Cheers,

Alexander

8080labs · 2020-05-18T15:38:09Z

Hi Alexander,

thank you for sending over the data.

The first error appeared because the logic with overriding the categorical breakpoint does not seem to work. I will have to look into this again. It failed for the following code:
pps.score(df, x = 'YearBuilt', y = "NumberofBuildings")
And it worked when explicitly passing the task.
pps.score(df, x = 'YearBuilt', y = "NumberofBuildings", task="regression")

The second error is due to the fact that the for-loop resulted in the following:
pps.score(df, x = 'YearBuilt', y = "YearBuilt", task="regression")
Internally, this resulted into a dataframe with two identical columns and hence we saw the error. This needs to be fixed.

Cheers,
Florian

alexandersmedley · 2020-05-20T15:19:25Z

Hi Florian,

I'm happy to learn the data helped you identify the problems :)

I had a hint the categorical breakpoint might not work but couldn't be sure as the for loop was acting weird. Didn't anticipate the x = y exception !

Thanks again for providing this package and taking the time to update and support it.

Cheers,

Alexander

8080labs · 2020-05-20T15:32:32Z

Yes, the data was very helpful - thank you for that!

FlorianWetschoreck · 2020-07-14T08:35:38Z

There have been two issues here:

adjusting the task which will be done in the future based on the dtype and not on numeric_breakpoints
allowing pps.score(df, x = 'YearBuilt', y = "YearBuilt")

FlorianWetschoreck self-assigned this Jul 13, 2020

FlorianWetschoreck closed this as completed Jul 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numerical columns treated as categorical #12

Numerical columns treated as categorical #12

alexandersmedley commented May 15, 2020

8080labs commented May 18, 2020

alexandersmedley commented May 18, 2020

8080labs commented May 18, 2020

alexandersmedley commented May 20, 2020

8080labs commented May 20, 2020

FlorianWetschoreck commented Jul 14, 2020

Numerical columns treated as categorical #12

Numerical columns treated as categorical #12

Comments

alexandersmedley commented May 15, 2020

8080labs commented May 18, 2020

alexandersmedley commented May 18, 2020

8080labs commented May 18, 2020

alexandersmedley commented May 20, 2020

8080labs commented May 20, 2020

FlorianWetschoreck commented Jul 14, 2020