Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modeling probabilistic binary outcome with npreg #34

Closed
mathieudavidrun opened this issue Apr 21, 2022 · 3 comments
Closed

Modeling probabilistic binary outcome with npreg #34

mathieudavidrun opened this issue Apr 21, 2022 · 3 comments

Comments

@mathieudavidrun
Copy link

Hello,

I am doing a comparison between two approaches to predict the probability of a binary outcome: the parametric approach with a classical logistic regression (logit) and the nonparametric approach with npreg. To be more precise, the binary outcome is the forecast of cloud presence of in front of the sun (0 = cloud / 1 = no cloud) and the 3 predictors are 1) the binary deterministic forecast of cloud presence derived from a sky imager (a camera that records the cloud motions), 2) the current sky opacity (i.e. the clear sky index, continuous variable), 3) and the average sky opacity over the 5 past minutes (also a continuous variable). The horizon of the forecast ranges from 1 minute to 30 minutes.

When I use the npreg fonction to fit my model, the predictions does not range between 0 and 1 but between 1 and 2. When I substract 1 to the predicted values, the results are awesome and clearly better than the logit model. But I do not understand why the predictions fall in the interval [1;2] instead of [0;1]. Is anyone aware of this particular behavior of npreg?

I also tried the npconmode function but, it seems to not be suitable to generate predictions because it is not compatible with the "predict" function. Is there any other approaches to create prediction from npconmode ?

Thanks

Mathieu

Below my code and in the attached pieces the data.

library(np)

#---------------

Load data

#---------------
train_set <- read.csv('train_set.txt') #training
test_set <- read.csv('test_set.txt') #test
test_set.txt
train_set.txt

#---------------

logit

#---------------
#Train
CPlogit <- glm(factor(CP_400th) ~ CSKt + avg + factor(cloudiness_forecast_th), family = binomial(link = "logit"), data = train_set)

#Predict
fcst_logit <- predict(CPlogit, newdata = data.frame(CSKt = test_set$CSKt, avg = test_set$avg, cloudiness_forecast_th = factor(test_set$cloudiness_forecast_th)), type = "response")

#Confusion matrix
cm_logit <- table(test_set$CP_400th, ifelse(fcst_logit>0.5, 1, 0))
print(cm_logit)

#------------------

Non parametric

#------------------
CPnp <- npreg(factor(CP_400th) ~ CSKt + avg + factor(cloudiness_forecast_th), data = train_set )

#Predict
fcst_np <- predict(CPnp, newdata = data.frame(CSKt = test_set$CSKt, avg = test_set$avg, cloudiness_forecast_th = factor(test_set$cloudiness_forecast_th)), type = "response")

#Confusion matrix
cm_np <- table(test_set$CP_400th, ifelse(fcst_np-1>0.5, 1, 0))
print(cm_np)

@JeffreyRacine
Copy link
Owner

JeffreyRacine commented Apr 21, 2022

Greetings! What a cool data set to play with...

First, npconmode() and predict... due to its nature npconmode() requires a slightly different approach from conditional mean estimation but its use is really super simple once you see an example (?npconmode indicates the use of newdata in the function call for what that is worth [see "Arguments" section in ?npconmode])... so for npconmode here is what you want to do...

# npconmode
# Read data

train_set <- read.csv('train_set.txt') #training
test_set <- read.csv('test_set.txt') #test

# For simplicity cast variables according to type in the train/test datasets

test_set$CP_400th <- factor(test_set$CP_400th)
test_set$cloudiness_forecast_th <- factor(test_set$cloudiness_forecast_th)

train_set$CP_400th <- factor(train_set$CP_400th)
train_set$cloudiness_forecast_th <- factor(train_set$cloudiness_forecast_th)

# Estimate conditional mode model

foo <- npconmode(CP_400th ~ CSKt + avg + cloudiness_forecast_th,
                                 data = train_set,
                                 newdata = test_set)

summary(foo)

## Compare correct classification ratios for npreg(), npconmode() and glm(,binomial(link = "logit"))

sum(diag(cm_np)/sum(cm_np))
sum(diag(foo$confusion.matrix)/sum(foo$confusion.matrix))
sum(diag(cm_logit)/sum(cm_logit))

Appending these to your example delivers the following:

> summary(foo)

Conditional Mode data: 299 training points, and 21604 evaluation points,
 in 4 variable(s)
(1 dependent variable(s), and 3 explanatory variable(s))

                          CP_400th
Dep. Var. Bandwidth(s): 0.01635493
                              CSKt      avg cloudiness_forecast_th
Exp. Var. Bandwidth(s): 0.06410095 48337.54             0.04977176

Bandwidth Type: Fixed

Confusion Matrix
      Predicted
Actual     0     1
     0  7611   713
     1  1462 11818

Overall Correct Classification Ratio:  0.8993242
Correct Classification Ratio By Outcome:
        0         1 
0.9143441 0.8899096 

McFadden-Puig-Kerschner performance measure:  0.8936554

Continuous Kernel Type: Second-Order Gaussian
No. Continuous Explanatory Vars.: 2

Unordered Categorical Kernel Type: Aitchison and Aitken
No. Unordered Categorical Explanatory Vars.: 1
No. Unordered Categorical Dependent Vars.: 1


> ## Compare correct classification ratios
> 
> sum(diag(cm_np)/sum(cm_np))
[1] 0.9053879

> sum(diag(foo$confusion.matrix)/sum(foo$confusion.matrix))
[1] 0.8993242

> sum(diag(cm_logit)/sum(cm_logit))
[1] 0.6796889

So, for this example and split of the data into a test/train set it appears both nonparametric methods deliver roughly 90% correct classification ratios while the logit delivers roughly 70%... given the enormity of your test data set this is probably fairly robust...

Second, npreg()... the data you sent has integer values for the dependent variable...

> train_set <- read.csv('train_set.txt') #training
> class(train_set$CP_400th)
[1] "integer"
> unique(train_set$CP_400th)
[1] 0 1

These run just fine for me using your code, so I conjecture that on some other data set (prior to a transformation) your data was coded as integer with values 1, 2. When you run npreg() you estimate a conditional mean, so if Y\in{1,2} then E[Y|X=x]=1Pr(Y=1|X=x)+2Pr(Y=2|X=x) which will lie in [1,2] for the local constant estimator npreg() (default). If you recode so that Y\in{0,1} then E[Y|X=x]=0Pr(Y=0|X=x)+1Pr(Y=1|X=x) = Pr(Y=1|X=x) which is what the logit model would estimate. Now, npconmode() admits factors for Y as it estimates the conditional probabilities directly (i.e., Pr(Y=1|X=x) and Pr(Y=2|X=x)) and computes the conditional mode so, unlike estimation of the conditional mean, no transformation for non binary Y is needed.

In short, subtracting 1 from Y\in{1,2} is exactly what you should do when computing regression-based binary predictions which is what you seem to be doing.

Hope this helps! Great illustration by the way!

@mathieudavidrun
Copy link
Author

Dear Jeffrey,

Many thanks for your prompt and very useful answers. The data set provided here is a reduced version of the complete data set I plan to use (2 years of solar data with a minute time step).

For the npconmode, how can I access to the predicted outcomes of the test set (i.e. the conditional probabilities)? Indeed, I want to compute indices of error that are not provided by npconmode function and that cannot be derived from the contengency table. More precisely, I need the Brier score, its decomposition and the ROC diagram, which are very important to compare probabilistic forecasts

Last, is npcomp mode available in the library npRmpi?

@JeffreyRacine
Copy link
Owner

JeffreyRacine commented Apr 22, 2022

Greetings,

npconmode() is based on npcdensbw() and npcdens()… if you don’t provide the npcdensbw() bandwidth object it is automatically computed by npconmode() and available via foo$bws (i.e., if you invoked foo <- npconmode()).

You can feed this bw object to npcdens() to generate the fitted & predicted probabilities (or re-run cross-validation by directly calling npcdensbw()), and use this to generate ROC curves etc. Note that npcdens() estimates f(y|x) so it uses your sample y values by default (i.e., if y_1=0 and y_2=1 you are estimating f(y_1|x_1)=f(0|x_1) and f(y_2|x_2)=f(1|x_2), etc.)

Yes, npconmode is available in npRmpi… however, if computation is an issue you might want to try trees instead (can reduce computation time in some cases) which use bounded support kernel functions (option cxkertype=”epanechnikov”) and set global options options(np.tree=TRUE) after you load the np library... see the FAQ for more info on trees and see ?npcdensbw() for kernel options... also, the option for npreg() to set a bounded support kernel is ckertype="epanechnikov" (in npcdens() we need to distinguish kernels for Y and X hence the "cxkertype" argument - your Y is a factor so no need to set cykertype for your illustration).

Hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants