-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modeling probabilistic binary outcome with npreg #34
Comments
Greetings! What a cool data set to play with... First, npconmode() and predict... due to its nature npconmode() requires a slightly different approach from conditional mean estimation but its use is really super simple once you see an example (?npconmode indicates the use of newdata in the function call for what that is worth [see "Arguments" section in ?npconmode])... so for npconmode here is what you want to do...
Appending these to your example delivers the following:
So, for this example and split of the data into a test/train set it appears both nonparametric methods deliver roughly 90% correct classification ratios while the logit delivers roughly 70%... given the enormity of your test data set this is probably fairly robust... Second, npreg()... the data you sent has integer values for the dependent variable...
These run just fine for me using your code, so I conjecture that on some other data set (prior to a transformation) your data was coded as integer with values 1, 2. When you run npreg() you estimate a conditional mean, so if Y\in{1,2} then E[Y|X=x]=1Pr(Y=1|X=x)+2Pr(Y=2|X=x) which will lie in [1,2] for the local constant estimator npreg() (default). If you recode so that Y\in{0,1} then E[Y|X=x]=0Pr(Y=0|X=x)+1Pr(Y=1|X=x) = Pr(Y=1|X=x) which is what the logit model would estimate. Now, npconmode() admits factors for Y as it estimates the conditional probabilities directly (i.e., Pr(Y=1|X=x) and Pr(Y=2|X=x)) and computes the conditional mode so, unlike estimation of the conditional mean, no transformation for non binary Y is needed. In short, subtracting 1 from Y\in{1,2} is exactly what you should do when computing regression-based binary predictions which is what you seem to be doing. Hope this helps! Great illustration by the way! |
Dear Jeffrey, Many thanks for your prompt and very useful answers. The data set provided here is a reduced version of the complete data set I plan to use (2 years of solar data with a minute time step). For the npconmode, how can I access to the predicted outcomes of the test set (i.e. the conditional probabilities)? Indeed, I want to compute indices of error that are not provided by npconmode function and that cannot be derived from the contengency table. More precisely, I need the Brier score, its decomposition and the ROC diagram, which are very important to compare probabilistic forecasts Last, is npcomp mode available in the library npRmpi? |
Greetings, npconmode() is based on npcdensbw() and npcdens()… if you don’t provide the npcdensbw() bandwidth object it is automatically computed by npconmode() and available via foo$bws (i.e., if you invoked foo <- npconmode()). You can feed this bw object to npcdens() to generate the fitted & predicted probabilities (or re-run cross-validation by directly calling npcdensbw()), and use this to generate ROC curves etc. Note that npcdens() estimates f(y|x) so it uses your sample y values by default (i.e., if y_1=0 and y_2=1 you are estimating f(y_1|x_1)=f(0|x_1) and f(y_2|x_2)=f(1|x_2), etc.) Yes, npconmode is available in npRmpi… however, if computation is an issue you might want to try trees instead (can reduce computation time in some cases) which use bounded support kernel functions (option cxkertype=”epanechnikov”) and set global options options(np.tree=TRUE) after you load the np library... see the FAQ for more info on trees and see ?npcdensbw() for kernel options... also, the option for npreg() to set a bounded support kernel is ckertype="epanechnikov" (in npcdens() we need to distinguish kernels for Y and X hence the "cxkertype" argument - your Y is a factor so no need to set cykertype for your illustration). Hope this helps! |
Hello,
I am doing a comparison between two approaches to predict the probability of a binary outcome: the parametric approach with a classical logistic regression (logit) and the nonparametric approach with npreg. To be more precise, the binary outcome is the forecast of cloud presence of in front of the sun (0 = cloud / 1 = no cloud) and the 3 predictors are 1) the binary deterministic forecast of cloud presence derived from a sky imager (a camera that records the cloud motions), 2) the current sky opacity (i.e. the clear sky index, continuous variable), 3) and the average sky opacity over the 5 past minutes (also a continuous variable). The horizon of the forecast ranges from 1 minute to 30 minutes.
When I use the npreg fonction to fit my model, the predictions does not range between 0 and 1 but between 1 and 2. When I substract 1 to the predicted values, the results are awesome and clearly better than the logit model. But I do not understand why the predictions fall in the interval [1;2] instead of [0;1]. Is anyone aware of this particular behavior of npreg?
I also tried the npconmode function but, it seems to not be suitable to generate predictions because it is not compatible with the "predict" function. Is there any other approaches to create prediction from npconmode ?
Thanks
Mathieu
Below my code and in the attached pieces the data.
library(np)
#---------------
Load data
#---------------
train_set <- read.csv('train_set.txt') #training
test_set <- read.csv('test_set.txt') #test
test_set.txt
train_set.txt
#---------------
logit
#---------------
#Train
CPlogit <- glm(factor(CP_400th) ~ CSKt + avg + factor(cloudiness_forecast_th), family = binomial(link = "logit"), data = train_set)
#Predict
fcst_logit <- predict(CPlogit, newdata = data.frame(CSKt = test_set$CSKt, avg = test_set$avg, cloudiness_forecast_th = factor(test_set$cloudiness_forecast_th)), type = "response")
#Confusion matrix
cm_logit <- table(test_set$CP_400th, ifelse(fcst_logit>0.5, 1, 0))
print(cm_logit)
#------------------
Non parametric
#------------------
CPnp <- npreg(factor(CP_400th) ~ CSKt + avg + factor(cloudiness_forecast_th), data = train_set )
#Predict
fcst_np <- predict(CPnp, newdata = data.frame(CSKt = test_set$CSKt, avg = test_set$avg, cloudiness_forecast_th = factor(test_set$cloudiness_forecast_th)), type = "response")
#Confusion matrix
cm_np <- table(test_set$CP_400th, ifelse(fcst_np-1>0.5, 1, 0))
print(cm_np)
The text was updated successfully, but these errors were encountered: