# Regression and Quantile Regression
In regression, the target is on a continuous scale. When the L2 loss commonly used is employed, the prediction of the model can be seen as the conditional probability $P(Y|X)$. It is usually a point prediction, although there are some techniques that gives prediction intervals like bayesian neural networks or quantile regression. With conformal prediction, we can turn the point prediction into a prediction interval that comes with a guarantee of covering the true outcome in future observations or adjust the intervals predicted to guarantee coverage. We will try conformal prediction starting from both kind of approaches.

In [1]:
import os
import wget
import zipfile
from os.path import exists
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
from mapie.classification import MapieClassifier
from mapie.metrics import classification_coverage_score
from mapie.metrics import classification_mean_width_score

In [2]:
#importing libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
#read the rent index data
rent = pd.read_csv("http://www.bamlss.org/misc/rent99.raw",sep=" ")
y= rent["rentsqm"]
X= rent.drop(["rent","rentsqm","cheating"],axis=1)

#split training data
X_train,X_rest1,y_train,y_rest1= train_test_split(X,y,test_size=2000,random_state=2)
#split test data
X_test,X_rest2,y_test,y_rest2= train_test_split(X_rest1,y_rest1,test_size=1500,random_state=2)
#split calibration data
X_calib,X_new,y_calib,y_new= train_test_split(X_rest2,y_rest2,test_size=500,random_state=2)

#data sizes
print(f" datasizes: train {len(X_train)}, test: {len(X_test)},calibration: {len(X_calib)}, new: {len(X_new)}")


 datasizes: train 1082, test: 500,calibration: 1000, new: 500


In [3]:
params = {"n_estimators":[10,50,100,500,1000],"max_depth": [None,1,2,5,10],
          "min_samples_split":[2,5,10],
          "min_samples_leaf":[1,2,4]}

#the model is a random forest
model = RandomForestRegressor()
#Create the random search object using 5-fold cross-validation
random_search = RandomizedSearchCV(estimator= model, param_distributions=params,cv=5,n_iter=10,random_state=0)
# fit the random search to the data
random_search.fit(X_train,y_train)

In [4]:
#refit the model with the best parameters found with random_search
model= RandomForestRegressor(**random_search.best_params_,random_state=1)
model.fit(X_train,y_train)

In [5]:
from sklearn.metrics import mean_absolute_error
y_pred= model.predict(X_test)
mae= mean_absolute_error(y_test,y_pred)
print(round(mae,2))

1.59


This means that, on average, the prediction is off by 1.59 euros per square meter. Again we can use the MAPIE library to conformalize the model.

In [6]:
from mapie.regression import MapieRegressor
mapie_reg = MapieRegressor(estimator=model,cv="prefit")
mapie_reg.fit(X_calib,y_calib)

In [7]:
y_pred,y_pis = mapie_reg.predict(X_new,alpha=1/3)

In [8]:
print(X_new.iloc[0])
print("predicted rent: {:.2f}".format(y_pred[0]))
interval= y_pis[0].flatten()
print("67% interval :[{:.2f},{:.2f}]".format(interval[0],interval[1]))

area          72.0
yearc       1970.0
location       1.0
bath           0.0
kitchen        0.0
district    1712.0
Name: 690, dtype: float64
predicted rent: 6.79
67% interval :[4.90,8.69]
