What will happen if one uses a regressor for a classification problem or a classifier for a regression problem?

Let us use Hyperparameter Optimization or Hyperparameter Tunning with GridSearchCV to approach this problem

In [1]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
import pandas as pd
from sklearn.svm import SVC
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor

Classification and Regression are both predictive modeling methods. Classification involves the task of predicting an existing labeled class. For instance, iris flower being categorized into three classes with labels 0,1 and 2 respectively will need a classifier for prediction while problems like predicting house price, car price, salary etc is a regression problem. Regression problem involves predicting a continuos or discrete quantity.


It is possible to use some classifiers for a regression problem but the predicted value will be inform of a probability for a labeled class.

In this hyperparameter tunning problem we want to see how regressors and classifiers perform when they are being applied to the same data set. 

In [2]:
from sklearn.datasets import load_digits
digit = load_digits()

In [3]:
dir(digit)

['DESCR', 'data', 'feature_names', 'frame', 'images', 'target', 'target_names']

In [5]:
model_params={
    'svm': {
        'model':SVC(gamma='auto'),
        'parameters':{
            'C':[2,5,10],
            'kernel':['rbf','linear']
        }
    },
    'RFC':{
        'model':RandomForestClassifier(),
        'parameters':{
            'n_estimators':[1,5,10]
        }                                                                                                                                                             
    },
    'logistic':{
        'model':LogisticRegression(solver='liblinear',multi_class='auto'),
        'parameters':{
            'C':[1,5,10]
        }
    },
    'GuassNB':{
        'model':GaussianNB(),
        'parameters':{}
    },
    'mm':{
        'model':MultinomialNB(),
        'parameters':{}
    },
    'DtreeC':{
        'model':DecisionTreeClassifier(),
        'parameters':{
            'criterion':['gini','entropy']
        }
    },
    'LR': {
            'model': LinearRegression(),
            'parameters': {
                'normalize': [True,False]
            }
        },
        
        'lasso': {
            'model': Lasso(),
            'parameters': {
                'alpha': [1,5,10],
                'selection': ['random', 'cyclic']
            }
        },
        
        'svr': {
            'model': SVR(),
            'parameters': {
                'gamma': ['auto','scale']
            }
        },
        
        'DtreeR': {
            'model': DecisionTreeRegressor(),
            'parameters': {
                'criterion': ['mse', 'friedman_mse'],
                'splitter': ['best', 'random']
            }
        },
        
        'RFR': {
            'model': RandomForestRegressor(criterion='mse'),
            'parameters': {
                'n_estimators': [1,5,10]
            }
        },
        
        'KNNR': {
            'model': KNeighborsRegressor(algorithm='auto'),
            'parameters': {
                'n_neighbors': [1,5,10]
            }
        }
}

In [6]:
from timeit import default_timer as timer

start = timer()
scores = []

for model_name,mp in model_params.items():
    clf=GridSearchCV(mp['model'],mp['parameters'], cv=5,return_train_score=False)
    clf.fit(digit.data,digit.target)
    scores.append({
        'model':model_name,
        'best_score':clf.best_score_,
        'best_params':clf.best_params_
    })

df=pd.DataFrame(scores)[['model','best_score','best_params']]
df

end=timer()
print(end-start)

152.867448293


In [7]:
df

Unnamed: 0,model,best_score,best_params
0,svm,0.947697,"{'C': 2, 'kernel': 'linear'}"
1,RFC,0.907646,{'n_estimators': 10}
2,logistic,0.922114,{'C': 1}
3,GuassNB,0.806928,{}
4,mm,0.87035,{}
5,DtreeC,0.812498,{'criterion': 'entropy'}
6,LR,0.506557,{'normalize': False}
7,lasso,0.419144,"{'alpha': 1, 'selection': 'random'}"
8,svr,0.798344,{'gamma': 'scale'}
9,DtreeR,0.582211,"{'criterion': 'friedman_mse', 'splitter': 'best'}"


In [8]:
#let's load another dataset to do a finding
from sklearn.datasets import load_iris
iris = load_iris()

In [9]:
x=iris.data
y=iris.target


In [10]:
from timeit import default_timer as timer

start = timer()
scores = []

for model_name,mp in model_params.items():
    clf=GridSearchCV(mp['model'],mp['parameters'], cv=5,return_train_score=False)
    clf.fit(x,y)
    scores.append({
        'model':model_name,
        'best_score':clf.best_score_,
        'best_params':clf.best_params_
    })

df=pd.DataFrame(scores)[['model','best_score','best_params']]
df

end=timer()
print(end-start)

5.123784892999993


In [11]:
df

Unnamed: 0,model,best_score,best_params
0,svm,0.98,"{'C': 2, 'kernel': 'rbf'}"
1,RFC,0.966667,{'n_estimators': 5}
2,logistic,0.966667,{'C': 5}
3,GuassNB,0.953333,{}
4,mm,0.953333,{}
5,DtreeC,0.96,{'criterion': 'gini'}
6,LR,0.322561,{'normalize': False}
7,lasso,-0.638527,"{'alpha': 1, 'selection': 'cyclic'}"
8,svr,0.357821,{'gamma': 'auto'}
9,DtreeR,0.51,"{'criterion': 'mse', 'splitter': 'best'}"


In [12]:
#we load this data for regression
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
dfautoinsurance = pd.read_csv(url, header=None)
dfautoinsurance

Unnamed: 0,0,1
0,108,392.5
1,19,46.2
2,13,15.7
3,124,422.2
4,40,119.4
...,...,...
58,9,87.4
59,31,209.8
60,14,95.5
61,53,244.6


In [13]:
df=dfautoinsurance.values

In [14]:
x = df[:, :-1]
y = df[:, -1]
y

array([392.5,  46.2,  15.7, 422.2, 119.4, 170.9,  56.9,  77.5, 214. ,
        65.3,  20.9, 248.1,  23.5,  39.6,  48.8,   6.6, 134.9,  50.9,
         4.4, 113. ,  14.8,  48.7,  52.1,  13.2, 103.9,  77.5,  11.8,
        98.1,  27.9,  38.1,   0. ,  69.2,  14.6,  40.3, 161.5,  57.2,
       217.6,  58.1,  12.6,  59.6,  89.9, 202.4, 181.3, 152.8, 162.8,
        73.4,  21.3,  92.6,  76.1,  39.9, 142.1,  93. ,  31.9,  32.1,
        55.6, 133.3, 194.5, 137.9,  87.4, 209.8,  95.5, 244.6, 187.5])

In [15]:
#drop all the classifiers as they will not be able to carry out regression
model_paramsR={
    'LR': {
            'model': LinearRegression(),
            'parameters': {
                'normalize': [True,False]
            }
        },
        
        'lasso': {
            'model': Lasso(),
            'parameters': {
                'alpha': [1,5,10],
                'selection': ['random', 'cyclic']
            }
        },
        
        'svr': {
            'model': SVR(),
            'parameters': {
                'gamma': ['auto','scale']
            }
        },
        
        'DtreeR': {
            'model': DecisionTreeRegressor(),
            'parameters': {
                'criterion': ['mse', 'friedman_mse'],
                'splitter': ['best', 'random']
            }
        },
        
        'RFR': {
            'model': RandomForestRegressor(criterion='mse'),
            'parameters': {
                'n_estimators': [1,5,10]
            }
        },
        
        'KNNR': {
            'model': KNeighborsRegressor(algorithm='auto'),
            'parameters': {
                'n_neighbors': [1,5,10]
            }
        }
}

In [16]:
from timeit import default_timer as timer

start = timer()
scoresR = []

for model_name,mp in model_paramsR.items():
    clf=GridSearchCV(mp['model'],mp['parameters'], cv=5,return_train_score=False)
    clf.fit(x,y)
    scoresR.append({
        'model':model_name,
        'best_score':clf.best_score_,
        'best_params':clf.best_params_
    })

dfR=pd.DataFrame(scoresR)[['model','best_score','best_params']]
dfR

end=timer()
print(end-start)

2.445819011000026


In [17]:
dfR

Unnamed: 0,model,best_score,best_params
0,LR,0.644305,{'normalize': True}
1,lasso,0.644452,"{'alpha': 10, 'selection': 'random'}"
2,svr,-0.304826,{'gamma': 'scale'}
3,DtreeR,0.32573,"{'criterion': 'mse', 'splitter': 'best'}"
4,RFR,0.421839,{'n_estimators': 10}
5,KNNR,0.47836,{'n_neighbors': 10}


Back to the question, what happen if one uses a regressor for a classification problem or a classifier for a regression problem?

from the above it is practically clear that:
1. regressors may be utilized in classification problems but classifiers cannot be used in regression problem

2. regressors perform very poorly when handling classification problem

3. classifiers' performance is sort of dataset dependent. For instance, compare the digit classification problem with the iris flower classification, svm performed higher in both cases but using different kernels 'linear' for the digit problem and 'rbf' for the iris problem. In fact, all of the other models had different set of best_parameter in each of the classificationproblem.

4. Use classifiers for classification problem and regressors for regression problems

Hyperparameter optimizatiion is very importance