Regression trees are comparable in function to classification trees. They may also be used for numeric estimation problems when the target variable is numeric or continuous: for example, predicting the price of a product based on several input variables. 

The commonly used regression approach for classification problems is the Logistic Regression. Remember that the predictors or input variables may be categorical or numeric in either case. It is the target variable that determines the type of prediction required.

We will not examine the concept of Regression again in this course because students are expected to have learned Regression algorithms concepts in Statistics-related courses such as the Statistical Methods for Data Science course.


To demonstrate how we can use Python to build a classifier model using the Regression approach with the LogisticRegression() function, the following codes show the example of the implementation for the customer churn data set.

In [1]:
#Importing necessary Libraries
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd

#Loading Dataset
data = pd.read_csv('ChurnFinal.csv')

#Generating Matrix of Features
df_inputs = pd.get_dummies(data[['Gender', 'Age', 'PostalCode', 'Cash', 
        'CreditCard', 'Cheque', 'SinceLastTrx', 'SqrtTotal', 'SqrtMax', 'SqrtMin']])
df_label = data['Churn']

#Splitting dataset into training and testing dataset
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(df_inputs, df_label, 
                stratify=df_label, test_size=0.3, random_state=7)

# scaling to remove potential outliers
from sklearn.preprocessing import StandardScaler  
scaler = StandardScaler()  
scaler.fit(X_train)                     
X_train = scaler.transform(X_train)  
X_test = scaler.transform(X_test)      

from sklearn.linear_model import LogisticRegression
lg = LogisticRegression(solver='liblinear',  random_state=7, max_iter=300)
lg.fit(X_train, Y_train)

# obtain the model intercept and coefficient of each input attribute
print(f"intercept: {np.round(lg.intercept_,4)}")
fieldList = np.array(list(df_inputs)).reshape(-1,1)
coeffs = np.reshape(np.round(lg.coef_,2),(-1,1))
coeffs=np.concatenate((fieldList,coeffs),axis=1)
print(pd.DataFrame(coeffs,columns=['Attribute','Coefficient']))

# fit the model to compare predicted and actual target output values
y_predict = lg.predict(X_test)      

# assess the model permance
from sklearn import metrics 
print("Model Accuracy     : ", round(metrics.accuracy_score(Y_test, y_predict),4))

intercept: [0.0295]
        Attribute Coefficient
0             Age        0.59
1      PostalCode       -0.01
2            Cash        0.05
3      CreditCard        -0.4
4          Cheque       -0.14
5    SinceLastTrx        0.69
6       SqrtTotal       -0.25
7         SqrtMax       -0.19
8         SqrtMin        0.12
9   Gender_female        0.15
10    Gender_male       -0.15
Model Accuracy     :  0.7625


Observe the model accuracy of 0.7626 printed on the console terminal as Moddel Accuracy :     0.7625.  This result indicates that the model has an accuracy of 76.25%, implying that 76.25% of the time it correctly classifies churn outcomes ('yes' or 'no').

We also observe the output of intercept: [0.03] with the Coefficient (in 2 decimal points) of each input attribute as follows: Age = 0.59, PostalCode = -0.01, Cash = 0.05, CreditCard = -0.40, Cheque = -0.14, SinceLastTrx = 0.69, SqrtTotal = -0.25, SqrtMax = -0.19, SqrtMin = 0.12, and Gender_male = -0.29. Thus, the generated Logistic Regression model is mathematically represented as follows:

       Churn = 0.03 + 0.59(Age) - 0.01(PostalCode) + 0.05(Cash) - 0.40(CreditCard) - 0.14(Cheque)

                        + 0.69(SinceLastTrx) -  0.25(SqrtTotal) -  0.19(SqrtMax) + 0.12(SqrtMin) - 0.29(Gender_male)

