**Chapter 9.9: Numerical Responses for Decision Trees and Classification Algorithms**

Classification means *classifying* outcomes into *categories*. Therefore, a numerical response can be *classified into categories* in a classification algorithm. 

Pay close attention: the package imported here is DecisionTree*Regressor*

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function

df = pd.read_csv('https://www.ishelp.info/data/insurance.csv')
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [2]:
df = pd.get_dummies(df, columns=['sex', 'region', 'smoker'], drop_first=True, dtype=int)
df

Unnamed: 0,age,bmi,children,charges,sex_male,region_northwest,region_southeast,region_southwest,smoker_yes
0,19,27.900,0,16884.92400,0,0,0,1,1
1,18,33.770,1,1725.55230,1,0,1,0,0
2,28,33.000,3,4449.46200,1,0,1,0,0
3,33,22.705,0,21984.47061,1,1,0,0,0
4,32,28.880,0,3866.85520,1,1,0,0,0
...,...,...,...,...,...,...,...,...,...
1333,50,30.970,3,10600.54830,1,1,0,0,0
1334,18,31.920,0,2205.98080,0,0,0,0,0
1335,18,36.850,0,1629.83350,0,0,1,0,0
1336,21,25.800,0,2007.94500,0,0,0,1,0


**Response:** *charges*

In [3]:
y = df.charges
X = df.drop(columns=['charges'])

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

**Pay close attention**: DecisionTree*Regressor()* not *Classifier()*

What we are doing is plugging and chugging our dataset into an algorithm, like before. This algorithm isn't a y=mx+b type model, but a computer algorithm that predicts an interval (or category) of charges someone will have on insurance based on other variables.  

In [4]:
# Create Decision Tree Regressor object
reg = DecisionTreeRegressor()

# Train Decision Tree Regressor
reg = reg.fit(X_train,y_train)

# Predict the labels for test dataset
y_pred = reg.predict(X_test)

**Residuals**

In [5]:
# View the predicted versus actual in a DataFrame

output_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred,})
output_df

Unnamed: 0,Actual,Predicted
559,1646.42970,1263.24900
1087,11353.22760,11945.13270
1020,8798.59300,8457.81800
460,10381.47870,10702.64240
802,2103.08000,1964.78000
...,...,...
323,11566.30055,10982.50130
1268,1880.48700,2020.17700
134,2457.21115,2257.47525
1274,17043.34140,17085.26760


In this case, we compare multiple types of algorithms or models to one another through **evaluation metrics**, or statistics that tell us the accuracy of our model, such as below.

In [6]:
reg.score(X_test, y_test)

0.6880296559691041

In [7]:
from sklearn import metrics

print(f'R squared:\t{metrics.r2_score(y_test, y_pred)}')
print(f'MAE:\t\t{metrics.mean_absolute_error(y_test, y_pred)}')
print(f'RMSE:\t\t{metrics.mean_squared_error(y_test, y_pred)**(1/2)}')

R squared:	0.6880296559691041
MAE:		3236.6024528606968
RMSE:		6649.165301396842


Compare the above numbers to a basic **multiple linear regression** model below

In [8]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg = reg.fit(X_train,y_train)
y_pred = reg.predict(X_test)

print(f'R squared:\t{metrics.r2_score(y_test, y_pred)}')
print(f'MAE:\t\t{metrics.mean_absolute_error(y_test, y_pred)}')
print(f'RMSE:\t\t{metrics.mean_squared_error(y_test, y_pred)**(1/2)}')

R squared:	0.7405989316927213
MAE:		4139.932064766011
RMSE:		6063.122656850449
