# Decision Tree Regression (CART)
> A tutorial on how to run decision tree regression in Python SciKit-learn

- toc: true 
- badges: true
- comments: true
- author: Kai Lewis
- categories: [jupyter, CART, regression, decisiontree]

# About

Decision trees are one of the best known *supervised* learning method used for both **classification** and **regression**. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A decision tree can be seen a piecewise constant approximation. 

The CART algorithm provides the foundation for important methods like bagged decision trees, random forest and boosted decision trees. The output of a CART is a decision tree where each fork is a split in a predictor variable and each end node contains a prediction for the response variable. Essentially, it sequentially asks a series of if-else questions about individual features in a dataset in order to split it for classification. 

## Advantages of CARTs
- Decision tree model can be used for both classification and regression problems
- Outputs of decision trees are easily understood
- Data pre-processing step is easier as CARTs don't require normalisation of data (they may each have different scales)
- Not largely impaced by outliers or missing values
- Can handle both numerical and categorical variables
- Non-parametric method, therefore, makes no assumptions about the underlying distributions of the data

## Disadvantages of CARTs
- Overfitting is common in decision trees, as the algorithm continually makes new hypotheses to reduce training set error but incidentally increase test set error. This may be accounted for by setting constraints on model parameters.
- Small changes in data tends to cause large differences in tree structure
- Can take longer to train the model than other algorithms




# Import packages
The packages we need for the downstream analysis

In [1]:
from matplotlib import pyplot as plt
import plotly_express as px
import seaborn as sns

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

import pandas as pd
import numpy as np
import os

# Exploratory Data Analysis

In [28]:
# Import dataset
mpg_df = pd.read_csv("./Datasets/auto-mpg.csv")

mpg_df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [29]:
mpg_df.describe()

Unnamed: 0,mpg,cylinders,displacement,weight,acceleration,model year,origin
count,398.0,398.0,398.0,398.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.425879,2970.424623,15.56809,76.01005,1.572864
std,7.815984,1.701004,104.269838,846.841774,2.757689,3.697627,0.802055
min,9.0,3.0,68.0,1613.0,8.0,70.0,1.0
25%,17.5,4.0,104.25,2223.75,13.825,73.0,1.0
50%,23.0,4.0,148.5,2803.5,15.5,76.0,1.0
75%,29.0,8.0,262.0,3608.0,17.175,79.0,2.0
max,46.6,8.0,455.0,5140.0,24.8,82.0,3.0


In [30]:
mpg_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   car name      398 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB


## Dataset transformations



In [None]:


data.horsepower = data.horsepower.astype('float')

In [25]:
# Define a function to scale columns to the same range [0,1]

def scale(a):
    b = (a-a.min())/(a.max()-a.min())
    return b

In [27]:
mpg_scale = mpg_df.copy()

In [None]:
mpg_scale ['displacement'] = scale(mpg_scale['displacement'])
mpg_scale['horsepower'] = scale(mpg_scale['horsepower'])
mpg_scale ['acceleration'] = scale(mpg_scale['acceleration'])
mpg_scale ['weight'] = scale(mpg_scale['weight'])
mpg_scale['mpg'] = scale(mpg_scale['mpg'])

In [None]:

# Convert object variables to categories
for col in ['horsepower', 'car name']:
    mpg_df[col] = mpg_df[col].astype('category')

mpg_df['Country_code'] = mpg_df.origin.replace([1,2,3],['USA','Europe','Japan'])

**Need some plots on dataset