# Regression with CART Trees - Lab

## Introduction

In this lab, we'll make use of what we learned in the previous lesson to build a model for the [Petrol Consumption Dataset](https://www.kaggle.com/harinir/petrol-consumption) from Kaggle. This model will be used to predict gasoline consumption for a bunch of examples, based on features about the drivers.

## Objectives

In this lab you will: 

- Fit a decision tree regression model with scikit-learn

## Import necessary libraries 

In [1]:
# Import libraries 
import pandas as pd  
import numpy as np  
from sklearn.model_selection import train_test_split 

## The dataset 

- Import the `'petrol_consumption.csv'` dataset 
- Print the first five rows of the data 
- Print the dimensions of the data 

In [2]:
# Import the dataset
dataset = pd.read_csv('https://raw.githubusercontent.com/Patriciangugi/dsc-regression-cart-trees-lab/master/petrol_consumption.csv', header=None)

In [3]:
# Print the first five rows
print(dataset.head())



            0               1               2                             3  \
0  Petrol_tax  Average_income  Paved_Highways  Population_Driver_licence(%)   
1           9            3571            1976                         0.525   
2           9            4092            1250                         0.572   
3           9            3865            1586                          0.58   
4         7.5            4870            2351                         0.529   

                    4  
0  Petrol_Consumption  
1                 541  
2                 524  
3                 561  
4                 414  


In [4]:
# Print the dimensions of the data
print("Dimensions of the dataset:", dataset.shape)

Dimensions of the dataset: (49, 5)


- Print the summary statistics of all columns in the data: 

In [5]:
# Describe the dataset
print("Summary Statistics:\n", dataset.describe())

# Get general information about the dataset
print("\nDataset Info:")
dataset.info()


Summary Statistics:
          0     1     2      3    4
count   49    49    49     49   49
unique  10    48    48     40   44
top      7  5126  7834  0.563  577
freq    19     2     2      2    2

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49 entries, 0 to 48
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       49 non-null     object
 1   1       49 non-null     object
 2   2       49 non-null     object
 3   3       49 non-null     object
 4   4       49 non-null     object
dtypes: object(5)
memory usage: 2.0+ KB


## Create training and test sets

- Assign the target column `'Petrol_Consumption'` to `y` 
- Assign the remaining independent variables to `X` 
- Split the data into training and test sets using a 80/20 split 
- Set the random state to 42 

In [11]:
# Split the data into training and test sets
print(dataset.columns)

X = dataset.drop(columns=['Petrol_Consumption'])
y = dataset['Petrol_Consumption']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Index([0, 1, 2, 3, 4], dtype='int64')


KeyError: "['Petrol_Consumption'] not found in axis"

## Create an instance of CART regressor and fit the data to the model 

As mentioned earlier, for a regression task we'll use a different `sklearn` class than we did for the classification task. The class we'll be using here is the `DecisionTreeRegressor` class, as opposed to the `DecisionTreeClassifier` from before.

In [12]:
# Import the DecisionTreeRegressor class 
from sklearn.tree import DecisionTreeRegressor

# Instantiate and fit a regression tree model to training data 
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)


KeyError: 'Petrol_Consumption'

## Make predictions and calculate the MAE, MSE, and RMSE

Use the above model to generate predictions on the test set. 

Just as with decision trees for classification, there are several commonly used metrics for evaluating the performance of our model. The most common metrics are:

* Mean Absolute Error (MAE)
* Mean Squared Error (MSE)
* Root Mean Squared Error (RMSE)

If these look familiar, it's likely because you have already seen them before -- they are common evaluation metrics for any sort of regression model, and as we can see, regressions performed with decision tree models are no exception!

Since these are common evaluation metrics, `sklearn` has functions for each of them that we can use to make our job easier. You'll find these functions inside the `metrics` module. In the cell below, calculate each of the three evaluation metrics. 

In [13]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Make predictions on the test set
y_pred = regressor.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

# Evaluate these predictions
print('Mean Absolute Error:', None)  
print('Mean Squared Error:', None)  
print('Root Mean Squared Error:', None)

NameError: name 'X_test' is not defined

## Level Up (Optional)

- Look at the hyperparameters used in the regression tree, check their value ranges in official doc and try running some optimization by growing a number of trees in a loop 

- Use a dataset that you are familiar with and run tree regression to see if you can interpret the results 

- Check for outliers, try normalization and see the impact on the output 

## Summary 

In this lesson, you implemented the architecture to train a tree regressor and predict values for unseen data. You saw that with a vanilla approach, the results were not so great, and thus we must further tune the model (what we described as hyperparameter optimization and pruning, in the case of trees). 