7 interesting facts about decision trees:

.They do not need the numerical input data to be scaled. Whatever the numerical values are, decision trees don't care.

.Decision trees handle categorical features in the raw text format (Scikit-Learn doesn't support this, TensorFlow's trees implementation does).

.Different to other complex learning algorithms, the results of decision trees can be interpreted. It's fair to say that decision trees are not blackbox type models.

.While most models will suffer from missing values, decision trees are okay with them.
.Trees can handle imbalanced datasets. You will only have to adjust the weights of the classes.
.Trees can provide the feature importances or how much each feature contributed to the model training results.
.Trees are the basic building blocks of ensemble methods such as random forests and gradient boosting machines.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [2]:
# fetch open ml cpu data

machine_cpu = fetch_openml(name='machine_cpu')

In [3]:
type(machine_cpu)

sklearn.utils._bunch.Bunch

In [4]:
machine_cpu.data.shape

(209, 6)

In [5]:
print(machine_cpu.DESCR)

**Author**:   
**Source**: Unknown -   
**Please cite**:   

The problem concerns Relative CPU Performance Data. More information can be obtained in the UCI Machine
 Learning repository (http://www.ics.uci.edu/~mlearn/MLSummary.html).
 The used attributes are :
 MYCT: machine cycle time in nanoseconds (integer)
 MMIN: minimum main memory in kilobytes (integer)
 MMAX: maximum main memory in kilobytes (integer)
 CACH: cache memory in kilobytes (integer)
 CHMIN: minimum channels in units (integer)
 CHMAX: maximum channels in units (integer)
 PRP: published relative performance (integer) (target variable)
 
 Original source: UCI machine learning repository. 
 Source: collection of regression datasets by Luis Torgo (ltorgo@ncc.up.pt) at
 http://www.ncc.up.pt/~ltorgo/Regression/DataSets.html
 Characteristics: 209 cases; 6 continuous variables

Downloaded from openml.org.


In [6]:
# Displaying feature names

machine_cpu.feature_names

['MYCT', 'MMIN', 'MMAX', 'CACH', 'CHMIN', 'CHMAX']

In [7]:
# Getting the whole dataframe

machine_cpu.frame

Unnamed: 0,MYCT,MMIN,MMAX,CACH,CHMIN,CHMAX,class
0,125.0,256.0,6000.0,256.0,16.0,128.0,198.0
1,29.0,8000.0,32000.0,32.0,8.0,32.0,269.0
2,29.0,8000.0,32000.0,32.0,8.0,32.0,220.0
3,29.0,8000.0,32000.0,32.0,8.0,32.0,172.0
4,29.0,8000.0,16000.0,32.0,8.0,16.0,132.0
...,...,...,...,...,...,...,...
204,124.0,1000.0,8000.0,0.0,1.0,8.0,42.0
205,98.0,1000.0,8000.0,32.0,2.0,8.0,46.0
206,125.0,2000.0,8000.0,0.0,2.0,14.0,52.0
207,480.0,512.0,8000.0,32.0,0.0,0.0,67.0


In [8]:
machine_data = machine_cpu.data
machine_labels = machine_cpu.target

In [9]:
type(machine_data)

pandas.core.frame.DataFrame

In [10]:
type(machine_labels)

pandas.core.series.Series