# Decision Tree Implementation( Python )

***

1. Classification
2. Regression

## 2. Regression

### Tools & Libraries

- **Pandas**( Data Analysis & Manipulation. )
- **Numpy**( Numerical Multidimensional Array, Matrices and Computation. )
- **Matplotlib**( Visualization )
- **Scikit-Learn**( ML  )

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Dataset

***

**Source :**
*FSU ( Petrol Consumption) Data Set*

In [2]:
dataset = pd.read_csv('./Resources/petrol_consumption.csv')

#### Information

For one year, the consumption of petrol was measured in 48 states. The relevant variables are the petrol tax, the per capita income, the number of miles of paved highway, and the proportion of the population with driver's licenses

In [3]:
def data_info(daten):
    print('Shape :')
    print(daten.shape)
    print('\nHead :')
    print(daten.head())

data_info(dataset)
dataset.describe()

Shape :
(48, 5)

Head :
   Petrol_tax  Average_income  Paved_Highways  Population_Driver_licence(%)  \
0         9.0            3571            1976                         0.525   
1         9.0            4092            1250                         0.572   
2         9.0            3865            1586                         0.580   
3         7.5            4870            2351                         0.529   
4         8.0            4399             431                         0.544   

   Petrol_Consumption  
0                 541  
1                 524  
2                 561  
3                 414  
4                 410  


Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
count,48.0,48.0,48.0,48.0,48.0
mean,7.668333,4241.833333,5565.416667,0.570333,576.770833
std,0.95077,573.623768,3491.507166,0.05547,111.885816
min,5.0,3063.0,431.0,0.451,344.0
25%,7.0,3739.0,3110.25,0.52975,509.5
50%,7.5,4298.0,4735.5,0.5645,568.5
75%,8.125,4578.75,7156.0,0.59525,632.75
max,10.0,5342.0,17782.0,0.724,968.0


**Attributes :**
```
#    There are 48 rows of data.  The data include:
#
#      I,  the index;
#      A0, 1;
#      A1, the petrol tax; (cent per gallon)
#      A2, the per capita income;
#      A3, the number of miles of paved highway;
#      A4, the proportion of drivers;
#      B,  the consumption of petrol.(millions of gallons)
#
#    We seek a model of the form
#
#      B = A0 * X0 + A1 * X1 + A2 * X2 + A3 * X3 + A4 * X4.
```

### Preprocessing

***

<u>**Attribute-Label Split.**</u>

- Attribute set: $X$ with corresponding labels: $y$.

In [4]:
X = dataset.drop('Petrol_Consumption', axis=1) # Column except 'Class'
y = dataset['Petrol_Consumption'] # Column 'Class'

<u>**Train-Test Split.**</u>

- 20%( Test ) & 80%( Train )


In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print('\nTrain Data Shape :')
print(X_train.shape)

print('\nTest Data Shape :')
print(X_test.shape)


Train Data Shape :
(38, 4)

Test Data Shape :
(10, 4)


### Model

***

- Scikit-Learn( tree Library )
- DecisionTreeRegressor( Class )

#### Training

- fit Method( class Regressor )

In [6]:
from sklearn.tree import DecisionTreeRegressor

regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

#### Prediction

- predict Method( class Regressor )

In [7]:
y_pred = regressor.predict(X_test)

#### Testing

In [8]:
dataFrame = pd.DataFrame({'Actual':y_test, 'Predicted': y_pred})
dataFrame

Unnamed: 0,Actual,Predicted
29,534,547.0
4,410,414.0
26,577,574.0
30,571,554.0
32,577,631.0
37,704,640.0
34,487,628.0
40,587,649.0
7,467,414.0
10,580,498.0


#### Evaluation Metrics

- Mean Absolute error
- Mean Squared error
- Root Mean Squared error

In [9]:
from sklearn import metrics

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 49.3
Mean Squared Error: 4075.3
Root Mean Squared Error: 63.8380764121


### Interpretation

The mean absolute error for the algorithm is 52.1, which is less than $10%$ of the mean of all the values in the 'Petrol_Consumption' column.

***