#Analysis Using GBRT on Central Car Auction Data

**Data Source**
Webpage:  http://www.centralcarauctions.com/trade/vehicles/price-guide/price-guide?page=1

Date Accessed: July 2014 

Analysis Method: Gradient Boosted Regression Trees (GBRT)

Steps in this notebook: 

1. Read in prepreoced encoded data.   
2. Perform a basic GBRT run and view Feature Importance.


## Imports and Setup

In [2]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import ensemble, cross_validation as cv
from sklearn.metrics import mean_squared_error, mean_absolute_error

%matplotlib inline

In [3]:
# Coloring / Styling
sns.set_style('darkgrid')
sns.set_context('talk')

## Read in  Data

In [35]:
# Read in Data file 
df_enc = pd.read_csv('cca_data_enc.csv')

In [31]:
# Select features
features = ['model', 'year_ord', 'mileage', 'MOT_ord', 'make', 'class']
target = ['price']

# Extract Values
X = df_enc[features].values.astype(float)
y = df_enc[target].values.ravel().astype(float)

In [40]:
df_enc[df_enc[features].isnull().any(axis=1)]

Unnamed: 0,make,model,trim,class,year,mileage,MOT,price,MOT_ord,year_ord
637,11,315,D - 1867CC,27,2004-03-01 00:00:00,,2015-09-01 00:00:00,825,8,73
2622,33,313,DYN DCI 106 - 1461CC,18,2007-07-01 00:00:00,,2015-03-01 00:00:00,800,2,113
3518,43,146,SXI 16V - 1199CC,6,2004-03-01 00:00:00,,2015-08-01 00:00:00,160,7,73
3700,43,388,16V CLUB - 1598CC,18,2000-07-01 00:00:00,,2015-08-01 00:00:00,270,7,29


# Investigate the Loss Function

This section primarily looks at the effect of the differnt loss functions:

* Least Absolute difference (lad)
* Least Squares (ls)
* A combination of the two above (huber)

Each are investiagetd by running a 5 fold cross validation and visualising the residuals on each test set.

Performace was measured using mean squared error (MSE) and mean absolute error (MAE) of the predictions on the test set. 

The huber loss function performed the best on this data with a combined Mean Absolute Error Score of **715.04** with standard deviation **52.94**.

In [32]:
### Model Batch Parameters
params_list = [{'n_estimators': 1000, 'max_depth': 3, 'min_samples_split': 5,
          'learning_rate': 0.1, 'loss': 'lad'}, 
               {'n_estimators': 1000, 'max_depth': 3, 'min_samples_split': 5,
          'learning_rate': 0.1, 'loss': 'ls'}, 
               {'n_estimators': 1000, 'max_depth': 3, 'min_samples_split': 5,
          'learning_rate': 0.1, 'loss': 'huber'}]

### Cross Validation 

In [33]:
# Initialise Data Structures
MSE_list = []
MAE_list = []

folds = 5

for param_no, params in enumerate(params_list[:3]):
    
    print('\nFor Parameter set {}: Loss function = {}\n'.format(param_no + 1, params['loss']))
    
    plt.figure(figsize=(12,6))
    
    # K fold Cross Validation 
    kf = cv.KFold(X.shape[0], n_folds=folds, shuffle=True, random_state=111)
    fold_count = 0    
    for train_index, test_index in kf:    
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
       
        # Initialise Regressor and fit training data 
        est = ensemble.GradientBoostingRegressor(**params)
        est.fit(X_train, y_train)

        # Log Errors Over all consecutive iterations  
        train_error = est.train_score_
        test_error = np.zeros((params['n_estimators'],), dtype=np.float64)
        for i, y_pred in enumerate(est.staged_predict(X_test)):
            test_error[i] = est.loss_(y_test, y_pred)
        
        # Calculate mean of absolute error 
        mae = np.abs(y_test - est.predict(X_test)).mean()
        # Calculate Std of absolute error 
        std_ae = np.abs(y_test - est.predict(X_test)).std()
        
        print("Fold %i: Mean Absolute Error: %0.2f   Std: %0.2f" % (fold_count, mae, std_ae))
        plt.plot(y_test - est.predict(X_test))
        plt.title('Loss Funtion = {}'.format(params['loss']))
        plt.xlabel('Test Instance')
        plt.xlabel('Residual')
        # Calculating mean results
        mse = mean_squared_error(y_test, est.predict(X_test))
        MSE_list.append(mse)
        MAE_list.append(mae)
        
        fold_count += 1
    
    CV_MSE = np.array(MSE_list)
    CV_MAE = np.array(MAE_list)
    print("\nAveraged Mean Absolute Error Scores: %0.2f    Std:  %0.2f" % (CV_MAE.mean(), CV_MAE.std()))


For Parameter set 1: Loss function = lad



ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

<matplotlib.figure.Figure at 0x108c30898>

In [34]:
%debug

> [0;32m/Users/chris/miniconda3/lib/python3.4/site-packages/sklearn/utils/validation.py[0m(43)[0;36m_assert_all_finite[0;34m()[0m
[0;32m     42 [0;31m        raise ValueError("Input contains NaN, infinity"
[0m[0;32m---> 43 [0;31m                         " or a value too large for %r." % X.dtype)
[0m[0;32m     44 [0;31m[0;34m[0m[0m
[0m
ipdb> X
array([[  4.30000000e+01,   1.33000000e+02,   3.86460000e+04,
          2.00000000e+00,   0.00000000e+00,   6.00000000e+00],
       [  1.10000000e+01,   1.06000000e+02,   8.11120000e+04,
          9.00000000e+00,   1.00000000e+00,   6.00000000e+00],
       [  1.10000000e+01,   8.60000000e+01,   5.97870000e+04,
          4.00000000e+00,   1.00000000e+00,   1.70000000e+01],
       ..., 
       [  3.77000000e+02,   1.09000000e+02,   7.40540000e+04,
          4.00000000e+00,   4.50000000e+01,   1.60000000e+01],
       [  3.77000000e+02,   1.09000000e+02,   7.79690000e+04,
          1.30000000e+01,   4.50000000e+01,   1.60000000e+01],


## Investigting Deviance and Importance Plots

Deviance plots track the error rate made with successive boosting iterations on both the training and test set.

In addion, the frequency by which a learner split on a particular feature is extracted and used to show the relative importance of each feature. 

In [8]:
# Only look at one Partition of the  Data 
X_train, X_test, y_train, y_test = cv.train_test_split(X, y, random_state=0)

### Model Batch Parameters
params_list = [{'n_estimators': 1000, 'max_depth': 3, 'min_samples_split': 5,
          'learning_rate': 0.1, 'loss': 'lad'}, 
               {'n_estimators': 1000, 'max_depth': 3, 'min_samples_split': 5,
          'learning_rate': 0.1, 'loss': 'ls'}, 
               {'n_estimators': 1000, 'max_depth': 3, 'min_samples_split': 5,
          'learning_rate': 0.1, 'loss': 'huber'}]

for params in params_list:
    
    # Initialise Regressor and fit training data 
    est = ensemble.GradientBoostingRegressor(**params)
    est.fit(X_train, y_train)

    # Log Errors Over all consecutive iterations  
    train_error = est.train_score_
    test_error = np.zeros((params['n_estimators'],), dtype=np.float64)
    for i, y_pred in enumerate(est.staged_predict(X_test)):
        test_error[i] = est.loss_(y_test, y_pred)
    
    mse = mean_squared_error(y_test, est.predict(X_test))
    mae = mean_absolute_error(y_test, est.predict(X_test))
    print("MAE: %.4f,   MSE: %.4f" % (mae, mse))
    print("Loss Function: {}".format(params['loss']))
    MSE_list.append(mse)
    MAE_list.append(mae)    

    # Plotting #
    # -------- #
    plt.figure(figsize=(12, 6))
    plt.subplot(1, 2, 1)
    plt.title('Deviance')
    plt.plot(np.arange(params['n_estimators']) + 1, est.train_score_, 'b-',
             label='Training Set Deviance')
    plt.plot(np.arange(params['n_estimators']) + 1, test_error, 'r-',
             label='Test Set Deviance')
    plt.legend(loc='upper right')
    plt.xlabel('Boosting Iterations')
    plt.ylabel('Deviance')

    # Plot feature importance
    feature_importance = est.feature_importances_
    # make importances relative to max importance
    feature_importance = 100.0 * (feature_importance / feature_importance.max())
    sorted_idx = np.argsort(feature_importance)
    pos = np.arange(sorted_idx.shape[0]) + .5
    plt.subplot(1, 2, 2)
    plt.barh(pos, feature_importance[sorted_idx], align='center')
    plt.yticks(pos, np.array(features)[sorted_idx])
    plt.xlabel('Relative Importance')
    plt.title('Variable Importance')
    plt.show()

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').