# Introduction: Model Evaluation

According to the [No Free Lunch Theorem](https://en.wikipedia.org/wiki/No_free_lunch_theorem) ([source 1](https://www.mitpressjournals.org/doi/abs/10.1162/neco.1996.8.7.1341), [source 2](https://ti.arc.nasa.gov/m/profile/dhw/papers/78.pdf)), no algorithm will beat every other algorithm across all learning tasks. What this means for machine learning is there is no single superior algorithm for every dataset. Therefore, to find the best model for a given task, we have to evaluate multiple models and determine which one achieves the highest generalization performance on the test set. Machine learning is a largely empirical field, and finding the best model remains almost entirely a process of experimentation.

With that in mind, in this notebook we will implement and evaluate several machine learning models for the building energy prediction task. This is a supervised regression problem where we are asked to build a model to predict energy consumption based on historical energy usage and weather conditions. The models will be tested on the final 6 months of data with the entire preceding data used for training. [Scikit-Learn](http://scikit-learn.org/stable/index.html) will be used to implement the models in this notebook which allows us to use the same syntax for building a diverse array of algorithms. We will show an example of building a machine learning model and then write a function that can train and evaluate any model. Here we will evaluate six models for two datasets, but the function will be applied to hundreds of buildings to determine which model performs the best on average for this task. The model with the best test set performance can then be further developed through the process of hyperparameter tuning. 

## Metric: Mean Absolute Percentage Error

In order to compare models, we need a single metric to assess performance. There are many choices for regression, including the [root mean squared error](https://en.wikipedia.org/wiki/Root-mean-square_deviation) (RMSE). RMSE is conveniently measured in the same units as the target, but suffers from the fact that it depends on the magnitude of the targets and therefore cannot be compared between buildings that have different levels of energy use. To compare performance across buildings, we need a metric that normalizes the error by the magnitude of the target. The mean absolute percentage error (MAPE) is one such metric. This performance measure has been used in previous electricity prediction studies ([here](https://www.sciencedirect.com/science/article/pii/S0306261914003596) and [here](https://www.sciencedirect.com/science/article/pii/S0378778812001582)), and can be used to compare predictions between buildings because it measures the error normalized by the magnitude of the energy usage. The MAPE is calculated as

$$\mbox{MAPE} = 100\% * \frac{1}{n}\sum_{i=1}^n  \left|\frac{y_i-\hat{y}_i}{y_i}\right|$$

where $n$ is the number of observations, $y_i$ is the actual prediction for observation $i$, and $\hat{y}_i$ is the predicted value. The MAPE represents the average percentage error for predictions. In addition to the MAPE, we can calculate the time it takes for the model to learn. Although time is not a significant consideration when massive computing power is available (through the [CWRU HPC](https://sites.google.com/a/case.edu/hpcc/), it is still an interesting comparison to make. 

## Models to Evaluate

We will evaluate the following models in this notebook. 
 
1. Elastic Net Linear Regression with `l1_ratio = 0.5`
2. K-Nearest Neighbors Regression with `k = 10`
3. Support Vector Machine Regression with Radial Basis Function Kernel
4. Random Forest Regression with 100 decision trees 
5. Extra Trees Regression with 100 decision trees
6. AdaBoost Regression with 1000 decision trees as the base learner

Two other models with be evaluated in additional notebooks:

1. Gradient Boosting Machine
2. Deep Fully Connected Neural Networks

The six models developed here can all be implemented in Scikit-Learn. We will focus on using the models rather than the theory behind them, but additional resources will be linked to for those interested in learning more.

### Imports

We will use a standard stack of data science libraries: `pandas`, `numpy`, `sklearn`, `matplotlib`. See the `requirements.txt` file for the correct version of these libraries to install. 

In [1]:
# numpy and pandas for data manipulation
import pandas as pd
import numpy as np

# Sklearn preprocessing functionality
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

# Matplotlib for visualizations
import matplotlib.pyplot as plt

# Adjust default font size 
plt.rcParams['font.size'] = 18

### Read in Dataset and Apply Feature Preprocessing

Let's read in an example dataset and preprocess the features using the function we developed earlier. This function takes in a building energy dataframe, a number of days to use for testing, and a boolean for whether or not to scale the features. We will use 183 days (6 months) for testing and will choose to scale the features because we are using models that depend on the distance between observations to make predictions. [Feature scaling](https://en.wikipedia.org/wiki/Feature_scaling) is a best practice when comparing multiple algorithms so that the range of the features does not impact model performance. 

In [2]:
# Import the feature preprocessing
from utilities import preprocess_data

df = pd.read_csv('../data/f-APS_weather.csv')

# Preprocess the data for machine learning
train, train_targets, test, test_targets = preprocess_data(df, test_days = 183, scale = True)

train.head()

Unnamed: 0,timestamp,biz_day,week_day_end,ghi,dif,gti,temp,rh,pwat,ws,...,yday_cos,month_sin,month_cos,wday_sin,wday_cos,num_time_sin,num_time_cos,sun_rise_set_neither,sun_rise_set_rise,sun_rise_set_set
0,0.0,1.0,0.0,0.0,0.0,0.0,0.394397,0.245303,0.069231,0.295082,...,0.629749,0.066987,0.75,1.0,0.25,0.5,1.0,1.0,0.0,0.0
1,1.1e-05,1.0,0.0,0.0,0.0,0.0,0.390086,0.250522,0.067949,0.295082,...,0.629749,0.066987,0.75,1.0,0.25,0.53305,0.998907,1.0,0.0,0.0
2,2.2e-05,1.0,0.0,0.0,0.0,0.0,0.387931,0.256785,0.067949,0.295082,...,0.629749,0.066987,0.75,1.0,0.25,0.565955,0.995631,1.0,0.0,0.0
3,3.4e-05,1.0,0.0,0.0,0.0,0.0,0.383621,0.262004,0.067949,0.303279,...,0.629749,0.066987,0.75,1.0,0.25,0.598572,0.990187,1.0,0.0,0.0
4,4.5e-05,1.0,0.0,0.0,0.0,0.0,0.37931,0.268267,0.067949,0.303279,...,0.629749,0.066987,0.75,1.0,0.25,0.630758,0.9826,1.0,0.0,0.0


The data is ready for machine learning. All of the values are numeric, the features have been scaled to between 0 and 1, and there are no missing values. 

## Machine Learning in Scikit-Learn

For each model, we will calculate three stats: the mean absolute percentage error, the time to train the model, and the time to make predictions. MAPE is calculated from the definition of the metric using the predicted values and the known test targets. The training and testing times are determined using the `default_timer` class from the [`timeit` module](https://docs.python.org/2/library/timeit.html). The `default_timer` [automatically adjusts for operating system](https://stackoverflow.com/questions/7370801/measure-time-elapsed-in-python) and provides the best timer accordingly. We will measure the wall-clock time, which is the total time to execute a program. This [differs from the CPU time](https://www.pythoncentral.io/measure-time-in-python-time-time-vs-time-clock/) (also called execution time) which measures the time a CPU spent executing a specific program. Since we are interested in the time out of curiosity rather than for model selection, the minor distinctions in time calculation are not a significant concern. 

All of the models in this notebook will be built using Scikit-Learn. This open-source Python library benefits from a consistent syntax for implementing models which makes it extremely simple to quickly implement many models. There are three steps to using a machine learning model in Scikit-Learn:

1. Instantiate the model, specifying the hyperparameters
2. `fit`: train the model on the training data
3. `predict`: make predictions on the testing data

These same three steps apply to every model in Scikit-Learn. 

## Model Hyperparameters

Model hyperparameters can best be thought of as settings for a machine learnig model. [In contrast to model parameters](https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/), which are learned by the model during training, model hyperparameters are set by the developer before training. As an example, the number of trees in a random forest is a model hyperparameter while the weights learned during linear regression are model parameters. The choice of hyperparameters can have a significant effect both on the model's training and on the test performance. Much like choosing the best model is an experimental process, optimizing the hyperparameters of a selected model requires evaluating many different combinations to find the one that performs the best. This can be an extremely time-intensive and computationally expensive process. 

For this notebook, we will stick with the Scikit-Learn hyperparameters for the most part except where noted. [Scikit-Learn aims](https://arxiv.org/abs/1309.0238) to provide a set of reasonable default hyperparameters designed to get practicioners up and running with a decent model quickly. However, these hyperparameters are likely to be nowhere near optimal, and tuning them is recommended after a working system is built. If we had unlimited time, we would try out not only multiple models, but also work on optimizing the hyperparameters of each model for the problem. For now, we will stick with the default hyperparameters for comparing models and then optimize the hyperparameters of the model that performs the best. We will note that this might not be a fair comparison because of the model dependence on hyperparameters, but it is a limitation we have to work with. (Note that for models where it is applicable, we set `n_jobs=-1` to use all available cores on the machine.)

### Approach 

In this notebook, we focus on implementing the algorithms rather than explaining how they work. Two extremely good resources for learning the theory of these models are [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/), and [Hands-On Machine Learning in Python with Scikit-Learn and TensorFlow](http://shop.oreilly.com/product/0636920052289.do). Both of these books provide the theory as well as how to implement the methods in R and Python respectively. Machine learning is only mastered through doing, and both these books place a large emphasis on practicing the techniques described. With those considerations in mind, let's get to modeling! 

For the first model, we will walk through the modeling steps individually, and then we will refactor the steps into a function that can be re-used for any model. 

## ElasticNet Linear Regression

[Regularization and Variable Selection via the Elastic Net](https://web.stanford.edu/~hastie/Papers/elasticnet.pdf)

ElasticNet Linear Regression is a regularized version of Linear Regression. It is a mix of two regularization methods: Lasso Regression and Ridge Regression. The blend of these two methods is controlled by the `l1_ratio` hyperparameter with the overall amount of regularization controlled by the `alpha` hyperparameter. We will use the default values of these hyperparameters which are `l1_ratio = 0.5` and `alpha = 1.0`. Unlike ordinary least squares (OLS) linear regression, which has an analytical solution, ElasticNet must be solved by coordinate descent. The equation for the parameter matrix $\beta$ in ElasticNet is:

$$\hat{\beta} = \underset{\beta}{\operatorname{argmin}} (\| y-X \beta \|^2 + \alpha \lambda_1 \|\beta\|_1 + \alpha (1 - \lambda_1) \|\beta\|_2^2)$$

where $\lambda_1$ multiplies the L1 norm of the parameter matrix and $1 - \lambda_1$ multiplies the L2 norm of the parameter matrix. $\alpha$ controls the overall amount of regularization, and if $\alpha = 0$, then ElasticNet simplifies to ordinary least squares regression with an objective function of Mean Squared Error (MSE). 

In [3]:
from sklearn.linear_model import ElasticNet

# Timing utilities
from timeit import default_timer as timer

In [4]:
# Set up the model with default hyperparameters
model = ElasticNet(alpha = 1.0, l1_ratio=0.5)
model

ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False)

### Training 

Training in Scikit-Learn is done using the `fit` method (function) of a model. This takes in the training features and the targets. 

In [5]:
# Start the timer
train_start = timer()

# Train the model
model.fit(train, train_targets)

# Stop the timer
train_end = timer()

train_time = train_end - train_start

### Testing 

Making predictions in Scikit-Learn using the `predict` method of a model. It takes in only the testing features.

In [6]:
# Start the time
test_start = timer()

# Make predictions on testing data
predictions = model.predict(test)

# Stop the time
test_end = timer()

test_time = test_end - test_start

# Calculate the mape
mape = 100 * np.mean( abs(predictions - test_targets) / test_targets)

In [7]:
print("Training time (seconds): %0.4f" % train_time)
print("Prediction time (seconds): %0.4f" % test_time)
print("Testing MAPE: %0.2f" % mape)

Training time (seconds): 0.0936
Prediction time (seconds): 0.0048
Testing MAPE: 56.97


The three results can be recorded in a numpy array along with the name of the model. When we get results from multiple models, we can stack the results and then record them in a dataframe. 

In [8]:
# Record the results in a numpy array
results = np.array(['ElasticNet', train_time, test_time, mape])

results

array(['ElasticNet', '0.09360281483673485', '0.004777918515813356',
       '56.973741837828975'], dtype='<U20')

# Modeling Function

Let's take the individual steps and put them into a single function. The function will operate the exact same for each model allowing for an efficient standardized workflow. This function takes in a model, training features, training labels, testing features, and testing labels, and returns the 3 numeric results along with the name of the model in a numpy array. The function will be able to train and evaluate any Scikit-Learn supervised regression model. 

In [9]:
def implement_model(model, train, training_targets, test, testing_targets, model_name):
    """Train a machine learning model and make predictions on a test set
    
    Parameters
    --------
    model : Scikit-Learn model object
        Model to use for training and making predictions
    
    train : dataframe, shape = [n_training_samples, n_features]
        Set of training features for training a model
    
    train_targets : array, shape = [n_training_samples]
        Array of training targets for training a model
        
    test : dataframe, shape = [n_testing_samples, n_features]
        Set of testing features for making predictions with a model
    
    test_targets : array, shape = [n_testing_samples]
        Array of testing targets for evaluating the model predictions
        
    model_name : string
        Name of the model used for returning results
        
    Returns
    --------
    
    results : array, shape = [4]
        Numpy array of results. 
        First entry is the model, second is the training time,
        third is the testing time, and fourth is the MAPE. All entries
        are in strings and so will need to be converted to numbers.
    
    """
    
    # Preprocess the data for machine learning
    train, train_targets, test, test_targets = preprocess_data(df, test_days = 183, scale = True)
    
    train_start = timer()
    
    # Start the timer
    train_start = timer()

    # Train the model
    model.fit(train, train_targets)

    # Calculate training time
    train_end = timer()
    train_time = train_end - train_start

    # Start test timer
    test_start = timer()

    # Make predictions
    predictions = model.predict(test)

    # Calculate testing time
    test_end = timer()
    test_time = test_end - test_start

    # Calculate the mape
    mape = 100 * np.mean( abs(predictions - test_targets) / test_targets)

    # Record the results
    results = [model_name, train_time, test_time, mape]
    
    return results


In [10]:
elasticnet_results = implement_model(model, train, train_targets, test, 
                                     test_targets, model_name = 'elasticnet')
elasticnet_results

['elasticnet', 0.08054083317867355, 0.0008706562336163737, 56.973741837828975]

Now we can use this function for any Scikit-Learn supervised regression machine learning model. We will go through the rest of the models and use the function to evaluate each one.

## K-Nearest Neighbors Regression

[K-Nearest Neighbors](https://link.springer.com/chapter/10.1007/978-3-642-38652-7_2)

K-Nearest Neighbors is a non-parameteric method that makes predictions for a new observation based on the K nearest observations as determined by a distance measure. For this implementation, the distance measure is the L2 norm or [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) (the Minkowski measure with p = 2). There is no actual learning done during the training phase of K-Nearest Neighbors because the work of calculating the nearest neighbors and making an estimation is done at testing time (this is an example of a ["lazy learner"](https://en.wikipedia.org/wiki/Lazy_learning)). K-Nearest Neighbors is extremely sensitive to the `n_neighbors` hyperparameter which we have set at 10 (up from the default 5). 

In [11]:
from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor(n_neighbors = 10)
model

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=10, p=2,
          weights='uniform')

In [12]:
# Call the model function
knn_results = implement_model(model, train, train_targets, test, 
                              test_targets, model_name = 'KNN')
knn_results

['KNN', 23.473687023886598, 3.4355423598204986, 23.656239736972875]

## Support Vector Machine

[Support Vector Machine](https://ieeexplore.ieee.org/abstract/document/708428/)

The support vector machine works by transforming the data to a new high-dimensional feature space using a kernel. This procedure, called the kernel trick, is designed to make the original data, which may not be linearly separable, linearly separable in the high-dimensional feature space. The support vector machine is highly sensitive to a number of hyperparameters: the `kernel`, `gamma`, `C` (the error term), and `epsilon`. For this implementation, we will use the default hyperparameters in Scikit-Learn which are:

* `kernel=rbf`: Gaussian Radial Basis Function kernel
* `gamma=auto`: Gamma is equal to $\frac{1}{\text{n_features}}$
* `C=1.0`: penalty parameter of the error term
* `epsilon=0.1`: the epsilon tube within which no penalty is associated in the training loss

In [13]:
from sklearn.svm import SVR

model = SVR(C=)
model

In [15]:
svm_results = implement_model(model, train, train_targets, test, 
                              test_targets, model_name = 'svm')
svm_results

['svm', 531.2380162893111, 59.102470593970565, 24.826293891805097]

## Random Forest

[Random Forest](https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf)

The random forest is an ensemble method which makes one powerful estimator out of a number of simpler models, in this case individual decision tree regressors. The random forest is an example of a [bagging (bootstrap aggregating) ensemble method](https://machinelearningmastery.com/bagging-and-random-forest-ensemble-algorithms-for-machine-learning/) in which all of the individual learners are trained independently and then predictions are made by averaging the predictions of the individuals. This allows for efficient training because the indidual decision tree regressors can be trained in parallel.

The random forest is fast to train, can be interpreted via the model feature importances, is very accurate, and has only one main hyperparameter to adjust, the number of trees in the forst. Other hyperparameters can be used to control the amount of overfitting by modifying the individuals in the forest. A random forest has less variance than a single decision tree which is very prone to overfitting. The "random" in the name comes from the fact that the model only trains on a subsample of the training examples (chosen with replacement called "bootstrapping") and only on a subset of the features. Both of these behaviors can be controlled through the model hyperparameters.

For this case, we will increase the number of trees to 100 (`n_estimators`) and leave the rest of the model hyperparameters at the default values.

In [16]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators = 100, n_jobs = -1)
model

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [17]:
rf_results = implement_model(model, train, train_targets, test, 
                             test_targets, model_name='rf')
rf_results

['rf', 18.59618313886108, 0.20558733802420193, 15.739369033234297]

## ExtraTrees Regressor

[Extremely Randomized Trees](https://link.springer.com/article/10.1007/s10994-006-6226-1)

The extra trees regressor is another bagging ensemble method. As with the random forest, the individual predictors are decision tree regressors and the ensemble makes a prediction by averaging the predictions of all the individuals. The difference between the extra trees model and a random forest is that the node splits are made on a random value of the feature in the extra trees ensemble. In other words, the values for splits of nodes are chosen randomly rather than by trying out all possible values of a feature as in a random forest. The "extra" means "extra random" in reference to this behavior. An additional difference is that the training observations are not sampled for extra trees (although this behavior can be changed through the [`bootstrap` hyperparameter in the function call](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html)). 

The extra trees model can be slightly faster than the random forest for training because the splits are chosen randomly instead of evaluating all splits for the optimal split. The idea behind extra trees is that the model will have lower variance than the random forest which can improve generalization performance on the test set. We will use 100 trees (`n_estimators`) and keep the other hyperparameters at the defaults. 

In [18]:
from sklearn.ensemble import ExtraTreesRegressor

model = ExtraTreesRegressor(n_estimators=100, n_jobs = -1)
model

ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
          max_features='auto', max_leaf_nodes=None,
          min_impurity_decrease=0.0, min_impurity_split=None,
          min_samples_leaf=1, min_samples_split=2,
          min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
          oob_score=False, random_state=None, verbose=0, warm_start=False)

In [19]:
et_results = implement_model(model, train, train_targets, test, 
                             test_targets, model_name = 'et')
et_results

['et', 9.403137355856302, 0.09805060240341845, 18.385035628942195]

## AdaBoost

[Explaining AdaBoost](https://link.springer.com/chapter/10.1007/978-3-642-41136-6_5)

The AdaBoost - standing for Adaptive Boosting - algorithm is another ensemble method. However, unlike the random forest and extra trees, it is a [boosting ensemble](https://machinelearningmastery.com/boosting-and-adaboost-for-machine-learning/) and not a bagging ensemble. The difference is that the individual learners are not trained independently, but rather sequentially, with each learner learning from the mistakes of the previous. In AdaBoost, this takes the form of re-weighting the observations the previous learner got most wrong when training the next. AdaBoost trains on a random subset of the training examples, and the training examples with greater weight are more likely to be included in the training set. This makes the examples that are more difficult - missed more often - more likely to appear in the training set. Moreover, while bagging methods simply average the predictions of each individual, boosting methods weight the predictions of each individual based on error rates. Better learners are given exponentially more weight in the final prediction. 

Boosting methods can be built on any individual learner, but the most common choice is the decision tree. These decision trees are kept very small (sometimes decision trees with only one level are used which are called decision stumps) and on their own are weak learners. Weak learners perform better than random guessing, but are not very accurate. However, by adding weak learners to the ensemble sequentially, the overall model is a strong classifier. Boosting is a general method that can be applied to any weak learner and AdaBoost was one of the first successful implementations. In later work, we will look at another boosting method called Gradient Boosting which is now considered more capable than AdaBoost. We will set the following hyperparameters in the AdaBoost ensemble:

* `n_estimators=1000`: number of decision trees. This is set high because the individual learners are weak
* `learning_rate=0.05`: the contribution of each learner to the ensemble. There is a tradeoff between `n_estimators` and the `learning_rate` with a lower learning rate complementing a higher number of weak learners.

The other hyperparameters will be set at the default values. The default base learner for AdaBoost is the `DecisionTreeRegressor`. AdaBoost cannot be trained in parallel which means we cannot set the `n_jobs` training argument.

In [20]:
from sklearn.ensemble import AdaBoostRegressor

model = AdaBoostRegressor(n_estimators = 1000, learning_rate = 0.05)
model

AdaBoostRegressor(base_estimator=None, learning_rate=0.05, loss='linear',
         n_estimators=1000, random_state=None)

In [21]:
adaboost_results = implement_model(model, train, train_targets, test, 
                                    test_targets, model_name = 'adaboost')
adaboost_results

['adaboost', 133.36641199829296, 0.9832505582719477, 36.654214548779]

# Implementing all Models

We can use the function we developed to train and evaluate all of the machine learning models. This function will take in a building energy dataframe and return a dataframe of metrics for all the models developed here. The function will call the `implement_model` function once for each model.  

In [22]:
def evaluate_models(df):
    """Evaluate scikit-learn machine learning models
    on a building energy dataset. More models can be added
    to the function as required. 
    
    
    Parameters
    --------
    df : dataframe
        Building energy dataframe. Each row must have one observation
        and the columns must contain the features. The dataframe
        needs to have an "elec_cons" column to be used as targets. 
    
    Return
    --------
    results : dataframe, shape = [n_models, 4]
        Modeling metrics. A dataframe with columns:
        model, train_time, test_time, mape. Used for comparing
        models for a given building dataset
        
    """
    try:
        # Preprocess the data for machine learning
        train, train_targets, test, test_targets = preprocess_data(df, test_days = 183, scale = True)
    except Exception as e:
        print('Error processing data: ', e)
        return
        
    # elasticnet
    model = ElasticNet(alpha = 1.0, l1_ratio=0.5)
    elasticnet_results = implement_model(model, train, train_targets, test, 
                                         test_targets, model_name = 'elasticnet')
    
    # knn
    model = KNeighborsRegressor()
    knn_results = implement_model(model, train, train_targets, test, 
                                  test_targets, model_name = 'knn')
    
    # svm
    model = SVR()
    svm_results = implement_model(model, train, train_targets, test, 
                                   test_targets, model_name = 'svm')
    
    # rf
    model = RandomForestRegressor(n_estimators = 100, n_jobs = -1)
    rf_results = implement_model(model, train, train_targets, test, 
                                  test_targets, model_name = 'rf')
    
    # et
    model = ExtraTreesRegressor(n_estimators=100, n_jobs = -1)
    et_results = implement_model(model, train, train_targets, test, 
                                  test_targets, model_name = 'et')
    
    # adaboost
    model = AdaBoostRegressor(n_estimators = 1000, learning_rate = 0.05, 
                              loss = 'exponential')
    adaboost_results = implement_model(model, train, train_targets, test, 
                                       test_targets, model_name = 'adaboost')
    
    # Put the results into a single array (stack the rows)
    results = np.vstack((elasticnet_results, knn_results, svm_results,
                         rf_results, et_results, adaboost_results))
    
    # Convert the results to a dataframe
    results = pd.DataFrame(results, columns = ['model', 'train_time', 'test_time', 'mape'])
    
    # Convert the numeric results to numbers
    results.iloc[:, 1:] = results.iloc[:, 1:].astype(np.float32)
    
    return results

#### Test the Function

Here we will test the function on two datasets. Later we will want to run the function on hundreds of buildings to get an accurate measure of the performance of the models.

In [23]:
results = evaluate_models(df)
results

Unnamed: 0,model,train_time,test_time,mape
0,elasticnet,0.0661613,0.0013945,56.9737
1,knn,20.9769,2.96457,23.7799
2,svm,525.235,57.757,24.8263
3,rf,18.9462,0.104385,15.7012
4,et,10.7384,0.211137,18.9027
5,adaboost,283.145,2.09613,39.5748


In [24]:
# Write results to disk
results.to_csv('../data/APS_modeling_results.csv', index = False)

In [25]:
df_new = pd.read_csv('../data/f-Kansas_weather.csv')

# Evaluate models on another dataset
results_new = evaluate_models(df_new)
results_new

Unnamed: 0,model,train_time,test_time,mape
0,elasticnet,0.0684406,0.00101819,56.9737
1,knn,21.5241,2.95038,23.7799
2,svm,528.563,58.6768,24.8263
3,rf,19.6616,0.104487,16.041
4,et,10.9921,0.104274,18.3991
5,adaboost,284.092,2.09143,39.2332


In [26]:
results_new.to_csv('../data/Kansas_modeling_results.csv', index = False)

From these preliminary results, it appears the random forest and extra trees regressor significantly outperform the competition. At this point it is too early to have any takeaways, and we will have to evaluate hundreds of buildings to make meaningful comparisons.

# Conclusions


We can now use the `evaluate_models` function to evaluate the six models across hundreds of buildings. This process cannot be done in a notebook, but we will look at the results in a future notebook. The next step after evaluating all the models is to select the best performer for further development in the process known as hyperparameter optimization. I will see you in the next notebook! 