<script>
    document.querySelector('head').innerHTML += '<style>.slides { zoom: 1. !important; }</style>';
</script>

# Week 9 - Tree-Based Methods
### Dr. David Elliott

1.8. [Hyperparameters](#hypers)
    
1.9. [Imballanced Data](#imb)
    
1.10. [Strengths and Limitations](#adv_lim)

__TODO__
- The following has good hyperparam advice for trees, bagging, and forests so use that! https://bradleyboehmke.github.io/HOML/random-forest.html
- compare models together to demonstrate pros and cons
- add stuff from pg. 552 in ML prob perspective

# Hyperparameters <a id='hypers'></a>

> The important parameters to adjust are n_estimators, max_features, and possibly pre-pruning options like max_depth. For n_estimators, larger is always better. Aver‐ aging more trees will yield a more robust ensemble by reducing overfitting. However, there are diminishing returns, and more trees need more memory and more time to train. A common rule of thumb is to build “as many as you have time/memory for.” As described earlier, max_features determines how random each tree is, and a smaller max_features reduces overfitting. In general, it’s a good rule of thumb to use the default values: max_features=sqrt(n_features) for classification and max_fea tures=n_features for regression. Adding max_features or max_leaf_nodes might sometimes improve performance. It can also drastically reduce space and time requirements for training and prediction.

Müller, A. C., & Guido, S. (2016). Introduction to machine learning with Python: a guide for data scientists. " O'Reilly Media, Inc.".

# Imballanced Data <a id='imb'></a>

__TODO__
- discuss general ensembles for imballances as well.

#### Forest
It is also worth noting we have been dealing with the class imballance found in this data by using `class_weight = 'balanced'` to assign more importance to getting ictal data predictions correct. We can however also undersample using a ballanced random forest. Generally what performs better depends on the amount of data you are training on. If small then class wight will be better (as seen below), but if you have very large datasets, then undersampling will likely work better.

__NOTE__
- note sure this imballance is going to do much, use abalone_data?

In [52]:
from collections import Counter

print(Counter(y_train))

Counter({0: 113, 1: 99})


In [53]:
from imblearn.ensemble import BalancedRandomForestClassifier

bal_f = BalancedRandomForestClassifier(criterion='gini',
                                       n_estimators=1000,
                                       max_features = 'sqrt',
                                       random_state=42,
                                       n_jobs=-1)

rf_dict = {'Random Forest':rf, 'Balanced Random Forest':bal_f}

for classifier_name in rf_dict:
    scores = cross_val_score(estimator=rf_dict[classifier_name], 
                             X=X_train, 
                             y=y_train, 
                             scoring = 'accuracy',
                             cv=StratifiedKFold(),
                             n_jobs=-1)

    print(color.BOLD+color.UNDERLINE+classifier_name+color.END)
    #print('CV accuracy scores: %s' % scores)
    print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

[1m[4mRandom Forest[0m
CV accuracy: 0.995 +/- 0.010
[1m[4mBalanced Random Forest[0m
CV accuracy: 0.995 +/- 0.010


# Strengths and Limitations <a id='adv_lim'></a>

## Trees

__Pros__
- Easy to explain
    - Trees can be displayed graphically in an interpretable mannor.
- Inherently multiclass
- Can handle different types of predictors*.
    - Independent of feature scaling

__Cons__

- Comparatively low predictive accuracy
    - Easy to overfit
    - Require pruning

- High variance
    - A small change in the data can cause a large change in the estimated tree.
    - Model affected by the rotation of the data

---
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.

https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/06-trees__notes.pdf

> Some advantages of decision trees are:
>
> Simple to understand and to interpret. Trees can be visualised.
>
> Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values.
>
> The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
>
> Able to handle both numerical and categorical data. However scikit-learn implementation does not support categorical variables for now. Other techniques are usually specialised in analysing datasets that have only one type of variable. See algorithms for more information.
>
> Able to handle multi-output problems.
>
> Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.
>
> Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.
>
> Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.
>
> The disadvantages of decision trees include:
>
> Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
>
> Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
>
> Predictions of decision trees are neither smooth nor continuous, but piecewise constant approximations as seen in the above figure. Therefore, they are not good at extrapolation.
>
> The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.
>
>There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.
>
> Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

https://scikit-learn.org/stable/modules/tree.html

> As discussed earlier, the parameters that control model complexity in decision trees are the pre-pruning parameters that stop the building of the tree before it is fully developed. Usually, picking one of the pre-pruning strategies—setting either max_depth, max_leaf_nodes, or min_samples_leaf—is sufficient to prevent overfit‐ ting.
> Decision trees have two advantages over many of the algorithms we’ve discussed so far: the resulting model can easily be visualized and understood by nonexperts (at least for smaller trees), and the algorithms are completely invariant to scaling of the data. As each feature is processed separately, and the possible splits of the data don’t depend on scaling, no preprocessing like normalization or standardization of features is needed for decision tree algorithms. In particular, decision trees work well when you have features that are on completely different scales, or a mix of binary and con‐ tinuous features.
> The main downside of decision trees is that even with the use of pre-pruning, they tend to overfit and provide poor generalization performance. Therefore, in most applications, the ensemble methods we discuss next are usually used in place of a single decision tree.

Müller, A. C., & Guido, S. (2016). Introduction to machine learning with Python: a guide for data scientists. " O'Reilly Media, Inc.".

__NOTES__

- Decision trees potentially more mirror human decision-making than the regression and classification approaches previously discussed.

### (+) Easy to explain

As well as being able to plot them in a straight forward manner, decision trees allow us assess the _importance_ of each feature for classifying the data.

The importance (or Gini importance) of a feature is the normalized total reduction of the criterion (e.g. Gini) brought by that feature.

---
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

In [None]:
__Notes__
- Lets fit all the penguins on the full data (without categories) and see what features are used to split the data and where they are in the tree. 
- For bagging regression trees we can use the RSS.

In [None]:
DT_g = DecisionTreeClassifier(criterion='gini',
                               random_state=42)
#DT_g.fit(datasets['bin']['X'], datasets['bin']['y'])

In [None]:
def tree_feat_import(DT, X, y, feat_names, class_names, title):

    DT.fit(X, y)
    # get the importances for the features
    importances = DT.feature_importances_

    importances_series = pd.Series(importances,index=feat_names).sort_values(ascending = False)
    
    fig, axes = plt.subplots(nrows = 2, figsize=(8,8))
    axes = axes.flatten()
    
    with plt.style.context("classic"):
        plt.sca(axes[0])
        tp = tree.plot_tree(DT,
                       feature_names=feat_names, 
                       class_names=class_names,
                       filled = True)
    
    plt.sca(axes[1])
    # plot the important features
    importances_series.plot.barh(legend =False, grid=False)
    plt.title(title)

    #plt.xticks(rotation=45,ha='right')
    plt.tight_layout()

    #plt.savefig('forest_importances.png', dpi=300)
    plt.show()

    # summarize feature importance
    for i,v in enumerate(importances):
        print(color.BOLD+feat_names[i]+color.END+": %.3f" % (v))

    print(color.BOLD+"total: "+color.END + str(round(sum(importances),2)))

In [None]:
tree_feat_import(DT_g, datasets['bin']['X'], datasets['bin']['y'], 
                 datasets['bin']['feats'],datasets['bin']['class'],
                 'Feature Importances for Classifying Adelie and Gentoo Penguins')

In [None]:
if DTREEVIS:
    viz = dtreeviz(DT_g, datasets['bin']['X'], datasets['bin']['y'], 
                   feature_names=datasets['bin']['feats'],
                   class_names=datasets['bin']['class'],
                   orientation ='LR', 
                   colors = col_dict, # doesnt seem to do much..
                   scale=2.0
                  )
    display(viz)

### (+) Inherently multiclass
Tree-based classifiers are inherently multiclass...

__[Insert about multiclass]__

In [None]:
DT_g.fit(datasets['multi']['X'], datasets['multi']['y'])

In [64]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

names = ["Sex", "Length", "Diameter", "Height", "Whole weight", 
         "Shucked weight", "Viscera weight", "Shell weight", "Rings"]

df = pd.read_csv("abalone_data.csv", names=names)
y_labels = df["Sex"]
X = df.drop("Sex", axis=1)

# create a dictionary with the our int labels
labels_multi = dict(zip(y_labels.unique(), range(3)))

# make a binary version - infants vs. adults
labels_bin = labels_multi.copy()
labels_bin['F'] = 0; labels_bin['I'] = 1

# replace the labels so they are now binary
y_bin = y_labels.replace(labels_bin)

X_train, X_test, y_train, y_test = train_test_split(X.values, y_bin.values, test_size = 0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.2, random_state=42)

In [None]:
fig, axes = plt.subplots(figsize=(10,5))
with plt.style.context("classic"):
    tree.plot_tree(DT_g,
                   feature_names=datasets['multi']['feats'], 
                   class_names=datasets['multi']['class'], 
                   filled = True)
    plt.show()
print(datasets['multi']['class'])

__Extra__

Unsurprisingly, more features are needed to separate out the multi-class problem than in the binary class as can be seen below

In [None]:
tree_feat_import(DT_g, datasets['multi']['X'], datasets['multi']['y'], 
                 datasets['multi']['feats'], 
                 'Feature Importances for Classifying Adelie, Gentoo, and Chinstrap Penguins')

### Extra: Categorical Features and Sklearn

You may also be wondering: where are my previous data visualisations of the categorical data before this? Well Sklearn's CART decision trees currently _"does not support categorical variables"_. This means:

- Do not use `Label Encoding` if your categorical data is __not ordinal__ with `DecisionTreeClassifier()`, you'll end up with splits that do not make sense, as the data will be treat as numeric<sup>Web1</sup>.

- Using a `OneHotEncoder` is the only current valid way with sklearn, allowing arbitrary splits not dependent on the label ordering, but is computationally expensive and it can deteriorate the performance of decision trees as it leads to sparse features, which can mess up feature importance<sup>Web1</sup>.

__Solution__

Currently the best way of handling categorical features is to use `catboost`. `catboost` is a boosting classifier as we discuss later, which can handle categorical inputs. If we want a tree then we can do something like the example below. However, while making these materials I haven't used `catboost` much so I'm still trying to figure out how to understand the categorical splits using this package - so I'll have to leave you to do your own digging to understand the tree below.

---
Web1. https://stackoverflow.com/questions/38108832/passing-categorical-data-to-sklearn-decision-tree
Web2. https://scikit-learn.org/stable/modules/tree.html

In [None]:
from catboost import CatBoostClassifier, Pool, cv
from sklearn.metrics import accuracy_score
test = penguins_rm
X = test.drop('species', axis=1)[["sex", "island"]]
y = test['species']

categorical_features_indices = np.where(X.dtypes != np.float)[0]

pool = Pool(X, y, cat_features=categorical_features_indices, feature_names=list(X.columns))


model = CatBoostClassifier(
    custom_loss=['Accuracy'],
    random_seed=42,
    logging_level='Info',
    iterations = 1,
    max_depth=2,
    max_ctr_complexity = 0,
)

model.fit(pool,
)

plot = model.plot_tree(tree_idx=0, pool = pool)

display(plot)

print("Leaf 1 prediction: " + str(np.argmax([0.830,-0.415, -0.415])))
print("Leaf 2 prediction: " + str(np.argmax([0,0,0])))
print("Leaf 3 prediction: " + str(np.argmax([0.169, 0.255,-0.424])))
print("Leaf 4 prediction: " + str(np.argmax([-0.088, -0.2473,0.561])))

### (-) High variance

__TODO__
- more demonstration of this

#### Rotation of the Data
As can be seen by the descision boundary, a decision tree is quite boxy. Furthermore, how the model makes a decision boundary is going to be affected by the rotation of the data (as DTs create straight lines).

In [None]:
regions_tree(DT_g1, datasets['blbd']['X'], datasets['blbd']['y'], datasets['blbd']['feats'], 
             datasets['blbd']['class'], [[37, 22],[51.5, 22]], r_labels, tp_labels)
regions_tree(DT_g, datasets['blbd']['X'], datasets['blbd']['y'], datasets['blbd']['feats'], 
             datasets['blbd']['class'], [[57.5, 19.5],[57.5, 14.5]])

If there is a highly non-linear and complex relationship between the features and the response then decision trees may outperform classical approaches.

However if the relationship between the features and the response is well approximated by a linear model, then an approach such as linear regression will likely work well.

---
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.

__Notes__

- Linear regression assumes a model of the form $f(X)=\beta 0+\sum_{j=1}^p X_j\beta_j$,

- Regression trees assume a model of the form $f(X)=\sum_{m=1}^Mc_m\cdot1_{(X\in R_m)}$, where $R_1, …, R_M$ represent a partition of feature space. 

- The relative performances of tree-based and classical approaches can be assessed by estimating the test error, using either cross-validation or the validation set approach.

In [1]:
Image(os.path.join("fig_8-7.png"))

NameError: name 'Image' is not defined

## Forests

relatively good performance “out of the box” and ease of use (not much tuning required to get good results) https://github.com/rasbt/stat451-machine-learning-fs20/blob/master/L07/07-ensembles__notes.pdf.

Advantage:
- _"Random forests are an effective tool in prediction. Because of the Law of Large Numbers they do not overfit. Injecting the right kind of randomness makes them accurate classifiers and regressors"_ Breiman L: Random forests. Machine Learning 2001, 45: 5–32.
    - although note the above statement has been questioned in Segal MR: Machine Learning Benchmarks and Random Forest Regression. Technical Report, Center for Bioinformatics & Molecular Biostatistics, University of California, San Francisco 2004.
    
An advantage of ExtraTrees is that it is faster than random forests because its time consuming to to find the best theshold for each feature at each node and it doesnt need to do that (hands on ML).

Disadvantage:
- Not good for very high-dimensional sparse data<sup>1</sup>.

---
1. Müller, A. C., & Guido, S. (2016). Introduction to machine learning with Python: a guide for data scientists. " O'Reilly Media, Inc.".

> Random forests for regression and classifica‐ tion are currently among the most widely used machine learning methods. They are very powerful, often work well without heavy tuning of the parameters, and don’t require scaling of the data.
Essentially, random forests share all of the benefits of decision trees, while making up for some of their deficiencies. One reason to still use decision trees is if you need a compact representation of the decision-making process. It is basically impossible to interpret tens or hundreds of trees in detail, and trees in random forests tend to be deeper than decision trees (because of the use of feature subsets). Therefore, if you need to summarize the prediction making in a visual way to nonexperts, a single decision tree might be a better choice. While building random forests on large data‐ sets might be somewhat time consuming, it can be parallelized across multiple CPU cores within a computer easily. If you are using a multi-core processor (as nearly all modern computers do), you can use the n_jobs parameter to adjust the number of cores to use. Using more CPU cores will result in linear speed-ups (using two cores, the training of the random forest will be twice as fast), but specifying n_jobs larger than the number of cores will not help. You can set n_jobs=-1 to use all the cores in your computer. You should keep in mind that random forests, by their nature, are random, and set‐ ting different random states (or not setting the random_state at all) can drastically change the model that is built. The more trees there are in the forest, the more robust it will be against the choice of random state. If you want to have reproducible results, it is important to fix the random_state. Random forests don’t tend to perform well on very high dimensional, sparse data, such as text data. For this kind of data, linear models might be more appropriate. Random forests usually work well even on very large datasets, and training can easily be parallelized over many CPU cores within a powerful computer. However, random forests require more memory and are slower to train and to predict than linear mod‐ els. If time and memory are important in an application, it might make sense to use a linear model instead.

Müller, A. C., & Guido, S. (2016). Introduction to machine learning with Python: a guide for data scientists. " O'Reilly Media, Inc.".

Lets look at how a descion boundary created by a bagged tree could generalise better than a single tree.

In [None]:
def hyper_search(model, params, X, y, save_path, n_iter=60, random_state=42, overwrite=False):
    if os.path.exists(save_path) and overwrite==False:
        #load the model
        models = joblib.load(save_path)
    else:
        # check all param inputs are lists
        if all(type(x)==list for x in params.values()):
            search_type = "Gridsearch"
            models = GridSearchCV(model, param_grid=params)
            n_iter = len(list(itertools.product(*list(iter(params.values())))))
        else:
            search_type = "Randomsearch"
            models = RandomizedSearchCV(model, param_distributions=params,
                                        n_iter=n_iter, random_state=random_state)
        
        start = time()
        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            models.fit(X, y)
        
        print(search_type + " took %.2f seconds for %d candidates" % ((time() - start), n_iter))
        joblib.dump(models, save_path)
    
    return models

cancer_features = ['mean radius','mean smoothness']
# specify parameters and distributions to sample from
param_grid = {"min_samples_leaf":list(range(1,15))}

lsamples_gs = hyper_search(DT, param_grid, X[cancer_features].values, y,
                           os.path.join(os.getcwd(), "Models", "lsamples_gs_object.pkl"))

pd.DataFrame(lsamples_gs.cv_results_).sort_values("rank_test_score")[["param_min_samples_leaf", "mean_test_score", "std_test_score"]].head()

In [None]:
# specify parameters and distributions to sample from
param_grid = {"n_estimators":range(2,1000),
              "max_samples":uniform(0.05, 1.)}

rf_gs = hyper_search(RF, param_grid, X[cancer_features].values, y,
                     os.path.join(os.getcwd(), "Models", "rf_gs_object.pkl"), 
                     n_iter=15, random_state=1, overwrite=False)

pd.DataFrame(rf_gs.cv_results_).sort_values("rank_test_score")[["param_n_estimators", "param_max_samples", "mean_test_score", "std_test_score"]].head()

In [None]:
tree_dict = {'Tree':lsamples_gs.best_estimator_, 'Forest':rf_gs.best_estimator_}

fig, axes = plt.subplots(ncols=2, figsize=(15, 5))
for i, classifier_name in enumerate(tree_dict):
    plt.sca(axes[i])

    plot_decision_regions(X[cancer_features].values, y.values,
                          clf = tree_dict[classifier_name])

    plt.xlabel(cancer_features[0]) 
    plt.ylabel(cancer_features[1])

    plt.title(classifier_name)
    plt.ylim([0.05,0.18])
    plt.xlim([10,20])
plt.show()

### Dimension Reduction
#### Model Stacking
A method growing in popularity is to use model stacking, where the input to one model is the output of another. This allows for nonlinearities to be captured in the first model, and the potential to use a simple linear model as the last layer. Deep learning is an example of model stacking as, often neural networks are layered on top of one another, to optimize both the features and the classifier simultaneously<sup>1</sup>.

---

1. Zheng, A., & Casari, A. (2018). Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. " O'Reilly Media, Inc.".

#### Feature Importances

An example of model stacking is to use the output of a decision tree–type model as input to a linear classifier. We can gain the importance for each feature by getting the average impurity decrease computed from all decision trees in the forest without regarding the linear separability of the classes. However, if features are highly correlated, one feature may be ranked highly while the information of the others not being fully captured<sup>1</sup>. 

---

1. Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.

https://towardsdatascience.com/explaining-feature-importance-by-example-of-a-random-forest-d9166011959e

Rather than manually setting a theshold like we have done (looking at the top 30) we can put it in a pipeline and use the SelectFromModel function from Scikit-learn. Using this we can still provide both a numeric theshold or we could use a heuristic such as the mean and median<sup>1</sup>.

---

1. http://scikit-learn.org/stable/modules/feature_selection.html

In [51]:
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

svm = SVC(kernel='rbf', random_state=42, class_weight = 'balanced')
rf = RandomForestClassifier(criterion='gini',
                            n_estimators=100,
                            max_features = 'sqrt',
                            random_state=42,
                            class_weight = 'balanced',
                            n_jobs=-1)

rf_svm = Pipeline([
  ('feature_selection', SelectFromModel(rf, threshold = 'mean')),
  ('classification', svm)
])

svm_dict = {'SVM':svm, 'Forest SVM':rf_svm}

for classifier_name in svm_dict:
    scores = cross_val_score(estimator=svm_dict[classifier_name], 
                             X=X_train, 
                             y=y_train, 
                             scoring = 'accuracy',
                             cv=StratifiedKFold(),
                             n_jobs=-1)

    print(color.BOLD+color.UNDERLINE+classifier_name+color.END)
    #print('CV accuracy scores: %s' % scores)
    print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

[1m[4mSVM[0m
CV accuracy: 0.906 +/- 0.039
[1m[4mForest SVM[0m
CV accuracy: 0.986 +/- 0.012


## Permutation Importance

_"impurity-based feature importances can be misleading for high cardinality features (many unique values). See `sklearn.inspection.permutation_importance` as an alternative"_ https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
- _"The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled<sup>9</sup>. This procedure breaks the relationship between the feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature. This technique benefits from being model agnostic and can be calculated many times with different permutations of the feature."_ https://scikit-learn.org/stable/modules/permutation_importance.html
- _"Its validation performance, measured via the score, is significantly larger than the chance level. This makes it possible to use the permutation_importance function to probe which features are most predictive"_ https://scikit-learn.org/stable/modules/permutation_importance.html
-_"Using a held-out set makes it possible to highlight which features contribute the most to the generalization power of the inspected model. Features that are important on the training set but not on the held-out set might cause the model to overfit."_ https://scikit-learn.org/stable/modules/permutation_importance.html
---
https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html#sklearn.inspection.permutation_importance

__NOTES__
- _"Warning: Features that are deemed of low importance for a bad model (low cross-validation score) could be very important for a good model. Therefore it is always important to evaluate the predictive power of a model using a held-out set (or better with cross-validation) prior to computing importances. Permutation importance does not reflect to the intrinsic predictive value of a feature by itself but how important this feature is for a particular model."_ https://scikit-learn.org/stable/modules/permutation_importance.html

_"As concluding remarks about ensemble techniques, it is worth noting that ensemble learning increases the computational complexity compared to individual classifiers. In practice, we need to think carefully about whether we want to pay the price of increased computational costs for an often relatively modest improvement in predictive performance._

_An often-cited example of this tradeoff is the famous \$1 million Netflix Prize, which was won using ensemble techniques. The details about the algorithm were published in The BigChaos Solution to the Netflix Grand Prize by A. Toescher, M. Jahrer, and R. M. Bell, Netflix Prize documentation, 2009, which is available at http://www.stat.osu.edu/~dmsl/GrandPrize2009_BPC_BigChaos.pdf. The winning team received the $1 million grand prize money; however, Netflix never implemented their model due to its complexity, which made it infeasible for a real-world application:_

_"We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment." http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html"_ Python Machine Learning


# References
1. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
2. Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081
3. Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.".
4. Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.