In [22]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import seaborn as sns
from pyearth import Earth
from sklearn.model_selection import LeaveOneOut, KFold, cross_val_score

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

<img style="float:right" src="https://www.washington.edu/brand/files/2014/09/W-Logo_Purple_Hex.png" width=60px)/>
# Traits and Range Shifts: Parameter Optimization +  Feature Engineering

## <small>work by [Tony Cannistra](http://www.github.com/acannistra) and the [Buckley Lab](http://faculty.washington.edu/lbuckley) at the University of Washington</small>

In experimentation with several regression techniques aiming to harness the predictive value of physiological traits, two techniques proved best: Multivariate Adaptive Regression Splines (MARS) and Support Vector regression (SVR). 

We hone the predictions of those two methods for this application by performing two optmizations: grid parameter search and feature engineering. 


### Contents
1. [Grid Search Background](#Grid-Search-Background)
1. [MARS Grid Search](#MARS)
1. [SVR Parameter Grid Search](#)
1. [Feature Engineering](#Feature-Engineering)

## Grid Search Background

Many algorithms have a large number of hyperparameters which can be tuned to improve performance. Often the pursuit of the optimal hyperparameters is arduous, especially when there are a large number of them (or a large hyperparameter space). 

Scikit-Learn provides two functions for hyperparameter search: [`model_selection.GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) and [`model_selection.RandomizedSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV). `GridSearchCV` provides an exhaustive search over parameter values and uses cross-validation and a scoring function to evaluate the most optimal parameters from the space given. `RandomizedSearchCV` samples candidate parameter values from given distributions. 



---


In [74]:
Data = {} ## master dictionary containing all data transformations

responseVar = "migration_m"

plants_master = pd.read_csv("../data/plants5.csv")


drop_features = ["Taxon",
                 "migr_sterr_m", 
                 "shift + 2SE", 
                 'signif_shift',
                 "signif_shift2",
                 "dispmode01",
                 "DispModeEng", ## what is this
                 "shift + 2SE",
                ]

categorical_features = ["oceanity",
                        "dispersal_mode",
                        "BreedSysCode",
                        "Grime"]

##
# We leave the data as-is, with missing values and categorical variables.
##

plants = pd.get_dummies(plants_master, columns=categorical_features)

# drop features we don't want
features = plants.drop(drop_features, axis=1)

# drop features with n/a or NaN
## axis = 1 drops columns with any NAs, axis = 0 drops rows with any NAs
features.dropna(axis=1, inplace=True)

# extract and remove target variable
target   = features[responseVar]
features.drop([responseVar], inplace=True, axis=1)

print("Features: ",features.columns.values)
print("Examples:", len(features))

Features:  ['Bio1_mean_nosyn' 'Bio1_std_nosyn' 'Bio1_var_nosyn' 'Bio1_mean_inclsyn'
 'Bio1_std_inclsyn' 'Bio1_var_inclsyn' 'oceanity_ks' 'oceanity_o'
 'oceanity_os' 'oceanity_sks' 'oceanity_so' 'oceanity_sos'
 'dispersal_mode_animal' 'dispersal_mode_gravity' 'dispersal_mode_water'
 'dispersal_mode_wind' 'BreedSysCode_1.0' 'BreedSysCode_2.0'
 'BreedSysCode_3.0' 'BreedSysCode_4.0' 'Grime_c' 'Grime_cs' 'Grime_csr'
 'Grime_r' 'Grime_s' 'Grime_sr']
Examples: 133


## MARS
[mars]: http://www.jstor.org/stable/2241837?origin=JSTOR-pdf&seq=1#page_scan_tab_contents "Friedman 2001"
[leathwick]: http://www.web.stanford.edu/~hastie/Papers/Ecology/fwb_1448.pdf "Leathwick et al. 2005"
The multivariate adaptive regression splines approach [(Friedman 2001)][mars] fits piece-wise linear basis functions in order to better approximate nonlinear realtionships. It has limited uses in ecology, with the first published example in [Leathwick et al. 2005][leathwick]. It offers our best bet for a bencmark that has any hope of capturing these nonlinear relationships.

### Benchmarking 
Nothing fancy, how does the algorithm perform with base parameters?

In [72]:
mars = Earth()
-cross_val_score(mars, features, target, cv=LeaveOneOut(), scoring='neg_mean_squared_error', n_jobs=-1).mean()


22.422410259994923

In [73]:
-cross_val_score(mars, features, target, cv=KFold(2), scoring='neg_mean_squared_error', n_jobs=-1).mean()

23.767335622318079

In [80]:
criteria = ('rss', 'gcv', 'nb_subsets')

mars = Earth(max_degree=3, feature_importance_type=criteria)
mars.fit(features, target)

print(-cross_val_score(mars, features, target, cv=LeaveOneOut(), scoring='neg_mean_squared_error', n_jobs=-1).mean())

KeyboardInterrupt: 

## SVR

### Benchmarking

In [86]:
from sklearn.svm import SVR

base = SVR()
errs = cross_val_score(base, features, target, cv=LeaveOneOut(), scoring='neg_mean_squared_error', n_jobs=-1)
-errs.mean()

20.482634275402656

Pretty good. Let's see if we can do better. 

### Parameter Grid Search

---

[featEng]: http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/ "Source"

## Feature Engineering
> Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data. [source] [featEng]

This applies to our problem in several ways. The most straightforward work in feature engineering we've already done most of. Namely, the encoding of categorical features in a way that creates binary features for each category of each categorical feature. This is known as *one hot* encoding. 


A much more interesting approach to feature engineering is the creation of new features by linearly combining existing features in some way. 

The challenge there is that creating pairwise multiples of, say, 20 features leads us to move from 20 features to 400 features. Learning any model on these 400 features is sure to lead to overfitting, especially in a linear model. We can combat this with regularization. 

There are two interesting forms of regularization that we can use for this project: ridge regression and LASSO regression. 
