## DRY Cross Validation

Recall that "DRY" stands for "**D**on't **R**epeat **Y**ourself." In this set of notes, we'll see how to write a function that can partially automate the selection of complexity parameters. 

In a recent Discussion, you wrote code to select complexity parameters for your model. Your code might have looked something like this: 

In [54]:
import pandas as pd
from matplotlib import pyplot as plt
from sklearn import tree, preprocessing, linear_model
import numpy as np
from sklearn.model_selection import cross_val_score

url = "https://philchodrow.github.io/PIC16A/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)
penguins = penguins[['Species', 'Flipper Length (mm)', 'Body Mass (g)', 'Sex']]

penguins = penguins.dropna()
penguins = penguins[penguins["Sex"] != "."]

X = penguins.drop(['Species'], axis = 1)
y = penguins['Species']

le = preprocessing.LabelEncoder()
X['Sex'] = le.fit_transform(X['Sex'])
y = le.fit_transform(y)

In [55]:
best_score = 0

for d in range(1,30):
    T = tree.DecisionTreeClassifier(max_depth = d)
    cv_score = cross_val_score(T, X, y, cv=10).mean()
    if cv_score > best_score:
        best_depth = d
        best_score = cv_score
print("best depth is " + str(best_depth))

best depth is 24


That's all well and good for handling one model, but I've asked you to do three! How can we make this work?

## Version 1

First, let's write a function that will allow us to select a depth for a decision tree. 

In [56]:
def select_DT_depth(X, y, possible_depths):
    best_score = 0
    for d in possible_depths:
        T = tree.DecisionTreeClassifier(max_depth = d)
        cv_score = cross_val_score(T, X, y, cv=10).mean()
        if cv_score > best_score:
            best_depth = d
            best_score = cv_score
    
    return best_depth, best_score

In [57]:
depth, score = select_DT_depth(X, y, range(1, 30))
print("Best depth was " + str(depth) + " with score " + str(score) + ".")

Best depth was 22 with score 0.8017825311942959.


This works just fine, but we have an issue: other models have *different* complexity parameters. How can we write a function that will work both for decision trees, where the complexity parameter is called `max_depth`, and for logistic regression, where the complexity parameter is called `C`? 

As a warmup, let's consider this mini-problem:

> Suppose we have a function `g` that accepts multiple keyword arguments. Write a function `f` such that 

```python
f("captain", "Burnham") == g(captain = "Burnham") 
```



**Hint**: Week 2. 

Ok, now let's use this idea to write a simple function for selecting a model complexity from some supplied possibilities. 

In [52]:
def select_complexity(model, X, y, complexity_kw, possible_complexities, **kwargs):
    
    best_score = 0
    for C in possible_complexities:
        comp = {complexity_kw : C}
        m = model(**comp, **kwargs)
        
        cv_score = cross_val_score(m, X, y, cv=10).mean()
        if cv_score > best_score:
            best_C = C
            best_score = cv_score
            
    return best_C, best_score

In [58]:
select_complexity(linear_model.LogisticRegression, X, y, "C", 10.0**np.arange(-5, 5), solver = "liblinear")

(100.0, 0.7746880570409982)

In [61]:
select_complexity(tree.DecisionTreeClassifier, X, y, "max_depth", range(1, 30))

(9, 0.7958110516934046)

We can even use this for cases in which the possible complexity parameters are structured objects, like lists or tuples. 

In [69]:
from sklearn import neural_network
from itertools import product

layer_sizes = [10, 50, 100]
layer_configs = product(layer_sizes, layer_sizes, layer_sizes) # each parameter is actually a spec for 3 layers

select_complexity(neural_network.MLPClassifier, X, y, "hidden_layer_sizes", layer_configs, solver = "adam", max_iter = 1000000)

((50, 10, 100), 0.5705882352941176)