# Cross Validation Notes

### Review: Why use train and test data?
- gives estimate of performance on new dataset
- serves as check on over-fitting

In [1]:
from sklearn.model_selection import train_test_split

In [3]:
from sklearn import model_selection

### Train, Transform, and Predicting Process
1) train test split
2) pca
3) svm (or other method)

#### Flow:
1) pca.fit(train features)

2) train features = pca.transform(train features)

3) svc.train(train features)

4) test features = pca.transform(test features)

5) pred = svc.predict(test features)

## K-fold Cross Validation

#### Problems with splitting testing and training data:
- trade-off between the amount of data to train vs. to test

#### Solultion: partition data in k-bins of same amount
- run k-separate experiments 
- pick test set 
- train
- test on testing set
- max accuracy

##### Tradeoff- can be expensive, train/test split is quicker, min run train time

## in sklearn

There's a simple way to randomize the events in sklearn k-fold CV: set the shuffle flag to true.

Then you'd go from something like this:


In [None]:
cv = model_selection.KFold( len(authors), 2 , shuffle = True)
# 1: number of items total in dataset
# 2. how many folds to look at

#To something like this:

cv = KFold( len(authors), 2, shuffle=True )

#### Gives 2 lists:
1) indices to use in test set

2) indices to use in train set

Implementation would look like this:

In [None]:
for train_indices, test_indices in cv:
    features_train = [word[ii] for ii in train_indices]
    features_test = [word[ii] for ii in test_indices]
    labels_train = [labels[ii] for ii in train_indices]
    labels_test = [labels[ii] for ii in test_indices]
    
    print(train_indices)
    print(labels_train)
    print(labels_test)

### Problem: can get errors if data has imbalanced class or is not randomized
ex: labels in first half of dataset, then none is second half

To fix this, we add the shuffle paramter to True:

In [5]:
cv = KFold( len(authors), 2, shuffle = True)

## GridSearchCV

GridSearchCV is way of systematically working through multiple combinations of parameter tunes, cross-validating as it goes to determine which tune gives the best performance. It can work through many combinations in only a couple extra lines of code (thanks udacity!)

sklearn example: 

In [6]:
from sklearn.model_selection import GridSearchCV

#### Parameters overview:

In [None]:
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

A dictionary of the parameters, and the possible values they may take. In this case, they're playing around with the kernel (possible choices are 'linear' and 'rbf'), and C (possible choices are 1 and 10).



Then a 'grid' of all the following combinations of values for (kernel, C) are automatically generated:

Each is used to train an SVM, and the performance is then assessed using cross-validation.

In [None]:
svr = svm.SVC()

This looks kind of like creating a classifier, just like we've been doing since the first lesson. But note that the "clf" isn't made until the next line--this is just saying what kind of algorithm to use. Another way to think about this is that the "classifier" isn't just the algorithm in this case, it's algorithm plus parameter values. Note that there's no monkeying around with the kernel or C; all that is handled in the next line.

In [None]:
clf = grid_search.GridSearchCV(svr, parameters)

The classifier is being created. We pass the algorithm (svr) and the dictionary of parameters to try (parameters) and it generates a grid of parameter combinations to try.

In [None]:
clf.fit(iris.data, iris.target)

The fit function now tries all the parameter combinations, and returns a fitted classifier that's automatically tuned to the optimal parameter combination.

You can now access the parameter values via: 

In [None]:
clf.best_params_