# Scikit Learn
## The Data Scientist's Secret Weapon

This is pretty much just the tutorial from Scikit Learn's website, with a little bit of added commentary. Scikit learn is an interesting library because I don't always see it in competition submissions, but everybody has used it because it is an amazing learning tool. It even comes with datasets to start analyzing right of the bat.

In [33]:
from sklearn import datasets, metrics, svm
from sklearn.tree import DecisionTreeClassifier
import IPython.display

### Loading An Example Dataset

In [34]:
# load the iris datasets
iris = datasets.load_iris()
# load the digits dataset
digits = datasets.load_digits()

### Show The Data Used For Classification

In [35]:
digits.data

array([[  0.,   0.,   5., ...,   0.,   0.,   0.],
       [  0.,   0.,   0., ...,  10.,   0.,   0.],
       [  0.,   0.,   0., ...,  16.,   9.,   0.],
       ..., 
       [  0.,   0.,   1., ...,   6.,   0.,   0.],
       [  0.,   0.,   2., ...,  12.,   0.,   0.],
       [  0.,   0.,  10., ...,  12.,   1.,   0.]])

### Show The Values To Be Predicted

In [51]:
type(digits.target)

numpy.ndarray

### Shape of the data arrays

The data is always a 2D array, shape (n_samples, n_features), although the original data may have had a different shape. In the case of the digits, each original sample is an image of shape $(8, 8)$ and can be accessed using:

In [37]:
digits.images[0]

array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
       [  0.,   0.,  13.,  15.,  10.,  15.,   5.,   0.],
       [  0.,   3.,  15.,   2.,   0.,  11.,   8.,   0.],
       [  0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.],
       [  0.,   5.,   8.,   0.,   0.,   9.,   8.,   0.],
       [  0.,   4.,  11.,   0.,   1.,  12.,   7.,   0.],
       [  0.,   2.,  14.,   5.,  10.,  12.,   0.,   0.],
       [  0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.]])

In [38]:
IPython.display.IFrame(src="https://en.wikipedia.org/wiki/Support_vector_machine", width="100%", height=300)

In [39]:
clf = svm.SVC(gamma=0.001, C=100.)

You can also use what is known as grid search to tune the parameters. Here is a simple example. There are several ways of implementing a grid search in sklearn.

In [40]:
IPython.display.IFrame(src="https://en.wikipedia.org/wiki/Hyperparameter_optimization", width="100%", height=300)

In [41]:
# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

clf = svm.SVC(tuned_parameters)