We are going to be using [scikit learn](http://scikit-learn.org/stable/index.html)

<img src='files/resources/scikit-learn-logo-small.png' align='left'><h2>Machine Learning in Python</h2>

<br>
* Simple and efficient tools for data mining and data analysis
* Accessible to everybody, and reusable in various contexts
* Built on NumPy, SciPy, and matplotlib
* Open source, commercially usable - BSD license

## Loading Data

We always start with a data set:

In [1]:
from sklearn import datasets
iris = datasets.load_iris()

The features are stored in `iris.data` a 2-d array of dimension (n_samples, n_features):

In [69]:
print('Data = ' + str(iris.data[0:5,]))
print('\nShape = ' + str(iris.data.shape))

Data = [[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]]

Shape = (150, 4)


The target is stored in `iris.target` and is usually an array of dimension (n_samples,):

In [70]:
print('Target = ' + str(iris.target))
print('\nShape = ' + str(iris.target.shape))

Target = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

Shape = (150,)


## Training / Test Data

We want to fit a clasifier on a portion of this data (the training data) and retain a portion for model testing (the test data).  scikit learn API includes a method to help split the data randomly:

In [59]:
from sklearn.cross_validation import train_test_split

data_train, data_test, target_train, target_test = train_test_split(iris.data, iris.target, 
                                                                    train_size = 0.6, stratify=iris.target,
                                                                    random_state = 72) 

print('Training data shape   = ' + str(data_train.shape))
print('Testing data shape    = ' + str(data_test.shape))
print('Training target shape = ' + str(target_train.shape))
print('Testing target shape  = ' + str(target_test.shape))

Training data shape   = (90, 4)
Testing data shape    = (60, 4)
Training target shape = (90,)
Testing target shape  = (60,)


## Selecting a Model and Hyperparameters

We want to classify the iris flowers in to 3 categories - so we need a multi category classifier - we will use a support vector classifer (`SVC`).  We will also set the model parameters - initially we will use a linear kernel. 

<img src='files/resources/ic_info_outline_black_24dp_2x.png' align='left'> Note that every machine learning algorithm will have different hyper-parameters.  
The [scikit learn API reference](http://scikit-learn.org/stable/index.html) is a great place to learn about them. 

In [60]:
from sklearn import svm

clf = svm.SVC(kernel='linear')

## Fitting the Model

At this point we have not fit the model - just defined it.  Now let's fit the model using the `fit()` method.  

<img src='files/resources/ic_info_outline_black_24dp_2x.png' align='left'> Every estimator (model) in scikit learn has a `fit()` method - whether it is a regression, classification or clustering algorithm.  
For other methods check the [scikit learn API reference](http://scikit-learn.org/stable/index.html).

In [5]:
clf.fit(X=data_train, y=target_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

## Predicting with the Model

Now we can use the fitted model to predict unknown values.

<img src='files/resources/ic_info_outline_black_24dp_2x.png' align='left'> Every estimator (model) in scikit learn has a `predict()` method that always requires and X argument to be supplied.  
For other methods check the [scikit learn API reference](http://scikit-learn.org/stable/index.html).

In [6]:
pred = clf.predict(X=data_test)

print(pred)

[2 1 1 1 1 2 1 0 1 0 0 0 0 0 0 1 2 1 2 2 1 1 1 0 0 0 2 2 1 0 2 2 2 0 1 2 0
 2 1 0 1 1 1 0 2 0 2 2 2 0 2 2 0 2 0 1 1 0 1 1]


## Measuring Performance

We can use metrics from sklearn to quantify performance:

In [3]:
from sklearn.metrics import accuracy_score

print("accuracy = {0:4.1f}% ".format(accuracy_score(pred, target_test)*100))

accuracy = 98.3% 


## More information

Start with the [scikit learn website](http://scikit-learn.org/stable/index.html).

In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()

data_train, data_test, \
target_train, target_test = train_test_split(iris.data, iris.target, 
                                                train_size = 0.6, stratify=iris.target,
                                                random_state = 72) 

from sklearn import svm

clf = svm.SVC(kernel='linear')

clf.fit(X=data_train, y=target_train)

pred = clf.predict(X=data_test)

After fitting a mode you can serialize it using `sklearn.externals` and the `joblib` class:

In [7]:
from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl') 