# Introduction to scikit-learn

Scikit-learn is a Python library for machine learning. It implements a lot of useful modules for preprocessing, feature extraction, classificationa and regression, model selectiona and evaluation.

## Numpy
In order to use it effectively you need basic familiarity with numpy. You can review the most important numpy concepts by going through of notebooks for the [Data Processing Advanced Course](https://github.com/tcsai/data-proc-adv).

The handy thing is that you can create numpy arrays from Python lists.


In [31]:
import numpy
y_list = [1, 2, 1]
X_list = [[0.5, 0.4],
          [0.3, 2.3],
          [0.4, 0.5 ]]
y = numpy.array(y_list)
X = numpy.array(X_list)
print(y.shape)
print(X.shape)    
print(X[2,1]) # third row, second column
print(X[0,:]) # first row
print(X[:,1]) # second column


(3,)
(3, 2)
0.5
[ 0.5  0.4]
[ 0.4  2.3  0.5]


In [32]:
Z = numpy.array([[[3,4],[4,5]],[[3,4],[4,5]]])
print(Z[:,0,0])

[3 3]


## Scikit learn
Let's do a guided tour of some of the functionality in scikit-learn.

### Datasets

Scikit learn includes some sample datasets:


In [33]:
import sklearn.datasets as datasets

iris = datasets.load_iris()
boston = datasets.load_boston()
cancer = datasets.load_breast_cancer()

In [34]:
print(iris.DESCR)

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

In [35]:
print(boston.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

Let's extract some basic info from the datasets themselves.

In [36]:
print(iris.target.shape)
print(iris.data.shape)

(150,)
(150, 4)


In [37]:
print(boston.target.shape)
print(boston.data.shape)

(506,)
(506, 13)


In [38]:
print(iris.target.dtype)
print(boston.target.dtype)

int64
float64


#### Exercise 4.1

Find out the following information for the breast cancer dataset:
- number of examples
- number of features
- type of target
- number of unique values of target
- the minimum, mean, and maximum of each feature values

### Training, validation, test

If your data does not come with predefined splits into training, validation and test set, you will need to define the splits yourself.

This is how to do it with scikit-learn.

In [39]:
from sklearn.model_selection import train_test_split

First we'll split off our final test set.

In [40]:
X_rest, X_test, y_rest, y_test = \
   train_test_split(boston.data, 
                    boston.target, 
                    test_size=0.25,
                    random_state=123)  # random seed makes this reproducible


Now we can split the rest into training and validation data.

In [41]:
X_train, X_val, y_train, y_val = train_test_split(X_rest, 
                                                  y_rest, 
                                                  test_size=y_test.shape[0],
                                                  random_state=123)

In [42]:
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

(252, 13)
(127, 13)
(127, 13)


In [43]:
print(y_train.shape)
print(y_val.shape)
print(y_test.shape)

(252,)
(127,)
(127,)


### Preprocessing

There are various ways you can pre-process your data, for example:
- normalize/standardize features
- add feature interactions
- run dimensionality reduction

We'll standardize the features, and add interactions to the boston dataset.
The preprocessing models all have a similar interface, consisting of two methods:

- `fit_transform`: gather and store necessary statistics, and transform given data
- `transform`: transform given data using stored statistics

We run `fit_transform` on the training data, and `transform` on the val/test data.


**Z-score features**

In [44]:
from sklearn.preprocessing import StandardScaler
zscore = StandardScaler()
X_train_z = zscore.fit_transform(X_train)
X_val_z   = zscore.transform(X_val)

In [45]:
print(numpy.mean(X_train_z, axis=0))
print(numpy.std(X_train_z, axis=0))

[  1.60145266e-16  -1.14987385e-16   1.20802839e-15  -1.35693925e-16
   4.69245454e-15  -2.40250941e-15   1.89351951e-15  -1.05328004e-15
   1.46267478e-16  -2.46716228e-17   1.49736925e-14   3.19695769e-15
   4.37921304e-16]
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]


In [46]:
from sklearn.preprocessing import PolynomialFeatures
inter = PolynomialFeatures(degree=2, 
                           interaction_only=True, 
                           include_bias=False)
X_train_zi = inter.fit_transform(X_train_z)
X_val_zi = inter.transform(X_val_z)
print(X_train_z.shape)
print(X_train_zi.shape)

(252, 13)
(252, 91)


### Fit a model

Now let's fit a linear regression model to the three versions of the dataset:
- original
- z-scored
- z-scored and interactions

Regression and classification models all have the same interface:
- `fit`: train model on given data
- `predict`: return predictions on given data

In [47]:
from sklearn.linear_model import LinearRegression

model    = LinearRegression()
model_z  = LinearRegression()
model_zi = LinearRegression()

model.fit(   X_train,    y_train)
model_z.fit( X_train_z,  y_train)
model_zi.fit(X_train_zi, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

We can now make predictions.

In [22]:
y_pred    = model.predict(X_val)
y_pred_z  = model_z.predict(X_val_z)
y_pred_zi = model_zi.predict(X_val_zi)


And check the error.

In [24]:
from sklearn.metrics import mean_absolute_error

In [25]:
print(mean_absolute_error(y_val, y_pred))
print(mean_absolute_error(y_val, y_pred_z))
print(mean_absolute_error(y_val, y_pred_zi))

3.57577432983
3.57577432983
2.62890923399


We can see that for this particular implementation of linear regression, z-scoring makes no difference. Add interactions improves the performance substantially. Let's also compute the proportion of variance explained.

In [26]:
from sklearn.metrics import r2_score

In [27]:
print(r2_score(y_val, y_pred))
print(r2_score(y_val, y_pred_zi))

0.703732092321
0.860314645433


So adding feature interactions cuts the unexplained variance in half. 

### Classification

For classification, the evaluation metrics to use are for example:
    - `accuracy_score`
    - `f1_score`
    
Check documentation for f1_score, and figure out how to compute micro and macro averaged versions.    

In [30]:
from sklearn.metrics import accuracy_score, f1_score

#### Exercise 4.2

Carry out the following steps for the `iris` dataset:

- split into train, validation and test set
- z-score features
- add polynomial features
- fit a Decision Tree model on three versions of the training data
- evaluate each model according to accuracy and f1_score

#### Exercise 4.3

Carry out the following steps for the `boston` dataset:

- Split into train and test data
- Add polynomial features
- Instead of using validation set for model selection, use 10-fold cross validation. Figure out how to do this using the scikit-learn class: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold
   - for each fold, fit a Decision Tree to both versions of the training data
   - and record the error 
- report the mean error, as well as its standard deviation across all folds
