# sklearn package learning note
The functions and methods of sklearn package learned from Udacity

## GaussianNB
Gaussian Naive Bayes progression

[`sklearn.naive_bayes.`__GaussianNB__](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB)

__Example__

In [4]:
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X, Y)
print(clf.predict([[-0.8, -1]]))

[1]


## preprocessing module
The [`sklearn.preprocessing`](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) module includes scaling, centering, normalization, binarization and imputation methods.

In [5]:
from sklearn import preprocessing

### Label Encoder
__Example__

In [24]:
import pandas as pd
# creating sample data
sample_data = {'name': ['Ray', 'Adam', 'Jason', 'Varun', 'Xiao'],
               'health': ['fit', 'slim', 'obese', 'fit', 'slim'],
               'test': ['pass', 'fail', 'fail', 'fail', 'pass']}
# storing sample data in the form of a dataframe
data = pd.DataFrame(sample_data)
# fit the data
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(data['health'])
# transform the data into numbers
label_encoder.transform(data['health'])
## label_encoder.fit_transform(data['health']) also works
label_encoder.fit_transform(data['test'])
data_n = data.apply(label_encoder.fit_transform)

### One-hot Encoder
__Example__

In [10]:
pd.get_dummies(data['health'])

Unnamed: 0,fit,obese,slim
0,1,0,0
1,0,0,1
2,0,1,0
3,1,0,0
4,0,0,1


In [11]:
ohe = preprocessing.OneHotEncoder() # creating OneHotEncoder object
label_encoded_data = label_encoder.fit_transform(data['health'])
ohe.fit_transform(label_encoded_data.reshape(-1,1))

<5x3 sparse matrix of type '<type 'numpy.float64'>'
	with 5 stored elements in Compressed Sparse Row format>

In [27]:
ohe = preprocessing.OneHotEncoder()
Xt = ohe.fit_transform(data_n)
print Xt

  (0, 9)	1.0
  (0, 5)	1.0
  (0, 0)	1.0
  (1, 8)	1.0
  (1, 3)	1.0
  (1, 2)	1.0
  (2, 8)	1.0
  (2, 4)	1.0
  (2, 1)	1.0
  (3, 8)	1.0
  (3, 6)	1.0
  (3, 0)	1.0
  (4, 9)	1.0
  (4, 7)	1.0
  (4, 2)	1.0


## Data Split
For sklearn version 0.17, the package is [sklearn.cross_validation](http://scikit-learn.org/0.17/modules/cross_validation.html)

For sklearn version 0.18, the package is [sklearn.model_selection.train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [1]:
import numpy as np
from sklearn import cross_validation
# for version 0.18
# from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()
iris.data.shape, iris.target.shape
# output ((150, 4), (150,))
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    iris.data, iris.target, test_size=0.4, random_state=0)

X_train.shape, y_train.shape
# output ((90, 4), (90,))
X_test.shape, y_test.shape
# output ((60, 4), (60,))

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)                           
# output 0.96...

0.96666666666666667

### $K$-Fold cross validation
split the data into $k$-fold, $1$ for test, $k-1$ for trainning.

`cv = KFold(len(authors), 2, shuffle=True)`

`StratifiedKFold()`

## Evaluation Metrics (discrete data)
__Metric__: the quantity shows how accurate the predictions from the model is.

### Accuracy
> accuracy = number of correctly identified instances / all instances

`sklearn.metric.accuracy_score()` method. [link](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score)



### Confusion matrix
> A confusion matrix $C$ is such that $C_{ij}$ is equal to the number of observations know to be in group $i$ but predicted to be in group $j$. In `sklearn`, $C_{00}$ is True Negtive (TN), $C_{01}$ is False Positive (FP), $C_{10}$ is False Negtive (FN), $C_{11}$ is True Positive (TP).

`sklearn.metrics.confusion_matrix(y_true, y_pred, labels=None, sample_weight=None)`. [link](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix)

### Recall and Precision

`sklearn.metrics.recall_score()` [link](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score)

`sklearn.metrics.precision_score()` [link](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score)

> __Recall__: If there are $N$ data gives value $i$, the model predict $N'$ of $N$ is $i$, then Recall = $N'/N$
$$
\text{Recall}= \frac{\text{TP}}{\text{TP} + \text{FN}} = \frac{C_{ii}}{\sum_j C_{ij}} .
$$
__Precision__: If there are $N$ predictions, $N'$ fit the data correctly, then Precision = $N'/N$
$$
\text{Precision}= \frac{\text{TP}}{\text{TP} + \text{FP}} = \frac{C_{ii}}{\sum_j C_{ji}}
$$

### F1 Score
`sklearn.metrics.f1_score()` [link](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score)

> F1 = 2  (precision $\times$ recall) / (precision + recall)

## Metric (continuum data)

### Mean Absolute Error
`sklearn.metrics.mean_absolute_error()` [link](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error)

### Mean Squared Error
`sklearn.metrics.mean_squared_error()` [link](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error)

### R2 score
`sklearn.metrics.r2_score()` [link](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score)

### Explained variance score
`sklearn.metrics.explained_variance_score()` [link](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html#sklearn.metrics.explained_variance_score)

## Learning Curve
The Learning Curve functionality from sklearn can help us in this respect. It allows us to study the behavior of our model with respect to the number of data points being considered to understand if our model is performing well or not.
```python
from sklearn.learning_curve import learning_curve # sklearn 0.17
from sklearn.model_selection import learning_curve # sklearn 0.18
learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
```
1. `estimator` is the model, e.g. `GaussianNB()`.
2. `x`: feature; `y`: label.
3. `cv` is the cross validation generator, which split the data into train and test, e.g. `KFold()` and `train_test_split()`
4. `n_jobs`: if run multiple operations in parallel
5. `train_sizes`: number of training examples