# Scikit PIPELINES

It allows us
to fit a model including an arbitrary number of transformation steps and apply it to make predictions
about new data.

In [1]:
import pandas as pd
#DataBase
df = pd.read_csv(
'https://archive.ics.uci.edu/ml/'
'machine-learning-databases'
'/breast-cancer-wisconsin/wdbc.data',
header=None
)

In [2]:
#Label encoding
from sklearn.preprocessing import LabelEncoder
X = df.loc[:, 2:].values
y = df.loc[:,1].values
le = LabelEncoder()
y = le.fit_transform(y)

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, stratify =y, random_state=1)

Let's suppose now that we want to go from 32 dimensional space to 2 by PCA. <br>
Instead of model fitting and data transformation, we can chain the standardScaler Pca and LR in a pipeline.

In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
pipe_lr = make_pipeline(StandardScaler(),
                        PCA(n_components=2),
                        LogisticRegression())
pipe_lr.fit(X_train, y_train)
y_pred = pipe_lr.predict(X_test)
test_acc = pipe_lr.score(X_test, y_test)
print(f'Test accuracy: {test_acc:.3f}')

Test accuracy: 0.942


The make_pipeline takes arbitrary number of scikit-learn transformers and have a estimator that implements the fit and predict methods. <br>
In this case we have 2 transformers (SS, PCA) and a estimator (LR). <br>
The order of pipe_lr.fit is -> StandardScaler() fit transform -> PCA fit transform -> LogisticRegression fit predict. <br>
N.B. if we want to use predict method, the last pipeline element has to be an estimator.

# Holdout

Split our initial dataset into separate training and test datasets, former for model training, latter to estimate its generalization performance. <br>
Model selection -> we want select optimal values of tuning parameters (hyperparameters). Many ppl use test dataset for model selection, this is not a good machine learning practice. <br>
Holdout method -> separate the data into 3 parts, training, validation and test dataset. <br>
Training dataset is used to fit the different models, and the performance on the validation dataset is then used for model selection.
<br> A disadvantages is that the performance is sensitive to how we partition the training dataset into training and validation subsets.


<H1> K-Fold Cross-Validation <br>
<H6> Randomly split training dataset into k folds without replacement.
k-1 folds are used for model training, and one fold is test fold and is used for performance evaluation. <br>
This procedure is repeated k times so that we obtain k models and performance estimates.
<br> We use this for model tuning. Finding the optimal hyperparameter values thatyield a satisfying generalization performance.
<br> Once we got the best hyperparameter we can retrain using the entire training dataset to obtain a final performance estimate using the independent test dataset.
<br> All test folds are disjoint, there is no overlap between test folds.
<br> In summary k-fold cross-validation makes better use of the dataset then the holdout method with a validation set, since k-fold cross-validation all data points are being used for evaluation.


In [6]:
#Dont forget to run above code for pipeline
import numpy as np
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=10).split(X_train, y_train)    #Kfolds split
scores = []
for k, (train, test) in enumerate(kfold):       #Format is (k, (train, test))
  pipe_lr.fit(X_train[train], y_train[train])
  score = pipe_lr.score(X_train[test], y_train[test])
  scores.append(score)
  print(f'Fold: {k+1:02d}, 'f'Class distr.: {np.bincount(y_train[train])}, 'f'Acc.: {score:.3f}')

Fold: 01, Class distr.: [225 133], Acc.: 0.925
Fold: 02, Class distr.: [225 133], Acc.: 0.950
Fold: 03, Class distr.: [225 133], Acc.: 0.950
Fold: 04, Class distr.: [225 133], Acc.: 0.950
Fold: 05, Class distr.: [225 133], Acc.: 0.875
Fold: 06, Class distr.: [225 133], Acc.: 1.000
Fold: 07, Class distr.: [225 133], Acc.: 0.925
Fold: 08, Class distr.: [225 133], Acc.: 0.925
Fold: 09, Class distr.: [225 134], Acc.: 0.974
Fold: 10, Class distr.: [225 134], Acc.: 0.974


In [7]:
mean_acc = np.mean(scores)
std_acc = np.std(scores)
print(f'\nCV accuracy: {mean_acc:.3f} +/- {std_acc:.3f}')


CV accuracy: 0.945 +/- 0.033


SCiKit Scoler for k-fold cross-validation

In [9]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator=pipe_lr, X=X_train, y=y_train, cv=10, n_jobs=1)
print(f'CV accuracy scores: {scores}')
print(f'CV accuracy: {np.mean(scores):.3f} 'f'+/- {np.std(scores):.3f}')

CV accuracy scores: [0.925      0.95       0.95       0.95       0.875      1.
 0.925      0.925      0.97435897 0.97435897]
CV accuracy: 0.945 +/- 0.033
