# Pipeline and Composite Estimators

Data Transformations can be automated using pipleine function that are provided in sklearn.Pipelins can be used chain multiple estimators into one . This is useful as there is often a fixed sequence of steps in processing the data , For example feature selection, normalization and classification. Pipleine serves multiple purpose here:

1. Convenience and encapsulation: You only have to call fit and predict once on your data to fit a whole sequence of estimators.

2. Joint parameter selectiom: You can grid serach over parameters of all estimators in the pipeline at once.

3. Safety : Pipeline help avoid leaking statistis from your test data into the trained model in cross validation by ensuring that the same samples are used to train the transformers and predictors.

### A. Building a Pipeline
 
The pipelione is built using a list of key-value pairs ,where key is string containing the name you want to give this step and value is and estimator project:

In [1]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA

estimators = [('reduce_dim', PCA()), ('clf',SVC())]
pipe = Pipeline(estimators)
pipe

In [2]:
## Accessing Pipeline
pipe[:1]

In [3]:
pipe[-1:]

In [5]:
## Tracking features in a pipeline
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest
iris = load_iris()
pipe = Pipeline(steps=[
    ('select', SelectKBest(k=2)),
    ('clf',LogisticRegression())
])
pipe.fit(iris.data, iris.target)
pipe[:-1].get_feature_names_out()

array(['x2', 'x3'], dtype=object)

In [6]:
# Accessing to Nested Parameters
pipe = Pipeline(steps=[("reduce_dim", PCA()), ("clf", SVC())])
pipe.set_params(clf__C=10)

### B. Caching Transfomers : avoid repact computation

Fitting transfomers may be computationally expensive. With its memory parameter set Pipeline will cache each transfomers after calling fit . This feature is to avoid computing the fit transfomers within a pipleine if the paramters and input data are identical.


In [3]:
from tempfile import mkdtemp
from shutil import rmtree
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline

estimators = [('reduce_dim', PCA()), ('clf', SVC())]
cachedir = mkdtemp()
pipe = Pipeline(estimators, memory=cachedir)
pipe


### C. Transforming target in regression:

TransformedTargetRegressor transforms the target y before fitting regression model. The predictions are mapped back to the original space via an inverse transform. It takes an argument the regressor that will be used for prediction , and the transformer that will be applied to the target variable:

In [10]:
import numpy as np
from sklearn.datasets import make_regression
from sklearn.preprocessing import QuantileTransformer
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X, y = make_regression(n_samples=20640,
                n_features=8, noise=100.0,
                random_state=0)

y = np.exp( 1 + (y - y.min()) * (4 / (y.max() - y.min())))
X, y = X[:2000, :], y[:2000]  # select a subset of data
transformer = QuantileTransformer(output_distribution='normal')
regressor = LinearRegression()
regr = TransformedTargetRegressor(regressor=regressor,
                                  transformer=transformer)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
regr.fit(X_train, y_train) 

print(f"R2 score: {regr.score(X_test, y_test):.2f}")
raw_target_regr = LinearRegression().fit(X_train, y_train)
print(f"R2 score: {raw_target_regr.score(X_test, y_test):.2f}")


R2 score: 0.67
R2 score: 0.64


In [11]:
def func(x):
    return np.log(x)
def inverse_func(x):
    return np.exp(x)

In [16]:
regr = TransformedTargetRegressor(regressor=regressor,
                                  func=func,
                                  inverse_func=inverse_func)
regr.fit(X_train, y_train)




In [17]:
print(f"R2 score: {regr.score(X_test, y_test):.2f}")


R2 score: -3.02


In [15]:
def inverse_func(x):
    return x
regr = TransformedTargetRegressor(regressor=regressor,
                                  func=func,
                                  inverse_func=inverse_func,
                                  check_inverse=False)
regr.fit(X_train, y_train)


In [18]:
print(f"R2 score: {regr.score(X_test, y_test):.2f}")


R2 score: -3.02


### D. Feature Union: Composite Feature spaces

FeatureUnion Combines several transformer objects into a new transfomer that combines their output . A FeatureUnion Takes a list of transfomer objects . 

When you want to apply different transformations to each field of the data, see the related class ColumnTransformer.

FeatureUnion serves the same purposes as Pipeline - convenience and joint parameter estimation and validation.

In [19]:
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA
from sklearn.decomposition import KernelPCA

estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
combined = FeatureUnion(estimators)
combined

In [20]:
combined.set_params(kernel_pca='drop')
