### 1 — Pipeline (core)
What it is

A sequential container of transformers (all but last) and a final estimator. It chains transforms and a final fit/predict call so preprocessing + model act as one object.

Constructor

In [None]:
from sklearn.pipeline import Pipeline
pipe = Pipeline(steps=[('name1', transformer1), ('name2', transformer2), ..., ('final', estimator)])


steps: ordered list of (name, object) pairs.

Names must be unique and valid identifiers.

All steps except the last must implement .fit and .transform (or .fit_transform). The last must implement .fit and at least one of .predict, .transform, or .fit_transform depending on use.

Key behavior (fit / transform / predict)

pipe.fit(X, y):

For each intermediate step t: call X = t.fit_transform(X, y) (if exists) or t.fit(X, y); X = t.transform(X).

Call final.fit(X, y) on the transformed data.

pipe.predict(X):

Transform through all transformers (.transform) in order.

Call final.predict(X_transformed).

pipe.transform(X) works only if the final estimator has .transform (then pipeline returns final .transform output).

pipe.fit_transform(X, y) is shorthand for fitting then transforming as above.

Important methods & what they do

.fit(X, y=None, **fit_params) — fit pipeline on data.

.fit_transform(X, y=None, **fit_params) — convenience.

.transform(X) — apply pipeline transforms up to final step; final must implement .transform.

.predict(X) — transform then final .predict.

.predict_proba(X) / .decision_function(X) — delegated to final estimator if supported.

.score(X, y) — delegated to final estimator’s .score after transforming X.

.get_params(deep=True) — returns all parameters including nested (used by GridSearchCV). Parameter names use step__param notation.

.set_params(**params) — set params; use nested names e.g. preproc__num__imputer__strategy='median'.

.named_steps — dict-like: access a step by name: pipe.named_steps['preproc'].

.steps — list of (name, estimator) pairs in order.

Attributes created after fit

pipe.named_steps['step_name'] is the fitted transformer/estimator instance.

If the final estimator exposes attributes (e.g., coef_), access pipe.named_steps['final'].coef_. (Alternatively use pipe.named_steps['final'] or pipe itself after fit in sklearn 1.0+ via delegation.)

In [None]:
Pipeline(steps=..., memory='cache_dir')


Caches results of transformer fit_transform calls to speed repeated calls (useful in GridSearch).

Requires that transformer objects be picklable. Use for expensive transforms.

verbose argument

If True, prints pipeline progress during fit.

Example: simple pipeline

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', Ridge(alpha=1.0))
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)


### 2 — make_pipeline (helper)
What

Convenience constructor that gives automatic step names based on class names (lowercased).

In [None]:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(SimpleImputer(), StandardScaler(), Ridge())
# step names: 'simpleimputer', 'standardscaler', 'ridge'


Use case

When you don’t care about custom step names. Works identical to Pipeline.

### 3 — FeatureUnion and make_union
What

Concatenate outputs of multiple transformer branches (horizontal concatenation). Useful when you want to apply different transforms to the same input and combine results (e.g., BOW + TF-IDF).

In [None]:
from sklearn.pipeline import FeatureUnion
union = FeatureUnion([
    ('pca', PCA(n_components=5)),
    ('select', SelectKBest(k=10))
])


Behavior

For each transformer, call .fit_transform(X), then horizontally stack outputs (numpy arrays or sparse matrices).

Order in output respects the order you pass transformers.

n_jobs supported to run transforms in parallel.

make_union

Like make_pipeline — auto names.

In [None]:
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
fu = FeatureUnion([('pca', PCA(n_components=5)), ('kbest', SelectKBest(k=5))])
X_new = fu.fit_transform(X)


### 4 — FunctionTransformer
What

Wraps an arbitrary function into a transformer so it can be used in pipelines.

In [None]:
from sklearn.preprocessing import FunctionTransformer
log_tr = FunctionTransformer(np.log1p, validate=False)


validate=False lets you pass DataFrame input; set validate=True to enforce 2D numpy arrays.

Implement custom inverse via inverse_func argument if you want to support .inverse_transform.

Use case

Quick transforms without writing a custom class.

### 5 — Custom transformers (BaseEstimator + TransformerMixin)
Why

When you need domain-specific feature engineering or complex logic. Implementing fit and transform allows insertion into pipelines.

Pattern:

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class MyTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, param=...): self.param = param
    def fit(self, X, y=None):
        # learn from X (store attributes)
        return self
    def transform(self, X):
        # return transformed X (numpy array or DataFrame)
        return X_transformed


### 6 — Integration with ColumnTransformer & pipeline patterns

Usually you combine ColumnTransformer (apply different pipelines to different column sets) with a top-level Pipeline:

In [None]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('num', num_pipeline, numeric_cols),
    ('cat', cat_pipeline, categorical_cols)
], remainder='drop')

full_pipe = Pipeline([
    ('pre', preprocessor),
    ('model', RandomForestRegressor())
])


Then use full_pipe with fit / predict / GridSearchCV.

### 7 — Pipeline methods in detail (practical list)

1. fit(X, y=None, **fit_params) — fit all transformers and final estimator.

2. fit_transform(X, y=None) — fit + transform up to (and including) last step if it’s a transformer; otherwise returns result from last transform.

3. transform(X) — apply transforms only; final estimator must implement transform.

4. predict(X) — transform then final predict.

5. predict_proba(X), decision_function(X), etc. — delegated if final estimator supports.

6. score(X, y) — delegated to final estimator’s score (after transform).

7. set_params(**kwargs) — set nested params: pipeline__step__param=value.

8. get_params(deep=True) — returns nested param dict.

9. named_steps — access steps by name: pipe.named_steps['imputer'].

10. steps — list of pairs.

11. memory — cache directory or joblib.Memory for caching transformers fit results.

12. verbose — prints details during fit.

### 8 — GridSearchCV & pipeline: parameter naming

When tuning hyperparams inside a pipeline, use step__param:

In [None]:
param_grid = {
    'pre__num__imputer__strategy': ['median','mean'],
    'model__alpha': [0.01, 0.1, 1.0]
}
gs = GridSearchCV(full_pipe, param_grid, cv=5)
gs.fit(X_train, y_train)


get_params() shows all available names — very useful for building param_grid.

### 9 — Caching expensive steps

If a transformer (e.g., expensive feature extraction) takes long and is reused across different parameter combinations, enable caching:

In [None]:
pipe = Pipeline(steps=[('pre', preprocessor), ('clf', model)], memory='cache_dir')


Cache stored under 'cache_dir'. To clear, delete that dir.

### 10 — Best practices & gotchas

1. Fit only on training data: always fit pipeline on training set. Use pipeline inside cross-validation so imputation/encoding is learned per fold.

2. Use handle_unknown='ignore' for OneHotEncoder to avoid errors on unseen categories.

3. Final estimator must be last: if final step is a transformer, .predict won’t be available.

4. Preserve feature names: ColumnTransformer.get_feature_names_out() helps map transformed columns back to names.

5. Memory + non-picklable objects: caching requires picklable transformers.

6. Avoid heavy nesting of parallelism: if model uses n_jobs=-1 and GridSearch also uses n_jobs=-1, you may oversubscribe CPUs.

7. When pipelines return sparse matrices (e.g., many OHE features), ensure downstream estimator accepts sparse input (many do).

8. Partial fit: pipelines can work with estimators that support partial_fit, but transformers must be

### 11 — Short cheat-sheet examples

In [None]:
pipe = make_pipeline(SimpleImputer(), StandardScaler(), Ridge())
pipe.fit(X_train, y_train)


Grid search over pipeline

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = {'ridge__alpha': [0.1, 1.0, 10.0]}
gs = GridSearchCV(pipe, param_grid, cv=5)
gs.fit(X_train, y_train)


Pipeline with ColumnTransformer

In [None]:
from sklearn.compose import ColumnTransformer
num_pipe = Pipeline([('imputer', SimpleImputer()), ('scaler', StandardScaler())])
cat_pipe = Pipeline([('imputer', SimpleImputer(fill_value='None')), ('ohe', OneHotEncoder(handle_unknown='ignore'))])
pre = ColumnTransformer([('num', num_pipe, num_cols), ('cat', cat_pipe, cat_cols)], remainder='drop')
full = Pipeline([('pre', pre), ('model', XGBRegressor())])
full.fit(X_train, y_train)


Access inner estimator

In [None]:
full.named_steps['model'].feature_importances_
