This notebook explores sklearn [Pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html), some dsmatch Transformers, and demonstrates the basic principles of building our sklearn models as series of several transformers.

**Author:** Tom McTavish

**Date:** March 22, 2022

# Introduction

A sklearn [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) is a series of [Transformers](https://scikit-learn.org/stable/data_transforms.html) that take a 1D or 2D array (or Series or DataFrame) and perform some sort of transformation to it.

A model may need to perform several different operations as it computes outputs. When architecting these operations as individual Transformer objects and placing them into the Pipeline architecture, we can take advantage of the various mechanisms to visualize, interrogate, and understand the model.

In [1]:
# %pip install -e "git+ssh://git@bitbucket.org/dhigroupinc/dhi-match-datascience.git@MATCH-2391-create-submodel-demo#egg=dhi-dsmatch[training]&subdirectory=src/dhi-dsmatch"
import os
import sys

new_path = [os.path.abspath(os.path.join(os.getcwd(), '..', '..', 'src', 'dhi-dsmatch'))]
new_path.extend(sys.path)
sys.path = new_path

In [2]:
# %pip install -U pandas
# %pip install -U scipy
# %pip install -U scikit-learn
# %pip install -U numpy
# %pip install -U matplotlib

In [3]:
import numpy as np
import pandas as pd
import joblib
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.base import TransformerMixin
from sklearn.preprocessing import FunctionTransformer
from sklearn import set_config
set_config(display='diagram')
from IPython.core.display import HTML

In [4]:
from dhi.dsmatch.sklearnmodeling.models.applytransformer import ApplyTransformer
from dhi.dsmatch.sklearnmodeling.functiontransformermapper import applymap, applyrows, try_member_func
from dhi.dsmatch.sklearnmodeling.models.featureuniondataframe import FeatureUnionDataFrame
from dhi.dsmatch.sklearnmodeling.models.columntransformerdataframe import ColumnTransformerDataFrame
from dhi.dsmatch.sklearnmodeling.models.mixins import FilterFunctionTransformer
from dhi.dsmatch.sklearnmodeling.models.pipelinehelpers import FeatureNamesPipeline
from dhi.dsmatch.sklearnmodeling.models.pipelinehelpers import findall_transformertype, leftjoin_pipeline
from dhi.dsmatch.sklearnmodeling.transformercontext import intermediate_transforms

# Create a simple dataset.

In [5]:
df = pd.DataFrame({
    'first_name': ['Anne', 'Bob', 'Charlie', 'Bob'],
    'last_name': ['Bancroft', 'Dylan', 'Chaplin', 'Marley'],
    'age': [20, 21, 22, 23],
})

df

Unnamed: 0,first_name,last_name,age
0,Anne,Bancroft,20
1,Bob,Dylan,21
2,Charlie,Chaplin,22
3,Bob,Marley,23


# Create some Transformers and look at their functionality.

With our simple dataset, let's build some simple string manipulation Transformers.

Here we use the [ApplyTransformer](https://bitbucket.org/dhigroupinc/dhi-match-datascience/src/master/src/dhi-dsmatch/dhi/dsmatch/sklearnmodeling/models/applytransformer.py) with [applymap](https://bitbucket.org/dhigroupinc/dhi-match-datascience/src/master/src/dhi-dsmatch/dhi/dsmatch/sklearnmodeling/functiontransformermapper.py) that apply a function across all cells of the DataFrame passed to it and will parallelize by splitting the DataFrame by rows if we are fitting or transforming a lot of data.

In [6]:
lowercase_tx = ApplyTransformer(apply_func=applymap, func=str.lower)
lowercase_tx.transform(df[['first_name', 'last_name']])

Unnamed: 0,first_name,last_name
0,anne,bancroft
1,bob,dylan
2,charlie,chaplin
3,bob,marley


Note that rather than sending specific columns in the `transform()` method, the ApplyTransformer (and other objects in dsmatch) use the [FilterTransformer](https://bitbucket.org/dhigroupinc/dhi-match-datascience/src/master/src/dhi-dsmatch/dhi/dsmatch/sklearnmodeling/models/mixins.py) to specify which columns to transform and we can send the whole DataFrame to the model's transform method. Since many transformers including ApplyTransformer may alter the original object even though it returns that object, we will send a copy of our original dataframe to these examples.

### Filtering with keys as a list

Here, we specify the keys to be modified, which replaces their values.

In [7]:
lowercase_tx = ApplyTransformer(applymap, str.lower, keys=['first_name', 'last_name'])
lowercase_tx.transform(df.copy())  # Pass the whole DataFrame                 

Unnamed: 0,first_name,last_name,age
0,anne,bancroft,20
1,bob,dylan,21
2,charlie,chaplin,22
3,bob,marley,23


### Append by specifying input/output pairs as a dict

Here we make `keys` a dict of `'<input>':'<output>'` pairs. This appends the output columns.

In [8]:
lowercase_tx = ApplyTransformer(
    applymap, 
    str.lower, 
    keys={
        'first_name': 'first_lower',
        'last_name': 'last_lower'
    }
)
lowercase_tx.transform(df.copy())  # Pass the whole DataFrame                 

Unnamed: 0,first_name,last_name,age,first_lower,last_lower
0,Anne,Bancroft,20,anne,bancroft
1,Bob,Dylan,21,bob,dylan
2,Charlie,Chaplin,22,charlie,chaplin
3,Bob,Marley,23,bob,marley


### Filter output by specifying `feature_names_out`

The ApplyTransformer (and other dsmatch transformers) also incorporates the [FeatureNamesMixin](https://bitbucket.org/dhigroupinc/dhi-match-datascience/src/master/src/dhi-dsmatch/dhi/dsmatch/sklearnmodeling/models/mixins.py), which allows us to specify the output columns we want to keep.

In [9]:
lowercase_tx = ApplyTransformer(
    applymap, 
    str.lower, 
    keys={'first_name': 'first_lower'},
    feature_names_out=['last_name', 'first_lower']
)
lowercase_tx.transform(df.copy())  # Pass the whole DataFrame

Unnamed: 0,last_name,first_lower
0,Bancroft,anne
1,Dylan,bob
2,Chaplin,charlie
3,Marley,bob


## Conjoining with keys of `keys` as a tuple

#### FilterFunctionTransformer

The FilterFunctionTransformer also takes keys. It is a sklearn [FunctionTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html), but also takes `keys` for filtering inputs and `feature_names_out` for specifying outputs. This operates on the whole DataFrame, not just individual columns or cells in the DataFrame. In this example, the keys of the `keys` parameter are tuples so we can provide multiple columns as input to the transformation, but it provides only a single column as output, which we name `concat`.

In [10]:
def strcat_cols(df, fill=''):
    return [f'{fill}'.join(row[1:]) for row in df.itertuples()]

concat_tx = FilterFunctionTransformer(
    strcat_cols, 
    keys={('first_name', 'last_name'): 'concat'},
    feature_names_out=FilterFunctionTransformer.calling_feature_names_out,
    kw_args=dict(fill=' ')
)
concat_tx.called_feature_names_out_ = ['age', 'concat']
concat_tx.transform(df.copy())

Unnamed: 0,age,concat
0,20,Anne Bancroft
1,21,Bob Dylan
2,22,Charlie Chaplin
3,23,Bob Marley


## Splitting with values of `keys` as a tuple

The following example shows that if a column splits into multiple columns, those outputs can be specified as a tuple in the value field.

In [11]:
def split_col(c):
    return c.apply([str.upper, str.lower])

concat_tx = FilterFunctionTransformer(
    split_col, 
    keys={'first_name': ('upper', 'lower')},
    feature_names_out=FilterFunctionTransformer.calling_feature_names_out,
)
concat_tx.called_feature_names_out_ = ['age', 'upper', 'lower']
concat_tx.transform(df.copy())

Unnamed: 0,age,upper,lower
0,20,ANNE,anne
1,21,BOB,bob
2,22,CHARLIE,charlie
3,23,BOB,bob


We now go back to creating a bank of different transformers that we'll assemble into pipelines.

In [12]:
lowercase_tx = ApplyTransformer(applymap, str.lower)
uppercase_tx = ApplyTransformer(applymap, str.upper)
uppercase_tx.transform(df[['first_name', 'last_name']])

Unnamed: 0,first_name,last_name
0,ANNE,BANCROFT
1,BOB,DYLAN
2,CHARLIE,CHAPLIN
3,BOB,MARLEY


### Reverse v1

Here's a reversing transformer that uses a lambda function.

In [13]:
reverse_tx = ApplyTransformer(applymap, lambda x: x[::-1])
reverse_tx.transform(df[['first_name', 'last_name']])

Unnamed: 0,first_name,last_name
0,ennA,tforcnaB
1,boB,nalyD
2,eilrahC,nilpahC
3,boB,yelraM


# Avoid Lambda Functions because they aren't pickleable.

While this transformer works as expected, models should avoid lambda functions because lambda functions in python are not pickleable and therefore the model cannot be saved and reloaded from disk. This is seen in the error below.

In [14]:
with open('temp.joblib', 'wb') as f:
    joblib.dump(reverse_tx, f)

PicklingError: Can't pickle <function <lambda> at 0x7f88faac6af0>: it's not found as __main__.<lambda>

## Make simple functions instead of lambda functions.

A simple workaround to not using lambda functions is to simply make a function that does the work instead.

### Reverse v2.

In [15]:
def reverse_text(x: str) -> str:
    return x[::-1]

reverse_tx = ApplyTransformer(applymap, reverse_text)
reverse_tx.transform(df[['first_name', 'last_name']])

Unnamed: 0,first_name,last_name
0,ennA,tforcnaB
1,boB,nalyD
2,eilrahC,nilpahC
3,boB,yelraM


### Write to disk and reload to confirm that the pickle works.

In [16]:
with open('temp.joblib', 'wb') as f:
    joblib.dump(reverse_tx, f)
    
with open('temp.joblib', 'rb') as f:
    reverse_fromdisk_tx = joblib.load(f)
    
reverse_fromdisk_tx.transform(df[['first_name', 'last_name']])

Unnamed: 0,first_name,last_name
0,ennA,tforcnaB
1,boB,nalyD
2,eilrahC,nilpahC
3,boB,yelraM


### We'll build a few more simple tranformers to work with.

In [17]:
def camelcase(x: str) -> str:
    return x[0].upper() + x[1:].lower()

camelcase_tx = ApplyTransformer(applymap, camelcase)
camelcase_tx.transform(df[['first_name', 'last_name']])

Unnamed: 0,first_name,last_name
0,Anne,Bancroft
1,Bob,Dylan
2,Charlie,Chaplin
3,Bob,Marley


In [18]:
def space_text(x: str) -> str:
    return ' '.join(x)
    
spacer_tx = ApplyTransformer(applymap, space_text)
spacer_tx.transform(df[['first_name', 'last_name']])

Unnamed: 0,first_name,last_name
0,A n n e,B a n c r o f t
1,B o b,D y l a n
2,C h a r l i e,C h a p l i n
3,B o b,M a r l e y


In [19]:
def prepend_text(x: str, prepend_str: str='PREPEND_') -> str:
    return prepend_str + x

prepend_tx = ApplyTransformer(applymap, prepend_text)
prepend_tx.transform(df[['first_name', 'last_name']])

Unnamed: 0,first_name,last_name
0,PREPEND_Anne,PREPEND_Bancroft
1,PREPEND_Bob,PREPEND_Dylan
2,PREPEND_Charlie,PREPEND_Chaplin
3,PREPEND_Bob,PREPEND_Marley


In [20]:
def repeat_factor(x, n=3):
    return x*n

factor_tx = ApplyTransformer(applymap, repeat_factor)
factor_tx.transform(df[['first_name', 'last_name']])

Unnamed: 0,first_name,last_name
0,AnneAnneAnne,BancroftBancroftBancroft
1,BobBobBob,DylanDylanDylan
2,CharlieCharlieCharlie,ChaplinChaplinChaplin
3,BobBobBob,MarleyMarleyMarley


# Multiple column outputs.

Here we explore what happens with a transformer that may provide more than one column for the given column passed in.

Here, we use the Pandas [`agg()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html) function. We see that the columns are now a [MultiIndex](https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.html), keeping the name of the input columns at the highest level and taking the function name as the output. 

In [21]:
def len_squared(x):
    return len(x)**2

df_ = df[['first_name', 'last_name']].agg([len, len_squared])
df_

Unnamed: 0_level_0,first_name,first_name,last_name,last_name
Unnamed: 0_level_1,len,len_squared,len,len_squared
0,4,16,8,64
1,3,9,5,25
2,7,49,7,49
3,3,9,6,36


### Transformer versions of `agg()`

We present two different transformer versions of the above.

In [22]:
tx = FunctionTransformer(
    try_member_func, 
    kw_args=dict(member_func='agg', fkwargs=dict(func=[len, len_squared]))
)
Xt = tx.transform(df[['first_name', 'last_name']])
Xt

Unnamed: 0_level_0,first_name,first_name,last_name,last_name
Unnamed: 0_level_1,len,len_squared,len,len_squared
0,4,16,8,64
1,3,9,5,25
2,7,49,7,49
3,3,9,6,36


In [23]:
pd.DataFrame(Xt.values, columns=pd.MultiIndex.from_tuples(Xt.columns.values.tolist()))

Unnamed: 0_level_0,first_name,first_name,last_name,last_name
Unnamed: 0_level_1,len,len_squared,len,len_squared
0,4,16,8,64
1,3,9,5,25
2,7,49,7,49
3,3,9,6,36


Here's another, perhaps better version. It is better because ApplyTransformer can operate in parallel, splitting the data into chunks. Where this gets confusing is that there is a dict with `fkwargs` as a key whose value is another dict called `fkwargs`. This is because ApplyTransformer takes a `fkwargs` argument as does `try_member_func`.

In [24]:
agg_tx = ApplyTransformer(
    try_member_func, 
    'agg', 
    fkwargs=dict(
        fkwargs=(dict(func=[len, len_squared]))
    )
)
df2 = agg_tx.transform(df[['first_name', 'last_name']])
df2

Unnamed: 0_level_0,first_name,first_name,last_name,last_name
Unnamed: 0_level_1,len,len_squared,len,len_squared
0,4,16,8,64
1,3,9,5,25
2,7,49,7,49
3,3,9,6,36


# Multiple column inputs

Let's now consider the case of a transformer that operates on a whole row. In this case, we will use the `applyrows()` function.

In [25]:
def len_all(x):  # x is a row of elements
    return len(''.join(x))

lenall_tx = ApplyTransformer(
    applyrows, 
    len_all, 
    keys={('first_name', 'last_name'): 'len_all'},
    feature_names_out=['len_all']
)
lenall_tx.transform(df)

Unnamed: 0,len_all
0,12
1,8
2,14
3,9


# Splitting with FeatureUnionDataFrame

A [FeatureUnion](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html) is an sklearn object that concatenates the results of *n* Transformers. Let's take the Transformers above and put them into one FeatureUnion. Additionally, we use our subclass of FeatureUnion, [FeatureUnionDataFrame](https://bitbucket.org/dhigroupinc/dhi-match-datascience/src/836c7787da56fc5697315d07697b4fe243d28908/src/dhi-dsmatch/dhi/dsmatch/sklearnmodeling/models/featureuniondataframe.py?at=feature%2FMATCH-2084-refactor-composite-model-to-u) to place the outputs into a DataFrame instead of a 2D numpy array so we keep column labels.

Additionally, these lines at the top of our notebook:

```python
from sklearn import set_config
set_config(display='diagram')
```
give us an interactive visualization of our model. (Click on the ApplyTransformers to see some settings.)

In [26]:
fu_tx = FeatureUnionDataFrame(
    [
#         ('camel', camelcase_tx), 
        ('lower', lowercase_tx), 
        ('upper', uppercase_tx),
        ('ind_lengths', agg_tx),
        ('combined', lenall_tx),
        ('factor', factor_tx),
#         ('space', spacer_tx),
#         ('reverse', reverse_tx),
#         ('prepend', prepend_tx),                              
    ]
)

fu_tx

And now, the output, when we run the DataFrame through our FeatureUnionDataFrame.

In [27]:
Xs = fu_tx.transform(df[['first_name', 'last_name']])
Xs

Unnamed: 0,lower__first_name,lower__last_name,upper__first_name,upper__last_name,"ind_lengths__('first_name', 'len')","ind_lengths__('first_name', 'len_squared')","ind_lengths__('last_name', 'len')","ind_lengths__('last_name', 'len_squared')",combined__len_all,factor__first_name,factor__last_name,factor__len_all
0,anne,bancroft,ANNE,BANCROFT,4,16,8,64,12,AnneAnneAnne,BancroftBancroftBancroft,36
1,bob,dylan,BOB,DYLAN,3,9,5,25,8,BobBobBob,DylanDylanDylan,24
2,charlie,chaplin,CHARLIE,CHAPLIN,7,49,7,49,14,CharlieCharlieCharlie,ChaplinChaplinChaplin,42
3,bob,marley,BOB,MARLEY,3,9,6,36,9,BobBobBob,MarleyMarleyMarley,27


# Fixing column names through `feature_names_out`.

Column names using the `FeatureUnionDataFrame` or `ColumnTransformerDataFrame` uses the transformers' `get_feature_names_out()` function. 

In [28]:
fu_tx.get_feature_names_out()

array(['lower__first_name', 'lower__last_name', 'upper__first_name',
       'upper__last_name', "ind_lengths__('first_name', 'len')",
       "ind_lengths__('first_name', 'len_squared')",
       "ind_lengths__('last_name', 'len')",
       "ind_lengths__('last_name', 'len_squared')", 'combined__len_all',
       'factor__first_name', 'factor__last_name', 'factor__len_all'],
      dtype=object)

As seen above, some of these names now look a little off. In particular, the `agg_tx` transformer provided a hierarchical column index as shown in the cell below.

In [29]:
df_ = agg_tx.transform(df[['first_name', 'last_name']])
df_.columns

MultiIndex([('first_name',         'len'),
            ('first_name', 'len_squared'),
            ( 'last_name',         'len'),
            ( 'last_name', 'len_squared')],
           )

`feature_names` are string-formatted versions of the column index.

In [30]:
agg_tx.get_feature_names_out()

["('first_name', 'len')",
 "('first_name', 'len_squared')",
 "('last_name', 'len')",
 "('last_name', 'len_squared')"]

In [31]:
agg_tx = ApplyTransformer(
    try_member_func, 
    'agg', 
    feature_names_out=[
        'first_len', 
        'first_lensquared', 
        'last_len', 
        'last_lensquared'
    ],
    fkwargs=dict(
        fkwargs=(dict(func=[len, len_squared]))
    )
)
agg_tx.transform(df[['first_name', 'last_name']])

Unnamed: 0,first_len,first_lensquared,last_len,last_lensquared
0,4,16,8,64
1,3,9,5,25
2,7,49,7,49
3,3,9,6,36


Note that calling the transformer directly does not change the output -- it still works the same as if one executed Pandas `agg()`. However, in the context of FeatureUnions and these cases where we have to concatenate results, we flatten the column names and we can use the `feature_names`. Here, we replace the `agg_tx` sub-tranformer with the one we just created and see that it is using the `feature_names` we just ascribed.

In [32]:
fu_tx.transformer_list[2] = ('ind_lengths', agg_tx)
Xs = fu_tx.transform(df[['first_name', 'last_name']])
Xs

Unnamed: 0,lower__first_name,lower__last_name,upper__first_name,upper__last_name,ind_lengths__first_len,ind_lengths__first_lensquared,ind_lengths__last_len,ind_lengths__last_lensquared,combined__len_all,factor__first_name,factor__last_name,factor__len_all
0,anne,bancroft,ANNE,BANCROFT,4,16,8,64,12,AnneAnneAnne,BancroftBancroftBancroft,36
1,bob,dylan,BOB,DYLAN,3,9,5,25,8,BobBobBob,DylanDylanDylan,24
2,charlie,chaplin,CHARLIE,CHAPLIN,7,49,7,49,14,CharlieCharlieCharlie,ChaplinChaplinChaplin,42
3,bob,marley,BOB,MARLEY,3,9,6,36,9,BobBobBob,MarleyMarleyMarley,27


In [33]:
fu_tx.get_feature_names_out()

array(['lower__first_name', 'lower__last_name', 'upper__first_name',
       'upper__last_name', 'ind_lengths__first_len',
       'ind_lengths__first_lensquared', 'ind_lengths__last_len',
       'ind_lengths__last_lensquared', 'combined__len_all',
       'factor__first_name', 'factor__last_name', 'factor__len_all'],
      dtype=object)

Keeping the string-formatted version allows us to rebuild the hierarchy if we want. Alternatively, we can explicitly set the `feature_names` to be our flattened list.

# Pipeline

Let's put our Transformers into a Pipeline. 

> **Note** that the last step in pipelines should be a `passthrough` if the transformer does not implement `fit()`. It's probably easiest to get in the habit of always having a last step of `passthrough`.

In [34]:
pipe_tx = Pipeline(
    [
        ('lower', lowercase_tx), 
        ('reverse', reverse_tx), 
        ('upper', uppercase_tx), 
        ('factor', factor_tx),
        ('camel', camelcase_tx),
        ('space', spacer_tx),
        ('prepend', prepend_tx),
        ('last', 'passthrough')
    ]
) 
pipe_tx

And the output of this model:

In [35]:
pipe_tx.transform(df[['first_name', 'last_name']])

Unnamed: 0,first_name,last_name
0,PREPEND_E n n a e n n a e n n a,PREPEND_T f o r c n a b t f o r c n a b t f o ...
1,PREPEND_B o b b o b b o b,PREPEND_N a l y d n a l y d n a l y d
2,PREPEND_E i l r a h c e i l r a h c e i l r a h c,PREPEND_N i l p a h c n i l p a h c n i l p a h c
3,PREPEND_B o b b o b b o b,PREPEND_Y e l r a m y e l r a m y e l r a m


# Intermediate results

To see if this is working correctly, we can use the [intermediate_transforms context](https://bitbucket.org/dhigroupinc/dhi-match-datascience/src/master/src/dhi-dsmatch/dhi/dsmatch/sklearnmodeling/transformercontext.py), which shows us the transformations of each step in the pipeline.

In [36]:
with intermediate_transforms(pipe_tx):
    Xt = pipe_tx.transform(df[['first_name', 'last_name']])
    intermediate_results = pipe_tx.intermediate_results__

for k, v in intermediate_results.items():
    display(HTML(f'<h3>{k}</h3>'), v)

Unnamed: 0,first_name,last_name
0,anne,bancroft
1,bob,dylan
2,charlie,chaplin
3,bob,marley


Unnamed: 0,first_name,last_name
0,enna,tforcnab
1,bob,nalyd
2,eilrahc,nilpahc
3,bob,yelram


Unnamed: 0,first_name,last_name
0,ENNA,TFORCNAB
1,BOB,NALYD
2,EILRAHC,NILPAHC
3,BOB,YELRAM


Unnamed: 0,first_name,last_name
0,ENNAENNAENNA,TFORCNABTFORCNABTFORCNAB
1,BOBBOBBOB,NALYDNALYDNALYD
2,EILRAHCEILRAHCEILRAHC,NILPAHCNILPAHCNILPAHC
3,BOBBOBBOB,YELRAMYELRAMYELRAM


Unnamed: 0,first_name,last_name
0,Ennaennaenna,Tforcnabtforcnabtforcnab
1,Bobbobbob,Nalydnalydnalyd
2,Eilrahceilrahceilrahc,Nilpahcnilpahcnilpahc
3,Bobbobbob,Yelramyelramyelram


Unnamed: 0,first_name,last_name
0,E n n a e n n a e n n a,T f o r c n a b t f o r c n a b t f o r c n a b
1,B o b b o b b o b,N a l y d n a l y d n a l y d
2,E i l r a h c e i l r a h c e i l r a h c,N i l p a h c n i l p a h c n i l p a h c
3,B o b b o b b o b,Y e l r a m y e l r a m y e l r a m


Unnamed: 0,first_name,last_name
0,PREPEND_E n n a e n n a e n n a,PREPEND_T f o r c n a b t f o r c n a b t f o ...
1,PREPEND_B o b b o b b o b,PREPEND_N a l y d n a l y d n a l y d
2,PREPEND_E i l r a h c e i l r a h c e i l r a h c,PREPEND_N i l p a h c n i l p a h c n i l p a h c
3,PREPEND_B o b b o b b o b,PREPEND_Y e l r a m y e l r a m y e l r a m


Obviously, these are crazy transforms and meant to simply illustrate that the output of one transformer in the pipeline is the input to the next.

## Splitting and Branching

Let's look now at the following model that does more splitting and branching. It preprocesses data using the reverse transformer, sends that reversed data to a lower, upper, and factor transformer via a FeatureUnionDataFrame, and then sends all of that output to a spacer transformer.

In [37]:
fu_tx = FeatureUnionDataFrame(
    [
        ('lower', lowercase_tx), 
        ('upper', uppercase_tx), 
        ('factor', factor_tx)
    ]
)

pipe_tx = Pipeline(
    [
        ('reverse', reverse_tx), 
        ('branched', fu_tx),
        ('space', spacer_tx),
        ('last', 'passthrough')
    ]
)

pipe_tx

And looking at the intermediate steps...

In [38]:
with intermediate_transforms(pipe_tx):
    Xt = pipe_tx.transform(df[['first_name', 'last_name']])
    intermediate_results = pipe_tx.intermediate_results__

for k, v in intermediate_results.items():
    display(HTML(f'<h3>{k}</h3>'), v)

Unnamed: 0,first_name,last_name
0,ennA,tforcnaB
1,boB,nalyD
2,eilrahC,nilpahC
3,boB,yelraM


Unnamed: 0,lower__first_name,lower__last_name,upper__first_name,upper__last_name,factor__first_name,factor__last_name
0,enna,tforcnab,ENNA,TFORCNAB,ennAennAennA,tforcnaBtforcnaBtforcnaB
1,bob,nalyd,BOB,NALYD,boBboBboB,nalyDnalyDnalyD
2,eilrahc,nilpahc,EILRAHC,NILPAHC,eilrahCeilrahCeilrahC,nilpahCnilpahCnilpahC
3,bob,yelram,BOB,YELRAM,boBboBboB,yelraMyelraMyelraM


Unnamed: 0,lower__first_name,lower__last_name,upper__first_name,upper__last_name,factor__first_name,factor__last_name
0,e n n a,t f o r c n a b,E N N A,T F O R C N A B,e n n A e n n A e n n A,t f o r c n a B t f o r c n a B t f o r c n a B
1,b o b,n a l y d,B O B,N A L Y D,b o B b o B b o B,n a l y D n a l y D n a l y D
2,e i l r a h c,n i l p a h c,E I L R A H C,N I L P A H C,e i l r a h C e i l r a h C e i l r a h C,n i l p a h C n i l p a h C n i l p a h C
3,b o b,y e l r a m,B O B,Y E L R A M,b o B b o B b o B,y e l r a M y e l r a M y e l r a M


# Pipelines in FeatureUnions

Now let's consider sub-pipelines in feature unions. In this case, our main pipeline preprocesses by reversing its input. This then goes to a FeatureUnion that has two pipelines. In one pipeline, we lowercase and then space the characters. In the other pipeline, we uppercase and factor it. Finally, both outputs are sent to a camelcase transformer.

### FeatureNamesPipeline

Pipelines in sklearn do not include `get_feature_names()` like many other transformers. The `ColumnTransformerDataFrame` and `FeatureUnionDataFrame` objects, however, need its transformers to supply this. The `FeatureNamesPipeline` object permits this functionality by taking the `get_feature_names()` of its last transformer.

In the example below, we explore using feature names as well as having sub-pipelines.

In [39]:
prepend_tx = ApplyTransformer(applymap, prepend_text)
reverse_tx = ApplyTransformer(applymap, reverse_text)
uppercase_tx = ApplyTransformer(applymap, str.upper)

pipe_lowerspace = FeatureNamesPipeline(
    [
        ('lower', lowercase_tx), 
        ('space', spacer_tx), 
        ('last', 'passthrough')
    ]
)

pipe_upperfactor = FeatureNamesPipeline(
    [
        ('upper', uppercase_tx), 
        ('factor', factor_tx), 
        ('last', 'passthrough')
    ]
)

fu_tx = FeatureUnionDataFrame(
    [
        ('pipe_lowerspace', pipe_lowerspace), 
        ('pipe_upperfactor', pipe_upperfactor)
    ]
)

pipe_tx = Pipeline(
    [
        ('reverse', reverse_tx), 
        ('branched', fu_tx),
        ('prepend', prepend_tx),
        ('last', 'passthrough')
    ]
)

pipe_tx

In [40]:
pipe_tx.transform(df[['first_name', 'last_name']])

Unnamed: 0,pipe_lowerspace__first_name,pipe_lowerspace__last_name,pipe_upperfactor__first_name,pipe_upperfactor__last_name
0,PREPEND_e n n a,PREPEND_t f o r c n a b,PREPEND_ENNAENNAENNA,PREPEND_TFORCNABTFORCNABTFORCNAB
1,PREPEND_b o b,PREPEND_n a l y d,PREPEND_BOBBOBBOB,PREPEND_NALYDNALYDNALYD
2,PREPEND_e i l r a h c,PREPEND_n i l p a h c,PREPEND_EILRAHCEILRAHCEILRAHC,PREPEND_NILPAHCNILPAHCNILPAHC
3,PREPEND_b o b,PREPEND_y e l r a m,PREPEND_BOBBOBBOB,PREPEND_YELRAMYELRAMYELRAM


### Looking at the intermediate steps.

In [41]:
with intermediate_transforms(pipe_tx):
    Xt = pipe_tx.transform(df[['first_name', 'last_name']])
    intermediate_results = pipe_tx.intermediate_results__

for k, v in intermediate_results.items():
    display(HTML(f'<h3>{k}</h3>'), v)

Unnamed: 0,first_name,last_name
0,ennA,tforcnaB
1,boB,nalyD
2,eilrahC,nilpahC
3,boB,yelraM


Unnamed: 0,first_name,last_name
0,enna,tforcnab
1,bob,nalyd
2,eilrahc,nilpahc
3,bob,yelram


Unnamed: 0,first_name,last_name
0,e n n a,t f o r c n a b
1,b o b,n a l y d
2,e i l r a h c,n i l p a h c
3,b o b,y e l r a m


Unnamed: 0,first_name,last_name
0,ENNA,TFORCNAB
1,BOB,NALYD
2,EILRAHC,NILPAHC
3,BOB,YELRAM


Unnamed: 0,first_name,last_name
0,ENNAENNAENNA,TFORCNABTFORCNABTFORCNAB
1,BOBBOBBOB,NALYDNALYDNALYD
2,EILRAHCEILRAHCEILRAHC,NILPAHCNILPAHCNILPAHC
3,BOBBOBBOB,YELRAMYELRAMYELRAM


Unnamed: 0,pipe_lowerspace__first_name,pipe_lowerspace__last_name,pipe_upperfactor__first_name,pipe_upperfactor__last_name
0,e n n a,t f o r c n a b,ENNAENNAENNA,TFORCNABTFORCNABTFORCNAB
1,b o b,n a l y d,BOBBOBBOB,NALYDNALYDNALYD
2,e i l r a h c,n i l p a h c,EILRAHCEILRAHCEILRAHC,NILPAHCNILPAHCNILPAHC
3,b o b,y e l r a m,BOBBOBBOB,YELRAMYELRAMYELRAM


Unnamed: 0,pipe_lowerspace__first_name,pipe_lowerspace__last_name,pipe_upperfactor__first_name,pipe_upperfactor__last_name
0,PREPEND_e n n a,PREPEND_t f o r c n a b,PREPEND_ENNAENNAENNA,PREPEND_TFORCNABTFORCNABTFORCNAB
1,PREPEND_b o b,PREPEND_n a l y d,PREPEND_BOBBOBBOB,PREPEND_NALYDNALYDNALYD
2,PREPEND_e i l r a h c,PREPEND_n i l p a h c,PREPEND_EILRAHCEILRAHCEILRAHC,PREPEND_NILPAHCNILPAHCNILPAHC
3,PREPEND_b o b,PREPEND_y e l r a m,PREPEND_BOBBOBBOB,PREPEND_YELRAMYELRAMYELRAM


# Joining Pipelines

We may want to make a Pipeline that has one particular flow, and, independently, have another Pipeline that has its own particular flow, too. These two pipelines might have shared preprocessing steps and in certain conditions, we want to use the two pipelines simultaneously and have them share the preprocessing part. This can be accomplished using `leftjoin_pipeline()` as illustrated below. 

In this case, we instantiate two Pipelines that have the same preprocessing steps. We instantiate two different versions of those steps in the two pipelines and do the left join, which uses the preprocessing instances of the left when it conjoins with the right.

Actually, we demonstrate how the left preprocessing steps *may* contain extra steps in its pipeline, but where the pipelines join, they must be sufficiently equivalent with taking the same number of inputs and even the same input columns. We insert an "ignored" transformation to illustrate the point, which becomes clearer when viewing the intermediate transforms.

In [42]:
# Create the left pipeline
reverse_tx = ApplyTransformer(applymap, reverse_text, keys=['first_name', 'last_name'])
ignored_tx = ApplyTransformer(
    applymap, 
    repeat_factor, 
    keys={'first_name': 'fn_repeat_ignored', 'last_name': 'ln_repeat_ignored'}
)

prepend_tx = ApplyTransformer(
    applymap, 
    prepend_text, 
    keys={'first_name': 'fnew', 'last_name': 'lnew'},
    feature_names_out=['fnew', 'lnew']
)

uppercase_tx = ApplyTransformer(
    applymap, 
    str.upper, 
    keys={'fnew': 'fnoo', 'lnew': 'lnoo'}  # We create new columns: fnoo and lnoo
)

pipe_a_tx = Pipeline(
    [
        ('reverse', reverse_tx),
        ('ignored', ignored_tx),
        ('prepend', prepend_tx),
        ('lower', lowercase_tx),
        ('space', spacer_tx),
        ('last', 'passthrough')
    ]
)

# Make another "right" pipeline to join with the one we just created. We duplicate the
# same functionality of the reverse and prepend transformers to illustrate that they
# are different instances that get spliced out in the join.
reverse2_tx = ApplyTransformer(applymap, reverse_text, keys=['first_name', 'last_name'])

prepend2_tx = ApplyTransformer(
    applymap, 
    prepend_text, 
    keys={'first_name': 'fnew', 'last_name': 'lnew'},
    feature_names_out=['fnew', 'lnew']
)

pipe_b_tx = Pipeline(
    [
        ('reverse2', reverse2_tx), 
        ('prepend2', prepend2_tx),
        ('upper', uppercase_tx),
        ('last', 'passthrough')
    ]
)

# Do the join
joined_tx = leftjoin_pipeline(pipe_a_tx, pipe_b_tx, left_on='prepend', right_on='prepend2')
joined_tx

In [43]:
joined_tx.transform(df)

Unnamed: 0,left__fnew,left__lnew,right__fnew,right__lnew,right__fnoo,right__lnoo
0,p r e p e n d _ e n n a,p r e p e n d _ t f o r c n a b,PREPEND_ennA,PREPEND_tforcnaB,PREPEND_ENNA,PREPEND_TFORCNAB
1,p r e p e n d _ b o b,p r e p e n d _ n a l y d,PREPEND_boB,PREPEND_nalyD,PREPEND_BOB,PREPEND_NALYD
2,p r e p e n d _ e i l r a h c,p r e p e n d _ n i l p a h c,PREPEND_eilrahC,PREPEND_nilpahC,PREPEND_EILRAHC,PREPEND_NILPAHC
3,p r e p e n d _ b o b,p r e p e n d _ y e l r a m,PREPEND_boB,PREPEND_yelraM,PREPEND_BOB,PREPEND_YELRAM


Notice that the joined pipeline uses the preprocessing pipeline from the left pipeline, but not the right.

In [44]:
joined_tx.named_steps['pre'].named_steps['prepend'] == prepend_tx

True

In [45]:
joined_tx.named_steps['pre'].named_steps['prepend'] == prepend2_tx

False

In [46]:
with intermediate_transforms(joined_tx):
    Xt = joined_tx.transform(df[['first_name', 'last_name']])
    intermediate_results = joined_tx.intermediate_results__

for k, v in intermediate_results.items():
    display(HTML(f'<h3>{k}</h3>'), v)

Unnamed: 0,first_name,last_name
0,Anne,Bancroft
1,Bob,Dylan
2,Charlie,Chaplin
3,Bob,Marley


Unnamed: 0,first_name,last_name,fn_repeat_ignored,ln_repeat_ignored
0,Anne,Bancroft,AnneAnneAnne,BancroftBancroftBancroft
1,Bob,Dylan,BobBobBob,DylanDylanDylan
2,Charlie,Chaplin,CharlieCharlieCharlie,ChaplinChaplinChaplin
3,Bob,Marley,BobBobBob,MarleyMarleyMarley


Unnamed: 0,fnew,lnew
0,PREPEND_Anne,PREPEND_Bancroft
1,PREPEND_Bob,PREPEND_Dylan
2,PREPEND_Charlie,PREPEND_Chaplin
3,PREPEND_Bob,PREPEND_Marley


Unnamed: 0,fnew,lnew
0,PREPEND_Anne,PREPEND_Bancroft
1,PREPEND_Bob,PREPEND_Dylan
2,PREPEND_Charlie,PREPEND_Chaplin
3,PREPEND_Bob,PREPEND_Marley


Unnamed: 0,fnew,lnew
0,prepend_anne,prepend_bancroft
1,prepend_bob,prepend_dylan
2,prepend_charlie,prepend_chaplin
3,prepend_bob,prepend_marley


Unnamed: 0,fnew,lnew
0,p r e p e n d _ a n n e,p r e p e n d _ b a n c r o f t
1,p r e p e n d _ b o b,p r e p e n d _ d y l a n
2,p r e p e n d _ c h a r l i e,p r e p e n d _ c h a p l i n
3,p r e p e n d _ b o b,p r e p e n d _ m a r l e y


Unnamed: 0,fnew,lnew,fnoo,lnoo
0,PREPEND_Anne,PREPEND_Bancroft,PREPEND_ANNE,PREPEND_BANCROFT
1,PREPEND_Bob,PREPEND_Dylan,PREPEND_BOB,PREPEND_DYLAN
2,PREPEND_Charlie,PREPEND_Chaplin,PREPEND_CHARLIE,PREPEND_CHAPLIN
3,PREPEND_Bob,PREPEND_Marley,PREPEND_BOB,PREPEND_MARLEY


Unnamed: 0,left__fnew,left__lnew,left__fnoo,left__lnoo,right__fnew,right__lnew,right__fnoo,right__lnoo
0,p r e p e n d _ a n n e,p r e p e n d _ b a n c r o f t,p r e p e n d _ a n n e,p r e p e n d _ b a n c r o f t,PREPEND_Anne,PREPEND_Bancroft,PREPEND_ANNE,PREPEND_BANCROFT
1,p r e p e n d _ b o b,p r e p e n d _ d y l a n,p r e p e n d _ b o b,p r e p e n d _ d y l a n,PREPEND_Bob,PREPEND_Dylan,PREPEND_BOB,PREPEND_DYLAN
2,p r e p e n d _ c h a r l i e,p r e p e n d _ c h a p l i n,p r e p e n d _ c h a r l i e,p r e p e n d _ c h a p l i n,PREPEND_Charlie,PREPEND_Chaplin,PREPEND_CHARLIE,PREPEND_CHAPLIN
3,p r e p e n d _ b o b,p r e p e n d _ m a r l e y,p r e p e n d _ b o b,p r e p e n d _ m a r l e y,PREPEND_Bob,PREPEND_Marley,PREPEND_BOB,PREPEND_MARLEY


# Splitting with ColumnTransformer

The sklearn [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) takes a DataFrame and sends particular columns to different streams. Here, we are going to make the `lower__first_name` and `upper__first_name` columns go to a spacer transformer and the `lower__last_name`, `upper__last_name`, and `lower__first_name` columns to the prepend transformer. Notice that the `lower__first_name` column goes to *both* transformers. And, finally, those columns that went to the factor transformer will continue to pass through.

In [47]:
# Reset our Transformers. Above, we specified filtering keys that are not used below.
prepend_tx = ApplyTransformer(applymap, prepend_text)
reverse_tx = ApplyTransformer(applymap, reverse_text)
lowercase_tx = ApplyTransformer(applymap, str.lower)
uppercase_tx = ApplyTransformer(applymap, str.upper)
spacer_tx = ApplyTransformer(applymap, space_text)
factor_tx = ApplyTransformer(applymap, repeat_factor)
agg_tx = ApplyTransformer(
    try_member_func, 
    'agg', 
    feature_names_out=['first_len', 'first_lensquared', 'last_len', 'last_lensquared'],
    fkwargs=dict(
        fkwargs=(dict(func=[len, len_squared]))
    )
)

pipe_lowerspace = FeatureNamesPipeline(
    [
        ('lower', lowercase_tx), 
        ('space', spacer_tx), 
        ('last', 'passthrough')
    ]
)

pipe_upperfactor = FeatureNamesPipeline(
    [
        ('upper', uppercase_tx), 
        ('factor', factor_tx), 
        ('last', 'passthrough')
    ]
)

fu_tx = FeatureUnionDataFrame(
    [
        ('pipe_lowerspace', pipe_lowerspace), 
        ('pipe_upperfactor', pipe_upperfactor)
    ]
)

subpipe = FeatureNamesPipeline(
    [
        ('spacer', spacer_tx),
        ('agg', agg_tx)
    ]
)

col_tx = ColumnTransformerDataFrame(
    [
        ('subpipe', subpipe,
         [
             'pipe_lowerspace__first_name', 
             'pipe_upperfactor__first_name'
         ]
        ),
        ('prepends', prepend_tx, 
         [
             'pipe_lowerspace__last_name',
             'pipe_upperfactor__last_name',
             'pipe_lowerspace__first_name'
         ]
        )
    ],
    remainder='passthrough'
)

pipe_tx = Pipeline(
    [
        ('reverse', reverse_tx), 
        ('branched', fu_tx),
        ('columntx', col_tx),
        ('last', 'passthrough')
    ]
)

pipe_tx

Let's execute it. (This results in an error addressed below.)

In [48]:
pipe_tx.transform(df[['first_name', 'last_name']])

NotFittedError: This ColumnTransformerDataFrame instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

### Be sure to `fit()` or `fit_transform()` when using the ColumnTransformer.

To avoid the error above, we have to call `fit()` or `fit_transform()` on the model.

In [49]:
pipe_tx.fit(df[['first_name', 'last_name']])

In [50]:
pipe_tx.transform(df[['first_name', 'last_name']])

Unnamed: 0,subpipe__first_len,subpipe__first_lensquared,subpipe__last_len,subpipe__last_lensquared,prepends__pipe_lowerspace__last_name,prepends__pipe_upperfactor__last_name,prepends__pipe_lowerspace__first_name
0,13,169,23,529,PREPEND_b a n c r o f t,PREPEND_BANCROFTBANCROFTBANCROFT,PREPEND_a n n e
1,9,81,17,289,PREPEND_d y l a n,PREPEND_DYLANDYLANDYLAN,PREPEND_b o b
2,25,625,41,1681,PREPEND_c h a p l i n,PREPEND_CHAPLINCHAPLINCHAPLIN,PREPEND_c h a r l i e
3,9,81,17,289,PREPEND_m a r l e y,PREPEND_MARLEYMARLEYMARLEY,PREPEND_b o b


### Intermediate results again.

As before, we can look at intermediate results.

In [51]:
with intermediate_transforms(pipe_tx):
    Xt = pipe_tx.transform(df[['first_name', 'last_name']])
    intermediate_results = pipe_tx.intermediate_results__

for k, v in intermediate_results.items():
    display(HTML(f'<h3>{k}</h3>'), v)

Unnamed: 0,first_name,last_name
0,Anne,Bancroft
1,Bob,Dylan
2,Charlie,Chaplin
3,Bob,Marley


Unnamed: 0,first_name,last_name
0,anne,bancroft
1,bob,dylan
2,charlie,chaplin
3,bob,marley


Unnamed: 0,first_name,last_name
0,a n n e,b a n c r o f t
1,b o b,d y l a n
2,c h a r l i e,c h a p l i n
3,b o b,m a r l e y


Unnamed: 0,first_name,last_name
0,ANNE,BANCROFT
1,BOB,DYLAN
2,CHARLIE,CHAPLIN
3,BOB,MARLEY


Unnamed: 0,first_name,last_name
0,ANNEANNEANNE,BANCROFTBANCROFTBANCROFT
1,BOBBOBBOB,DYLANDYLANDYLAN
2,CHARLIECHARLIECHARLIE,CHAPLINCHAPLINCHAPLIN
3,BOBBOBBOB,MARLEYMARLEYMARLEY


Unnamed: 0,pipe_lowerspace__first_name,pipe_lowerspace__last_name,pipe_upperfactor__first_name,pipe_upperfactor__last_name
0,a n n e,b a n c r o f t,ANNEANNEANNE,BANCROFTBANCROFTBANCROFT
1,b o b,d y l a n,BOBBOBBOB,DYLANDYLANDYLAN
2,c h a r l i e,c h a p l i n,CHARLIECHARLIECHARLIE,CHAPLINCHAPLINCHAPLIN
3,b o b,m a r l e y,BOBBOBBOB,MARLEYMARLEYMARLEY


Unnamed: 0,pipe_lowerspace__first_name,pipe_upperfactor__first_name
0,a n n e,A N N E A N N E A N N E
1,b o b,B O B B O B B O B
2,c h a r l i e,C H A R L I E C H A R L I E C H A R L I E
3,b o b,B O B B O B B O B


Unnamed: 0,first_len,first_lensquared,last_len,last_lensquared
0,13,169,23,529
1,9,81,17,289
2,25,625,41,1681
3,9,81,17,289


Unnamed: 0,subpipe__first_len,subpipe__first_lensquared,subpipe__last_len,subpipe__last_lensquared,prepends__pipe_lowerspace__last_name,prepends__pipe_upperfactor__last_name,prepends__pipe_lowerspace__first_name
0,13,169,23,529,PREPEND_b a n c r o f t,PREPEND_BANCROFTBANCROFTBANCROFT,PREPEND_a n n e
1,9,81,17,289,PREPEND_d y l a n,PREPEND_DYLANDYLANDYLAN,PREPEND_b o b
2,25,625,41,1681,PREPEND_c h a p l i n,PREPEND_CHAPLINCHAPLINCHAPLIN,PREPEND_c h a r l i e
3,9,81,17,289,PREPEND_m a r l e y,PREPEND_MARLEYMARLEYMARLEY,PREPEND_b o b


# Looping through a ColumnTransformer

The final code snippet shows how we can loop through pipelines of a ColumnTransformer. Note that if the subtransformers of the ColumnTransformer are not Pipeline objects, then this method will fail.

In [52]:
agg_tx = ApplyTransformer(
    try_member_func, 
    'agg', 
    feature_names_out=['first_len', 'first_lensquared'],
    fkwargs=dict(
        fkwargs=(dict(func=[len, len_squared]))
    )
)

pipereverseupper = FeatureNamesPipeline(
    [
        ('reverse', reverse_tx),
        ('upper', uppercase_tx),
        ('last', 'passthrough'),
    ]
)

pipefactorcamel = FeatureNamesPipeline(
    [
        ('factor', factor_tx),
        ('camel', camelcase_tx),
        ('last', 'passthrough'),
    ]
)

pipereverseupper_agg = FeatureNamesPipeline(
    [
        ('pipereverseupper', pipereverseupper),
        ('agg', agg_tx),
        ('last', 'passthrough'),
    ]
)

col_tx1 = ColumnTransformerDataFrame(
    [
        ('pipefirst', pipereverseupper_agg, ['first_name']),
        ('pipelast', pipefactorcamel, ['last_name']),
    ],
    remainder='passthrough'
)

col_tx1.fit(df)

In [53]:
intermediate_results = {}
for name, transformer, cols in col_tx1.transformers_:
    if transformer == 'passthrough' or name == 'remainder':
        continue
    with intermediate_transforms(transformer):
        Xt = transformer.transform(df[cols])
        intermediate_results_sub = transformer.intermediate_results__

    for key_sub in intermediate_results_sub.keys():
        sub_name = name + '__' + key_sub
        intermediate_results[sub_name] = intermediate_results_sub[key_sub]

for k, v in intermediate_results.items():
    display(HTML(f'<h3>{k}</h3>'), v)
    
display(HTML(f'<h3>Composite Transformation</h3>'), col_tx1.transform(df))

Unnamed: 0,first_name
0,Anne
1,Bob
2,Charlie
3,Bob


Unnamed: 0,first_name
0,ANNE
1,BOB
2,CHARLIE
3,BOB


Unnamed: 0,first_name
0,ANNE
1,BOB
2,CHARLIE
3,BOB


Unnamed: 0,first_len,first_lensquared
0,4,16
1,3,9
2,7,49
3,3,9


Unnamed: 0,last_name
0,tforcnaBtforcnaBtforcnaB
1,nalyDnalyDnalyD
2,nilpahCnilpahCnilpahC
3,yelraMyelraMyelraM


Unnamed: 0,last_name
0,Tforcnabtforcnabtforcnab
1,Nalydnalydnalyd
2,Nilpahcnilpahcnilpahc
3,Yelramyelramyelram


Unnamed: 0,pipefirst__first_len,pipefirst__first_lensquared,pipelast__last_name,remainder__age,remainder__len_all
0,4,16,Tforcnabtforcnabtforcnab,20,12
1,3,9,Nalydnalydnalyd,21,8
2,7,49,Nilpahcnilpahcnilpahc,22,14
3,3,9,Yelramyelramyelram,23,9


# Conclusion

This notebook demonstrates the utility of using the sklearn Pipeline architecture, with branching and splitting. Models can obviously be made rather complex with the branching and splitting capabilities, but through graphical tools and vieweing intermediate results, these approaches can help in the debugging and explainability of our models.

With respect to our models, inputs will largely come as a formatted DataFrame, so ColumnTransformerDataFrame should be the object that splits columns into other FeatureNamesPipeline objects. This also requires careful use of the `feature_names` properties of the transformers.
