# Capstone Project: <br/> Arvato Customer Acquisition Prediction Using Supervised Learning <br/><br/> Part 2: Developing the Data Transformation Pipeline

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Run-Common-Code-Notebook" data-toc-modified-id="Run-Common-Code-Notebook-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Run Common Code Notebook</a></span></li><li><span><a href="#Load-Raw-Data" data-toc-modified-id="Load-Raw-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Load Raw Data</a></span></li><li><span><a href="#Metadata-Class-Definition" data-toc-modified-id="Metadata-Class-Definition-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Metadata Class Definition</a></span></li><li><span><a href="#Custom--Transformers-Class-Definitions" data-toc-modified-id="Custom--Transformers-Class-Definitions-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Custom  Transformers Class Definitions</a></span><ul class="toc-item"><li><span><a href="#Transformers-package-dependencies" data-toc-modified-id="Transformers-package-dependencies-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Transformers package dependencies</a></span></li><li><span><a href="#Custom-Preprocessor" data-toc-modified-id="Custom-Preprocessor-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Custom Preprocessor</a></span></li><li><span><a href="#Missing-Data-Columns-Remover" data-toc-modified-id="Missing-Data-Columns-Remover-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Missing Data Columns Remover</a></span></li><li><span><a href="#Missing-Train-Data-Rows-Remover" data-toc-modified-id="Missing-Train-Data-Rows-Remover-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Missing Train Data Rows Remover</a></span></li><li><span><a href="#Correlation-Remover" data-toc-modified-id="Correlation-Remover-5.5"><span class="toc-item-num">5.5&nbsp;&nbsp;</span>Correlation Remover</a></span></li><li><span><a href="#Training-Samples-Duplicate-Remover" data-toc-modified-id="Training-Samples-Duplicate-Remover-5.6"><span class="toc-item-num">5.6&nbsp;&nbsp;</span>Training Samples Duplicate Remover</a></span></li><li><span><a href="#Custom-Simple-Imputer" data-toc-modified-id="Custom-Simple-Imputer-5.7"><span class="toc-item-num">5.7&nbsp;&nbsp;</span>Custom Simple Imputer</a></span></li><li><span><a href="#Custom-Outlier-Remover" data-toc-modified-id="Custom-Outlier-Remover-5.8"><span class="toc-item-num">5.8&nbsp;&nbsp;</span>Custom Outlier Remover</a></span></li><li><span><a href="#Custom-OverSampler" data-toc-modified-id="Custom-OverSampler-5.9"><span class="toc-item-num">5.9&nbsp;&nbsp;</span>Custom OverSampler</a></span></li><li><span><a href="#Custom-Column-Transformer" data-toc-modified-id="Custom-Column-Transformer-5.10"><span class="toc-item-num">5.10&nbsp;&nbsp;</span>Custom Column Transformer</a></span></li></ul></li><li><span><a href="#Write-Custom-Transformer-Class-Definitions-to-Python-Source-Script" data-toc-modified-id="Write-Custom-Transformer-Class-Definitions-to-Python-Source-Script-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Write Custom Transformer Class Definitions to Python Source Script</a></span></li><li><span><a href="#Pipeline-Testing" data-toc-modified-id="Pipeline-Testing-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Pipeline Testing</a></span><ul class="toc-item"><li><span><a href="#Load-Transformer-Class-Definitions-and-Metadata" data-toc-modified-id="Load-Transformer-Class-Definitions-and-Metadata-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Load Transformer Class Definitions and Metadata</a></span></li><li><span><a href="#Pipeline-Definition" data-toc-modified-id="Pipeline-Definition-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Pipeline Definition</a></span></li><li><span><a href="#Extract-the-Pipeline-Intermediate-Outputs" data-toc-modified-id="Extract-the-Pipeline-Intermediate-Outputs-7.3"><span class="toc-item-num">7.3&nbsp;&nbsp;</span>Extract the Pipeline Intermediate Outputs</a></span></li><li><span><a href="#Test-Pipeline-Training" data-toc-modified-id="Test-Pipeline-Training-7.4"><span class="toc-item-num">7.4&nbsp;&nbsp;</span>Test Pipeline Training</a></span></li><li><span><a href="#Test-Pipeline-Inference" data-toc-modified-id="Test-Pipeline-Inference-7.5"><span class="toc-item-num">7.5&nbsp;&nbsp;</span>Test Pipeline Inference</a></span></li></ul></li></ul></div>

## Introduction

In this notebook we will develop the data processing pipeline for our project. The code developed in this notebook is packaged in `../src/metadata.py` and `../src/transformers.py`

 The pipeline we will develop is outlined in detail in the following diagram:

![pipeline.svg](attachment:pipeline.svg)

## Run Common Code Notebook

In [39]:
%run 00_common.ipynb

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Overwriting ../src/helper_functions.py


## Load Raw Data

In [2]:
# Load the training dataset, from a compressed archive
data_archive = os.path.join(dir_data_raw, "Udacity_MAILOUT_052018_TRAIN.tar.xz")
with tarfile.open(data_archive, "r:*") as tar:
    mailout_train = pd.read_csv(tar.extractfile('Udacity_MAILOUT_052018_TRAIN.csv'), sep=";")

  mailout_train = pd.read_csv(tar.extractfile('Udacity_MAILOUT_052018_TRAIN.csv'), sep=";")


In [3]:
# Load the test data
data_archive = os.path.join(dir_data_raw, "Udacity_MAILOUT_052018_TEST.tar.xz")
with tarfile.open(data_archive, "r:*") as tar:
    mailout_test = pd.read_csv(tar.extractfile('Udacity_MAILOUT_052018_TEST.csv'), sep=";")

  mailout_test = pd.read_csv(tar.extractfile('Udacity_MAILOUT_052018_TEST.csv'), sep=";")


## Metadata Class Definition

In [7]:
%%writefile "../src/metadata.py"

import numpy as np
import pandas as pd

class Metadata():
    """
        Attributes
        ----------
        df_metadata : pandas.DataFrame
            A clean metadata loaded from csv file.
        
        df_lookup : pandas.DataFrame
            A dataframe used for lookup
    """
    def __init__(self, file_path):
        """
        
        Parameters
        ----------
        file_path : str
            A path to clean metadata file.
        """
 
        self.df_metadata = pd.read_csv(file_path)
        dict_features_extra = {
            'numeric': [
                         'ALTER_KIND1', 'ALTER_KIND2', 'ALTER_KIND3', 'ALTER_KIND4', 'ANZ_STATISTISCHE_HAUSHALTE',
                         'EINGEFUEGT_AM', 'EINGEZOGENAM_HH_JAHR', 'EXTSEL992', 'VERDICHTUNGSRAUM'
            ],
            'nominal': [
                        'AKT_DAT_KL', 'ALTERSKATEGORIE_FEIN', 'ANZ_KINDER', 'ARBEIT', 'CJT_KATALOGNUTZER',
                        'CJT_TYP_1', 'CJT_TYP_2', 'CJT_TYP_3', 'CJT_TYP_4', 'CJT_TYP_5', 'CJT_TYP_6',
                        'D19_BUCH_CD', 'D19_KONSUMTYP_MAX',
                        'D19_LETZTER_KAUF_BRANCHE', 'D19_SOZIALES',
                        'D19_TELKO_ONLINE_QUOTE_12', 'D19_VERSI_DATUM',
                        'D19_VERSI_OFFLINE_DATUM', 'D19_VERSI_ONLINE_DATUM',
                        'D19_VERSI_ONLINE_QUOTE_12', 'DSL_FLAG', 'FIRMENDICHTE', 'GEMEINDETYP',
                        'HH_DELTA_FLAG', 'KOMBIALTER', 'KONSUMZELLE', 'MOBI_RASTER',
                        'RT_KEIN_ANREIZ', 'RT_SCHNAEPPCHEN', 'RT_UEBERGROESSE',
                        'STRUKTURTYP', 'UMFELD_ALT', 'UMFELD_JUNG', 'UNGLEICHENN_FLAG',
                        'VHA', 'VHN', 'VK_DHT4A', 'VK_DISTANZ', 'VK_ZG11'            
            ],
            'ordinal': [
                        'KBA13_ANTG1', 'KBA13_ANTG2', 'KBA13_ANTG3',
                        'KBA13_ANTG4', 'KBA13_BAUMAX', 'KBA13_GBZ', 'KBA13_HHZ',
                        'KBA13_KMH_210'
            ]
        }
        self.df_lookup = self.df_metadata[['attribute', 'type']]\
                        .drop_duplicates(subset=['attribute', 'type'])\
                        .reset_index(drop=True)
        self.df_lookup['dataset'] = 'metadata'
        
        for key in dict_features_extra.keys():
            n = len(dict_features_extra[key])
            
            self.df_lookup = pd.concat(
                [
                    self.df_lookup,
                    pd.DataFrame({
                                'attribute': dict_features_extra[key],
                                'type': [key] * n,
                                'dataset': ['extra'] * n
                                 })
                ],
                ignore_index=True
            )

    def drop_unknown_values(self):
        """Drop values that have `unknown` tag from metadta frame, 
        querying the data will replace them by np.nan
        
        """    
        self.df_metadata.drop(self.df_metadata.query("meaning == 'unknown'").index, inplace=True)

    def lookup_features(self, input_df, method='intersect', subsets=['metadata', 'extra'], types=['numeric', 'ordinal','nominal']):
        """Lookup Metadata features given a dataframe columns, either by intersection or exclusion
        
        Parameters
        ----------
        input_df : Pands Dataframe
            A dataframe which columns will be used in query.
            
        method : str, default='intersect'
            - If 'intersect', return features in metadata that intersect with input_df columns.
            - If 'diff', return features that are in input_df columns, but not in metadata.
            - If 'complement', return featuers that are in metadata, but not in df_input columns.
            
        subsets : list, default=['metadata', 'extra']
            Metadata subsets to be queried.
            - If 'metadata', return features that exist in metadata only.
            - If 'extra', return features that don't exist in the metadata.
            
        types : list, default=['numeric', 'ordinal','nominal']
            feature types to be queried.
        """
    
        lookup_cols = self.get_features(subsets=subsets, types=types)
        if method == 'intersect':
            func = np.intersect1d
            set_order = (input_df.columns, lookup_cols)
        if method == 'diff':
            func = np.setdiff1d
            set_order = (input_df.columns, lookup_cols)
        if method == 'complement':
            func = np.setdiff1d
            set_order = (lookup_cols, input_df.columns)
        return func(*set_order).tolist()
    
    def get_features(self, subsets=['metadata', 'extra'], types=['numeric', 'ordinal','nominal']):
        """Lookup Metadata features

        Parameters
        ----------
        input_df : Pands Dataframe
            A dataframe which columns will be used in query.
                    
        subsets : list, default=['metadata', 'extra']
            Metadata subsets to be queried.
            - If 'metadata', return features that exist in metadata only.
            - If 'extra', return features that don't exist in the metadata.
            
        types : list, default=['numeric', 'ordinal','nominal']
            feature types to be queried.
        """

        lookup_cols = self.df_lookup.query(f"type.isin({types}) and dataset.isin({subsets})", engine="python")\
                          .attribute.tolist()
        return lookup_cols
    
    def get_valid_ranges(self, col_name, col_dtype):
        """Get valid ranges of a feature
        
        Parameters
        ----------
        col_name : str
            Column name.
            
        col_dtype : object
            Column dtype.
            
        Returns
        -------
        valid_ranges : ndarray
            A list of valid ranges for the given feature.
        """
        return self.df_metadata.query(f"attribute == '{col_name}'")['value'].to_numpy().astype(col_dtype)

Overwriting ../src/metadata.py


## Custom  Transformers Class Definitions

### Transformers package dependencies

In [None]:
import re

import numpy as np
import pandas as pd

from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import (OneHotEncoder, StandardScaler)

### Custom Preprocessor

In [10]:
class CustomPreprocessor(BaseEstimator,TransformerMixin):
    def __init__(self, metadata):
        self.metadata = metadata
        
        # Process: Drop unknown values from metadta frame, as all invalid values not found in metadata will be replaced by np.nan
        self.metadata.drop_unknown_values()     

    def replace_invalid_values(self, series):
        series = series.copy()
        col_name = series.name
        col_dtype = series.dtype
        valid_metadata_ranges = self.metadata.get_valid_ranges(col_name, col_dtype)

        if col_dtype == object:
            dataset_ranges = series.fillna('NULL').unique().astype(col_dtype)
        else:
            dataset_ranges = series.unique().astype(col_dtype)

        invalid_values = np.setdiff1d(
            dataset_ranges,
            valid_metadata_ranges
        )
        series.replace(invalid_values, np.nan, inplace=True)

        return series


    def fit(self,X,y=None):
        return self
    
    def transform(self,X,y=None):
        X = pd.DataFrame(X).copy()
        
        # Process: Remove invalid entries in ['CAMEO_DEUG_2015', 'CAMEO_INTL_2015'] and change dtype to float
        X.iloc[:,18].replace('X', np.nan, inplace=True)
        X.iloc[:,19].replace('XX', np.nan, inplace=True)
        X.iloc[:,[18,19]] = X.iloc[:,[18,19]].astype('float')
        
        # Process: Drop 'LNR' column as it has unique values:
        X.drop('LNR', axis=1, inplace=True)
        
        # Process: Columns that exist in dataset that don't have metadata:
        extra_cols_dataset = self.metadata.lookup_features(X, method='diff', subsets=['metadata'])

        # Process: Columns that exist in metadata and not in the dataset:
        extra_cols_metadata = self.metadata.lookup_features(X, method='complement', subsets=['metadata'])

        # Process: Rename D19 columns in dataset exist in metadata to match metadata, suffix them with _RZ
        X.columns =  X.columns.map(
            lambda x: re.sub(r"^(D19_.*)$", r"\1_RZ", x) 
            if x in np.intersect1d([re.sub(r"_RZ", "", x) for x in extra_cols_metadata], extra_cols_dataset) 
            else x
        )
        
        # Process: Manually associate and rename columns
        X.rename(columns={
                        'CAMEO_INTL_2015':'CAMEO_DEUINTL_2015',
                        'KK_KUNDENTYP': 'D19_KK_KUNDENTYP',
                        'KBA13_CCM_1401_2500': 'KBA13_CCM_1400_2500', 
                        'D19_BUCH_CD_RZ': 'D19_BUCH_RZ',
                        'SOHO_KZ': 'SOHO_FLAG'
                        }, 
                        inplace=True
                )

        # Process: convert `'EINGEFUEGT_AM'` column that is a date to year column.
        X['EINGEFUEGT_AM'] = X['EINGEFUEGT_AM'].astype('datetime64[ns]').dt.year
        
        # Process: find categorical features that exist in both the dataset and metadata.
        dataset_meta_categorical = self.metadata.lookup_features(
            X,
            method='intersect',
            subsets=['metadata'], 
            types=['ordinal', 'nominal']
        )
        
        # Process: find and replace invalid values with np.nan
        X[dataset_meta_categorical] = X[dataset_meta_categorical].apply(self.replace_invalid_values)
               
        return X

In [219]:
# preprocessor_pipeline = Pipeline([
#     ('preprocessor', CustomPreprocessor(metadata)), 
# ])

### Missing Data Columns Remover

In [15]:
class MissingDataColsRemover(BaseEstimator,TransformerMixin):
    def __init__(self, missing_threshold=0.3):
        self.missing_threshold = missing_threshold
        
    def fit(self,X,y=None):
        missing_columns_freqs = (X.isna().sum() / X.shape[0]).sort_values(ascending = False)
        self.missing_data_cols = missing_columns_freqs[missing_columns_freqs > self.missing_threshold].index.tolist()
        return self
    
    def transform(self,X,y=None):
        X = pd.DataFrame(X).copy()
        X = X.drop(self.missing_data_cols, axis=1)       
        self.feature_names_out = X.columns.to_list()
        return X
    def get_feature_names_out(self, *args):
        return self.feature_names_out

### Missing Train Data Rows Remover

In [54]:
class TrainMissingDataRowsRemover(TransformerMixin):
    def __init__(self, missing_threshold=0.3):
        self.missing_threshold = missing_threshold

    def fit(self,X,y=None):
        #Inplace transformation of X and Y
        missing_rows_freqs = (X.isna().sum(axis=1) / X.shape[1]).sort_values(ascending = False)
        drop_idx = missing_rows_freqs[missing_rows_freqs > self.missing_threshold].index.tolist()        
        X.drop(drop_idx, inplace=True)
        y.drop(drop_idx, inplace=True)
        return self
    
    def transform(self,X,y=None):
        return X

In [36]:
# # Testint code
# X = output[1]['X'].copy()
# y = output[1]['y'].copy()

# train_missing_transformer = TrainMissingDataRowsRemover()

# print(X.shape, y.shape)

# train_missing_transformer.fit_transform(X, y)

# print(X.shape, y.shape)

### Correlation Remover

In [16]:
class CorrelatedRemover(BaseEstimator,TransformerMixin):
    def __init__(self, correlation_threshold=0.6):
        self.correlation_threshold = correlation_threshold
        

    def fit(self,X,y=None):
        correlation_matrix = X.corr().abs()
        
        # Select upper triangle of correlation matrix
        upper_corr_mat = correlation_matrix.where(
            np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool))

        self.correlated_cols_to_drop = upper_corr_mat.columns[(upper_corr_mat > self.correlation_threshold).any(axis='rows')].tolist()

        return self

    def transform(self,X,y=None):
        X = pd.DataFrame(X).copy()
        X.drop(self.correlated_cols_to_drop, axis=1, inplace=True)
        self.feature_names_out = X.columns.to_list()
        return X
    def get_feature_names_out(self, *args):
        return self.feature_names_out

In [142]:
# # Testing code
# column_transformer = CustomColumnTransformer(
#     [
#         (
#             'numeric_ordinal',
#             CorrelatedRemover(correlation_threshold=0.6),
#             partial(metadata.lookup_features, types=['numeric', 'ordinal'])
#         )
#     ],
#     remainder = 'passthrough'
# )

# X = output[1]['X'].copy()
# y = output[1]['y'].copy()

# X_out = column_transformer.fit_transform(X, y)

# X_out.shape

### Training Samples Duplicate Remover

In [19]:
class TrainDuplicatesRemover(TransformerMixin):
    def __init__(self):
        None

    def fit(self,X,y=None):
        #Inplace transformation of X and Y
        drop_idx = X[X.duplicated()].index
        X.drop(drop_idx, inplace=True)
        y.drop(drop_idx, inplace=True)
        return self
    
    def transform(self,X,y=None):
        return X

### Custom Simple Imputer

In [17]:
class CustomSimpleImputer(SimpleImputer):
    def __init__(self, strategy, **kwargs):
        super().__init__(strategy=strategy, **kwargs)
    def transform(self, X):
        output = super().transform(X)
        return pd.DataFrame(output, columns=self.feature_names_in_, index=X.index)
    def get_feature_names_out(self, column):
        return column

### Custom Outlier Remover

In [297]:
# # Using Local Outlier Factor
# class TrainOutlierRemover(BaseEstimator, TransformerMixin):
#     def __init__(
#         self, 
#         selector_callable = lambda X: X.columns, 
#         n_neighbors=20, 
#         contamination=0.1,
#         **kwargs
#     ):
#         self.lof = LocalOutlierFactor(n_neighbors=n_neighbors, contamination=contamination, **kwargs)
#         self.selector_callable = selector_callable
#         pass
#     def fit(self, X, y):
#         ypred = self.lof.fit_predict(X[self.selector_callable(X)])
#         X.drop(X.index[ypred == -1], inplace=True)
#         y.drop(y.index[ypred == -1], inplace=True)
#         return self

#     def transform(self, X, y=None):
#         return X

In [308]:
# IQR method
# class TrainOutlierRemover(BaseEstimator, TransformerMixin):
#     def __init__(
#         self, 
#         selector_callable = lambda X: X.columns, 
#     ):
#         self.selector_callable = selector_callable
        
#     def fit(self, X, y):
#         query_cols = self.selector_callable(X)
#         outliers_idx = X[X[query_cols].apply(self.get_outliers).any(axis=1)].index
#         X.drop(outliers_idx, inplace=True)
#         y.drop(outliers_idx, inplace=True)
#         return self

#     def transform(self, X, y=None):
#         return X
    
#     def get_outliers(self, s):
#         q25 , q75 = s.quantile(.25), s.quantile(.75)
#         cutoff = 1.5 * (q75 - q25)
#         return ~ s.between(q25 - cutoff, q75 + cutoff) # a series that has True in place of outliers

In [20]:
# 3 Standard Deviation Outliers
class TrainOutlierRemover(BaseEstimator, TransformerMixin):
    def __init__(
        self, 
        selector_callable = lambda X: X.columns, 
    ):
        self.selector_callable = selector_callable
        
    def fit(self, X, y):
        query_cols = self.selector_callable(X)
        outliers_idx = X[X[query_cols].apply(self.get_outliers).any(axis=1)].index
        X.drop(outliers_idx, inplace=True)
        y.drop(outliers_idx, inplace=True)
        return self

    def transform(self, X, y=None):
        return X
    
    def get_outliers(self, s):
        mean , std = s.mean(), s.std()
        cutoff = 3 * std 
        return ~ s.between(mean - cutoff, mean + cutoff) # a series that has True in place of outliers

In [338]:
# # Testing code
# outlier_transformer = TrainOutlierRemover(
#     selector_callable = partial(metadata.lookup_features, method='intersect', types=['numeric'])
# )
                                

# X = output[4]['X'].copy()
# y = output[4]['y'].copy()

# outlier_transformer.fit_transform(X, y)

### Custom OverSampler

In [30]:
class TrainOverSampler(BaseEstimator, TransformerMixin):
    def __init__(
        self, 
        categorical_selector_callable = lambda X: X.columns, 
    ):
        self.categorical_selector_callable = categorical_selector_callable
    def fit(self, X, y):
        categorical_cols = X.columns.isin(self.categorical_selector_callable(X))
        sampler = SMOTENC(categorical_cols)
        X_sample, y_sample = sampler.fit_resample(X, y)
        X.__dict__, y.__dict__ = X_sample.__dict__, y_sample.__dict__
        return self

    def transform(self, X, y=None):
        return X

In [61]:
# # Testing code
# X = output[5]['X'].copy()
# y = output[5]['y'].copy()

# oversampler = TrainOverSampler(
#     categorical_selector_callable = partial(metadata.lookup_features, method='intersect', types=['nominal'])
# )

# oversampler.fit_transform(X, y)

### Custom Column Transformer

In [18]:
class CustomColumnTransformer(ColumnTransformer):
    def __init__(
        self,
        transformers,
        verbose_feature_names_out=False,
        **kwargs
    ):
        super().__init__(
        transformers = transformers,
        verbose_feature_names_out=False,
        **kwargs
    )
    def fit_transform(self, X, y=None):
        output = super().fit_transform(X, y)
        return pd.DataFrame(
            output if type(output) == np.ndarray else output.toarray(),
            columns=self.get_feature_names_out(),
            index = X.index,
        ) #.convert_dtypes()
    
    def transform(self, X):
        output = super().transform(X)
        return pd.DataFrame(
            output if type(output) == np.ndarray else output.toarray(),
            columns=self.get_feature_names_out(),
            index = X.index,
        ) #.convert_dtypes()

## Write Custom Transformer Class Definitions to Python Source Script

The following code opens the notebook file and write all cells in the current notebook that have the tag `transformer` to a python script file `../src/transformers.py`

In [59]:
with open('02_data-processing-pipeline.ipynb', 'r') as f:
    notebook_json = json.load(f)

transformers_code = ''
for cell in notebook_json['cells']:
    tags = cell['metadata'].get('tags') if cell['metadata'].get('tags') else []
    if 'transformer' in tags:
        transformers_code += ''.join(cell['source'])
        transformers_code += '\n\n'

with open(os.path.join(dir_src, 'transformers.py'), 'w') as f:
    f.write(transformers_code)

## Pipeline Testing

### Load Transformer Class Definitions and Metadata

In [4]:
from metadata import Metadata
from transformers import *

In [5]:
metadata = Metadata(os.path.join(dir_data_processed, 'metadata.csv'))

### Pipeline Definition

In [6]:
imputer_column_transformer = CustomColumnTransformer(
    [
        (
            'numeric', 
            CustomSimpleImputer(strategy="mean"), 
            partial(metadata.lookup_features, types=['numeric'])
        ),
        (
            'ordinal_nominal', 
            CustomSimpleImputer(strategy="most_frequent"), 
            partial(metadata.lookup_features, types=['ordinal', 'nominal'])
        )
    ],
#    remainder='passthrough'
)


terminal_column_transformer = CustomColumnTransformer(
    [
        (
            'numeric_ordinal', 
            StandardScaler(), 
            partial(metadata.lookup_features, types=['numeric', 'ordinal'])
        ),
       (
            'nominal', 
            OneHotEncoder(handle_unknown='ignore'), 
            partial(metadata.lookup_features, types=['nominal'])
        ),
     ],
#    remainder='passthrough'
)

correlated_cols_remover = CustomColumnTransformer(
    [
        (
            'numeric_ordinal',
            CorrelatedRemover(correlation_threshold=0.6),
            partial(metadata.lookup_features, types=['numeric', 'ordinal'])
        )
    ],
    remainder = 'passthrough'
)

custom_outlier_remover =  TrainOutlierRemover(
                            selector_callable = partial(metadata.lookup_features, 
                                                        method='intersect', 
                                                        types=['numeric']
                                                       ),
)

In [None]:
# custom_oversampler = TrainOverSampler(
#     categorical_selector_callable = partial(metadata.lookup_features, method='intersect', types=['nominal', 'ordinal'])
# )

In [7]:
pipeline = Pipeline([
    ('preprocessor', CustomPreprocessor(metadata)), 
    ('remove_missing_cols', MissingDataColsRemover(missing_threshold=0.3)),
#    ('remove_missing_rows', TrainMissingDataRowsRemover(missing_threshold=0.5)),
    ('remove_correlated_cols', correlated_cols_remover),
    ('remove_duplicate_cols', TrainDuplicatesRemover()),
    ('imputer_column', imputer_column_transformer),
    ('outlier_remover', custom_outlier_remover),
#     ('oversampler', custom_oversampler),
    ('terminal_column', terminal_column_transformer),
    ('benchmark_model', LogisticRegression(max_iter=2000))
])

In [8]:
pipeline

### Extract the Pipeline Intermediate Outputs

In [9]:
# n_steps = 8

n_steps = pipeline.__len__() - 1
step_names = dir(pipeline.named_steps)

pipeline_output = list()
pipeline_output.append(
    {
     'X': pipeline[0].fit_transform(mailout_train.drop('RESPONSE', axis=1), mailout_train['RESPONSE']),
     'y': mailout_train['RESPONSE'].copy()
    }
)
print(0, pipeline.steps[0][0])

for i in range(1, n_steps):
    pipeline_output.append(
    {
     'X': pipeline[i].fit_transform(pipeline_output[i-1]['X'], pipeline_output[i-1]['y']),
     'y': pipeline_output[i-1]['y']
    }
    )   
    print(i, pipeline.steps[i][0])

0 preprocessor
1 remove_missing_cols
2 remove_correlated_cols
3 remove_duplicate_cols
4 imputer_column
5 outlier_remover
6 terminal_column


### Test Pipeline Training

In [10]:
t = time.time()
pipeline.fit(mailout_train.drop('RESPONSE', axis=1), mailout_train['RESPONSE'])
elapsed = time.time() - t
print(elapsed)

34.84210658073425


### Test Pipeline Inference

In [33]:
submission_benchmark = mailout_test[['LNR']].set_index('LNR')
submission_benchmark['RESPONSE'] = pipeline.predict_proba(mailout_test)[:,1]
submission_benchmark.to_csv(os.path.join(dir_submit, "submission_pipeline_test.csv"))

In [35]:
!kaggle competitions submit -c udacity-arvato-identify-customers -f {os.path.join(dir_submit, "submission_pipeline_test.csv")} -m "pipeline test submission"

100%|███████████████████████████████████████| 1.11M/1.11M [00:07<00:00, 163kB/s]
Successfully submitted to Udacity+Arvato: Identify Customer Segments

**The resulting score:**
    
![kaggle-pipeline-test-score.svg](attachment:kaggle-pipeline-test-score.svg)