## Multi-Step Column-Transformer
This notebook shows how to use a pipeline inside a column-transformer in order to be able to make column transformations which require more than one step.

In [25]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier

from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score

from sklearn.svm import SVC

%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score

### Reading the data:

In [26]:
# read the csv-files and take the respondent_id column as index:

X_train_df = pd.read_csv("training_set_features.csv", index_col="respondent_id")
y_train_df = pd.read_csv("training_set_labels.csv", index_col="respondent_id")
X_test_df = pd.read_csv("test_set_features.csv", index_col="respondent_id")

Check the columns and their data-types:

In [27]:
# for convenience the code producing the column-name : dtype output of the data-frame is commented out:

#for i,t in zip(X_train_df.dtypes.index,X_train_df.dtypes):
#    print(f"{i} : {t}")

# Result:
""" 
h1n1_concern : float64
h1n1_knowledge : float64
behavioral_antiviral_meds : float64
behavioral_avoidance : float64
behavioral_face_mask : float64
behavioral_wash_hands : float64
behavioral_large_gatherings : float64
behavioral_outside_home : float64
behavioral_touch_face : float64
doctor_recc_h1n1 : float64
doctor_recc_seasonal : float64
chronic_med_condition : float64
child_under_6_months : float64
health_worker : float64
health_insurance : float64
opinion_h1n1_vacc_effective : float64
opinion_h1n1_risk : float64
opinion_h1n1_sick_from_vacc : float64
opinion_seas_vacc_effective : float64
opinion_seas_risk : float64
opinion_seas_sick_from_vacc : float64
age_group : object
education : object
race : object
sex : object
income_poverty : object
marital_status : object
rent_or_own : object
employment_status : object
hhs_geo_region : object
census_msa : object
household_adults : float64
household_children : float64
employment_industry : object
employment_occupation : object

"""

' \nh1n1_concern : float64\nh1n1_knowledge : float64\nbehavioral_antiviral_meds : float64\nbehavioral_avoidance : float64\nbehavioral_face_mask : float64\nbehavioral_wash_hands : float64\nbehavioral_large_gatherings : float64\nbehavioral_outside_home : float64\nbehavioral_touch_face : float64\ndoctor_recc_h1n1 : float64\ndoctor_recc_seasonal : float64\nchronic_med_condition : float64\nchild_under_6_months : float64\nhealth_worker : float64\nhealth_insurance : float64\nopinion_h1n1_vacc_effective : float64\nopinion_h1n1_risk : float64\nopinion_h1n1_sick_from_vacc : float64\nopinion_seas_vacc_effective : float64\nopinion_seas_risk : float64\nopinion_seas_sick_from_vacc : float64\nage_group : object\neducation : object\nrace : object\nsex : object\nincome_poverty : object\nmarital_status : object\nrent_or_own : object\nemployment_status : object\nhhs_geo_region : object\ncensus_msa : object\nhousehold_adults : float64\nhousehold_children : float64\nemployment_industry : object\nemployment

From the data-types and the meaning of the columns, we select the appropriate transformations to convert data into numeric form.

### Defining the columns on which to perform what:

In [28]:
ohe_columns = ["race", "sex", "marital_status", "rent_or_own", "employment_status", "hhs_geo_region", "census_msa", "employment_industry", "employment_occupation"]
ordinal_columns = ['age_group','education', 'income_poverty']
numeric_columns = X_train_df.columns[X_train_df.dtypes == "float64"]

### Use a Pipeline for multi-step Transformation of a Column-Set:

If there is more than one transformation to be executed on one and the same set of columns, like here numeric_columns, then these steps have to be
collected inside a Pipeline - it does not work to do these step one after the other directly in the column-transformer, as every step will augment the data-frame with columns containing the result of the transformation-step - which could be expected, since a ColumnTransformer is not a pipeline, i.e. does not know anything about first-step, second-step, third-step...

Defining the Pipeline:

In [29]:
num_pipe = Pipeline([
    ("SimpleImputer", SimpleImputer(strategy="median", missing_values=np.nan)),
    ("PostImp_StandardScaler", StandardScaler(copy=False))
])

Defining the Column-Transformer: <br>
<br>
We will use a one-hot-encoder on the data-frame before we will send the data to the column-transformer, but we can allready define the column-transformer here.

In [30]:
from sklearn.preprocessing import OrdinalEncoder


full_columnTransformer = ColumnTransformer(
    transformers= [
        ("numerical_pipeline", num_pipe, numeric_columns),
        ("ordinal_preprocessing", OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1, encoded_missing_value=-1), ordinal_columns),
    ],
    remainder="passthrough"
)

One-hot-encoding the training- and the test-data: <br>
We print out the shape of the data-frame to see if something happend with the data-frame.

In [31]:
print(f"X_train_df.shape: {X_train_df.shape}")
X_train_df = pd.get_dummies(data=X_train_df, columns=ohe_columns, dummy_na=True)
print(f"X_train_df.shape: {X_train_df.shape}")
X_test_df = pd.get_dummies(data=X_test_df, columns=ohe_columns, dummy_na=True)



X_train_df.shape: (26707, 35)
X_train_df.shape: (26707, 105)


Obviously some columns have been added to the data-frame under the one-hot-encoding...

Save the column-names of the data-frame to because the column-transformer will only return a numpy-array and we might re-construct our pandas data-frame from this later on.

In [32]:
X_columns = X_train_df.columns

X_train_dfnp = full_columnTransformer.fit_transform(X_train_df)
X_test_dfnp = full_columnTransformer.transform(X_test_df)
print(f"X_train_dfnp.shape: {X_train_dfnp.shape}")
print(f"X_train_df.shape: {X_train_df.shape}")

X_train_dfnp.shape: (26707, 105)
X_train_df.shape: (26707, 105)


Check if all missing values have been removed by imputation and ordinal-encoding:

In [35]:
np.isnan(X_train_dfnp).sum().sum()

0