## ColumnTransformer

The ColumnTransformer is a very powerful tool in scikit-learn that allows you to apply different preprocessing steps to different subsets of features (columns) in a dataset. This is particularly useful when dealing with datasets that contain both numerical and categorical data. Each type of data might require a different type of preprocessing or transformation.

**Why Use ColumnTransformer?**

* **Streamlining the Preprocessing:** Instead of manually applying transformations on subsets of data, the ColumnTransformer allows you to apply different transformations to different columns in a single step.
* **Integration in Pipelines:** ColumnTransformer can be easily used in a pipeline, allowing you to apply the same set of transformations every time you train a model, ensuring that no data leakage or mistakes happen.

In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

In [5]:
data = {
    'Age': [25, 30, np.nan, 40, 45],  # Missing value in Age
    'Salary': [50000, 60000, 70000, np.nan, 90000],  # Missing value in Salary
    'Owner': ['First Owner', 'Second Owner', 'Third Owner', 'First Owner', 'Second Owner'],
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles']
}
df = pd.DataFrame(data)

In [6]:
X = df[['Age', 'Salary', 'Owner', 'Gender', 'City']]
y = [0, 1, 0, 1, 0] 

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**--- Individual Transformation Steps ---**

In [9]:
# Apply SimpleImputer to handle missing values in 'Age' and 'Salary'

simple_imputer = SimpleImputer(strategy='mean')
X_train_imputed = simple_imputer.fit_transform(X_train[['Age', 'Salary']])
X_test_imputed = simple_imputer.transform(X_test[['Age', 'Salary']])

In [10]:
# Apply OrdinalEncoder to 'Owner' (ordinal feature)

ordinal_encoder = OrdinalEncoder()
X_train_ord = ordinal_encoder.fit_transform(X_train[['Owner']])
X_test_ord = ordinal_encoder.transform(X_test[['Owner']])

In [11]:
# Apply OneHotEncoder to 'Gender' and 'City' (categorical features)

one_hot_encoder = OneHotEncoder(drop='first', sparse=False)
X_train_cat = one_hot_encoder.fit_transform(X_train[['Gender', 'City']])
X_test_cat = one_hot_encoder.transform(X_test[['Gender', 'City']])



In [12]:
# Combine all transformed columns (for comparison)

X_train_combined = np.hstack([X_train_imputed, X_train_ord, X_train_cat])
X_test_combined = np.hstack([X_test_imputed, X_test_ord, X_test_cat])

In [13]:
print("Transformed X_train (individual transformations):")
print(X_train_combined)

Transformed X_train (individual transformations):
[[4.50000000e+01 9.00000000e+04 1.00000000e+00 1.00000000e+00
  1.00000000e+00 0.00000000e+00]
 [3.66666667e+01 7.00000000e+04 2.00000000e+00 1.00000000e+00
  0.00000000e+00 1.00000000e+00]
 [2.50000000e+01 5.00000000e+04 0.00000000e+00 1.00000000e+00
  0.00000000e+00 1.00000000e+00]
 [4.00000000e+01 7.00000000e+04 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00]]


In [14]:
X_train_combined.shape

(4, 6)

**--- Combined Transformation Steps Using ColumnTransformer ---**

In [17]:
# Define the ColumnTransformer

column_transformer = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='mean'), ['Age', 'Salary']),  # Impute missing values for numerical columns
        ('ord', OrdinalEncoder(), ['Owner']),  # Apply OrdinalEncoder to 'Owner' column
        ('cat', OneHotEncoder(drop='first', sparse=False), ['Gender', 'City'])  # Apply OneHotEncoder to 'Gender' and 'City'
    ]
)

In [18]:
# Apply ColumnTransformer on the training and test sets

X_train_transformed = column_transformer.fit_transform(X_train)
X_test_transformed = column_transformer.transform(X_test)



In [19]:
print("\nTransformed X_train (using ColumnTransformer):")
print(X_train_transformed)


Transformed X_train (using ColumnTransformer):
[[4.50000000e+01 9.00000000e+04 1.00000000e+00 1.00000000e+00
  1.00000000e+00 0.00000000e+00]
 [3.66666667e+01 7.00000000e+04 2.00000000e+00 1.00000000e+00
  0.00000000e+00 1.00000000e+00]
 [2.50000000e+01 5.00000000e+04 0.00000000e+00 1.00000000e+00
  0.00000000e+00 1.00000000e+00]
 [4.00000000e+01 7.00000000e+04 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00]]


In [20]:
# Compare the results from the individual and ColumnTransformer methods
print("\nAre the results the same?")
print(np.array_equal(X_train_combined, X_train_transformed))


Are the results the same?
True
