<h1 align="center"><strong>column transformer | make_column_transformer code</strong></h1>

| Feature                          | ColumnTransformer (class)                                    | make_column_transformer (function)                  |
|----------------------------------|---------------------------------------------------------------|-----------------------------------------------------|
| **Type**                         | Class (need to instantiate explicitly)                       | Helper function (returns a ColumnTransformer)       |
| **Syntax**                       | More verbose, need to pass parameters manually               | Shorter, cleaner, less boilerplate                  |
| **Flexibility**                  | High – full control over parameters                          | Limited – uses defaults (e.g., remainder="drop")    |
| **`remainder` default**          | Must be specified explicitly (no default passthrough/drop)   | Default = `"drop"`                                  |
| **`verbose_feature_names_out`**  | Can be set directly in constructor                           | Not available directly (must use `.set_params()`)   |
| **Use cases**                    | Production pipelines, complex transformations                | Quick prototyping, simple pipelines                 |
| **Naming behavior**              | Supports `verbose_feature_names_out` for cleaner names       | Same, but needs `.set_params()` after creation      |

<body>
    <div style = "
        width: 100%;
        height: 30px;
        background: linear-gradient(to right,rgb(235, 238, 212),rgb(235, 238, 212));">
    </div>
</body>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder, OrdinalEncoder

df = pd.read_csv('../data/50_Startups.csv')
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [2]:
def get_column_types(df, numerics=[np.number], show_summary=True):
    """Identify numerical and categorical columns in a DataFrame."""
    from itertools import zip_longest
    
    # Numeric columns
    numerical_columns = df.select_dtypes(include=numerics).columns.tolist()
    
    # Categorical columns
    categorical_columns = df.select_dtypes(exclude=numerics).columns.tolist()
    
    if show_summary:
        print(f"Column Type Summary:")
        print(f"   Numerical   : {len(numerical_columns)} columns")
        print(f"   Categorical : {len(categorical_columns)} columns")
        print(f"   Total       : {len(df.columns)} columns")
        
        print("\n Numerical Columns                                     | Categorical Columns")
        print("==========================================================================================")
        for i, (num, cat) in enumerate(zip_longest(numerical_columns, categorical_columns, fillvalue='')):
            print(f"{i+1:2d}. {num:<50} | {cat:<28}")
    
    return numerical_columns, categorical_columns

numerical_columns, categorical_columns = get_column_types(df)

Column Type Summary:
   Numerical   : 4 columns
   Categorical : 1 columns
   Total       : 5 columns

 Numerical Columns                                     | Categorical Columns
 1. R&D Spend                                          | State                       
 2. Administration                                     |                             
 3. Marketing Spend                                    |                             
 4. Profit                                             |                             


In [3]:
# --------------------------
# Example 1: Using ColumnTransformer directly
# --------------------------
ct3 = ColumnTransformer(
    transformers=[
        # name, transformer, columns:
        ('ordinalencoder', OrdinalEncoder(), ['State']),      # Encode 'State' column with OrdinalEncoder
        ('passthrough', 'passthrough', numerical_columns)     # Keep numerical columns unchanged
    ],
    remainder='drop',                                         # Drop any other columns
    verbose_feature_names_out=False                           # Avoid prefixing feature names
)

# Fit and transform the DataFrame
df_new = ct3.fit_transform(df)

# Convert to DataFrame and manually rename columns: first encoded 'State' + numerical columns
df_new = pd.DataFrame(df_new, columns=['State_encoded'] + numerical_columns)

# Check shapes and preview
display(df.shape, df.head())
display(df_new.shape, df_new.head())

(50, 5)

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


(50, 5)

Unnamed: 0,State_encoded,R&D Spend,Administration,Marketing Spend,Profit
0,2.0,165349.2,136897.8,471784.1,192261.83
1,0.0,162597.7,151377.59,443898.53,191792.06
2,1.0,153441.51,101145.55,407934.54,191050.39
3,2.0,144372.41,118671.85,383199.62,182901.99
4,1.0,142107.34,91391.77,366168.42,166187.94


In [4]:
# --------------------------
# Example 2: Using make_column_transformer with remainder='passthrough'
# --------------------------
ohe = OneHotEncoder(sparse_output=False)
ode = OrdinalEncoder()

ct = make_column_transformer(
    (ohe, ['State']),         # Apply OneHotEncoder to 'State'
    (ode, ['State']),         # Also apply OrdinalEncoder to 'State'
    remainder='passthrough'   # Keep all other columns unchanged
)

ct.set_output(transform='pandas')   # Return a pandas DataFrame instead of NumPy array
df_new = ct.fit_transform(df)

# Show original vs transformed shapes and data
display(df.shape, df.head())
display(df_new.shape, df_new.head())

(50, 5)

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


(50, 8)

Unnamed: 0,onehotencoder__State_California,onehotencoder__State_Florida,onehotencoder__State_New York,ordinalencoder__State,remainder__R&D Spend,remainder__Administration,remainder__Marketing Spend,remainder__Profit
0,0.0,0.0,1.0,2.0,165349.2,136897.8,471784.1,192261.83
1,1.0,0.0,0.0,0.0,162597.7,151377.59,443898.53,191792.06
2,0.0,1.0,0.0,1.0,153441.51,101145.55,407934.54,191050.39
3,0.0,0.0,1.0,2.0,144372.41,118671.85,383199.62,182901.99
4,0.0,1.0,0.0,1.0,142107.34,91391.77,366168.42,166187.94


In [5]:
# --------------------------
# Example 3: make_column_transformer with remainder='drop'
# --------------------------
ct2 = make_column_transformer(
    (ohe, ['State']),      # OneHotEncoder on 'State'
    (ode, ['State']),      # OrdinalEncoder on 'State'
    remainder='drop'       # Drop all other columns
)

ct2.set_output(transform='pandas')
df_new_2 = ct2.fit_transform(df)

display(df.shape, df.head())
display(df_new_2.shape, df_new_2.head())

(50, 5)

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


(50, 4)

Unnamed: 0,onehotencoder__State_California,onehotencoder__State_Florida,onehotencoder__State_New York,ordinalencoder__State
0,0.0,0.0,1.0,2.0
1,1.0,0.0,0.0,0.0
2,0.0,1.0,0.0,1.0
3,0.0,0.0,1.0,2.0
4,0.0,1.0,0.0,1.0


In [6]:
# --------------------------
# Example 4: Custom passthrough of selected columns
# --------------------------
ct3 = make_column_transformer(
    (ohe, ['State']),       # OneHotEncoder on 'State'
    (ode, ['State']),       # OrdinalEncoder on 'State'
    ('passthrough', [       # Explicitly keep these columns as they are
        'R&D Spend',
        'Administration',
        'Marketing Spend',
        'State',
        'Profit'
    ]),
    remainder='drop'        # Drop any other columns not listed
)

ct3.set_output(transform='pandas')
df_new_3 = ct3.fit_transform(df)

display(df.shape, df.head())
display(df_new_3.shape, df_new_3.head())

(50, 5)

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


(50, 9)

Unnamed: 0,onehotencoder__State_California,onehotencoder__State_Florida,onehotencoder__State_New York,ordinalencoder__State,passthrough__R&D Spend,passthrough__Administration,passthrough__Marketing Spend,passthrough__State,passthrough__Profit
0,0.0,0.0,1.0,2.0,165349.2,136897.8,471784.1,New York,192261.83
1,1.0,0.0,0.0,0.0,162597.7,151377.59,443898.53,California,191792.06
2,0.0,1.0,0.0,1.0,153441.51,101145.55,407934.54,Florida,191050.39
3,0.0,0.0,1.0,2.0,144372.41,118671.85,383199.62,New York,182901.99
4,0.0,1.0,0.0,1.0,142107.34,91391.77,366168.42,Florida,166187.94


In [7]:
df2= df.copy()

encoder = OneHotEncoder(sparse_output=False, drop='first')  # drop='first' avoids multicollinearity
encoded = encoder.fit_transform(df2[['State']])

df2 = df2.drop(columns=['State'])
df2 = pd.concat([df2, pd.DataFrame(encoded, columns=encoder.get_feature_names_out())], axis=1)

df2.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit,State_Florida,State_New York
0,165349.2,136897.8,471784.1,192261.83,0.0,1.0
1,162597.7,151377.59,443898.53,191792.06,0.0,0.0
2,153441.51,101145.55,407934.54,191050.39,1.0,0.0
3,144372.41,118671.85,383199.62,182901.99,0.0,1.0
4,142107.34,91391.77,366168.42,166187.94,1.0,0.0


In [8]:
df3= df.copy()

le = LabelEncoder()
df3['State'] = le.fit_transform(df3['State'])
df3.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,2,192261.83
1,162597.7,151377.59,443898.53,0,191792.06
2,153441.51,101145.55,407934.54,1,191050.39
3,144372.41,118671.85,383199.62,2,182901.99
4,142107.34,91391.77,366168.42,1,166187.94


In [9]:
df4= df.copy()
oe = OrdinalEncoder(categories=[['New York', 'California', 'Florida']])
df4['State'] = oe.fit_transform(df[['State']])
df4.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,0.0,192261.83
1,162597.7,151377.59,443898.53,1.0,191792.06
2,153441.51,101145.55,407934.54,2.0,191050.39
3,144372.41,118671.85,383199.62,0.0,182901.99
4,142107.34,91391.77,366168.42,2.0,166187.94


In [10]:
df5= df.copy()
import category_encoders as ce

binary_encoder = ce.BinaryEncoder(cols=['State'])
df5 = binary_encoder.fit_transform(df5)
df5.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State_0,State_1,Profit
0,165349.2,136897.8,471784.1,0,1,192261.83
1,162597.7,151377.59,443898.53,1,0,191792.06
2,153441.51,101145.55,407934.54,1,1,191050.39
3,144372.41,118671.85,383199.62,0,1,182901.99
4,142107.34,91391.77,366168.42,1,1,166187.94


In [11]:
df6= df.copy()
import category_encoders as ce

target_encoder = ce.TargetEncoder(cols=['State'], smoothing=1.0)
df6 = target_encoder.fit_transform(df6['State'], df6['Profit'])
df6.head()

Unnamed: 0,State
0,112095.340782
1,111628.135645
2,112134.250893
3,112095.340782
4,112134.250893


In [12]:
df7= df.copy()

freq_map = df['State'].value_counts().to_dict()
df7['State'] = df7['State'].map(freq_map)
df7.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,17,192261.83
1,162597.7,151377.59,443898.53,17,191792.06
2,153441.51,101145.55,407934.54,16,191050.39
3,144372.41,118671.85,383199.62,17,182901.99
4,142107.34,91391.77,366168.42,16,166187.94


In [13]:
df8= df.copy()

import category_encoders as ce

hash_encoder = ce.HashingEncoder(cols=['State'], n_components=4)
df8 = hash_encoder.fit_transform(df8)
df8.head()

Unnamed: 0,col_0,col_1,col_2,col_3,R&D Spend,Administration,Marketing Spend,Profit
0,0,0,0,1,165349.2,136897.8,471784.1,192261.83
1,1,0,0,0,162597.7,151377.59,443898.53,191792.06
2,1,0,0,0,153441.51,101145.55,407934.54,191050.39
3,0,0,0,1,144372.41,118671.85,383199.62,182901.99
4,1,0,0,0,142107.34,91391.77,366168.42,166187.94


In [14]:
from sklearn.model_selection import train_test_split
X = df5.drop('Profit', axis=1)
y = df5['Profit']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=None)

print(f"X_train shape: {X_train.shape}")
print(f"X_test  shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test  shape: {y_test.shape}")
print(f"\nMissing values:")
print(f"X_train : {X_train.isnull().sum().sum()}")
print(f"X_test  : {X_test.isnull().sum().sum()}")
print(f"y_train : {y_train.isnull().sum().sum()}")
print(f"y_test  : {y_test.isnull().sum().sum()}")

X_train shape: (40, 5)
X_test  shape: (10, 5)
y_train shape: (40,)
y_test  shape: (10,)

Missing values:
X_train : 0
X_test  : 0
y_train : 0
y_test  : 0


In [15]:
def regression_evaluate_model(model, X_train, X_test, y_train, y_test, model_name="RegressionModel", verbose=True):
    import time
    from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
    
    start_time = time.time()
    
    # Predictions
    y_train_pred = model.predict(X_train)
    y_test_pred  = model.predict(X_test)
    
    # Metrics
    train_r2 = r2_score(y_train, y_train_pred)
    test_r2  = r2_score(y_test, y_test_pred)
    
    mae  = mean_absolute_error(y_test, y_test_pred)
    mse  = mean_squared_error(y_test, y_test_pred)
    rmse = np.sqrt(mse)
    
    end_time = time.time()
    elapsed  = end_time - start_time
    
    if verbose:
        print(f"{'='*60}")
        print(f"Evaluation results: {model_name}")
        print(f"{'='*60}")
        print(f"--------- R² Scores ---------")
        print(f"Train R²       : {train_r2:.4f}")
        print(f"Test  R²       : {test_r2:.4f}")
        print(f"Difference     : {abs(train_r2 - test_r2):.4f}")
        if abs(train_r2 - test_r2) > 0.05:
            print("⚠️ Possible overfitting/underfitting detected!")
        else:
            print("✅ Good generalization!")
        
        print(f"\n--------- Error Metrics (Test Set) ---------")
        print(f"MAE   : {mae:.4f}")
        print(f"MSE   : {mse:.4f}")
        print(f"RMSE  : {rmse:.4f}")
    
    return {
        "train_r2": train_r2,
        "test_r2": test_r2,
        "mae": mae,
        "mse": mse,
        "rmse": rmse,
        "time_sec": elapsed
    }

In [16]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

LR_results = regression_evaluate_model(lr_model, X_train, X_test, y_train, y_test, "Linear Regression")

Evaluation results: Linear Regression
--------- R² Scores ---------
Train R²       : 0.9537
Test  R²       : 0.8987
Difference     : 0.0550
⚠️ Possible overfitting/underfitting detected!

--------- Error Metrics (Test Set) ---------
MAE   : 6961.4778
MSE   : 82010363.0447
RMSE  : 9055.9573


In [17]:
lasso = Lasso(alpha=1000)
lasso.fit(X_train, y_train)

Lasso_results = regression_evaluate_model(lasso, X_train, X_test, y_train, y_test, "Lasso Regression")

Evaluation results: Lasso Regression
--------- R² Scores ---------
Train R²       : 0.9536
Test  R²       : 0.9001
Difference     : 0.0535
⚠️ Possible overfitting/underfitting detected!

--------- Error Metrics (Test Set) ---------
MAE   : 6979.1342
MSE   : 80925843.4293
RMSE  : 8995.8792


In [18]:
ridge = Ridge(alpha=100)
ridge.fit(X_train, y_train)

Ridge_results = regression_evaluate_model(ridge, X_train, X_test, y_train, y_test, "Ridge Regression")

Evaluation results: Ridge Regression
--------- R² Scores ---------
Train R²       : 0.9536
Test  R²       : 0.9000
Difference     : 0.0536
⚠️ Possible overfitting/underfitting detected!

--------- Error Metrics (Test Set) ---------
MAE   : 6978.4264
MSE   : 80965109.2030
RMSE  : 8998.0614


In [19]:
ElasticNet_model = ElasticNet(alpha=0.1, l1_ratio=0.5)
ElasticNet_model.fit(X_train, y_train)

ElasticNet_results = regression_evaluate_model(ElasticNet_model, X_train, X_test, y_train, y_test, "ElasticNet Regression")

Evaluation results: ElasticNet Regression
--------- R² Scores ---------
Train R²       : 0.9537
Test  R²       : 0.8992
Difference     : 0.0545
⚠️ Possible overfitting/underfitting detected!

--------- Error Metrics (Test Set) ---------
MAE   : 6967.1060
MSE   : 81632027.5870
RMSE  : 9035.0444


In [20]:
from sklearn.svm import SVR

svr_model = SVR(
    kernel='rbf',
    C=1.0
)
svr_model.fit(X_train, y_train)
SVR_results = regression_evaluate_model(svr_model, X_train, X_test, y_train, y_test, "Support Vector Regression")

Evaluation results: Support Vector Regression
--------- R² Scores ---------
Train R²       : -0.0215
Test  R²       : -0.1799
Difference     : 0.1585
⚠️ Possible overfitting/underfitting detected!

--------- Error Metrics (Test Set) ---------
MAE   : 22845.3262
MSE   : 955509927.3603
RMSE  : 30911.3236


In [21]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

poly_model = Pipeline([
    ('poly', PolynomialFeatures(degree=2)),
    ('linear', LinearRegression())
])
poly_model.fit(X_train, y_train)

Poly_results = regression_evaluate_model(poly_model, X_train, X_test, y_train, y_test, "Polynomial Regression (Degree 2)")

Evaluation results: Polynomial Regression (Degree 2)
--------- R² Scores ---------
Train R²       : 0.9643
Test  R²       : 0.9005
Difference     : 0.0638
⚠️ Possible overfitting/underfitting detected!

--------- Error Metrics (Test Set) ---------
MAE   : 7067.8598
MSE   : 80597275.7853
RMSE  : 8977.5986
