<h1 style="text-align: center;">Scikit-learn ColumnTransformer and Pipeline</h1>
<hr>

Scikit-learn's ColumnTransformer
--------------------------------

The ColumnTransformer is a crucial tool for applying **different preprocessing steps to different columns** of a dataset within a single, unified operation. 

It's the standard solution for handling datasets with **mixed data types** (e.g., numerical, categorical, text), where each data type requires a unique transformation pipeline.

### Why Use a ColumnTransformer?

Real-world datasets are rarely uniform. You might have:

*   **Numerical columns** that need scaling or imputation.
    
*   **Categorical columns** that need one-hot encoding or ordinal encoding.
    
*   **Text columns** that require vectorization.
    

The ColumnTransformer allows you to define and apply these separate transformations cleanly and efficiently.

### How It Works

You construct a ColumnTransformer by providing it with a list of tuples. Each tuple defines a specific transformation and the columns it applies to.

Each tuple has the following structure:('name\_of\_step', transformer\_object, columns\_to\_apply\_to)

*   **name\_of\_step**: A string to identify the transformer (e.g., 'numeric\_scaler').
    
*   **transformer\_object**: An instance of a scikit-learn transformer (e.g., StandardScaler(), OneHotEncoder()).
    
*   **columns\_to\_apply\_to**: A list of column names or indices.
    

#### Important Parameter: remainder

The remainder parameter controls what happens to the columns that you _don't_ explicitly select for transformation.

*   remainder='passthrough': **Keeps** the unselected columns in the output. This is useful for features you don't want to transform.
    
*   remainder='drop' (default): **Discards** the unselected columns.
    

By integrating ColumnTransformer into a Pipeline, you can create a single, robust preprocessing and modeling workflow that handles complex datasets with ease.

In [68]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder,OneHotEncoder
from sklearn.model_selection import train_test_split

In [3]:
df = pd.read_csv(r'C:\Feature Engineering\Datasets\cars.csv')

In [4]:
df.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [5]:
df['owner'].value_counts()

owner
First Owner             5289
Second Owner            2105
Third Owner              555
Fourth & Above Owner     174
Test Drive Car             5
Name: count, dtype: int64

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
                                                      df.drop(columns=['selling_price']),
                                                      df['selling_price'],
                                                      test_size=0.2,
                                                      random_state=42
                                                    )

In [7]:
X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
6518,Tata,2560,Petrol,First Owner
6144,Honda,80000,Petrol,Second Owner
6381,Hyundai,150000,Diesel,Fourth & Above Owner
438,Maruti,120000,Diesel,Second Owner
5939,Maruti,25000,Petrol,First Owner


## Manual Preprocessing vs. `ColumnTransformer`

The core difference is between a **fragmented, error-prone process** and a **single, robust, automated workflow**.  
While the manual approach forces you to handle each column set and data split separately, `ColumnTransformer` encapsulates the entire logic into one object.

Here is a direct comparison:

| Aspect | Manual Approach  | `ColumnTransformer` Approach  |
| :--- | :--- | :--- |
| **Workflow** | You manually select column subsets, fit transformers on training data, transform train and test sets separately, and then combine the results. | You define all transformations for all column sets in a single object. It handles the "fit" and "transform" logic internally. |
| **Data Leakage Risk** | **High** . It's very easy to accidentally `fit` on the test data or the entire dataset before splitting, leading to an over-optimistic model that performs poorly on new data. | **Low** . It's designed to prevent data leakage. When used in a `Pipeline`, it correctly learns parameters from the training data only and applies them to any other data. |
| **Code Complexity** | **Verbose and Cluttered**. Your code becomes long and repetitive, with separate logic for scaling, encoding, etc., making it hard to read and debug. | **Concise and Organized**. The entire preprocessing logic is defined in one clean, readable block. |
| **Reproducibility** | **Difficult**. You must perfectly re-implement the same sequence of steps and use the exact same fitted objects for new data, which is tedious and error-prone. | **Simple**. The entire `ColumnTransformer` (often within a `Pipeline`) can be saved as a single object, ensuring the exact same process is applied to new data with a simple `.predict()` call. |

---

### How `ColumnTransformer` Makes Your Work Easier

1. **It Prevents the Biggest Mistake (Data Leakage):**  
   By far, the most critical advantage. `ColumnTransformer` (especially inside a `Pipeline`) enforces the correct pattern of fitting **only** on training data and transforming all data splits (train, validation, test).  
   This saves you from the most common and dangerous pitfall in machine learning.

2. **It Organizes Your Entire Workflow:**  
   Instead of having scattered pieces of code for handling different columns, you have a single object that serves as a blueprint for your entire feature engineering process.  
   This makes your project's logic immediately understandable.

3. **It Automates Everything:**  
   Once defined, the `ColumnTransformer` works seamlessly within a `Pipeline` for cross-validation (`cross_val_score`) and hyperparameter tuning (`GridSearchCV`).  
   You don't need to manually apply transformations at each fold of cross-validation; the `Pipeline` handles it automatically.

---

In short, `ColumnTransformer` turns a complex, risky manual process into a single, robust, and reproducible step, saving you time and preventing critical errors.

### The Hard Way!

In [8]:
# apply ordinal encoder to owner
oe = OrdinalEncoder(categories=[['Test Drive Car', 'Fourth & Above Owner', 'Third Owner', 'Second Owner', 'First Owner']])

X_train_owner = oe.fit_transform(X_train.loc[:,['owner']])
X_test_owner = oe.transform(X_test.loc[:,['owner']])

In [9]:
# convert to df
X_train_owner_df = pd.DataFrame(X_train_owner,columns=oe.get_feature_names_out())
X_test_owner_df = pd.DataFrame(X_test_owner,columns=oe.get_feature_names_out())

In [10]:
X_train_owner_df.head()

Unnamed: 0,owner
0,4.0
1,3.0
2,1.0
3,3.0
4,4.0


In [11]:
# apply ohe to brand and fuel
ohe = OneHotEncoder(sparse_output=False)

X_train_brand_fuel = ohe.fit_transform(X_train[['brand','fuel']])
X_test_brand_fuel = ohe.transform(X_test[['brand','fuel']])

In [12]:
# converting to dataframe
X_train_brand_fuel_df = pd.DataFrame(X_train_brand_fuel, columns=ohe.get_feature_names_out())
X_test_brand_fuel_df = pd.DataFrame(X_test_brand_fuel, columns=ohe.get_feature_names_out())

In [13]:
X_train_brand_fuel_df.head()

Unnamed: 0,brand_Ambassador,brand_Ashok,brand_Audi,brand_BMW,brand_Chevrolet,brand_Daewoo,brand_Datsun,brand_Fiat,brand_Force,brand_Ford,...,brand_Renault,brand_Skoda,brand_Tata,brand_Toyota,brand_Volkswagen,brand_Volvo,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [14]:
X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
6518,Tata,2560,Petrol,First Owner
6144,Honda,80000,Petrol,Second Owner
6381,Hyundai,150000,Diesel,Fourth & Above Owner
438,Maruti,120000,Diesel,Second Owner
5939,Maruti,25000,Petrol,First Owner


In [15]:
X_train_rem = X_train.drop(columns=['brand','fuel','owner'],inplace=True)
X_test_rem = X_test.drop(columns=['brand','fuel','owner'],inplace=True)

In [16]:
X_train = pd.concat([X_train_rem, X_train_owner_df, X_train_brand_fuel_df],axis=1)
X_test = pd.concat([X_test_rem, X_test_owner_df, X_test_brand_fuel_df],axis=1)

In [17]:
X_train.head()

Unnamed: 0,owner,brand_Ambassador,brand_Ashok,brand_Audi,brand_BMW,brand_Chevrolet,brand_Daewoo,brand_Datsun,brand_Fiat,brand_Force,...,brand_Renault,brand_Skoda,brand_Tata,brand_Toyota,brand_Volkswagen,brand_Volvo,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol
0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


### The Easy Way!

In [18]:
from sklearn.compose import ColumnTransformer

In [21]:
df = pd.read_csv(r"C:\Feature Engineering\Datasets\cars.csv")

In [22]:
X_train, X_test, y_train, y_test = train_test_split(
                                                      df.drop(columns=['selling_price']),
                                                      df['selling_price'],
                                                      test_size=0.2,
                                                      random_state=42
                                                    )

In [23]:
X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
6518,Tata,2560,Petrol,First Owner
6144,Honda,80000,Petrol,Second Owner
6381,Hyundai,150000,Diesel,Fourth & Above Owner
438,Maruti,120000,Diesel,Second Owner
5939,Maruti,25000,Petrol,First Owner


In [24]:
transformer = ColumnTransformer(
    [
        ("ordinal", OrdinalEncoder(categories=[['Test Drive Car', 'Fourth & Above Owner', 'Third Owner', 'Second Owner', 'First Owner']]), ['owner']),
        ("onehot", OneHotEncoder(sparse_output=False), ['brand', 'fuel'])
    ],
    remainder='passthrough'
)

# setting to get a pandas df
transformer.set_output(transform='pandas')


0,1,2
,transformers,"[('ordinal', ...), ('onehot', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,categories,"[['Test Drive Car', 'Fourth & Above Owner', ...]]"
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,unknown_value,
,encoded_missing_value,
,min_frequency,
,max_categories,

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


In [25]:
X_train_transformed = transformer.fit_transform(X_train)
X_test_transformed = transformer.transform(X_test)

In [26]:
transformer.set_output(transform='pandas')

0,1,2
,transformers,"[('ordinal', ...), ('onehot', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,categories,"[['Test Drive Car', 'Fourth & Above Owner', ...]]"
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,unknown_value,
,encoded_missing_value,
,min_frequency,
,max_categories,

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


In [27]:
transformer.fit_transform(X_train)

Unnamed: 0,ordinal__owner,onehot__brand_Ambassador,onehot__brand_Ashok,onehot__brand_Audi,onehot__brand_BMW,onehot__brand_Chevrolet,onehot__brand_Daewoo,onehot__brand_Datsun,onehot__brand_Fiat,onehot__brand_Force,...,onehot__brand_Skoda,onehot__brand_Tata,onehot__brand_Toyota,onehot__brand_Volkswagen,onehot__brand_Volvo,onehot__fuel_CNG,onehot__fuel_Diesel,onehot__fuel_LPG,onehot__fuel_Petrol,remainder__km_driven
6518,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2560
6144,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,80000
6381,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,150000
438,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,120000
5939,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,25000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5226,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,120000
5390,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,80000
860,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,35000
7603,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,27000


In [28]:
transformer.feature_names_in_

array(['brand', 'km_driven', 'fuel', 'owner'], dtype=object)

In [29]:
transformer.get_feature_names_out()

array(['ordinal__owner', 'onehot__brand_Ambassador',
       'onehot__brand_Ashok', 'onehot__brand_Audi', 'onehot__brand_BMW',
       'onehot__brand_Chevrolet', 'onehot__brand_Daewoo',
       'onehot__brand_Datsun', 'onehot__brand_Fiat',
       'onehot__brand_Force', 'onehot__brand_Ford', 'onehot__brand_Honda',
       'onehot__brand_Hyundai', 'onehot__brand_Isuzu',
       'onehot__brand_Jaguar', 'onehot__brand_Jeep', 'onehot__brand_Kia',
       'onehot__brand_Land', 'onehot__brand_Lexus', 'onehot__brand_MG',
       'onehot__brand_Mahindra', 'onehot__brand_Maruti',
       'onehot__brand_Mercedes-Benz', 'onehot__brand_Mitsubishi',
       'onehot__brand_Nissan', 'onehot__brand_Opel',
       'onehot__brand_Peugeot', 'onehot__brand_Renault',
       'onehot__brand_Skoda', 'onehot__brand_Tata',
       'onehot__brand_Toyota', 'onehot__brand_Volkswagen',
       'onehot__brand_Volvo', 'onehot__fuel_CNG', 'onehot__fuel_Diesel',
       'onehot__fuel_LPG', 'onehot__fuel_Petrol', 'remainder__km_drive

In [30]:
transformer.n_features_in_

4

In [31]:
transformer.transformers_

[('ordinal',
  OrdinalEncoder(categories=[['Test Drive Car', 'Fourth & Above Owner',
                              'Third Owner', 'Second Owner', 'First Owner']]),
  ['owner']),
 ('onehot', OneHotEncoder(sparse_output=False), ['brand', 'fuel']),
 ('remainder',
  FunctionTransformer(accept_sparse=True, check_inverse=False,
                      feature_names_out='one-to-one'),
  ['km_driven'])]

In [32]:
transformer.output_indices_

{'ordinal': slice(0, 1, None),
 'onehot': slice(1, 37, None),
 'remainder': slice(37, 38, None)}

### Sklearn Pipeline

A Pipeline in scikit-learn chains multiple data processing steps and a final model into a single object. This simplifies your machine learning workflow by bundling a sequence of transformations (like scaling and encoding) and a final estimator (like a classifier) into one cohesive unit. 

Why Use a Pipeline?
-------------------

Using a Pipeline is a best practice in machine learning for several key reasons:

*   **Simplicity and Organization**: It cleans up your code by combining a multi-step workflow into a single object. This makes your entire process, from preprocessing to prediction, much easier to manage and understand.
    
*   **Preventing Data Leakage**: This is the most critical benefit. A Pipeline ensures that data transformations (like learning scaling parameters) are fitted **only** on the training data during cross-validation. This prevents information from the test set from "leaking" into your training process, giving you a more reliable measure of your model's performance. 
    
*   **Automation**: It automates the process of applying the same sequence of steps, which is essential for tasks like cross-validation and grid searching for hyperparameters. You can treat the entire pipeline as a single estimator.
    

How It Works
------------

A Pipeline is built by providing a list of steps. Each step is a tuple containing a name you choose and an instance of a transformer or estimator.

A common workflow might look like this:

1.  **Impute** missing values.
    
2.  **Scale** numerical features.
    
3.  **Train** a final model (e.g., a classifier or regressor).
    

The Pipeline object seamlessly manages the fit, transform, and predict logic across all these steps, ensuring they are executed in the correct order every time.

In [78]:
df = pd.read_csv(r"C:\Feature Engineering\Datasets\cars.csv")
df.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [79]:
df.shape

(8128, 5)

In [101]:
import numpy as np

np.random.seed(42)
missing_km_indices = np.random.choice(df.index, size=int(0.05*len(df)), replace=False)
df.loc[missing_km_indices, 'km_driven'] = np.nan

# Introduce missing values in 'owner' column (1% missing values)
missing_owner_indices = np.random.choice(df.index, size=int(0.01*len(df)), replace=False)
df.loc[missing_owner_indices, 'owner'] = np.nan

In [102]:
df.isnull().sum()

brand              0
km_driven        406
fuel               0
owner             81
selling_price      0
dtype: int64

In [103]:
X_train, X_test, y_train, y_test = train_test_split(
                                                      df.drop(columns=['selling_price']),
                                                      df['selling_price'],
                                                      test_size=0.2,
                                                      random_state=42
                                                    )

In [104]:
X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
6518,Tata,2560.0,Petrol,First Owner
6144,Honda,80000.0,Petrol,Second Owner
6381,Hyundai,150000.0,Diesel,Fourth & Above Owner
438,Maruti,120000.0,Diesel,Second Owner
5939,Maruti,25000.0,Petrol,First Owner


In [105]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6502 entries, 6518 to 7270
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   brand      6502 non-null   object 
 1   km_driven  6502 non-null   float64
 2   fuel       6502 non-null   object 
 3   owner      6442 non-null   object 
dtypes: float64(1), object(3)
memory usage: 254.0+ KB


In [106]:
# Plan of Attack

# Missing value imputation
# Encoding Categorical Variables
# Scaling
# Feature Selection
# Model building
# Prediction

In [107]:
df['owner'].value_counts()

owner
First Owner             5235
Second Owner            2085
Third Owner              549
Fourth & Above Owner     173
Test Drive Car             5
Name: count, dtype: int64

In [108]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest,chi2

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

In [109]:
# imputation transformer
trf1 = ColumnTransformer([
    ('impute_km_driven',SimpleImputer(),[1]),
    ('impute_owner',SimpleImputer(strategy='most_frequent'),[3])
],remainder='passthrough')

In [110]:
# encoding categorical variables
trf2 = ColumnTransformer(
    [
        ("ordinal", OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), [3]),
        ("onehot", OneHotEncoder(handle_unknown='ignore', sparse_output=False), [0,2])
    ],
    remainder='passthrough'
)

In [111]:
# Scaling
trf3 = ColumnTransformer([
    ('scale',MinMaxScaler(),slice(0,38))
])

In [112]:
a = [1,2,3,4,5]
x = slice(0,5)
a[x]

[1, 2, 3, 4, 5]

In [113]:
# Feature selection
trf4 = SelectKBest(score_func=chi2,k=10)

In [114]:
# train the model
trf5 = RandomForestRegressor()

In [115]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('imputer',trf1),
    ('encoder',trf2),
    ('scaler',trf3),
    ('fselector',trf4),
    ('model',trf5)
])


In [116]:
pipe.fit(X_train, y_train)

0,1,2
,steps,"[('imputer', ...), ('encoder', ...), ...]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('impute_km_driven', ...), ('impute_owner', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'mean'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,transformers,"[('ordinal', ...), ('onehot', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,categories,'auto'
,dtype,<class 'numpy.float64'>
,handle_unknown,'use_encoded_value'
,unknown_value,-1
,encoded_missing_value,
,min_frequency,
,max_categories,

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,transformers,"[('scale', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,feature_range,"(0, ...)"
,copy,True
,clip,False

0,1,2
,score_func,<function chi...001CED880D3A0>
,k,10

0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [117]:
pipe.feature_names_in_

array(['brand', 'km_driven', 'fuel', 'owner'], dtype=object)

In [118]:
pipe.named_steps

{'imputer': ColumnTransformer(remainder='passthrough',
                   transformers=[('impute_km_driven', SimpleImputer(), [1]),
                                 ('impute_owner',
                                  SimpleImputer(strategy='most_frequent'),
                                  [3])]),
 'encoder': ColumnTransformer(remainder='passthrough',
                   transformers=[('ordinal',
                                  OrdinalEncoder(handle_unknown='use_encoded_value',
                                                 unknown_value=-1),
                                  [3]),
                                 ('onehot',
                                  OneHotEncoder(handle_unknown='ignore',
                                                sparse_output=False),
                                  [0, 2])]),
 'scaler': ColumnTransformer(transformers=[('scale', MinMaxScaler(), slice(0, 38, None))]),
 'fselector': SelectKBest(score_func=<function chi2 at 0x000001CED880D3A0>),
 'model

In [119]:
pipe.named_steps['scaler'].transformers_[0][1].data_max_

array([3., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1.])

In [120]:
pipe.predict(X_test)[10:40]

array([631107.85033222, 631107.85033222, 631107.85033222, 631107.85033222,
       631107.85033222, 631107.85033222, 631107.85033222, 631107.85033222,
       631107.85033222, 631107.85033222, 631107.85033222, 631107.85033222,
       631107.85033222, 631107.85033222, 631107.85033222, 631107.85033222,
       631107.85033222, 631107.85033222, 631107.85033222, 631107.85033222,
       631107.85033222, 631107.85033222, 631107.85033222, 631107.85033222,
       631107.85033222, 631107.85033222, 631107.85033222, 631107.85033222,
       631107.85033222, 631107.85033222])

In [121]:
# Predict
pipe.predict(np.array(['Maruti',100000.0,'Diesel','First Owner']).reshape(1,4))



array([631107.85033222])

### Cross Validation

In [122]:
# cross validation using cross_val_score
from sklearn.model_selection import cross_val_score
cross_val_score(pipe, X_train, y_train, cv=5, scoring='neg_mean_squared_error').mean()

np.float64(-639113244101.0538)

### Hyperparameter Tuning

In [123]:
# gridsearchcv
params = {
    'model__max_depth':[1,2,3,4,5,None]
}

In [124]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, params, cv=5, scoring='neg_mean_squared_error')
grid.fit(X_train, y_train)

0,1,2
,estimator,Pipeline(step...Regressor())])
,param_grid,"{'model__max_depth': [1, 2, ...]}"
,scoring,'neg_mean_squared_error'
,n_jobs,
,refit,True
,cv,5
,verbose,0
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,transformers,"[('impute_km_driven', ...), ('impute_owner', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'mean'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,transformers,"[('ordinal', ...), ('onehot', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,categories,'auto'
,dtype,<class 'numpy.float64'>
,handle_unknown,'use_encoded_value'
,unknown_value,-1
,encoded_missing_value,
,min_frequency,
,max_categories,

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,transformers,"[('scale', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,feature_range,"(0, ...)"
,copy,True
,clip,False

0,1,2
,score_func,<function chi...001CED880D3A0>
,k,10

0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [125]:
grid.best_score_

np.float64(-639088313514.5734)

In [126]:
grid.best_params_

{'model__max_depth': None}

### Export the Pipeline

In [127]:
# export
import pickle
pickle.dump(pipe,open('pipe.pkl','wb'))