<div id="header">
    <p style="color:#6a66bd; text-align:center; font-weight:bold; font-family:verdana; font-size:20px;">Sklearn ColumnTransformer
    </p>
</div>

---

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>What is ColumnTransformer?</strong>
<br>
• The ColumnTransformer in scikit-learn is a powerful tool for applying different preprocessing steps to different subsets of the features in a dataset.
<br>
• It allows you to handle mixed data types (numerical, categorical, etc.) in a structured way.
<br>
• The ColumnTransformer was introduced in scikit-learn version 0.20, which was released in December 2018. 
<br>
• This feature allows users to apply different preprocessing steps to different subsets of features in a dataset, making it especially useful for handling mixed data types.
<br>
<br>
<strong>Benefits of ColumnTransformer</strong>
<br>
➩ <strong>Flexibility</strong>
<br>
• Apply various preprocessing techniques to different columns. 
<br>
• For example, you can standardize numerical features while applying one-hot encoding to categorical features in a single step.
<br>
➩ <strong>Simplicity</strong>
<br>
• Reduces the complexity of the preprocessing workflow by consolidating multiple transformations into a single object.
<br>
• This makes the code cleaner and more maintainable.
<br>
➩ <strong>Integration with Pipelines</strong>
<br>
• Works seamlessly with scikit-learn’s Pipeline, allowing users to combine preprocessing and model fitting in a coherent workflow. 
<br>
• This helps streamline the entire machine learning process.
<br>
➩ <strong>Improved Performance</strong>
<br>
• ColumnTransformer can optimize the execution of transformations, particularly when working with large datasets.
<br>
➩ <strong>Control Over Remaining Columns</strong>
<br>
• The remainder parameter allows you to specify what happens to columns not included in the transformers (e.g., pass them through unchanged or drop them), offering additional control over the preprocessing steps.
</div>

In [1]:
# Importing Libraries
import numpy as np
import pandas as pd

In [2]:
# Reading CSV File
df = pd.read_csv('covid.csv')
df.sample(5)

Unnamed: 0,age,gender,fever,cough,city,has_covid
77,8,Female,101.0,Mild,Kolkata,No
69,73,Female,103.0,Mild,Delhi,No
38,49,Female,101.0,Mild,Delhi,Yes
37,55,Male,100.0,Mild,Kolkata,No
40,49,Female,102.0,Mild,Delhi,No


In [3]:
# Shape of the DataFrame
df.shape

(100, 6)

In [4]:
# Unique values in gender column
df['gender'].value_counts()

gender
Female    59
Male      41
Name: count, dtype: int64

In [5]:
# Unique values in cough column
df['cough'].value_counts()

cough
Mild      62
Strong    38
Name: count, dtype: int64

In [6]:
# Unique values in city column
df['city'].value_counts()

city
Kolkata      32
Bangalore    30
Delhi        22
Mumbai       16
Name: count, dtype: int64

In [7]:
# Unique values in has_covid column
df['has_covid'].value_counts()

has_covid
No     55
Yes    45
Name: count, dtype: int64

In [8]:
# Null values in the DataFrame
df.isna().sum()

age           0
gender        0
fever        10
cough         0
city          0
has_covid     0
dtype: int64

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Train Test Split</strong>
<br>
The train-test split is a common technique in machine learning for evaluating model performance. It involves dividing your dataset into two parts :
<br>
• <strong>Training Set :</strong> Used to train the model.
<br>
• <strong>Testing Set :</strong> Used to evaluate the model's performance on unseen data.
<br>
<br>
<strong>Parameters</strong>
<br>
• <strong>arrays :</strong> This can be a list or a tuple of arrays (e.g, features and target variables).
<br>
• <strong>test_size :</strong> Determines the proportion of the dataset to include in the test split (e.g, 0.2 for 20%).
<br>
• <strong>random_state :</strong> Controls the shuffling applied to the data before the split (e.g., any integer).
<br>
• <strong>shuffle :</strong> A boolean that indicates whether to shuffle the data before splitting.
</div>

In [9]:
# Importing train_test_split
from sklearn.model_selection import train_test_split

In [10]:
# Defining Features and Target Variables
X = df.iloc[:,0:5]
y = df['has_covid']

In [11]:
# Splitting the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [12]:
# Shape of Training and Testing Set
print(X_train.shape, X_test.shape)

(70, 5) (30, 5)


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>ColumnTransformer</strong>
<br>
• ColumnTransformer is a powerful tool from scikit-learn that allows you to preprocess different columns of a dataset in different ways. 
<br>
• This is especially useful when working with datasets that have mixed data types (numerical, categorical, text, etc.).
<br>
<br>
<strong>Parameters of ColumnTransformer</strong>
<br>
➩ <strong>transformers</strong>
<br>
• A list of tuples, where each tuple has :
<br>
• A name (string): Identifies the transformer.
<br>
• Transformer object: This is the preprocessing method (e.g., StandardScaler, OneHotEncoder).
<br>
• The columns to apply the transformer to (list of strings or a single string).
<br>
➩ <strong>remainder</strong>
<br>
• This specifies what to do with the remaining columns that are not specified in the transformers.
<br>
• 'drop': Ignores the remaining columns.
<br>
• 'passthrough': Keeps the remaining columns unchanged.
<br>
• You can also provide a transformer to apply to the remaining columns.
</div>

In [13]:
# Importing SimpleImputer, OrdinalEncoder and OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

In [14]:
# Importing ColumnTransformer
from sklearn.compose import ColumnTransformer

In [15]:
# Creating ColumnTransformer Object
transformer = ColumnTransformer(transformers=[
    ('tnf1',SimpleImputer(),['fever']),
    ('tnf2',OrdinalEncoder(categories=[['Mild','Strong']]),['cough']),
    ('tnf3',OneHotEncoder(sparse_output=False, drop='first'),['gender','city'])
], remainder='passthrough')

In [16]:
# Fitting and Transforming Training Data
X_train_transformed = transformer.fit_transform(X_train)

In [17]:
# Transforming Testing Data
X_test_transformed = transformer.transform(X_test)