## `Column Transformer`

- In the dataset there are different problems associated with different columns. Like some have missing values, Some may have categorical data in them, some columns needed to be scaled and so on. Now while doing **FE** we need to solve all these problems by techniques like ***Imputation* (To solve missing value problem)** or ***Encoding* (To transform categorical data to numeric form)** or ***Standardization and Normalization* (To Scale the the column)** etc.
- Now it is very difficult to handle all these problem one after one, as we need to apply different transformation techniques for each columns.
- So to solve this we use **ColumnTransformer** class from **Scikitlearn** library.

In [1]:
# importing the libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder

import warnings
warnings.filterwarnings('ignore')

In [2]:
# importing the dataset

df = pd.read_csv('datasets/covid_toy.csv')
df.sample(5)

Unnamed: 0,age,gender,fever,cough,city,has_covid
7,20,Female,,Strong,Mumbai,Yes
88,5,Female,100.0,Mild,Kolkata,No
5,84,Female,,Mild,Bangalore,Yes
63,10,Male,100.0,Mild,Bangalore,No
22,71,Female,98.0,Strong,Kolkata,Yes


### First we will do the transformations one by one

- Here we use ***SimpleImputer*** to fill the missing values, and ***OneHotEncoder***, ***OrdinalEncoder*** and ***LabelEncoder*** for encoding the categorical values.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   age        100 non-null    int64  
 1   gender     100 non-null    object 
 2   fever      90 non-null     float64
 3   cough      100 non-null    object 
 4   city       100 non-null    object 
 5   has_covid  100 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 4.8+ KB


In [4]:
# Checking for null values in all the columns

df.isna().sum()

age           0
gender        0
fever        10
cough         0
city          0
has_covid     0
dtype: int64

**Notes:**

- So we have 10 missing values in the column `fever`.

In [5]:
# Checking different Categories in all the categorical columns

for column in df.columns:
    if df[column].dtype == 'O':
        print(f"\nThe different categories in the column {column} is:")
        print(df[column].unique())
        print(f"Number of categories in the column {column} is: {len(df[column].unique())}")
        print("*"*50)


The different categories in the column gender is:
['Male' 'Female']
Number of categories in the column gender is: 2
**************************************************

The different categories in the column cough is:
['Mild' 'Strong']
Number of categories in the column cough is: 2
**************************************************

The different categories in the column city is:
['Kolkata' 'Delhi' 'Mumbai' 'Bangalore']
Number of categories in the column city is: 4
**************************************************

The different categories in the column has_covid is:
['No' 'Yes']
Number of categories in the column has_covid is: 2
**************************************************


**Notes:**

- Now here we will use ***One Hot Encoding*** in `city` and `gender` columns, ***Ordinal Encoding*** in `cough` column, and ***Label Encoding*** in `has_covid` column. 
- Along with this we use ***Simple Imputer*** to solve the missing values problem in `fever` column.

### Doing the train test split

In [6]:
# Here we are using 'has_covid' as the output feature

X = df.drop(columns=['has_covid'], axis=1)
y = df['has_covid']

X.head()

Unnamed: 0,age,gender,fever,cough,city
0,60,Male,103.0,Mild,Kolkata
1,27,Male,100.0,Mild,Delhi
2,42,Male,101.0,Mild,Delhi
3,31,Female,98.0,Mild,Kolkata
4,65,Female,101.0,Mild,Mumbai


In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape

((80, 5), (20, 5))

In [8]:
X_train.head()

Unnamed: 0,age,gender,fever,cough,city
55,81,Female,101.0,Mild,Mumbai
88,5,Female,100.0,Mild,Kolkata
26,19,Female,100.0,Mild,Kolkata
42,27,Male,100.0,Mild,Delhi
69,73,Female,103.0,Mild,Delhi


In [9]:
# Checking for missing value

X_train.isnull().sum()

age       0
gender    0
fever     9
cough     0
city      0
dtype: int64

In [10]:
X_test.isnull().sum()

age       0
gender    0
fever     1
cough     0
city      0
dtype: int64

In [11]:
# Using Simple Imputer to column 'fever'
# Here all the missing values will get replaced with the 'mean' of the distribution

# Creating object of the class
si = SimpleImputer()

# train data
X_train_fever = si.fit_transform(X_train[['fever']])

# test data
X_test_fever = si.transform(X_test[['fever']])

In [12]:
# Creating dataframe of the numpy array so we can check for null values

X_train_fever_df = pd.DataFrame(X_train_fever, columns=['fever'])
X_test_fever_df = pd.DataFrame(X_test_fever, columns=['fever'])

In [13]:
X_train_fever_df.isna().sum()

fever    0
dtype: int64

In [14]:
X_test_fever_df.isnull().sum()

fever    0
dtype: int64

In [15]:
# Using Ordinal encoding on 'cough' column
# Here we will make sure to give 'Mild' smaller value and 'Strong' larger value 
# by providing them in order we want them in parameter categories


oe = OrdinalEncoder(categories=[['Mild','Strong']])

# train data
X_train_cough = oe.fit_transform(X_train[['cough']])

# also the test data
X_test_cough = oe.transform(X_test[['cough']])

In [16]:
# Checking the categories in Ordinal encoding

oe.categories_

[array(['Mild', 'Strong'], dtype=object)]

In [17]:
# Creating dataframe of the numpy array 

X_train_cough_df = pd.DataFrame(X_train_cough, columns=['cough'])
X_test_cough_df = pd.DataFrame(X_test_cough, columns=['cough'])

In [18]:
X_train_cough_df.sample(5)

Unnamed: 0,cough
20,0.0
57,1.0
65,1.0
23,0.0
12,0.0


In [19]:
X_test_cough_df.sample(5)

Unnamed: 0,cough
0,0.0
1,0.0
15,0.0
6,1.0
10,0.0


In [20]:
# Using One Hot Encoding on 'gender' and 'city' column

ohe = OneHotEncoder(drop='first',sparse=False)

X_train_gender_city = ohe.fit_transform(X_train[['gender', 'city']])

# also the test data
X_test_gender_city = ohe.transform(X_test[['gender', 'city']])

In [21]:
X_train_gender_city.shape, X_test_gender_city.shape

((80, 4), (20, 4))

**Notes:**
- So now we have 4 columns instead of 2, here 1 is for 'gender' column as it has 2 values in it and the other 3 belongs to the city as it has 4 values.

In [22]:
# Using Label Encoding on 'has_covid' column

le = LabelEncoder()

y_train_covid = le.fit_transform(y_train)

# also the test data
y_test_covid = le.transform(y_test)

In [23]:
# Checking values of the label encoder

le.classes_

array(['No', 'Yes'], dtype=object)

In [24]:
# Creating dataframe of the numpy array 

y_train_covid_df = pd.DataFrame(y_train_covid, columns=['has_covid'])
y_test_covid_df = pd.DataFrame(y_test_covid, columns=['has_covid'])

In [25]:
y_train_covid_df.sample(5)

Unnamed: 0,has_covid
35,1
7,1
10,1
22,1
57,1


In [26]:
y_test_covid_df.sample(5)

Unnamed: 0,has_covid
16,1
7,1
2,0
5,0
0,0


In [27]:
# Checking shape of all the arrays

X_train_fever.shape, X_train_cough.shape, X_train_gender_city.shape, y_train_covid.shape

((80, 1), (80, 1), (80, 4), (80,))

- So all the arrays have 80 rows

### Now we need to concatenate all the input columns

In [28]:
# To do this 1st we need to drop the 'age' column as it is of int type
# So here we are dropping the column from both train and test dataset and keeping it's value in a variable

# Extracting Age
X_train_age = X_train.drop(columns=['gender','fever','cough','city']).values

# also the test data
X_test_age = X_test.drop(columns=['gender','fever','cough','city']).values

In [29]:
# Checking the values

X_train_age

array([[81],
       [ 5],
       [19],
       [27],
       [73],
       [70],
       [49],
       [51],
       [64],
       [83],
       [65],
       [18],
       [16],
       [16],
       [27],
       [84],
       [51],
       [69],
       [82],
       [69],
       [44],
       [74],
       [20],
       [12],
       [33],
       [42],
       [65],
       [23],
       [56],
       [64],
       [13],
       [31],
       [40],
       [49],
       [19],
       [11],
       [14],
       [42],
       [38],
       [46],
       [71],
       [10],
       [60],
       [22],
       [19],
       [65],
       [19],
       [54],
       [81],
       [20],
       [48],
       [82],
       [23],
       [66],
       [ 5],
       [49],
       [ 5],
       [34],
       [79],
       [ 6],
       [10],
       [69],
       [55],
       [34],
       [27],
       [47],
       [73],
       [42],
       [80],
       [47],
       [38],
       [34],
       [25],
       [24],
       [12],
       [24],
       [75],

In [30]:
# Now concatinating all the columns

X_train_transformed = np.concatenate((X_train_age, X_train_fever, X_train_gender_city, X_train_cough), axis=1)

# also the test data
X_test_transformed = np.concatenate((X_test_age, X_test_fever, X_test_gender_city, X_test_cough), axis=1)

In [31]:
X_train_transformed.shape

(80, 7)

- So now we will have 7 columns instead of 5 columns as we had in the original train dataset.

### Now doing all these above using the `ColumnTransformer`

- Here we need to pass two parameters:
    - **transformers** - Here we will pass the list. Remember the transformers are passed inside `tuples`. Here inside `tuples` we pass the name of the transformer, the transfomation to be applied, the list of columns on which the transformation to take place.
    - **remainder** - To set the columns on which we donot apply any transformation. Here default value is `drop`, another option is `passthrough`.

In [32]:
# importing column transformer

from sklearn.compose import ColumnTransformer

In [33]:
# Here the 'age' column will remain intact as we passed remainder as passthrough

transformer = ColumnTransformer(transformers=[
    ('tnf1', SimpleImputer(), ['fever']),
    ('tnf2', OrdinalEncoder(categories=[['Mild','Strong']]), ['cough']),
    ('tnf3', OneHotEncoder(sparse=False, drop='first'), ['gender','city'])
], remainder='passthrough')

In [34]:
# Now doing fit and transform on train data

X_train_tf = transformer.fit_transform(X_train)
X_test_tf = transformer.transform(X_test)

In [35]:
X_train_tf.shape

(80, 7)

### Now using the transformer technique just before the train test split and after all the FE

In [36]:
X.head()

Unnamed: 0,age,gender,fever,cough,city
0,60,Male,103.0,Mild,Kolkata
1,27,Male,100.0,Mild,Delhi
2,42,Male,101.0,Mild,Delhi
3,31,Female,98.0,Mild,Kolkata
4,65,Female,101.0,Mild,Mumbai


In [37]:
X.shape

(100, 5)

In [38]:
y.head()

0     No
1    Yes
2     No
3     No
4     No
Name: has_covid, dtype: object

In [39]:
# Doing all the transformations of the independent features i.e. 'x'

imputer_columns = ['fever']
ordinal_columns = ['cough']
onehot_columns = ['gender','city']



impute_transformer = SimpleImputer()
order_transformer = OrdinalEncoder(categories=[['Mild','Strong']])
oh_transformer = OneHotEncoder(sparse=False, drop='first')


transformer = ColumnTransformer(transformers=
                                [
                                    ("Imputation", impute_transformer, imputer_columns),
                                    ("OrdinalEncoder", order_transformer, ordinal_columns),
                                    ("OneHotEncoder", oh_transformer, onehot_columns),
                                ], remainder='passthrough')


# Now doing the fit transform

X = transformer.fit_transform(X)

In [40]:
# Doing the label encoding on dependent feature i.e. 'y'

le = LabelEncoder()

y = le.fit_transform(y)

In [41]:
# Now doing the train test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape

((80, 7), (20, 7))

In [42]:
y_train.shape, y_test.shape

((80,), (20,))

In [43]:
# Creating dataframe of the X_train dataset to see the columns

x_train_df = pd.DataFrame(X_train)
x_train_df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,101.0,0.0,0.0,0.0,0.0,1.0,81.0
1,100.0,0.0,0.0,0.0,1.0,0.0,5.0
2,100.0,0.0,0.0,0.0,1.0,0.0,19.0
3,100.0,0.0,1.0,1.0,0.0,0.0,27.0
4,103.0,0.0,0.0,1.0,0.0,0.0,73.0


- Here one column is for `fever`, one for `cough`, one for `gender` and three for `city` and one for `age`.
- So total is ***(1 + 1 + 1 + 3 + 1) = 7*** columns