### Transformer Description

The custom transformer `EncoderbyCol`, accepts a dictionary (as a parameter), that has encoding methods (onehot or label) as its keys, and the column names of the features to be encoded by these methods, respectively, as the keys' values.
The encoding method is performed on the columns specified in the dictionary's values ; the DataFrame is transformed by the encoders.

For example, 
    
    col = {'label': ['Sex'], 'onehot': ['Embarked','Pclass']}
The feature 'Sex' will be transformed by a `LabelEnconder`, while the features 'Embarked', and 'Pclass' will be transformed by `OneHotEncoder`.

This custom transformer has the functionality to perform Label encoding and One-hot encoding.

This custom transform has the functionality to handle missing data. All missing values that are present prior to the transformation will remain after the transformation, but they will not be involved the transformation process.
Missing data points are excluded from the encoding process, i.e, encoding is performed on only the non-missing values.
All missing values in the DataFrame are first replaced by the string `'NaN'` to prevent the encoder from raising an error, the encoding operation is performed, and then the transformed DataFrame is returned with the missing values intact, having been converted to `np.nan`. 

The transformer will return the full DataFrame passsed into it, but with the specified features transformed/Encoded as required. The other features will remain unchanged.

In [1]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.impute import KNNImputer

class EncoderByCol(BaseEstimator, TransformerMixin):
    def __init__(self, col):
        # dictionary with the type of encoding to be performed as its keys, 
        # and corresponding column names of the DataFrame as its values.
        self.col = col
        # list of column names to be encoded only.
        self.col_list = []
        # Dictionary storing the respective Encoder for each column.
        self.enc_dic = {}
    
        for enc in list(self.col.keys()):
            if enc == 'onehot':
                for attr in self.col.get(enc):
                    self.enc_dic[attr] = OneHotEncoder(sparse=False)
                    self.col_list.append(attr)
            if enc == 'label':
                for attr in self.col.get(enc):
                    self.enc_dic[attr] = LabelEncoder()
                    self.col_list.append(attr)
        
    def fit(self,X,y=None):
        # Fill missing values with the string 'NaN'
        X[X.columns] = X[X.columns].fillna('NaN')
        # for one-hot encoder
        for attr in self.col_list:
            if str(self.enc_dic[attr]) == 'OneHotEncoder(sparse=False)':
                # Only use the values that are not 'NaN' to fit the Encoder.
                # a is adjusted to a 2D array, as expected by the one-hot encoded.
                a = X.loc[X[attr]!='NaN', [attr]] 
                self.enc_dic[attr].fit(a)
            else:
                # Only use the values that are not 'NaN' to fit the Encoder
                a = X.loc[X[attr]!='NaN', attr]
                self.enc_dic[attr].fit(a)
        return self

    def transform(self,X,y=None):
        #Fill missing values with the string 'NaN'
        X[X.columns] = X[X.columns].fillna('NaN')
        for attr in self.col_list:
            # Handle One Hot Encoding of the specified features.
            if str(self.enc_dic[attr]) == 'OneHotEncoder(sparse=False)':
                # Only use the values that are not 'NaN' to fit the Encoder
                a = X.loc[X[attr]!='NaN', [attr]]
                
                one_hot_encoder = self.enc_dic[attr]
                encoded_array = one_hot_encoder.transform(a)
                # extract OneHotEncoder's learned parameters into a list.
                cat = [[j for j in i] for i in one_hot_encoder.categories_][0] 
                # convert the encoded array into a DataFrame with its column names matching the encoding.
                encoded_df = pd.DataFrame(data=encoded_array, index=a.index, 
                                               columns=[f'{attr}_{i}' for i in cat])
                # concatenate the main DataFrame (while dropping the unencoded column) with the DataFrame of the one hot encoded values.
                X = pd.concat([X.drop([attr],axis=1), encoded_df], axis=1)
            else:
                # Only use the values that are not 'NaN' to fit the Encoder
                a = X.loc[X[attr]!='NaN', attr]
                #Store an ndarray of the current column
                b = X[attr].to_numpy()
                # Replace the elements in the ndarray that are not 'NaN'
                # using the transformer
                b[b!='NaN'] = self.enc_dic[attr].transform(a)
                # Overwrite the column in the DataFrame
                X[attr]=b
            
        # Return missing values; from 'NaN' to np.nan.
        for attr in X[X.columns]:
            X.loc[X[attr]=='NaN', attr] = np.nan
        # return the transformed DataFrame
        return X

### Demonstration.
The famous titanic dataset will be used to demonstrate the working of this custom transformer. <br>
First, make the necessary imports, and then load the dataset.

In [18]:
import pandas as pd
import numpy as np

In [10]:
titanic = pd.read_csv('train.csv')

In [11]:
titanic.index = titanic['PassengerId']
titanic.drop(['PassengerId','Name','Ticket','Cabin',], axis=1, inplace=True)
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,0,3,male,22.0,1,0,7.25,S
2,1,1,female,38.0,1,0,71.2833,C
3,1,3,female,26.0,0,0,7.925,S
4,1,1,female,35.0,1,0,53.1,S
5,0,3,male,35.0,0,0,8.05,S


Check the number of missing values per column

In [12]:
titanic.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

In [14]:
# viewing the missing values in 'Embarked' column.
titanic.loc[titanic['Embarked'].isnull(), ]

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
62,1,1,female,38.0,0,0,80.0,
830,1,1,female,62.0,0,0,80.0,


To encode the features 'Sex' and 'Pclass' (label encoding), and 'Embarked' (one-hot encoding).

In [15]:
col={'label': ['Sex', 'Pclass'], 'onehot': ['Embarked',]}

In [19]:
encoder = EncoderByCol(col)
titanic_enc = encoder.fit_transform(titanic)

  b[b!='NaN'] = self.enc_dic[attr].transform(a)


In [20]:
titanic_enc.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0.0,2.0,1.0,22.0,1.0,0.0,7.25,0.0,0.0,1.0
2,1.0,0.0,0.0,38.0,1.0,0.0,71.2833,1.0,0.0,0.0
3,1.0,2.0,0.0,26.0,0.0,0.0,7.925,0.0,0.0,1.0
4,1.0,0.0,0.0,35.0,1.0,0.0,53.1,0.0,0.0,1.0
5,0.0,2.0,1.0,35.0,0.0,0.0,8.05,0.0,0.0,1.0


Check to confirm that the missing values remain unchanged:

In [21]:
titanic_enc.isnull().sum()

Survived        0
Pclass          0
Sex             0
Age           177
SibSp           0
Parch           0
Fare            0
Embarked_C      2
Embarked_Q      2
Embarked_S      2
dtype: int64

In [24]:
# viewing the rows where 'Embarked' had missing values.
titanic_enc.loc[[62,830]]

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
62,1.0,0.0,0.0,38.0,0.0,0.0,80.0,,,
830,1.0,0.0,0.0,62.0,0.0,0.0,80.0,,,
