# Scikit-Transformers : BoolColumnTransformer

## Imports 

Import the data libraries

In [1]:
import pandas as pd
import numpy as np

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Import the scikit-learn libraries

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.impute import KNNImputer

Uncomment the following cell to install the necessary packages.

In [3]:
#!pip install scikit-transformers

Import scikit-transformers

In [4]:
try : 
    from sktransf import get_titanic
except Exception as e:
    print(e)
    print("Please install the package using the following command")
    print("pip install scikit-transformers")
    from sktransf._get_titanic import get_titanic

No module named 'sktransf.logger'
Please install the package using the following command
pip install scikit-transformers


In [6]:
from sktransf import LogColumnTransformer

ModuleNotFoundError: No module named 'sktransf.logger'

## Data

Get the data from the [Kaggle](https://www.kaggle.com/c/titanic/data) Titanic dataset.

In [14]:
df, _ = get_titanic()


Display the first few rows of the data.

In [15]:
df.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
0,3,22.0,1,0,7.25
1,1,38.0,1,0,71.2833
2,3,26.0,0,0,7.925
3,1,35.0,1,0,53.1
4,3,35.0,0,0,8.05


Add a dummy column to the data.

In [16]:
df["dummy_column"] = np.random.choice(["A", "B"], size=df.shape[0])

Display the new df

In [18]:
df.sample(10)

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,dummy_column
456,1,65.0,0,0,26.55,B
84,2,17.0,0,0,10.5,B
225,3,22.0,0,0,9.35,A
578,3,,1,0,14.4583,B
579,3,32.0,0,0,7.925,A
634,3,9.0,3,2,27.9,A
746,3,16.0,1,1,20.25,B
417,2,18.0,0,2,13.0,A
83,1,28.0,0,0,47.1,B
155,1,51.0,0,1,61.3792,B


## Using BoolColumnTransformer

Fit and transform the data using the BoolColumnTransformer

In [19]:
clean_df = BoolColumnTransformer().fit_transform(df)
clean_df

  _X[col] = _X[col].replace(dd)


array([[ 3.    , 22.    ,  1.    ,  0.    ,  7.25  ,  0.    ],
       [ 1.    , 38.    ,  1.    ,  0.    , 71.2833,  0.    ],
       [ 3.    , 26.    ,  0.    ,  0.    ,  7.925 ,  0.    ],
       ...,
       [ 3.    ,     nan,  1.    ,  2.    , 23.45  ,  1.    ],
       [ 1.    , 26.    ,  0.    ,  0.    , 30.    ,  0.    ],
       [ 3.    , 32.    ,  0.    ,  0.    ,  7.75  ,  1.    ]])

Clean_df is a np.ndarray object.
If you want to convert it back to a DataFrame, you can use the following code:

In [20]:
clean_df = BoolColumnTransformer(force_df_out=True).fit_transform(df)
clean_df

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,dummy_column
0,3,22.0,1,0,7.2500,0
1,1,38.0,1,0,71.2833,0
2,3,26.0,0,0,7.9250,0
3,1,35.0,1,0,53.1000,0
4,3,35.0,0,0,8.0500,0
...,...,...,...,...,...,...
886,2,27.0,0,0,13.0000,0
887,1,19.0,0,0,30.0000,0
888,3,,1,2,23.4500,1
889,1,26.0,0,0,30.0000,0


Check data types of the new df

In [23]:
clean_df.dtypes

Pclass            int64
Age             float64
SibSp             int64
Parch             int64
Fare            float64
dummy_column      int64
dtype: object

Check nunique values of the new df

In [24]:
df.nunique()

Pclass            3
Age              88
SibSp             7
Parch             7
Fare            248
dummy_column      2
dtype: int64