# Scikit-Transformers : BoolColumnTransformer

## Imports 

Import warnings and disable warnings for this notebook.

In [1]:
import warnings

warnings.filterwarnings("ignore")

Import the data libraries

In [2]:
import pandas as pd
import numpy as np

Uncomment the following cell to install the necessary packages.

In [3]:
#!pip install scikit-transformers

Import scikit-transformers

In [4]:
try:
    from sktransf.utils import get_titanics
except Exception as e:
    print(e)
    print("Please install the package using the following command")
    print("pip install scikit-transformers")
    from sktransf.utils import get_titanic

cannot import name 'get_titanics' from 'sktransf.utils' (/home/alex/tmp/scikit-transformers/sktransf/utils/__init__.py)
Please install the package using the following command
pip install scikit-transformers


In [5]:
from sktransf.transformer import BoolColumnTransformer

## Data

Get the data from the [Kaggle](https://www.kaggle.com/c/titanic/data) Titanic dataset.

In [6]:
df, _ = get_titanic()

Display the first few rows of the data.

In [7]:
df.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
0,1,3,22.0,1,0,7.25
1,2,1,38.0,1,0,71.2833
2,3,3,26.0,0,0,7.925
3,4,1,35.0,1,0,53.1
4,5,3,35.0,0,0,8.05


Add a dummy column to the data.

In [8]:
df["dummy_column"] = np.random.choice(["A", "B"], size=df.shape[0])

Display the new df

In [9]:
df.sample(10)

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,dummy_column
336,337,1,29.0,1,0,66.6,B
699,700,3,42.0,0,0,7.65,B
168,169,1,,0,0,25.925,B
67,68,3,19.0,0,0,8.1583,A
733,734,2,23.0,0,0,13.0,A
224,225,1,38.0,1,0,90.0,B
478,479,3,22.0,0,0,7.5208,A
731,732,3,11.0,0,0,18.7875,B
606,607,3,30.0,0,0,7.8958,A
239,240,2,33.0,0,0,12.275,A


## Using BoolColumnTransformer

Fit and transform the data using the BoolColumnTransformer

In [10]:
clean_df = BoolColumnTransformer().fit_transform(df)
clean_df

array([[1, 3, 22.0, ..., 0, 7.25, 0],
       [2, 1, 38.0, ..., 0, 71.2833, 0],
       [3, 3, 26.0, ..., 0, 7.925, 1],
       ...,
       [889, 3, nan, ..., 2, 23.45, 1],
       [890, 1, 26.0, ..., 0, 30.0, 0],
       [891, 3, 32.0, ..., 0, 7.75, 1]], dtype=object)

Clean_df is a np.ndarray object.
If you want to convert it back to a DataFrame, you can use the following code:

In [11]:
clean_df = BoolColumnTransformer(force_df_out=True).fit_transform(df)
clean_df

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,dummy_column
0,1,3,22.0,1,0,7.2500,0
1,2,1,38.0,1,0,71.2833,0
2,3,3,26.0,0,0,7.9250,1
3,4,1,35.0,1,0,53.1000,1
4,5,3,35.0,0,0,8.0500,1
...,...,...,...,...,...,...,...
886,887,2,27.0,0,0,13.0000,1
887,888,1,19.0,0,0,30.0000,0
888,889,3,,1,2,23.4500,1
889,890,1,26.0,0,0,30.0000,0


Check data types of the new df

In [12]:
clean_df.dtypes

PassengerId       int64
Pclass            int64
Age             float64
SibSp             int64
Parch             int64
Fare            float64
dummy_column     object
dtype: object

Check nunique values of the new df

In [13]:
df.nunique()

PassengerId     891
Pclass            3
Age              88
SibSp             7
Parch             7
Fare            248
dummy_column      2
dtype: int64