This notebook was create using the tutorial videos from Data School  , video link:https://youtu.be/sCt4LVD5hPc
resource link 
https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html#pandas.DataFrame.select_dtypes

###  There are seven ways to select a column using ColumnTransformer

> column name

> integer position

> slice

> boolean mask

> regex pattern

> dtype to include

> dtype to exclude


In [2]:
import pandas as pd
import numpy as np

In [15]:
# import the data
df = pd.read_csv('http://bit.ly/kaggletrain')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [16]:
# check for missing values
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [19]:
# let create a dataframe with 5 rows from this using columns 'Fare' , 'Embarked', 'Sex', 'Age'

X = df.loc[:5,['Fare', 'Embarked', 'Sex', 'Age']]
X.head()

Unnamed: 0,Fare,Embarked,Sex,Age
0,7.25,S,male,22.0
1,71.2833,C,female,38.0
2,7.925,S,female,26.0
3,53.1,S,female,35.0
4,8.05,S,male,35.0


Our goal is to select columns `'Embarked' , 'Sex' `and one-hot-encode them 

In [20]:

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_selector
from sklearn.compose import make_column_transformer

In [21]:
one_hot = OneHotEncoder()

`remainder = drop` is default in make_column_transformer

## 1. Selecting by dataframe column names

In [22]:
col_trans = make_column_transformer((one_hot,['Embarked', 'Sex']))

col_trans.fit_transform(X)

array([[0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1.]])

## 2. Select by integer position

This is useful if your input-data is not a dataframe , input to column_trasformer doesn't have to be a dataframe it could be a numpy array.

Arrays don't have column names so we could specify column position

In [25]:
col_trans = make_column_transformer((one_hot, [1,2]))

col_trans.fit_transform(X)

array([[0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1.]])

## 3. Select using slicing
This is useful when we have, say , a 100 columns in the dataframe and we can select them by slicing ` slice(0,101)`

`slice(start,stop)  
start - inclusive
stop - exclusive `

In [27]:
col_trans = make_column_transformer((one_hot, slice(1,3)))

col_trans.fit_transform(X)

array([[0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1.]])

## 4. Select using a boolean mask

In [29]:
col_trans = make_column_transformer((one_hot, [False,True,True,False]))

col_trans.fit_transform(X)

array([[0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1.]])

## Select using Regular expressions
This is useful when we have a lot of columns and a subset of them have a certain pattern in the column names (starting or ending) and we can use Regular-expressions to match on that 

Here we can use  `make_column_selector(pattern='')`

In [32]:
col_trans = make_column_transformer((one_hot, make_column_selector(pattern='E|S')))

col_trans.fit_transform(X)

array([[0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1.]])

## Select using dtype_include

This is useful when we want to include a certain type of columns , for eg: we can use this to one-hot-encode all object-dtype columns.

In [33]:
col_trans = make_column_transformer((one_hot, make_column_selector(dtype_include=object)))

col_trans.fit_transform(X)

array([[0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1.]])

## Select using dtype_include

This is useful when we want to exclude a certain type of columns , for eg: we can use this to one-hot-encode all except object-dtype columns. or all except numerical columns

In [36]:
col_trans = make_column_transformer((one_hot, make_column_selector(dtype_exclude='number')))

col_trans.fit_transform(X)

array([[0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1.]])