### Column Transfer
In this notebook, we will apply different transformation techniques to different columns based on our requirements. "age" and "fever" are <b style = 'color:orange'>numeric columns</b> here. But "fever" has some missing values. "gender", "cough", "city", "has_covid" are <b style = 'color:red'>categorical columns</b> from which we apply <b style = 'color:red'>One-Hot Encoding</b> in both city and gender columns and use <b style = 'color:red'>Label encoding</b> for "has_covid" column and <b style = 'color:red'>Ordinal encoding</b> for "cough" column.


In [1]:
import numpy as np
import pandas as pd

In [2]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder

In [3]:
df = pd.read_csv('covid_toy.csv')
df.sample(5)

Unnamed: 0,age,gender,fever,cough,city,has_covid
17,40,Female,98.0,Strong,Delhi,No
77,8,Female,101.0,Mild,Kolkata,No
27,33,Female,102.0,Strong,Delhi,No
2,42,Male,101.0,Mild,Delhi,No
15,70,Male,103.0,Strong,Kolkata,Yes


In [4]:
"""
Dataframe is nothing but a 2-D numpy array when you convert it to an array"""
#df.values

'\nDataframe is nothing but a 2-D numpy array when you convert it to an array'

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   age        100 non-null    int64  
 1   gender     100 non-null    object 
 2   fever      90 non-null     float64
 3   cough      100 non-null    object 
 4   city       100 non-null    object 
 5   has_covid  100 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 4.8+ KB


In [7]:
df.isnull().sum()

age           0
gender        0
fever        10
cough         0
city          0
has_covid     0
dtype: int64

In [8]:
df['cough'].value_counts()

Mild      62
Strong    38
Name: cough, dtype: int64

In [9]:
df['city'].nunique()

4

In [10]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.drop(columns = ['has_covid']),df['has_covid'],
                                                test_size=0.2,random_state=2)

In [11]:
X_train

Unnamed: 0,age,gender,fever,cough,city
35,82,Female,102.0,Strong,Bangalore
11,65,Female,98.0,Mild,Mumbai
84,69,Female,98.0,Strong,Mumbai
44,20,Male,102.0,Strong,Delhi
73,34,Male,98.0,Strong,Kolkata
...,...,...,...,...,...
43,22,Female,99.0,Mild,Bangalore
22,71,Female,98.0,Strong,Kolkata
72,83,Female,101.0,Mild,Kolkata
15,70,Male,103.0,Strong,Kolkata


### Without Column Transfer method.

We will transfer columns manually one by one. And then append those columns into a new one. 

###  "Fever" column

In [14]:
"""
SimpleImputer() uses a imputation strategy. This strategy defines which method we are going to use to fill our missing values.
By default strategy is set to "mean".
most_frequent,
mean,
median,
constant.

"""
si = SimpleImputer()    

In [15]:
#X_train['fever'].shape

In [16]:
X_train_fever = np.round(si.fit_transform(X_train[['fever']]),2)

In [17]:
X_test_fever = np.round(si.fit_transform(X_test[['fever']]),2)
X_test_fever

array([[104.  ],
       [101.  ],
       [101.06],
       [100.  ],
       [103.  ],
       [ 98.  ],
       [101.  ],
       [102.  ],
       [104.  ],
       [102.  ],
       [ 98.  ],
       [102.  ],
       [100.  ],
       [104.  ],
       [103.  ],
       [ 98.  ],
       [ 98.  ],
       [101.06],
       [ 98.  ],
       [103.  ]])

### "Cough" Column.

In [18]:
oe = OrdinalEncoder(categories=[['Mild','Strong']])
X_train_cough = oe.fit_transform(X_train[['cough']])
X_test_cough = oe.fit_transform(X_test[['cough']])
X_test_cough

array([[0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.]])

### "City" and "Gender" Columns

In [21]:
ohe = OneHotEncoder(drop = 'first',sparse = False)
X_train_gender_city = ohe.fit_transform(X_train[['gender','city']])
X_test_gender_city = ohe.fit_transform(X_test[['gender','city']])

In [22]:
X_test_gender_city

array([[0., 0., 1., 0.],
       [1., 1., 0., 0.],
       [1., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [1., 1., 0., 0.],
       [0., 1., 0., 0.],
       [1., 0., 1., 0.],
       [1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 1., 0.],
       [1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.]])

### "age" Column
Extracting age from the train and test set and then covert these into numpy arrays.


In [23]:
X_train_age = X_train.drop(columns=['gender','fever','cough','city']).values
X_test_age = X_test.drop(columns=['gender','fever','cough','city']).values
#X_train_age

In [24]:
"""
The arrays must have the same shape, except in the dimension.
"""
X_train_transformed = np.concatenate((X_train_age,X_train_cough,X_train_fever,X_train_gender_city),axis = 1)
X_test_transformed = np.concatenate((X_test_age,X_test_cough,X_test_fever,X_test_gender_city),axis = 1)
X_test_transformed.shape

(20, 7)

In [25]:
X_test_transformed

array([[ 17.  ,   0.  , 104.  ,   0.  ,   0.  ,   1.  ,   0.  ],
       [ 15.  ,   0.  , 101.  ,   1.  ,   1.  ,   0.  ,   0.  ],
       [ 71.  ,   1.  , 101.06,   1.  ,   0.  ,   1.  ,   0.  ],
       [ 13.  ,   1.  , 100.  ,   0.  ,   0.  ,   1.  ,   0.  ],
       [ 69.  ,   0.  , 103.  ,   0.  ,   0.  ,   1.  ,   0.  ],
       [ 80.  ,   0.  ,  98.  ,   0.  ,   1.  ,   0.  ,   0.  ],
       [ 42.  ,   0.  , 101.  ,   1.  ,   1.  ,   0.  ,   0.  ],
       [ 33.  ,   1.  , 102.  ,   0.  ,   1.  ,   0.  ,   0.  ],
       [ 16.  ,   0.  , 104.  ,   1.  ,   0.  ,   1.  ,   0.  ],
       [ 64.  ,   0.  , 102.  ,   1.  ,   0.  ,   0.  ,   0.  ],
       [ 10.  ,   1.  ,  98.  ,   0.  ,   0.  ,   1.  ,   0.  ],
       [ 82.  ,   1.  , 102.  ,   0.  ,   0.  ,   1.  ,   0.  ],
       [ 80.  ,   0.  , 100.  ,   1.  ,   0.  ,   0.  ,   0.  ],
       [ 51.  ,   0.  , 104.  ,   1.  ,   0.  ,   0.  ,   0.  ],
       [ 60.  ,   0.  , 103.  ,   1.  ,   0.  ,   1.  ,   0.  ],
       [ 73.  ,   0.  ,  

### "has_covid" Column

In [37]:
le = LabelEncoder()
y_train_has_covid = le.fit_transform(y_train)
y_test_has_covid = le.fit_transform(y_test)
y_test_has_covid

array([0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1])

# Using Column Transformer

The <b style = 'color:orange'>ColumnTransformer</b> is a class in the scikit-learn Python machine learning library that allows you to selectively apply data preparation transforms.

In [26]:
from sklearn.compose import ColumnTransformer

In [27]:
"""
We initially pass two parameters in time of creating ColumnTransformer object namely "transformers" and "remainder
List of (name, transformer, columns) tuples specifying the transformer objects to be applied to subsets of the data.
    
By default, only the specified columns in transformers are transformed and combined in the output, and the non-specified
columns are dropped. By specifying "remainder='passthrough", all remaining columns that were not specified in transformers
will be automatically passed through, Hence making no transformation for those non-specified columns. This subset of 
columns is concatenated with the output of the transformers.
"""

transformer = ColumnTransformer(transformers=[
    ('tnf1',SimpleImputer(),['fever']),
    ('tnf2',OrdinalEncoder(categories = [['Mild','Strong']]),['cough']),
    ('tnf3',OneHotEncoder(drop = 'first',sparse = False),['gender','city'])
],remainder='passthrough')

In [28]:
transformer.fit_transform(X_test).shape

(20, 7)

In [29]:
transformer.fit_transform(X_train).shape

(80, 7)