In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('D:\\Study\\Data-Science\\Machine Learning\\Practice\\CampusX\\Feature Engineering\\Datasets\\covid_toy.csv')

In [3]:
df.sample(10)

Unnamed: 0,age,gender,fever,cough,city,has_covid
43,22,Female,99.0,Mild,Bangalore,Yes
49,44,Male,104.0,Mild,Mumbai,No
29,34,Female,,Strong,Mumbai,Yes
10,75,Female,,Mild,Delhi,No
33,26,Female,98.0,Mild,Kolkata,No
32,34,Female,101.0,Strong,Delhi,Yes
94,79,Male,,Strong,Kolkata,Yes
2,42,Male,101.0,Mild,Delhi,No
55,81,Female,101.0,Mild,Mumbai,Yes
58,23,Male,98.0,Strong,Mumbai,Yes


We will do:

        OHE -> gender, city
        
        Ordinal encoding -> cough
        
        simple Imputer -> fever
        
        Label encoder -> has_covid

### Train, test, split

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
X = df.iloc[:,:-1]
X

Unnamed: 0,age,gender,fever,cough,city
0,60,Male,103.0,Mild,Kolkata
1,27,Male,100.0,Mild,Delhi
2,42,Male,101.0,Mild,Delhi
3,31,Female,98.0,Mild,Kolkata
4,65,Female,101.0,Mild,Mumbai
...,...,...,...,...,...
95,12,Female,104.0,Mild,Bangalore
96,51,Female,101.0,Strong,Kolkata
97,20,Female,101.0,Mild,Bangalore
98,5,Female,98.0,Strong,Mumbai


In [6]:
y = df['has_covid']
y

0      No
1     Yes
2      No
3      No
4      No
     ... 
95     No
96    Yes
97     No
98     No
99    Yes
Name: has_covid, Length: 100, dtype: object

### Manual approach to encode and impute features

In [7]:
df['fever'].isnull().sum()

10

### Why we can't do encoding and imputing before splitting data? Why we have to do that after splitting!

The reason for performing data splitting before imputing and encoding in a machine learning pipeline is to avoid data leakage and ensure that your model's evaluation is a realistic representation of its performance on unseen data. Let's elaborate on this:

Data Leakage: Data leakage refers to a situation where information from the testing set unintentionally influences the training process. It can lead to overly optimistic model evaluation because the model has "seen" some of the test data during preprocessing. This can result in models that do not generalize well to new, unseen data.

Training and Testing Sets Must Be Independent: The fundamental principle of machine learning is that the model should learn from one set of data (the training set) and be evaluated on a separate, independent set of data (the testing set). This separation ensures that the model's performance assessment is not biased and that it can handle new, unseen data.

Imputing and Encoding Based on the Training Data: When you perform imputation (e.g., filling missing values) and encoding (e.g., one-hot encoding) on the entire dataset before splitting, the preprocessing steps consider information from both the training and testing data. This leads to two problems:

a. Information Leakage: Imputing and encoding with the entire dataset can introduce information from the testing set into the training set, violating the independence principle. For example, if you use the mean for imputation, the mean value computed on the entire dataset will include information from the testing set.

b. Optimistic Evaluation: If you perform imputation and encoding on the entire dataset before splitting, your model might perform well on the test set due to the information it "leaked" during preprocessing. However, this doesn't reflect how well the model generalizes to truly unseen data.

Preventing Overfitting: Data leakage can also lead to overfitting, where the model captures noise in the data instead of the underlying patterns. Proper separation of the training and testing data helps mitigate overfitting.

By splitting the data into a training set and a testing set first and then performing imputing and encoding separately for each set, you ensure that your model is trained on clean, independent data and that your evaluation is free from the influence of the testing set. This approach provides a more accurate representation of your model's real-world performance and ensures that it can generalize well to new, unseen data.

In [8]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)

In [9]:
X_train.shape

(80, 5)

In [10]:
X_test.shape

(20, 5)

#### Adding simple imputer to fever columns

In [11]:
from sklearn.impute import SimpleImputer

In [12]:
imputer = SimpleImputer()

In [13]:
X_train_fever = imputer.fit_transform(X_train[['fever']])
X_train_fever.shape

(80, 1)

In [14]:
X_test_fever = imputer.transform(X_test[['fever']])
X_test_fever.shape

(20, 1)

#### Adding One Hot Encoding to Gender and City columns

In [15]:
from sklearn.preprocessing import OneHotEncoder

In [16]:
X_train[['gender']].value_counts()

gender
Female    48
Male      32
dtype: int64

In [17]:
X_train[['city']].value_counts()

city     
Bangalore    25
Kolkata      23
Delhi        19
Mumbai       13
dtype: int64

In [18]:
ohe = OneHotEncoder(drop='first',sparse=False)

X_train_gender_city = ohe.fit_transform(X_train[['gender','city']])

X_train_gender_city.shape

(80, 4)

In [19]:
X_test_gender_city = ohe.transform(X_test[['gender','city']])

X_test_gender_city.shape

(20, 4)

#### Ordinal Encoding on Cough column

In [20]:
X_train['cough'].value_counts()

Mild      48
Strong    32
Name: cough, dtype: int64

In [21]:
from sklearn.preprocessing import OrdinalEncoder

In [22]:
order = ['Mild','Strong']
oe = OrdinalEncoder(categories=[order])

In [23]:
X_train_cough = oe.fit_transform(X_train[['cough']])
X_train_cough.shape

(80, 1)

In [24]:
X_test_cough = oe.transform(X_test[['cough']])
X_test_cough.shape

(20, 1)

#### Label Encoding on Target column

In [25]:
y_train.value_counts()

No     49
Yes    31
Name: has_covid, dtype: int64

In [26]:
from sklearn.preprocessing import LabelEncoder

In [27]:
le = LabelEncoder()

In [28]:
y_train = le.fit_transform(y_train)
y_train.shape

(80,)

In [29]:
y_test = le.transform(y_test)
y_test.shape

(20,)

In [30]:
# Extracting Age
X_train_age = X_train.drop(columns=['gender','fever','cough','city']).values

# also the test data
X_test_age = X_test.drop(columns=['gender','fever','cough','city']).values

X_train_age.shape

(80, 1)

#### Now we have to add all the imputed and encoded numpy arrays together.

In [34]:
X_train_transformed = np.concatenate((X_train_age,X_train_fever,X_train_gender_city,X_train_cough),axis=1)
# also the test data
X_test_transformed = np.concatenate((X_test_age,X_test_fever,X_test_gender_city,X_test_cough),axis=1)

X_train_transformed.shape

(80, 7)

### Column Transformer Technique

In [36]:
from sklearn.compose import ColumnTransformer

In [37]:
transformer = ColumnTransformer(
    transformers=[
        ('tnf1',OneHotEncoder(sparse=False,drop='first'),['gender','city']),
        ('tnf2',OrdinalEncoder(categories=[['Mild','Strong']]),['cough']),
        ('tnf3',SimpleImputer(),['fever'])
    ],
    remainder='passthrough'
)

In [41]:
transformer.fit_transform(X_train).shape

(80, 7)

In [42]:
transformer.transform(X_test).shape

(20, 7)