### ColumnTransformer in Machine Learning

A **ColumnTransformer** is used in machine learning when different columns in a dataset require different preprocessing steps. It is especially useful 
for datasets with mixed data types (numerical and categorical) or when feature-specific transformations are needed.

#### When to Use:
1. **Mixed Data Types:** Apply different preprocessing (e.g., scaling for numerical features, encoding for categorical features).
2. **Feature-Specific Preprocessing:** Handle specific transformations like normalization, log transformation, or encoding for individual features.
3. **Simplify Pipelines:** Create clean, maintainable, and efficient preprocessing pipelines.
4. **Dimensionality Reduction:** Apply techniques like PCA or feature selection on specific columns.

#### Example:
- **Numerical Features:** Scale features like age and income using `StandardScaler`.
- **Categorical Features:** Encode features like gender and country using `OneHotEncoder`.

#### Benefits:
- Clean and efficient handling of multiple preprocessing steps.
- Reduces errors by ensuring correct transformations are applied to the intended columns.
- Scalable for large and complex datasets.

In [17]:
import numpy as np
import pandas as pd

In [41]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

In [19]:
df = pd.read_csv('covid_toy.csv')

In [20]:
df.head()

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No
3,31,Female,98.0,Mild,Kolkata,No
4,65,Female,101.0,Mild,Mumbai,No


In [21]:
df.isnull().sum()

age           0
gender        0
fever        10
cough         0
city          0
has_covid     0
dtype: int64

In [22]:
from sklearn.model_selection import train_test_split

In [34]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['has_covid']), df['has_covid'], test_size=0.2)



In [35]:
X_train

Unnamed: 0,age,gender,fever,cough,city
33,26,Female,98.0,Mild,Kolkata
65,69,Female,102.0,Mild,Bangalore
61,81,Female,98.0,Strong,Mumbai
69,73,Female,103.0,Mild,Delhi
57,49,Female,99.0,Strong,Bangalore
...,...,...,...,...,...
50,19,Male,101.0,Mild,Delhi
58,23,Male,98.0,Strong,Mumbai
34,74,Male,102.0,Mild,Mumbai
81,65,Male,99.0,Mild,Delhi


# Normal way

# SimpleImputer in Machine Learning

In machine learning, **SimpleImputer** is a tool provided by the **scikit-learn** library to handle missing data in datasets. Missing data is common in real-world scenarios, and the `SimpleImputer` class helps fill in these missing values using various strategies.

## What Does SimpleImputer Do?

The **SimpleImputer** replaces missing values in a dataset with a specific value or statistic (like the mean, median, or most frequent value) for each column. This ensures that the dataset is complete and ready for machine learning algorithms, which generally do not handle missing values well.

## Key Features of SimpleImputer

1. **Customizable Strategies**:
   - Fill missing values with:
     - **Mean**: Useful for numerical data.
     - **Median**: Useful for skewed numerical data.
     - **Most Frequent**: Useful for categorical data.
     - **Constant**: Fill with a fixed value (e.g., `0` or `"missing"`).

2. **Supports Arrays and DataFrames**:
   - Works with both `numpy` arrays and `pandas` DataFrames.

3. **Consistent and Repeatable**:
   - Ensures imputation is applied consistently across training and testing datasets.

## How to Use SimpleImputer

### Step-by-Step Guide

1. **Import the SimpleImputer**:
   ```python
   from sklearn.impute import SimpleImputer
   ```

2. **Define the Imputer**:
   Specify the strategy for handling missing values.
   ```python
   imputer = SimpleImputer(strategy='mean')  # Use 'mean', 'median', 'most_frequent', or 'constant'
   ```

3. **Fit the Imputer**:
   Learn the statistics from the dataset.
   ```python
   imputer.fit(X)  # X is the dataset (array or DataFrame)
   ```

4. **Transform the Dataset**:
   Replace the missing values with the computed statistics.
   ```python
   X_transformed = imputer.transform(X)
   ```

## Example

### Dataset with Missing Values:

```python
import numpy as np
from sklearn.impute import SimpleImputer

# Sample dataset
data = np.array([[1, 2], [3, np.nan], [5, 6], [7, 8], [np.nan, 10]])

# Define imputer strategy as 'mean'
imputer = SimpleImputer(strategy='mean')

# Fit and transform the data
data_transformed = imputer.fit_transform(data)

print(data_transformed)
```

### Output:
```plaintext
[[ 1.  2.]
 [ 3.  6.]
 [ 5.  6.]
 [ 7.  8.]
 [ 4. 10.]]
```

In this example, the missing value in the second row is replaced with the column's mean value (`6.0`), and the missing value in the last row is replaced with the column's mean (`4.0`).

In [36]:
# adding simple imputer to fever col
imputer = SimpleImputer()
X_train_fever = imputer.fit_transform(X_train[['fever']])

# also the test data
X_test_fever = imputer.fit_transform(X_test[['fever']])
                                 
X_train_fever.shape

(80, 1)

In [37]:
# Ordinalencoding -> cough
oe = OrdinalEncoder(categories=[['Mild','Strong']])
X_train_cough = oe.fit_transform(X_train[['cough']])

# also the test data
X_test_cough = oe.fit_transform(X_test[['cough']])

X_train_cough.shape

(80, 1)

In [43]:
# OneHotEncoding -> gender,city
ohe = OneHotEncoder(drop='first',sparse_output=False)
X_train_gender_city = ohe.fit_transform(X_train[['gender','city']])

# also the test data
X_test_gender_city = ohe.fit_transform(X_test[['gender','city']])

X_train_gender_city.shape

(80, 4)

In [44]:
# Extracting Age
X_train_age = X_train.drop(columns=['gender','fever','cough','city']).values

# also the test data
X_test_age = X_test.drop(columns=['gender','fever','cough','city']).values

X_train_age.shape

(80, 1)

In [46]:
X_train_transformed = np.concatenate((X_train_age,X_train_fever,X_train_gender_city,X_train_cough),axis=1)
# also the test data
X_test_transformed = np.concatenate((X_test_age,X_test_fever,X_test_gender_city,X_test_cough),axis=1)

X_train_transformed.shape

(80, 7)

**Using Column Transformer**

In [48]:
from sklearn.compose import ColumnTransformer

In [55]:
transformer = ColumnTransformer(transformers=[
    ('tnf1',SimpleImputer(),['fever']),
    ('tnf2',OrdinalEncoder(categories=[['Mild','Strong']]),['cough']),
    ('tnf3',OneHotEncoder(sparse_output=False,drop='first'),['gender','city'])
],remainder='passthrough')

In [56]:
transformer.fit_transform(X_train).shape

(80, 7)

In [57]:
transformer.transform(X_test).shape

(20, 7)