# Handmade Standardizer

🧑🏻‍🏫 In this challenge, we are going to create our *own* StandardScaler. Are you wondering why? Glad you asked!

🎯 The goals of this exercise are to:
- understand `stateless transformers` vs. `stateful transformers`
- manipulate `FeatureUnion`

##  📚 Stateless Transformer vs. Stateful Transformer

## Imports 

In [1]:
import numpy as np
import pandas as pd
from sklearn import set_config; set_config(display='diagram')
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.pipeline import make_pipeline, make_union
from sklearn.base import TransformerMixin, BaseEstimator

In [2]:
X_train = pd.DataFrame({
    'A': {0: 1, 1: 2, 2: 3},
    'B': {0: 4, 1: 5, 2: 6},
    'C': {0: 7, 1: 8, 2: 9}
})

print("This is the training dataset:")
display(X_train)
X_test = pd.DataFrame({
    'A': {0: 1, 1: 2, 2: 3},
    'B': {0: 2, 1: 3, 2: 4},
    'C': {0: 3, 1: 4, 2: 10}
})

print("This is the test dataset:")
display(X_test)

This is the training dataset:


Unnamed: 0,A,B,C
0,1,4,7
1,2,5,8
2,3,6,9


This is the test dataset:


Unnamed: 0,A,B,C
0,1,2,3
1,2,3,4
2,3,4,10


In [3]:
standard_scaler = StandardScaler()
feature_averager = FunctionTransformer(lambda df: pd.DataFrame(1/3 * (df["A"] + df["B"] + df["C"])))

pipeline = make_union(standard_scaler, feature_averager)
pipeline

Let's :
- fit the pipeline to the training set 
- and transform both the training set and the test set

In [4]:
pipeline.fit(X_train)

In [5]:
X_train_transformed = pd.DataFrame(pipeline.transform(X_train))
X_train_transformed

Unnamed: 0,0,1,2,3
0,-1.224745,-1.224745,-1.224745,4.0
1,0.0,0.0,0.0,5.0
2,1.224745,1.224745,1.224745,6.0


In [6]:
X_test_transformed = pd.DataFrame(pipeline.transform(X_test))
X_test_transformed

Unnamed: 0,0,1,2,3
0,-1.224745,-3.674235,-6.123724,2.0
1,0.0,-2.44949,-4.898979,3.0
2,1.224745,-1.224745,2.44949,5.666667


## 💻 Create your own state-full transformer

### 💻 Custom Standardizer

❓ **Questions: Coding your own class** ❓

1. Code your own class `CustomStandardizer` 
    * It should behave exactly like the  `StandardScaler` from Scikit Learn, this means having:
        * a `.fit()` method which computes ("learns") $\mu_{\color{blue}{train}}$ and $\sigma
   _{\color{blue}{train}}$
        * and a `.transform()` method.


2. Fit it on `X_train` 

3. Transform both `X_train` and `X_test` 

4. Compare your `CustomStandardizer` with the `StandardScaler` from Scikit Learn to make sure you got it right !

In [7]:
class CustomStandardizer(TransformerMixin, BaseEstimator):

    def __init__(self):
        pass

    def fit(self, X, y=None):
        '''
        Stores what needs to be stored as instance attributes. 
        ReturnS "self" to allow chaining fit and transform.
        '''
        self.means = X.mean()
        self.stds = X.std(ddof=0)
        return self
    
    def transform(self, X, y=None): 
        return (X - self.means)/self.stds
    
    def inverse_transform(self, X, y=None):
        return X * self.stds + self.means
     

In [8]:
custom_standardizer = CustomStandardizer()
custom_standardizer.fit(X_train)

In [9]:
X_train_transformed = custom_standardizer.transform(X_train)
print("X_train_transformed:")
X_train_transformed

X_train_transformed:


Unnamed: 0,A,B,C
0,-1.224745,-1.224745,-1.224745
1,0.0,0.0,0.0
2,1.224745,1.224745,1.224745


In [10]:
X_test_transformed = custom_standardizer.transform(X_test)
print("X_test_transformed:")
X_test_transformed

X_test_transformed:


Unnamed: 0,A,B,C
0,-1.224745,-3.674235,-6.123724
1,0.0,-2.44949,-4.898979
2,1.224745,-1.224745,2.44949


### 💻 Inverse Transform

❓ **Question (Inverse Transform)** ❓

_StandardScaler_ from Scikit Learn has a [`.inverse_transform()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler.inverse_transform) method that helps you revert back to the unscaled dataset.

1. Go back to your `CustomStandardizer` class and implement your own `.inverse_transform()` method.

2. Try it on your scaled training set and your scaled test set.

In [11]:
X_train_inverse_transformed = custom_standardizer.inverse_transform(X_train_transformed)
X_train_inverse_transformed

Unnamed: 0,A,B,C
0,1.0,4.0,7.0
1,2.0,5.0,8.0
2,3.0,6.0,9.0


In [12]:
X_test_inverse_transformed = custom_standardizer.inverse_transform(X_test_transformed)
X_test_inverse_transformed

Unnamed: 0,A,B,C
0,1.0,2.0,3.0
1,2.0,3.0,4.0
2,3.0,4.0,10.0


### 💻 Complete custom pipeline!

In [13]:
class CustomStandardizer(TransformerMixin, BaseEstimator):

    def __init__(self, shrink_factor = 1):
        self.shrink_factor = shrink_factor

    def fit(self, X, y=None):
        '''
        Stores what needs to be stored as instance attributes. 
        Returns "self" to allow chaining fit and transform.
        '''
        self.means = X.mean()
        self.stds = X.std(ddof=0)
        return self
    
    def transform(self, X, y=None): 
        return (X - self.means) / self.stds / self.shrink_factor
    
    def inverse_transform(self, X, y=None):
        return X * self.shrink_factor * self.stds + self.means

In [14]:
custom_standardizer_2 = CustomStandardizer(shrink_factor=2).fit(X_train)
custom_standardizer_2

In [15]:
X_train_transformed = custom_standardizer_2.transform(X_train)
X_train_transformed

Unnamed: 0,A,B,C
0,-0.612372,-0.612372,-0.612372
1,0.0,0.0,0.0
2,0.612372,0.612372,0.612372


In [16]:
X_test_transformed = custom_standardizer_2.transform(X_test)
X_test_transformed

Unnamed: 0,A,B,C
0,-0.612372,-1.837117,-3.061862
1,0.0,-1.224745,-2.44949
2,0.612372,-0.612372,1.224745


In [17]:
# Run the following cells to ensure you got the right transformations 
truth_train = np.array([
    [-0.612372, -0.612372, -0.612372],
    [0.000000, 0.000000, 0.000000],
    [0.612372, 0.612372, 0.612372]
])
truth_test = np.array([
    [-0.612372, -1.837117, -3.061862],
    [ 0.        , -1.224745, -2.449490],
    [ 0.612372, -0.612372,  1.224745]])

In [18]:
# # Asserts
np.allclose(X_train_transformed, truth_train)

True

In [19]:
np.allclose(X_test_transformed, truth_test)

True

In [20]:
class FeatureAverager(TransformerMixin, BaseEstimator):

    def __init__(self):
        pass

    def fit(self, X, y=None):
        '''
        If needed, this method will store information instance attributes.
        Returns "self".
        '''
        return self

    def transform(self, X, y=None):
        features_sum = X.sum(axis="columns")
        max_factor = X.max(axis = "columns")
        ncol = X.shape[1]
        feature_averager = ((1 / ncol) * features_sum) / max_factor
        return pd.DataFrame(feature_averager)

In [21]:
custom_feature_averager = FeatureAverager().fit(X_train)
custom_feature_averager

In [22]:
X_train_transformed = custom_feature_averager.transform(X_train)
X_train_transformed

Unnamed: 0,0
0,0.571429
1,0.625
2,0.666667


In [23]:
X_test_transformed = custom_feature_averager.transform(X_test)
X_test_transformed

Unnamed: 0,0
0,0.666667
1,0.75
2,0.566667


In [24]:
custom_standardizer_3 = CustomStandardizer(shrink_factor=3)
custom_standardizer_3

In [25]:
custom_feature_averager

In [26]:
pipeline = make_union(custom_standardizer_3, custom_feature_averager)
pipeline

In [27]:
pipeline.fit(X_train)

In [28]:
X_train_transformed = pd.DataFrame(pipeline.transform(X_train))
X_train_transformed

Unnamed: 0,0,1,2,3
0,-0.408248,-0.408248,-0.408248,0.571429
1,0.0,0.0,0.0,0.625
2,0.408248,0.408248,0.408248,0.666667


In [29]:
X_test_transformed = pd.DataFrame(pipeline.transform(X_test))
X_test_transformed

Unnamed: 0,0,1,2,3
0,-0.408248,-1.224745,-2.041241,0.666667
1,0.0,-0.816497,-1.632993,0.75
2,0.408248,-0.408248,0.816497,0.566667
