# Transformations

In this model we'll see a few common transformations that we use when we pre-process data before training our model

## Preprocessing

One of the most common steps that we will need to do before running a model is pre-processing our features.  This may be as simple as standardizing your features so that they are the same scale, all the way to mapping your empirical data to a guassian distribution.  `sklearn` has a suite of built-in preprocessors to help us do this easily.

### Standardization

Standardization is simply taking a set of data points, subtracting out the mean and dividing by its standard deviation.

It is often needed for most machine learning models, as features with different scale and means can dramatically affect the estimated results.  It's often good practice to standardize features by default, and only not standardize if there's a very good reason to do so.

To standardize, we can use sklearn's transformers to help us.  For example, if we want to standardize a variable:

In [None]:
from sklearn.preprocessing import StandardScaler
import numpy as np

In [None]:
X_train = np.array([
    [ 1., -1.,  2.],
    [ 2.,  0.,  0.],
    [ 0.,  1., -1.]]
)


In [None]:
scaler = StandardScaler()
scaler.fit(X_train)

In [None]:
scaler.mean_

In [None]:
scaler.scale_

In [None]:
X_scaled = scaler.transform(X_train)
X_scaled

**note**: we can do both steps by calling `fit_transform`:

In [None]:
X_scaled = StandardScaler().fit_transform(X_train)
X_scaled

we can now verify that we have standardized the variable:

In [None]:
X_scaled.mean(axis=0)

In [None]:
X_scaled.std(axis=0)

### Normalization

A more generalized version of standardization is normalization, where we scale the data to have unit norm.

In [None]:
from sklearn.preprocessing import Normalizer

In [None]:
normalizer = Normalizer(norm='l2')

In [None]:
normalizer.fit_transform(X_train)

you can also use other norms, e.g.:

In [None]:
normalizer_l1 = Normalizer(norm='l1')
normalizer_l1.fit_transform(X_train)

### Scaling

There will be certain situations where we prefer to scale our features rather than standardize them.  We may want to do this for data sets with a lot of zeros, where zeros are meaningful.

We can scale our input matrix above to `[-1, 1]`:

In [None]:
from sklearn.preprocessing import MaxAbsScaler

In [None]:
min_max_scaler = MaxAbsScaler()

In [None]:
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax

In [None]:
min_max_scaler.scale_

We can see in the above example that we have scaled our dataset to `[-1, 1]`, however we have also protected zero entries.

If we want to scale our data to some arbitrary `[a, b]`, then we can use `MinMaxScaler` instead, which works the same way as `MaxAbsScaler`, but it can be initialized with a `feature_range=(min, max)` to specific the range

### Quantiles

Sometimes we can generate higher signals from our features by grouping our data in a logical way.  Quantile grouping is one very common way to transform features - in this method we take our data points and map them to a uniform (or normal) distribution.  This has two major effects:
- it spreads out the data when data is tightly clustered, and groups data that is sparse
- it reduces the impact of outliers, since they will just be grouped into the top or bottom quantile

In [None]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import QuantileTransformer

In [None]:
X, y = load_iris(return_X_y=True)

In [None]:
quantile = QuantileTransformer(random_state=0, n_quantiles=10)
x_quantiles = quantile.fit_transform(X)
x_quantiles[:10]

after processing the data into quantiles, we can now see that the data is between `[0, 1]`, and with a uniform distribution

In [None]:
np.percentile(X[:, 0], [0, 25, 50, 75, 100]) 


In [None]:
np.percentile(x_quantiles[:, 0], [0, 25, 50, 75, 100]) 


### Categorical Feature Encoding

Sometimes, our data isn't numeric but categorical, however for most machine learning models, non-numeric inputs tend to be fairly difficult to deal with.  As a result we will need to encode our categorical variables into numeric equivalents.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

In [None]:
X = [
    ['Berkeley', 'Male', 'Masters'],
    ['Oakland', 'Male', 'Bachelors'],
    ['Berkeley', 'Female', 'PhD']
]

In [None]:
encoder = OrdinalEncoder()

In [None]:
encoder.fit_transform(X)

we can see from the above that we have converted our categorical features to ordinal features.  However, this is not always useful for modeling since models will take these variables as numeric.  As an example, we cannot take this transformed data and use it for a regression.  

Instead, we can use `one hot encoding` (aka dummy variables) to turn our categorical features to a set of dummy features that we can now use in most downstream models

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
hot = OneHotEncoder()

In [None]:
hot.fit_transform(X).toarray()

In [None]:
hot.categories_

**note** this showcases another really powerful usecase of `Transfomers` - the resulting output does not need to be the same number of columns as the input.  In our case we have 7 distinct categorical variables in 3 rows, and we get a 3x7 matrix as a result.

We can also pre-specify the categories - this especially useful if the data set doesn't include all possible categories, but it is important for the model to incorporate them:

In [None]:
hot = OneHotEncoder(categories=[
    ['Berkeley', 'Oakland', 'San Francisco'],
    ['Female', 'Male'],
    ['Bachelors', 'Masters', 'PhD', 'High School']
])
hot.fit_transform(X).toarray()

we can now see that we have 9 categories represented in the data, even though we don't have 2 of them in our sample data set.

However, if we were to run a regression, this output still would not work as the matrix is perfectly collinear.  Instead, we can simply add the `drop` argument to get us a noncollinear matrix:

In [None]:
hot = OneHotEncoder(drop='first') # can also use 'if_binary'
hot.fit_transform(X).toarray()

### Discretization

Discretization is useful when we don't need the granularity of continuous variables, or when we get higher signal:noise from the discrete/binned representation than the continous one.  

One example of this can be for threshold signals (e.g. a binary option, where payout is 0 if stock price is under $100, and 100 otherwise) - if we wanted to regress payout vs stock price, having the discretized representation of the feature will yield a much better model than unsing the continous variable.

To do this, we can use the `KBinsDiscretizer`

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

In [None]:
encoder = KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal')

In [None]:
encoder.fit_transform(X_train)

we can transform the output of the discretization either as ordinal (above), or one-hot (below):

In [None]:
encoder = KBinsDiscretizer(n_bins=[3, 2, 2])
encoder.fit_transform(X_train).toarray()

we can also change the way that `KBinsDiscretizer` cuts.  By default the transformer cuts using quantiles, however we can also do uniform cuts by setting `strategy='uniform'`, which will take the range and cut into even chunks:

In [None]:
encoder = KBinsDiscretizer(strategy='uniform', n_bins=[3, 2, 2])
encoder.fit_transform(X_train).toarray()

Lastly, we can discretize to binary using the Binarizer

In [None]:
from sklearn.preprocessing import Binarizer

In [None]:
encoder = Binarizer(threshold=0)

In [None]:
encoder.fit_transform(X_train)

## Imputation

We previously looked at how to deal with missing data as a part of data cleaning, and one method we mentioned was imputation.  Once we move from data analysis to modeling, we will need to build our imputation strategy into our modeling pipeline to make sure our training/testing process is consistent.  To do this, we can leverage `Transformers` again to help us tranform the data (in this case the tranformation is an imputation).

The simplest way to impute is using the `SimpleImputer`:

In [None]:
X_train = [
    [1, 2, 3],
    [4, np.nan, 6],
    [np.nan, np.nan, 9],
    [1, 3, 7],
    [6, 8, 1]
]

X_train_cat = [
    ['a', '1'],
    ['a', '2'],
    [np.nan, '2'],
    ['b', np.nan]
]

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

In [None]:
imputer.fit_transform(X_train)

We can do the same type of imputation with categorial variables also:

In [None]:
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit_transform(X_train_cat)

In situations where there are strong relations between features, we can leverage multivariate imputers instead of having to rely on single-feature properties.  One common way to do this is via KNN-based imputation, i.e.:

In [None]:
from sklearn.impute import KNNImputer

In [None]:
imputer = KNNImputer(n_neighbors=2, weights="uniform")

In [None]:
imputer.fit_transform(X_train)

## Dimensionality Reduction

Lastly, when we have a lot of features, we may want to reduce the dimensionality of the data before training the model on it.  One very popular way to do this is via PCA, which is at a very abstract level just another transformation on the data.

We can use the PCA transformer to bring our higher dimensionality data to lower dimensionality:

In [None]:
from sklearn.decomposition import PCA

In [None]:
X_train = np.array([
    [-1, -1, 3], 
    [-2, -1, 10], 
    [-3, -2, 13], 
    [1, 1, 15], 
    [2, 1, 22], 
    [3, 2, 1]
])

In [None]:
pca = PCA(n_components=1)

In [None]:
pca.fit_transform(X_train)


after transforming the data, we can now take the lower dimensionality inputs and use them to train our model

## Custom Transformers

We've gone through many different feature transformation use cases, however there is a likelihood that none of the above will suit your specific use case.  In that situation, you can simply create your own Transformer.  Most transformers just need to inherit `BaseEstimator` (since all transformers are estimators), and the `TransformerMixin` which gives the transformer the `fit_transform` method.

For example, we can create a Transformer that transforms a feature into a boolean column that is `true` if the value is not null and `false` otherwise:

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

In [None]:
class BinaryNullTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        #fit needs to be implemented as this is an estimator, however we don't need to fit anything
        return self
    
    def transform(self, X):
        return ~np.isnan(X)
    

In [None]:
X = np.array([
    [-1, np.nan, 3], 
    [-2, -1, 10], 
    [-3, -2, 13], 
    [1, 1, np.nan], 
    [2, 1, 22], 
    [np.nan, 2, 1]
])

In [None]:
transformer = BinaryNullTransformer()

In [None]:
transformer.fit_transform(X)