# Feature Engineering


The process of creating domain-specific features based on existing features in order to improve the performance of a machine learning algorithm. It could be as sample as formulating a new feature as a linear combination of other feature or more complex as term frequency or tf-idf features for email spam classification algorithms.

Feature engineer often requires domain specific knowledge, and is often the most _time consuming_ process of forming a model.

## What are some options that work for feature engineering?

**Binarization**

Is the process of converting a numeric feature into a binary one by enforcing a condition `X['new_feature'] = X['selected feature'] >= t`. 


In [1]:
from sklearn.preprocessing import Binarizer


X = [[4], [1], [3], [0]]
binarizer = Binarizer(threshold=2.9)
X_new = binarizer.fit_transform(X)
X_new

array([[1],
       [0],
       [1],
       [0]])

**Discretization / Binning**

Is the process of converting a numerical features into one or more catagorical ones. Binarization the special case of discretization and discretization is the generalized of binarization. A common example is an age group: for adolescent, teenage, adult, middle age, and old.

**Interaction**

Is the process of producing a new feature based on the mathematical operation of two other existing features. For example, the two features `n_visits_week` and `n_purchases_week` and produce `n_purchases_visit` by `n_visits_week / n_purchases_week`.

The following class implementations all possible interaction terms, assumming associativity of the operations ($a+b=b+a)$. I say this because some operators are not associative, like $a^b$, but the order is not taken into account within the implementation.

In [2]:
import operator as op
import itertools
import numpy as np
import pandas as pd


class Interaction:
    op_map = {
        '+': op.add,
        '-': op.sub,
        '*': op.mul,
        '/': op.truediv,
        '^': op.pow
    }

    @staticmethod
    def build_interactions(df, operators, corr_filter=None):
        """
        :param df: DataFrame - full dataframe where the features are assummed
        :param operatros: list of basic operatiosn
        :param corr_filter: correlation coefficient threshold to filter attributes
        """
        return {op_str: Interaction._interaction_comb(df, op_str, corr_filter) for op_str in operators}

    @staticmethod
    def op_column_name(op_str, feature_a, feature_b):
        return f'{feature_a} {op_str} {feature_b}'

    @staticmethod
    def _interaction_comb(df, op_str, corr_filter):
        df_map = {}
        the_op = Interaction.op_map[op_str]
        for a, b in itertools.combinations(list(df), 2):
            df_map[Interaction.op_column_name(op_str, a, b)] = the_op(df[a], df[b])
        full_df = pd.DataFrame(df_map)
        if corr_filter is None:
            return full_df
        else:
            # filter out redundantly correlated columns
            corr_matrix = full_df.corr()
            good_corr_filters = [Interaction.op_column_name(op_str, a, b) for a, b in
                                 itertools.combinations(list(df), 2)
                                 if corr_matrix[a][b] >= abs(corr_filter)]
            return full_df[good_corr_filters]


df = pd.DataFrame(np.random.randint(0, 100, size=(50, 4)), columns=list('ABCD'))
dfs = Interaction.build_interactions(df, ['+', '-', '/', '^', '*'])

print(dfs['+'].head(3), end='\n\n')
print(dfs['-'].head(3), end='\n\n')
print(dfs['*'].head(3), end='\n\n')
print(dfs['/'].head(3), end='\n\n')


   A + B  A + C  A + D  B + C  B + D  C + D
0     82     38    126     56    144    100
1     98    125    121    163    159    186
2     91     71    104     74    107     87

   A - B  A - C  A - D  B - C  B - D  C - D
0    -18     26    -62     44    -44    -88
1    -38    -65    -61    -27    -23      4
2     -3     17    -16     20    -13    -33

   A * B  A * C  A * D  B * C  B * D  C * D
0   1600    192   3008    300   4700    564
1   2040   2850   2730   6460   6188   8645
2   2068   1188   2640   1269   2820   1620

      A / B     A / C     A / D     B / C     B / D     C / D
0  0.640000  5.333333  0.340426  8.333333  0.531915  0.063830
1  0.441176  0.315789  0.329670  0.715789  0.747253  1.043956
2  0.936170  1.629630  0.733333  1.740741  0.783333  0.450000



**Polynomial Transformation**

In the process of generating all combinations of polynomial and interaction features. For example, two features $a$ and $b$, have the polynomial interactions with 2 degrees as $a^0$ or $b^0$, $a$, $b$, $a^2$, $b^2$ and $ab$.

Two features will degree three contains all the polynomial features of degree 2 as a subset, and then the additional combination you can make with 3. Under the hood, we can generalize all the possible combination on features using a power matrix that operate on the features. For example, with two features, all polynomial combinations of degree k can be described by $[a^{k_1}, b^{k_2}]$.

In summary, with `PolynomialFeatures` we can generate a power matrix that describes all the possible combinations and interactions a model can have in terms of the variable number of degrees (power) and the number of features (columns).

For example:

In [3]:
import pandas as pd
from myutils.regression.utils import polynomial_function


max_ndegrees, max_nfeatures = 4, 4
demo_poly_matrix = [['' for i in range(max_ndegrees+1)] for j in range(max_nfeatures+1)]
for degree in range(max_ndegrees+1):
    for feature in range(1, max_nfeatures+1):
        demo_poly_matrix[degree][feature] = polynomial_function(degree, feature)
        
print('Columns = number of features')
print('Rows = number of degrees')

pd.DataFrame(demo_poly_matrix)

Columns = number of features
Rows = number of degrees


Unnamed: 0,0,1,2,3,4
0,,1,1,1,1
1,,1+A,1+A+B,1+A+B+C,1+A+B+C+D
2,,1+A+A^2,1+A+B+A^2+A B+B^2,1+A+B+C+A^2+A B+A C+B^2+B C+C^2,1+A+B+C+D+A^2+A B+A C+A D+B^2+B C+B D+C^2+C D+D^2
3,,1+A+A^2+A^3,1+A+B+A^2+A B+B^2+A^3+A^2 B+A B^2+B^3,1+A+B+C+A^2+A B+A C+B^2+B C+C^2+A^3+A^2 B+A^2 ...,1+A+B+C+D+A^2+A B+A C+A D+B^2+B C+B D+C^2+C D+...
4,,1+A+A^2+A^3+A^4,1+A+B+A^2+A B+B^2+A^3+A^2 B+A B^2+B^3+A^4+A^3 ...,1+A+B+C+A^2+A B+A C+B^2+B C+C^2+A^3+A^2 B+A^2 ...,1+A+B+C+D+A^2+A B+A C+A D+B^2+B C+B D+C^2+C D+...


Notice how we can entirely describe a purely linear model using a polynomial model. See the first row: a linear model is nothing more than a polynomial model with 1 degree.

How `PolynomialFeatures` works is that it takes all combinations of the number of features in the model along with the number of degrees each feature could have. We can then use this to transform our data to obtain _additional_ features based off the _same_ variables. This grows at $\frac{(n+d)!}{d!n!}$, where $d$ is the option number of degrees and $n$ is the number of features in your model. It also include interaction terms such as $ab$ if we have 2 features. So in this sense, polynomial regression is capable of finding relationship between features where a purely linear model cannot do this.


In [4]:
from sklearn.preprocessing import PolynomialFeatures


poly = PolynomialFeatures(degree=3)
X_new = poly.fit_transform(X)
X_new

array([[ 1.,  4., 16., 64.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  3.,  9., 27.],
       [ 1.,  0.,  0.,  0.]])