### Feature Engineering

There are 
many situations where you can avoid large, complicated models and use simple 
models with crucially engineered features

**Feature engineering is not just about creating new features from data but also includes different types of normalization and transformations.**

### Dealing with Date and Time Data

In [None]:
df.loc[:, 'year'] = df['datetime_column'].dt.year
df.loc[:, 'weekofyear'] = df['datetime_column'].dt.weekofyear
df.loc[:, 'month'] = df['datetime_column'].dt.month
df.loc[:, 'dayofweek'] = df['datetime_column'].dt.dayofweek
df.loc[:, 'weekend'] = (df.datetime_column.dt.weekday >=5).astype(int)
df.loc[:, 'hour'] = df['datetime_column'].dt.hour


Let’s see 
some of the sample features that can be created.

In [1]:
import pandas as pd

# create a series of datetime with a frequency of 10 hours
s = pd.date_range('2020-01-06', '2020-01-10', freq='10H').to_series()

# create some features based on datetime
features = {
 "dayofweek": s.dt.dayofweek.values,
 "dayofyear": s.dt.dayofyear.values,
 "hour": s.dt.hour.values,
 "is_leap_year": s.dt.is_leap_year.values,
 "quarter": s.dt.quarter.values,
 "weekofyear": s.dt.weekofyear.values
}


  "weekofyear": s.dt.weekofyear.values


Using aggregates in pandas, it is quite easy to create features

In [2]:
def generate_features(df):
 # create a bunch of features using the date column
 df.loc[:, 'year'] = df['date'].dt.year
 df.loc[:, 'weekofyear'] = df['date'].dt.weekofyear
 df.loc[:, 'month'] = df['date'].dt.month
 df.loc[:, 'dayofweek'] = df['date'].dt.dayofweek
 df.loc[:, 'weekend'] = (df['date'].dt.weekday >=5).astype(int)
 
 # create an aggregate dictionary
 aggs = {}

 # for aggregation by month, we calculate the
 # number of unique month values and also the mean
 aggs['month'] = ['nunique', 'mean']
 aggs['weekofyear'] = ['nunique', 'mean']

 # we aggregate by num1 and calculate sum, max, min 
 # and mean values of this column
 aggs['num1'] = ['sum','max','min','mean']

 # for customer_id, we calculate the total count
 aggs['customer_id'] = ['size']
 
 # again for customer_id, we calculate the total unique
 aggs['customer_id'] = ['nunique']
 
 # we group by customer_id and calculate the aggregates
 agg_df = df.groupby('customer_id').agg(aggs)
 agg_df = agg_df.reset_index()
 return agg_df

- when dealing with time-series problems, you might have features which are not individual values but a list of values.

- Example - transactions by a customer in a given period of time. 
- In these cases, we create different types of features such as: with numerical features, 
- when you are grouping on a categorical column, you will get features like a list of values which are time 
distributed. 
- In these cases, you can create a bunch of statistical features 

In [None]:
import numpy as np

feature_dict = {}
# calculate mean
feature_dict['mean'] = np.mean(x)

# calculate max
feature_dict['max'] = np.max(x)

# calculate min
feature_dict['min'] = np.min(x)

# calculate standard deviation
feature_dict['std'] = np.std(x)

# calculate variance
feature_dict['var'] = np.var(x)

# peak-to-peak
feature_dict['ptp'] = np.ptp(x)

# percentile features
feature_dict['percentile_10'] = np.percentile(x, 10)
feature_dict['percentile_60'] = np.percentile(x, 60)
feature_dict['percentile_90'] = np.percentile(x, 90)

# quantile features
feature_dict['quantile_5'] = np.quantile(x, 0.05)
feature_dict['quantile_95'] = np.quantile(x, 0.95)
feature_dict['quantile_99'] = np.quantile(x, 0.99)

The time series data (list of values) can be converted to a lot of features. 

In [None]:
from tsfresh.feature_extraction import feature_calculators as fc

# tsfresh based features
feature_dict['abs_energy'] = fc.abs_energy(x)
feature_dict['count_above_mean'] = fc.count_above_mean(x)
feature_dict['count_below_mean'] = fc.count_below_mean(x)
feature_dict['mean_abs_change'] = fc.mean_abs_change(x)
feature_dict['mean_change'] = fc.mean_change(x)

## Polynomial features

A simple 
way to generate many features is just to create a bunch of polynomial features. For 
example, a second-degree polynomial feature from two features “a” and “b” would 
include: “a”, “b”, “ab”, “a2
” and “b2
”.

In [4]:
import numpy as np
# generate a random dataframe with 
# 2 columns and 100 rows
df = pd.DataFrame(
 np.random.rand(100, 2),
 columns=[f"f_{i}" for i in range(1, 3)]
)
df

Unnamed: 0,f_1,f_2
0,0.647649,0.680086
1,0.027097,0.765764
2,0.093894,0.539776
3,0.214612,0.587476
4,0.353642,0.842006
...,...,...
95,0.111657,0.818410
96,0.172780,0.792059
97,0.385267,0.175224
98,0.958516,0.841332


And we can create two-degree polynomial features using PolynomialFeatures from 
scikit-learn

In [5]:
from sklearn import preprocessing
# initialize polynomial features class object
# for two-degree polynomial features
pf = preprocessing.PolynomialFeatures(
 degree=2,
 interaction_only=False,
 include_bias=False
)
# fit to the features
pf.fit(df)

# create polynomial features
poly_feats = pf.transform(df)

# create a dataframe with all the features
num_feats = poly_feats.shape[1]
df_transformed = pd.DataFrame(
 poly_feats,
 columns=[f"f_{i}" for i in range(1, num_feats + 1)]
)

In [6]:
df_transformed

Unnamed: 0,f_1,f_2,f_3,f_4,f_5
0,0.647649,0.680086,0.419449,0.440457,0.462517
1,0.027097,0.765764,0.000734,0.020750,0.586395
2,0.093894,0.539776,0.008816,0.050682,0.291358
3,0.214612,0.587476,0.046058,0.126079,0.345129
4,0.353642,0.842006,0.125063,0.297769,0.708974
...,...,...,...,...,...
95,0.111657,0.818410,0.012467,0.091381,0.669796
96,0.172780,0.792059,0.029853,0.136852,0.627358
97,0.385267,0.175224,0.148431,0.067508,0.030703
98,0.958516,0.841332,0.918753,0.806431,0.707840


### Binnning

Another interesting feature converts the numbers to categories. It’s known as 
binning.

In [7]:
# create bins of the numerical columns
# 10 bins
df["f_bin_10"] = pd.cut(df["f_1"], bins=10, labels=False)

# 100 bins
df["f_bin_100"] = pd.cut(df["f_1"], bins=100, labels=False)

In [8]:
df

Unnamed: 0,f_1,f_2,f_bin_10,f_bin_100
0,0.647649,0.680086,6,64
1,0.027097,0.765764,0,2
2,0.093894,0.539776,0,9
3,0.214612,0.587476,2,21
4,0.353642,0.842006,3,35
...,...,...,...,...
95,0.111657,0.818410,1,11
96,0.172780,0.792059,1,17
97,0.385267,0.175224,3,38
98,0.958516,0.841332,9,96


Binning also enables you to treat 
numerical features as categorical.

### Log Transformation

Yet another interesting type of feature that you can create from numerical features 
is log transformation. 

In [None]:
df.f_3.apply(lambda x: np.log(1 + x)).var()

### handling missing/NaN 

- **For categorical features,** let’s keep it super simple. 

- If you ever encounter missing 
values in categorical features, treat is as a **new category!** As simple as this is, it 
(almost) always works!

- **Numerical Data**

- let’s say 0 
is not seen in the feature. So, we fill all the missing values using 0. This is one of 
the ways but might not be the most effective.

- One of the methods that works better than filling 0s for numerical data is to fill with mean instead. 
- You can also try to fill with the median of all the values for that feature.
- you can use the most common value to fill the missing values.

#### using k-nearest neighbour

- can use the KNN imputer implementation for filling missing values

In [13]:
import numpy as np
from sklearn import impute

# create a random numpy array with 10 samples
# and 6 features and values ranging from 1 to 15
X = np.random.randint(1, 15, (10, 6))

# convert the array to float
X = X.astype(float)

# randomly assign 10 elements to NaN (missing)
X.ravel()[np.random.choice(X.size, 10, replace=False)] = np.nan

X


array([[ 9., nan, nan,  7., 14.,  6.],
       [14., nan, nan,  4., nan,  2.],
       [ 5., nan,  7.,  9.,  8.,  7.],
       [ 6., 12.,  5.,  8.,  6., nan],
       [ 1., 13.,  5.,  4., 12.,  3.],
       [ 1.,  8.,  4.,  3.,  1.,  7.],
       [10.,  7.,  5.,  2., 11.,  4.],
       [ 5., nan,  6.,  6., 10., 12.],
       [nan, 12.,  1., 10.,  6.,  4.],
       [13.,  3., 12.,  6.,  3., nan]])

In [14]:
# use 2 nearest neighbours to fill na values
knn_imputer = impute.KNNImputer(n_neighbors=2)
knn_imputer.fit_transform(X)

array([[ 9. , 10. ,  6. ,  7. , 14. ,  6. ],
       [14. ,  5. ,  8.5,  4. ,  7. ,  2. ],
       [ 5. , 12. ,  7. ,  9. ,  8. ,  7. ],
       [ 6. , 12. ,  5. ,  8. ,  6. ,  5.5],
       [ 1. , 13. ,  5. ,  4. , 12. ,  3. ],
       [ 1. ,  8. ,  4. ,  3. ,  1. ,  7. ],
       [10. ,  7. ,  5. ,  2. , 11. ,  4. ],
       [ 5. , 12.5,  6. ,  6. , 10. , 12. ],
       [ 5.5, 12. ,  1. , 10. ,  6. ,  4. ],
       [13. ,  3. , 12. ,  6. ,  3. ,  4.5]])

- Another way of imputing missing values in a column would be to train a regression model that tries to predict missing values in a column based on other columns

- you start with one column that has a missing value and treat this column as the 
target column for regression model without the missing values. 

- Using all the other columns, you now train a model on samples for which there is no missing value in 
the concerned column. 

- then try to predict target (the same column) for the samples that were removed earlier.

### Notes

**Always remember that imputing values for tree-based models is unnecessary as they can handle it themselves.**

**Always remember to scale or normalize your features if you are using linear models like logistic regression or a model like SVM.**

- Tree-based models will always work fine without any normalization of features.
