## Feature Engineering 

**Feature engineering is not about new features from data but also includes different types of normalization  and transformations.**

Let us say that you are dealing with **date and time data**. So, we have a pandas dataframe with datetime column. Using thid column, we can create features like.

* Year
* Week of year
* Month
* Day of week
* Weekend
* Hour
* And many more.


**The below code show how we can implement it.**

import datetime as dt

* df.loc[:,"year"] = df["date"].dt.year
* df.loc[:,"weekofyear"] = df["date"].dt.weekofyear 
* df.loc[:,"month"] = df["date"].dt.month
* df.loc[:,"dayofweek"] = df["date"].dt.dayofweek
* df.loc[:,"weekend"] = (df["date"].dt.weekday >= 5).astype(int)
* df.loc[:,"hour"] = df["dateyime_column"].dt.hour
     

In [4]:
import pandas as pd

#create a series datetime with a frequency of 10 hrs
s = pd.date_range('2020-01-06','2020-01-10',freq = '10H').to_series()

#create some features baesd on datatime 
features = {
    "dayofweek": s.dt.dayofweek.values,
    "dayofyear": s.dt.dayofyear.values,
    "hour": s.dt.hour.values,
    "is_leap_year": s.dt.is_leap_year.values,
    "quarter": s.dt.quarter.values,
    "weekofyear": s.dt.weekofyear.values
}

In [5]:
features

{'dayofweek': array([0, 0, 0, 1, 1, 2, 2, 2, 3, 3]),
 'dayofyear': array([6, 6, 6, 7, 7, 8, 8, 8, 9, 9]),
 'hour': array([ 0, 10, 20,  6, 16,  2, 12, 22,  8, 18]),
 'is_leap_year': array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True]),
 'quarter': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 'weekofyear': array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])}

The above code generate a dictionary of features from a given series. 
* You can apply this to any datetime column in a pandas dataframe. These are some of the many date time features that pandas offer. 
* Date time features are critical when you are dealing with time-series data, for example, predicting sales of a store but would like to use a model like xgboost on aggregated features. 


Now lets create a dataframe

In [2]:
data = {"date":['2016-09-01','2017-04-01','2017-08-01','2017-12-01','2017-09-01'],
        "customer_id": [146361,180838,157857,159772,80014],
         "cart1":[2,4,3,5,3],
         "cart2": [2,1,3,1,2],
          "cart3":[0,0,1,1,1],
          "num1": [-0.518679,0.415853,-2.061687,-0.276558,-1.456827]}



In [5]:
new_data = pd.DataFrame.from_dict(data)

In [6]:
new_data

Unnamed: 0,date,customer_id,cart1,cart2,cart3,num1
0,2016-09-01,146361,2,2,0,-0.518679
1,2017-04-01,180838,4,1,0,0.415853
2,2017-08-01,157857,3,3,1,-2.061687
3,2017-12-01,159772,5,1,1,-0.276558
4,2017-09-01,80014,3,2,1,-1.456827


In [12]:
#Convert the date object to datetime.
new_data["date"] = pd.to_datetime(new_data['date'], infer_datetime_format=True)

Now using aggregate to create more features.

In [15]:
def generate_features(df):
    
    df.loc[:,"year"] = df["date"].dt.year
    df.loc[:,"weekofyear"] = df["date"].dt.weekofyear 
    df.loc[:,"month"] = df["date"].dt.month
    df.loc[:,"dayofweek"] = df["date"].dt.dayofweek
    df.loc[:,"weekend"] = (df["date"].dt.weekday >= 5).astype(int)
    
    #create an aggregate dictonary 
    
    aggs = {} 
    #for aggreation by month, we calculate the
    #number of unique month values and also the mean 
    aggs["month"] = ["nunique","mean"]
    
    #for aggreation by month, we calculate the
    #number of unique month values nd also the mean
    aggs["month"] = ["nunique","mean"]
    aggs["weekofyear"] = ["nunique","mean"]
    
    #we aggregate by num1 and calculate sum,max,min 
    #and mean values of this column 
    aggs["num1"] = ["sum","max","min","mean"]
    
    #for customer_id, we calculate the total count 
    aggs["customer_id"]  = ["size"]
    
    #again for customer_id, we calculate the total unqiue 
    aggs["customer_id"] = ["nunique"]
    
    #we group by customer_id and calculate the aggregates 
    agg_df = df.groupby("customer_id").agg(aggs)
    agg_df = agg_df.reset_index() 
    
    return agg_df

In [21]:
new_data1 = generate_features(new_data)

In [22]:
new_data1

Unnamed: 0_level_0,customer_id,month,month,weekofyear,weekofyear,num1,num1,num1,num1,customer_id
Unnamed: 0_level_1,Unnamed: 1_level_1,nunique,mean,nunique,mean,sum,max,min,mean,nunique
0,80014,1,9,1,35,-1.456827,-1.456827,-1.456827,-1.456827,1
1,146361,1,9,1,35,-0.518679,-0.518679,-0.518679,-0.518679,1
2,157857,1,8,1,31,-2.061687,-2.061687,-2.061687,-2.061687,1
3,159772,1,12,1,48,-0.276558,-0.276558,-0.276558,-0.276558,1
4,180838,1,4,1,13,0.415853,0.415853,0.415853,0.415853,1


Hence we will get the above features after using aggreagate.

Here we are not trying to predcit anything, we are just creating generic features. However, it would have been easier to create features if we were trying to predict something here.

Sometimes, for example, when dealing with time-series problems, you might have features which are not individual values but a list of values. For example, transactions by a customer in a given period of time. In these cases, we create different types of features such as: with numerical features, when you are grouping on a categorical column, you will get features like a list of values which are time distributed. In these cases, you can create a bunch of statistical features such as:

* Mean
* Max
* Unique
* Skew
* Kurtosis
* Kstat 
* Percentile
* Quantile
* Peak to peak
* And many more 

This can be created using numpy functions as shown in the following  python snippet.


In [None]:
import numpy as np 

feature_dict = {} 

#calculate mean 
feature_dict['mean'] = np.mean(x) 

#calculate max 
feature_dict['max'] = np.max(x) 

#calculate min 
feature_dict['min'] = np.min(x) 

#calculate standard deviation 
feature_dict['std'] = np.std(x) 

#calculate variance 
feature_dict['var'] = np.var(x) 

#peak-to-peak 
feature_dict['ptp'] = np.ptp(x) 

#percentile features 
feature_dict['percentile_10'] = np.percentile(x, 10) 
feature_dict['percentile_60'] = np.percentile(x, 60) 
feature_dict['percentile_90'] = np.percentile(x, 90) 

#quantile features 
feature_dict['quantile_5'] = np.percentile(x, 5) 
feature_dict['quantile_95'] = np.percentile(x, 95) 
feature_dict['quantile_99'] = np.percentile(x, 99) 


The time series data(list of values) can be converted to a lot of features.

A python library called **tsfresh** is instrumental in this cases.

In [None]:
from tsfresh.feature_extraction import feature_calculators as fc 

#tsfresh based features 
feature_dict['abs_energy'] = fc.abs_energy(x) 
feature_dict['count_above_mean'] = fc.count_above_mean(x) 
feature_dict['count_below_mean'] = fc.count_below_mean(x) 
feature_dict['mean_abs_change'] = fc.mean_abs_change(x) 
feature_dict['mean_change'] = fc.mean_change(x) 

This is not all; tsfresh offers hundreds of features and tens of variations of different features that you can use for time series (list of values) based features. In the examples above, x is a list of values. But that’s not all. There are many other features that you can create for numerical data with or without categorical data. A simple way to generate many features is just to create a bunch of **polynomial features.** For example, a second-degree polynomial feature from two features “a” and “b” would include: “a”, “b”, “ab”, “a2 ” and “b2 ”. 

In [17]:
import numpy as np 

#generate a random dataframe with 
#2 columns and 100 rows 
df = pd.DataFrame(
    np.random.rand(100,2),
 columns = [f"f_{i}" for i in range(1,3)]
)

In [18]:
df.head(10)

Unnamed: 0,f_1,f_2
0,0.924478,0.316386
1,0.876845,0.442913
2,0.861783,0.113082
3,0.397906,0.658547
4,0.178985,0.953942
5,0.158267,0.619018
6,0.065997,0.767432
7,0.755679,0.614073
8,0.7248,0.048357
9,0.315393,0.981535


We can create two-degree polynomial feautres using PolynomialFeatures from scikit-learn.

In [19]:
from sklearn import preprocessing 

#initialize polynomial features class object 
#for two-degree polynomial features 

pf = preprocessing.PolynomialFeatures( degree=2, interaction_only=False, include_bias=False ) 

#fit to the features 
pf.fit(df) 

#create polynomial features 
poly_feats = pf.transform(df) 

#create a dataframe with all the features 
num_feats = poly_feats.shape[1] 
df_transformed = pd.DataFrame( poly_feats, columns=[f"f_{i}" for i in range(1, num_feats + 1)] ) 

In [22]:
df_transformed.head(10)

Unnamed: 0,f_1,f_2,f_3,f_4,f_5
0,0.924478,0.316386,0.854659,0.292492,0.1001
1,0.876845,0.442913,0.768856,0.388366,0.196172
2,0.861783,0.113082,0.742671,0.097453,0.012788
3,0.397906,0.658547,0.158329,0.26204,0.433685
4,0.178985,0.953942,0.032036,0.170741,0.910006
5,0.158267,0.619018,0.025048,0.09797,0.383184
6,0.065997,0.767432,0.004356,0.050648,0.588952
7,0.755679,0.614073,0.571051,0.464042,0.377085
8,0.7248,0.048357,0.525336,0.035049,0.002338
9,0.315393,0.981535,0.099473,0.30957,0.963411


If we create third-degree polynomial features then there will be nine features in total.

<img src="feature_eng1.jpeg">

Another interesting features converts the numbers to categories.It's known as **bining**. Let's look figure 5, which shows a sample histogram of random numerical features. We use ten bins for this figure, and we see that we can divide the data  into ten parts. This is  accomplised using the pandas *cut* function.

In [23]:
#create bins of the numercial columns 
# 10 bins 
df["f_bin_10"] = pd.cut(df["f_1"],bins = 10,labels = False)
df["f_bin_100"] = pd.cut(df["f_1"],bins = 100,labels = False)

In [24]:
df.head(10)

Unnamed: 0,f_1,f_2,f_bin_10,f_bin_100
0,0.924478,0.316386,9,92
1,0.876845,0.442913,8,88
2,0.861783,0.113082,8,86
3,0.397906,0.658547,3,39
4,0.178985,0.953942,1,17
5,0.158267,0.619018,1,14
6,0.065997,0.767432,0,5
7,0.755679,0.614073,7,75
8,0.7248,0.048357,7,72
9,0.315393,0.981535,3,30


**Bining enables you to treat numerical features as categorical**

Yet another interesting type of features that you can create from numerical features is log transformation.
Take a look a given table below. 

<img src="feature_eng2.jpeg">


* f_3 is a special feature with a very high variance. Compared to other features that have a low variance.Thus we would want to reduce the variance of this column, and that can be done by taking a log transformation.

* The value of column f_3 ranges from 0 to 10000 and a histogram is shown below

<img src="feature_eng3.jpeg">


And we apply log(1+x) to this column to reduce its variance. Figure 9 shows what happens to the histogram when log transformation is applied.
<img src="feature_eng3.jpeg">


Sometimes, instead of log, you can also take exponential. A very interesting case is when you use a log-based evaluation metric, for example, RMSLE. In that case, you can train on log-transformed targets and convert back to original using exponential on the prediction. That would help optimize the model for the metric.


When dealing with both categorical and numerical variables, you might encounter missing values. In previous chapter we saw different ways to handle missing values. But there are many ways to handle missing values. This is considered feature engineering.


For categorical features, let’s keep it super simple. If you ever encounter missing values in categorical features, treat is as a new category! As simple as this is, it (almost) always works! 


One way to fill missing values in numerical data would be to choose a value that does not appear in the specific feature and fill using that. For example, let’s say 0 is not seen in the feature. So, we fill all the missing values using 0. This is one of the ways but might not be the most effective. One of the methods that works better than filling 0s for numerical data is to fill with mean instead. You can also try to fill with the median of all the values for that feature, or you can use the most common value to fill the missing values. There are just so many ways to do this. 


A fancy way of filling in the missing values would be to use a k-nearest neighbour method. You can select a sample with missing values and find the nearest neighbours utilising some kind of distance metric, for example, Euclidean distance. Then you can take the mean of all nearest neighbours and fill up the missing value. You can use the KNN imputer implementation for filling missing values like this.

<img src="feature_eng5.jpeg">

Let's see how a matrix with missing values, as shown in the figure is handled by using **KNNImputer**



In [4]:
import numpy as np
from sklearn import impute

#create a random numpy array with 10 samples 
# and 6 features and values ranging from 1 to 15 

x = np.random.randint(1,15,(10,6))

#create the array to float 
x = x.astype(float)

#randomly assign 10 elements to NAN
x.ravel()[np.random.choice(x.size,10,replace = False)] = np.nan

#use 3 nearest neighbours to fill na values 
knn_imputer = impute.KNNImputer(n_neighbors = 2)
knn_imputer.fit_transform(x)

array([[11. ,  9. , 10. , 11. ,  6. ,  5. ],
       [ 2. , 11. ,  5. ,  5. , 13. ,  7. ],
       [ 5. ,  4. ,  1. , 10. ,  6. ,  5.5],
       [ 4. ,  8. ,  2. ,  9. ,  4. ,  5.5],
       [ 2. , 14. ,  8. ,  9. ,  7.5,  6. ],
       [ 1. ,  5. ,  1. , 14. ,  1. ,  7. ],
       [ 8. , 11. , 11. , 13. ,  9. ,  5. ],
       [ 7.5,  7. ,  8. ,  8. ,  7. ,  4. ],
       [ 7. ,  9. , 14. , 13. , 12. ,  2. ],
       [ 5. , 11. ,  2. , 12. ,  2. ,  4. ]])

Another way of imputing missing values in a column would be to train a regression model that tries to predict missing values in a column based on other columns. So, you start with one column that has a missing value and treat this column as the target column for regression model without the missing values. Using all the other columns, you now train a model on samples for which there is no missing value in the concerned column and then try to predict target (the same column) for the samples that were removed earlier. This way, you have a more robust model based imputation. 


Now, let’s say you are working on a problem of predicting store sales of different items (per week or month). You have items, and you have store ids. So, you can create features like items per store. Now, this is one of the features that is not discussed above. These kinds of features cannot be generalized and come purely from domain, data and business knowledge. Look at the data and see what fits and create features accordingly.

**Always remember to scale or normalize your features if you are using linear models like logistic regression or model like SVM. Tree based models will always work fine without any normalization of features.**