# Generate new features for time series related data

In [2]:
import pandas as pd
# create a series of datetime with a frequency of 10 hours
s = pd.date_range('2020-01-06', '2020-01-10', freq='10H').to_series()
# create some features based on datetime
features = {
    "dayofweek": s.dt.dayofweek.values,
    "dayofyear": s.dt.dayofyear.values,
    "hour": s.dt.hour.values,
    "is_leap_year": s.dt.is_leap_year.values,
    "quarter": s.dt.quarter.values,
#     "weekofyear": s.dt.weekofyear.values
}

In [3]:
features

{'dayofweek': array([0, 0, 0, 1, 1, 2, 2, 2, 3, 3], dtype=int32),
 'dayofyear': array([6, 6, 6, 7, 7, 8, 8, 8, 9, 9], dtype=int32),
 'hour': array([ 0, 10, 20,  6, 16,  2, 12, 22,  8, 18], dtype=int32),
 'is_leap_year': array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True]),
 'quarter': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)}

In [4]:
def generate_features(df):
    # create a bunch of features using the date column
    df.loc[:, 'year'] = df['date'].dt.year
    df.loc[:, 'weekofyear'] = df['date'].dt.weekofyear
    df.loc[:, 'month'] = df['date'].dt.month
    df.loc[:, 'dayofweek'] = df['date'].dt.dayofweek
    df.loc[:, 'weekend'] = (df['date'].dt.weekday >=5).astype(int)

    # create an aggregate dictionary
    aggs = {}
    # for aggregation by month, we calculate the
    # number of unique month values and also the mean
    aggs['month'] = ['nunique', 'mean']
    aggs['weekofyear'] = ['nunique', 'mean']
    # we aggregate by num1 and calculate sum, max, min 
    # and mean values of this column
    aggs['num1'] = ['sum','max','min','mean']
    # for customer_id, we calculate the total count
    aggs['customer_id'] = ['size']
    # again for customer_id, we calculate the total unique
    aggs['customer_id'] = ['nunique']

    # we group by customer_id and calculate the aggregates
    agg_df = df.groupby('customer_id').agg(aggs)
    agg_df = agg_df.reset_index()
    return agg_df

when you are grouping 
on a categorical column, you will get features like a list of values which are time 
distributed. In these cases, you can create a bunch of statistical features such as:
- Mean
- Max
- Min
- Unique
- Skew
- Kurtosis
- Kstat
- Percentile
- Quantile
- Peak to peak

and many more

In [None]:
import numpy as np
feature_dict = {}
# calculate mean
feature_dict['mean'] = np.mean(x)
# calculate max
feature_dict['max'] = np.max(x)
# calculate min
feature_dict['min'] = np.min(x)
# calculate standard deviation
feature_dict['std'] = np.std(x)
# calculate variance
feature_dict['var'] = np.var(x)
# peak-to-peak
feature_dict['ptp'] = np.ptp(x)
# percentile features
feature_dict['percentile_10'] = np.percentile(x, 10)
feature_dict['percentile_60'] = np.percentile(x, 60)
feature_dict['percentile_90'] = np.percentile(x, 90)
# quantile features
feature_dict['quantile_5'] = np.quantile(x, 0.05)
feature_dict['quantile_95'] = np.quantile(x, 0.95)
feature_dict['quantile_99'] = np.quantile(x, 0.99)

In [None]:
# tsfresh is good at handling these type of feature engineering
from tsfresh.feature_extraction import feature_calculators as fc
# tsfresh based features
feature_dict['abs_energy'] = fc.abs_energy(x)
feature_dict['count_above_mean'] = fc.count_above_mean(x)
feature_dict['count_below_mean'] = fc.count_below_mean(x)
feature_dict['mean_abs_change'] = fc.mean_abs_change(x)
feature_dict['mean_change'] = fc.mean_change(x)

This is not all; tsfresh offers hundreds of features and tens of variations of different 
features that you can use for time series (list of values) based features. In the 
examples above, x is a list of values. But that’s not all. There are many other features 
that you can create for numerical data with or without categorical data. A simple 
way to generate many features is just to create a bunch of polynomial features. For 
example, a second-degree polynomial feature from two features “a” and “b” would 
include: “a”, “b”, “ab”, “a2
” and “b2
”.

In [6]:
import numpy as np
# generate a random dataframe with 
# 2 columns and 100 rows
df = pd.DataFrame(
 np.random.rand(100, 2),
 columns=[f"f_{i}" for i in range(1, 3)]
)

In [7]:
from sklearn import preprocessing
# initialize polynomial features class object
# for two-degree polynomial features
pf = preprocessing.PolynomialFeatures(
 degree=2,
 interaction_only=False,
 include_bias=False
)
# fit to the features
pf.fit(df)
# create polynomial features
poly_feats = pf.transform(df)
# create a dataframe with all the features
num_feats = poly_feats.shape[1]
df_transformed = pd.DataFrame(
 poly_feats,
 columns=[f"f_{i}" for i in range(1, num_feats + 1)]
)

In [8]:
df_transformed

Unnamed: 0,f_1,f_2,f_3,f_4,f_5
0,0.450219,0.473560,0.202697,0.213206,0.224259
1,0.784832,0.973155,0.615961,0.763763,0.947031
2,0.864824,0.251149,0.747921,0.217200,0.063076
3,0.855703,0.401482,0.732228,0.343549,0.161188
4,0.683066,0.644092,0.466579,0.439957,0.414855
...,...,...,...,...,...
95,0.504936,0.792664,0.254961,0.400245,0.628316
96,0.674998,0.963180,0.455623,0.650145,0.927716
97,0.018934,0.296746,0.000358,0.005619,0.088058
98,0.630630,0.453290,0.397694,0.285858,0.205472


In [10]:
# create bins of the numerical columns
# 10 bins
df["f_bin_10"] = pd.cut(df["f_1"], bins=10, labels=False)
# 100 bins
df["f_bin_100"] = pd.cut(df["f_1"], bins=100, labels=False)

In [13]:
df_transformed.f_3.var()

0.09333398704768911

In [14]:
df_transformed.f_3.apply(lambda x: np.log(1 + x)).var()

0.04661263999992299

you can see the variance of the feature column is actually getting lower.

Sometimes, instead of log, you can also take exponential. A very interesting case is 
when you use a log-based evaluation metric, for example, RMSLE. In that case, 
you can train on log-transformed targets and convert back to original using 
exponential on the prediction. That would help optimize the model for the metric.

# Missing Value/Nan feature engineering

For categorical features, let’s keep it super simple. If you ever encounter missing 
values in categorical features, treat is as a new category! As simple as this is, it 
(almost) always works!


One way to fill missing values in numerical data would be to choose a value that 
does not appear in the specific feature and fill using that. For example, let’s say 0 
is not seen in the feature. So, we fill all the missing values using 0. This is one of 
the ways but might not be the most effective. One of the methods that works better 
than filling 0s for numerical data is to fill with mean instead. You can also try to fill 
with the median of all the values for that feature, or you can use the most common 
value to fill the missing values. There are just so many ways to do this.


A fancy way of filling in the missing values would be to use a k-nearest neighbour
method. You can select a sample with missing values and find the nearest 
neighbours utilising some kind of distance metric, for example, Euclidean distance. 
Then you can take the mean of all nearest neighbours and fill up the missing value. 

In [15]:
import numpy as np
from sklearn import impute
# create a random numpy array with 10 samples
# and 6 features and values ranging from 1 to 15
X = np.random.randint(1, 15, (10, 6))

# convert the array to float
X = X.astype(float)
# randomly assign 10 elements to NaN (missing)
X.ravel()[np.random.choice(X.size, 10, replace=False)] = np.nan
# use 2 nearest neighbours to fill na values
knn_imputer = impute.KNNImputer(n_neighbors=2)
knn_imputer.fit_transform(X)

array([[11. ,  2. , 12. ,  3.5,  7. , 14. ],
       [14. ,  5.5,  8. ,  1. ,  5. , 13. ],
       [ 4. , 13. , 13. , 10. ,  3. ,  3. ],
       [ 5. ,  8. ,  6. , 10. ,  2. , 13. ],
       [ 9. ,  9. , 10. ,  7. ,  6.5, 11. ],
       [10. ,  9. , 10. ,  6. , 11. , 14. ],
       [ 2. ,  9. ,  4. ,  8. ,  9. ,  8. ],
       [ 3.5, 13. ,  2. ,  7. ,  5.5, 10. ],
       [ 2. ,  5. ,  8.5, 13. ,  5.5, 10.5],
       [ 7. , 11. , 13. ,  9. ,  6. ,  4. ]])

Another way of imputing missing values in a column would be to train a regression 
model that tries to predict missing values in a column based on other columns. So, 
you start with one column that has a missing value and treat this column as the 
target column for regression model without the missing values. Using all the other 
columns, you now train a model on samples for which there is no missing value in 
the concerned column and then try to predict target (the same column) for the 
samples that were removed earlier. This way, you have a more robust model based 
imputation.

Always remember that imputing values for tree-based models is unnecessary as they 
can handle it themselves.

You're absolutely right to emphasize this point. Tree-based models indeed have a unique advantage when it comes to handling missing values, making imputation often unnecessary. Let's explore why this is the case:
Why Tree-Based Models Can Handle Missing Values
1. Splitting Mechanism
Tree-based models, such as decision trees, random forests, and gradient boosting machines, work by making sequential splits in the data based on feature values. When encountering missing values, these models have built-in methods to handle them:
They can treat "missing" as a separate category
They can use surrogate splits based on other correlated features
2. Implicit Imputation
Tree-based models perform a form of implicit imputation during the training process:
For each split, the algorithm determines the best way to handle missing values to minimize the loss function
This approach is often more effective than pre-processing imputation, as it considers the specific context of each split
3. Preservation of Information
By not imputing values beforehand:
The model preserves the information that a value was missing
It can potentially discover patterns related to missingness itself
4. Robustness to Different Types of Missing Data
Tree-based models can effectively handle different types of missing data:
Missing Completely at Random (MCAR)
Missing at Random (MAR)
Missing Not at Random (MNAR)
Advantages Over Explicit Imputation
Saves preprocessing time and effort
Avoids introducing bias through imputation
Maintains the original data structure
Can capture complex relationships between missing values and other features

Exceptions and Considerations

While tree-based models generally handle missing values well, there are some scenarios where imputation might still be considered:

When using tree-based models for feature importance analysis
In ensemble methods that include non-tree-based models
When dealing with extremely high proportions of missing data

In conclusion, the ability of tree-based models to handle missing values internally is a significant advantage, often making explicit imputation unnecessary and potentially counterproductive.


And always remember to scale or normalize your 
features if you are using linear models like logistic regression or a model like SVM. 
Tree-based models will always work fine without any normalization of features.
